RE: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Cheng, Hao
+1

-Original Message-
From: Jacques Nadeau [mailto:jacq...@apache.org] 
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament 
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept 
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ 
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for 
> > faster interactive query using advanced columnar storage, index, 
> > compression and encoding techniques to improve computing efficiency, 
> > in turn it will help speedup queries an order of magnitude faster 
> > over PetaBytes of data.
> >
> > CarbonData github address: 
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing 
> > customer experiences for telecom carriers, enterprises, and 
> > consumers on big data, In order to satisfy the following customer 
> > requirements, we created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching 
> > all fields.
> >   * Fast data loading speed and support incremental load in period 
> > of minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats 
> > in the Hadoop eco-system, but we could not find a suitable solution 
> > that satisfying requirements all at the same time, so we start 
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for 
> > file format such as 
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> >   2. CarbonData integration with big data processing framework such 
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to 
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many 
> > features that a modern columnar format has, such as splittable, 
> > compression schema ,complex data type etc. And CarbonData has 
> > following unique
> > features:
> >
> >  Indexing 
> >
> > In order to support fast interactive query, CarbonData leverage 
> > indexing technology to reduce I/O scans. CarbonData files stores 
> > data along with index, the index is not stored separately but the 
> > CarbonData file itself contains the index. In current 
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each 
> > data blocks each column block is written in sequence. Finally, the 
> > metadata block for the file is written with information about byte 
> > positions of each block in the file, Min-Max statistics index and 
> > the start and end MDK of each data block. Since, the entire data in 
> > the file is in sorted order, the start and end MDK of each data 
> > block can be used to construct a B+Tree and the file can be 
> > logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the 
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this 
> > index, it helps processing/query engine to do filtering inside one HDFS 
> > block.
> > Furthermore, query acceleration for count distinct like operation is 
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query 
> > engine can skip scan that is not required.
> >
> >  Global Dictionary 
> >
> > Besides I/O reduction, CarbonData accelerates computation by using 
> > global dictionary, which enables processing/query engines to perform 
> > all processing on encoded data without having to convert the data 
> > (Late Materialization). We have observed dramatic performance 
> > improvement for OLAP analytic scenario where table contains many 
> > columns in string data type. The data is converted back to the user 
> > readable form 

RE: [VOTE] Accept Mnemonic into the Apache Incubator

2016-03-03 Thread Cheng, Hao
erested in innovative memory project to fit 
> > large sized persistent memory and storage devices. Various Apache 
> > projects such as Apache Spark™, Apache HBase™, Apache Phoenix™, 
> > Apache Flink™, Apache Cassandra™ etc. can take good advantage of 
> > this project to overcome serialization/de-serialization, Java GC, 
> > and caching issues. We expect a wide range of interest will be 
> > generated after we open source this project to Apache.
> >
> >  Reliance on Salaried Developers  All developers are paid by 
> > their employers to contribute to this project. We welcome all others 
> > to contribute to this project after it is open sourced.
> >
> >  Relationships with Other Apache Product  Relationship with 
> > Apache™ Arrow:
> > Arrow's columnar data layout allows great use of CPU caches & SIMD. 
> > It places all data that relevant to a column operation in a compact 
> > format in memory.
> >
> > Mnemonic directly puts the whole business object graphs on external 
> > heterogeneous storage media, e.g. off-heap, SSD. It is not necessary 
> > to normalize the structures of object graphs for caching, checkpoint 
> > or storing. It doesn’t require developers to normalize their data 
> > object graphs. Mnemonic applications can avoid indexing & join 
> > datasets compared to traditional approaches.
> >
> > Mnemonic can leverage Arrow to transparently re-layout qualified 
> > data objects or create special containers that is able to 
> > efficiently hold those data records in columnar form as one of major 
> > performance optimization constructs.
> >
> > Mnemonic can be integrated into various Big Data and Cloud 
> > frameworks and applications.
> > We are currently working on several Apache projects with Mnemonic:
> > For Apache Spark™ we are integrating Mnemonic to improve:
> > a) Local checkpoints
> > b) Memory management for caching
> > c) Persistent memory datasets input
> > d) Non-Volatile RDD operations
> > The best use case for Apache Spark™ computing is that the input data 
> > is stored in form of Mnemonic native storage to avoid caching its 
> > row data for iterative processing. Moreover, Spark applications can 
> > leverage Mnemonic to perform data transforming in persistent or 
> > non-persistent memory without SerDes.
> >
> > For Apache™ Hadoop®, we are integrating HDFS Caching with Mnemonic 
> > instead of mmap. This will take advantage of persistent memory 
> > related features. We also plan to evaluate to integrate in Namenode 
> > Editlog, FSImage persistent data into Mnemonic persistent memory area.
> >
> > For Apache HBase™, we are using Mnemonic for BucketCache and 
> > evaluating performance improvements.
> >
> > We expect Mnemonic will be further developed and integrated into 
> > many Apache BigData projects and so on, to enhance memory management 
> > solutions for much improved performance and reliability.
> >
> >  An Excessive Fascination with the Apache Brand  While we 
> > expect Apache brand helps to attract more contributors, our 
> > interests in starting this project is based on the factors mentioned 
> > in the Rationale section.
> >
> > We would like Mnemonic to become an Apache project to further foster 
> > a healthy community of contributors and consumers in BigData 
> > technology R areas. Since Mnemonic can directly benefit many 
> > Apache projects and solves major performance problems, we expect the 
> > Apache Software Foundation to increase interaction with the larger 
> > community as well.
> >
> > === Documentation ===
> > The documentation is currently available at Intel and will be posted
> > under: https://mnemonic.incubator.apache.org/docs
> >
> > === Initial Source ===
> > Initial source code is temporary hosted Github for general viewing:
> > https://github.com/NonVolatileComputing/Mnemonic.git
> > It will be moved to Apache http://git.apache.org/ after podling.
> >
> > The initial Source is written in Java code (88%) and mixed with JNI 
> > C code (11%) and shell script (1%) for underlying native allocation 
> > libraries.
> >
> > === Source and Intellectual Property Submission Plan === As soon as 
> > Mnemonic is approved to join the Incubator, the source code will be 
> > transitioned via the Software Grant Agreement onto ASF 
> > infrastructure and in turn made available under the Apache License, 
> > version 2.0.
> >
> > === External Dependencies ===
> > The required external dependencies are all