+1
-Original Message-
From: Jacques Nadeau [mailto:jacq...@apache.org]
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding)
On Wed, May 25, 2016 at 4:04 PM, John D. Ament
wrote:
> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept
> > CarbonData into the Incubator.
> >
> > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for
> > faster interactive query using advanced columnar storage, index,
> > compression and encoding techniques to improve computing efficiency,
> > in turn it will help speedup queries an order of magnitude faster
> > over PetaBytes of data.
> >
> > CarbonData github address:
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing
> > customer experiences for telecom carriers, enterprises, and
> > consumers on big data, In order to satisfy the following customer
> > requirements, we created a new Hadoop native file format:
> >
> > * Support interactive OLAP-style query over big data in seconds.
> > * Support fast query on individual record which require touching
> > all fields.
> > * Fast data loading speed and support incremental load in period
> > of minutes.
> > * Support HDFS so that customer can leverage existing Hadoop cluster.
> > * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats
> > in the Hadoop eco-system, but we could not find a suitable solution
> > that satisfying requirements all at the same time, so we start
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> > 1. CarbonData File Format: which contains core implementation for
> > file format such as
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> > 2. CarbonData integration with big data processing framework such
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many
> > features that a modern columnar format has, such as splittable,
> > compression schema ,complex data type etc. And CarbonData has
> > following unique
> > features:
> >
> > Indexing
> >
> > In order to support fast interactive query, CarbonData leverage
> > indexing technology to reduce I/O scans. CarbonData files stores
> > data along with index, the index is not stored separately but the
> > CarbonData file itself contains the index. In current
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> > The Data block are written in sequence to the disk and within each
> > data blocks each column block is written in sequence. Finally, the
> > metadata block for the file is written with information about byte
> > positions of each block in the file, Min-Max statistics index and
> > the start and end MDK of each data block. Since, the entire data in
> > the file is in sorted order, the start and end MDK of each data
> > block can be used to construct a B+Tree and the file can be
> > logically represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> > Inverted index is widely used in search engine. By using this
> > index, it helps processing/query engine to do filtering inside one HDFS
> > block.
> > Furthermore, query acceleration for count distinct like operation is
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> > For all columns, minmax index is created so that processing/query
> > engine can skip scan that is not required.
> >
> > Global Dictionary
> >
> > Besides I/O reduction, CarbonData accelerates computation by using
> > global dictionary, which enables processing/query engines to perform
> > all processing on encoded data without having to convert the data
> > (Late Materialization). We have observed dramatic performance
> > improvement for OLAP analytic scenario where table contains many
> > columns in string data type. The data is converted back to the user
> > readable form