[RESULT][VOTE] Accept CarbonData into the Apache Incubator

Jean-Baptiste Onofré Thu, 02 Jun 2016 12:29:12 -0700

Hi,

I close this vote with only +1: welcome to Apache CarbonData in theIncubator !


I will request the resources creation.

Thanks all for your vote.

Regards
JB

On 05/25/2016 10:24 PM, Jean-Baptiste Onofré wrote:

Hi all,

following the discussion thread, I'm now calling a vote to accept
CarbonData into the Incubator.

[ ] +1 Accept CarbonData into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...

This vote is open for 72 hours.

The proposal follows, you can also access the wiki page:
https://wiki.apache.org/incubator/CarbonDataProposal

Thanks !
Regards
JB

= Apache CarbonData =

== Abstract ==

Apache CarbonData is a new Apache Hadoop native file format for faster
interactive
query using advanced columnar storage, index, compression and encoding
techniques
to improve computing efficiency, in turn it will help speedup queries an
order of
magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

== Background ==

Huawei is an ICT solution provider, we are committed to enhancing
customer experiences for telecom carriers, enterprises, and consumers on
big data, In order to satisfy the following customer requirements, we
created a new Hadoop native file format:

  * Support interactive OLAP-style query over big data in seconds.
  * Support fast query on individual record which require touching all
fields.
  * Fast data loading speed and support incremental load in period of
minutes.
  * Support HDFS so that customer can leverage existing Hadoop cluster.
  * Support time based data retention.

Based on these requirements, we investigated existing file formats in
the Hadoop eco-system, but we could not find a suitable solution that
satisfying requirements all at the same time, so we start designing
CarbonData.

== Rationale ==

CarbonData contains multiple modules, which are classified into two
categories:

  1. CarbonData File Format: which contains core implementation for file
format such as columnar,index,dictionary,encoding+compression,API for
reading/writing etc.
  2. CarbonData integration with big data processing framework such as
Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
the execution runtime.

=== CarbonData File Format ===

CarbonData file format is a columnar store in HDFS, it has many features
that a modern columnar format has, such as splittable, compression
schema ,complex data type etc. And CarbonData has following unique
features:

==== Indexing ====

In order to support fast interactive query, CarbonData leverage indexing
technology to reduce I/O scans. CarbonData files stores data along with
index, the index is not stored separately but the CarbonData file itself
contains the index. In current implementation, CarbonData supports 3
types of indexing:

1. Multi-dimensional Key (B+ Tree index)
  The Data block are written in sequence to the disk and within each
data blocks each column block is written in sequence. Finally, the
metadata block for the file is written with information about byte
positions of each block in the file, Min-Max statistics index and the
start and end MDK of each data block. Since, the entire data in the file
is in sorted order, the start and end MDK of each data block can be used
to construct a B+Tree and the file can be logically  represented as a
B+Tree with the data blocks as leaf nodes (on disk) and the remaining
non-leaf nodes in memory.
2. Inverted index
  Inverted index is widely used in search engine. By using this index,
it helps processing/query engine to do filtering inside one HDFS block.
Furthermore, query acceleration for count distinct like operation is
made possible when combining bitmap and inverted index in query time.
3. MinMax index
  For all columns, minmax index is created so that processing/query
engine can skip scan that is not required.

==== Global Dictionary ====

Besides I/O reduction, CarbonData accelerates computation by using
global dictionary, which enables processing/query engines to perform all
processing on encoded data without having to convert the data (Late
Materialization). We have observed dramatic performance improvement for
OLAP analytic scenario where table contains many columns in string data
type. The data is converted back to the user readable form just before
processing/query engine returning results to user.

==== Column Group ====

Sometimes users want to perform processing/query on multi-columns in one
table, for example, performing scan for individual record in
troubleshooting scenario. In this case, row format is more efficient
than columnar format since all columns will be touched by the workload.
To accelerate this, CarbonData supports storing a group of column in row
format, so data in column group is stored together and enable fast
retrieval.

==== Optimized for multiple use cases ====

CarbonData indices and dictionary is highly configurable. To make
storage optimized for different use cases, user can configure what to
index, so user can decide and tune the format before loading data into
CarbonData.

For example

|| Use Case || Supporting Features ||
|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
Tree index), Minmax index, Inverted index ||
|| High throughput scan || Global dictionary, Minmax index ||
|| Low latency point query || Multi-dimensional Key (B+ Tree index),
Partitioning ||
|| Individual record query || Column group, Global dictionary ||

=== BigData Processing Framework Integration ===

  * CarbonData provides InputFormat/OutputFormat interfaces for
Reading/Writing data from the CarbonData files and at the same time
provides abstract API for processing data stored as Carbondata format
with data processing framework.
  * CarbonData provides deep integration with Apache Spark including
predicate push down, column pruning, aggregation push down etc. So users
can use Spark SQL to connect and query from CarbonData.
  * CarbonData can integrate with various big data Query/Processing
framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.

Example:
https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala


== Initial Goals ==

Our initial goals are to bring CarbonData into the ASF, transition
internal engineering processes into the open, and foster a collaborative
development model according to the "Apache Way".

== Current Status ==

CarbonData is production ready and already provide a large set of features.
The current license is already Apache 2.0.

== Meritocracy ==

We intend to radically expand the initial developer and user community
by running the project in accordance with the "Apache Way". Users and
new contributors will be treated with respect and welcomed. By
participating in the community and providing quality patches/support
that move the project forward, they will earn merit. They also will be
encouraged to provide non-code contributions (documentation, events,
community management, etc.) and will gain merit for doing so. Those with
a proven support and quality track record will be encouraged to become
committers.

== Community ==

If CarbonData is accepted for incubation, the primary initial goal is to
build a large community. We really trust that CarbonData will become a
key project for big data column-like platforms, and so, we bet on a
large community of users and developers.

== Known Risks ==

Development has been sponsored mostly by a one company.For the project
to fully transition to the Apache Way governance model, development must
shift towards the meritocracy-centric model of growing a community of
contributors balanced with the needs for extreme stability and core
implementation coherency.

== Orphaned products ==

Huawei is fully committed CarbonData. Moreover, Huawei has a vested
interest in making CarbonData succeed by driving its close integration
with sister ASF projects. We expect this to further reduces the risk of
orphaning the product.

== Inexperience with Open Source ==

Huawei has been developing and using open source software since a long
time. Additionally, several ASF veterans agreed to mentor the project
and are listed in this proposal. The project will rely on their guidance
and collective wisdom to quickly transition the entire team of initial
committers towards practicing the Apache Way.

== Reliance on Salaried Developers ==

Most of the contributors are paid to work in big data space. While they
might wander from their current employers, they are unlikely to venture
far from their core expertises and thus will continue to be engaged with
the project regardless of their current employers.

== An Excessive Fascination with the Apache Brand ==

While we intend to leverage the Apache ‘branding’ when talking to other
projects as testament of our project’s ‘neutrality’, we have no plans
for making use of Apache brand in press releases nor posting billboards
advertising acceptance of CarbonData into Apache Incubator.

== Initial Source ==

https://github.com/HuaweiBigData/carbondata.git

== External Dependencies ==

All external dependencies are licensed under an Apache 2.0 license or
Apache-compatible license. As we grow the Carbondata community we will
configure our build process to require and validate all contributions
and dependencies are licensed under the Apache 2.0 license or are under
an Apache-compatible license.

  * Apache Spark
  * Apache Hadoop
  * Apache Maven
  * Apache Commons
  * Apache Log4j
  * Apache Thrift
  * Apache Zookeeper
  * Scala
  * Snappy
  * Kettle (Pentaho)
  * Eigenbase
  * Fastutil
  * GSON
  * Jmockit
  * Junit

== Required Resources ==

=== Mailing lists ===

  * priv...@carbondata.incubator.apache.org (moderated subscriptions)
  * comm...@carbondata.incubator.apache.org
  * d...@carbondata.incubator.apache.org
  * iss...@carbondata.incubator.apache.org

=== Git Repository ===

  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git

=== Issue Tracking ===

  * JIRA Project CarbonData (CarbonData)

=== Initial Committers ===

  * Liang Chenliang
  * Jean-Baptiste Onofré
  * Henry Saputra
  * Uma Maheswara Rao G
  * Jenny MA
  * Jacky Likun
  * Vimal Das Kammath
  * Jarray Qiuheng

=== Affiliations ===

  * Huawei: Liang Chenliang
  * Talend: Jean-Baptiste Onofré
  * Ebay: Henry Saputra
  * Intel: Uma Maheswara Rao G

=== Sponsors ===

=== Champion ===

  * Jean-Baptiste Onofré - Apache Member

=== Mentors ===

  * Henry Saputra (eBay)
  * Jean-Baptiste Onofré (Talend)
  * Uma Maheswara Rao G (Intel)

=== Sponsoring Entity ===

The Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

[RESULT][VOTE] Accept CarbonData into the Apache Incubator

Reply via email to