Hi Lars

Thanks for you participated in discussion.

Based on the below requirements, we investigated existing file formats in
the Hadoop eco-system, but we could not find a suitable solution that
satisfying requirements all at the same time, so we start designing
CarbonData.
R1.Support big scan & only fetch a few columns
R2.Support primary key lookup response in sub-second. 
R3.Support interactive OLAP-style query over big data which involve many
filters in a query, this type of workload should response in seconds. 
R4.Support fast individual record extraction which fetch all columns of the
record. 
R5.Support HDFS so that customer can leverage existing Hadoop cluster.

When we investigate Parquet/ORC, it seems they work very well for R1 and R5,
but they does not meet for R2,R3,R4. So we designed CarbonData mainly to add
following differentiating features:

1.Stores data along with index: it can significantly accelerate query
performance and reduces the I/O scans and CPU resources, where there are
filters in the query.  CarbonData index is consisted of multiple level, a
processing framework can leverage this index to reduce the task it needs to
schedule and process, and it can also do skip scan in more finer grain unit
(called blocklet) in task side scanning instead of scanning the whole file.

2.Operable encoded data :Through supporting efficient compression and global
encoding schemes, can query on compressed/encoded data, the data can be
converted just before returning the results to the users, which is "late
materialized".

3.Column group: Allow multiple columns form a column group to store as row
format, thus cost of column reconstructing is reduced.

4.Supports for various use cases with one single Data format : like
interactive OLAP-style query, Sequential Access (big scan), Random Access
(narrow scan).

Please kindly let me know if the above info answer your questions.

Regards
Liang






--
View this message in context: 
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49652.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to