Re: [DISCUSS] CarbonData incubation proposal

Julien Le Dem Thu, 19 May 2016 17:47:42 -0700

Similar comment regarding the file format specification. It looks like this
is derived from the Parquet file format.


Which is fine as long as we follow the terms of the license:

https://github.com/apache/parquet-format/blob/master/LICENSE#L101
        (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

For example CarbonData:

https://github.com/HuaweiBigData/carbondata/wiki/CarbonData-File-Structure-and-Format

https://github.com/HuaweiBigData/carbondata/blob/master/format/src/main/thrift/carbondata.thrift

Parquet:

https://github.com/apache/parquet-format/blob/master/README.md

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

On Thu, May 19, 2016 at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:

> I see code derived from Mondrian in the org.carbondata.core.carbon
> package[1] (I’m familiar with Mondrian’s code structure because I wrote
> it). Mondrian was originally EPL and as such cannot be re-licensed under
> ASL. Everything is probably fine, but as part of incubation, we will need
> to make sure that this and other code has a clear progeny.
>
> Julian
>
> [1]
> https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon
> <
> https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon
> >
>
> > On May 19, 2016, at 10:04 AM, Liang Chen <chenliang...@huawei.com>
> wrote:
> >
> > Hi Lars
> >
> > Thanks for you participated in discussion.
> >
> > Based on the below requirements, we investigated existing file formats in
> > the Hadoop eco-system, but we could not find a suitable solution that
> > satisfying requirements all at the same time, so we start designing
> > CarbonData.
> > R1.Support big scan & only fetch a few columns
> > R2.Support primary key lookup response in sub-second.
> > R3.Support interactive OLAP-style query over big data which involve many
> > filters in a query, this type of workload should response in seconds.
> > R4.Support fast individual record extraction which fetch all columns of
> the
> > record.
> > R5.Support HDFS so that customer can leverage existing Hadoop cluster.
> >
> > When we investigate Parquet/ORC, it seems they work very well for R1 and
> R5,
> > but they does not meet for R2,R3,R4. So we designed CarbonData mainly to
> add
> > following differentiating features:
> >
> > 1.Stores data along with index: it can significantly accelerate query
> > performance and reduces the I/O scans and CPU resources, where there are
> > filters in the query.  CarbonData index is consisted of multiple level, a
> > processing framework can leverage this index to reduce the task it needs
> to
> > schedule and process, and it can also do skip scan in more finer grain
> unit
> > (called blocklet) in task side scanning instead of scanning the whole
> file.
> >
> > 2.Operable encoded data :Through supporting efficient compression and
> global
> > encoding schemes, can query on compressed/encoded data, the data can be
> > converted just before returning the results to the users, which is "late
> > materialized".
> >
> > 3.Column group: Allow multiple columns form a column group to store as
> row
> > format, thus cost of column reconstructing is reduced.
> >
> > 4.Supports for various use cases with one single Data format : like
> > interactive OLAP-style query, Sequential Access (big scan), Random Access
> > (narrow scan).
> >
> > Please kindly let me know if the above info answer your questions.
> >
> > Regards
> > Liang
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49652.html
> > Sent from the Apache Incubator - General mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
>


-- 
Julien

Re: [DISCUSS] CarbonData incubation proposal

Reply via email to