[jira] [Commented] (TAJO-30) Parquet Integration

Hyunsik Choi (JIRA) Tue, 18 Mar 2014 20:08:30 -0700

    [ 
https://issues.apache.org/jira/browse/TAJO-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940105#comment-13940105
 ]


Hyunsik Choi commented on TAJO-30:
----------------------------------

Hi [~davidzchen],

I have a plan for nested schema. Currently, Tajo only supports a flat schema 
like relational DBMS. So, even though Tajo is extended to nested data mode, it 
will not break the compatibility.

I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). When 
I consider nested data model, I thought two main points. Parquet data model 
satisfies with these points. The first point that I've thought is the 
processing model on nested data. Parquet data model is the same to that of 
BigQuery, and BigQuery already concreted the processing model including 
flattening, cross production on repeated fields, and aggregation on repeated 
fields \[1]\[2]. The second point is file format. Parquet is a native file 
format for this model. Parquet already includes the efficient record assembly 
method. Besides, Parquet is already mature and is widely used in many systems.

\[1] http://research.google.com/pubs/pub36632.html
\[2] https://developers.google.com/bigquery/docs/data

I'm thinking that we need three stages for this work. Firstly, we can start 
with a small change to improve our schema system. Then, we will add some 
physical operator to just flatten one nested row into a number of flattened 
rows. Finally, we will solve some query optimization issues like 
projection/filter push down on nested schema and will add some physical 
operators to directly process nested rows.

If you have any idea, feel free to share with us.

Thanks,
Hyunsik

> Parquet Integration
> -------------------
>
>                 Key: TAJO-30
>                 URL: https://issues.apache.org/jira/browse/TAJO-30
>             Project: Tajo
>          Issue Type: New Feature
>            Reporter: Hyunsik Choi
>            Assignee: David Chen
>              Labels: Parquet
>
> Parquet is very promising file format developed by twitter. We need to 
> investigate the applicability of Parquet. If possible, we implement Parquet 
> port.
> http://parquet.io/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TAJO-30) Parquet Integration

Reply via email to