[jira] [Comment Edited] (TAJO-30) Parquet Integration

David Chen (JIRA) Fri, 21 Mar 2014 18:37:25 -0700

    [ 
https://issues.apache.org/jira/browse/TAJO-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943833#comment-13943833
 ]


David Chen edited comment on TAJO-30 at 3/22/14 1:35 AM:
---------------------------------------------------------

Hi Hyunsik,

That's an interesting idea. Do you mean that Tajo will use Parquet as the 
default storage format or have all storage formats deserialize into a 
representation that follows the Dremel model? Parquet doesn't really have its 
own in-memory representation. Each of the Parquet packages basically 
deserialize into a given in-memory representation using the readers and 
writers. For example, parquet-avro deserializes into Avro GenericRecords (or 
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code 
deserializes into Tajo Tuples.

My changes are currently in the {{parquet}} branch in my fork on GitHub: 
https://github.com/davidzchen/incubator-tajo/tree/parquet

They are almost ready. During further testing, I found a few more issues, most 
of them I have now fixed. One thing I noticed was that when reading a 
projection, the resulting Tuple still has all the columns of the table schema 
but the non-projected fields are simply null. What is the motivation for 
retaining all the columns in the Tuple rather than having the Tuple only 
contain the projected columns?

There is one last test that is failing which is caused by the fact that I am 
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a 
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find 
much documentation on its use. I can always write this as a placeholder column 
or special-case it. Once I fix this, I will post a review request.

There are some follow-up work items that I plan to do, most likely as review 
changes:

 * Add TableStats to ParquetAppender.
 * Figure out of ParquetAppender.flush() is needed.
 * Additional end-to-end testing

Thanks,
David


was (Author: davidzchen):
Hi Hyunsik,

That's an interesting idea. Do you mean that Tajo will use Parquet as the 
default storage format or have all storage formats deserialize into a 
representation that follows the Dremel model? Parquet doesn't really have its 
own in-memory representation. Each of the Parquet packages basically 
deserialize into a given in-memory representation using the readers and 
writers. For example, parquet-avro deserializes into Avro GenericRecords (or 
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code 
deserializes into Tajo Tuples.

My changes are currently in the {{parquet}} branch in my fork on GitHub: 
https://github.com/davidzchen/incubator-tajo/tree/parquet

They are almost ready. During further testing, I found a few more issues, most 
of them I have now fixed. One thing I noticed was that when reading a 
projection, the resulting Tuple still has all the columns of the table schema 
but the non-projected fields are simply null. What is the motivation for 
retaining all the columns in the Tuple rather than having the Tuple only 
contain the projected columns?

There is one last test that is failing which is caused by the fact that I am 
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a 
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find 
much documentation on its use. Once I fix, I will post a review request.

There are some follow-up work items that I plan to do, most likely as review 
changes:

 * Add TableStats to ParquetAppender.
 * Figure out of ParquetAppender.flush() is needed.
 * Additional end-to-end testing

Thanks,
David

> Parquet Integration
> -------------------
>
>                 Key: TAJO-30
>                 URL: https://issues.apache.org/jira/browse/TAJO-30
>             Project: Tajo
>          Issue Type: New Feature
>            Reporter: Hyunsik Choi
>            Assignee: David Chen
>              Labels: Parquet
>
> Parquet is very promising file format developed by twitter. We need to 
> investigate the applicability of Parquet. If possible, we implement Parquet 
> port.
> http://parquet.io/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TAJO-30) Parquet Integration

Reply via email to