[
https://issues.apache.org/jira/browse/TAJO-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943833#comment-13943833
]
David Chen edited comment on TAJO-30 at 3/22/14 1:35 AM:
---------------------------------------------------------
Hi Hyunsik,
That's an interesting idea. Do you mean that Tajo will use Parquet as the
default storage format or have all storage formats deserialize into a
representation that follows the Dremel model? Parquet doesn't really have its
own in-memory representation. Each of the Parquet packages basically
deserialize into a given in-memory representation using the readers and
writers. For example, parquet-avro deserializes into Avro GenericRecords (or
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code
deserializes into Tajo Tuples.
My changes are currently in the {{parquet}} branch in my fork on GitHub:
https://github.com/davidzchen/incubator-tajo/tree/parquet
They are almost ready. During further testing, I found a few more issues, most
of them I have now fixed. One thing I noticed was that when reading a
projection, the resulting Tuple still has all the columns of the table schema
but the non-projected fields are simply null. What is the motivation for
retaining all the columns in the Tuple rather than having the Tuple only
contain the projected columns?
There is one last test that is failing which is caused by the fact that I am
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find
much documentation on its use. I can always write this as a placeholder column
or special-case it. Once I fix this, I will post a review request.
There are some follow-up work items that I plan to do, most likely as review
changes:
* Add TableStats to ParquetAppender.
* Figure out of ParquetAppender.flush() is needed.
* Additional end-to-end testing
Thanks,
David
was (Author: davidzchen):
Hi Hyunsik,
That's an interesting idea. Do you mean that Tajo will use Parquet as the
default storage format or have all storage formats deserialize into a
representation that follows the Dremel model? Parquet doesn't really have its
own in-memory representation. Each of the Parquet packages basically
deserialize into a given in-memory representation using the readers and
writers. For example, parquet-avro deserializes into Avro GenericRecords (or
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code
deserializes into Tajo Tuples.
My changes are currently in the {{parquet}} branch in my fork on GitHub:
https://github.com/davidzchen/incubator-tajo/tree/parquet
They are almost ready. During further testing, I found a few more issues, most
of them I have now fixed. One thing I noticed was that when reading a
projection, the resulting Tuple still has all the columns of the table schema
but the non-projected fields are simply null. What is the motivation for
retaining all the columns in the Tuple rather than having the Tuple only
contain the projected columns?
There is one last test that is failing which is caused by the fact that I am
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find
much documentation on its use. Once I fix, I will post a review request.
There are some follow-up work items that I plan to do, most likely as review
changes:
* Add TableStats to ParquetAppender.
* Figure out of ParquetAppender.flush() is needed.
* Additional end-to-end testing
Thanks,
David
> Parquet Integration
> -------------------
>
> Key: TAJO-30
> URL: https://issues.apache.org/jira/browse/TAJO-30
> Project: Tajo
> Issue Type: New Feature
> Reporter: Hyunsik Choi
> Assignee: David Chen
> Labels: Parquet
>
> Parquet is very promising file format developed by twitter. We need to
> investigate the applicability of Parquet. If possible, we implement Parquet
> port.
> http://parquet.io/
--
This message was sent by Atlassian JIRA
(v6.2#6252)