[ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260732#comment-13260732
 ] 

Raymie Stata commented on AVRO-806:
-----------------------------------

This is the second attempt at a column-major codec.  The whole goal of 
col-major formats is to optimize performance.  Thus, to drive this exercise 
forward it seems necessary to have some kind of benchmark to do some testing.  
(I don't think a micro-benchmark is sufficient -- rather the right benchmark is 
with a query planner (Hive?) that can take advantage of these formats.)  With 
such a benchmark in place, we'd compare the performance of the existing 
row-major (as a baseline) Avro formats with the various, proposed col-major 
formats to make sure that we're getting the kind of performance improvements 
(2x, 4x or more) to justify the complexity of a col-major format.

Some comments more specific to this proposal: First, I'd like to see the Type 
Mapping section for Avro filled in; this would give us a much better idea of 
what you're trying.  Second, at first glance, it seems like your design 
replicates some of the features of RCFiles that the CIF paper claims cause 
performance problems (but, again, maybe this issue is better addressed via some 
benchmarking).

Regarding your implementation of this proposal, it re-implements all the 
lower-levels of Avro.  It seems like this double-implementation will be a 
maintenance problem.  
                
> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.7.0
>
>         Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to