[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260732#comment-13260732
]
Raymie Stata commented on AVRO-806:
-----------------------------------
This is the second attempt at a column-major codec. The whole goal of
col-major formats is to optimize performance. Thus, to drive this exercise
forward it seems necessary to have some kind of benchmark to do some testing.
(I don't think a micro-benchmark is sufficient -- rather the right benchmark is
with a query planner (Hive?) that can take advantage of these formats.) With
such a benchmark in place, we'd compare the performance of the existing
row-major (as a baseline) Avro formats with the various, proposed col-major
formats to make sure that we're getting the kind of performance improvements
(2x, 4x or more) to justify the complexity of a col-major format.
Some comments more specific to this proposal: First, I'd like to see the Type
Mapping section for Avro filled in; this would give us a much better idea of
what you're trying. Second, at first glance, it seems like your design
replicates some of the features of RCFiles that the CIF paper claims cause
performance problems (but, again, maybe this issue is better addressed via some
benchmarking).
Regarding your implementation of this proposal, it re-implements all the
lower-levels of Avro. It seems like this double-implementation will be a
maintenance problem.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Fix For: 1.7.0
>
> Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira