[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448220#comment-13448220
]
Doug Cutting commented on AVRO-806:
-----------------------------------
Jakob, this is discussed in the spec and also in the CIF paper cited there.
http://cutting.github.com/trevni/spec.html
RCFile, with relatively small row groups, can save CPU time when skipping
un-processed columns, but isn't able to save much i/o time. To save i/o time
you need adjacent sequences of values that are larger than 10MB or so (so
transfer time >> seek time). When values in some columns are considerably
smaller than others (e.g., an int or long versus a blob) then it can take a lot
of records to get 10MB or more of the smaller values. Put another way, i/o
performance is optimal when there's one row group per unit of parallelism so
there's just one seek per column. So a job with 100 map tasks should ideally
have 100 row groups. And we want to localize these map tasks, so each group
should ideally reside in a single HDFS datanode. The CIF folks achieve this by
having a single-block file per column in the row group, then using a custom
block placement strategy to place these on the same datanodes. Trevni instead
works with stock HDFS by placing all the columns of each row group in a single
file that fits in a single block. Does that make sense?
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Attachments: AVRO-806.patch, AVRO-806-v2.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira