[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022395#comment-13022395
]
Doug Cutting commented on AVRO-806:
-----------------------------------
The question is not whether the elements of depth > 1 are included, but whether
they're each stored in a distinct column. Regardless, one will read the data
file in the same way, using a schema with a subset of the fields, even if
you're not using the column-major codec at all. So if you have a query that
scans only field x.y.z, then storing values for x.y in a column will still make
things faster than a row-order, but perhaps not as fast as if x.y.z values were
stored in their own column, especially if y has a lot of other fields. Note
that Avro's already fast at skipping string and binary values that are not
desired: it reads the length and increments the buffer pointer. So
column-major will provide the biggest speedup for structures that have a lot of
numeric fields that are often ignored queries.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira