[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022298#comment-13022298
]
Doug Cutting commented on AVRO-806:
-----------------------------------
I was thinking of just creating columns for the fields of the fields of the
top-level record. In this approach, a union would be written as a union,
prefixed with a varint indicating the branch taken.
If we stored union branches separately then we'd also need a column that has
the varint. Iterators would then use this to decide when a column has a value.
For nested unions I think the iterators would need to have a list of pointers
to varints.
The use case is to accelerate scans of a subset of fields. Further
acceleration is possible if things are columnized more deeply, but we probably
want to stop at some fixed depth in each block regardless. So I'm effectively
proposing a depth of 1. Increasing the depth increases the number of buffer
pointers and the complexity of row iteration. I don't have a clear sense of
when that becomes significant. One way to limit the depth would be to specify
a maximum number of columns, and use a breadth-first walk of the schema until
that number of columns are encountered. However I wonder whether we're
over-engineering this.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira