[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022911#comment-13022911
]
Doug Cutting commented on AVRO-806:
-----------------------------------
Thiru, so you're proposing multiple, parallel files for columns? I'm proposing
a single file whose format is as it is today except for the encoding of blocks
of records, which would use a new codec: in addition to the current "null",
"gzip" and "snappy" we'd add a "column" codec. As you note, existing
implementations would not be able to read this until they've implemented this
codec, while, with multiple files, they would. However I'm not sure that folks
would appreciate the increase in the number of files that parallel files would
create.
Note that this codec could be implemented using the existing compression codec
API: it could accept a buffer of serialized records, then parse the records
using the file's schema, splitting their fields into separate buffers, and
finally appending all of these buffers with an index at the front. This can be
optimized to avoid the extra copy of data.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira