[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052721#comment-13052721
]
Scott Carey commented on AVRO-806:
----------------------------------
{quote}
For now, unions have a single column, irrespective of the number and type of
branches.
{quote}
I think "For now" = Forever? This is a binary format, I think we should make
unions columnar as well. Backwards compatibility will be hard and high
maintenance.
{quote}
How does this integrate with compression? I suspect we should compress each
column separately, so the compression codec needs to be invoked on each buffer
before it's written. This means that the Encoder must know about the
compression codec.
{quote}
This depends on what the goal is. If we want to avoid decompressing columns
that are not accessed, we will need to do that. Otherwise it is not necessary
and compression ratios will be best if large blocks are compressed as a unit
with all columns.
{quote}
and Snappy's fast enough that using it all of the time doesn't cost much.
{quote}
Yes, the CRC currently used costs more than Snappy.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira