[ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052566#comment-13052566
 ] 

Doug Cutting commented on AVRO-806:
-----------------------------------

Thiru, this looks great.  So some issues we need to resolve are:
 - How does one specify this format to DataFileWriter?  Perhaps rather than 
extending the codec API we might add EncoderFactory.getEncoderNamed(String) and 
DecoderFactory.getDecoderNamed(String)?  Then we can add a setEncoding(String) 
method to DataFileWriter?
 - How does this integrate with compression?  I suspect we should compress each 
column separately, so the compression codec needs to be invoked on each buffer 
before it's written.  This means that the Encoder must know about the 
compression codec.
 - How is the format indicated in the file itself?  While it may not have made 
sense to implement this as a codec, it might make sense to use the "avro.codec" 
metadata field, as readers should already check this.  We might use, e.g., 
values like "column+snappy".

I think it would be perfectly acceptable if the initial version only supported 
a particular compression codec, Snappy, and that compression codec was always 
turned on.  A big advantage of the column representation should be improved 
compression, and Snappy's fast enough that using it all of the time doesn't 
cost much.

> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to