[ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022911#comment-13022911
 ] 

Doug Cutting commented on AVRO-806:
-----------------------------------

Thiru, so you're proposing multiple, parallel files for columns?  I'm proposing 
a single file whose format is as it is today except for the encoding of blocks 
of records, which would use a new codec: in addition to the current "null", 
"gzip" and "snappy" we'd add a "column" codec.  As you note, existing 
implementations would not be able to read this until they've implemented this 
codec, while, with multiple files, they would.  However I'm not sure that folks 
would appreciate the increase in the number of files that parallel files would 
create.

Note that this codec could be implemented using the existing compression codec 
API: it could accept a buffer of serialized records, then parse the records 
using the file's schema, splitting their fields into separate buffers, and 
finally appending all of these buffers with an index at the front.  This can be 
optimized to avoid the extra copy of data.

> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to