[jira] [Commented] (AVRO-806) add a column-major codec for data files

Doug Cutting (JIRA) Wed, 20 Apr 2011 11:57:49 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022298#comment-13022298
 ]


Doug Cutting commented on AVRO-806:
-----------------------------------

I was thinking of just creating columns for the fields of the fields of the 
top-level record.  In this approach, a union would be written as a union, 
prefixed with a varint indicating the branch taken.

If we stored union branches separately then we'd also need a column that has 
the varint.  Iterators would then use this to decide when a column has a value. 
 For nested unions I think the iterators would need to have a list of pointers 
to varints.

The use case is to accelerate scans of a subset of fields.  Further 
acceleration is possible if things are columnized more deeply, but we probably 
want to stop at some fixed depth in each block regardless.  So I'm effectively 
proposing a depth of 1.  Increasing the depth increases the number of buffer 
pointers and the complexity of row iteration.  I don't have a clear sense of 
when that becomes significant.  One way to limit the depth would be to specify 
a maximum number of columns, and use a breadth-first walk of the schema until 
that number of columns are encountered.  However I wonder whether we're 
over-engineering this.


> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-806) add a column-major codec for data files

Reply via email to