[jira] [Commented] (AVRO-806) add a column-major codec for data files

Thiruvalluvan M. G. (JIRA) Wed, 20 Apr 2011 19:02:47 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022593#comment-13022593
 ]


Thiruvalluvan M. G. commented on AVRO-806:
------------------------------------------

It will be nice if we can make the column datafile compatible with the current 
datafile standard (with the column's schema, of course). There will be a "meta" 
datafile as a catalog of its constituent column data files. A single datafile 
is, by definition, its own catalog. This approach, which I suppose is not 
incompatible with the original proposal, has some advantages:

   - We can reuse almost all the current code.
   - Whether to split columns into further columns is completely left the 
writer as both are valid.

However, this approach will make splitting unions harder. We have to cater to 
"holes" in the branch columns. With a little bit of space overhead, we can 
address the problem: [A, B, C] will be split into [null, A], [null, B] and 
[null, C]. (A, B, or C is null, there won't be a corresponding column 
datafile). The "holes" in the column will then be encoded as null branch. We do 
not need a separate branch-index column.

An array of record { A, B, C } can be decomposed as three columns: array of A, 
array of B and and array C. The idea can be extended to maps.

The good thing about this is that this defines a recursive column splitting 
algorithm. And it is flexible. For example, if for a record {A, B, C} B and C 
always come together, we can split it as A and {B, C}.

On another point, while it may be a good strategy for the writer to write 
column blocks of data in lock-step fashion, the reader should not rely on it. 
This allows different columns to be written at different points in time. This 
will perform better because different columns have different sizes and compress 
differently. But with this, skipping to next sync on one column could lead to 
skipping to the middle of another column. That could be challenging.

> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-806) add a column-major codec for data files

Reply via email to