[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022593#comment-13022593
]
Thiruvalluvan M. G. commented on AVRO-806:
------------------------------------------
It will be nice if we can make the column datafile compatible with the current
datafile standard (with the column's schema, of course). There will be a "meta"
datafile as a catalog of its constituent column data files. A single datafile
is, by definition, its own catalog. This approach, which I suppose is not
incompatible with the original proposal, has some advantages:
- We can reuse almost all the current code.
- Whether to split columns into further columns is completely left the
writer as both are valid.
However, this approach will make splitting unions harder. We have to cater to
"holes" in the branch columns. With a little bit of space overhead, we can
address the problem: [A, B, C] will be split into [null, A], [null, B] and
[null, C]. (A, B, or C is null, there won't be a corresponding column
datafile). The "holes" in the column will then be encoded as null branch. We do
not need a separate branch-index column.
An array of record { A, B, C } can be decomposed as three columns: array of A,
array of B and and array C. The idea can be extended to maps.
The good thing about this is that this defines a recursive column splitting
algorithm. And it is flexible. For example, if for a record {A, B, C} B and C
always come together, we can split it as A and {B, C}.
On another point, while it may be a good strategy for the writer to write
column blocks of data in lock-step fashion, the reader should not rely on it.
This allows different columns to be written at different points in time. This
will perform better because different columns have different sizes and compress
differently. But with this, skipping to next sync on one column could lead to
skipping to the middle of another column. That could be challenging.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira