[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022253#comment-13022253
]
Douglas Creager commented on AVRO-806:
--------------------------------------
Avro's formalism is a bit different than protobuf's, so we wouldn't be able to
use Dremel's data model as-is. But it's definitely a good start.
My hunch is that we'd want each columnar chunk to only store primitive values,
so we'd have to break down a complex type into each primitive pieces. For most
schemas, there's a unique name that identifies each primitive piece. So if you
have
{code}
record Test {
int a;
union {float, string} b;
long c;
array<double> d;
map<string> e;
record Subtest {
boolean f;
bytes g;
} sub1;
union {null, Subtest} sub2;
}
{code}
then your primitive names would be
{code}
a
b.float
b.string
c
d.[]
e.{}
sub1.f
sub1.g
sub2.null
sub2.Subtest.f
sub2.Subtest.g
{code}
There are only four compound types in Avro: For records, there's a child for
each field. For arrays, there's a single child, which I called []. For maps,
it's {}. For unions, there's a child for each branch, with the same name as
the branch's schema.
The only issue is with recursive types, since for those, the nesting can be
arbitrarily deep at runtime. We can probably add a “nesting level” to handle
these, but we'll have to hammer out the details.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira