[jira] [Commented] (AVRO-806) add a column-major codec for data files

Douglas Creager (JIRA) Wed, 20 Apr 2011 10:45:45 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022253#comment-13022253
 ]


Douglas Creager commented on AVRO-806:
--------------------------------------

Avro's formalism is a bit different than protobuf's, so we wouldn't be able to 
use Dremel's data model as-is.  But it's definitely a good start.

My hunch is that we'd want each columnar chunk to only store primitive values, 
so we'd have to break down a complex type into each primitive pieces.  For most 
schemas, there's a unique name that identifies each primitive piece.  So if you 
have

{code}
record Test {
  int  a;
  union {float, string}  b;
  long  c;
  array<double>  d;
  map<string>  e;
  record Subtest {
    boolean f;
    bytes g;
  } sub1;
  union {null, Subtest}  sub2;
}
{code}

then your primitive names would be

{code}
a
b.float
b.string
c
d.[]
e.{}
sub1.f
sub1.g
sub2.null
sub2.Subtest.f
sub2.Subtest.g
{code}

There are only four compound types in Avro:  For records, there's a child for 
each field.  For arrays, there's a single child, which I called [].  For maps, 
it's {}.  For unions, there's a child for each branch, with the same name as 
the branch's schema.

The only issue is with recursive types, since for those, the nesting can be 
arbitrarily deep at runtime.  We can probably add a “nesting level” to handle 
these, but we'll have to hammer out the details.

> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-806) add a column-major codec for data files

Reply via email to