[ 
https://issues.apache.org/jira/browse/AVRO-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919886#action_12919886
 ] 

Stu Hood commented on AVRO-679:
-------------------------------

> Adding a new fundamental type or encoding is hard to do compatibly.
Agreed: but this particular optimization is only possible with Avro's support, 
and opens up a lot of other interesting possibilities. For instance, in your 
prefix encoding example, encoding a block of <int,string,long> as a record 
{{array<int>, array<long>, array<string>}} might give a 3-6x increase in decode 
speed (based on the numbers suggested in the link).

It is worth considering how the specification can evolve backwards compatibly 
as well: perhaps the next revision of the specification could require a magical 
'spec revision' number to be present in all schemas, and would assume that a 
schema that is missing the rev number is a legacy format? This would allow 
readers and writers to communicate across spec revision boundaries by disabling 
optimizations/encodings that the other side does not support.

> One might automatically rewrite schemas and have a layer that transforms 
> datastructures accordingly?
Yea: there is probably room for a schema translation layer above Avro for 
things like RLE / prefix encoding, but I think it is a separate area of focus.

> Improved encodings for arrays
> -----------------------------
>
>                 Key: AVRO-679
>                 URL: https://issues.apache.org/jira/browse/AVRO-679
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Stu Hood
>            Priority: Minor
>
> There are better ways to encode arrays of varints [1] which are faster to 
> decode, and more space efficient than encoding varints independently.
> Extending the idea to other types of variable length data like 'bytes' and 
> 'string', you could encode the entries for an array block as an array of 
> lengths, followed by contiguous byte/utf8 data.
> [1] group varint encoding: slides 57-63 of 
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to