Re: spec changes & schedule

Scott Carey Fri, 24 Apr 2009 15:56:48 -0700

As a very interested future user of Avro (and potential minor contributor), I 
was wondering if there would be more than one possible way to serialize an int.


In particular, a large number of use cases are only positive integers, or 
expect only very small negative numbers to exist.  Is a positive number biased 
serialization of much use, or is it more trouble than its worth?

In general, having a couple serialization types for other currently fixed-only 
size types might be useful as well.  "encoding": "bias-unsigned"?
Given the knowledge of the data stored I can see some variation in what a user 
would want for, float, double, string, datetime, or int WRT space / time 
tradeoffs.

I don't think that a first version of Avro should do much work on this front.  
But I would like it to not preclude future options or extensibility on the 
encoding side of things per data type.
In the distant future, Strings could even have an encoding type that compresses 
by spanning across records of the same column in stream formats -- (which is 
often much more efficient than compressing the whole row since column 
similarity - think URLs - can be very high).

In my mind, the most valuable aspect of Avro is a common format for making very 
compact, space efficient (high performance) serialization of data across the 
Hadoop infrastructure AND for custom use cases in the hadoop ecosystem.
RPC and other such things are just use cases above this layer that share the 
cross-language need.
How a schema decodes/encodes a record is the core of it, regardless of what 
context that record is  in.  I can see many cases one may want to copy the raw 
bytes of a record of a known schema type from one location or format to another 
without even decoding/re-encoding.  Records in disk-backed hashes or inexes, 
records stored in ZooKeeper's byte[] data, records in custom stream formats or 
file formats . . .  If its all the same format, much encoding and decoding can 
be avoided and tools like pig, hive, cloudbase, cascading, hbase, and others 
can easily share data seamlessly without custom reader/writer adapters and 
constant serialization/deserialization when its not necessary to inspect the 
contents of the record.

Just some comments from someone lurking with interest --

-Scott


On 4/24/09 11:25 AM, "Doug Cutting" <[email protected]> wrote:

I've now added all the spec changes I can think of to Jira:

https://issues.apache.org/jira/secure/IssueNavigator.jspa?pid=12310911&component=12312779

I propose to complete these in roughly the following order:

AVRO-1 - switch to JSON arrays for record definitions
AVRO-9 - restrict map keys to be strings
AVRO-17 - remove single-float type

AVRO-8 - add default values to record field definitions
AVRO-2 - switch to optimized RPC handshake
AVRO-10 - add fixed-size type

My goal is to make incompatible, inner, changes sooner, and implement
new features and higher-level changes later.

I've completed the documentation and Java changes for AVRO-1, and will
now start on similar patches for AVRO-9 and AVRO-17.  Hopefully Sharad
will add Python support to these patches and we can commit all three
early next week, then start on the last three.

With luck, we might get all of these done by the end of the first week
in May, and by May 15th at the latest.  Does that sound like a
reasonable goal?

Cheers,

Doug

Re: spec changes & schedule

Reply via email to