As a very interested future user of Avro (and potential minor contributor), I was wondering if there would be more than one possible way to serialize an int.
In particular, a large number of use cases are only positive integers, or expect only very small negative numbers to exist. Is a positive number biased serialization of much use, or is it more trouble than its worth? In general, having a couple serialization types for other currently fixed-only size types might be useful as well. "encoding": "bias-unsigned"? Given the knowledge of the data stored I can see some variation in what a user would want for, float, double, string, datetime, or int WRT space / time tradeoffs. I don't think that a first version of Avro should do much work on this front. But I would like it to not preclude future options or extensibility on the encoding side of things per data type. In the distant future, Strings could even have an encoding type that compresses by spanning across records of the same column in stream formats -- (which is often much more efficient than compressing the whole row since column similarity - think URLs - can be very high). In my mind, the most valuable aspect of Avro is a common format for making very compact, space efficient (high performance) serialization of data across the Hadoop infrastructure AND for custom use cases in the hadoop ecosystem. RPC and other such things are just use cases above this layer that share the cross-language need. How a schema decodes/encodes a record is the core of it, regardless of what context that record is in. I can see many cases one may want to copy the raw bytes of a record of a known schema type from one location or format to another without even decoding/re-encoding. Records in disk-backed hashes or inexes, records stored in ZooKeeper's byte[] data, records in custom stream formats or file formats . . . If its all the same format, much encoding and decoding can be avoided and tools like pig, hive, cloudbase, cascading, hbase, and others can easily share data seamlessly without custom reader/writer adapters and constant serialization/deserialization when its not necessary to inspect the contents of the record. Just some comments from someone lurking with interest -- -Scott On 4/24/09 11:25 AM, "Doug Cutting" <[email protected]> wrote: I've now added all the spec changes I can think of to Jira: https://issues.apache.org/jira/secure/IssueNavigator.jspa?pid=12310911&component=12312779 I propose to complete these in roughly the following order: AVRO-1 - switch to JSON arrays for record definitions AVRO-9 - restrict map keys to be strings AVRO-17 - remove single-float type AVRO-8 - add default values to record field definitions AVRO-2 - switch to optimized RPC handshake AVRO-10 - add fixed-size type My goal is to make incompatible, inner, changes sooner, and implement new features and higher-level changes later. I've completed the documentation and Java changes for AVRO-1, and will now start on similar patches for AVRO-9 and AVRO-17. Hopefully Sharad will add Python support to these patches and we can commit all three early next week, then start on the last three. With luck, we might get all of these done by the end of the first week in May, and by May 15th at the latest. Does that sound like a reasonable goal? Cheers, Doug
