Scott Carey wrote:
In particular, a large number of use cases are only positive integers, or expect only very small negative numbers to exist. Is a positive number biased serialization of much use, or is it more trouble than its worth?
I did spend a few days days trying to define a positive-biased integer format that was sufficiently simple to describe and efficient to implement, and was unable to. That's not to say it's impossible.
In general, having a couple serialization types for other currently fixed-only size types might be useful as well. "encoding": "bias-unsigned"? Given the knowledge of the data stored I can see some variation in what a user would want for, float, double, string, datetime, or int WRT space / time tradeoffs.
We need to strike a reasonable balance between simplicity and maximal performance for every case. When things are unclear, I tend to opt for simplicity over a potential minor performance improvement. A biased representation might save a few percent in size for some applications, but at the cost of forcing every implementation in every language to support that encoding. Avro's about interchange between applications, and one might need to make some compromises over what's ideal for each application.
I don't think that a first version of Avro should do much work on this front. But I would like it to not preclude future options or extensibility on the encoding side of things per data type. In the distant future, Strings could even have an encoding type that compresses by spanning across records of the same column in stream formats -- (which is often much more efficient than compressing the whole row since column similarity - think URLs - can be very high).
Such optimizations can be implemented in other ways too, complementary to Avro. For example, a container might store a sequence of Avro records that contain <int,string> pairs, the int indicating how much of the previous record's string is a prefix of the current, and the string providing the suffix. The container's API could then decode these to the full string.
Doug
