Scott Carey wrote:
In particular, a large number of use cases are only positive
integers, or expect only very small negative numbers to exist.  Is a
positive number biased serialization of much use, or is it more
trouble than its worth?

I did spend a few days days trying to define a positive-biased integer format that was sufficiently simple to describe and efficient to implement, and was unable to. That's not to say it's impossible.

In general, having a couple serialization types for other currently
fixed-only size types might be useful as well.  "encoding":
"bias-unsigned"? Given the knowledge of the data stored I can see
some variation in what a user would want for, float, double, string,
datetime, or int WRT space / time tradeoffs.

We need to strike a reasonable balance between simplicity and maximal performance for every case. When things are unclear, I tend to opt for simplicity over a potential minor performance improvement. A biased representation might save a few percent in size for some applications, but at the cost of forcing every implementation in every language to support that encoding. Avro's about interchange between applications, and one might need to make some compromises over what's ideal for each application.

I don't think that a first version of Avro should do much work on
this front.  But I would like it to not preclude future options or
extensibility on the encoding side of things per data type. In the
distant future, Strings could even have an encoding type that
compresses by spanning across records of the same column in stream
formats -- (which is often much more efficient than compressing the
whole row since column similarity - think URLs - can be very high).

Such optimizations can be implemented in other ways too, complementary to Avro. For example, a container might store a sequence of Avro records that contain <int,string> pairs, the int indicating how much of the previous record's string is a prefix of the current, and the string providing the suffix. The container's API could then decode these to the full string.

Doug


Reply via email to