Avro int vs long and varint encoding

Spencer Nelson Fri, 26 Mar 2021 14:35:48 -0700

Why does Avro have separate 'int' and 'long' types, and what does it mean
for them to represent 32- and 64-bit integers?


Since their binary encoding is variable-length, it appears to me that they
are unbounded in practice. Encoder and decoder functions for int and long
are completely identical, other than checking that a value doesn't exceed
the 32- and 64-bit maximum values. In fact, the Python implementation's
read_int is just an alias for read_long [1], and the same goes for
write_int / write_long.

Just as a guess, I suppose that int and long exist to make life a little
easier when you're deserializing. They act as promises that, despite being
variable-length, the integer you read out will fit in a certain-sized chunk
of memory.

But if this is the goal, it seems like fixed-size integers would do a
better job, and would let deserializers be *much* more efficient since they
don't need to do multiple instructions (including a conditional!) on every
single byte of the input.

I don't know how open folks are to dramatic changes (is Avro 2 a concept
that anyone is talking about?) but I think that Avro would be significantly
improved with a rethink of its integer types. Some sort of fixed-length
integer type[s], and just one variable-length integer type, would be
clarifying, more expressive, and more efficient, I think.



[1]
https://github.com/apache/avro/blob/5bd7cfe0bf742d0482bf6f54b4541b4d22cc87d9/lang/py/avro/io.py#L251-L255

Avro int vs long and varint encoding

Reply via email to