Re: Avro int vs long and varint encoding

Micah Kornfield Fri, 26 Mar 2021 14:56:56 -0700

Hi Spencer,

Just as a guess, I suppose that int and long exist to make life a little
>  easier when you're deserializing. They act as promises that, despite being
> variable-length, the integer you read out will fit in a certain-sized chunk
> of memory

I agree this is the reason.  A lot of languages (python is not one of them)
need to have a consistent size for efficiency.

But if this is the goal, it seems like fixed-size integers would do a
> better job, and would let deserializers be *much* more efficient since they
> don't need to do multiple instructions (including a conditional!) on every
> single byte of the input.

There is a tradeoff here between storage/wire efficiency and CPU
efficiency.  In a lot of cases integers tend to be small and branch
predictors can do a good job of eliminating the inefficiencies of branches.
When this applies, increasing storage/on the wire cost by 4x-8x doesn't
make sense.   Other serialization formats like protobuf [1] do allow users
to make the choice between fixed and variable width data.

I think the general assumptions about cost of compute, storage/networking
has changed to a certain degree from the time when a lot of serialization
formats were created, and that in many cases, making more use of fixed size
types make sense.  Flatbuffers [2] and Capnproto [3] are serialization
formats that take this to the extreme.

I can't speak to plans for  an Avro 2 implementation but based on my
experience, it is non-trivial to get a new serialization format to a point
where it is useful (i.e. has good support across a lot of languages), and
even more so for adoption.

[1] https://developers.google.com/protocol-buffers/docs/proto3#scalar
[2] https://google.github.io/flatbuffers/
[3] https://capnproto.org/

On Fri, Mar 26, 2021 at 2:35 PM Spencer Nelson <[email protected]> wrote:

> Why does Avro have separate 'int' and 'long' types, and what does it mean
> for them to represent 32- and 64-bit integers?
>
> Since their binary encoding is variable-length, it appears to me that they
> are unbounded in practice. Encoder and decoder functions for int and long
> are completely identical, other than checking that a value doesn't exceed
> the 32- and 64-bit maximum values. In fact, the Python implementation's
> read_int is just an alias for read_long [1], and the same goes for
> write_int / write_long.
>
> Just as a guess, I suppose that int and long exist to make life a little
> easier when you're deserializing. They act as promises that, despite being
> variable-length, the integer you read out will fit in a certain-sized chunk
> of memory.
>
> But if this is the goal, it seems like fixed-size integers would do a
> better job, and would let deserializers be *much* more efficient since they
> don't need to do multiple instructions (including a conditional!) on every
> single byte of the input.
>
> I don't know how open folks are to dramatic changes (is Avro 2 a concept
> that anyone is talking about?) but I think that Avro would be significantly
> improved with a rethink of its integer types. Some sort of fixed-length
> integer type[s], and just one variable-length integer type, would be
> clarifying, more expressive, and more efficient, I think.
>
>
>
> [1]
>
> https://github.com/apache/avro/blob/5bd7cfe0bf742d0482bf6f54b4541b4d22cc87d9/lang/py/avro/io.py#L251-L255
>

Re: Avro int vs long and varint encoding

Reply via email to