Hi Spencer, Just as a guess, I suppose that int and long exist to make life a little > easier when you're deserializing. They act as promises that, despite being > variable-length, the integer you read out will fit in a certain-sized chunk > of memory
I agree this is the reason. A lot of languages (python is not one of them) need to have a consistent size for efficiency. But if this is the goal, it seems like fixed-size integers would do a > better job, and would let deserializers be *much* more efficient since they > don't need to do multiple instructions (including a conditional!) on every > single byte of the input. There is a tradeoff here between storage/wire efficiency and CPU efficiency. In a lot of cases integers tend to be small and branch predictors can do a good job of eliminating the inefficiencies of branches. When this applies, increasing storage/on the wire cost by 4x-8x doesn't make sense. Other serialization formats like protobuf [1] do allow users to make the choice between fixed and variable width data. I think the general assumptions about cost of compute, storage/networking has changed to a certain degree from the time when a lot of serialization formats were created, and that in many cases, making more use of fixed size types make sense. Flatbuffers [2] and Capnproto [3] are serialization formats that take this to the extreme. I can't speak to plans for an Avro 2 implementation but based on my experience, it is non-trivial to get a new serialization format to a point where it is useful (i.e. has good support across a lot of languages), and even more so for adoption. [1] https://developers.google.com/protocol-buffers/docs/proto3#scalar [2] https://google.github.io/flatbuffers/ [3] https://capnproto.org/ On Fri, Mar 26, 2021 at 2:35 PM Spencer Nelson <[email protected]> wrote: > Why does Avro have separate 'int' and 'long' types, and what does it mean > for them to represent 32- and 64-bit integers? > > Since their binary encoding is variable-length, it appears to me that they > are unbounded in practice. Encoder and decoder functions for int and long > are completely identical, other than checking that a value doesn't exceed > the 32- and 64-bit maximum values. In fact, the Python implementation's > read_int is just an alias for read_long [1], and the same goes for > write_int / write_long. > > Just as a guess, I suppose that int and long exist to make life a little > easier when you're deserializing. They act as promises that, despite being > variable-length, the integer you read out will fit in a certain-sized chunk > of memory. > > But if this is the goal, it seems like fixed-size integers would do a > better job, and would let deserializers be *much* more efficient since they > don't need to do multiple instructions (including a conditional!) on every > single byte of the input. > > I don't know how open folks are to dramatic changes (is Avro 2 a concept > that anyone is talking about?) but I think that Avro would be significantly > improved with a rethink of its integer types. Some sort of fixed-length > integer type[s], and just one variable-length integer type, would be > clarifying, more expressive, and more efficient, I think. > > > > [1] > > https://github.com/apache/avro/blob/5bd7cfe0bf742d0482bf6f54b4541b4d22cc87d9/lang/py/avro/io.py#L251-L255 >
