Re: [PROPOSAL] new subproject: Avro

Doug Cutting Mon, 06 Apr 2009 12:13:05 -0700

Chad Walters wrote:

-- You suggest that there is not a lot in Thrift that Avro can
leverage. I think you may be overlooking the fact that Thrift has a
user base and a community of developers who are very interested in
issues of cross-language data serialization and interoperability.

I meant that in terms of common code, not coders. Coders can belong tomore than one community but code should generally not. Hadoop Core hasbecome a sprawling community that we're trying to split. It's moreproductive to have have more, small communities than few large ones. Aproject needs a handful of active developers, but too many and itbecomes ungainly. So, if it's technically possible for a codebase to bedistinct, and it can attract enough active developers to sustain itself,that is a preferable structure.

At the code level, Thrift contains a transport abstraction and
multiple different transport and server implementations in many
different target languages. If there were closer collaboration, Avro
could certainly benefit from leveraging the existing ones and any
additional contributions in this area would benefit both projects.

The transport and server implementations are indeed an area where codecould potentially be shared between Avro and Thrift. Perhaps someonecould start a separate project with reusable transport and serverimplementations to support RPC? In any case, Avro primarily specifies abinary message format, not a full transport. We hope to piggyback offother transport implementations, like HTTP servers, etc. Fulltransports involve authentication, authorization, encryption, etc.,which are outside of the scope of Avro.

The most significant issue is that both of them specify a type
system. At a very minimum I would like to see Avro and Thrift make
agreements on that type system.


This makes good sense.  It would be good if these were interoperable.

Thrift has byte and i16, which Avro does not currently. I'd like to adda fixed<n> primitive type to Avro, where n is the number of bytes and isspecified in the schema, so that one could, e.g., define a byte asfixed<1>, i16 as a fixed<2> and md5 as a fixed<16>.

Thrift has both lists and sets, Avro has just arrays, which areequivalent to lists (they're ordered). Perhaps Avro could add sets.Are they leveraged heavily in Thrift? I've not heard much call for themin Avro yet.


Avro has single-float, Thrift does not.  Avro could perhaps lose this.

Avro distinguishes UTF-8 text strings from byte strings, while Thriftdoes not. I am reluctant to lose this distinction.

Avro has unions and a null type, while Thrift does not. Does Thriftsupport recursive data structures?

Furthermore, you say that last part ("Thrift would have yet another
serialization format...") like it is a bad thing...

When faced with multiple programming and scripting languages, multipleserialization formats should be discouraged, or one ends up withmultiplicative compatibility problems. A single, primary data formatwould vastly simplify the Hadoop ecosystem. Yes, folks need to be ableto easily import and export data, but expecting scripts in arbitrarylanguages to be able to process data in arbitrary formats seems unwise.

Note that it is
an explicit design goal of Thrift to allow for multiple different
serialization formats so that lots of different use cases can be
supported by the same fundamental framework.

That's not a design goal of Avro, which intends to provide a single,well-specified, easy to implement serialization format. This is not inconflict with Thrift, it's just a different goal.

Also, doesn't Avro essentially contain "another serialization format
that every language would need to implement for it to be useful"?
Seems like the same basic set of work to me, whether it is in Avro or
Thrift.

None of Thrift's existing formats solve the problems Avro seeks to.Thrift may be able to incorporate Avro's format, if it has good formatgeneralizations, ideally using Avro's code. So there should be littleduplication of effort in such an approach.

The simplification comes simply not having the field IDs in the IDL?
I am not sure why having sequential id numbers after each field is
considered to be so onerous.

I didn't say it was onerous, I said that, like in most data structurelanguages (e.g., programming languages), Avro permits folks to namefields with symbolic names alone. In human-authored software, symbolicnaming is generally preferable to numeric naming. Is that really amatter of dispute?

If the field IDs are really so
objectionable, Thrift could allow them to be optional for purely
dynamic usages.

Optional features increase compatibility complexity and are harder tomaintain and test. A Thrift IDL without numbers would not provideversioning features to non-dynamic languages.

I also don't see why matching names is considered easier than
matching numbers, which is essentially what the versioning semantics
come down to in the end. Am I missing something here?

They are formally equivalent. For machines, matching numbers is easier,but people usually prefer to operate on names, and names can beautomatically mapped to numbers.

Consider an alternative: making Avro more like a sub-project of
Thrift or just implementing it directly in Thrift.

I looked into changing Thrift to support Avro's features, and it wasvery messy. Perhaps someone else could do this more easily.

Building Avro as a part of Thrift would take considerably more effortfor me and I think offer little more than it does separately. If youfeel differently, you are free to fork Avro, start a competitor, providepatches that integrate it into Thrift, or whatever.

In that case, I
think the end result will be a powerful and flexible "one-stop shop"
for data serialization for RPC and archival purposes with the ability
to bring both static and dynamic capabilities as needed for
particular application purposes. To me this seems like a bigger win
for both Hadoop and for Thrift.


It could be a floor wax and a dessert topping!

Doug

Re: [PROPOSAL] new subproject: Avro

Reply via email to