Chad Walters wrote:
-- You suggest that there is not a lot in Thrift that Avro can
leverage. I think you may be overlooking the fact that Thrift has a
user base and a community of developers who are very interested in
issues of cross-language data serialization and interoperability.
I meant that in terms of common code, not coders. Coders can belong to
more than one community but code should generally not. Hadoop Core has
become a sprawling community that we're trying to split. It's more
productive to have have more, small communities than few large ones. A
project needs a handful of active developers, but too many and it
becomes ungainly. So, if it's technically possible for a codebase to be
distinct, and it can attract enough active developers to sustain itself,
that is a preferable structure.
At the code level, Thrift contains a transport abstraction and
multiple different transport and server implementations in many
different target languages. If there were closer collaboration, Avro
could certainly benefit from leveraging the existing ones and any
additional contributions in this area would benefit both projects.
The transport and server implementations are indeed an area where code
could potentially be shared between Avro and Thrift. Perhaps someone
could start a separate project with reusable transport and server
implementations to support RPC? In any case, Avro primarily specifies a
binary message format, not a full transport. We hope to piggyback off
other transport implementations, like HTTP servers, etc. Full
transports involve authentication, authorization, encryption, etc.,
which are outside of the scope of Avro.
The most significant issue is that both of them specify a type
system. At a very minimum I would like to see Avro and Thrift make
agreements on that type system.
This makes good sense. It would be good if these were interoperable.
Thrift has byte and i16, which Avro does not currently. I'd like to add
a fixed<n> primitive type to Avro, where n is the number of bytes and is
specified in the schema, so that one could, e.g., define a byte as
fixed<1>, i16 as a fixed<2> and md5 as a fixed<16>.
Thrift has both lists and sets, Avro has just arrays, which are
equivalent to lists (they're ordered). Perhaps Avro could add sets.
Are they leveraged heavily in Thrift? I've not heard much call for them
in Avro yet.
Avro has single-float, Thrift does not. Avro could perhaps lose this.
Avro distinguishes UTF-8 text strings from byte strings, while Thrift
does not. I am reluctant to lose this distinction.
Avro has unions and a null type, while Thrift does not. Does Thrift
support recursive data structures?
Furthermore, you say that last part ("Thrift would have yet another
serialization format...") like it is a bad thing...
When faced with multiple programming and scripting languages, multiple
serialization formats should be discouraged, or one ends up with
multiplicative compatibility problems. A single, primary data format
would vastly simplify the Hadoop ecosystem. Yes, folks need to be able
to easily import and export data, but expecting scripts in arbitrary
languages to be able to process data in arbitrary formats seems unwise.
Note that it is
an explicit design goal of Thrift to allow for multiple different
serialization formats so that lots of different use cases can be
supported by the same fundamental framework.
That's not a design goal of Avro, which intends to provide a single,
well-specified, easy to implement serialization format. This is not in
conflict with Thrift, it's just a different goal.
Also, doesn't Avro essentially contain "another serialization format
that every language would need to implement for it to be useful"?
Seems like the same basic set of work to me, whether it is in Avro or
Thrift.
None of Thrift's existing formats solve the problems Avro seeks to.
Thrift may be able to incorporate Avro's format, if it has good format
generalizations, ideally using Avro's code. So there should be little
duplication of effort in such an approach.
The simplification comes simply not having the field IDs in the IDL?
I am not sure why having sequential id numbers after each field is
considered to be so onerous.
I didn't say it was onerous, I said that, like in most data structure
languages (e.g., programming languages), Avro permits folks to name
fields with symbolic names alone. In human-authored software, symbolic
naming is generally preferable to numeric naming. Is that really a
matter of dispute?
If the field IDs are really so
objectionable, Thrift could allow them to be optional for purely
dynamic usages.
Optional features increase compatibility complexity and are harder to
maintain and test. A Thrift IDL without numbers would not provide
versioning features to non-dynamic languages.
I also don't see why matching names is considered easier than
matching numbers, which is essentially what the versioning semantics
come down to in the end. Am I missing something here?
They are formally equivalent. For machines, matching numbers is easier,
but people usually prefer to operate on names, and names can be
automatically mapped to numbers.
Consider an alternative: making Avro more like a sub-project of
Thrift or just implementing it directly in Thrift.
I looked into changing Thrift to support Avro's features, and it was
very messy. Perhaps someone else could do this more easily.
Building Avro as a part of Thrift would take considerably more effort
for me and I think offer little more than it does separately. If you
feel differently, you are free to fork Avro, start a competitor, provide
patches that integrate it into Thrift, or whatever.
In that case, I
think the end result will be a powerful and flexible "one-stop shop"
for data serialization for RPC and archival purposes with the ability
to bring both static and dynamic capabilities as needed for
particular application purposes. To me this seems like a bigger win
for both Hadoop and for Thrift.
It could be a floor wax and a dessert topping!
Doug