Doug,

> I have never said I was not interested in working together.

That's great -- glad to hear that you are open to collaboration. My concern is 
that by making a separate (sub)project, however, it may be difficult for us to 
work together in practice, and in particular it may be difficult for Thrift to 
leverage Avro's source code.

> I've
said that I think Avro is fundamentally different from Thrift.
> Avro is
a specific format, Thrift is a generic API for various formats, none
like Avro
> They might be made to work together.  But at this point I
see no point in forcing
> them together.

I don't think that they are as far apart as you are making it sound with this 
statement. I do think, however, that it will be very difficult for them to work 
together properly if the goal of code reuse by Thrift is not an explicit goal 
of Avro. The easiest way I can come up with to guarantee this is simply to 
incorporate Avro's feature set into Thrift. If you have other mechanisms for 
doing this, I'd love to hear them.

> If TProtocol's API is a good
match for Avro's format and features,
> then it should be easy for folks
to implement TProtocol using Avro's code and
> include Avro in Thrift. 
If the match is not good then perhaps we can adjust Thrift
> and/or Avro
to improve it.

Absolutely. And right now, there are sufficient differences in the type system 
and other areas that do require some adjustments, likely some on both sides 
(although, as I said in my previous email, we need to account for the fact that 
Thrift has current users to support so backwards-compatibility will need to be 
a consideration).

> Communities form around code, and, if Avro's code is largely disjoint
> from Thrift's, we should not assume that everyone in the Thrift
> community cares about Avro or vice versa.

IMO communities form around shared goals and purposes. Code and designs are 
created to achieve those purposes; they are also malleable and can be bent to 
achieve new goals and purposes. If we can find common cause, then we form a 
common community.

You have some features that you want to satisfy for Hadoop's purposes: compact 
serialization of large files containing many records of identical structure; 
partial deserialization in support of projection; dynamic interpretation of 
object schemas; better/more efficient RPC -- all delivered across multiple 
languages. The first three are also use cases that are of interest to some 
portion of the Thrift community and the fourth is something that Thrift already 
provides.

Avro at this point is fairly nascent -- you have a design, some code, a couple 
of developers, and a target group of future users who seem very receptive to 
what you are working on. You do not have current users, however, and that 
should mean that you have some degree of flexibility to your design where it 
doesn't make a material difference to the use cases you are trying to solve.

If you are willing to make some modifications to that design and
code, the work on Avro could also work directly towards extending
Thrift's functionality. I am pretty certain that the Thrift community
would be willing to make some reasonable modifications and extensions
to Thrift to smooth the way for this as well.

I think that by working closely with the Thrift community directly in the 
Thrift code base, you will get several significant benefits. You will be able 
to directly leverage the transport and server implementations in Thrift today 
and any future work in this area is also beneficial. You will have a built-in 
set of developers and committers across many languages who are already familiar 
with issues in cross-language serialization (and I agree with Kevin that this 
is not as portable as you seem to think it is). You will be able to avoid 
writing lots of parts of an RPC framework in multiple languages that you would 
need to write to make Avro a stand-alone solution for Hadoop. You would have a 
significant role in shaping the direction of Thrift to make sure that it 
remains a strong solution for Hadoop.

> I've said that I think Avro is fundamentally different from Thrift.  Avro
> is a specific format, Thrift is a generic API for various formats, none like 
> Avro.

It is clear to me that a slightly modified version of Avro's data format should 
fit just fine as a Thrift TProtocol implementation. Out of the box this would, 
of course, only provide for statically generated bindings, but this is enough 
to satisfy the first of the desired features I described above.

The second feature, partial deserialization, is a feature that I would like to 
see in Thrift for a variety of use cases, not just your projection use case -- 
for example, message routing where only a message header is deserialized to 
determine where to pass along an otherwise uninterpreted block of data. This 
feature is not tightly coupled to the Avro data format in any way. As you have 
stated, this is possible to do when you have the schema in hand. Note that he 
static bindings in Thrift are another way that the schema can be transmitted -- 
in fact, the whole schema could just be retrievable from the bindings directly 
and fed into whatever mechanism is availabe for dynamic interpretation. But we 
wouldn't have to go so far as that for field look up by name -- as Kevin 
pointed out, the Java and Ruby Thrift libraries already have mechanisms for 
sufficient introspection to accomplish the right kind of lookups, I believe, 
and the other libraries could be
 extended to do the same quite easily. So partial deserialization can be 
supported via either dynamic interpretation and/or via introspection features 
of the static bindings.

To support the second use case, dynamic schema interpretation, there is 
definitely significant new code to be written. Note that this code is 
essentially the same code wherever you are writing it. Whatever work you are 
doing in Avro to be able to dynamically interpret JSON IDL could just be 
directly implemented in Thrift -- we would just define a JSON version of the 
Thrift IDL which would look a lot like Avro's IDL. To help further with 
interoperability we could make the Thrift compiler generate the JSON IDL from 
the Thrift IDL as another output target.

The basic upshot of the above is that it is not that hard to see how Avro could 
be directly integrated into Thrift if you were willing to entertain that option 
and I believe that you would see significant benefits that would more than 
offset the impact to your own ease of development about which you expressed 
concerns.

To touch on a couple specific responses from your previous email to me:

>> If the field IDs are really so
>> objectionable, Thrift could allow them to be optional for purely
>> dynamic usages.
>
> Optional features increase compatibility complexity and are harder
> to maintain and test. A Thrift IDL without numbers would not provide
> versioning features to non-dynamic languages.

Let me rephrase my suggestion because I think I may not have put it across as 
clearly as I could have. I am proposing that the IDL would only allow for field 
IDs to be omitted in the case where the schema was being interpreted 
dynamically -- no static bindings could be generated from IDL without fully 
specified field IDs. So if you are only interested in dynamic interpretation, 
you never have to look at or even think about field IDs. Does that in any way 
alter your stance here?

> It could be a floor wax and a dessert topping!

Love the SNL reference, but I don't think it is really appropos. My vision for 
Thrft with Avro's features folded in as a unified framework for cross-language 
serialization, covering a variety of use cases, is not jamming two completely 
heterogeneous things together. I can easily see wanting to take structures 
represented in one serialization format from disk and send them out over RPC. 
Thrift provides the means to do this kind of thing seemlessly, with formats 
appropriate to both use cases, rather than selecting a format that is good for 
one use case and so-so for the other.

Chad



----- Original Message ----
From: Doug Cutting <[email protected]>
To: [email protected]
Sent: Monday, April 6, 2009 9:15:01 PM
Subject: Re: [PROPOSAL] new subproject: Avro

Kevin Clark wrote:
> The overhead for those people (or some
> equivalent group) to pay attention to another mailing list, another
> bug tracker, another irc channel, and another community isn't trivial.

Communities form around code, and, if Avro's code is largely disjoint from 
Thrift's, we should not assume that everyone in the Thrift community cares 
about Avro or vice versa.

> Of course, this assumes that one of the primary goals of Avro is to be
> cross language. Is that the case, or have I misunderstood?

Yes, that is a goal.

> It would be perfectly reasonable for Hadoop to specify that they
> use the Avro data format for transmissions, and the cross language
> library to provide the API could be Thrift. I think you said something
> similar in your post, but if not please do clarify.

Yes, perhaps this could be done.  I am not convinced that TProtocol is an ideal 
API for reading and writing Avro data, but it could perhaps be made to work 
reasonably well.

> That being said, I'm fairly confident we'll be providing an Avro
> protocol on our own at some point if you're not interested in working
> together. But I think if we go down that path we're doing a disservice
> to users of both Thrift and Avro.

I have never said I was not interested in working together.  I've said that I 
think Avro is fundamentally different from Thrift.  Avro is a specific format, 
Thrift is a generic API for various formats, none like Avro.  They might be 
made to work together.  But at this point I see no point in forcing them 
together.  If TProtocol's API is a good match for Avro's format and features, 
then it should be easy for folks to implement TProtocol using Avro's code and 
include Avro in Thrift.  If the match is not good then perhaps we can adjust 
Thrift and/or Avro to improve it.

Doug

Reply via email to