Doug, > I have never said I was not interested in working together.
That's great -- glad to hear that you are open to collaboration. My concern is that by making a separate (sub)project, however, it may be difficult for us to work together in practice, and in particular it may be difficult for Thrift to leverage Avro's source code. > I've said that I think Avro is fundamentally different from Thrift. > Avro is a specific format, Thrift is a generic API for various formats, none like Avro > They might be made to work together. But at this point I see no point in forcing > them together. I don't think that they are as far apart as you are making it sound with this statement. I do think, however, that it will be very difficult for them to work together properly if the goal of code reuse by Thrift is not an explicit goal of Avro. The easiest way I can come up with to guarantee this is simply to incorporate Avro's feature set into Thrift. If you have other mechanisms for doing this, I'd love to hear them. > If TProtocol's API is a good match for Avro's format and features, > then it should be easy for folks to implement TProtocol using Avro's code and > include Avro in Thrift. If the match is not good then perhaps we can adjust Thrift > and/or Avro to improve it. Absolutely. And right now, there are sufficient differences in the type system and other areas that do require some adjustments, likely some on both sides (although, as I said in my previous email, we need to account for the fact that Thrift has current users to support so backwards-compatibility will need to be a consideration). > Communities form around code, and, if Avro's code is largely disjoint > from Thrift's, we should not assume that everyone in the Thrift > community cares about Avro or vice versa. IMO communities form around shared goals and purposes. Code and designs are created to achieve those purposes; they are also malleable and can be bent to achieve new goals and purposes. If we can find common cause, then we form a common community. You have some features that you want to satisfy for Hadoop's purposes: compact serialization of large files containing many records of identical structure; partial deserialization in support of projection; dynamic interpretation of object schemas; better/more efficient RPC -- all delivered across multiple languages. The first three are also use cases that are of interest to some portion of the Thrift community and the fourth is something that Thrift already provides. Avro at this point is fairly nascent -- you have a design, some code, a couple of developers, and a target group of future users who seem very receptive to what you are working on. You do not have current users, however, and that should mean that you have some degree of flexibility to your design where it doesn't make a material difference to the use cases you are trying to solve. If you are willing to make some modifications to that design and code, the work on Avro could also work directly towards extending Thrift's functionality. I am pretty certain that the Thrift community would be willing to make some reasonable modifications and extensions to Thrift to smooth the way for this as well. I think that by working closely with the Thrift community directly in the Thrift code base, you will get several significant benefits. You will be able to directly leverage the transport and server implementations in Thrift today and any future work in this area is also beneficial. You will have a built-in set of developers and committers across many languages who are already familiar with issues in cross-language serialization (and I agree with Kevin that this is not as portable as you seem to think it is). You will be able to avoid writing lots of parts of an RPC framework in multiple languages that you would need to write to make Avro a stand-alone solution for Hadoop. You would have a significant role in shaping the direction of Thrift to make sure that it remains a strong solution for Hadoop. > I've said that I think Avro is fundamentally different from Thrift. Avro > is a specific format, Thrift is a generic API for various formats, none like > Avro. It is clear to me that a slightly modified version of Avro's data format should fit just fine as a Thrift TProtocol implementation. Out of the box this would, of course, only provide for statically generated bindings, but this is enough to satisfy the first of the desired features I described above. The second feature, partial deserialization, is a feature that I would like to see in Thrift for a variety of use cases, not just your projection use case -- for example, message routing where only a message header is deserialized to determine where to pass along an otherwise uninterpreted block of data. This feature is not tightly coupled to the Avro data format in any way. As you have stated, this is possible to do when you have the schema in hand. Note that he static bindings in Thrift are another way that the schema can be transmitted -- in fact, the whole schema could just be retrievable from the bindings directly and fed into whatever mechanism is availabe for dynamic interpretation. But we wouldn't have to go so far as that for field look up by name -- as Kevin pointed out, the Java and Ruby Thrift libraries already have mechanisms for sufficient introspection to accomplish the right kind of lookups, I believe, and the other libraries could be extended to do the same quite easily. So partial deserialization can be supported via either dynamic interpretation and/or via introspection features of the static bindings. To support the second use case, dynamic schema interpretation, there is definitely significant new code to be written. Note that this code is essentially the same code wherever you are writing it. Whatever work you are doing in Avro to be able to dynamically interpret JSON IDL could just be directly implemented in Thrift -- we would just define a JSON version of the Thrift IDL which would look a lot like Avro's IDL. To help further with interoperability we could make the Thrift compiler generate the JSON IDL from the Thrift IDL as another output target. The basic upshot of the above is that it is not that hard to see how Avro could be directly integrated into Thrift if you were willing to entertain that option and I believe that you would see significant benefits that would more than offset the impact to your own ease of development about which you expressed concerns. To touch on a couple specific responses from your previous email to me: >> If the field IDs are really so >> objectionable, Thrift could allow them to be optional for purely >> dynamic usages. > > Optional features increase compatibility complexity and are harder > to maintain and test. A Thrift IDL without numbers would not provide > versioning features to non-dynamic languages. Let me rephrase my suggestion because I think I may not have put it across as clearly as I could have. I am proposing that the IDL would only allow for field IDs to be omitted in the case where the schema was being interpreted dynamically -- no static bindings could be generated from IDL without fully specified field IDs. So if you are only interested in dynamic interpretation, you never have to look at or even think about field IDs. Does that in any way alter your stance here? > It could be a floor wax and a dessert topping! Love the SNL reference, but I don't think it is really appropos. My vision for Thrft with Avro's features folded in as a unified framework for cross-language serialization, covering a variety of use cases, is not jamming two completely heterogeneous things together. I can easily see wanting to take structures represented in one serialization format from disk and send them out over RPC. Thrift provides the means to do this kind of thing seemlessly, with formats appropriate to both use cases, rather than selecting a format that is good for one use case and so-so for the other. Chad ----- Original Message ---- From: Doug Cutting <[email protected]> To: [email protected] Sent: Monday, April 6, 2009 9:15:01 PM Subject: Re: [PROPOSAL] new subproject: Avro Kevin Clark wrote: > The overhead for those people (or some > equivalent group) to pay attention to another mailing list, another > bug tracker, another irc channel, and another community isn't trivial. Communities form around code, and, if Avro's code is largely disjoint from Thrift's, we should not assume that everyone in the Thrift community cares about Avro or vice versa. > Of course, this assumes that one of the primary goals of Avro is to be > cross language. Is that the case, or have I misunderstood? Yes, that is a goal. > It would be perfectly reasonable for Hadoop to specify that they > use the Avro data format for transmissions, and the cross language > library to provide the API could be Thrift. I think you said something > similar in your post, but if not please do clarify. Yes, perhaps this could be done. I am not convinced that TProtocol is an ideal API for reading and writing Avro data, but it could perhaps be made to work reasonably well. > That being said, I'm fairly confident we'll be providing an Avro > protocol on our own at some point if you're not interested in working > together. But I think if we go down that path we're doing a disservice > to users of both Thrift and Avro. I have never said I was not interested in working together. I've said that I think Avro is fundamentally different from Thrift. Avro is a specific format, Thrift is a generic API for various formats, none like Avro. They might be made to work together. But at this point I see no point in forcing them together. If TProtocol's API is a good match for Avro's format and features, then it should be easy for folks to implement TProtocol using Avro's code and include Avro in Thrift. If the match is not good then perhaps we can adjust Thrift and/or Avro to improve it. Doug
