Thanks for that Ted. Correct - internal wire format doesnt mean 'drill only supports protobuf encoded data'.
Part of the reason to favor protobuf is that a lot of people in the broader 'big data' community are building a lot of experience with it. Hadoop and HBase both are moving to/moved to protobuf on the wire. Being able to leverage this expertise is valuable. There is a JIRA in Hadoop-land where someone had done a deep dive 'bake off' between thrift, protobuf and avro. The ultimate choice was protobuf for a number of reasons. If people want to re-do the analysis, I'd like to see it in the context of THAT analysis (eg: why the assumptions there are not the same for Drill)... if anything it'd give a concrete form to what can be a mire. For what it's worth, I've had many discussion along these angles with a variety of people including committers on Thrift, and the consensus is both are good choices. -ryan On Fri, Sep 14, 2012 at 2:31 PM, Ted Dunning <[email protected]> wrote: > I think that it is important to ask a few questions leading up a decision > here. > > The first is a (rhetorical) show of hands about how many people believe > that there are no serious performance or expressivity killers when > comparing alternative serialization frameworks. As far as I know, > performance differences are not massive (and protobufs is one of the > leaders in any case) and the expressivity differences are essentially nil. > If somebody feels that there is a serious show-stopper with any option, > they should speak. > > The second is to ask the sense of the community whether they judge progress > or perfection in this decision is most important to the project. My guess > is that almost everybody would prefer to see progress as long as the > technical choice is not subject to some horrid missing bit. > > The final question is whether it is reasonable to go along with protobufs > given that several very experienced engineers prefer it and would like to > produce code based on it. If the first two answers are answered to the > effect of protobufs is about as good as we will find and that progress > trumps small differences, then it seems that moving to follow this > preference of Jason and Ryan for protobufs might be a reasonable thing to > do. > > The question of an internal wire format, btw, does not constrain the > project relative to external access. I think it is important to support > JDBC and ODBC and whatever is in common use for querying. For external > access the question is quite different. Whereas for the internal format > consensus around a single choice has large benefits, the external format > choice is nearly the opposite. For an external format, limiting ourselves > to a single choice seems like a bad idea and increasing the audience seems > like a better choice. > > On Fri, Sep 14, 2012 at 12:44 PM, Ryan Rawson <[email protected]> wrote: > >> Hi folks, >> >> I just commented on this first JIRA. Here is my text: >> >> This issue has been hashed over a lot in the Hadoop projects. There >> was work done to compare thrift vs avro vs protobuf. The conclusion >> was protobuf was the decision to use. >> >> Prior to this move, there had been a lot of noise about pluggable RPC >> transports, and whatnot. It held up adoption of a backwards compatible >> serialization framework for a long time. The problem ended up being >> the analysis-paralysis, rather than the specific implementation >> problem. In other words, the problem was a LACK of implementation than >> actual REAL problems. >> >> Based on this experience, I'd strongly suggest adopting protobuf and >> moving on. Forget about pluggable RPC implementations, the complexity >> doesnt deliver benefits. The benefits of protobuf is that its the RPC >> format for Hadoop and HBase, which allows Drill to draw on the broad >> experience of those communities who need to implement high performance >> backwards compatible RPC serialization. >> >> ==== >> >> Expanding a bit, I've looked in to this issue a lot, and there is very >> few significant concrete reasons to choose protobuf vs thrift. Tiny >> percent faster of this, and that, etc. I'd strongly suggest protobuf >> for the expanded community. There is no particular Apache imperative >> that Apache projects re-use libraries. Use what makes sense for your >> project. >> >> As regards to Avro, it's a fine serialization format for long term >> data retention, but the complexities that exist to enable that make it >> non-ideal for an RPC. I know of no one who uses AvroRPC in any form. >> >> -ryan >> >> On Tue, Sep 4, 2012 at 12:30 PM, Tomer Shiran <[email protected]> >> wrote: >> > We plan to propose the architecture and interfaces in the next couple >> > weeks, which will make it easy to divide the project into clear building >> > blocks. At that point it will be easier to start contributing different >> > data sources, data formats, operators, query languages, etc. >> > >> > The contributions are done in the usual Apache way. It's best to open a >> > JIRA and then post a patch so that others can review and then a committer >> > can check it in. >> > >> > On Tue, Sep 4, 2012 at 12:23 PM, Chandan Madhesia < >> [email protected] >> >> wrote: >> > >> >> Hi >> >> >> >> Hi >> >> >> >> What is the process to become a contributor to drill ? >> >> >> >> Regards >> >> chandan >> >> >> >> On Tue, Sep 4, 2012 at 9:51 PM, Ted Dunning <[email protected]> >> wrote: >> >> >> >> > Suffice it to say that if *you* think it is important enough to >> implement >> >> > and maintain, then the group shouldn't say naye. The consensus stuff >> >> > should only block things that break something else. Additive features >> >> that >> >> > are highly maintainable (or which come with commitments) shouldn't >> >> > generally be blocked. >> >> > >> >> > On Tue, Sep 4, 2012 at 9:14 AM, Michael Hausenblas < >> >> > [email protected]> wrote: >> >> > >> >> > > Good. Feel free to put me down for that, if the group as a whole >> thinks >> >> > > that (supporting Thrift) makes sense. >> >> > > >> >> > >> >> >> > >> > >> > >> > -- >> > Tomer Shiran >> > Director of Product Management | MapR Technologies | 650-804-8657 >>
