Re: [PROPOSAL] new subproject: Avro

George Porter Fri, 03 Apr 2009 13:36:51 -0700


On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote:

George Porter wrote:
While this representation would certainly be as compact aspossible, wouldn't it prevent evolving the data structure overtime? One of the nice features of Google Protocol Buffers andThrift is that you can evolve the set of fields over time, andolder/newer clients can talk to older/newer services. If theproposed Avro is evolvable, then perhaps I'm misunderstanding yourstatement about the lack of IDs in the serialized data.
Avro supports schema evolution. In Avro, the schema used to writethe data must be available when the data is read. (In files, it istypically stored in the file metadata.)
If you have the schema that was used to write the data, and you'reexpecting a slightly different schema, then you simply keep thosefields that are in both schemas and skip those not. This isequivalent to Thrift and Protocol Buffer's support for schemaevolution, but does not require manually assigning numeric field ids.
This feature can also be used to support projection. If you haverecords with many large fields, but only need a single field in aparticular computation, then you can specify an expected schema withonly that field, and the runtime will efficiently skip all of theother fields, returning a record with just the single, expected field.

Thanks for the clarification--I better understand the schemarelationship. The projection feature is a nice feature, especiallysince it seems like it would be able to support "sparse files" whereyou want to just peek at large structs without invoking a lot of disk-io (for data serialized on-disk).

I also agree with Bryan, in that it would be unfortunate to havetwo different Apache projects with overlapping goals.
We already have both Thrift and Etch in the incubator, which havesimilar goals. Apache does not attempt to mandate that projectshave disjoint goals. There are many ways to slice things, andApache prefers to rely on survival of the fittest rather thanforcing things together.
Regardless of features, both protocol buffers and thrift have theadvantage of being debugged in mission-critical productionenvironments.
Yes, but, as I've argued in other messages in this thread, they donot support the dynamic features we need. Adding those featureswould add new code that would share little with existing code inthose projects. So, while the projects are conceptually similar, theimplementations are necessarily different, and, without significantcode overlap, separate projects seem more natural.
Doug


Makes sense.  Thanks,
George

Re: [PROPOSAL] new subproject: Avro

Reply via email to