On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote:
George Porter wrote:
While this representation would certainly be as compact as
possible, wouldn't it prevent evolving the data structure over
time? One of the nice features of Google Protocol Buffers and
Thrift is that you can evolve the set of fields over time, and
older/newer clients can talk to older/newer services. If the
proposed Avro is evolvable, then perhaps I'm misunderstanding your
statement about the lack of IDs in the serialized data.
Avro supports schema evolution. In Avro, the schema used to write
the data must be available when the data is read. (In files, it is
typically stored in the file metadata.)
If you have the schema that was used to write the data, and you're
expecting a slightly different schema, then you simply keep those
fields that are in both schemas and skip those not. This is
equivalent to Thrift and Protocol Buffer's support for schema
evolution, but does not require manually assigning numeric field ids.
This feature can also be used to support projection. If you have
records with many large fields, but only need a single field in a
particular computation, then you can specify an expected schema with
only that field, and the runtime will efficiently skip all of the
other fields, returning a record with just the single, expected field.
Thanks for the clarification--I better understand the schema
relationship. The projection feature is a nice feature, especially
since it seems like it would be able to support "sparse files" where
you want to just peek at large structs without invoking a lot of disk-
io (for data serialized on-disk).
I also agree with Bryan, in that it would be unfortunate to have
two different Apache projects with overlapping goals.
We already have both Thrift and Etch in the incubator, which have
similar goals. Apache does not attempt to mandate that projects
have disjoint goals. There are many ways to slice things, and
Apache prefers to rely on survival of the fittest rather than
forcing things together.
Regardless of features, both protocol buffers and thrift have the
advantage of being debugged in mission-critical production
environments.
Yes, but, as I've argued in other messages in this thread, they do
not support the dynamic features we need. Adding those features
would add new code that would share little with existing code in
those projects. So, while the projects are conceptually similar, the
implementations are necessarily different, and, without significant
code overlap, separate projects seem more natural.
Doug
Makes sense. Thanks,
George