Owen O'Malley wrote:
2. Protocol buffers (and thrift) encode the field names as id numbers. That means that if you read them into dynamic language like Python that it has to use the field numbers instead of the field names. In Avro, the field names are saved and there are no field ids.

This hints at a related problem with Thrift and Protocol Buffers, which is that they require one to generate code for each datatype one processes. This is awkward in dynamic environments, where one would like to write a script (Pig, Python, Perl, Hive, whatever) to process input data and generate output data, without having to locate the IDL for each input file, run an IDL compiler, load the generated code, generate an IDL file for the output, run the compiler again, load the output code and finally write your output. Avro rather lets you simply open your inputs, examine their datatypes, specify output types and write them.

Avro's Java implementation currently includes three different data representations:

- a "generic" representation uses a standard set of datastructures for all datatypes: records are represented as Map<String,Object>, arrays as List<Object>, longs as Long, etc.

- a "reflect" representation uses Java reflection to permit one to read and write existing Java classes with Avro.

- a "specific" representation generates Java classes that are compiled and loaded, much like Thrift and Protocol Buffers.

We don't expect most scripting languages to use more than a single representation. Implementing Avro is quite simple, by design. We have a Python implementation, and hope to add more soon.

Doug

Reply via email to