Re: [PROPOSAL] new subproject: Avro

Doug Cutting Fri, 03 Apr 2009 09:07:21 -0700

Owen O'Malley wrote:

2. Protocol buffers (and thrift) encode the field names as id numbers.That means that if you read them into dynamic language like Python thatit has to use the field numbers instead of the field names. In Avro, thefield names are saved and there are no field ids.

This hints at a related problem with Thrift and Protocol Buffers, whichis that they require one to generate code for each datatype oneprocesses. This is awkward in dynamic environments, where one wouldlike to write a script (Pig, Python, Perl, Hive, whatever) to processinput data and generate output data, without having to locate the IDLfor each input file, run an IDL compiler, load the generated code,generate an IDL file for the output, run the compiler again, load theoutput code and finally write your output. Avro rather lets you simplyopen your inputs, examine their datatypes, specify output types andwrite them.

Avro's Java implementation currently includes three different datarepresentations:

- a "generic" representation uses a standard set of datastructures forall datatypes: records are represented as Map<String,Object>, arrays asList<Object>, longs as Long, etc.

- a "reflect" representation uses Java reflection to permit one toread and write existing Java classes with Avro.

- a "specific" representation generates Java classes that are compiledand loaded, much like Thrift and Protocol Buffers.

We don't expect most scripting languages to use more than a singlerepresentation. Implementing Avro is quite simple, by design. We havea Python implementation, and hope to add more soon.


Doug

Re: [PROPOSAL] new subproject: Avro

Reply via email to