Questions re integrating Avro into Cascading process

Ken Krugler Sun, 18 Apr 2010 07:49:42 -0700

Hi all,

We're looking at creating a Cascading Scheme for Avro, and have gota few questions below. These are very general, as this is more of ascoping phase (as in, are we crazy to try this) so apologies inadvance for lack of detail.
For context, Cascading is an open source project that provides aworkflow API on top of Hadoop. The key unit of data is a tuple,which corresponds to a record - you have fields (names) and values.Cascading uses a generalized "tap" concept for reading & writingtuples, where a tap uses a scheme to handle the low-level mappingfrom Cascading-land to/from the storage format.
So the goal here is to define a Cascading Scheme that will run on0.18.3 and later versions of Hadoop, and provide general support forreading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.
We grabbed the recently committed AvroXXX code fromorg.apache.avro.mapred (thanks Doug & Scott), and began building theCascading scheme to bridge between AvroWrapper<T> keys and Cascadingtuples.

An update on status - there's a working Cascading tap at http://github.com/bixolabs/cascading.avro. See the README (http://github.com/bixolabs/cascading.avro/blob/master/README) for more details.

One open issue - it would be great to be able to set metadata in theheaders of the resulting Avro files. But it wasn't obvious how to dothat, given our (intentionally) arms-length approach via the use ofthe Avro mapred code.

One idea would be to have job conf values using keys prefixed withavro.metadata.xxx, and the Avro mapred support could automagically usethat when creating the file. But this would break our goal of usingunmodified Avro source, so I'm curious whether support for setting thefile metadata would also be useful for the standard (Hadoop) use ofAvro for an output format, and if so, whether there was a betterapproach.


Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Questions re integrating Avro into Cascading process

Reply via email to