Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got a few questions below. These are very general, as this is more of a scoping phase (as in, are we crazy to try this) so apologies in advance for lack of detail.

For context, Cascading is an open source project that provides a workflow API on top of Hadoop. The key unit of data is a tuple, which corresponds to a record - you have fields (names) and values. Cascading uses a generalized "tap" concept for reading & writing tuples, where a tap uses a scheme to handle the low-level mapping from Cascading-land to/from the storage format.

So the goal here is to define a Cascading Scheme that will run on 0.18.3 and later versions of Hadoop, and provide general support for reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code from org.apache.avro.mapred (thanks Doug & Scott), and began building the Cascading scheme to bridge between AvroWrapper<T> keys and Cascading tuples.

An update on status - there's a working Cascading tap at http://github.com/bixolabs/cascading.avro . See the README (http://github.com/bixolabs/cascading.avro/blob/master/README ) for more details.

One open issue - it would be great to be able to set metadata in the headers of the resulting Avro files. But it wasn't obvious how to do that, given our (intentionally) arms-length approach via the use of the Avro mapred code.

One idea would be to have job conf values using keys prefixed with avro.metadata.xxx, and the Avro mapred support could automagically use that when creating the file. But this would break our goal of using unmodified Avro source, so I'm curious whether support for setting the file metadata would also be useful for the standard (Hadoop) use of Avro for an output format, and if so, whether there was a better approach.

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to