Hi all,
We're looking at creating a Cascading Scheme for Avro, and have got
a few questions below. These are very general, as this is more of a
scoping phase (as in, are we crazy to try this) so apologies in
advance for lack of detail.
For context, Cascading is an open source project that provides a
workflow API on top of Hadoop. The key unit of data is a tuple,
which corresponds to a record - you have fields (names) and values.
Cascading uses a generalized "tap" concept for reading & writing
tuples, where a tap uses a scheme to handle the low-level mapping
from Cascading-land to/from the storage format.
So the goal here is to define a Cascading Scheme that will run on
0.18.3 and later versions of Hadoop, and provide general support for
reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.
We grabbed the recently committed AvroXXX code from
org.apache.avro.mapred (thanks Doug & Scott), and began building the
Cascading scheme to bridge between AvroWrapper<T> keys and Cascading
tuples.
An update on status - there's a working Cascading tap at http://github.com/bixolabs/cascading.avro
. See the README (http://github.com/bixolabs/cascading.avro/blob/master/README
) for more details.
One open issue - it would be great to be able to set metadata in the
headers of the resulting Avro files. But it wasn't obvious how to do
that, given our (intentionally) arms-length approach via the use of
the Avro mapred code.
One idea would be to have job conf values using keys prefixed with
avro.metadata.xxx, and the Avro mapred support could automagically use
that when creating the file. But this would break our goal of using
unmodified Avro source, so I'm curious whether support for setting the
file metadata would also be useful for the standard (Hadoop) use of
Avro for an output format, and if so, whether there was a better
approach.
Thanks!
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g