Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got a few questions below. These are very general, as this is more of a scoping phase (as in, are we crazy to try this) so apologies in advance for lack of detail.

For context, Cascading is an open source project that provides a workflow API on top of Hadoop. The key unit of data is a tuple, which corresponds to a record - you have fields (names) and values. Cascading uses a generalized "tap" concept for reading & writing tuples, where a tap uses a scheme to handle the low-level mapping from Cascading-land to/from the storage format.

So the goal here is to define a Cascading Scheme that will run on 0.18.3 and later versions of Hadoop, and provide general support for reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code from org.apache.avro.mapred (thanks Doug & Scott), and began building the Cascading scheme to bridge between AvroWrapper<T> keys and Cascading tuples.

1. What's the best approach if we want to dynamically define the Avro schema, based on a list of field names and types (classes)?

This assumes it's possible to dynamically define & use a schema, of course.

2. How much has the new Hadoop map-reduce support code been tested?

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar, and that then creating problems. Anything else?

4. The key integration point, besides the fields+classes to schema issue above, is mapping between Cascading tuples and AvroWrapper<T>

If we're using (I assume) the generic format, any input on how we'd do this two-way conversion?

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to