Re: Questions re integrating Avro into Cascading process

Ken Krugler Fri, 16 Apr 2010 11:20:56 -0700

Hi Scott,

Thanks for the response. See below for my comments...

We're looking at creating a Cascading Scheme for Avro, and have got a
few questions below. These are very general, as this is more of a
scoping phase (as in, are we crazy to try this) so apologies in
advance for lack of detail.

For context, Cascading is an open source project that provides a
workflow API on top of Hadoop. The key unit of data is a tuple, which
corresponds to a record - you have fields (names) and values.
Cascading uses a generalized "tap" concept for reading & writing
tuples, where a tap uses a scheme to handle the low-level mappingfrom
Cascading-land to/from the storage format.
I am somewhat familiar with Cascading as a user. I am not familiarwith how it is implemented or how to customize things like a Tap orSink.
Correct me if I'm wrong, but its notion of a record is very simple-- there are no arrays or maps -- just a list of fields.
This maps to avro easily.

Correct - currently Cascading doesn't have built-in support forarrays, maps or unions - though I believe arrays & maps are on the list.

So the goal here is to define a Cascading Scheme that will run on
0.18.3 and later versions of Hadoop, and provide general support for
reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code from
org.apache.avro.mapred (thanks Doug & Scott), and began building the
Cascading scheme to bridge between AvroWrapper<T> keys and Cascading
tuples.
You might be fine without the org.apache.avro.mapred stuff --specifically if you only need the sinks and taps to use Avro and notthe stuff in between a map and reduce. For example, I have a customLoadFunc in Pig that can read/write avro data files working off Avro1.3.0 -- but it works for a static schema.
1. What's the best approach if we want to dynamically define the Avro
schema, based on a list of field names and types (classes)?
Creating an Avro schema programmatically is fairly straightforward-- especially without arrays, maps, or unions. If the code hasaccess to the Cascading record definition, transforming that into anAvro schema dynamically should be straightforward. Schema hasvarious constructors and static methods from which you can get theJSON schema representation or just pass around Schema objects.

We're currently using the string rep, since a Schema isn'tserializable, and Cascading needs that to save the defined workflow inthe job conf.


[snip]

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,
and that then creating problems. Anything else?
I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production andthe only problem was the above library conflict. This is withoutthe new o.a.avro.mapred stuff however.


Great, good to know.

4. The key integration point, besides the fields+classes to schema
issue above, is mapping between Cascading tuples and AvroWrapper<T>
If we're using (I assume) the generic format, any input on how we'ddo
this two-way conversion?
I'd suggest thinking about using Avro container files for input andoutput, which may not require the above depending on how Cascadingis built internally. In Pig for example, the LoadFunc defines a pigschema on input for reading, and everything else from there requiresno change -- although this means that it is using the default pigtypes and serialization for all the intermediate work, reading andwriting inputs and outputs can be done with Avro with minimal effort.Cascading is already defining the M/R jobs, the keys, values, etc...so you may only have to modify the Tap to translate from an Avroschema to the Cascading record to get it to read or write an Avrofile.

So far one issue is that we need to translate between CascadingStrings and Avro Utf8 types, but most everything else works just fine.

One can go farther and use AvroWrapper and o.a.avro.mapred definethe M/R jobs enabling a lot of other possibilities. I can'tconfidently state what all the requirements are here outside ofdoing the Cascading record <> Avro schema translation and changingall the touch points that Cascading has on the K/V types.


It's pretty much four routines in the scheme:

- sinkInit (setting up the conf properly, for which we're using theAvroJob support)

- sourceInit (same thing)

- sink (mapping from Tuple to o.a.avro.Generic.GenericData)
- source (mapping from o.a.avro.Generic.GenericData to Tuple)

The above is all based on the Avro mapred support, so we just have todo the translation work for Fields <-> Schema and Tuple <-> GenericData.


It looks pretty doable, thanks for the help!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Questions re integrating Avro into Cascading process

Reply via email to