Re: Questions re integrating Avro into Cascading process

Scott Carey Fri, 16 Apr 2010 11:06:20 -0700

On Apr 15, 2010, at 10:33 AM, Ken Krugler wrote:

> Hi all,
> 
> We're looking at creating a Cascading Scheme for Avro, and have got a  
> few questions below. These are very general, as this is more of a  
> scoping phase (as in, are we crazy to try this) so apologies in  
> advance for lack of detail.
> 
> For context, Cascading is an open source project that provides a  
> workflow API on top of Hadoop. The key unit of data is a tuple, which  
> corresponds to a record - you have fields (names) and values.  
> Cascading uses a generalized "tap" concept for reading & writing  
> tuples, where a tap uses a scheme to handle the low-level mapping from  
> Cascading-land to/from the storage format.


I am somewhat familiar with Cascading as a user.  I am not familiar with how it 
is implemented or how to customize things like a Tap or Sink.

Correct me if I'm wrong, but its notion of a record is very simple -- there are 
no arrays or maps -- just a list of fields.
This maps to avro easily.

> 
> So the goal here is to define a Cascading Scheme that will run on  
> 0.18.3 and later versions of Hadoop, and provide general support for  
> reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.
> 
> We grabbed the recently committed AvroXXX code from  
> org.apache.avro.mapred (thanks Doug & Scott), and began building the  
> Cascading scheme to bridge between AvroWrapper<T> keys and Cascading  
> tuples.

You might be fine without the org.apache.avro.mapred stuff -- specifically if 
you only need the sinks and taps to use Avro and not the stuff in between a map 
and reduce.  For example, I have a custom LoadFunc in Pig that can read/write 
avro data files working off Avro 1.3.0 -- but it works for a static schema.

> 
> 1. What's the best approach if we want to dynamically define the Avro  
> schema, based on a list of field names and types (classes)?
> 

Creating an Avro schema programmatically is fairly straightforward -- 
especially without arrays, maps, or unions.  If the code has access to the 
Cascading record definition, transforming that into an Avro schema dynamically 
should be straightforward. Schema has various constructors and static methods 
from which you can get the JSON schema representation or just pass around 
Schema objects.


> This assumes it's possible to dynamically define & use a schema, of  
> course.
> 
> 2. How much has the new Hadoop map-reduce support code been tested?
> 

I can't speak for all of what Doug has done here, but there are unit tests for 
basic stuff -- word count, etc.


> 3. Will there be issues with running in 0.18.3, 0.19.2, etc?
> 
> I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,  
> and that then creating problems. Anything else?

I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production and the only 
problem was the above library conflict.  This is without the new 
o.a.avro.mapred stuff however.

> 
> 4. The key integration point, besides the fields+classes to schema  
> issue above, is mapping between Cascading tuples and AvroWrapper<T>
> 
> If we're using (I assume) the generic format, any input on how we'd do  
> this two-way conversion?
> 

I'd suggest thinking about using Avro container files for input and output, 
which may not require the above depending on how Cascading is built internally. 
 In Pig for example, the LoadFunc defines a pig schema on input for reading, 
and everything else from there requires no change -- although this means that 
it is using the default pig types and serialization for all the intermediate 
work, reading and writing inputs and outputs can be done with Avro with minimal 
effort. 
Cascading is already defining the M/R jobs, the keys, values, etc... so you may 
only have to modify the Tap to translate from an Avro schema to the Cascading 
record to get it to read or write an Avro file.

One can go farther and use AvroWrapper and o.a.avro.mapred define the M/R jobs 
enabling a lot of other possibilities.  I can't confidently state what all the 
requirements are here outside of doing the Cascading record <> Avro schema 
translation and changing all the touch points that Cascading has on the K/V 
types.


> Thanks!
> 
> -- Ken
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
>

Re: Questions re integrating Avro into Cascading process

Reply via email to