Re: AvroStorage pig adapter

Scott Carey Sat, 15 May 2010 14:10:12 -0700

On May 14, 2010, at 12:18 PM, Doug Cutting wrote:

> On 05/14/2010 11:50 AM, Scott Carey wrote:
>> It doesn't cost on the serialization size but currently it costs a lot on 
>> the performance side.
> 
> Not necessarily. For Pig data in Java I think one might reasonably write 
> a custom reader and writer that reads and writes Pig data structures 
> directly.  This could look something like readSchema, writeSchema, 
> readJson and writeJson methods in the patch for AVRO-251.
> 
> https://issues.apache.org/jira/browse/AVRO-251
> 
> Start by looking at json.avsc (an Avro schema for arbitrary JSON data) 
> then see the readJson() method.  It directly reads Avro data 
> corresponding to that schema into a Jackson JsonNode.  ResolvingDecoder 
> enforces the schema.


This has to work through Hadoop via an InputFormat / OutputFormat 0.20 apis, 
and the o.a.h.mapreduce API implementations I have made closely resemble those 
you made for the old o.a.h.mapred api.  These currently require using the 
Generic or Specific API, though one could make one that took a Pig Tuple 
directly instead of an avro object.

Longer term, I think we can get the Specific and Reflect APIs to perform as 
well as the above by dynamically compiling a schema's serialization and 
deserialization via ASM into direct decoder/encoder API calls.  But thats the 
sort of research project I only wish I had time for :)

> 
>> Yes, it helps a lot.  One question remains, how can I construct a recursive 
>> schema programmatically?
>> I have a couple options for the pig Tuple avro schema -- write it in JSON 
>> and put that in the source code or programmatically construct it.
>> I'm currently programmatically constructing a schema specific to the Pig 
>> schema that is serialized, which is straightforward until I hit the map type 
>> and recursion.
> 
> If you're not using a universal Pig schema then the above strategy may 
> or may not work.  It might still work if the specific schema is always a 
> subset of the universal Pig schema, which I suspect it is.

One requirement I may not have communicated well is that I want to persist the 
Pig schema, not just the pig data.   So if someone in Pig does:

STORE FOO into 'directory' using AvroStorage();

and FOO has pig schema:
{ firstName: string, lastName: string, departmentId: int, salary: long }
The avro schema would ideally be:
{ "name": "FOO", "type": "record", "fields": [
  { "name": "firstName", "type": "string" },
  { "name": "lastName", "type": "string" },
  { "name": "departmentId", "type": "int" },
  { "name": "salary", "type": "long"}
]

Then, when the avro container file is loaded back in pig (or in Java M/R, or 
Hive, or Cascading, or external applications, etc) the field names are 
preserved.  Without this feature much of the power of Avro goes away -- 
resolving schema evolution is manual and lock-step across all data consumers, 
etc.

I will create a unique schema for each store, and when a Map is encountered the 
value will be an avro record that can contain generic, nameless pig data using 
something much like Thiru's schema.

> 
> To construct a recursive schema programmatically you need to do what the 
> schema parser does: create a record schema with createSchema(), create 
> it's fields, including one or more that references the record schema, 
> then call setFields() with the fields.

Excellent.  I'll do that.  This means that the recursion points that avro 
supports can only be records, right?  Only a record has the equivalent of 
setFields(), everything else that can contain another element (arrays, unions) 
must declare the inner element(s) at creation time.

> 
> Doug
> 
> 

Thanks Doug,

-Scott

Re: AvroStorage pig adapter

Reply via email to