[jira] [Updated] (PIG-2875) Add recursive record support to AvroStorage

Cheolsoo Park (JIRA) Thu, 16 Aug 2012 15:48:40 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cheolsoo Park updated PIG-2875:
-------------------------------

    Attachment: PIG-2875.patch
    
> Add recursive record support to AvroStorage
> -------------------------------------------
>
>                 Key: PIG-2875
>                 URL: https://issues.apache.org/jira/browse/PIG-2875
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>    Affects Versions: 0.10.0
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>         Attachments: avro_test_files.tar.gz, PIG-2869.patch, PIG-2875.patch
>
>
> Currently, AvroStorage does not allow recursive records in Avro schema 
> because it is not possible to define Pig schema for recursive records. (i.e. 
> records that have self-referencing fields cause an infinite loop, so they are 
> not supported.)
> Even though there is no natural way of handling recursive records in Pig 
> schema, I'd like to propose the following workaround: mapping recursive 
> records to bytearray.
> Take for example the following Avro schema:
> {code}
> {
>   "type" : "record",
>   "name" : "RECURSIVE_RECORD",
>   "fields" : [ {
>     "name" : "value",
>     "type" : [ "null", "int" ]
>   }, {
>     "name" : "next",
>     "type" : [ "null", "RECURSIVE_RECORD" ]
>   } ]
> }
> {code}
> and the following data:
> {code}
> {"value":1,"next":{"RECURSIVE_RECORD":{"value":2,"next":{"RECURSIVE_RECORD":{"value":3,"next":null}}}}}
>  
> {"value":2,"next":{"RECURSIVE_RECORD":{"value":3,"next":null}}} 
> {"value":3,"next":null}
> {code}
> Then, we can define Pig schema as follows:
> {code}
> {value: int,next: bytearray}
> {code}
> Even though Pig thinks that the "next" fields are bytearray, they're actually 
> loaded as tuples since AvroStorage uses Avro schema when loading files.
> {code}
> grunt> in = LOAD 'test_recursive_schema.avro' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage ();
> grunt> dump in;
> (1,(2,(3,)))
> (2,(3,))
> (3,)
> {code}
> At this point, we have discrepancy between Avro schema and Pig schema; 
> nevertheless, we can still refer to each field of tuples as follows:
> {code}
> grunt> first = FOREACH in GENERATE $0;
> grunt> dump first;
> (1)
> (2)
> (3)
> or
> grunt> second = FOREACH in GENERATE $1.$0;
> grunt> dump second;
> (2)
> (3)
> ()
> {code}
> Lastly, we can store these tuples as Avro files by specifying schema. Since 
> we can no longer construct Avro schema from Pig schema, it is required for 
> the user to provide Avro schema via the 'schema' parameter in STORE function.
> {code}
> grunt> STORE first INTO 'output' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage ( 'schema', '[ "null", 
> "int" ]' );
> or
> grunt> STORE in INTO 'output' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage ( 'schema', '
> {
>   "type" : "record",
>   "name" : "recursive_schema",
>   "fields" : [ { 
>     "name" : "value",
>     "type" : [ "null", "int" ]
>   }, {
>     "name" : "next",
>     "type" : [ "null", "recursive_schema" ]
>   } ] 
> }
> ' );
> {code}
> To implement this workaround, the following work is required:
> - Update the current generic union check so that it can handle recursive 
> records. Currently, AvroStorage checks if the Avro schema contains 1) 
> recursive records and 2) generic unions, and fails if so. But since I am 
> going to remove the 1st check, the 2nd check should be able to handle 
> recursive records without stack overflow.
> - Update AvroSchema2Pig so that recursive records can be detected and mapped 
> to bytearrays in Pig schema.
> - Add the 'no_schema_check' parameter to STORE function so that results can 
> be stored even though there exists discrepancy between Avro schema and Pig 
> schema. Since Avro schema for STORE function cannot be constructed from Pig 
> schema, it has to be specified by the user via the 'schema' parameter, and 
> schema check has to be disabled by 'no_schema_check'.
> - Update AvroStorage wiki.
> - Add unit tests.
> I do not think that any incompatibility issues will be introduced by this.
> P.S. The reason why I chose to map recursive records to bytearray instead of 
> empty tuple is because I cannot refer to any field if I use empty tuple. For 
> example, if Pig schema is defined as follows:
> {code}
> {value: int,next: ()}
> {code}
> I get an exception when I attempt to refer to any field in loaded tuples 
> since their schema is not defined (i.e. empty tuple).
> {code}
> ERROR 1127: Index 0 out of range in schema
> {code}
> This is all what I found by trials and errors, so there might be something 
> that I am missing here. If so, please let me know.
> Thanks!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2875) Add recursive record support to AvroStorage

Reply via email to