[
https://issues.apache.org/jira/browse/PIG-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheolsoo Park updated PIG-2875:
-------------------------------
Attachment: PIG-2875.patch
> Add recursive record support to AvroStorage
> -------------------------------------------
>
> Key: PIG-2875
> URL: https://issues.apache.org/jira/browse/PIG-2875
> Project: Pig
> Issue Type: Improvement
> Components: piggybank
> Affects Versions: 0.10.0
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Attachments: avro_test_files.tar.gz, PIG-2869.patch, PIG-2875.patch
>
>
> Currently, AvroStorage does not allow recursive records in Avro schema
> because it is not possible to define Pig schema for recursive records. (i.e.
> records that have self-referencing fields cause an infinite loop, so they are
> not supported.)
> Even though there is no natural way of handling recursive records in Pig
> schema, I'd like to propose the following workaround: mapping recursive
> records to bytearray.
> Take for example the following Avro schema:
> {code}
> {
> "type" : "record",
> "name" : "RECURSIVE_RECORD",
> "fields" : [ {
> "name" : "value",
> "type" : [ "null", "int" ]
> }, {
> "name" : "next",
> "type" : [ "null", "RECURSIVE_RECORD" ]
> } ]
> }
> {code}
> and the following data:
> {code}
> {"value":1,"next":{"RECURSIVE_RECORD":{"value":2,"next":{"RECURSIVE_RECORD":{"value":3,"next":null}}}}}
>
> {"value":2,"next":{"RECURSIVE_RECORD":{"value":3,"next":null}}}
> {"value":3,"next":null}
> {code}
> Then, we can define Pig schema as follows:
> {code}
> {value: int,next: bytearray}
> {code}
> Even though Pig thinks that the "next" fields are bytearray, they're actually
> loaded as tuples since AvroStorage uses Avro schema when loading files.
> {code}
> grunt> in = LOAD 'test_recursive_schema.avro' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage ();
> grunt> dump in;
> (1,(2,(3,)))
> (2,(3,))
> (3,)
> {code}
> At this point, we have discrepancy between Avro schema and Pig schema;
> nevertheless, we can still refer to each field of tuples as follows:
> {code}
> grunt> first = FOREACH in GENERATE $0;
> grunt> dump first;
> (1)
> (2)
> (3)
> or
> grunt> second = FOREACH in GENERATE $1.$0;
> grunt> dump second;
> (2)
> (3)
> ()
> {code}
> Lastly, we can store these tuples as Avro files by specifying schema. Since
> we can no longer construct Avro schema from Pig schema, it is required for
> the user to provide Avro schema via the 'schema' parameter in STORE function.
> {code}
> grunt> STORE first INTO 'output' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage ( 'schema', '[ "null",
> "int" ]' );
> or
> grunt> STORE in INTO 'output' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage ( 'schema', '
> {
> "type" : "record",
> "name" : "recursive_schema",
> "fields" : [ {
> "name" : "value",
> "type" : [ "null", "int" ]
> }, {
> "name" : "next",
> "type" : [ "null", "recursive_schema" ]
> } ]
> }
> ' );
> {code}
> To implement this workaround, the following work is required:
> - Update the current generic union check so that it can handle recursive
> records. Currently, AvroStorage checks if the Avro schema contains 1)
> recursive records and 2) generic unions, and fails if so. But since I am
> going to remove the 1st check, the 2nd check should be able to handle
> recursive records without stack overflow.
> - Update AvroSchema2Pig so that recursive records can be detected and mapped
> to bytearrays in Pig schema.
> - Add the 'no_schema_check' parameter to STORE function so that results can
> be stored even though there exists discrepancy between Avro schema and Pig
> schema. Since Avro schema for STORE function cannot be constructed from Pig
> schema, it has to be specified by the user via the 'schema' parameter, and
> schema check has to be disabled by 'no_schema_check'.
> - Update AvroStorage wiki.
> - Add unit tests.
> I do not think that any incompatibility issues will be introduced by this.
> P.S. The reason why I chose to map recursive records to bytearray instead of
> empty tuple is because I cannot refer to any field if I use empty tuple. For
> example, if Pig schema is defined as follows:
> {code}
> {value: int,next: ()}
> {code}
> I get an exception when I attempt to refer to any field in loaded tuples
> since their schema is not defined (i.e. empty tuple).
> {code}
> ERROR 1127: Index 0 out of range in schema
> {code}
> This is all what I found by trials and errors, so there might be something
> that I am missing here. If so, please let me know.
> Thanks!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira