[
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904680#action_12904680
]
Scott Carey commented on PIG-794:
---------------------------------
So a summary of the differences I can see quickly are:
h5. Schema usage:
This creates a 'generic' Avro schema that can be used for any pig data. Each
field in a Tuple is a Union of all possible pig types, and each Tuple is a list
of fields. It does not preserve the field names or types -- these are not
important for intermediate data anyway.
AVRO-592 translates the Pig schema into a specific Avro schema that persists
the field names and types, so that:
STORE foo INTO 'file' USING AvroStorage();
Will create a file that
foo2 = LOAD 'file' USING AvroStorage();
will be able to re-create the exact schema for use in a script.
h5. Serialization and Deserialization:
This uses the same style as Avro's GenericRecord, which traverses the schema on
the fly and writes fields for each record.
AVRO-592 constructs a state machine for each specific schema to optimally
traverse a Tuple to serialize a record or create a Tuple when deserializing.
This should be faster but the code is definitely harder to read (but easy to
unit test -- AVRO-592 has 98% unit test code coverage on that portion).
Integrating these should not be too hard. I'll try and put my latest version
of AVRO-592 up there late today or tomorrow.
> Use Avro serialization in Pig
> -----------------------------
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.2.0
> Reporter: Rakesh Setty
> Assignee: Dmitriy V. Ryaboy
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch,
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java,
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs
> instead of the current BinStorage. Attached is an implementation of
> AvroBinStorage which performs significantly better compared to BinStorage on
> our benchmarks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.