[
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729700#action_12729700
]
Alan Gates commented on PIG-794:
--------------------------------
I agree with Doug's comments that it's better to use an API to build the schema
that will give us compile time checking. I think it will also (hopefully) be
easier to figure out the schema when reading the code, as it will avoid the
need to read JSON directly.
I have a general question on the approach. This is a direct port of Pig's
BinStorage to use Avro, including the writing of indicator bytes for types. I
do not have a deep knowledge of Avro. But I had assumed that since it was a
de/serialization framework with types, part of what it would provide was type
recognition. That is, can't this code rely on Avro to set the type for it? Do
we need to be writing those indicator bytes ourselves? Perhaps this is the
same comment that Doug is making about using GenericDatumReader and addField.
In response to Hong's comment, the sync marks are vulnerable as you point out.
But the loader needs some way to find a proper starting place when it's handed
any block but the initial block of a file. I wonder if we could create a new
sync type. It would always consist of a 100 byte marker (say the first 25
prime numbers, or the first 25 digits of pi or something). We could then write
a tuple with that sync type every 1000 records in the data. Loaders that don't
start at position 0 could then seek to the first sync type it found before it
began reading. All loaders would read past the end of their position until
they saw a sync type.
As for this being compatible with with non-pig apps, that isn't the purpose of
this AvroStorage function. This is for pig to pass data between MR jobs for
itself. Having a tool independent storage format is a bigger project, as it
requires agreeing on things like sync marks, how to represent different Avro
objects, etc.
> Use Avro serialization in Pig
> -----------------------------
>
> Key: PIG-794
> URL: https://issues.apache.org/jira/browse/PIG-794
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.2.0
> Reporter: Rakesh Setty
> Fix For: 0.2.0
>
> Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch,
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs
> instead of the current BinStorage. Attached is an implementation of
> AvroBinStorage which performs significantly better compared to BinStorage on
> our benchmarks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.