[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Cheolsoo Park (JIRA) Tue, 30 Oct 2012 23:59:19 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487592#comment-13487592
 ]


Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

The list of options that you described looks like a good start. I think that we 
should definitely start with a small set of options, but it may be a good idea 
to keep in mind what options we eventually want to add. So here are my 
questions:

*LoadFunc*
{quote}
(a) Just pick the schema from the most recent file
(b) Check all the files to make sure the schemas are compatible
{quote}
I haven't checked out your repository, so please correct me if I am wrong. I 
assume that your storage converts Avro schema to Pig schema during the load? If 
so, how do you convert multiple (compatible but different) schemas to one Pig 
schema? The current storage has an option called 'multiple_schemas' to merge 
multiple schemas into one.
{quote}
(2) Use a schema manually provided by the user
{quote}
Do we need this option for LoadFunc? Is this for when the input Avro files do 
not have an embedded schema?

Does your storage also have limits on unions and recursive records like the 
current storage? In fact, recursive records are now supported by PIG-2875.

How about corrupted files? Currently, we have an option to skip corrupted files 
(ignore_bad_files) instead of failing on them.

*StoreFunc*
{quote}
(2) Use a schema manually provided by the user
{quote}
The current storage provides three ways of specifying the output schema:
# A JSON string can be given (option: schema).
# The schema of an existing Avro file (.avro) can be used (option: same).
# An Avro schema file (.avsc) can be used (option: schema_file).

Are you going to support the same?

How about multiple stores with different output schemas? Currently, the current 
storage has the 'index' option that allows the user to specify different output 
schemas for each store.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Reply via email to