[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Cheolsoo Park (JIRA) Wed, 31 Oct 2012 13:35:14 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488201#comment-13488201
 ]


Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

1) Using different functions sounds OK to me, but couldn't we handle them via 
args using CommandLineParser? IMHO, this is simpler and more scalable. Another 
advantage of using CommandLineParser is that we don't have to infer the meaning 
of arguments based on the number of arguments. Other built-in storages (e.g. 
HBaseStorage) use CommandLineParser, so why don't we do the same to provide the 
universal syntax to the user across the project? Thoughts?

2) Multiple schema support
{quote}
this brings up another question: what does "compatible" mean in this case?
{quote}
Please refer to the rules listed in 
[PIG-2579|https://issues.apache.org/jira/browse/PIG-2579?focusedCommentId=13446546&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13446546].
 I did this because it asked by several people. The use case is that people 
define Avro schemas, but they evolve over time. Since the AvroStorage used to 
assume that all the input files have the exactly the same schema, they couldn't 
load them. PIG-2579 was trying to address that inconvenience. Do you think that 
we should include a similar functionality as an option in the new storage?

3) Recursive record support
{quote}
You can't specify a recursive schema in Pig, so why allow users to load files 
with recursive schemas in Pig? By default, recursive schema definitions should 
result in an error, or at least a warning message. I'd propose that this be 
allowed only as an option.
{quote}
Agreed (and guilty :-)). In fact, this was a feature request from one of my 
customers. The rationale was that people couldn't change their already-defined 
recursive schemas, but they wanted to do some processing on non-recursive parts 
of data. Providing it as an option sound good to me.

4) Multiple store support
{quote}
Can you explain the use case for multiple stores with different output schemas? 
I'm having a hard time understanding why it makes sense to do something 
complicated like that.
{quote}
I think that I wasn't clear. All I wanted to say is that if we have more than 
one relation to store in a script, we should be able to do it.
{code}
set1 = load 'input1.txt' using PigStorage() as ( ... );
store set1 into 'set1' using 
org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');

set2 = load 'input2.txt' using PigStorage() as ( ... );
store set2 into 'set2' using 
org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2');
{code}
The current storage supports multiple stores via the 'index' option. In fact, 
this is very hacky, and we should get rid of it. Nevertheless, I wanted to know 
if this will be still supported. On a second thought, I think that your 
proposal already implies multiple store support because:
- The output schema will be derived from the Pig schema per store, or
- The user will specify the output schema per store.

So I don't see any problem.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Reply via email to