[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

David Ciemiewicz (JIRA) Thu, 09 Apr 2009 13:12:37 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697631#action_12697631
 ]


David Ciemiewicz commented on PIG-760:
--------------------------------------

Sure, you could do that, create PigStorageSchema.

The thing is, I don't think it is necessary and it is possible to do this in a 
"backward" compatible way.

First, if the user specifies a LOAD ... AS clause schema, then PigStorage could 
simply use that "casting" to override what is in the .schema.  Of course, 
PigStorage might want to warn that there is an override at run time or do a 
"smart" warning only if there are incompatible differences between the 
serialized schema and the explicit AS clause schema.

Next, is there really any harm in creating the serialized shema file on each 
and every STORE?

Finally, why sub class when we could parameterize?

In other words, instead of writing:

store A into 'file' using PigStorageSchema();

Why not do:

store A into 'file' using PigStorage('schema=yes');  -- redundant schema=yes is 
default

I think it would be more useful to have single classes with parameterized 
options than a proliferation of classes.

Or, better yet, why can't I just define the behavior of PigStorage() for all of 
the instances in my script:

define PigStorage PigStorage(
        'sep=\t',
        'schema=yes',
        'erroronmissingcolumn=no'
);

I have recently done similar things for other functions and it turns out to be 
a nice way of capturing "global" parameterizations for cleaner Pig code.




> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> I'm finding PigStorage() really convenient for storage and data interchange 
> because it compresses well and imports into Excel and other analysis 
> environments well.
> However, it is a pain when it comes to maintenance because the columns are in 
> fixed locations and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a 
> .schema file stored with the data and if store PigStorage() could store a 
> .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode 
> will ignore a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

Reply via email to