[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

Alan Gates (JIRA) Tue, 27 Oct 2009 10:07:25 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770573#action_12770573
 ]


Alan Gates commented on PIG-760:
--------------------------------

I know I'm wandering dangerously close to being fanatical here, but I really 
dislike taking a struct, making all the members private/protected, and then 
adding getters and setters.  If some tools need getters and setters, feel free 
to add them.  But please leave the members public.

I notice you snuck in your names for LoadMetadata and StoreMetadata.  I'm fine 
with motions to change the names.  But let's get everyone to agree on the new 
names before we start using them.

On the StoreMetadata interface, Pradeep had some thoughts on getting rid of it, 
as he felt all the necessary information could be communicated in 
StoreFunc.allFinished().  He should be publishing an update to the load/store 
redesign wiki ( http://wiki.apache.org/pig/LoadStoreRedesignProposal ) soon.  
He also wanted to change LoadMetadata.getSchema() to take a location so that 
the loader could find the file.

Other changes all look good.  

One general thought.  I want to figure out how to keep the ResourceStatistics 
object flexible enough that it's easy to add new statistics to it.  One thought 
I'd had previously (I can't remember if we discussed this or not) was to add a 
Map<String, Object> to it.  That way we can add new stats between versions of 
the object.  Once the stats are accepted as valid and take hold, they could be 
moved into the object proper.  Upside of this is its flexible.  Downside is we 
risk devolving into an unknown properties object and every stat has to go 
through a transition.  Thoughts?

> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.6.0
>
>         Attachments: pigstorageschema-2.patch, pigstorageschema.patch
>
>
> I'm finding PigStorage() really convenient for storage and data interchange 
> because it compresses well and imports into Excel and other analysis 
> environments well.
> However, it is a pain when it comes to maintenance because the columns are in 
> fixed locations and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a 
> .schema file stored with the data and if store PigStorage() could store a 
> .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode 
> will ignore a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

Reply via email to