[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

Dmitriy V. Ryaboy (JIRA) Tue, 27 Oct 2009 12:45:23 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770618#action_12770618
 ]


Dmitriy V. Ryaboy commented on PIG-760:
---------------------------------------

bq. If some tools need getters and setters, feel free to add them. But please 
leave the members public. 

I'll revert the change.

bq. I notice you snuck in your names for LoadMetadata and StoreMetadata. I'm 
fine with motions to change the names. But let's get everyone to agree on the 
new names before we start using them.

Yeah I kind of figured we'll get to discuss again if I do that :-). It seems 
like we didn't really reach a final decision last time.  Are we sure the only 
time it might be reasonable to read or write metadata are during Loads and 
Stores? I am not. I can envision future uses where the "storage" is some 
ephemeral state that we have operating reporting stats into to enable adaptive 
optimizations.   Also, and I know I am nitpicking, "LoadMetadata" is an 
instruction, where's "MetadataLoader" is a thing. Same with "StoreMetadata" and 
MetadataStorer (but storer isn't a real word so I chose Writer..).

bq. On the StoreMetadata interface, Pradeep had some thoughts on getting rid of 
it, as he felt all the necessary information could be communicated in 
StoreFunc.allFinished(). He should be publishing an update to the load/store 
redesign wiki ( http://wiki.apache.org/pig/LoadStoreRedesignProposal ) soon. 

I was envisioning the setStatistics() and setSchema() methods as methods used 
to alter state, whereas allFinished()  essentially does the job of flushing 
whatever is needed (you'll notice I fake an allFinished() method in my finish() 
implementation by simply checking if any other task has started creating the 
necessary file yet -- a suboptimal workaround, but the best that can be done 
with the current interface).

bq. He also wanted to change LoadMetadata.getSchema() to take a location so 
that the loader could find the file.

A location by itself my not be sufficient -- for example for the JsonMetadata 
implementation, I need the DataStorage as well. 
I solved that by passing the location and storage into JsonMetadata's 
constructor. 
There is something to be said for being able to reuse the same MetadataLoader 
to load schemas for multiple locations, however. Assuming we can't come up with 
any scenarios where by the time we need to get the schema, we no longer have 
the location -- but we might have created the MetadataLoader beforehand, and 
set the internal location at that time -- I agree with the change.

bq. One thought I'd had previously (I can't remember if we discussed this or 
not) was to add a Map<String, Object> to it

I have a feeling we did discuss this, or something like this, possibly in a 
different context, but I can't find the mention either.   I am not sure what we 
would gain by this -- the only consumers of stats would be various 
optimizers/compilers/translators, right? So they would need to be updated to 
deal with new stats, and code that propagates / estimates stats down a logical 
plan would need to get updated, whenever a new statistic is added. That sounds 
pretty extensive. If we instead assume that any field is nullable (or, if 
collection, can be empty), and make sure that all missing fields are filled in 
as nulls/empties when the stat objects are deserialized, we should be ok with 
upgrades.


> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.6.0
>
>         Attachments: pigstorageschema-2.patch, pigstorageschema.patch
>
>
> I'm finding PigStorage() really convenient for storage and data interchange 
> because it compresses well and imports into Excel and other analysis 
> environments well.
> However, it is a pain when it comes to maintenance because the columns are in 
> fixed locations and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a 
> .schema file stored with the data and if store PigStorage() could store a 
> .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode 
> will ignore a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

Reply via email to