Dmitriy V. Ryaboy updated PIG-760:

    Attachment: pigstorageschema.patch

I am attaching a preliminary patch for this issue.

It implements a new Load/StoreFunc PigStorageSchema that inherits from 
PigStorage and performs schema serialization into JSON; currently it only works 
for flat schemas (a JSON parser limitation that can probably be overcome with a 
bit of elbow grease). It also only works in MR mode due to limitations on the 
StoreFunc interface (in local mode, there is no way I am aware of to get the 
directory name you are writing to from the StoreFunc -- in MR mode I am able to 
get it from the JobConf).

It also writes the headers as described above, but at the moment does not 
provide nice constructors (like the ones suggested by David) to allow one to 
turn functionality on/off. 

Implementation notes:

I chose Jackson for JSON parsing because that's what Avro uses, so once Avro is 
used in Pig, we won't have two parsers that do the same thing.
I didn't modify the zip targets in build.xml to package the Avro libs, so if 
you want to use PigStorageSchema, you will want to register 
build/ivy/lib/Pig/jackson-mapper-asl-1.0.1.jar and 

This patch also uses a number of the interfaces (MetadataLoader/Writer, 
ResourceStatistics, ResourceSchema) from the Load/Store redesign proposal. I 
simply dumped them into org.apache.pig -- we may want to come up with an 
appropriate package.

As expected, implementing this raised a number of issues with the interfaces as 
proposed, most notably the need for getters and setters in order to enable Java 
tools that work with POJOs to interact with these interfaces.

I indulged in some Class.cast trickery in DataType to avoid large swaths of 
copy+paste code. Despite what the patch appears to say, the changes to 
determineFieldSchema are really fairly minimal, I just made it work on Object 
and ResourceFieldSchemas at the same time.

> Serialize schemas for PigStorage() and other storage types.
> -----------------------------------------------------------
>                 Key: PIG-760
>                 URL: https://issues.apache.org/jira/browse/PIG-760
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>         Attachments: pigstorageschema.patch
> I'm finding PigStorage() really convenient for storage and data interchange 
> because it compresses well and imports into Excel and other analysis 
> environments well.
> However, it is a pain when it comes to maintenance because the columns are in 
> fixed locations and I'd like to add columns in some cases.
> It would be great if load PigStorage() could read a default schema from a 
> .schema file stored with the data and if store PigStorage() could store a 
> .schema file with the data.
> I have tested this out and both Hadoop HDFS and Pig in -exectype local mode 
> will ignore a file called .schema in a directory of part files.
> So, for example, if I have a chain of Pig scripts I execute such as:
> A = load 'data-1' using PigStorage() as ( a: int , b: int );
> store A into 'data-2' using PigStorage();
> B = load 'data-2' using PigStorage();
> describe B;
> describe B should output something like { a: int, b: int }

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to