Currently we use Text (in Pig 0.7+, PigStorage just wraps Hadoop's text output format). But this can be annoying to maintain. Column changes or reordering break things, and the output itself does not contain information about what the columns are. Some of our intermediates last for a year or more, and having to re-generate them when enhancements are made is annoying.
I'm in the process of writing an Avro reader/writer for Pig (https://issues.apache.org/jira/browse/AVRO-592), but it most likely won't make it into Avro 1.4 -- I don't have time to work on it for a few weeks. This would make it easier to maintain intermediate data since the schema would be persisted in the files -- column ordering and schema evolution is a part of Avro so your Java and Pig don't have to synchronize, and 'long term' intermediate outputs don't necessarily have to be re-generated when the code changes. In the long term, I'd like to be able to share such data between Java M/R, Hive, Pig, and Cascading relatively easily and not rely on text formats, which have issues with delimiters/escapes, performance, and schema maintenance. On Jul 28, 2010, at 8:25 AM, Corbin Hoenes wrote: > Mridul - > > What file format do you use to exchange data between pig and java? Text or > something else? > > On Jul 25, 2010, at 1:52 PM, Mridul Muralidharan wrote: > >> >> >> In some of our pipelines, pig jobs are part of the pipeline - which consist >> of other hadoop jobs, shell executions, etc. >> We currently do this by using intermediate file dumps. >> >> >> Regards, >> Mridul >> >> >> >> On Friday 23 July 2010 10:45 PM, Corbin Hoenes wrote: >>> What are some strategies to have pig and java mapreduce jobs exchange data? >>> E.g. we find a particular pig script in a chain is too slow and we could >>> optimize with a custom mapreduce job we'd want pig to write the data out in >>> a format that mapreduce could access and vice versa. >>> >> >
