Currently we use Text (in Pig 0.7+, PigStorage just wraps Hadoop's text output 
format).  But this can be annoying to maintain.  Column changes or reordering 
break things, and the output itself does not contain information about what the 
columns are.   Some of our intermediates last for a year or more, and having to 
re-generate them when enhancements are made is annoying.

I'm in the process of writing an Avro reader/writer for Pig 
(https://issues.apache.org/jira/browse/AVRO-592), but it most likely won't make 
it into Avro 1.4 -- I don't have time to work on it for a few weeks.  
This would make it easier to maintain intermediate data since the schema would 
be persisted in the files -- column ordering and schema evolution is a part of 
Avro so your Java and Pig don't have to synchronize, and 'long term' 
intermediate outputs don't necessarily have to be re-generated when the code 
changes.  In the long term, I'd like to be able to share such data between Java 
M/R, Hive, Pig, and Cascading relatively easily and not rely on text formats, 
which have issues with delimiters/escapes, performance, and schema maintenance.


On Jul 28, 2010, at 8:25 AM, Corbin Hoenes wrote:

> Mridul -
> 
> What file format do you use to exchange data between pig and java?  Text or 
> something else?
> 
> On Jul 25, 2010, at 1:52 PM, Mridul Muralidharan wrote:
> 
>> 
>> 
>> In some of our pipelines, pig jobs are part of the pipeline - which consist 
>> of other hadoop jobs, shell executions, etc.
>> We currently do this by using intermediate file dumps.
>> 
>> 
>> Regards,
>> Mridul
>> 
>> 
>> 
>> On Friday 23 July 2010 10:45 PM, Corbin Hoenes wrote:
>>> What are some strategies to have pig and java mapreduce jobs exchange data? 
>>>  E.g. we find a particular pig script in a chain is too slow and we could 
>>> optimize with a custom mapreduce job we'd want pig to write the data out in 
>>> a format that mapreduce could access and vice versa.
>>> 
>> 
> 

Reply via email to