PERFORMANCE: Bag creation can be more efficiently handled in order by

                 Key: PIG-744
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.2.0
            Reporter: Pradeep Kamath
             Fix For: 0.3.0

Currently order by results in multiple map reduce jobs (2 or 3 depending on the 
script) of which the last one does the actual ordering. In this last map reduce 
job, we create a bag of values (each value being the entire tuple that is 
getting sorted) for each sort key(s) using POPackage in the reduce phase. Then 
we turn around and flatten the bag in the foreach following the package. So 
there is really no need for the bag. But to be generic and use the existing 
operators, we can be more efficient by tagging the POPackage to create bags 
which are backed by the Hadoop iterator itself. This way we do not create a bag 
by making a copy of each tuple from the hadoop iterator. This should help both 
performance and scalability by making better use of memory.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to