reducing number of MR stages with ORDER BY

                 Key: PIG-791
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.2.0
            Reporter: Olga Natkovich

When an order by is not the only operation in a pig script, it is done in two 
additional MR jobs. The first job samples using a sampling loader, the second 
does the sort. The sample is used to construct a partitioner that equally 
balances the data in the sort. The sampler needs to be changed to be a EvalFunc 
instead of a loader. This way a split can be but in the proceeding MR job, with 
the main data being written out and the other part flowing to the sampler func, 
which can then write out the sample. The final MR job can then be the sort. 

This change depends on multiquery code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to