Dick King commented on PIG-791:

I am considering a modification to hadoop that would allow users to designate 
that a map/reduce output is a: sorted, b: likely to be the input to some other 
map/reduce where selected keys are re-emitted by the second mapper unchanged, 
with probability not correlated by the ordering, and c: the same sort order is 
used in the second map/reduce.

It would work by writing a sample file as a secondary output of the mapper in 
the first map/reduce.

This proposal in on my back burner, but could potentially be moved up.

Would that functionality be generally useful here?

> reducing number of MR stages with ORDER BY
> ------------------------------------------
>                 Key: PIG-791
>                 URL: https://issues.apache.org/jira/browse/PIG-791
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
> When an order by is not the only operation in a pig script, it is done in two 
> additional MR jobs. The first job samples using a sampling loader, the second 
> does the sort. The sample is used to construct a partitioner that equally 
> balances the data in the sort. The sampler needs to be changed to be a 
> EvalFunc instead of a loader. This way a split can be but in the proceeding 
> MR job, with the main data being written out and the other part flowing to 
> the sampler func, which can then write out the sample. The final MR job can 
> then be the sort. 
> This change depends on multiquery code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to