Pradeep Kamath commented on PIG-841:

This mechanism can be used for any join which requires sampling like the one 
described in http://wiki.apache.org/pig/PigSkewedJoinSpec

> PERFORMANCE: The sample MR job in order by implementation can use Hadoop 
> sorting instead of doing a POSort
> ----------------------------------------------------------------------------------------------------------
>                 Key: PIG-841
>                 URL: https://issues.apache.org/jira/browse/PIG-841
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
> Currently the sample map reduce job in order by implementation does the 
> following:
>  - sample 100 records from each map
>  - group all on the above output
>  - sort the output bag from the above grouping on keys of the order by
>  - give the sorted bag to FindQuantiles udf
> The steps 2 and 3 above can be replaced by
> - group the sample output by the order by key and set parallelism of the 
> group to 1 so that output of the group goes to one reducer. Since Hadoop 
> ensures the output of the group is sorted by key we get sorting for free 
> without using POSort 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to