PERFORMANCE: The sample MR job in order by implementation can use Hadoop
sorting instead of doing a POSort
----------------------------------------------------------------------------------------------------------
Key: PIG-841
URL: https://issues.apache.org/jira/browse/PIG-841
Project: Pig
Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Fix For: 0.3.0
Currently the sample map reduce job in order by implementation does the
following:
- sample 100 records from each map
- group all on the above output
- sort the output bag from the above grouping on keys of the order by
- give the sorted bag to FindQuantiles udf
The steps 2 and 3 above can be replaced by
- group the sample output by the order by key and set parallelism of the group
to 1 so that output of the group goes to one reducer. Since Hadoop ensures the
output of the group is sorted by key we get sorting for free without using
POSort
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.