[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated PIG-545: ------------------------------- Attachment: PIG-545-v3.patch Attached a revised version of the last patch with the following changes: 1) When parallel is not specified the code now consults jobClient to get defaultReduces() and uses 0.9 times the value as the number of reducers (and hence the number of quantiles) 2) There was a bug in the way order by * was handled in MRCompiler which is now fixed 3) In WeightedRangePartitioner the basic idea is to first set up the quantiles array as the last element of a quantile (partition). Then the code iterates over all the sample items and if it finds an item which equals the quantile element for the partition, then there is a good chance this item may repeat in the next quantile. The occurences of such sample items in each partition are recorded to use when deciding which partition such an item in the real data should go to. The occurences in each partition over the total occurences of such an element gives the probability that such an element should go to the given partition. In the earlier version of the patch, to set this up, the code was comparing a sample item with the quantile element of the next partition instead of the quantile element of the partition in which the sample element falls (since the quantile element is the last element of the partition, it should be used in the comparison to decide if this element is likely to crossover to the next partition). This has been fixed. 4) The earlier patch was not handling the case where number of samples < quantiles - this is handled now. > PERFORMANCE: Sampler for order bys does not produce a good distribution > ----------------------------------------------------------------------- > > Key: PIG-545 > URL: https://issues.apache.org/jira/browse/PIG-545 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: types_branch > Reporter: Alan Gates > Assignee: Pradeep Kamath > Fix For: types_branch > > Attachments: PIG-545-v3.patch, WRP.patch, WRP1.patch > > > In running tests on actual data, I've noticed that the final reduce of an > order by has skewed partitions. Some reduces finish in a few seconds while > some run for 20 minutes. Getting a better distribution should lead to much > better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.