[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shravan Matthur Narayanamurthy updated PIG-545: ----------------------------------------------- Attachment: WRP.patch This patch implements the Weighted Range Partitioner as detailed in the Dewitt et. al. paper on Practical Skew Handling in Parallel Joins. The JobControlCompiler has been modified to use the new partitioner for order by. So the old unit tests should be valid. One caveat is that we need to mention the number of reducers via the parallel keyword when doing order by. Currently, if you don't specify it by default there will just be one partition and it messes up the distribution. We need to do something about this. Another thing is when the Partitioner gets configured it reads the entire sample file from HDFS but it currently doesn't do any reporting as I could not think of a way to do it right now > PERFORMANCE: Sampler for order bys does not produce a good distribution > ----------------------------------------------------------------------- > > Key: PIG-545 > URL: https://issues.apache.org/jira/browse/PIG-545 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: types_branch > Reporter: Alan Gates > Assignee: Amir Youssefi > Fix For: types_branch > > Attachments: WRP.patch > > > In running tests on actual data, I've noticed that the final reduce of an > order by has skewed partitions. Some reduces finish in a few seconds while > some run for 20 minutes. Getting a better distribution should lead to much > better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.