[
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670938#action_12670938
]
Alan Gates commented on PIG-545:
--------------------------------
I ran the pigmix L9 (order by of single field) and L10 (order by of multiple
fields). L9 went from 14 minutes to 8, so this patch holds huge promise. But
L10 went from 8 minutes to 11, so it doesn't seem to be working well in the
multiple field case. (It could also be related to the fact that L10 uses
descending on one of the columns, I don't know if the new partitioner can
handle that or not.) I also ran our end to end order by tests on it, and all
passed, except bigdata_1, which fails with an IndexOutOfBounds exception in the
new WeightedRangePartitioner class.
As for the caveat that it needs to know the number of reducers up front, I
believe in cases where the user doesn't say parallel, that we can determine the
parallelism of the reduces using JobClient.getDefaultReduces(). We need to
double check that this will give us the right information in both the hod and
non-hod cases.
> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
> Key: PIG-545
> URL: https://issues.apache.org/jira/browse/PIG-545
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Amir Youssefi
> Fix For: types_branch
>
> Attachments: WRP.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an
> order by has skewed partitions. Some reduces finish in a few seconds while
> some run for 20 minutes. Getting a better distribution should lead to much
> better performance for order by.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.