[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

Alan Gates (JIRA) Thu, 05 Feb 2009 15:04:25 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670938#action_12670938
 ]


Alan Gates commented on PIG-545:
--------------------------------

I ran the pigmix L9 (order by of single field) and L10 (order by of multiple 
fields).  L9 went from 14 minutes to 8, so this patch holds huge promise.  But 
L10 went from 8 minutes to 11, so it doesn't seem to be working well in the 
multiple field case.  (It could also be related to the fact that L10 uses 
descending on one of the columns, I don't know if the new partitioner can 
handle that or not.)  I also ran our end to end order by tests on it, and all 
passed, except bigdata_1, which fails with an IndexOutOfBounds exception in the 
new WeightedRangePartitioner class.  

As for the caveat that it needs to know the number of reducers up front, I 
believe in cases where the user doesn't say parallel, that we can determine the 
parallelism of the reduces using JobClient.getDefaultReduces().  We need to 
double check that this will give us the right information in both the hod and 
non-hod cases.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>
>         Attachments: WRP.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

Reply via email to