Shravan Matthur Narayanamurthy commented on PIG-545:

Thanks for running the patch Alan. I figured out the IndexOutOfBounds exception 
& fixed it. That should not happen.

I was also working on the L10 issue. I tried it outside of Pig by sending it 
tuples(int,string) with ordering required as (desc,asc). It works fine. So I 
don't think there is any problem with the partitioner there. Most of the things 
like asc, desc & user comparator should be handled as I use the comparator 
passed to me through the jobConf.  So I checked the samples file that was 
generated. Its not sorted at all. The main assumption is invalid and the 
partitioner will definitely get messed up.

I finally figured that the way we are doing the compilation of order by in 
MRCompiler is wrong. When we do the nested sort using the input POSort, we are 
converting it into "order by *" instead it should be "order by $0, $1, $2 ..."

I have started L10 with the changes. WIll update with the results.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>         Attachments: WRP.patch
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to