[ 
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-545:
-----------------------------------------------

    Attachment: WRP1.patch

Ran some tests and this quantiles scheme seems to have the least deviation from 
perfect distribution. Also, the time took for L10 has reduced. It took 8 mins 
vs 7 mins for the old code. But it produces a good distribution as shown below: 
The patch also modifies MRCompiler to fix sort on multiple fields with 
different order for each column.
New algorithm:
{noformat}
/part-00000<r 3>        396866140
/part-00001<r 3>        388565356
/part-00002<r 3>        412419093
/part-00003<r 3>        404673062
/part-00004<r 3>        407805613
/part-00005<r 3>        399685590
/part-00006<r 3>        374470156
/part-00007<r 3>        407210410
/part-00008<r 3>        392022575
/part-00009<r 3>        403592598
/part-00010<r 3>        407005509
/part-00011<r 3>        392739807
/part-00012<r 3>        407132246
/part-00013<r 3>        393974442
/part-00014<r 3>        394310422
/part-00015<r 3>        397676923
/part-00016<r 3>        408960794
/part-00017<r 3>        407120924
/part-00018<r 3>        398555578
/part-00019<r 3>        398831802
/part-00020<r 3>        381319493
/part-00021<r 3>        397961816
/part-00022<r 3>        408716378
/part-00023<r 3>        401850651
/part-00024<r 3>        394624621
/part-00025<r 3>        411533286
/part-00026<r 3>        397598333
/part-00027<r 3>        402013011
/part-00028<r 3>        412664722
/part-00029<r 3>        390615865
/part-00030<r 3>        402257701
/part-00031<r 3>        404278892
/part-00032<r 3>        408376085
/part-00033<r 3>        403230193
/part-00034<r 3>        396062725
/part-00035<r 3>        403166437
/part-00036<r 3>        396123295
/part-00037<r 3>        400208557
/part-00038<r 3>        396028297
/part-00039<r 3>        428541846
{noformat}
Old Algorithm:
{noformat}
/part-00000<r 3>        39703
/part-00001<r 3>        396917259
/part-00002<r 3>        388958263
/part-00003<r 3>        412109839
/part-00004<r 3>        405626251
/part-00005<r 3>        411808194
/part-00006<r 3>        385084639
/part-00007<r 3>        618796205
/part-00008<r 3>        59754649
/part-00009<r 3>        506719655
/part-00010<r 3>        403039137
/part-00011<r 3>        406540458
/part-00012<r 3>        395629722
/part-00013<r 3>        404795418
/part-00014<r 3>        394881722
/part-00015<r 3>        393959841
/part-00016<r 3>        398194260
/part-00017<r 3>        408370148
/part-00018<r 3>        334248039
/part-00019<r 3>        260118680
/part-00020<r 3>        642453106
/part-00021<r 3>        383168594
/part-00022<r 3>        364791108
/part-00023<r 3>        408601454
/part-00024<r 3>        404588449
/part-00025<r 3>        392940424
/part-00026<r 3>        413354408
/part-00027<r 3>        412538285
/part-00028<r 3>        385894942
/part-00029<r 3>        412674723
/part-00030<r 3>        392572446
/part-00031<r 3>        403012671
/part-00032<r 3>        398679596
/part-00033<r 3>        410864380
/part-00034<r 3>        405389743
/part-00035<r 3>        397248129
/part-00036<r 3>        401438264
/part-00037<r 3>        396456821
/part-00038<r 3>        402122621
/part-00039<r 3>        816408998
{noformat}

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: WRP.patch, WRP1.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to