[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shravan Matthur Narayanamurthy updated PIG-545: ----------------------------------------------- Attachment: WRP1.patch Ran some tests and this quantiles scheme seems to have the least deviation from perfect distribution. Also, the time took for L10 has reduced. It took 8 mins vs 7 mins for the old code. But it produces a good distribution as shown below: The patch also modifies MRCompiler to fix sort on multiple fields with different order for each column. New algorithm: {noformat} /part-00000<r 3> 396866140 /part-00001<r 3> 388565356 /part-00002<r 3> 412419093 /part-00003<r 3> 404673062 /part-00004<r 3> 407805613 /part-00005<r 3> 399685590 /part-00006<r 3> 374470156 /part-00007<r 3> 407210410 /part-00008<r 3> 392022575 /part-00009<r 3> 403592598 /part-00010<r 3> 407005509 /part-00011<r 3> 392739807 /part-00012<r 3> 407132246 /part-00013<r 3> 393974442 /part-00014<r 3> 394310422 /part-00015<r 3> 397676923 /part-00016<r 3> 408960794 /part-00017<r 3> 407120924 /part-00018<r 3> 398555578 /part-00019<r 3> 398831802 /part-00020<r 3> 381319493 /part-00021<r 3> 397961816 /part-00022<r 3> 408716378 /part-00023<r 3> 401850651 /part-00024<r 3> 394624621 /part-00025<r 3> 411533286 /part-00026<r 3> 397598333 /part-00027<r 3> 402013011 /part-00028<r 3> 412664722 /part-00029<r 3> 390615865 /part-00030<r 3> 402257701 /part-00031<r 3> 404278892 /part-00032<r 3> 408376085 /part-00033<r 3> 403230193 /part-00034<r 3> 396062725 /part-00035<r 3> 403166437 /part-00036<r 3> 396123295 /part-00037<r 3> 400208557 /part-00038<r 3> 396028297 /part-00039<r 3> 428541846 {noformat} Old Algorithm: {noformat} /part-00000<r 3> 39703 /part-00001<r 3> 396917259 /part-00002<r 3> 388958263 /part-00003<r 3> 412109839 /part-00004<r 3> 405626251 /part-00005<r 3> 411808194 /part-00006<r 3> 385084639 /part-00007<r 3> 618796205 /part-00008<r 3> 59754649 /part-00009<r 3> 506719655 /part-00010<r 3> 403039137 /part-00011<r 3> 406540458 /part-00012<r 3> 395629722 /part-00013<r 3> 404795418 /part-00014<r 3> 394881722 /part-00015<r 3> 393959841 /part-00016<r 3> 398194260 /part-00017<r 3> 408370148 /part-00018<r 3> 334248039 /part-00019<r 3> 260118680 /part-00020<r 3> 642453106 /part-00021<r 3> 383168594 /part-00022<r 3> 364791108 /part-00023<r 3> 408601454 /part-00024<r 3> 404588449 /part-00025<r 3> 392940424 /part-00026<r 3> 413354408 /part-00027<r 3> 412538285 /part-00028<r 3> 385894942 /part-00029<r 3> 412674723 /part-00030<r 3> 392572446 /part-00031<r 3> 403012671 /part-00032<r 3> 398679596 /part-00033<r 3> 410864380 /part-00034<r 3> 405389743 /part-00035<r 3> 397248129 /part-00036<r 3> 401438264 /part-00037<r 3> 396456821 /part-00038<r 3> 402122621 /part-00039<r 3> 816408998 {noformat} > PERFORMANCE: Sampler for order bys does not produce a good distribution > ----------------------------------------------------------------------- > > Key: PIG-545 > URL: https://issues.apache.org/jira/browse/PIG-545 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: types_branch > Reporter: Alan Gates > Assignee: Pradeep Kamath > Fix For: types_branch > > Attachments: WRP.patch, WRP1.patch > > > In running tests on actual data, I've noticed that the final reduce of an > order by has skewed partitions. Some reduces finish in a few seconds while > some run for 20 minutes. Getting a better distribution should lead to much > better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.