[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amir Youssefi updated PIG-460: ------------------------------ Attachment: (was: sampler2.patch) > PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 > ------------------------------------------------------------ > > Key: PIG-460 > URL: https://issues.apache.org/jira/browse/PIG-460 > Project: Pig > Issue Type: Bug > Affects Versions: types_branch > Reporter: Alan Gates > Assignee: Amir Youssefi > Fix For: types_branch > > Attachments: sampler.patch > > > Currently order by is done in three MR jobs: > job 1: read data in whatever loader the user requests, store using BinStorage > job 2: load using RandomSampleLoader, find quantiles > job 3: load data again and sort > It is done this way because RandomSampleLoader extends BinStorage, and so > needs the data in that format to read it. > If the logic in RandomSampleLoader was made into an operator instead of being > in a loader then jobs 1 and 2 could be merged. On average job 1 takes about > 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.