[ https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates resolved PIG-791. ---------------------------- Resolution: Won't Fix After some testing by Amir Youssefi we determined that making this change actually makes performance worse. Changing RandomSampleLoader into an EvalFunc means that all records in the file have to be read and parsed. Since hadoop efficiently supports skipping in the input stream, this is very expensive. Instead we will pursue making RandomSampleLoader subsume the user's loader to avoid requiring a third MR job (see PIG-820). > reducing number of MR stages with ORDER BY > ------------------------------------------ > > Key: PIG-791 > URL: https://issues.apache.org/jira/browse/PIG-791 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.0 > Reporter: Olga Natkovich > > When an order by is not the only operation in a pig script, it is done in two > additional MR jobs. The first job samples using a sampling loader, the second > does the sort. The sample is used to construct a partitioner that equally > balances the data in the sort. The sampler needs to be changed to be a > EvalFunc instead of a loader. This way a split can be but in the proceeding > MR job, with the main data being written out and the other part flowing to > the sampler func, which can then write out the sample. The final MR job can > then be the sort. > This change depends on multiquery code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.