[ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399593#comment-13399593 ]
Jie Li commented on PIG-483: ---------------------------- As Dmitriy pointed out, we need to optimize this at runtime. For order-by, we can simply remove the sampling job. One problem is that PigStats has a copy of the job graph before the jobs run, so if we remove any job at runtime, we may need to update the job graph info in PigStats. Will that affect any external tools, like Ambrose? > PERFORMANCE: different strategies for large and small order bys > --------------------------------------------------------------- > > Key: PIG-483 > URL: https://issues.apache.org/jira/browse/PIG-483 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.0 > Reporter: Olga Natkovich > Labels: gsoc2011, performance > > Currently pig always does a multi-pass order by where it first determines a > distribution for the keys and then orders in a second pass. This avoids the > necessity of having a single reducer. However, in cases where the data is > small enough to fit into a single reducer, this is inefficient. For small > data sets it would be good to realize the small size of the set and do the > order by in a single pass with a single reducer. > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira