[ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426749#comment-13426749 ]
Bill Graham commented on PIG-483: --------------------------------- Instead of modifying the job graph at runtime (in this case to swap in a {{SkipJob}}), a better approach would be to build the DAG with multiple possible paths of execution, connected by something like a new {{RuntimeDecision}} operator. {{RuntimeDecision}} decides which subpath should be executed and marks the other path as {{IGNORED}} or some other new state. Each subpath would then be connected back to something like a {{RuntimeDecisionSink}} operator, which knows how to pipe the data to the next operator. The main advantage of this approach would be that we could support other types of non-skip optimizations like join selection. Also, the graph is built with all possible paths represented and is still immutable. This is similar to the approach Hive took AFAIK. The multiple paths would also make sense when viewed through Ambrose. > PERFORMANCE: different strategies for large and small order bys > --------------------------------------------------------------- > > Key: PIG-483 > URL: https://issues.apache.org/jira/browse/PIG-483 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.0 > Reporter: Olga Natkovich > Labels: gsoc2011, performance > Attachments: PIG-483.0.patch, PIG-483.1.patch > > > Currently pig always does a multi-pass order by where it first determines a > distribution for the keys and then orders in a second pass. This avoids the > necessity of having a single reducer. However, in cases where the data is > small enough to fit into a single reducer, this is inefficient. For small > data sets it would be good to realize the small size of the set and do the > order by in a single pass with a single reducer. > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira