[ https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743348#action_12743348 ]
Daniel Dai commented on PIG-922: -------------------------------- Design for push up projection rule: Presumption: * Prune columns of loader, save time for record parsing a = load 'a' as (n1:chararray, n2:chararray, n3:chararray); b = foreach a generate n1, n2; => a = load 'a' (n1:chararray, n2:chararray) We do not need to parse n3 in our loader. * Prune columns across map-reduce boundary (between map-reduce jobs or inter map-reduce jobs), save bandwidth a = load 'a' as (n1:chararray, n2:chararray, n3:chararray); b = group a by n1; c = sort b by n2; d = foreach c generate n2, n3; => a = load 'a' as (n1:chararray, n2:chararray, n3:chararray); b = group a by n1; b1 = foreach b generate n2, n3; c = sort b1 by n2; d = foreach c generate n2, n3; * Prune column within map-reduce boundary does not seem to be helpful store a into 'a'; b = filter a by n1='1'; c = foreach b generate n2; dump c; => store a into 'a'; a1 = foreach a generate n1, n2; b = filter a1 by n1='1'; c = foreach b generate n2; dump c; In this case, an extra foreach step is processed, but we gain no benefit. Algorithm description: 1. Divide all logical operators into two categories: create map-reduce boundary and not create map-reduce boundary. boundary = true: LOCoGroup, LOCross, LOJoin, LODistinct, LOSort boundary = false: LOFilter, LOForEach, LODefine, LOLoad, LOStore, LOSplit, LOSplitOutput, LOStream, LOUnion LOJoin can be boundary or not, depends on the type of join 2. We collect required fields from the bottom, a reverse dependency order walker algorithm is required to do this 3. We do not actually start from the leaf. We start from the last LOForEach. Only LOForEach prune columns. If there is no LOForEach in the script, then we cannot prune anything. 4. From a required output, we need an algorithm to figure required input <= require $0, $2, $3 b = foreach a generate $0, $2+$3; <= require $0, $1 5. From the bottom LOForEach, we collect required fields all the way up, if we move over a boundary operator, save the position because it is possible to put projection there ...... => projection here x = CoGroup ..... ...... => projection here y = order ...... Put the projection right before boundary to make sure fewer data cross the boundary However, we do not make this decision and do the actual prune now, we will do the actual pruning top down 6. While we traversing up, if we see operator containing more than one inputs, we trace required fields in all directions; We rely on the output schema of this operator to figure out which required fields belong to which input. If we see operator containing more than one outputs, we collects required fields until all outputs has been traced 7. If we see LOStream, LOStore, we stop 8. If we see LOLoad, we stop and set required fields in LOLoad 9. From LOLoad, we do a top down traverse to decide whether we need to put projection, and if yes, insert ForEach 10. We only add projection if it is necessary. It is only necessary when the required fields of that boundary operator is more than output fields of operator before it. Filter ...... (output fields: n1, n2, n3) <= we can prune n3 here x = CoGroup .... (required fields: n1, n2) 11. It is possible that we create a foreach which can be combined into previous foreach, however, we do not handle it in PushUpProject rule ForEach...... <= we will add a ForEach anyway here x = CoGroup ..... 12. Everytime we insert a LOForEach, we need to adjust the projection map all the way down 13. To fit the PushUpProject into current optimizor framework, we hook the check rule to LOForEach. Everytime we start from LOForEach and we never push up over another LOForEach. So we stop at LOForEach and save required fields upto this point. > Logical optimizer: push up project > ---------------------------------- > > Key: PIG-922 > URL: https://issues.apache.org/jira/browse/PIG-922 > Project: Pig > Issue Type: New Feature > Components: impl > Affects Versions: 0.3.0 > Reporter: Daniel Dai > Assignee: Daniel Dai > Fix For: 0.4.0 > > > This is a continuation work of > [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add > another rule to the logical optimizer: Push up project, ie, prune columns as > early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.