[ https://issues.apache.org/jira/browse/PIG-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695231#comment-16695231 ]
Koji Noguchi commented on PIG-5370: ----------------------------------- Pig does create the correct result if we skip ColumnMapKeyPrune optimization. Calling "explain D" with '-t ColumnMapKeyPrune', it shows {noformat} |---D: (Name: LOUnion Schema: A#40:bag{tuple#41:tuple(a1#42:int,a2#43:chararray,a3#44:int)},a2#43:chararray,a3#44:int) | |---B: (Name: LOForEach Schema: A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int) | |---C: (Name: LOForEach Schema: A#17:bag{tuple#49:tuple(a1#50:int,a2#51:chararray,a3#52:int)},a2#22:chararray,a3#23:int) {noformat} This issue only happen when we have a relation like B where inner schema contains a field with same uid as the one at the root level. In the above example, uid {{\*10\*}} and {{\*11\*}}. Before PIG-5312, schema of the inner bag was set to null so we didn't have this issue. With PIG-5312, and the way LOUnion determines the output UIDS based on input UIDs, two issues are happening. # schema of LOUnion is using the same uid for inner bag and outside. (UID 43 & 44) # ColumnMapKeyPrune is (incorrectly) determining that a2#22 & a3#23 are not being used and dropping them. Reading {{DuplicateForEachColumnRewriteVisitor.java}}, "relation B using the same uid" is a correct behavior since they are not at the same level. So I'm guessing the required fix would be in the LOUnion. > Union onschema + columnprune dropping used fields > -------------------------------------------------- > > Key: PIG-5370 > URL: https://issues.apache.org/jira/browse/PIG-5370 > Project: Pig > Issue Type: Bug > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Priority: Major > > After PIG-5312, below query started failing. > {code} > A = load 'input.txt' as (a1:int, a2:chararray, a3:int); > B = FOREACH (GROUP A by (a1,a2)) { > A_FOREACH = FOREACH A GENERATE a2,a3; > GENERATE A, FLATTEN(A_FOREACH) as (a2,a3); > } > C = load 'input2.txt' as (A:bag{tuple:(a1: int,a2: chararray,a3:int)},a2: > chararray,a3:int); > D = UNION ONSCHEMA B, C; > dump D; > {code} > {code:title=input1.txt} > 1 a 3 > 2 b 4 > 2 c 5 > 1 a 6 > 2 b 7 > 1 c 8 > {code} > {code:title=input2.txt} > {(10,a0,30),(20,b0,40)} zzz 222 > {code} > {noformat:title=Expected output} > ({(10,a0,30),(20,b0,40)},zzz,222) > ({(1,a,6),(1,a,3)},a,6) > ({(1,a,6),(1,a,3)},a,3) > ({(1,c,8)},c,8) > ({(2,b,7),(2,b,4)},b,7) > ({(2,b,7),(2,b,4)},b,4) > ({(2,c,5)},c,5) > {noformat} > {noformat:title=Actual (incorrect) output} > ({(10,a0,30),(20,b0,40)}) ****ONLY 1 Field **** > ({(1,a,6),(1,a,3)},a,6) > ({(1,a,6),(1,a,3)},a,3) > ({(1,c,8)},c,8) > ({(2,b,7),(2,b,4)},b,7) > ({(2,b,7),(2,b,4)},b,4) > ({(2,c,5)},c,5) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)