[ https://issues.apache.org/jira/browse/PIG-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701435#comment-16701435 ]
Koji Noguchi edited comment on PIG-5370 at 11/28/18 6:22 AM: ------------------------------------------------------------- I can think of two different approaches. \(i) Even for overlapping uids on different nested level, do not allow them and force IdentityColumn. This way, all uids will be unique. (ii) Change LOUnion uidMapping logic from (output_uid, input_uid) lists to (output_uid, nested_uids). Attaching a patch that tries (ii). If possible, I'd like to avoid \(i) which is already creating more uids to keep track. Taking one relation as example, {noformat} B: (Name: LOForEach Schema: A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int) {noformat} Before the patch, input_uid 36,9,10,11,10,11 were used for uidMapping. After the patch, it'll use nested_uids, _36, _36_9, _36_10, _36_11, _10, _11 This way, there won't be any incorrect list lookup. [~daijy], would this approach work? was (Author: knoguchi): I can think of two different approaches. (i) Even for overlapping uids on different nested level, do not allow them and force IdentityColumn. This way, all uids will be unique. (ii) Change LOUnion uidMapping logic from (output_uid, input_uid) lists to (output_uid, nested_uids). Attaching a patch that tries (ii). If possible, I'd like to avoid (i) which is already creating more uids to keep track. Taking one relation as example, {noformat} B: (Name: LOForEach Schema: A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int) {noformat} Before the patch, input_uid 36,9,10,11,10,11 were used for uidMapping. After the patch, it'll use nested_uids, _36, _36_9, _36_10, _36_11, _10, _11 This way, there won't be any incorrect list lookup. [~daijy], would this approach work? > Union onschema + columnprune dropping used fields > -------------------------------------------------- > > Key: PIG-5370 > URL: https://issues.apache.org/jira/browse/PIG-5370 > Project: Pig > Issue Type: Bug > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Priority: Major > Attachments: pig-5370-v1.patch > > > After PIG-5312, below query started failing. > {code} > A = load 'input.txt' as (a1:int, a2:chararray, a3:int); > B = FOREACH (GROUP A by (a1,a2)) { > A_FOREACH = FOREACH A GENERATE a2,a3; > GENERATE A, FLATTEN(A_FOREACH) as (a2,a3); > } > C = load 'input2.txt' as (A:bag{tuple:(a1: int,a2: chararray,a3:int)},a2: > chararray,a3:int); > D = UNION ONSCHEMA B, C; > dump D; > {code} > {code:title=input1.txt} > 1 a 3 > 2 b 4 > 2 c 5 > 1 a 6 > 2 b 7 > 1 c 8 > {code} > {code:title=input2.txt} > {(10,a0,30),(20,b0,40)} zzz 222 > {code} > {noformat:title=Expected output} > ({(10,a0,30),(20,b0,40)},zzz,222) > ({(1,a,6),(1,a,3)},a,6) > ({(1,a,6),(1,a,3)},a,3) > ({(1,c,8)},c,8) > ({(2,b,7),(2,b,4)},b,7) > ({(2,b,7),(2,b,4)},b,4) > ({(2,c,5)},c,5) > {noformat} > {noformat:title=Actual (incorrect) output} > ({(10,a0,30),(20,b0,40)}) ****ONLY 1 Field **** > ({(1,a,6),(1,a,3)},a,6) > ({(1,a,6),(1,a,3)},a,3) > ({(1,c,8)},c,8) > ({(2,b,7),(2,b,4)},b,7) > ({(2,b,7),(2,b,4)},b,4) > ({(2,c,5)},c,5) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)