[ 
https://issues.apache.org/jira/browse/PIG-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695231#comment-16695231
 ] 

Koji Noguchi commented on PIG-5370:
-----------------------------------

Pig does create the correct result if we skip ColumnMapKeyPrune optimization.

Calling "explain D" with '-t ColumnMapKeyPrune', it shows
{noformat}
|---D: (Name: LOUnion Schema: 
A#40:bag{tuple#41:tuple(a1#42:int,a2#43:chararray,a3#44:int)},a2#43:chararray,a3#44:int)
    |
    |---B: (Name: LOForEach Schema: 
A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int)
    |
    |---C: (Name: LOForEach Schema: 
A#17:bag{tuple#49:tuple(a1#50:int,a2#51:chararray,a3#52:int)},a2#22:chararray,a3#23:int)
{noformat}

This issue only happen when we have a relation like B where inner schema 
contains a field with same uid as the one at the root level.  In the above 
example, uid {{\*10\*}} and {{\*11\*}}.

Before PIG-5312, schema of the inner bag was set to null so we didn't have this 
issue.
With PIG-5312, and the way LOUnion determines the output UIDS based on input 
UIDs, two issues are happening.
# schema of LOUnion is using the same uid for inner bag and outside. (UID 43 & 
44)
# ColumnMapKeyPrune is (incorrectly) determining that a2#22 & a3#23 are not 
being used and dropping them. 

Reading {{DuplicateForEachColumnRewriteVisitor.java}}, "relation B using the 
same uid" is a correct behavior since they are not at the same level.  So I'm 
guessing the required fix would be in the LOUnion.

> Union onschema + columnprune dropping used fields 
> --------------------------------------------------
>
>                 Key: PIG-5370
>                 URL: https://issues.apache.org/jira/browse/PIG-5370
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Major
>
> After PIG-5312, below query started failing.
> {code}
> A = load 'input.txt' as (a1:int, a2:chararray, a3:int);
> B = FOREACH (GROUP A by (a1,a2)) {
>     A_FOREACH = FOREACH A GENERATE a2,a3;
>     GENERATE A, FLATTEN(A_FOREACH) as (a2,a3);
> }
> C = load 'input2.txt' as (A:bag{tuple:(a1: int,a2: chararray,a3:int)},a2: 
> chararray,a3:int);
> D = UNION ONSCHEMA B, C;
> dump D;
> {code}
> {code:title=input1.txt}
> 1       a       3
> 2       b       4
> 2       c       5
> 1       a       6
> 2       b       7
> 1       c       8
> {code}
> {code:title=input2.txt}
> {(10,a0,30),(20,b0,40)} zzz     222
> {code}
> {noformat:title=Expected output}
> ({(10,a0,30),(20,b0,40)},zzz,222)
> ({(1,a,6),(1,a,3)},a,6)
> ({(1,a,6),(1,a,3)},a,3)
> ({(1,c,8)},c,8)
> ({(2,b,7),(2,b,4)},b,7)
> ({(2,b,7),(2,b,4)},b,4)
> ({(2,c,5)},c,5)
> {noformat}
> {noformat:title=Actual (incorrect) output}
> ({(10,a0,30),(20,b0,40)})    ****ONLY 1 Field ****
> ({(1,a,6),(1,a,3)},a,6)
> ({(1,a,6),(1,a,3)},a,3)
> ({(1,c,8)},c,8)
> ({(2,b,7),(2,b,4)},b,7)
> ({(2,b,7),(2,b,4)},b,4)
> ({(2,c,5)},c,5)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to