[jira] [Comment Edited] (PIG-5370) Union onschema + columnprune dropping used fields

Koji Noguchi (JIRA) Tue, 27 Nov 2018 22:25:03 -0800


    [ 
https://issues.apache.org/jira/browse/PIG-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701435#comment-16701435
 ]


Koji Noguchi edited comment on PIG-5370 at 11/28/18 6:22 AM:
-------------------------------------------------------------

I can think of two different approaches.

\(i) Even for overlapping uids on different nested level, do not allow them and
 force IdentityColumn. This way, all uids will be unique.

(ii) Change LOUnion uidMapping logic from (output_uid, input_uid) lists to
 (output_uid, nested_uids).

Attaching a patch that tries (ii). If possible, I'd like to avoid \(i) which is 
already
 creating more uids to keep track.

Taking one relation as example,
{noformat}
B: (Name: LOForEach Schema: 
A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int)
{noformat}
Before the patch, input_uid
 36,9,10,11,10,11
were used for uidMapping.

After the patch, it'll use nested_uids,
 _36, _36_9, _36_10, _36_11, _10, _11

This way, there won't be any incorrect list lookup.

 [~daijy], would this approach work? 



was (Author: knoguchi):
I can think of two different approaches.

(i) Even for overlapping uids on different nested level, do not allow them and
 force IdentityColumn. This way, all uids will be unique.

(ii) Change LOUnion uidMapping logic from (output_uid, input_uid) lists to
 (output_uid, nested_uids).

Attaching a patch that tries (ii). If possible, I'd like to avoid (i) which is 
already
 creating more uids to keep track.

Taking one relation as example,
{noformat}
B: (Name: LOForEach Schema: 
A#36:bag{#37:tuple(a1#9:int,a2#*10*:chararray,a3#*11*:int)},a2#*10*:chararray,a3#*11*:int)
{noformat}
Before the patch, input_uid
 36,9,10,11,10,11
were used for uidMapping.

After the patch, it'll use nested_uids,
 _36, _36_9, _36_10, _36_11, _10, _11

This way, there won't be any incorrect list lookup.

 [~daijy], would this approach work? 


> Union onschema + columnprune dropping used fields 
> --------------------------------------------------
>
>                 Key: PIG-5370
>                 URL: https://issues.apache.org/jira/browse/PIG-5370
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Major
>         Attachments: pig-5370-v1.patch
>
>
> After PIG-5312, below query started failing.
> {code}
> A = load 'input.txt' as (a1:int, a2:chararray, a3:int);
> B = FOREACH (GROUP A by (a1,a2)) {
>     A_FOREACH = FOREACH A GENERATE a2,a3;
>     GENERATE A, FLATTEN(A_FOREACH) as (a2,a3);
> }
> C = load 'input2.txt' as (A:bag{tuple:(a1: int,a2: chararray,a3:int)},a2: 
> chararray,a3:int);
> D = UNION ONSCHEMA B, C;
> dump D;
> {code}
> {code:title=input1.txt}
> 1       a       3
> 2       b       4
> 2       c       5
> 1       a       6
> 2       b       7
> 1       c       8
> {code}
> {code:title=input2.txt}
> {(10,a0,30),(20,b0,40)} zzz     222
> {code}
> {noformat:title=Expected output}
> ({(10,a0,30),(20,b0,40)},zzz,222)
> ({(1,a,6),(1,a,3)},a,6)
> ({(1,a,6),(1,a,3)},a,3)
> ({(1,c,8)},c,8)
> ({(2,b,7),(2,b,4)},b,7)
> ({(2,b,7),(2,b,4)},b,4)
> ({(2,c,5)},c,5)
> {noformat}
> {noformat:title=Actual (incorrect) output}
> ({(10,a0,30),(20,b0,40)})    ****ONLY 1 Field ****
> ({(1,a,6),(1,a,3)},a,6)
> ({(1,a,6),(1,a,3)},a,3)
> ({(1,c,8)},c,8)
> ({(2,b,7),(2,b,4)},b,7)
> ({(2,b,7),(2,b,4)},b,4)
> ({(2,c,5)},c,5)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (PIG-5370) Union onschema + columnprune dropping used fields

Reply via email to