[
https://issues.apache.org/jira/browse/PIG-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183778#comment-13183778
]
David Wahler commented on PIG-2465:
-----------------------------------
Thanks for taking a look at this. In the actual script that led me to discover
the problem, A is actually the result of a cogroup operation, so I've worked
around the problem by using JOINs instead.
The bug is present in 0.8.1, as far as I can tell. I originally discovered it
while using Cloudera's pig-0.8.1-cdh3u2, but the official release shows the
same behavior. Attaching output from 0.8.1 including the output of EXPLAIN.
After poking around in the code a bit, it looks like this can be fixed by
always generating new uids in LOUnion.getSchema(). I don't understand the
planner deeply enough to know whether that's a good idea, but it fixes the bug
without breaking the core unit tests. I'd be happy to provide a patch if this
sounds reasonable.
> FLATTEN, reorder columns, UNION causes uid conflict
> ---------------------------------------------------
>
> Key: PIG-2465
> URL: https://issues.apache.org/jira/browse/PIG-2465
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.9.1, 0.10
> Reporter: David Wahler
> Assignee: Daniel Dai
>
> This is a regression in the new logical plan that causes incorrect results in
> 0.8/0.9, and a fatal "duplicate uid in schema" error on trunk. The following
> script demonstrates the problem (extracted and simplified from a much larger
> script):
> {code}A = LOAD 'bug.in' AS (x:{t:(x:int)}, y:{t:(y:int)});
> B1 = FOREACH A GENERATE FLATTEN(x),FLATTEN(y);
> B2 = FOREACH A GENERATE FLATTEN(y),FLATTEN(x);
> C = UNION B1, B2;
> D = GROUP C BY *;{code}
> Input data:
> {code}{(1)} {(2)}
> {(1)} {(3)}{code}
> C contains the correct data:
> {code}(1,2)
> (2,1)
> (1,3)
> (3,1){code}
> D should use the entire tuple as the group key (making it essentially a
> DISTINCT) but instead the output is:
> {code}((1,1),{(1,2),(1,3)})
> ((2,2),{(2,1)})
> ((3,3),{(3,1)}){code}
> The GROUP operation is using ($0,$0) as the key instead of ($0,$1). The
> logical plan includes the line: {{C: (Name: LOUnion Schema:
> x::x#37:int,y::y#37:int)}}. Switching to the old logical plan produces the
> correct output in 0.8, but apparently this is no longer possible in 0.9 and
> later versions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira