[
https://issues.apache.org/jira/browse/PIG-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy resolved PIG-4789.
-------------------------------------
Resolution: Fixed
Fix Version/s: 0.16.0
> Pig on TEZ creates wrong result with replicated join
> ----------------------------------------------------
>
> Key: PIG-4789
> URL: https://issues.apache.org/jira/browse/PIG-4789
> Project: Pig
> Issue Type: Bug
> Components: tez
> Affects Versions: 0.15.0
> Reporter: Michael Prim
> Priority: Critical
> Fix For: 0.16.0
>
> Attachments: tez_bug.pig, tez_bug_input1.csv, tez_bug_input2.csv,
> tez_bug_input3.csv
>
>
> Please find below a minimal example of a Pig script that uses splits and
> replicated joins and where the output differs between MapReduce and TEZ as
> execution engine. The attachment also contains the sample input data.
> The expected output, as created by MapReduce engine is:
> {code}
> (id1,123,A,)
> (id2,234,,B)
> (id3,456,,)
> (id4,567,A,)
> {code}
> whereas TEZ produces
> {code}
> (id1,123,A,A)
> (id2,234,B,B)
> (id3,456,,)
> (id4,567,A,A)
> {code}
> Removing the {{USING 'replicated'}} and using a regular join yields correct
> results. I am not sure if this is a Pig issue or a TEZ issue. However, as
> this issue silently can lead to data corruption I rated it critical. So far
> searching didn't indicate a similar bug or anybody being aware of it.
> {code}
> classdata = LOAD '/tez_bug_input1.csv' USING PigStorage(',') AS
> (classid:chararray, class:chararray);
> data = LOAD '/tez_bug_input2.csv' USING PigStorage(',') AS
> (eventid:chararray, classid:chararray);
> basedata = LOAD '/tez_bug_input3.csv' USING PigStorage(',') AS
> (eventid:chararray, foo:int);
> dataJclassdata = JOIN classdata BY classid, data BY classid;
> SPLIT dataJclassdata INTO classA IF class == 'A', classB IF class == 'B';
> dataA = JOIN basedata BY eventid LEFT OUTER, classA BY data::eventid USING
> 'replicated';
> dataA = foreach dataA generate basedata::eventid as eventid
> , basedata::foo as foo
> , classA::classdata::class as classA;
> dataB = JOIN dataA BY eventid LEFT OUTER, classB BY eventid USING
> 'replicated';
> dataB = foreach dataB generate dataA::eventid as eventid
> , dataA::foo as foo
> , dataA::classA as classA
> , classB::classdata::class as classB;
> DUMP dataB;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)