Diamond splitter does not generate correct results when using Multi-query 
optimization
--------------------------------------------------------------------------------------

                 Key: PIG-1252
                 URL: https://issues.apache.org/jira/browse/PIG-1252
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.6.0
            Reporter: Viraj Bhat
             Fix For: 0.7.0


I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
                               orderedData = ORDER trueDataTmp BY col1,col2;
                               GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
                              }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to