[ 
https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1252:
------------------------------

    Attachment: PIG-1252.patch

This is the result of diamond query optimizer merging a job that has secondary 
key optimization. This patch disallows such merge.

In practice, users should consider the performance trade-off between using 
multiquery optimization and using secondary key optimization. Right now the 
secondary key optimizer runs before the multiquery optimizer which now doesn't 
merge any job that has secondary key optimization.

To disable multiquery optimization, use option -M. To disable secondary key 
optimization, use option -Dpig.exec.nosecondarykey=true.

> Diamond splitter does not generate correct results when using Multi-query 
> optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split 
> branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using 
> org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
> col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
> (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
> ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 
> : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
> falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 
> 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result. 
>  This could be the result of complex BinCond's in the POLoad. We can get rid 
> of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to