[ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1252: ------------------------------ Attachment: PIG-1252.patch This is the result of diamond query optimizer merging a job that has secondary key optimization. This patch disallows such merge. In practice, users should consider the performance trade-off between using multiquery optimization and using secondary key optimization. Right now the secondary key optimizer runs before the multiquery optimizer which now doesn't merge any job that has secondary key optimization. To disable multiquery optimization, use option -M. To disable secondary key optimization, use option -Dpig.exec.nosecondarykey=true. > Diamond splitter does not generate correct results when using Multi-query > optimization > -------------------------------------------------------------------------------------- > > Key: PIG-1252 > URL: https://issues.apache.org/jira/browse/PIG-1252 > Project: Pig > Issue Type: Bug > Affects Versions: 0.6.0 > Reporter: Viraj Bhat > Assignee: Richard Ding > Fix For: 0.7.0 > > Attachments: PIG-1252.patch > > > I have script which uses split but somehow does not use one of the split > branch. The skeleton of the script is as follows > {code} > loadData = load '/user/viraj/zebradata' using > org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, > col7'); > prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, > (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : > ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 > : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; > SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), > falseDataTmp IF (validRec == '1' AND splitcond == ''); > grpData = GROUP trueDataTmp BY splitcond; > finalData = FOREACH grpData { > orderedData = ORDER trueDataTmp BY col1,col2; > GENERATE FLATTEN ( MYUDF (orderedData, 60, > 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); > } > dump finalData; > {code} > You can see that "falseDataTmp" is untouched. > When I run this script with no-Multiquery (-M) option I get the right result. > This could be the result of complex BinCond's in the POLoad. We can get rid > of this error by using FILTER instead of SPIT. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.