----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/519/#review346 -----------------------------------------------------------
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java <https://reviews.apache.org/r/519/#comment703> You mean why I enumerate entry instead of keys? I will make change to address all other comments. - Daniel On 2011-03-21 16:16:23, Daniel Dai wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/519/ > ----------------------------------------------------------- > > (Updated 2011-03-21 16:16:23) > > > Review request for pig and thejas. > > > Summary > ------- > > I have a small demonstration script (actually, a directory with one main > script and several other scripts that it calls) where the output (STOREd to a > file) is not consistent between runs. I will paste the files below this > message, and I can also email the tarball to anybody who would like it; I > wanted to just upload the tarball but I don't see a way to do that. > > The problem appears to be that when a dataset X gets LOADed twice, with > things other than LOADs occurring between the loads (like a FOREACH > GENERATE), a FOREACH GENERATE that is later performed on X doesn't always > choose the correct columns. The correctness of the output was highly variable > on my computer, for one of my co-workers it almost always failed, and for two > other of my co-workers they didn't see any failures, so it's likely to be a > race condition or something like that. > > – FILES FOR REPLICATING THE PROBLEM > – I will paste the name of the file as a comment, with the content of the > file beneath it. > – I will put the contents of the following files: > – 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and > load_raw_data.pig) > – 2) The input data file (data.csv) > – 3) The correct output file (correct_output.csv) > – 4) The shell script that runs the pig files and compares their output to > what it should be > – 5) README > > – main.pig > RUN calc_x_W.pig; > RUN calc_x_Y.pig; > STORE x_W INTO 'output/W' USING PigStorage(','); > STORE x_Y INTO 'output/Y' USING PigStorage(','); – this is wrong sometimes > > – calc_x_W.pig > RUN load_raw_data.pig; > x_W = FOREACH raw_data GENERATE x, w; > > – calc_x_Y.pig > RUN load_raw_data.pig; > x_Y = FOREACH raw_data GENERATE x, y; > > – load_raw_data.pig > raw_data = LOAD 'data.csv' USING PigStorage(',') > AS ( > x, > y, > w > ); > > – data.csv > x1,CORRECT ANSWER,21148.59 > x2,CORRECT OUTPUT,27219.98 > x3,RIGHT ANSWER,10818.15 > > – correct_output.csv > x1,CORRECT ANSWER > x2,CORRECT OUTPUT > x3,RIGHT ANSWER > > – testmany.sh > typeset -a results > i=0 > while (( i < 10 )); do > rm -rf output/* > pig -x local -d WARN -e "set debug off;run main.pig" || break > diff correct_output.csv output/Y/part-m-00000 && echo good > results[$i]=$? > i=$((i+1)) > done; > echo ${results[*]} > > – README > > This directory is intended to show a non-deterministic bug in pig. > Non-deterministic in the sense that the output of the script is not > the same between different times it is run on the same input; usually > the input is right, but sometimes it's wrong for no apparent reason. > > The scripts and dataset included in this directory demonstrate the > issue. The scripts load the file data.csv and write to the output > directory, but the file output/Y/part-m-00000 is sometimes different > between consecutive runs. In particular, this file SHOULD just be > the first and third columns of data.csv, but it sometimes uses the > second column in place of the third. > > The root of the problem appears to be that there is an intermediate > LOAD of data.csv, after some relations have already been defined. > The following things will make the error stop: > > * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in > main.pig > * making a copy of data.csv called data2.csv, and a file > load_daw_data2.pig > that loads data2.csv and having having calc_x_W.pig use that instead. > > It's possible that this isn't a bug and I'm just mis-using Pig; > if that is the case I would greatly appreciate hearing about it. > I believe this issue was also discussed here: > http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E > > I have a shell script testmany.sh which runs my script multiple times > and reports for which runs the output agrreed with the file > correct_output.csv. > > IMPORTANT NOTE: We have run this code on 4 different laptops, all running > pig 0.8.0. On one laptop (the one I'm using) the output of this script > was highly non-deterministic, generally giving both the wrong and the right > output several times each during 10 runs. Another laptop consistently got > the wrong output up until the 28th run, when it finally gave the right output. > The other two computer never actually observed the wrong output. We suspect > this is likely a race condition. > > Thanks! > > > This addresses bug PIG-1912. > https://issues.apache.org/jira/browse/PIG-1912 > > > Diffs > ----- > > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java > 1083943 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOLoad.java > 1083943 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java > 1083943 > > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java > PRE-CREATION > > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestEvalPipeline2.java > 1083943 > > Diff: https://reviews.apache.org/r/519/diff > > > Testing > ------- > > Test-patch: > [exec] +1 overall. > [exec] > [exec] +1 @author. The patch does not contain any @author tags. > [exec] > [exec] +1 tests included. The patch appears to include 3 new or > modified tests. > [exec] > [exec] +1 javadoc. The javadoc tool did not generate any warning > messages. > [exec] > [exec] +1 javac. The applied patch does not increase the total > number of javac compiler warnings. > [exec] > [exec] +1 findbugs. The patch does not introduce any new Findbugs > warnings. > [exec] > [exec] +1 release audit. The applied patch does not increase the > total number of release audit warnings. > > Unit test: > all pass > > End-to-end test: > all pass > > > Thanks, > > Daniel > >
