Re: Review Request: non-deterministic output when a file is loaded multiple times

Daniel Dai Mon, 21 Mar 2011 18:00:51 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/519/#review346
-----------------------------------------------------------




http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment703>

    You mean why I enumerate entry instead of keys?


I will make change to address all other comments.

- Daniel


On 2011-03-21 16:16:23, Daniel Dai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/519/
> -----------------------------------------------------------
> 
> (Updated 2011-03-21 16:16:23)
> 
> 
> Review request for pig and thejas.
> 
> 
> Summary
> -------
> 
> I have a small demonstration script (actually, a directory with one main 
> script and several other scripts that it calls) where the output (STOREd to a 
> file) is not consistent between runs. I will paste the files below this 
> message, and I can also email the tarball to anybody who would like it; I 
> wanted to just upload the tarball but I don't see a way to do that.
> 
> The problem appears to be that when a dataset X gets LOADed twice, with 
> things other than LOADs occurring between the loads (like a FOREACH 
> GENERATE), a FOREACH GENERATE that is later performed on X doesn't always 
> choose the correct columns. The correctness of the output was highly variable 
> on my computer, for one of my co-workers it almost always failed, and for two 
> other of my co-workers they didn't see any failures, so it's likely to be a 
> race condition or something like that.
> 
> – FILES FOR REPLICATING THE PROBLEM
> – I will paste the name of the file as a comment, with the content of the 
> file beneath it.
> – I will put the contents of the following files:
> – 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and 
> load_raw_data.pig)
> – 2) The input data file (data.csv)
> – 3) The correct output file (correct_output.csv)
> – 4) The shell script that runs the pig files and compares their output to 
> what it should be
> – 5) README
> 
> – main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(','); – this is wrong sometimes
> 
> – calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> 
> – calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> 
> – load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
> x,
> y,
> w
> );
> 
> – data.csv
> x1,CORRECT ANSWER,21148.59
> x2,CORRECT OUTPUT,27219.98
> x3,RIGHT ANSWER,10818.15
> 
> – correct_output.csv
> x1,CORRECT ANSWER
> x2,CORRECT OUTPUT
> x3,RIGHT ANSWER
> 
> – testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
> rm -rf output/*
> pig -x local -d WARN -e "set debug off;run main.pig" || break
> diff correct_output.csv output/Y/part-m-00000 && echo good
> results[$i]=$?
> i=$((i+1))
> done;
> echo ${results[*]}
> 
> – README
> 
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> 
> The scripts and dataset included in this directory demonstrate the
> issue. The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs. In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> 
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> 
>     * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in 
> main.pig
>     * making a copy of data.csv called data2.csv, and a file 
> load_daw_data2.pig
>       that loads data2.csv and having having calc_x_W.pig use that instead.
> 
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E
> 
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file 
> correct_output.csv.
> 
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0. On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs. Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output. We suspect
> this is likely a race condition.
> 
> Thanks!
> 
> 
> This addresses bug PIG-1912.
>     https://issues.apache.org/jira/browse/PIG-1912
> 
> 
> Diffs
> -----
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
>  1083943 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOLoad.java
>  1083943 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java
>  1083943 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestEvalPipeline2.java
>  1083943 
> 
> Diff: https://reviews.apache.org/r/519/diff
> 
> 
> Testing
> -------
> 
> Test-patch:
>      [exec] +1 overall.  
>      [exec] 
>      [exec]     +1 @author.  The patch does not contain any @author tags.
>      [exec] 
>      [exec]     +1 tests included.  The patch appears to include 3 new or 
> modified tests.
>      [exec] 
>      [exec]     +1 javadoc.  The javadoc tool did not generate any warning 
> messages.
>      [exec] 
>      [exec]     +1 javac.  The applied patch does not increase the total 
> number of javac compiler warnings.
>      [exec] 
>      [exec]     +1 findbugs.  The patch does not introduce any new Findbugs 
> warnings.
>      [exec] 
>      [exec]     +1 release audit.  The applied patch does not increase the 
> total number of release audit warnings.
> 
> Unit test:
>     all pass
> 
> End-to-end test:
>     all pass
> 
> 
> Thanks,
> 
> Daniel
> 
>

Re: Review Request: non-deterministic output when a file is loaded multiple times

Reply via email to