[
https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647866#comment-13647866
]
Daniel Dai commented on PIG-1912:
---------------------------------
[~field.cady] Did you attach to the wrong Jira?
> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
> Key: PIG-1912
> URL: https://issues.apache.org/jira/browse/PIG-1912
> Project: Pig
> Issue Type: Bug
> Environment: Ubuntu 10.04
> Reporter: Field Cady
> Assignee: Daniel Dai
> Fix For: 0.8.1
>
> Attachments: PIG-1912-1.patch, PIG-1912-2.patch,
> PIG-1912-3_0.8.patch, PIG-1912-3.patch, pigbug.zip
>
>
> I have a small demonstration script (actually, a directory with one main
> script and several other scripts that it calls) where the output (STOREd to a
> file) is not consistent between runs. I will paste the files below this
> message, and I can also email the tarball to anybody who would like it; I
> wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with
> things other than LOADs occurring between the loads (like a FOREACH
> GENERATE), a FOREACH GENERATE that is later performed on X doesn't always
> choose the correct columns. The correctness of the output was highly
> variable on my computer, for one of my co-workers it *almost* always failed,
> and for two other of my co-workers they didn't see any failures, so it's
> likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the
> file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and
> load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to
> what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(','); -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
> x,
> y,
> w
> );
> -- data.csv
> x1,CORRECT ANSWER,21148.59
> x2,CORRECT OUTPUT,27219.98
> x3,RIGHT ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT ANSWER
> x2,CORRECT OUTPUT
> x3,RIGHT ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
> rm -rf output/*
> pig -x local -d WARN -e "set debug off;run main.pig" || break
> diff correct_output.csv output/Y/part-m-00000 && echo good
> results[$i]=$?
> i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue. The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs. In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in
> main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
> that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file
> correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0. On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs. Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output. We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> [email protected]
> [email protected]
> (360)621-4810
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira