[
https://issues.apache.org/jira/browse/PIG-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pi Song updated PIG-153:
------------------------
Attachment: PIG_153_fix_optimization.patch
This one-line fix took me an hour to hunt!!! The problem that causes this is in
MapReduce optimization where already-executed sub-plans are catched and reused
as input of new plans.
Unit test only does test the reported scenario. Other than that I couldn't find
a non-intrusive way to do it.
> Incorrect results when there is a dump in between statements.
> --------------------------------------------------------------
>
> Key: PIG-153
> URL: https://issues.apache.org/jira/browse/PIG-153
> Project: Pig
> Issue Type: Bug
> Environment: Pig + Hadoop
> Reporter: Amir Youssefi
> Attachments: PIG_153_fix_optimization.patch
>
>
> Following scenario is with Pig + Hadoop.
> A similar run with Local Pig showed correct results.
> Here is test file data/test/test2.txt:
> a1 1 5700
> b1 2 2001
> c2 2
> I run the following script step by step:
> grunt> a = load 'data/test/test2.txt';
> grunt> dump a;
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce
> Job --- --
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
> [/user/amiry/dat a/test/test2.txt:org.apache.pig.builtin.PigStorage()]
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
> /tmp/temp135967 7959/tmp-246846292:org.apache.pig.builtin.BinStorage
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism:
> -1
> 2008-03-18 06:41:55,163 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce
> parallelism: -1
> 2008-03-18 06:41:57,472 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 0%
> 2008-03-18 06:41:58,477 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 50%
> 2008-03-18 06:42:04,495 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 100%
> (a1, 1, 5700)
> (b1, 2, 2001)
> (c2, 2, )
> grunt> b = filter a by $0 eq 'a1';
> grunt> dump b;
> 2008-03-18 06:42:23,881 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce
> Job --- --
> 2008-03-18 06:42:23,881 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
> [/tmp/temp135967 7959/tmp-246846292:org.apache.pig.builtin.BinStorage]
> 2008-03-18 06:42:23,881 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map:
> [[*]->[FILTER BY ( [PROJECT $0] eq ['a1'])]]
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
> /tmp/temp135967 7959/tmp1851797397:org.apache.pig.builtin.BinStorage
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism:
> -1
> 2008-03-18 06:42:23,882 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce
> parallelism: -1
> 2008-03-18 06:42:25,938 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 0%
> 2008-03-18 06:42:28,946 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 50%
> 2008-03-18 06:42:34,963 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 100%
> (a1, 1, 5700)
> grunt> c = filter a by $0 eq 'b1';
> grunt> dump c;
> 2008-03-18 06:42:59,884 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce
> Job --- --
> 2008-03-18 06:42:59,884 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
> [/tmp/temp135967 7959/tmp1851797397:org.apache.pig.builtin.BinStorage]
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map:
> [[*]->[FILTER BY ( [PROJECT $0] eq ['b1'])]]
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
> /tmp/temp135967 7959/tmp-1157182212:org.apache.pig.builtin.BinStorage
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism:
> -1
> 2008-03-18 06:42:59,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce
> parallelism: -1
> 2008-03-18 06:43:01,964 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 0%
> 2008-03-18 06:43:04,974 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 50%
> 2008-03-18 06:43:06,980 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> - Pig progress = 100%
> grunt>
> Meaning c is empty.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.