[
https://issues.apache.org/jira/browse/PIG-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547012
]
Utkarsh Srivastava commented on PIG-7:
--------------------------------------
Unfortunately, this patch has problems. It can kick off the combiner even in
situations where it is not applicable. Following is the test sequence I tried:
finishship-lm-corp-yahoo-com:~/Documents/workspace/Test Pig $ java -cp pig.jar
org.apache.pig.Main -
Connecting to hadoop file system at: localhost:9000
Connecting to map-reduce job tracker at: localhost:9001
grunt> a = load 'file:b';
grunt> b = group a by $0;
grunt> c = foreach b generate group, a;
grunt> dump c;
As you will see below, the combiner is kicked off while it shouldn't be, and
then the job fails.
----- MapReduce Job -----
Input: [/tmp/temp-1447320079/tmp-1892534978:org.apache.pig.builtin.PigStorage()]
Map: [[*]]
Group: [GENERATE {[PROJECT $0],[*]}]
Combine: GENERATE {[PROJECT $0],[PROJECT $1]}
Reduce: GENERATE {[PROJECT $0],[PROJECT $1]}
Output: /tmp/temp-1447320079/tmp840894904:org.apache.pig.builtin.BinStorage
Split: null
Map parallelism: -1
Reduce parallelism: -1
Job jar size = 476828
Pig progress = 0%
Pig progress = 50%
Error message from task (map) tip_200711292202_0003_m_000000
Error message from task (reduce) tip_200711292202_0003_r_000000
java.io.IOException: Unexpected data while reading tuple from binary file
at org.apache.pig.data.Tuple.readFields(Tuple.java:294)
at org.apache.pig.data.DataBag.read(DataBag.java:251)
at org.apache.pig.data.Tuple.readDatum(Tuple.java:322)
at org.apache.pig.data.Tuple.read(Tuple.java:308)
at org.apache.pig.data.Tuple.readFields(Tuple.java:295)
at org.apache.pig.data.IndexedTuple.readFields(IndexedTuple.java:52)
at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:210)
at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.<init>(ReduceTask.java:160)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.<init>(ReduceTask.java:228)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:320)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
java.io.IOException: Unexpected data while reading tuple from binary file
I think the problem is that ProjectSpec unconditionally returns true for
amenableToCombiner() while in the above example, it is not amenable.
Another, much smaller problem is that in visitSortDistinct() method, the
sortSpec can be null (if the operator is carrying out a distinct), and that
throws a NullPointerException (EvalSpecVisitor.java:62)
Another problem (though not strictly required to be solved in the first
version), is that the combiner is kicked off in very restricted situations.
The condition in MapreducePlanCompiler.java is
if (mro.toReduce == null && spec.amenableToCombiner() &&
spec instanceof GenerateSpec &&
mro.groupFuncs != null && mro.groupFuncs.size() == 1) {
But, in most cases, the users will follow up GENERATE of SUM, AVG etc. by
filter, or some other foreach etc. In these cases spec will be an instance of
CompositeEvalSpec with the first thing as a GenerateSpec, and the combiner
won't fire. It will be as easy to replace by a more general condition
spec instanceof GenerateSpec || (spec instanceof CompositeEvalSpec &&
((CompositeEvalSpec)spec).getSpecs().get(0) instanceof GenerateSpec
> Optimize execution of algebraic functions
> -----------------------------------------
>
> Key: PIG-7
> URL: https://issues.apache.org/jira/browse/PIG-7
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Olga Natkovich
> Assignee: Alan Gates
> Attachments: combiner.patch
>
>
> Algebraic are functions that can be computed incrementally like count(X),
> SUM(X), etc. They can be computed effciently by doing the first level
> computation using hadoop combiner. This can give a significant (2-3x) speedup
> for many aggregation queries.
> Several users asked us for this feature so it is pretty high priority.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.