[jira] Commented: (PIG-7) Optimize execution of algebraic functions

Utkarsh Srivastava (JIRA) Thu, 29 Nov 2007 22:34:06 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547012
 ]


Utkarsh Srivastava commented on PIG-7:
--------------------------------------

Unfortunately, this patch has problems. It can kick off the combiner even in 
situations where it is not applicable. Following is the test sequence I tried:

finishship-lm-corp-yahoo-com:~/Documents/workspace/Test Pig $ java -cp pig.jar 
org.apache.pig.Main -
Connecting to hadoop file system at: localhost:9000
Connecting to map-reduce job tracker at: localhost:9001
grunt> a = load 'file:b';
grunt> b = group a by $0;
grunt> c = foreach b generate group, a; 
grunt> dump c;


As you will see below, the combiner is kicked off while it shouldn't be, and 
then the job fails.




----- MapReduce Job -----
Input: [/tmp/temp-1447320079/tmp-1892534978:org.apache.pig.builtin.PigStorage()]
Map: [[*]]
Group: [GENERATE {[PROJECT $0],[*]}]
Combine: GENERATE {[PROJECT $0],[PROJECT $1]}
Reduce: GENERATE {[PROJECT $0],[PROJECT $1]}
Output: /tmp/temp-1447320079/tmp840894904:org.apache.pig.builtin.BinStorage
Split: null
Map parallelism: -1
Reduce parallelism: -1
Job jar size = 476828
Pig progress = 0%
Pig progress = 50%
Error message from task (map) tip_200711292202_0003_m_000000
Error message from task (reduce) tip_200711292202_0003_r_000000 
java.io.IOException: Unexpected data while reading tuple from binary file
        at org.apache.pig.data.Tuple.readFields(Tuple.java:294)
        at org.apache.pig.data.DataBag.read(DataBag.java:251)
        at org.apache.pig.data.Tuple.readDatum(Tuple.java:322)
        at org.apache.pig.data.Tuple.read(Tuple.java:308)
        at org.apache.pig.data.Tuple.readFields(Tuple.java:295)
        at org.apache.pig.data.IndexedTuple.readFields(IndexedTuple.java:52)
        at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:210)
        at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.<init>(ReduceTask.java:160)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.<init>(ReduceTask.java:228)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:320)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
 java.io.IOException: Unexpected data while reading tuple from binary file

I think the problem is that ProjectSpec unconditionally returns true for 
amenableToCombiner() while in the above example, it is not amenable.


Another, much smaller problem is that  in visitSortDistinct() method, the 
sortSpec can be null (if the operator is carrying out a distinct), and that 
throws a NullPointerException (EvalSpecVisitor.java:62)


Another problem (though not strictly required to be solved in the first 
version), is that the combiner is kicked off in very restricted situations. 
The condition in MapreducePlanCompiler.java is

if (mro.toReduce == null && spec.amenableToCombiner() &&
                    spec instanceof GenerateSpec &&
                    mro.groupFuncs != null && mro.groupFuncs.size() == 1) {

But, in most cases, the users will follow up GENERATE of SUM, AVG etc. by 
filter, or some other foreach etc. In these cases spec will be an instance of 
CompositeEvalSpec with the first thing as a GenerateSpec, and the combiner 
won't fire. It will be as easy to replace by a more general condition

spec instanceof GenerateSpec || (spec instanceof CompositeEvalSpec && 
((CompositeEvalSpec)spec).getSpecs().get(0) instanceof GenerateSpec




> Optimize execution of algebraic functions
> -----------------------------------------
>
>                 Key: PIG-7
>                 URL: https://issues.apache.org/jira/browse/PIG-7
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: combiner.patch
>
>
> Algebraic are functions that can be computed incrementally like count(X), 
> SUM(X), etc. They can be computed effciently by doing the first level 
> computation using hadoop combiner. This can give a significant (2-3x) speedup 
> for many aggregation queries. 
> Several users asked us for this feature so it is pretty high priority.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-7) Optimize execution of algebraic functions

Reply via email to