[
https://issues.apache.org/jira/browse/PIG-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547602
]
Utkarsh Srivastava commented on PIG-7:
--------------------------------------
Still having a similar problem. The sequence I tried is below. The combiner
shouldn't have kicked in. And then the job fails.
I think the correct solution is for CombinerVisiter to know which Project specs
are amenable to combiner and which not. Thus immediately after group, $0 is
amenable, but $1, $2 etc. are not. Also, another thing that the patch doesn't
do yet is examine the arguments to the function. I am worried that bad things
might happen if its something like SUM(blah($1)) instead of simply SUM($1). For
now we can just target the cases when its of the simplest form such as some
algebraic function($1).
finishship-lm-corp-yahoo-com:~/Documents/workspace/Test Pig $ java -cp
eeyore/:pig.jar org.apache.pig.Main -
grunt> a = load 'a';
grunt> b = group a by $0;
grunt> c = foreach b generate $0, $1, SUM($1.$1);
grunt> dump c;
----- MapReduce Job -----
Input: [/user/utkarsh/a:org.apache.pig.builtin.PigStorage()]
Map: [[*]]
Group: [GENERATE {[PROJECT $0],[*]}]
Combine: GENERATE {[PROJECT $0],[PROJECT
$1],[org.apache.pig.builtin.SUM$Initial(GENERATE {[PROJECT $1]->[PROJECT $1]})]}
Reduce: GENERATE {[PROJECT $0],[PROJECT
$1],[org.apache.pig.builtin.SUM$Final(GENERATE {[PROJECT $1]->[PROJECT $2]})]}
Output: /tmp/temp1297608287/tmp242183746:org.apache.pig.builtin.BinStorage
Split: null
Map parallelism: -1
Reduce parallelism: -1
Job jar size = 477719
Pig progress = 0%
Pig progress = 50%
Error message from task (map) tip_200711301728_0003_m_000000
Error message from task (reduce) tip_200711301728_0003_r_000000
java.io.IOException: Unexpected data while reading tuple from binary file
at org.apache.pig.data.Tuple.readFields(Tuple.java:294)
at org.apache.pig.data.DataBag.read(DataBag.java:251)
> Optimize execution of algebraic functions
> -----------------------------------------
>
> Key: PIG-7
> URL: https://issues.apache.org/jira/browse/PIG-7
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Olga Natkovich
> Assignee: Alan Gates
> Attachments: combiner.patch, combiner2.patch
>
>
> Algebraic are functions that can be computed incrementally like count(X),
> SUM(X), etc. They can be computed effciently by doing the first level
> computation using hadoop combiner. This can give a significant (2-3x) speedup
> for many aggregation queries.
> Several users asked us for this feature so it is pretty high priority.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.