Awesome!!
On Nov 29, 2007, at 12:25 PM, Alan Gates (JIRA) wrote:
[ https://issues.apache.org/jira/browse/PIG-7?
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-7:
-------------------------
Patch Info: [Patch Available]
Attaching patch that implements use of combiner for algebraic
functions in limited situations. Algebraic is only applied when
all functions to be evaluated in a given generate line are
algebraic and when there is one and only one relation being grouped
(ie it is not applied in cogroup situations).
Initial, very simple, performance tests show a speed up of ~40%
(13m -> 7.5m for 4G on 10 machines) with the following script:
a = load '/user/pig/tests/data/perf/studenttab200M';
b = group a by $0;
c = foreach b generate group, COUNT($1), SUM($1.$2), AVG($1.$2), MIN
($1.$1), MAX($1.$2);
store c into 'bla';
Optimize execution of algebraic functions
-----------------------------------------
Key: PIG-7
URL: https://issues.apache.org/jira/browse/PIG-7
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Olga Natkovich
Assignee: Alan Gates
Attachments: combiner.patch
Algebraic are functions that can be computed incrementally like
count(X), SUM(X), etc. They can be computed effciently by doing
the first level computation using hadoop combiner. This can give a
significant (2-3x) speedup for many aggregation queries.
Several users asked us for this feature so it is pretty high
priority.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research