[
https://issues.apache.org/jira/browse/PIG-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-7:
-------------------------
Patch Info: [Patch Available]
Attaching patch that implements use of combiner for algebraic functions in
limited situations. Algebraic is only applied when all functions to be
evaluated in a given generate line are algebraic and when there is one and only
one relation being grouped (ie it is not applied in cogroup situations).
Initial, very simple, performance tests show a speed up of ~40% (13m -> 7.5m
for 4G on 10 machines) with the following script:
a = load '/user/pig/tests/data/perf/studenttab200M';
b = group a by $0;
c = foreach b generate group, COUNT($1), SUM($1.$2), AVG($1.$2), MIN($1.$1),
MAX($1.$2);
store c into 'bla';
> Optimize execution of algebraic functions
> -----------------------------------------
>
> Key: PIG-7
> URL: https://issues.apache.org/jira/browse/PIG-7
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Olga Natkovich
> Assignee: Alan Gates
> Attachments: combiner.patch
>
>
> Algebraic are functions that can be computed incrementally like count(X),
> SUM(X), etc. They can be computed effciently by doing the first level
> computation using hadoop combiner. This can give a significant (2-3x) speedup
> for many aggregation queries.
> Several users asked us for this feature so it is pretty high priority.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.