[
https://issues.apache.org/jira/browse/PIG-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Kamath updated PIG-484:
-------------------------------
Assignee: Pradeep Kamath
Status: Patch Available (was: Open)
Patch details:
- The idea is to stream small chunks in bags to the "Initial" version of the
algebraic function in the combiner. So in cases where in the map, there is an
explosion of values, the original bag would have been big and possibly caused
costly spills. This will be avoided now since small chunks will be sent to the
aggregate function's "initial" method.
The code checks for a combine plan and if it is present, it replaces the
POPackage and POForEach in the combine plan with POJoinPackage which is a
combination of the two customized for streaming small bags between the package
and the foreach.
> PERFORMANCE: streaming data to aggregate functions
> --------------------------------------------------
>
> Key: PIG-484
> URL: https://issues.apache.org/jira/browse/PIG-484
> Project: Pig
> Issue Type: Improvement
> Affects Versions: types_branch
> Reporter: Olga Natkovich
> Assignee: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: PIG-484.patch
>
>
> Currently, for queries like
> A = load 'data';
> B = group A by $0;
> C = foreach A generate group, MIN(A.$1), MAX (A.$1)
> The data will be put into the bag before being passed to aggregate functions.
> This is unnecessary and inefficient. In this case, data can be just streamed
> to the functions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.