[ 
https://issues.apache.org/jira/browse/PIG-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-484:
-------------------------------

    Assignee: Pradeep Kamath
      Status: Patch Available  (was: Open)

Patch details:
- The idea is to stream small chunks in bags to the "Initial" version of the 
algebraic function in the combiner. So in cases where in the map, there is an 
explosion of values, the original bag would have been big and possibly caused 
costly spills. This will be avoided now since small chunks will be sent to the 
aggregate function's "initial" method.

The code checks for a combine plan and if it is present, it replaces the 
POPackage and POForEach in the combine plan with POJoinPackage which is a 
combination of the two customized for streaming small bags between the package 
and the foreach.



> PERFORMANCE: streaming data to aggregate functions
> --------------------------------------------------
>
>                 Key: PIG-484
>                 URL: https://issues.apache.org/jira/browse/PIG-484
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-484.patch
>
>
> Currently, for queries like
> A = load 'data';
> B = group A by $0;
> C = foreach A generate group, MIN(A.$1), MAX (A.$1)
> The data will be put into the bag before being passed to aggregate functions. 
> This is unnecessary and inefficient. In this case, data can be just streamed 
> to the functions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to