David Ciemiewicz commented on PIG-979:

This JIRA doesn't quite get the gist of why I believe the Accumulator interface 
is of interest.  It isn't just about performance and avoiding retreading over 
the same data over and over again.

It is also about providing an interface to support CUMMULATIVE_SUM, RANK, and 
other functions of it's ilk.

A better code example for justifying this would be:

A = load 'data' using PigStorage() as ( query: chararray, int: count );
B = order A by count desc parallel 1;
C = foreach B generate
        CUMULATIVE_SUM(count) as cumulative_count,
        RANK(count) as rank;

These functions RANK and CUMULATIVE_SUM would have persistent state and yet 
would emit a value per value or tuple passed.  Bags would not be appropriate as 

Additionally, the reason for the Accumulator inteface is to avoid multiple 
passes over the same data:

For instance, consider the example:

A = load 'data' using PigStorage() as ( query: chararray, int: count );
B = group A all;
C = foreach B generate

Repeatedly shuffling the same values just isn't an optimal way to process data.

> Acummulator Interface for UDFs
> ------------------------------
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to