[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759804#action_12759804
 ] 

Alan Gates commented on PIG-979:
--------------------------------

Consider a Pig script like the following:

{code}
A = load 'bla';
B = group A by $0;
C = foreach B {
    D = order A by $1;
    generate CUMMULATIVE_SUM(D);
}
{code}

Because the UDF needs to see this data in an ordered fashion, it cannot be done 
using Pig's Algebraic interface.  But it
does not need to see all the contents of the bag together.

One way to address this is to add an Accumulator interface that UDFs could 
implement.

{code}
interface Accumulator<T> {

    /**
     * Pass tuples to the UDF.  The passed in bag will contain only records 
from one
     * key.  It may not contain all the records for one key.  This function will
     * be called repeatedly until all records from one key are provided
     * to the UDF.
     * @param 1 or more tuples, all sharing the same key.
     */
    void accumulate(Bag b);

    /**
     * Called when all records from a key have been passed to accumulate.
     * @return the value for the UDF for this key.
     */
    T getValue();
}
{code}

In cases where all UDFs in a given foreach implement this accumulate interface, 
then Pig could choose to use this method to
push records to the UDFs.  Then it would not need to read all records from the 
Reduce iterator and cache them in memory or
on disk.

Before we commit to adding this new level of complexity to the langauge, we 
should performance test it.  Given that we have
recently made a change aimed at addressing Pig's problem of dying during large 
non-algebraic group bys (see PIG-975), this
needs to perform significantly better than that to justify adding it.


> Acummulator Interface for UDFs
> ------------------------------
>
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to