[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759804#action_12759804 ]
Alan Gates commented on PIG-979: -------------------------------- Consider a Pig script like the following: {code} A = load 'bla'; B = group A by $0; C = foreach B { D = order A by $1; generate CUMMULATIVE_SUM(D); } {code} Because the UDF needs to see this data in an ordered fashion, it cannot be done using Pig's Algebraic interface. But it does not need to see all the contents of the bag together. One way to address this is to add an Accumulator interface that UDFs could implement. {code} interface Accumulator<T> { /** * Pass tuples to the UDF. The passed in bag will contain only records from one * key. It may not contain all the records for one key. This function will * be called repeatedly until all records from one key are provided * to the UDF. * @param 1 or more tuples, all sharing the same key. */ void accumulate(Bag b); /** * Called when all records from a key have been passed to accumulate. * @return the value for the UDF for this key. */ T getValue(); } {code} In cases where all UDFs in a given foreach implement this accumulate interface, then Pig could choose to use this method to push records to the UDFs. Then it would not need to read all records from the Reduce iterator and cache them in memory or on disk. Before we commit to adding this new level of complexity to the langauge, we should performance test it. Given that we have recently made a change aimed at addressing Pig's problem of dying during large non-algebraic group bys (see PIG-975), this needs to perform significantly better than that to justify adding it. > Acummulator Interface for UDFs > ------------------------------ > > Key: PIG-979 > URL: https://issues.apache.org/jira/browse/PIG-979 > Project: Pig > Issue Type: New Feature > Reporter: Alan Gates > Assignee: Ying He > > Add an accumulator interface for UDFs that would allow them to take a set > number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.