[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759813#action_12759813 ]
David Ciemiewicz commented on PIG-979: -------------------------------------- This JIRA doesn't quite get the gist of why I believe the Accumulator interface is of interest. It isn't just about performance and avoiding retreading over the same data over and over again. It is also about providing an interface to support CUMMULATIVE_SUM, RANK, and other functions of it's ilk. A better code example for justifying this would be: {code} A = load 'data' using PigStorage() as ( query: chararray, int: count ); B = order A by count desc parallel 1; C = foreach B generate query, count, CUMULATIVE_SUM(count) as cumulative_count, RANK(count) as rank; {code} These functions RANK and CUMULATIVE_SUM would have persistent state and yet would emit a value per value or tuple passed. Bags would not be appropriate as coded. Additionally, the reason for the Accumulator inteface is to avoid multiple passes over the same data: For instance, consider the example: {code} A = load 'data' using PigStorage() as ( query: chararray, int: count ); B = group A all; C = foreach B generate group, SUM(A.count), AVG(A.count), VAR(A.count), STDEV(A.count), MIN(A.count), MAX(A.count), MEDIAN(A.count); {code} Repeatedly shuffling the same values just isn't an optimal way to process data. > Acummulator Interface for UDFs > ------------------------------ > > Key: PIG-979 > URL: https://issues.apache.org/jira/browse/PIG-979 > Project: Pig > Issue Type: New Feature > Reporter: Alan Gates > Assignee: Ying He > > Add an accumulator interface for UDFs that would allow them to take a set > number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.