[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775184#action_12775184
 ] 

Ying He commented on PIG-979:
-----------------------------

Alan, thanks for the feedback.

1. A test case is already created to test mix of accumulator UDF with regular 
UDF, it is in testAccumBasic().

2. The optimizer can't be applied when inner is set to POPackage, because if an 
inner is set, POPackage checks the bag for that input is NULL, if it is, 
POPackage returns NULL. This can only be done when all the tuples are retrieved 
and put into a bag.

3 & 4, will fix that

5. needs performance testing.

6. The reducer get results from POPackage and pass it to root, which is 
POForEach, to process. From POForEach perspective, it gets a tuple with bags in 
it from POPackage. Then POForEach retrieves tuples off iterator and pass to 
UDFs in multiple cycles. Because only POPackage knows how to read tuples out of 
iterator and put in proper bags, AccumulativeTupleBuffer and AccumulativeBag 
are created to communicate between POPackage and POForEach. Every time 
POForEach calls getNextBatch() on AccumulativeTupleBuffer, it in effects calls 
inner class of POPackage to retrieve tuples out of iterator.

POPackage can not be the one to block the reading of tuples, because it is only 
called once from reducer. I also thought of changing reducer to call POPackage 
multiple times to process each batch of data, then it becomes tricky to 
maintain correct states of operators, and all operators in reducer plan would 
have to support partial data, which is not necessary. 

> Acummulator Interface for UDFs
> ------------------------------
>
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
>         Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to