[
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775158#action_12775158
]
Alan Gates commented on PIG-979:
--------------------------------
A test should be added that checks that when accumulator UDFs are mixed with
non-accumulator UDFs it works properly.
Why is the optimization not applied in the case that inner is set on POPackage?
It seems the accumulator interface should still work in this case.
Some comments on what AccumulatorOptimizer.check() is and what it allows would
be helpful.
The code contains tabs in some spots instead of 4 spaces.
The cases in which the accumulator interface can be used has been greatly
extended by adding the support for unary and binary operators. But this comes
at a cost. Every binary and unary comparison now has to make the accumChild
call. 99% of the time this will be false. It's not clear to me how often
users will do things like:
{code}
foreach C generate accumfunc1(A) + accumfunc2(A) OR
foreach C generate (accumfunc1(A) > 100 ? 0 : 1)
{code}
which is the only time I can see where this additional functionality is useful,
since we don't currently allow these functions in filters. It's possible that
JIT along with branch prediction will remove this extra cost, since the branch
will always be one way or another for a given query. But I'd like to see this
tested. It would be interesting to compare a query with heavy use of binary
operators (but no accumulator UDFs) with and without this change.
I don't understand why you need the new interface AccumulativeTupleBuffer and
class AccumulativeBag. Why can't the block of tuples read off of the iterator
just be put in a regular bag and then passed to the UDFs?
In all the sum implementations of accumulate you calculate the sum of the block
of tuples twice. It should be done once and cached.
In COUNT.accumulate rather than making intermediateCount a Long and then
forcing the creation of a new Long each time you add one you should instead
keep it as a long and depend on boxing to convert it to Long when you return it
in getValue. Same in COUNT_STAR.accumulate
> Acummulator Interface for UDFs
> ------------------------------
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
> Issue Type: New Feature
> Reporter: Alan Gates
> Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set
> number of records at a time instead of the entire bag.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.