[
https://issues.apache.org/jira/browse/PIG-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168085#comment-13168085
]
Dmitriy V. Ryaboy commented on PIG-2421:
----------------------------------------
Some more thoughts on this:
- We can easily match through reflection any static methods, and perhaps object
methods for classes with no-arg constructors, invoker-style. This would allow
us to transparently reuse a ton of existing java code without forcing people to
write annotation-laden scaffolding. That means pushing the dynamicInvoker logic
deeper into pig (right now it's essentially just a UDF hack).
- not sure I'm comfortable with mapper / combiner / reducer annotations, or
initial/intermediate/final. Ideally, for COUNT, for example, we want to be able
to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that
to happen -- we have to reimplement for count. Can we allow udf authors to
return to us method pointers?
- a lot of the pain we have right now is from not providing proper Context
objects. This forces us into a pretty tight space, design-wise. If we make a
strict contract about when Contexts are passed in and available, we can add
what we add to context easily (definitely, things like the job conf, counter
and logger helpers, exec mode, requested schema, input schema, etc should be in
there. The approach of squirreling things away into the conf on the fe and
unrolling it in the first invocation of exec() on the be is error-prone and
overly complex).
- it would be awesome if tuples knew their schema, and you could get their
fields by name as well as by index.
- evalFuncs currently must return 1 row per 1 input row. This leads to
hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns
(the latter is also potentially very expensive memory-wise). Ideally we could
return a plain value, tuple, or bag and have Pig behave as it currently does,
or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that
means 0 to many results are coming out of the udf.
- while we are experimenting with annotations, perhaps we can add something to
advanced eval funcs that would let them tell the planner how much data they are
producing? It'd be neat to be able to say that ngramming a text will blow up
the number of records, while counting words will shrink it.
- it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse is
huge for efficiency (and a potential source of many bugs, so it's a tradeoff).
> EvalFuncs need redesigned
> -------------------------
>
> Key: PIG-2421
> URL: https://issues.apache.org/jira/browse/PIG-2421
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.11
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: PIG-newudf.patch, examples.patch
>
>
> The current EvalFunc interface (and associated Algebraic and Accumulator
> interfaces) have grown unwieldy. In particular, people have noted the
> following issues:
> # Writing a UDF requires a lot of boiler plate code.
> # Since UDFs always pass a tuple, users are required to manage their own type
> checking for input.
> # Declaring schemas for output data is confusing.
> # Writing a UDF that accepts multiple different parameters (using
> getArgToFuncMapping) is confusing.
> # Using Algebraic and Accumulator interfaces often entails duplicating code
> from the initial implementation.
> # UDF implementors are exposed to the internals of Pig since they have to
> know when to return a tuple (Initial, Intermediate) and when not to (exec,
> Final).
> # The separation of Initial, Intermediate, and Final into separate classes
> forces code duplication and makes it hard for UDFs in other languages to use
> those interfaces.
> # There is unused code in the current interface that occasionally causes
> confusion (e.g. isAsynchronous)
> Any change must be done in a way that allows existing UDFs to continue
> working essentially forever.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira