[
https://issues.apache.org/jira/browse/PIG-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171352#comment-13171352
]
Alan Gates commented on PIG-2421:
---------------------------------
Responses to Dmitriy's comments:
bq. We can easily match through reflection any static methods, and perhaps
object methods for classes with no-arg constructors, invoker-style. This would
allow us to transparently reuse a ton of existing java code without forcing
people to write annotation-laden scaffolding. That means pushing the
dynamicInvoker logic deeper into pig (right now it's essentially just a UDF
hack).
I agree we should more deeply integrate the dynamicInvoker logic. We need to
make it easier to declare (more along the line of defining a python UDF), we
need to make it so you can use methods on objects, and we need a way to pass
arguments to the constructors of those objects. But I don't see how that
changes this proposal. I see that as a separate track from this.
bq. not sure I'm comfortable with mapper / combiner / reducer annotations, or
initial/intermediate/final. Ideally, for COUNT, for example, we want to be able
to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that to
happen – we have to reimplement for count. Can we allow udf authors to return
to us method pointers?
Using annotations it seems difficult to return method pointers, since return
types have to be a String, Enum, or Class. We could define some pigeon
language where we return a string like
@Intermediate("org.apache.pig.newudf.SUM.exec"), I suppose, but that seems
nasty.
Are there that many cases where there will be crossover for method
implementations? You're right that the proposed annotation method only allows
sharing of methods within a particular UDF, not across UDFs. If we think
across UDFs will be that common, we could allow the UDF classname to be
decoupled from the Pig UDF name, and then in the annotations indicate which
UDFs a particular implementation is for. For example:
{code}
@UDFName("SUM", "COUNT")
public class SUMandCOUNT extends EvalFunc {
...
@Initital("COUNT")
public long countInitial(int val) {
return 1;
}
@Intermediate("COUNT")
@Final("COUNT")
@Initial("SUM")
@Intermediate("SUM")
@Final("SUM")
public long verySharedCode(bag vals) {
...
}
}
{code}
But I'm not sure this is a frequent enough use case to build the interface
around.
bq. a lot of the pain we have right now is from not providing proper Context
objects. This forces us into a pretty tight space, design-wise. If we make a
strict contract about when Contexts are passed in and available, we can add
what we add to context easily (definitely, things like the job conf, counter
and logger helpers, exec mode, requested schema, input schema, etc should be in
there. The approach of squirreling things away into the conf on the fe and
unrolling it in the first invocation of exec() on the be is error-prone and
overly complex).
Agreed, as with Julien's comment. I want to spend some more time thinking
about Context
objects, what should be there, and when we should pass them. I'll come back
with more
proposals there.
bq. it would be awesome if tuples knew their schema, and you could get their
fields by name as well as by index.
I think this is a separate topic.
bq. evalFuncs currently must return 1 row per 1 input row. This leads to
hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns
(the latter is also potentially very expensive memory-wise). Ideally we could
return a plain value, tuple, or bag and have Pig behave as it currently does,
or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that
means 0 to many results are coming out of the udf.
In the case where more than 1 value comes out, what does Pig do? Put them in a
bag?
Auto-flatten them with other elements in the generate? This seems closely
related to the
OUTER_FLATTEN work Jonathan is proposing.
bq. while we are experimenting with annotations, perhaps we can add something
to advanced eval funcs that would let them tell the planner how much data they
are producing? It'd be neat to be able to say that ngramming a text will blow
up the number of records, while counting words will shrink it.
I like the idea, but I think we should wait until we have an optimizer that can
make use of
these. Otherwise we won't know what we should and shouldn't annotate for.
bq. it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse
is huge for efficiency (and a potential source of many bugs, so it's a
tradeoff).
I'm not clear how this relates to the current topic.
> EvalFuncs need redesigned
> -------------------------
>
> Key: PIG-2421
> URL: https://issues.apache.org/jira/browse/PIG-2421
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.11
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: PIG-newudf.patch, examples.patch
>
>
> The current EvalFunc interface (and associated Algebraic and Accumulator
> interfaces) have grown unwieldy. In particular, people have noted the
> following issues:
> # Writing a UDF requires a lot of boiler plate code.
> # Since UDFs always pass a tuple, users are required to manage their own type
> checking for input.
> # Declaring schemas for output data is confusing.
> # Writing a UDF that accepts multiple different parameters (using
> getArgToFuncMapping) is confusing.
> # Using Algebraic and Accumulator interfaces often entails duplicating code
> from the initial implementation.
> # UDF implementors are exposed to the internals of Pig since they have to
> know when to return a tuple (Initial, Intermediate) and when not to (exec,
> Final).
> # The separation of Initial, Intermediate, and Final into separate classes
> forces code duplication and makes it hard for UDFs in other languages to use
> those interfaces.
> # There is unused code in the current interface that occasionally causes
> confusion (e.g. isAsynchronous)
> Any change must be done in a way that allows existing UDFs to continue
> working essentially forever.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira