[ 
https://issues.apache.org/jira/browse/PIG-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171352#comment-13171352
 ] 

Alan Gates commented on PIG-2421:
---------------------------------


Responses to Dmitriy's comments:

bq.  We can easily match through reflection any static methods, and perhaps 
object methods for classes with no-arg constructors, invoker-style. This would 
allow us to transparently reuse a ton of existing java code without forcing 
people to write annotation-laden scaffolding. That means pushing the 
dynamicInvoker logic deeper into pig (right now it's essentially just a UDF 
hack).

I agree we should more deeply integrate the dynamicInvoker logic.  We need to 
make it easier to declare (more along the line of defining a python UDF), we 
need to make it so you can use methods on objects, and we need a way to pass 
arguments to the constructors of those objects.  But I don't see how that 
changes this proposal.  I see that as a separate track from this.

bq. not sure I'm comfortable with mapper / combiner / reducer annotations, or 
initial/intermediate/final. Ideally, for COUNT, for example, we want to be able 
to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that to 
happen – we have to reimplement for count. Can we allow udf authors to return 
to us method pointers?

Using annotations it seems difficult to return method pointers, since return 
types have to be a String, Enum, or Class.  We could define some pigeon 
language where we return a string like 
@Intermediate("org.apache.pig.newudf.SUM.exec"), I suppose, but that seems 
nasty.  

Are there that many cases where there will be crossover for method 
implementations?  You're right that the proposed annotation method only allows 
sharing of methods within a particular UDF, not across UDFs.  If we think 
across UDFs will be that common, we could allow the UDF classname to be 
decoupled from the Pig UDF name, and then in the annotations indicate which 
UDFs a particular implementation is for.  For example:

{code}
@UDFName("SUM", "COUNT")
public class SUMandCOUNT extends EvalFunc {
...
    @Initital("COUNT")
    public long countInitial(int val) {
        return 1;
    }

    @Intermediate("COUNT")
    @Final("COUNT")
    @Initial("SUM")
    @Intermediate("SUM")
    @Final("SUM")
    public long verySharedCode(bag vals) {
        ...
    }
}
{code}

But I'm not sure this is a frequent enough use case to build the interface 
around.

bq.  a lot of the pain we have right now is from not providing proper Context 
objects. This forces us into a pretty tight space, design-wise. If we make a 
strict contract about when Contexts are passed in and available, we can add 
what we add to context easily (definitely, things like the job conf, counter 
and logger helpers, exec mode, requested schema, input schema, etc should be in 
there. The approach of squirreling things away into the conf on the fe and 
unrolling it in the first invocation of exec() on the be is error-prone and 
overly complex).

Agreed, as with Julien's comment.  I want to spend some more time thinking 
about Context
objects, what should be there, and when we should pass them.  I'll come back 
with more
proposals there.

bq. it would be awesome if tuples knew their schema, and you could get their 
fields by name as well as by index.

I think this is a separate topic.

bq. evalFuncs currently must return 1 row per 1 input row. This leads to 
hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns 
(the latter is also potentially very expensive memory-wise). Ideally we could 
return a plain value, tuple, or bag and have Pig behave as it currently does, 
or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that 
means 0 to many results are coming out of the udf.

In the case where more than 1 value comes out, what does Pig do?  Put them in a 
bag?
Auto-flatten them with other elements in the generate?  This seems closely 
related to the
OUTER_FLATTEN work Jonathan is proposing.

bq. while we are experimenting with annotations, perhaps we can add something 
to advanced eval funcs that would let them tell the planner how much data they 
are producing? It'd be neat to be able to say that ngramming a text will blow 
up the number of records, while counting words will shrink it.

I like the idea, but I think we should wait until we have an optimizer that can 
make use of
these.  Otherwise we won't know what we should and shouldn't annotate for.

bq. it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse 
is huge for efficiency (and a potential source of many bugs, so it's a 
tradeoff).

I'm not clear how this relates to the current topic.

                
> EvalFuncs need redesigned
> -------------------------
>
>                 Key: PIG-2421
>                 URL: https://issues.apache.org/jira/browse/PIG-2421
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: PIG-newudf.patch, examples.patch
>
>
> The current EvalFunc interface (and associated Algebraic and Accumulator 
> interfaces) have grown unwieldy.  In particular, people have noted the 
> following issues:
> # Writing a UDF requires a lot of boiler plate code.
> # Since UDFs always pass a tuple, users are required to manage their own type 
> checking for input.
> # Declaring schemas for output data is confusing.
> # Writing a UDF that accepts multiple different parameters (using 
> getArgToFuncMapping) is confusing.
> # Using Algebraic and Accumulator interfaces often entails duplicating code 
> from the initial implementation.
> # UDF implementors are exposed to the internals of Pig since they have to 
> know when to return a tuple (Initial, Intermediate) and when not to (exec, 
> Final).
> # The separation of Initial, Intermediate, and Final into separate classes 
> forces code duplication and makes it hard for UDFs in other languages to use 
> those interfaces.
> # There is unused code in the current interface that occasionally causes 
> confusion (e.g. isAsynchronous)
> Any change must be done in a way that allows existing UDFs to continue 
> working essentially forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to