[ 
https://issues.apache.org/jira/browse/PIG-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171443#comment-13171443
 ] 

Jonathan Coveney commented on PIG-2421:
---------------------------------------

I think this is a super key thing to do as far as making pig more extensible 
and useable. I think that the annotation stuff is nice, but it seems geared 
towards making it easy to write a UDF with as few lines of possible, which I 
don't necessarily think should be the goal...instead, I think that the goal 
should be a rock solid EvalFunc base which meets a set of goals we lay out, one 
of which should be allowing is to extend it to provide nice annotations and 
whatnot. Here are some worthwhile goals, which relate to what we have above 
(not in any particular order):

1. Minimize code reuse. Alan, you asked if this is a thing, and it absolutely 
is. Right now you have to jump through many hoops in order to reuse code, and 
even in the builtin stuff, it's quite patchy (just look at Dmitriy's recent 
commit of -1500 lines of math function code, and the fact that more could be 
done still). I think the goal should be to make it so that it is really easy to 
build new UDF's using existing functionality in an elegant way. I agree with 
Dmitrity that using annotations could make that difficult.
2. Making it MUCH more explicit what is happening where. Look at Julien's 
example, which checks if it is on the frontend or not...I agree with Alan that 
things like this could be split out, or at least made much clearer. You could 
have a frontendInit, frontendFinalize, frontendInit, backendFinalize. Once 
again, we could provide a much simpler "SimpleEvalFunc" that doesn't have all 
this, but I think part of the current problem with pig is that you have to jump 
through a ton of hoops to do somewhat reasonable things, and we should 
facilitate those things, because they will ultimately enable more elegant 
solutions (instead of manually serializing things in crazy places, etc).
3. Directly relating to the above, it should be dead easy to pass information 
between the front end and the back end.
4. Allowing functions to both receive and return iterators seems very usable, 
and would cut down on the same "get first element cast to bag iterate over bag" 
that's in every script, and give us Accumulative UDFs for free. It would be 
nice to have one -> many functions where a given row may result in 0 or more 
results. Alan, your point is a good one, but I think we can present it in such 
a way that it's clear when a bag is being return and when it isn't. Flattening 
of course gives the same result, but I am under the impression that by 
returning iterators we could be much more efficient about it. Maybe the 
solution is to have an OUTER_FLATTEN, and be smart about how we generate and 
flatten the intermediate data. I don't know much about how pig currently deals 
with that sort of thing.

Thejas: I think that a Bag of any type would be neat. Basically, other 
spillable data structures would be cool, and potentially an PrimitiveBag could 
see the exact same benefits that Dmitriy's PrimitiveTuples see.
                
> EvalFuncs need redesigned
> -------------------------
>
>                 Key: PIG-2421
>                 URL: https://issues.apache.org/jira/browse/PIG-2421
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: PIG-newudf.patch, examples.patch
>
>
> The current EvalFunc interface (and associated Algebraic and Accumulator 
> interfaces) have grown unwieldy.  In particular, people have noted the 
> following issues:
> # Writing a UDF requires a lot of boiler plate code.
> # Since UDFs always pass a tuple, users are required to manage their own type 
> checking for input.
> # Declaring schemas for output data is confusing.
> # Writing a UDF that accepts multiple different parameters (using 
> getArgToFuncMapping) is confusing.
> # Using Algebraic and Accumulator interfaces often entails duplicating code 
> from the initial implementation.
> # UDF implementors are exposed to the internals of Pig since they have to 
> know when to return a tuple (Initial, Intermediate) and when not to (exec, 
> Final).
> # The separation of Initial, Intermediate, and Final into separate classes 
> forces code duplication and makes it hard for UDFs in other languages to use 
> those interfaces.
> # There is unused code in the current interface that occasionally causes 
> confusion (e.g. isAsynchronous)
> Any change must be done in a way that allows existing UDFs to continue 
> working essentially forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to