[
https://issues.apache.org/jira/browse/PIG-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171443#comment-13171443
]
Jonathan Coveney commented on PIG-2421:
---------------------------------------
I think this is a super key thing to do as far as making pig more extensible
and useable. I think that the annotation stuff is nice, but it seems geared
towards making it easy to write a UDF with as few lines of possible, which I
don't necessarily think should be the goal...instead, I think that the goal
should be a rock solid EvalFunc base which meets a set of goals we lay out, one
of which should be allowing is to extend it to provide nice annotations and
whatnot. Here are some worthwhile goals, which relate to what we have above
(not in any particular order):
1. Minimize code reuse. Alan, you asked if this is a thing, and it absolutely
is. Right now you have to jump through many hoops in order to reuse code, and
even in the builtin stuff, it's quite patchy (just look at Dmitriy's recent
commit of -1500 lines of math function code, and the fact that more could be
done still). I think the goal should be to make it so that it is really easy to
build new UDF's using existing functionality in an elegant way. I agree with
Dmitrity that using annotations could make that difficult.
2. Making it MUCH more explicit what is happening where. Look at Julien's
example, which checks if it is on the frontend or not...I agree with Alan that
things like this could be split out, or at least made much clearer. You could
have a frontendInit, frontendFinalize, frontendInit, backendFinalize. Once
again, we could provide a much simpler "SimpleEvalFunc" that doesn't have all
this, but I think part of the current problem with pig is that you have to jump
through a ton of hoops to do somewhat reasonable things, and we should
facilitate those things, because they will ultimately enable more elegant
solutions (instead of manually serializing things in crazy places, etc).
3. Directly relating to the above, it should be dead easy to pass information
between the front end and the back end.
4. Allowing functions to both receive and return iterators seems very usable,
and would cut down on the same "get first element cast to bag iterate over bag"
that's in every script, and give us Accumulative UDFs for free. It would be
nice to have one -> many functions where a given row may result in 0 or more
results. Alan, your point is a good one, but I think we can present it in such
a way that it's clear when a bag is being return and when it isn't. Flattening
of course gives the same result, but I am under the impression that by
returning iterators we could be much more efficient about it. Maybe the
solution is to have an OUTER_FLATTEN, and be smart about how we generate and
flatten the intermediate data. I don't know much about how pig currently deals
with that sort of thing.
Thejas: I think that a Bag of any type would be neat. Basically, other
spillable data structures would be cool, and potentially an PrimitiveBag could
see the exact same benefits that Dmitriy's PrimitiveTuples see.
> EvalFuncs need redesigned
> -------------------------
>
> Key: PIG-2421
> URL: https://issues.apache.org/jira/browse/PIG-2421
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.11
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: PIG-newudf.patch, examples.patch
>
>
> The current EvalFunc interface (and associated Algebraic and Accumulator
> interfaces) have grown unwieldy. In particular, people have noted the
> following issues:
> # Writing a UDF requires a lot of boiler plate code.
> # Since UDFs always pass a tuple, users are required to manage their own type
> checking for input.
> # Declaring schemas for output data is confusing.
> # Writing a UDF that accepts multiple different parameters (using
> getArgToFuncMapping) is confusing.
> # Using Algebraic and Accumulator interfaces often entails duplicating code
> from the initial implementation.
> # UDF implementors are exposed to the internals of Pig since they have to
> know when to return a tuple (Initial, Intermediate) and when not to (exec,
> Final).
> # The separation of Initial, Intermediate, and Final into separate classes
> forces code duplication and makes it hard for UDFs in other languages to use
> those interfaces.
> # There is unused code in the current interface that occasionally causes
> confusion (e.g. isAsynchronous)
> Any change must be done in a way that allows existing UDFs to continue
> working essentially forever.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira