[
https://issues.apache.org/jira/browse/PIG-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171327#comment-13171327
]
Alan Gates commented on PIG-2421:
---------------------------------
Responses to Julien's comments above:
bq. Not to have to extend EvalFunc at all: the udf context is provided through
a @Context annotation and is different per call of the UDF (not per FuncSpec).
The UDF context also specifies if we are in the frontend or backend and
provides methods for optional information passed by the UDF (output schema,
...) and access to the distributed cache (name-spaced by the UDF context).
What's the value of not extending EvalFunc? Your classes still have an init
method, which you'll have to find through reflection. It seems like you're
routing around Java's class hierarchy here.
I also don't understand what passing the @Context annotation does. And I'm not
clear what "a @Context annotation that is different per call of the UDF (not
per FuncSpec)" means.
I do like expanding the context object to contain more information we need to
pass, including inbound schema and backend vs. frontend. I'm less sure about
the distributed cache. I think it's much easier to make these explicit methods
in the interface rather than hide everything in a config object. One of the
most confusing things about the Hadoop interface is that lots of things are
just "values" in a config object, and don't show up in the API docs anywhere.
I guess if we expanded the config object to have methods like "addFileToCache"
and "fetchFileFromCache", then it's fine.
bq. I was trying to define schema-aware Tuples but did not get there yet.
I think this is separable from what I'm proposing here.
> EvalFuncs need redesigned
> -------------------------
>
> Key: PIG-2421
> URL: https://issues.apache.org/jira/browse/PIG-2421
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.11
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: PIG-newudf.patch, examples.patch
>
>
> The current EvalFunc interface (and associated Algebraic and Accumulator
> interfaces) have grown unwieldy. In particular, people have noted the
> following issues:
> # Writing a UDF requires a lot of boiler plate code.
> # Since UDFs always pass a tuple, users are required to manage their own type
> checking for input.
> # Declaring schemas for output data is confusing.
> # Writing a UDF that accepts multiple different parameters (using
> getArgToFuncMapping) is confusing.
> # Using Algebraic and Accumulator interfaces often entails duplicating code
> from the initial implementation.
> # UDF implementors are exposed to the internals of Pig since they have to
> know when to return a tuple (Initial, Intermediate) and when not to (exec,
> Final).
> # The separation of Initial, Intermediate, and Final into separate classes
> forces code duplication and makes it hard for UDFs in other languages to use
> those interfaces.
> # There is unused code in the current interface that occasionally causes
> confusion (e.g. isAsynchronous)
> Any change must be done in a way that allows existing UDFs to continue
> working essentially forever.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira