[
https://issues.apache.org/jira/browse/PIG-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606524#action_12606524
]
Alan Gates commented on PIG-276:
--------------------------------
Ideally we would like to support full function overloading for UDFs. In the
meantime, we need a way to allow some highly used UDFs to have separate
implementations based on input types. There are two reasons for this:
# Obeying the law of least astonishment. Users don't expect to have SUM(int)
return a double.
# Performance. Some crude testing showed that summing longs was 10x faster
than summing doubles. As some of these builtin functions are very frequently
used, optimizing them is a worthwhile endeavor.
Based on discussions on PIG-162, I propose the following changes:
It will be possible to specify an implementation of EvalFunc for each type. In
the default implementation (such as SUM) there will be a method:
Class classForType(byte type); // uses DataType types
Given a type, this method will return the appropriate extension of EvalFunc to
be used. This will require the following changes:
# The EvalFunc class will need to have this method added. It should have a
default implementation that returns null.
# The type checker will need to be changed to call classForType as part of
checking LOUserFunc. If classForType returns anything other than null, it will
need to change mFuncSpec in LOUserFunc. Currently, the parser does some checks
on the function when it loads it (makes sure we can load the indicated class,
etc.) This should be factored out and put in LOUserFunc (or a helper class) so
that type checker can do the same checks after it swaps the function. Also,
LOUserFunc shoudl change to keep a reference to the actual UDF (which the
parser instantiates), so the type checker doesn't have to instantiate it again.
As for builtins, we need to implement the following specialized functions:
|| External name || input type || output type || mapped to || comments ||
| SUM | long | long | longSum | will handle sum of ints too |
| SUM | double | double | doubleSum | will handle sum of floats too |
| MIN | int | int | intMin |
| MIN | long | long | longMin |
| MIN | float | float | floatMin |
| MIN | double | double | doubleMin |
| MIN | chararray | chararray | charMin |
| MIN | bytearray | bytearray | byteMin |
| MAX | int | int | intMax |
| MAX | long | long | longMax |
| MAX | float | float | floatMax |
| MAX | double | double | doubleMax |
| MAX | chararray | chararray | charMax |
| MAX | bytearray | bytearray | byteMax |
| AVG | long | double | longAvg | will handle avg of ints too |
| AVG | double | double | doubleAvg | will handle avg of
floats too |
| concat | chararray | chararray | charConcat | new function to
concatenate strings |
| concat | bytearray | bytearray | byteConcat | new function to
concatenate strings |
| size | bag | long | bagSize | returns number of tuples |
| size | tuple | long | tupleSize | returns number of elements |
| size | map | long | mapSize | returns number of keys |
| size | chararray | long | charSize | returns number of
characters in chararray |
| size | bytearray | long | byteSize | returns number of bytes
in chararray |
The existing versions of SUM, MIN, MAX, and AVG will need to implement the
classForType method. Default versions of concat and size will need to be
implemented that also implement the classForType method. The default
implementations of eval for these two new functions should just error out.
> Allow UDFs to have different implementations based on input types
> -----------------------------------------------------------------
>
> Key: PIG-276
> URL: https://issues.apache.org/jira/browse/PIG-276
> Project: Pig
> Issue Type: Sub-task
> Reporter: Alan Gates
> Assignee: Pradeep Kamath
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.