[jira] Updated: (PIG-276) Allow UDFs to have different implementations based on input types

Pradeep Kamath (JIRA) Mon, 07 Jul 2008 11:34:58 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pradeep Kamath updated PIG-276:
-------------------------------

    Attachment: udf_funcspec_src.patch

Attached patch contains the modifications in the "src" part of the code to 
support new design for EvalFunc to handle different input types. 
New design - quoting Alan:
{quote}
With the introduction of types (see
http://issues.apache.org/jira/browse/PIG-157) we need to decide how EvalFunc 
will interact with the types.  The original proposal was that the DEFINE 
keyword would be modified to allow specification of types for the UDF.  This 
has a couple of problems.  One, DEFINE is already used to specify constructor 
arguments.  Using it to also specify types will be confusing.  Two, it has been 
pointed out that this type information is a property of the UDF and should 
therefore be declared by the UDF, not in the script.

Separately, as a way to allow simple function overloading, a change had been 
proposed to the EvalFunc interface to allow an EvalFunc to specify that for a 
given type, a different instance of EvalFunc should be used (see 
https://issues.apache.org/jira/browse/PIG-276).

I would like to propose that we expand the changes in PIG-276 to be more 
general.  Rather than adding classForType() as proposed in PIG-276, EvalFunc 
will instead add a function:

public Map<Schema, FuncSpec> getArgToFuncMapping() {
    return null;
}

Where FuncSpec is a new class that contains the name of the class that 
implements the UDF along with any necessary arguments for the constructor.

The type checker will then, as part of type checking LOUserFunc make a call to 
this function.  If it receives a null, it will simply leave the UDF as is, and 
make the assumption that the UDF can handle whatever datatype is being provided 
to it.  This will cover most existing UDFs, which will not override the default 
implementation.

If a UDF wants to override the default, it should return a map that gives a 
FuncSpec for each type of schema that it can support.  For example, for the UDF 
concat, the map would have two entries:
key: schema(chararray, chararray) value: StringConcat
key: schema(bytearray, bytearray) value: ByteConcat

The type checker will then take the schema of what is being passed to it and 
perform a lookup in the map.  If it finds an entry, it will use the associated 
FuncSpec.  If it does not, it will throw an exception saying that that EvalFunc 
cannot be used with those types.

At this point, the type checker will make no effort to find a best fit 
function.  Either the fit is perfect, or it will not be done.  In the future we 
would like to modify the type checker to select a best fit.  
For example, if a UDF says it can handle schema(long) and the type checker 
finds it has schema(int), it can insert a cast to deal with that.  But in the 
first pass we will ignore this and depend on the user to insert the casts.

{quote}

One Change to the above proposal is the change in return type of 
getArgToFuncMapping() :

{code}

public List<FuncSpec> getArgToFuncMapping() {
    return null;
}

{code}

The FuncSpec class will also have a schema member to hold the schema of the 
input arguments supported by a given FuncSpec object. So The 
TypeCheckingVisitor will iterate over the List<FuncSpec> to see if a matching 
FuncSpec can be found corresponding to the schema of the input args it has.

Some other observations:
   * In AVG, if there are some null inputs, these will contribute to the 
"count" in the average but will be treated as 0 in the "sum" needed for the 
average
   * SUM, AVG, MIN and MAX on DataByteArrays (i.e. input with no type 
specified) will compute the function by converting the input to Double (the 
input will not be permanently casted - a Double copy of the input will be used 
for the computations)
   * SIZE and CONCAT will return null if *either* of their inputs are null

Deprecation:
bq.
@Deprecated
    public void registerFunction(String function, String functionSpec) 
in favor of:
    public void registerFunction(String function, FuncSpec funcSpec) 

A patch covering changes in Tests will be attached separately


> Allow UDFs to have different implementations based on input types
> -----------------------------------------------------------------
>
>                 Key: PIG-276
>                 URL: https://issues.apache.org/jira/browse/PIG-276
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: EvalFunc.patch, EvalFunc_Combined.patch, 
> EvalFunc_unittestcases.patch, udf_funcspec_src.patch, udf_funcspec_tests.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-276) Allow UDFs to have different implementations based on input types

Reply via email to