UDFs and types

Alan Gates Wed, 02 Jul 2008 13:46:11 -0700

With the introduction of types (seehttp://issues.apache.org/jira/browse/PIG-157) we need to decide howEvalFunc will interact with the types. The original proposal was thatthe DEFINE keyword would be modified to allow specification of types forthe UDF. This has a couple of problems. One, DEFINE is already used tospecify constructor arguments. Using it to also specify types will beconfusing. Two, it has been pointed out that this type information is aproperty of the UDF and should therefore be declared by the UDF, not inthe script.

Separately, as a way to allow simple function overloading, a change hadbeen proposed to the EvalFunc interface to allow an EvalFunc to specifythat for a given type, a different instance of EvalFunc should be used(see https://issues.apache.org/jira/browse/PIG-276).

I would like to propose that we expand the changes in PIG-276 to be moregeneral. Rather than adding classForType() as proposed in PIG-276,EvalFunc will instead add a function:


public Map<Schema, FuncSpec> getArgToFuncMapping() {
   return null;
}

Where FuncSpec is a new class that contains the name of the class thatimplements the UDF along with any necessary arguments for the constructor.

The type checker will then, as part of type checking LOUserFunc make acall to this function. If it receives a null, it will simply leave theUDF as is, and make the assumption that the UDF can handle whateverdatatype is being provided to it. This will cover most existing UDFs,which will not override the default implementation.

If a UDF wants to override the default, it should return a map thatgives a FuncSpec for each type of schema that it can support. Forexample, for the UDF concat, the map would have two entries:

key: schema(chararray, chararray) value: StringConcat
key: schema(bytearray, bytearray) value: ByteConcat

The type checker will then take the schema of what is being passed to itand perform a lookup in the map. If it finds an entry, it will use theassociated FuncSpec. If it does not, it will throw an exception sayingthat that EvalFunc cannot be used with those types.

At this point, the type checker will make no effort to find a best fitfunction. Either the fit is perfect, or it will not be done. In thefuture we would like to modify the type checker to select a best fit.For example, if a UDF says it can handle schema(long) and the typechecker finds it has schema(int), it can insert a cast to deal withthat. But in the first pass we will ignore this and depend on the userto insert the casts.


Thoughts?

Alan.

UDFs and types

Reply via email to