RE: UDFs and types

Santhosh Srinivasan Mon, 07 Jul 2008 12:13:30 -0700

Paolo [CC'ed] observed that currently, if the return type of the UDF is
a bag or a tuple, the contents of the bag/tuple is not known at type
checking time. In addition to the input parameter types, the return type
of the UDF should also be a schema. This will make the inputs and
outputs well defined and help the type checker enforce type checking and
promotion.


I found a paper that describes algorithms to do fast type inclusion
tests (if a type is a sub-type of another type).

http://www.cs.purdue.edu/homes/jv/pubs/oopsla97.pdf

Santhosh 

-----Original Message-----
From: pi song [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 07, 2008 5:58 AM
To: [email protected]
Subject: Re: UDFs and types

You're right. The real problem will be defining rules.

How about?
0) We do only non-nested types first.
1) All number types can be casted to bigger types
    int -> long -> float -> double
2) bytearray can be casted to chararray or double (chararray takes
precedance)
3) Matches on the left are more important than on the right. For
example:-

Input:-
(int, long)

Candidates:-
(int, float)
(float, long)

will match (int, float)

On Fri, Jul 4, 2008 at 1:42 AM, Benjamin Reed <[EMAIL PROTECTED]>
wrote:

> You rock Pi!
>
> It might be good to agree on best-fit rules. There are obvious ones:
int
> -> long, float -> double, but what about long -> int, long ->float,
and
> string -> float.
>
> There is also the recursive fits, which might be purely theoretical:
> tuples of the form (long, {float}) fit to (double, {long}) or (int,
> {long}). (That example might be invalid depending on the first answer,
> but hopefully you get the idea.)
>
> ben
>
> pi song wrote:
> > +1 Agree.
> >
> > I will try to make "best fit" happen in 24 hours after you commit
the new
> > UDF design.
> >
> >
> > On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <[EMAIL PROTECTED]>
> wrote:
> >
> >
> >> Sounds good to me.
> >>
> >> Olga
> >>
> >>
> >>> -----Original Message-----
> >>> From: Alan Gates [mailto:[EMAIL PROTECTED]
> >>> Sent: Wednesday, July 02, 2008 1:44 PM
> >>> To: [email protected]
> >>> Subject: UDFs and types
> >>>
> >>> With the introduction of types (see
> >>> http://issues.apache.org/jira/browse/PIG-157) we need to
> >>> decide how EvalFunc will interact with the types.  The
> >>> original proposal was that the DEFINE keyword would be
> >>> modified to allow specification of types for the UDF.  This
> >>> has a couple of problems.  One, DEFINE is already used to
> >>> specify constructor arguments.  Using it to also specify
> >>> types will be confusing.  Two, it has been pointed out that
> >>> this type information is a property of the UDF and should
> >>> therefore be declared by the UDF, not in the script.
> >>>
> >>> Separately, as a way to allow simple function overloading, a
> >>> change had been proposed to the EvalFunc interface to allow
> >>> an EvalFunc to specify that for a given type, a different
> >>> instance of EvalFunc should be used (see
> >>> https://issues.apache.org/jira/browse/PIG-276).
> >>>
> >>> I would like to propose that we expand the changes in PIG-276
> >>> to be more general.  Rather than adding classForType() as
> >>> proposed in PIG-276, EvalFunc will instead add a function:
> >>>
> >>> public Map<Schema, FuncSpec> getArgToFuncMapping() {
> >>>     return null;
> >>> }
> >>>
> >>> Where FuncSpec is a new class that contains the name of the
> >>> class that implements the UDF along with any necessary
> >>> arguments for the constructor.
> >>>
> >>> The type checker will then, as part of type checking
> >>> LOUserFunc make a call to this function.  If it receives a
> >>> null, it will simply leave the UDF as is, and make the
> >>> assumption that the UDF can handle whatever datatype is being
> >>> provided to it.  This will cover most existing UDFs, which
> >>> will not override the default implementation.
> >>>
> >>> If a UDF wants to override the default, it should return a
> >>> map that gives a FuncSpec for each type of schema that it can
> >>> support.  For example, for the UDF concat, the map would have
> >>> two entries:
> >>> key: schema(chararray, chararray) value: StringConcat
> >>> key: schema(bytearray, bytearray) value: ByteConcat
> >>>
> >>> The type checker will then take the schema of what is being
> >>> passed to it and perform a lookup in the map.  If it finds an
> >>> entry, it will use the associated FuncSpec.  If it does not,
> >>> it will throw an exception saying that that EvalFunc cannot
> >>> be used with those types.
> >>>
> >>> At this point, the type checker will make no effort to find a
> >>> best fit function.  Either the fit is perfect, or it will not
> >>> be done.  In the future we would like to modify the type
> >>> checker to select a best fit.
> >>> For example, if a UDF says it can handle schema(long) and the
> >>> type checker finds it has schema(int), it can insert a cast
> >>> to deal with that.  But in the first pass we will ignore this
> >>> and depend on the user to insert the casts.
> >>>
> >>> Thoughts?
> >>>
> >>> Alan.
> >>>
> >>>
> >
> >
>
>

RE: UDFs and types

Reply via email to