Paolo [CC'ed] observed that currently, if the return type of the UDF is a bag or a tuple, the contents of the bag/tuple is not known at type checking time. In addition to the input parameter types, the return type of the UDF should also be a schema. This will make the inputs and outputs well defined and help the type checker enforce type checking and promotion.
I found a paper that describes algorithms to do fast type inclusion tests (if a type is a sub-type of another type). http://www.cs.purdue.edu/homes/jv/pubs/oopsla97.pdf Santhosh -----Original Message----- From: pi song [mailto:[EMAIL PROTECTED] Sent: Monday, July 07, 2008 5:58 AM To: [email protected] Subject: Re: UDFs and types You're right. The real problem will be defining rules. How about? 0) We do only non-nested types first. 1) All number types can be casted to bigger types int -> long -> float -> double 2) bytearray can be casted to chararray or double (chararray takes precedance) 3) Matches on the left are more important than on the right. For example:- Input:- (int, long) Candidates:- (int, float) (float, long) will match (int, float) On Fri, Jul 4, 2008 at 1:42 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > You rock Pi! > > It might be good to agree on best-fit rules. There are obvious ones: int > -> long, float -> double, but what about long -> int, long ->float, and > string -> float. > > There is also the recursive fits, which might be purely theoretical: > tuples of the form (long, {float}) fit to (double, {long}) or (int, > {long}). (That example might be invalid depending on the first answer, > but hopefully you get the idea.) > > ben > > pi song wrote: > > +1 Agree. > > > > I will try to make "best fit" happen in 24 hours after you commit the new > > UDF design. > > > > > > On Thu, Jul 3, 2008 at 6:55 AM, Olga Natkovich <[EMAIL PROTECTED]> > wrote: > > > > > >> Sounds good to me. > >> > >> Olga > >> > >> > >>> -----Original Message----- > >>> From: Alan Gates [mailto:[EMAIL PROTECTED] > >>> Sent: Wednesday, July 02, 2008 1:44 PM > >>> To: [email protected] > >>> Subject: UDFs and types > >>> > >>> With the introduction of types (see > >>> http://issues.apache.org/jira/browse/PIG-157) we need to > >>> decide how EvalFunc will interact with the types. The > >>> original proposal was that the DEFINE keyword would be > >>> modified to allow specification of types for the UDF. This > >>> has a couple of problems. One, DEFINE is already used to > >>> specify constructor arguments. Using it to also specify > >>> types will be confusing. Two, it has been pointed out that > >>> this type information is a property of the UDF and should > >>> therefore be declared by the UDF, not in the script. > >>> > >>> Separately, as a way to allow simple function overloading, a > >>> change had been proposed to the EvalFunc interface to allow > >>> an EvalFunc to specify that for a given type, a different > >>> instance of EvalFunc should be used (see > >>> https://issues.apache.org/jira/browse/PIG-276). > >>> > >>> I would like to propose that we expand the changes in PIG-276 > >>> to be more general. Rather than adding classForType() as > >>> proposed in PIG-276, EvalFunc will instead add a function: > >>> > >>> public Map<Schema, FuncSpec> getArgToFuncMapping() { > >>> return null; > >>> } > >>> > >>> Where FuncSpec is a new class that contains the name of the > >>> class that implements the UDF along with any necessary > >>> arguments for the constructor. > >>> > >>> The type checker will then, as part of type checking > >>> LOUserFunc make a call to this function. If it receives a > >>> null, it will simply leave the UDF as is, and make the > >>> assumption that the UDF can handle whatever datatype is being > >>> provided to it. This will cover most existing UDFs, which > >>> will not override the default implementation. > >>> > >>> If a UDF wants to override the default, it should return a > >>> map that gives a FuncSpec for each type of schema that it can > >>> support. For example, for the UDF concat, the map would have > >>> two entries: > >>> key: schema(chararray, chararray) value: StringConcat > >>> key: schema(bytearray, bytearray) value: ByteConcat > >>> > >>> The type checker will then take the schema of what is being > >>> passed to it and perform a lookup in the map. If it finds an > >>> entry, it will use the associated FuncSpec. If it does not, > >>> it will throw an exception saying that that EvalFunc cannot > >>> be used with those types. > >>> > >>> At this point, the type checker will make no effort to find a > >>> best fit function. Either the fit is perfect, or it will not > >>> be done. In the future we would like to modify the type > >>> checker to select a best fit. > >>> For example, if a UDF says it can handle schema(long) and the > >>> type checker finds it has schema(int), it can insert a cast > >>> to deal with that. But in the first pass we will ignore this > >>> and depend on the user to insert the casts. > >>> > >>> Thoughts? > >>> > >>> Alan. > >>> > >>> > > > > > >
