The idea of asking for both the argument and return class is interesting. I
don't think we do that for the scala APIs currently, right? In
functions.scala, we only use the TypeTag for RT.

  def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]):
UserDefinedFunction = {
    UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType)
  }

There would only be a small subset of conversions that would make sense
implicitly (e.g. int to double, the typical conversions in programming
languages), but something like (double => int) might be dangerous and
(timestamp => double) wouldn't really make sense. Perhaps it's better to be
explicit about casts?

If we don't care about declaring the types of the arguments, perhaps we can
have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic
interface called UDF, then have

    def define(f: UDF, returnType: Class[_])

to simplify the APIs.


On Sat, May 30, 2015 at 3:43 AM Reynold Xin <r...@databricks.com> wrote:

> I think you are right that there is no way to call Java UDF without
> registration right now. Adding another 20 methods to functions would be
> scary. Maybe the best way is to have a companion object
> for UserDefinedFunction, and define UDF there?
>
> e.g.
>
> object UserDefinedFunction {
>
>   def define(f: org.apache.spark.api.java.function.Function0, returnType:
> Class[_]): UserDefinedFunction
>
>   // ... define a few more - maybe up to 5 arguments?
> }
>
> Ideally, we should ask for both argument class and return class, so we can
> do the proper type conversion (e.g. if the UDF expects a string, but the
> input expression is an int, Catalyst can automatically add a cast).
> However, we haven't implemented those in UserDefinedFunction yet.
>
>
>
>
> On Fri, May 29, 2015 at 12:54 PM, Justin Uang <justin.u...@gmail.com>
> wrote:
>
>> I would like to define a UDF in Java via a closure and then use it
>> without registration. In Scala, I believe there are two ways to do this:
>>
>>     myUdf = functions.udf({ _ + 5})
>>     myDf.select(myUdf(myDf("age")))
>>
>> or
>>
>>     myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType,
>> myDf("age")))
>>
>> However, both of these don't work for Java UDF. The first one requires
>> TypeTags. For the second one, I was able to hack it by creating a scala
>> AbstractFunction1 and using callUDF, which requires declaring the catalyst
>> DataType instead of using TypeTags. However, it was still nasty because I
>> had to return a scala map instead of a java map.
>>
>> Is there first class support for creating
>> a org.apache.spark.sql.UserDefinedFunction that works with
>> the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to
>> declare the catalyst type when creating it.
>>
>> If it doesn't exist, I would be happy to work on it =)
>>
>> Justin
>>
>
>

Reply via email to