Re: Using UDFs in Java without registration

2017-07-26 Thread Justin Uang
Would like to bring this back for consideration again. I'm open to adding
types for all the parameters, but it does seem onerous, and in the case of
Python, we don't do that. Do you feel strongly about adding them?

On Sat, May 30, 2015 at 8:04 PM Reynold Xin  wrote:

> We added all the typetags for arguments but haven't got around to use them
> yet. I think it'd make sense to have them and do the auto cast, but we can
> have rules in analysis to forbid certain casts (e.g. don't auto cast double
> to int).
>
>
> On Sat, May 30, 2015 at 7:12 AM, Justin Uang 
> wrote:
>
>> The idea of asking for both the argument and return class is interesting.
>> I don't think we do that for the scala APIs currently, right? In
>> functions.scala, we only use the TypeTag for RT.
>>
>>   def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]):
>> UserDefinedFunction = {
>> UserDefinedFunction(f,
>> ScalaReflection.schemaFor(typeTag[RT]).dataType)
>>   }
>>
>> There would only be a small subset of conversions that would make sense
>> implicitly (e.g. int to double, the typical conversions in programming
>> languages), but something like (double => int) might be dangerous and
>> (timestamp => double) wouldn't really make sense. Perhaps it's better to be
>> explicit about casts?
>>
>> If we don't care about declaring the types of the arguments, perhaps we
>> can have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic
>> interface called UDF, then have
>>
>> def define(f: UDF, returnType: Class[_])
>>
>> to simplify the APIs.
>>
>>
>> On Sat, May 30, 2015 at 3:43 AM Reynold Xin  wrote:
>>
>>> I think you are right that there is no way to call Java UDF without
>>> registration right now. Adding another 20 methods to functions would be
>>> scary. Maybe the best way is to have a companion object
>>> for UserDefinedFunction, and define UDF there?
>>>
>>> e.g.
>>>
>>> object UserDefinedFunction {
>>>
>>>   def define(f: org.apache.spark.api.java.function.Function0,
>>> returnType: Class[_]): UserDefinedFunction
>>>
>>>   // ... define a few more - maybe up to 5 arguments?
>>> }
>>>
>>> Ideally, we should ask for both argument class and return class, so we
>>> can do the proper type conversion (e.g. if the UDF expects a string, but
>>> the input expression is an int, Catalyst can automatically add a cast).
>>> However, we haven't implemented those in UserDefinedFunction yet.
>>>
>>>
>>>
>>>
>>> On Fri, May 29, 2015 at 12:54 PM, Justin Uang 
>>> wrote:
>>>
 I would like to define a UDF in Java via a closure and then use it
 without registration. In Scala, I believe there are two ways to do this:

 myUdf = functions.udf({ _ + 5})
 myDf.select(myUdf(myDf("age")))

 or

 myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType,
 myDf("age")))

 However, both of these don't work for Java UDF. The first one requires
 TypeTags. For the second one, I was able to hack it by creating a scala
 AbstractFunction1 and using callUDF, which requires declaring the catalyst
 DataType instead of using TypeTags. However, it was still nasty because I
 had to return a scala map instead of a java map.

 Is there first class support for creating
 a org.apache.spark.sql.UserDefinedFunction that works with
 the org.apache.spark.sql.api.java.UDF1? I'm fine with having to
 declare the catalyst type when creating it.

 If it doesn't exist, I would be happy to work on it =)

 Justin

>>>
>>>
>


Re: Using UDFs in Java without registration

2015-05-30 Thread Reynold Xin
We added all the typetags for arguments but haven't got around to use them
yet. I think it'd make sense to have them and do the auto cast, but we can
have rules in analysis to forbid certain casts (e.g. don't auto cast double
to int).


On Sat, May 30, 2015 at 7:12 AM, Justin Uang justin.u...@gmail.com wrote:

 The idea of asking for both the argument and return class is interesting.
 I don't think we do that for the scala APIs currently, right? In
 functions.scala, we only use the TypeTag for RT.

   def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]):
 UserDefinedFunction = {
 UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType)
   }

 There would only be a small subset of conversions that would make sense
 implicitly (e.g. int to double, the typical conversions in programming
 languages), but something like (double = int) might be dangerous and
 (timestamp = double) wouldn't really make sense. Perhaps it's better to be
 explicit about casts?

 If we don't care about declaring the types of the arguments, perhaps we
 can have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic
 interface called UDF, then have

 def define(f: UDF, returnType: Class[_])

 to simplify the APIs.


 On Sat, May 30, 2015 at 3:43 AM Reynold Xin r...@databricks.com wrote:

 I think you are right that there is no way to call Java UDF without
 registration right now. Adding another 20 methods to functions would be
 scary. Maybe the best way is to have a companion object
 for UserDefinedFunction, and define UDF there?

 e.g.

 object UserDefinedFunction {

   def define(f: org.apache.spark.api.java.function.Function0, returnType:
 Class[_]): UserDefinedFunction

   // ... define a few more - maybe up to 5 arguments?
 }

 Ideally, we should ask for both argument class and return class, so we
 can do the proper type conversion (e.g. if the UDF expects a string, but
 the input expression is an int, Catalyst can automatically add a cast).
 However, we haven't implemented those in UserDefinedFunction yet.




 On Fri, May 29, 2015 at 12:54 PM, Justin Uang justin.u...@gmail.com
 wrote:

 I would like to define a UDF in Java via a closure and then use it
 without registration. In Scala, I believe there are two ways to do this:

 myUdf = functions.udf({ _ + 5})
 myDf.select(myUdf(myDf(age)))

 or

 myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType,
 myDf(age)))

 However, both of these don't work for Java UDF. The first one requires
 TypeTags. For the second one, I was able to hack it by creating a scala
 AbstractFunction1 and using callUDF, which requires declaring the catalyst
 DataType instead of using TypeTags. However, it was still nasty because I
 had to return a scala map instead of a java map.

 Is there first class support for creating
 a org.apache.spark.sql.UserDefinedFunction that works with
 the org.apache.spark.sql.api.java.UDF1T1, R? I'm fine with having to
 declare the catalyst type when creating it.

 If it doesn't exist, I would be happy to work on it =)

 Justin





Re: Using UDFs in Java without registration

2015-05-30 Thread Reynold Xin
I think you are right that there is no way to call Java UDF without
registration right now. Adding another 20 methods to functions would be
scary. Maybe the best way is to have a companion object
for UserDefinedFunction, and define UDF there?

e.g.

object UserDefinedFunction {

  def define(f: org.apache.spark.api.java.function.Function0, returnType:
Class[_]): UserDefinedFunction

  // ... define a few more - maybe up to 5 arguments?
}

Ideally, we should ask for both argument class and return class, so we can
do the proper type conversion (e.g. if the UDF expects a string, but the
input expression is an int, Catalyst can automatically add a cast).
However, we haven't implemented those in UserDefinedFunction yet.




On Fri, May 29, 2015 at 12:54 PM, Justin Uang justin.u...@gmail.com wrote:

 I would like to define a UDF in Java via a closure and then use it without
 registration. In Scala, I believe there are two ways to do this:

 myUdf = functions.udf({ _ + 5})
 myDf.select(myUdf(myDf(age)))

 or

 myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType,
 myDf(age)))

 However, both of these don't work for Java UDF. The first one requires
 TypeTags. For the second one, I was able to hack it by creating a scala
 AbstractFunction1 and using callUDF, which requires declaring the catalyst
 DataType instead of using TypeTags. However, it was still nasty because I
 had to return a scala map instead of a java map.

 Is there first class support for creating
 a org.apache.spark.sql.UserDefinedFunction that works with
 the org.apache.spark.sql.api.java.UDF1T1, R? I'm fine with having to
 declare the catalyst type when creating it.

 If it doesn't exist, I would be happy to work on it =)

 Justin



Re: Using UDFs in Java without registration

2015-05-30 Thread Justin Uang
The idea of asking for both the argument and return class is interesting. I
don't think we do that for the scala APIs currently, right? In
functions.scala, we only use the TypeTag for RT.

  def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]):
UserDefinedFunction = {
UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType)
  }

There would only be a small subset of conversions that would make sense
implicitly (e.g. int to double, the typical conversions in programming
languages), but something like (double = int) might be dangerous and
(timestamp = double) wouldn't really make sense. Perhaps it's better to be
explicit about casts?

If we don't care about declaring the types of the arguments, perhaps we can
have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic
interface called UDF, then have

def define(f: UDF, returnType: Class[_])

to simplify the APIs.


On Sat, May 30, 2015 at 3:43 AM Reynold Xin r...@databricks.com wrote:

 I think you are right that there is no way to call Java UDF without
 registration right now. Adding another 20 methods to functions would be
 scary. Maybe the best way is to have a companion object
 for UserDefinedFunction, and define UDF there?

 e.g.

 object UserDefinedFunction {

   def define(f: org.apache.spark.api.java.function.Function0, returnType:
 Class[_]): UserDefinedFunction

   // ... define a few more - maybe up to 5 arguments?
 }

 Ideally, we should ask for both argument class and return class, so we can
 do the proper type conversion (e.g. if the UDF expects a string, but the
 input expression is an int, Catalyst can automatically add a cast).
 However, we haven't implemented those in UserDefinedFunction yet.




 On Fri, May 29, 2015 at 12:54 PM, Justin Uang justin.u...@gmail.com
 wrote:

 I would like to define a UDF in Java via a closure and then use it
 without registration. In Scala, I believe there are two ways to do this:

 myUdf = functions.udf({ _ + 5})
 myDf.select(myUdf(myDf(age)))

 or

 myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType,
 myDf(age)))

 However, both of these don't work for Java UDF. The first one requires
 TypeTags. For the second one, I was able to hack it by creating a scala
 AbstractFunction1 and using callUDF, which requires declaring the catalyst
 DataType instead of using TypeTags. However, it was still nasty because I
 had to return a scala map instead of a java map.

 Is there first class support for creating
 a org.apache.spark.sql.UserDefinedFunction that works with
 the org.apache.spark.sql.api.java.UDF1T1, R? I'm fine with having to
 declare the catalyst type when creating it.

 If it doesn't exist, I would be happy to work on it =)

 Justin