cloud-fan commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774945050
@rdblue Thanks for writing up the design doc! This is a very important and
useful feature, and the `UnboundFunction` seems like a very interesting idea.
It allows function overload (for different input schema, people can return
different `BoundFunction`), but I'm wondering how it can suggest Spark to add
Cast. For example, if a function accepts int type input, but the actual input
is byte type.
Another point is we should think of the final generated java code when
invoking UDF. With whole-stage-codegen (the default case), the input values are
actually java variables in the generated java code. It means we need to build
an `InternalRow` before invoking the new UDF, which is very inefficient and is
even worse than the current Spark Scala/Java UDF. Also, the type parameter of
the return type has perf issues because of primitive type boxing.
My rough idea is
```
interface ScalarFunction {
StructType[] expectedInputTypes();
DataType returnType();
}
class MyScalaFunction implements ScalarFunction {
StructType[] expectedInputTypes() { // ... allows int and string }
DataType returnType() { return IntegerType; }
int call(int arg) { return String.valueOf(arg).length(); }
int call(UTF8String arg) { return arg.length(); }
}
```
The analyzer will bind the UDF with actual input types (add implicit cast if
needed), and check if the `call` method exits
for certain input/return types via reflection. Then in whole-stage-codegen,
we just call the `call` method with certain type of inputs, and assign the
result to a java variable. No need to build `InternalRow`, no boxing overhead,
but no compile-time type safety (analyzer can still catch errors).
cc @viirya @maropu @kiszk @rednaxelafx
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]