Re: [DISCUSS] SPIP: FunctionCatalog

Wenchen Fan Tue, 02 Mar 2021 06:35:40 -0800

Yes, GenericInternalRow is safe if when type mismatches, with the cost of
using Object[], and primitive types need to do boxing. And this is a
runtime failure, which is absolutely worse than query-compile-time checks.
Also, don't forget my previous point: users need to specify the type and
index such as row.getLong(0), which is error-prone.


> But we don’t do that for any of the similar UDFs today so I’m skeptical
that this would actually be a high enough priority to implement.

I'd say this is a must-have if we go with the individual-parameters
approach. The Scala UDF today checks the method signature at compile-time,
thanks to the Scala type tag. The Java UDF today doesn't check and is hard
to use.

> You can’t implement ScalarFunction2<Integer, Integer> and
ScalarFunction2<Long, Long>.

Can you elaborate? We have function binding and we can use different UDFs
for different input types. If we do want to reuse one UDF
for different types, using "magical methods" solves the problem:
class MyUDF {
  def call(i: Int): Int = ...
  def call(l: Long): Long = ...
}

On the other side, I don't think the row-parameter approach can solve this
problem. The best I can think of is:
class MyUDF implement ScalaFunction[Object] {
  def call(row: InternalRow): Object = {
    if (int input) row.getInt(0) ... else row.getLong(0) ...
  }
}

This is worse because: 1) it needs to do if-else to check different input
types. 2) the return type can only be Object and cause boxing issues.

I agree that Object[] is worse than InternalRow. But I can't think of real
use cases that will force the individual-parameters approach to use Object
instead of concrete types.


On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <[email protected]> wrote:

> Thanks for adding your perspective, Erik!
>
> If the input is string type but the UDF implementation calls
> row.getLong(0), it returns wrong data
>
> I think this is misleading. It is true for UnsafeRow, but there is no
> reason why InternalRow should return incorrect values.
>
> The implementation in GenericInternalRow
> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
> would throw a ClassCastException. I don’t think that using a row is a bad
> option simply because UnsafeRow is unsafe.
>
> It’s unlikely that UnsafeRow would be used to pass the data. The
> implementation would evaluate each argument expression and set the result
> in a generic row, then pass that row to the UDF. We can use whatever
> implementation we choose to provide better guarantees than unsafe.
>
> I think we should consider query-compile-time checks as nearly-as-good as
> Java-compile-time checks for the purposes of safety.
>
> I don’t think I agree with this. A failure at query analysis time vs
> runtime still requires going back to a separate project, fixing something,
> and rebuilding. The time needed to fix a problem goes up significantly vs.
> compile-time checks. And that is even worse if the UDF is maintained by
> someone else.
>
> I think we also need to consider how common it would be that a use case
> can have the query-compile-time checks. Going through this in more detail
> below makes me think that it is unlikely that these checks would be used
> often because of the limitations of using an interface with type erasure.
>
> I believe that Wenchen’s proposal will provide stronger query-compile-time
> safety
>
> The proposal could have better safety for each argument, assuming that we
> detect failures by looking at the parameter types using reflection in the
> analyzer. But we don’t do that for any of the similar UDFs today so I’m
> skeptical that this would actually be a high enough priority to implement.
>
> As Erik pointed out, type erasure also limits the effectiveness. You can’t
> implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long,
> Long>. You can handle those cases using InternalRow or you can handle
> them using VarargScalarFunction<Object>. That forces many use cases into
> varargs with Object, where you don’t get any of the proposed analyzer
> benefits and lose compile-time checks. The only time the additional checks
> (if implemented) would help is when only one set of argument types is
> needed because implementing ScalarFunction<Object, Object> defeats the
> purpose.
>
> It’s worth noting that safety for the magic methods would be identical
> between the two options, so the trade-off to consider is for varargs and
> non-codegen cases. Combining the limitations discussed, this has better
> safety guarantees only if you need just one set of types for each number of
> arguments and are using the non-codegen path. Since varargs is one of the
> primary reasons to use this API, then I don’t think that it is a good idea
> to use Object[] instead of InternalRow.
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to