Re: [DISCUSS] SPIP: FunctionCatalog

Dongjoon Hyun Wed, 03 Mar 2021 16:36:53 -0800

Hi, All.

We shared many opinions in different perspectives.
However, we didn't reach a consensus even on a partial merge by excluding
something
(on the PR by me, on this mailing thread by Wenchen).


For the following claims, we have another alternative to mitigate it.

    > I don't like it because it promotes the row-parameter API and forces
users to implement it, even if the users want to only use the
individual-parameters API.

Why don't we merge the AS-IS PR by adding something instead of excluding
something?

    - R produceResult(InternalRow input);
    + default R produceResult(InternalRow input) throws Exception {
    +   throw new UnsupportedOperationException();
    + }

By providing the default implementation, it will not *forcing users to
implement it* technically.
And, we can provide a document about our expected usage properly.
What do you think?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing
>
> The question is not whether to use the magic functions, which would not
> need boxing. The question here is whether to use multiple ScalarFunction
> interfaces. Those interfaces would require boxing or using Object[] so
> there isn’t a benefit.
>
> If we do want to reuse one UDF for different types, using “magical
> methods” solves the problem
>
> Yes, that’s correct. We agree that magic methods are a good option for
> this.
>
> Again, the question we need to decide is whether to use InternalRow or
> interfaces like ScalarFunction2 for non-codegen. The option to use
> multiple interfaces is limited by type erasure because you can only have
> one set of type parameters. If you wanted to support both 
> ScalarFunction2<Integer,
> Integer> and ScalarFunction2<Long, Long> you’d have to fall back to 
> ScalarFunction2<Object,
> Object> and cast.
>
> The point is that type erasure will commonly lead either to many different
> implementation classes (one for each type combination) or will lead to
> parameterizing by Object, which defeats the purpose.
>
> The alternative adds safety because correct types are produced by calls
> like getLong(0). Yes, this depends on the implementation making the
> correct calls, but it is better than using Object and casting.
>
> I can’t think of real use cases that will force the individual-parameters
> approach to use Object instead of concrete types.
>
> I think this is addressed by the type erasure discussion above. A simple
> Plus method would require Object or 12 implementations for 2 arguments
> and 4 numeric types.
>
> And basically all varargs cases would need to use Object[]. Consider a
> UDF to create a map that requires string keys and some consistent type for
> values. This would be easy with the InternalRow API because you can use
> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
> Use of UTF8String vs String will be checked at compile time.
>
> I agree that Object[] is worse than InternalRow
>
> Yes, and if we are always using Object because of type erasure or using
> magic methods to get better performance, the utility of the parameterized
> interfaces is very limited.
>
> Because we want to expose the magic functions, the use of ScalarFunction2
> and similar is extremely limited because it is only for non-codegen.
> Varargs is by far the more common case. The InternalRow interface is also
> a very simple way to get started and ensures that Spark can always find the
> right method after the function is bound to input types.
>
> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>> of using Object[], and primitive types need to do boxing. And this is a
>> runtime failure, which is absolutely worse than query-compile-time checks.
>> Also, don't forget my previous point: users need to specify the type and
>> index such as row.getLong(0), which is error-prone.
>>
>> > But we don’t do that for any of the similar UDFs today so I’m skeptical
>> that this would actually be a high enough priority to implement.
>>
>> I'd say this is a must-have if we go with the individual-parameters
>> approach. The Scala UDF today checks the method signature at compile-time,
>> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
>> to use.
>>
>> > You can’t implement ScalarFunction2<Integer, Integer> and
>> ScalarFunction2<Long, Long>.
>>
>> Can you elaborate? We have function binding and we can use different UDFs
>> for different input types. If we do want to reuse one UDF
>> for different types, using "magical methods" solves the problem:
>> class MyUDF {
>>   def call(i: Int): Int = ...
>>   def call(l: Long): Long = ...
>> }
>>
>> On the other side, I don't think the row-parameter approach can solve
>> this problem. The best I can think of is:
>> class MyUDF implement ScalaFunction[Object] {
>>   def call(row: InternalRow): Object = {
>>     if (int input) row.getInt(0) ... else row.getLong(0) ...
>>   }
>> }
>>
>> This is worse because: 1) it needs to do if-else to check different input
>> types. 2) the return type can only be Object and cause boxing issues.
>>
>> I agree that Object[] is worse than InternalRow. But I can't think of
>> real use cases that will force the individual-parameters approach to use
>> Object instead of concrete types.
>>
>>
>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Thanks for adding your perspective, Erik!
>>>
>>> If the input is string type but the UDF implementation calls
>>> row.getLong(0), it returns wrong data
>>>
>>> I think this is misleading. It is true for UnsafeRow, but there is no
>>> reason why InternalRow should return incorrect values.
>>>
>>> The implementation in GenericInternalRow
>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
>>> would throw a ClassCastException. I don’t think that using a row is a
>>> bad option simply because UnsafeRow is unsafe.
>>>
>>> It’s unlikely that UnsafeRow would be used to pass the data. The
>>> implementation would evaluate each argument expression and set the result
>>> in a generic row, then pass that row to the UDF. We can use whatever
>>> implementation we choose to provide better guarantees than unsafe.
>>>
>>> I think we should consider query-compile-time checks as nearly-as-good
>>> as Java-compile-time checks for the purposes of safety.
>>>
>>> I don’t think I agree with this. A failure at query analysis time vs
>>> runtime still requires going back to a separate project, fixing something,
>>> and rebuilding. The time needed to fix a problem goes up significantly vs.
>>> compile-time checks. And that is even worse if the UDF is maintained by
>>> someone else.
>>>
>>> I think we also need to consider how common it would be that a use case
>>> can have the query-compile-time checks. Going through this in more detail
>>> below makes me think that it is unlikely that these checks would be used
>>> often because of the limitations of using an interface with type erasure.
>>>
>>> I believe that Wenchen’s proposal will provide stronger
>>> query-compile-time safety
>>>
>>> The proposal could have better safety for each argument, assuming that
>>> we detect failures by looking at the parameter types using reflection in
>>> the analyzer. But we don’t do that for any of the similar UDFs today so I’m
>>> skeptical that this would actually be a high enough priority to implement.
>>>
>>> As Erik pointed out, type erasure also limits the effectiveness. You
>>> can’t implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long,
>>> Long>. You can handle those cases using InternalRow or you can handle
>>> them using VarargScalarFunction<Object>. That forces many use cases
>>> into varargs with Object, where you don’t get any of the proposed
>>> analyzer benefits and lose compile-time checks. The only time the
>>> additional checks (if implemented) would help is when only one set of
>>> argument types is needed because implementing ScalarFunction<Object,
>>> Object> defeats the purpose.
>>>
>>> It’s worth noting that safety for the magic methods would be identical
>>> between the two options, so the trade-off to consider is for varargs and
>>> non-codegen cases. Combining the limitations discussed, this has better
>>> safety guarantees only if you need just one set of types for each number of
>>> arguments and are using the non-codegen path. Since varargs is one of the
>>> primary reasons to use this API, then I don’t think that it is a good idea
>>> to use Object[] instead of InternalRow.
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to