Re: [DISCUSS] SPIP: FunctionCatalog

Ryan Blue Wed, 03 Mar 2021 17:53:42 -0800

Good point, Dongjoon. I think we can probably come to some compromise here:


   - Remove SupportsInvoke since it isn’t really needed. We should always
   try to find the right method to invoke in the codegen path.
   - Add a default implementation of produceResult so that implementations
   don’t have to use it. If they don’t implement it and a magic function can’t
   be found, then it will throw UnsupportedOperationException

This is assuming that we can agree not to introduce all of the
ScalarFunction interface variations, which would have limited utility
because of type erasure.

Does that sound like a good plan to everyone? If so, I’ll update the SPIP
doc so we can move forward.

On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, All.
>
> We shared many opinions in different perspectives.
> However, we didn't reach a consensus even on a partial merge by excluding
> something
> (on the PR by me, on this mailing thread by Wenchen).
>
> For the following claims, we have another alternative to mitigate it.
>
>     > I don't like it because it promotes the row-parameter API and forces
> users to implement it, even if the users want to only use the
> individual-parameters API.
>
> Why don't we merge the AS-IS PR by adding something instead of excluding
> something?
>
>     - R produceResult(InternalRow input);
>     + default R produceResult(InternalRow input) throws Exception {
>     +   throw new UnsupportedOperationException();
>     + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
>> using Object[], and primitive types need to do boxing
>>
>> The question is not whether to use the magic functions, which would not
>> need boxing. The question here is whether to use multiple ScalarFunction
>> interfaces. Those interfaces would require boxing or using Object[] so
>> there isn’t a benefit.
>>
>> If we do want to reuse one UDF for different types, using “magical
>> methods” solves the problem
>>
>> Yes, that’s correct. We agree that magic methods are a good option for
>> this.
>>
>> Again, the question we need to decide is whether to use InternalRow or
>> interfaces like ScalarFunction2 for non-codegen. The option to use
>> multiple interfaces is limited by type erasure because you can only have
>> one set of type parameters. If you wanted to support both 
>> ScalarFunction2<Integer,
>> Integer> and ScalarFunction2<Long, Long> you’d have to fall back to 
>> ScalarFunction2<Object,
>> Object> and cast.
>>
>> The point is that type erasure will commonly lead either to many
>> different implementation classes (one for each type combination) or will
>> lead to parameterizing by Object, which defeats the purpose.
>>
>> The alternative adds safety because correct types are produced by calls
>> like getLong(0). Yes, this depends on the implementation making the
>> correct calls, but it is better than using Object and casting.
>>
>> I can’t think of real use cases that will force the individual-parameters
>> approach to use Object instead of concrete types.
>>
>> I think this is addressed by the type erasure discussion above. A simple
>> Plus method would require Object or 12 implementations for 2 arguments
>> and 4 numeric types.
>>
>> And basically all varargs cases would need to use Object[]. Consider a
>> UDF to create a map that requires string keys and some consistent type for
>> values. This would be easy with the InternalRow API because you can use
>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
>> Use of UTF8String vs String will be checked at compile time.
>>
>> I agree that Object[] is worse than InternalRow
>>
>> Yes, and if we are always using Object because of type erasure or using
>> magic methods to get better performance, the utility of the parameterized
>> interfaces is very limited.
>>
>> Because we want to expose the magic functions, the use of ScalarFunction2
>> and similar is extremely limited because it is only for non-codegen.
>> Varargs is by far the more common case. The InternalRow interface is
>> also a very simple way to get started and ensures that Spark can always
>> find the right method after the function is bound to input types.
>>
>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>>> of using Object[], and primitive types need to do boxing. And this is a
>>> runtime failure, which is absolutely worse than query-compile-time checks.
>>> Also, don't forget my previous point: users need to specify the type and
>>> index such as row.getLong(0), which is error-prone.
>>>
>>> > But we don’t do that for any of the similar UDFs today so I’m
>>> skeptical that this would actually be a high enough priority to implement.
>>>
>>> I'd say this is a must-have if we go with the individual-parameters
>>> approach. The Scala UDF today checks the method signature at compile-time,
>>> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
>>> to use.
>>>
>>> > You can’t implement ScalarFunction2<Integer, Integer> and
>>> ScalarFunction2<Long, Long>.
>>>
>>> Can you elaborate? We have function binding and we can use
>>> different UDFs for different input types. If we do want to reuse one UDF
>>> for different types, using "magical methods" solves the problem:
>>> class MyUDF {
>>>   def call(i: Int): Int = ...
>>>   def call(l: Long): Long = ...
>>> }
>>>
>>> On the other side, I don't think the row-parameter approach can solve
>>> this problem. The best I can think of is:
>>> class MyUDF implement ScalaFunction[Object] {
>>>   def call(row: InternalRow): Object = {
>>>     if (int input) row.getInt(0) ... else row.getLong(0) ...
>>>   }
>>> }
>>>
>>> This is worse because: 1) it needs to do if-else to check different
>>> input types. 2) the return type can only be Object and cause boxing issues.
>>>
>>> I agree that Object[] is worse than InternalRow. But I can't think of
>>> real use cases that will force the individual-parameters approach to use
>>> Object instead of concrete types.
>>>
>>>
>>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Thanks for adding your perspective, Erik!
>>>>
>>>> If the input is string type but the UDF implementation calls
>>>> row.getLong(0), it returns wrong data
>>>>
>>>> I think this is misleading. It is true for UnsafeRow, but there is no
>>>> reason why InternalRow should return incorrect values.
>>>>
>>>> The implementation in GenericInternalRow
>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
>>>> would throw a ClassCastException. I don’t think that using a row is a
>>>> bad option simply because UnsafeRow is unsafe.
>>>>
>>>> It’s unlikely that UnsafeRow would be used to pass the data. The
>>>> implementation would evaluate each argument expression and set the result
>>>> in a generic row, then pass that row to the UDF. We can use whatever
>>>> implementation we choose to provide better guarantees than unsafe.
>>>>
>>>> I think we should consider query-compile-time checks as nearly-as-good
>>>> as Java-compile-time checks for the purposes of safety.
>>>>
>>>> I don’t think I agree with this. A failure at query analysis time vs
>>>> runtime still requires going back to a separate project, fixing something,
>>>> and rebuilding. The time needed to fix a problem goes up significantly vs.
>>>> compile-time checks. And that is even worse if the UDF is maintained by
>>>> someone else.
>>>>
>>>> I think we also need to consider how common it would be that a use case
>>>> can have the query-compile-time checks. Going through this in more detail
>>>> below makes me think that it is unlikely that these checks would be used
>>>> often because of the limitations of using an interface with type erasure.
>>>>
>>>> I believe that Wenchen’s proposal will provide stronger
>>>> query-compile-time safety
>>>>
>>>> The proposal could have better safety for each argument, assuming that
>>>> we detect failures by looking at the parameter types using reflection in
>>>> the analyzer. But we don’t do that for any of the similar UDFs today so I’m
>>>> skeptical that this would actually be a high enough priority to implement.
>>>>
>>>> As Erik pointed out, type erasure also limits the effectiveness. You
>>>> can’t implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long,
>>>> Long>. You can handle those cases using InternalRow or you can handle
>>>> them using VarargScalarFunction<Object>. That forces many use cases
>>>> into varargs with Object, where you don’t get any of the proposed
>>>> analyzer benefits and lose compile-time checks. The only time the
>>>> additional checks (if implemented) would help is when only one set of
>>>> argument types is needed because implementing ScalarFunction<Object,
>>>> Object> defeats the purpose.
>>>>
>>>> It’s worth noting that safety for the magic methods would be identical
>>>> between the two options, so the trade-off to consider is for varargs and
>>>> non-codegen cases. Combining the limitations discussed, this has better
>>>> safety guarantees only if you need just one set of types for each number of
>>>> arguments and are using the non-codegen path. Since varargs is one of the
>>>> primary reasons to use this API, then I don’t think that it is a good idea
>>>> to use Object[] instead of InternalRow.
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to