Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Liang-Chi Hsieh


Yeah, in short this is a great compromise approach and I do like to see this
proposal move forward to next step. This discussion is valuable.


Chao Sun wrote
> +1 on Dongjoon's proposal. Great to see this is getting moved forward and
> thanks everyone for the insightful discussion!
> 
> 
> 
> On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue 

> rblue@

>  wrote:
> 
>> Okay, great. I'll update the SPIP doc and call a vote in the next day or
>> two.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Chao Sun
+1 on Dongjoon's proposal. Great to see this is getting moved forward and
thanks everyone for the insightful discussion!



On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue  wrote:

> Okay, great. I'll update the SPIP doc and call a vote in the next day or
> two.
>
> On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen  wrote:
>
>> +1 on Dongjoon's proposal. This is a very nice compromise between the
>> reflective/magic-method approach and the InternalRow approach, providing
>> a lot of flexibility for our users, and allowing for the more complicated
>> reflection-based approach to evolve at its own pace, since you can always
>> fall back to InternalRow for situations which aren't yet supported by
>> reflection.
>>
>> We can even consider having Spark code detect that you haven't overridden
>> the default produceResult (IIRC this is discoverable via reflection),
>> and raise an error at query analysis time instead of at runtime when it
>> can't find a reflective method or an overridden produceResult.
>>
>> I'm very pleased we have found a compromise that everyone seems happy
>> with! Big thanks to everyone who participated.
>>
>> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:
>>
>>> +1 Good plan to move forward.
>>>
>>> Thank you all for the constructive and comprehensive discussions in this
>>> thread! Decisions on this important feature will have ramifications for
>>> years to come.
>>>
>>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>>>
 +1 to this proposal. If people don't like the ScalarFunction0,1, ...
 variants and prefer the "magical methods", then we can have a single
 ScalarFunction interface which has the row-parameter API (with a
 default implementation to fail) and documents to describe the "magical
 methods" (which can be done later).

 I'll start the PR review this week to check the naming, doc, etc.

 Thanks all for the discussion here and let's move forward!

 On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:

> Good point, Dongjoon. I think we can probably come to some compromise
> here:
>
>- Remove SupportsInvoke since it isn’t really needed. We should
>always try to find the right method to invoke in the codegen path.
>- Add a default implementation of produceResult so that
>implementations don’t have to use it. If they don’t implement it and a
>magic function can’t be found, then it will throw
>UnsupportedOperationException
>
> This is assuming that we can agree not to introduce all of the
> ScalarFunction interface variations, which would have limited utility
> because of type erasure.
>
> Does that sound like a good plan to everyone? If so, I’ll update the
> SPIP doc so we can move forward.
>
> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We shared many opinions in different perspectives.
>> However, we didn't reach a consensus even on a partial merge by
>> excluding something
>> (on the PR by me, on this mailing thread by Wenchen).
>>
>> For the following claims, we have another alternative to mitigate it.
>>
>> > I don't like it because it promotes the row-parameter API and
>> forces users to implement it, even if the users want to only use the
>> individual-parameters API.
>>
>> Why don't we merge the AS-IS PR by adding something instead of
>> excluding something?
>>
>> - R produceResult(InternalRow input);
>> + default R produceResult(InternalRow input) throws Exception {
>> +   throw new UnsupportedOperationException();
>> + }
>>
>> By providing the default implementation, it will not *forcing users
>> to implement it* technically.
>> And, we can provide a document about our expected usage properly.
>> What do you think?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>> cost of using Object[], and primitive types need to do boxing
>>>
>>> The question is not whether to use the magic functions, which would
>>> not need boxing. The question here is whether to use multiple
>>> ScalarFunction interfaces. Those interfaces would require boxing or
>>> using Object[] so there isn’t a benefit.
>>>
>>> If we do want to reuse one UDF for different types, using “magical
>>> methods” solves the problem
>>>
>>> Yes, that’s correct. We agree that magic methods are a good option
>>> for this.
>>>
>>> Again, the question we need to decide is whether to use InternalRow
>>> or interfaces like ScalarFunction2 for non-codegen. The option to
>>> use multiple interfaces is limited by type erasure because you can only
>>> have one set of type parameters. If you wanted to support both 
>>> 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Ryan Blue
Okay, great. I'll update the SPIP doc and call a vote in the next day or
two.

On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen  wrote:

> +1 on Dongjoon's proposal. This is a very nice compromise between the
> reflective/magic-method approach and the InternalRow approach, providing
> a lot of flexibility for our users, and allowing for the more complicated
> reflection-based approach to evolve at its own pace, since you can always
> fall back to InternalRow for situations which aren't yet supported by
> reflection.
>
> We can even consider having Spark code detect that you haven't overridden
> the default produceResult (IIRC this is discoverable via reflection), and
> raise an error at query analysis time instead of at runtime when it can't
> find a reflective method or an overridden produceResult.
>
> I'm very pleased we have found a compromise that everyone seems happy
> with! Big thanks to everyone who participated.
>
> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:
>
>> +1 Good plan to move forward.
>>
>> Thank you all for the constructive and comprehensive discussions in this
>> thread! Decisions on this important feature will have ramifications for
>> years to come.
>>
>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>>
>>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>>> variants and prefer the "magical methods", then we can have a single
>>> ScalarFunction interface which has the row-parameter API (with a
>>> default implementation to fail) and documents to describe the "magical
>>> methods" (which can be done later).
>>>
>>> I'll start the PR review this week to check the naming, doc, etc.
>>>
>>> Thanks all for the discussion here and let's move forward!
>>>
>>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>>>
 Good point, Dongjoon. I think we can probably come to some compromise
 here:

- Remove SupportsInvoke since it isn’t really needed. We should
always try to find the right method to invoke in the codegen path.
- Add a default implementation of produceResult so that
implementations don’t have to use it. If they don’t implement it and a
magic function can’t be found, then it will throw
UnsupportedOperationException

 This is assuming that we can agree not to introduce all of the
 ScalarFunction interface variations, which would have limited utility
 because of type erasure.

 Does that sound like a good plan to everyone? If so, I’ll update the
 SPIP doc so we can move forward.

 On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> We shared many opinions in different perspectives.
> However, we didn't reach a consensus even on a partial merge by
> excluding something
> (on the PR by me, on this mailing thread by Wenchen).
>
> For the following claims, we have another alternative to mitigate it.
>
> > I don't like it because it promotes the row-parameter API and
> forces users to implement it, even if the users want to only use the
> individual-parameters API.
>
> Why don't we merge the AS-IS PR by adding something instead of
> excluding something?
>
> - R produceResult(InternalRow input);
> + default R produceResult(InternalRow input) throws Exception {
> +   throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the
>> cost of using Object[], and primitive types need to do boxing
>>
>> The question is not whether to use the magic functions, which would
>> not need boxing. The question here is whether to use multiple
>> ScalarFunction interfaces. Those interfaces would require boxing or
>> using Object[] so there isn’t a benefit.
>>
>> If we do want to reuse one UDF for different types, using “magical
>> methods” solves the problem
>>
>> Yes, that’s correct. We agree that magic methods are a good option
>> for this.
>>
>> Again, the question we need to decide is whether to use InternalRow
>> or interfaces like ScalarFunction2 for non-codegen. The option to
>> use multiple interfaces is limited by type erasure because you can only
>> have one set of type parameters. If you wanted to support both 
>> ScalarFunction2> Integer> and ScalarFunction2 you’d have to fall back to 
>> ScalarFunction2> Object> and cast.
>>
>> The point is that type erasure will commonly lead either to many
>> different implementation classes (one for each type combination) or will
>> lead to 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Erik Krogen
+1 on Dongjoon's proposal. This is a very nice compromise between the
reflective/magic-method approach and the InternalRow approach, providing a
lot of flexibility for our users, and allowing for the more complicated
reflection-based approach to evolve at its own pace, since you can always
fall back to InternalRow for situations which aren't yet supported by
reflection.

We can even consider having Spark code detect that you haven't overridden
the default produceResult (IIRC this is discoverable via reflection), and
raise an error at query analysis time instead of at runtime when it can't
find a reflective method or an overridden produceResult.

I'm very pleased we have found a compromise that everyone seems happy with!
Big thanks to everyone who participated.

On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:

> +1 Good plan to move forward.
>
> Thank you all for the constructive and comprehensive discussions in this
> thread! Decisions on this important feature will have ramifications for
> years to come.
>
> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>
>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>> variants and prefer the "magical methods", then we can have a single
>> ScalarFunction interface which has the row-parameter API (with a default
>> implementation to fail) and documents to describe the "magical methods"
>> (which can be done later).
>>
>> I'll start the PR review this week to check the naming, doc, etc.
>>
>> Thanks all for the discussion here and let's move forward!
>>
>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>>
>>> Good point, Dongjoon. I think we can probably come to some compromise
>>> here:
>>>
>>>- Remove SupportsInvoke since it isn’t really needed. We should
>>>always try to find the right method to invoke in the codegen path.
>>>- Add a default implementation of produceResult so that
>>>implementations don’t have to use it. If they don’t implement it and a
>>>magic function can’t be found, then it will throw
>>>UnsupportedOperationException
>>>
>>> This is assuming that we can agree not to introduce all of the
>>> ScalarFunction interface variations, which would have limited utility
>>> because of type erasure.
>>>
>>> Does that sound like a good plan to everyone? If so, I’ll update the
>>> SPIP doc so we can move forward.
>>>
>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 We shared many opinions in different perspectives.
 However, we didn't reach a consensus even on a partial merge by
 excluding something
 (on the PR by me, on this mailing thread by Wenchen).

 For the following claims, we have another alternative to mitigate it.

 > I don't like it because it promotes the row-parameter API and
 forces users to implement it, even if the users want to only use the
 individual-parameters API.

 Why don't we merge the AS-IS PR by adding something instead of
 excluding something?

 - R produceResult(InternalRow input);
 + default R produceResult(InternalRow input) throws Exception {
 +   throw new UnsupportedOperationException();
 + }

 By providing the default implementation, it will not *forcing users to
 implement it* technically.
 And, we can provide a document about our expected usage properly.
 What do you think?

 Bests,
 Dongjoon.



 On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost
> of using Object[], and primitive types need to do boxing
>
> The question is not whether to use the magic functions, which would
> not need boxing. The question here is whether to use multiple
> ScalarFunction interfaces. Those interfaces would require boxing or
> using Object[] so there isn’t a benefit.
>
> If we do want to reuse one UDF for different types, using “magical
> methods” solves the problem
>
> Yes, that’s correct. We agree that magic methods are a good option for
> this.
>
> Again, the question we need to decide is whether to use InternalRow
> or interfaces like ScalarFunction2 for non-codegen. The option to use
> multiple interfaces is limited by type erasure because you can only have
> one set of type parameters. If you wanted to support both 
> ScalarFunction2 Integer> and ScalarFunction2 you’d have to fall back to 
> ScalarFunction2 Object> and cast.
>
> The point is that type erasure will commonly lead either to many
> different implementation classes (one for each type combination) or will
> lead to parameterizing by Object, which defeats the purpose.
>
> The alternative adds safety because correct types are produced by
> calls like getLong(0). Yes, this depends on the implementation making
> the correct calls, but it is better than using 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread John Zhuge
+1 Good plan to move forward.

Thank you all for the constructive and comprehensive discussions in this
thread! Decisions on this important feature will have ramifications for
years to come.

On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:

> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
> variants and prefer the "magical methods", then we can have a single
> ScalarFunction interface which has the row-parameter API (with a default
> implementation to fail) and documents to describe the "magical methods"
> (which can be done later).
>
> I'll start the PR review this week to check the naming, doc, etc.
>
> Thanks all for the discussion here and let's move forward!
>
> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>
>> Good point, Dongjoon. I think we can probably come to some compromise
>> here:
>>
>>- Remove SupportsInvoke since it isn’t really needed. We should
>>always try to find the right method to invoke in the codegen path.
>>- Add a default implementation of produceResult so that
>>implementations don’t have to use it. If they don’t implement it and a
>>magic function can’t be found, then it will throw
>>UnsupportedOperationException
>>
>> This is assuming that we can agree not to introduce all of the
>> ScalarFunction interface variations, which would have limited utility
>> because of type erasure.
>>
>> Does that sound like a good plan to everyone? If so, I’ll update the SPIP
>> doc so we can move forward.
>>
>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> We shared many opinions in different perspectives.
>>> However, we didn't reach a consensus even on a partial merge by
>>> excluding something
>>> (on the PR by me, on this mailing thread by Wenchen).
>>>
>>> For the following claims, we have another alternative to mitigate it.
>>>
>>> > I don't like it because it promotes the row-parameter API and
>>> forces users to implement it, even if the users want to only use the
>>> individual-parameters API.
>>>
>>> Why don't we merge the AS-IS PR by adding something instead of excluding
>>> something?
>>>
>>> - R produceResult(InternalRow input);
>>> + default R produceResult(InternalRow input) throws Exception {
>>> +   throw new UnsupportedOperationException();
>>> + }
>>>
>>> By providing the default implementation, it will not *forcing users to
>>> implement it* technically.
>>> And, we can provide a document about our expected usage properly.
>>> What do you think?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>>
 Yes, GenericInternalRow is safe if when type mismatches, with the cost
 of using Object[], and primitive types need to do boxing

 The question is not whether to use the magic functions, which would not
 need boxing. The question here is whether to use multiple
 ScalarFunction interfaces. Those interfaces would require boxing or
 using Object[] so there isn’t a benefit.

 If we do want to reuse one UDF for different types, using “magical
 methods” solves the problem

 Yes, that’s correct. We agree that magic methods are a good option for
 this.

 Again, the question we need to decide is whether to use InternalRow or
 interfaces like ScalarFunction2 for non-codegen. The option to use
 multiple interfaces is limited by type erasure because you can only have
 one set of type parameters. If you wanted to support both 
 ScalarFunction2>>> Integer> and ScalarFunction2 you’d have to fall back to 
 ScalarFunction2>>> Object> and cast.

 The point is that type erasure will commonly lead either to many
 different implementation classes (one for each type combination) or will
 lead to parameterizing by Object, which defeats the purpose.

 The alternative adds safety because correct types are produced by calls
 like getLong(0). Yes, this depends on the implementation making the
 correct calls, but it is better than using Object and casting.

 I can’t think of real use cases that will force the
 individual-parameters approach to use Object instead of concrete types.

 I think this is addressed by the type erasure discussion above. A
 simple Plus method would require Object or 12 implementations for 2
 arguments and 4 numeric types.

 And basically all varargs cases would need to use Object[]. Consider a
 UDF to create a map that requires string keys and some consistent type for
 values. This would be easy with the InternalRow API because you can
 use getString(pos) and get(pos + 1, valueType) to get the key/value
 pairs. Use of UTF8String vs String will be checked at compile time.

 I agree that Object[] is worse than InternalRow

 Yes, and if we are always using Object because of type erasure or
 using magic methods to get better performance, the utility of the

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Wenchen Fan
+1 to this proposal. If people don't like the ScalarFunction0,1, ...
variants and prefer the "magical methods", then we can have a single
ScalarFunction interface which has the row-parameter API (with a default
implementation to fail) and documents to describe the "magical methods"
(which can be done later).

I'll start the PR review this week to check the naming, doc, etc.

Thanks all for the discussion here and let's move forward!

On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:

> Good point, Dongjoon. I think we can probably come to some compromise here:
>
>- Remove SupportsInvoke since it isn’t really needed. We should always
>try to find the right method to invoke in the codegen path.
>- Add a default implementation of produceResult so that
>implementations don’t have to use it. If they don’t implement it and a
>magic function can’t be found, then it will throw
>UnsupportedOperationException
>
> This is assuming that we can agree not to introduce all of the
> ScalarFunction interface variations, which would have limited utility
> because of type erasure.
>
> Does that sound like a good plan to everyone? If so, I’ll update the SPIP
> doc so we can move forward.
>
> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We shared many opinions in different perspectives.
>> However, we didn't reach a consensus even on a partial merge by excluding
>> something
>> (on the PR by me, on this mailing thread by Wenchen).
>>
>> For the following claims, we have another alternative to mitigate it.
>>
>> > I don't like it because it promotes the row-parameter API and
>> forces users to implement it, even if the users want to only use the
>> individual-parameters API.
>>
>> Why don't we merge the AS-IS PR by adding something instead of excluding
>> something?
>>
>> - R produceResult(InternalRow input);
>> + default R produceResult(InternalRow input) throws Exception {
>> +   throw new UnsupportedOperationException();
>> + }
>>
>> By providing the default implementation, it will not *forcing users to
>> implement it* technically.
>> And, we can provide a document about our expected usage properly.
>> What do you think?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>>> of using Object[], and primitive types need to do boxing
>>>
>>> The question is not whether to use the magic functions, which would not
>>> need boxing. The question here is whether to use multiple ScalarFunction
>>> interfaces. Those interfaces would require boxing or using Object[] so
>>> there isn’t a benefit.
>>>
>>> If we do want to reuse one UDF for different types, using “magical
>>> methods” solves the problem
>>>
>>> Yes, that’s correct. We agree that magic methods are a good option for
>>> this.
>>>
>>> Again, the question we need to decide is whether to use InternalRow or
>>> interfaces like ScalarFunction2 for non-codegen. The option to use
>>> multiple interfaces is limited by type erasure because you can only have
>>> one set of type parameters. If you wanted to support both 
>>> ScalarFunction2>> Integer> and ScalarFunction2 you’d have to fall back to 
>>> ScalarFunction2>> Object> and cast.
>>>
>>> The point is that type erasure will commonly lead either to many
>>> different implementation classes (one for each type combination) or will
>>> lead to parameterizing by Object, which defeats the purpose.
>>>
>>> The alternative adds safety because correct types are produced by calls
>>> like getLong(0). Yes, this depends on the implementation making the
>>> correct calls, but it is better than using Object and casting.
>>>
>>> I can’t think of real use cases that will force the
>>> individual-parameters approach to use Object instead of concrete types.
>>>
>>> I think this is addressed by the type erasure discussion above. A simple
>>> Plus method would require Object or 12 implementations for 2 arguments
>>> and 4 numeric types.
>>>
>>> And basically all varargs cases would need to use Object[]. Consider a
>>> UDF to create a map that requires string keys and some consistent type for
>>> values. This would be easy with the InternalRow API because you can use
>>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
>>> Use of UTF8String vs String will be checked at compile time.
>>>
>>> I agree that Object[] is worse than InternalRow
>>>
>>> Yes, and if we are always using Object because of type erasure or using
>>> magic methods to get better performance, the utility of the parameterized
>>> interfaces is very limited.
>>>
>>> Because we want to expose the magic functions, the use of
>>> ScalarFunction2 and similar is extremely limited because it is only for
>>> non-codegen. Varargs is by far the more common case. The InternalRow
>>> interface is also a very simple way to get started and ensures that Spark
>>> can always find the right 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue
Good point, Dongjoon. I think we can probably come to some compromise here:

   - Remove SupportsInvoke since it isn’t really needed. We should always
   try to find the right method to invoke in the codegen path.
   - Add a default implementation of produceResult so that implementations
   don’t have to use it. If they don’t implement it and a magic function can’t
   be found, then it will throw UnsupportedOperationException

This is assuming that we can agree not to introduce all of the
ScalarFunction interface variations, which would have limited utility
because of type erasure.

Does that sound like a good plan to everyone? If so, I’ll update the SPIP
doc so we can move forward.

On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> We shared many opinions in different perspectives.
> However, we didn't reach a consensus even on a partial merge by excluding
> something
> (on the PR by me, on this mailing thread by Wenchen).
>
> For the following claims, we have another alternative to mitigate it.
>
> > I don't like it because it promotes the row-parameter API and forces
> users to implement it, even if the users want to only use the
> individual-parameters API.
>
> Why don't we merge the AS-IS PR by adding something instead of excluding
> something?
>
> - R produceResult(InternalRow input);
> + default R produceResult(InternalRow input) throws Exception {
> +   throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
>> using Object[], and primitive types need to do boxing
>>
>> The question is not whether to use the magic functions, which would not
>> need boxing. The question here is whether to use multiple ScalarFunction
>> interfaces. Those interfaces would require boxing or using Object[] so
>> there isn’t a benefit.
>>
>> If we do want to reuse one UDF for different types, using “magical
>> methods” solves the problem
>>
>> Yes, that’s correct. We agree that magic methods are a good option for
>> this.
>>
>> Again, the question we need to decide is whether to use InternalRow or
>> interfaces like ScalarFunction2 for non-codegen. The option to use
>> multiple interfaces is limited by type erasure because you can only have
>> one set of type parameters. If you wanted to support both 
>> ScalarFunction2> Integer> and ScalarFunction2 you’d have to fall back to 
>> ScalarFunction2> Object> and cast.
>>
>> The point is that type erasure will commonly lead either to many
>> different implementation classes (one for each type combination) or will
>> lead to parameterizing by Object, which defeats the purpose.
>>
>> The alternative adds safety because correct types are produced by calls
>> like getLong(0). Yes, this depends on the implementation making the
>> correct calls, but it is better than using Object and casting.
>>
>> I can’t think of real use cases that will force the individual-parameters
>> approach to use Object instead of concrete types.
>>
>> I think this is addressed by the type erasure discussion above. A simple
>> Plus method would require Object or 12 implementations for 2 arguments
>> and 4 numeric types.
>>
>> And basically all varargs cases would need to use Object[]. Consider a
>> UDF to create a map that requires string keys and some consistent type for
>> values. This would be easy with the InternalRow API because you can use
>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
>> Use of UTF8String vs String will be checked at compile time.
>>
>> I agree that Object[] is worse than InternalRow
>>
>> Yes, and if we are always using Object because of type erasure or using
>> magic methods to get better performance, the utility of the parameterized
>> interfaces is very limited.
>>
>> Because we want to expose the magic functions, the use of ScalarFunction2
>> and similar is extremely limited because it is only for non-codegen.
>> Varargs is by far the more common case. The InternalRow interface is
>> also a very simple way to get started and ensures that Spark can always
>> find the right method after the function is bound to input types.
>>
>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>>> of using Object[], and primitive types need to do boxing. And this is a
>>> runtime failure, which is absolutely worse than query-compile-time checks.
>>> Also, don't forget my previous point: users need to specify the type and
>>> index such as row.getLong(0), which is error-prone.
>>>
>>> > But we don’t do that for any of the similar UDFs today so I’m
>>> skeptical that this would actually be a high enough priority 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Dongjoon Hyun
Hi, All.

We shared many opinions in different perspectives.
However, we didn't reach a consensus even on a partial merge by excluding
something
(on the PR by me, on this mailing thread by Wenchen).

For the following claims, we have another alternative to mitigate it.

> I don't like it because it promotes the row-parameter API and forces
users to implement it, even if the users want to only use the
individual-parameters API.

Why don't we merge the AS-IS PR by adding something instead of excluding
something?

- R produceResult(InternalRow input);
+ default R produceResult(InternalRow input) throws Exception {
+   throw new UnsupportedOperationException();
+ }

By providing the default implementation, it will not *forcing users to
implement it* technically.
And, we can provide a document about our expected usage properly.
What do you think?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing
>
> The question is not whether to use the magic functions, which would not
> need boxing. The question here is whether to use multiple ScalarFunction
> interfaces. Those interfaces would require boxing or using Object[] so
> there isn’t a benefit.
>
> If we do want to reuse one UDF for different types, using “magical
> methods” solves the problem
>
> Yes, that’s correct. We agree that magic methods are a good option for
> this.
>
> Again, the question we need to decide is whether to use InternalRow or
> interfaces like ScalarFunction2 for non-codegen. The option to use
> multiple interfaces is limited by type erasure because you can only have
> one set of type parameters. If you wanted to support both 
> ScalarFunction2 Integer> and ScalarFunction2 you’d have to fall back to 
> ScalarFunction2 Object> and cast.
>
> The point is that type erasure will commonly lead either to many different
> implementation classes (one for each type combination) or will lead to
> parameterizing by Object, which defeats the purpose.
>
> The alternative adds safety because correct types are produced by calls
> like getLong(0). Yes, this depends on the implementation making the
> correct calls, but it is better than using Object and casting.
>
> I can’t think of real use cases that will force the individual-parameters
> approach to use Object instead of concrete types.
>
> I think this is addressed by the type erasure discussion above. A simple
> Plus method would require Object or 12 implementations for 2 arguments
> and 4 numeric types.
>
> And basically all varargs cases would need to use Object[]. Consider a
> UDF to create a map that requires string keys and some consistent type for
> values. This would be easy with the InternalRow API because you can use
> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
> Use of UTF8String vs String will be checked at compile time.
>
> I agree that Object[] is worse than InternalRow
>
> Yes, and if we are always using Object because of type erasure or using
> magic methods to get better performance, the utility of the parameterized
> interfaces is very limited.
>
> Because we want to expose the magic functions, the use of ScalarFunction2
> and similar is extremely limited because it is only for non-codegen.
> Varargs is by far the more common case. The InternalRow interface is also
> a very simple way to get started and ensures that Spark can always find the
> right method after the function is bound to input types.
>
> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>> of using Object[], and primitive types need to do boxing. And this is a
>> runtime failure, which is absolutely worse than query-compile-time checks.
>> Also, don't forget my previous point: users need to specify the type and
>> index such as row.getLong(0), which is error-prone.
>>
>> > But we don’t do that for any of the similar UDFs today so I’m skeptical
>> that this would actually be a high enough priority to implement.
>>
>> I'd say this is a must-have if we go with the individual-parameters
>> approach. The Scala UDF today checks the method signature at compile-time,
>> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
>> to use.
>>
>> > You can’t implement ScalarFunction2 and
>> ScalarFunction2.
>>
>> Can you elaborate? We have function binding and we can use different UDFs
>> for different input types. If we do want to reuse one UDF
>> for different types, using "magical methods" solves the problem:
>> class MyUDF {
>>   def call(i: Int): Int = ...
>>   def call(l: Long): Long = ...
>> }
>>
>> On the other side, I don't think the row-parameter approach can solve
>> this problem. The best I can think of is:
>> class MyUDF implement ScalaFunction[Object] {
>>   def call(row: InternalRow): Object = {
>> if (int input) 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue
Yes, GenericInternalRow is safe if when type mismatches, with the cost of
using Object[], and primitive types need to do boxing

The question is not whether to use the magic functions, which would not
need boxing. The question here is whether to use multiple ScalarFunction
interfaces. Those interfaces would require boxing or using Object[] so
there isn’t a benefit.

If we do want to reuse one UDF for different types, using “magical methods”
solves the problem

Yes, that’s correct. We agree that magic methods are a good option for this.

Again, the question we need to decide is whether to use InternalRow or
interfaces like ScalarFunction2 for non-codegen. The option to use multiple
interfaces is limited by type erasure because you can only have one set of
type parameters. If you wanted to support both ScalarFunction2 and ScalarFunction2 you’d have to fall back to
ScalarFunction2 and cast.

The point is that type erasure will commonly lead either to many different
implementation classes (one for each type combination) or will lead to
parameterizing by Object, which defeats the purpose.

The alternative adds safety because correct types are produced by calls
like getLong(0). Yes, this depends on the implementation making the correct
calls, but it is better than using Object and casting.

I can’t think of real use cases that will force the individual-parameters
approach to use Object instead of concrete types.

I think this is addressed by the type erasure discussion above. A simple
Plus method would require Object or 12 implementations for 2 arguments and
4 numeric types.

And basically all varargs cases would need to use Object[]. Consider a UDF
to create a map that requires string keys and some consistent type for
values. This would be easy with the InternalRow API because you can use
getString(pos) and get(pos + 1, valueType) to get the key/value pairs. Use
of UTF8String vs String will be checked at compile time.

I agree that Object[] is worse than InternalRow

Yes, and if we are always using Object because of type erasure or using
magic methods to get better performance, the utility of the parameterized
interfaces is very limited.

Because we want to expose the magic functions, the use of ScalarFunction2
and similar is extremely limited because it is only for non-codegen.
Varargs is by far the more common case. The InternalRow interface is also a
very simple way to get started and ensures that Spark can always find the
right method after the function is bound to input types.

On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing. And this is a
> runtime failure, which is absolutely worse than query-compile-time checks.
> Also, don't forget my previous point: users need to specify the type and
> index such as row.getLong(0), which is error-prone.
>
> > But we don’t do that for any of the similar UDFs today so I’m skeptical
> that this would actually be a high enough priority to implement.
>
> I'd say this is a must-have if we go with the individual-parameters
> approach. The Scala UDF today checks the method signature at compile-time,
> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
> to use.
>
> > You can’t implement ScalarFunction2 and
> ScalarFunction2.
>
> Can you elaborate? We have function binding and we can use different UDFs
> for different input types. If we do want to reuse one UDF
> for different types, using "magical methods" solves the problem:
> class MyUDF {
>   def call(i: Int): Int = ...
>   def call(l: Long): Long = ...
> }
>
> On the other side, I don't think the row-parameter approach can solve this
> problem. The best I can think of is:
> class MyUDF implement ScalaFunction[Object] {
>   def call(row: InternalRow): Object = {
> if (int input) row.getInt(0) ... else row.getLong(0) ...
>   }
> }
>
> This is worse because: 1) it needs to do if-else to check different input
> types. 2) the return type can only be Object and cause boxing issues.
>
> I agree that Object[] is worse than InternalRow. But I can't think of
> real use cases that will force the individual-parameters approach to use
> Object instead of concrete types.
>
>
> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue  wrote:
>
>> Thanks for adding your perspective, Erik!
>>
>> If the input is string type but the UDF implementation calls
>> row.getLong(0), it returns wrong data
>>
>> I think this is misleading. It is true for UnsafeRow, but there is no
>> reason why InternalRow should return incorrect values.
>>
>> The implementation in GenericInternalRow
>> 
>> would throw a ClassCastException. I don’t think that using a row is a
>> bad option simply because UnsafeRow is unsafe.
>>
>> It’s unlikely that UnsafeRow would be used to pass the data. 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-02 Thread Wenchen Fan
Yes, GenericInternalRow is safe if when type mismatches, with the cost of
using Object[], and primitive types need to do boxing. And this is a
runtime failure, which is absolutely worse than query-compile-time checks.
Also, don't forget my previous point: users need to specify the type and
index such as row.getLong(0), which is error-prone.

> But we don’t do that for any of the similar UDFs today so I’m skeptical
that this would actually be a high enough priority to implement.

I'd say this is a must-have if we go with the individual-parameters
approach. The Scala UDF today checks the method signature at compile-time,
thanks to the Scala type tag. The Java UDF today doesn't check and is hard
to use.

> You can’t implement ScalarFunction2 and
ScalarFunction2.

Can you elaborate? We have function binding and we can use different UDFs
for different input types. If we do want to reuse one UDF
for different types, using "magical methods" solves the problem:
class MyUDF {
  def call(i: Int): Int = ...
  def call(l: Long): Long = ...
}

On the other side, I don't think the row-parameter approach can solve this
problem. The best I can think of is:
class MyUDF implement ScalaFunction[Object] {
  def call(row: InternalRow): Object = {
if (int input) row.getInt(0) ... else row.getLong(0) ...
  }
}

This is worse because: 1) it needs to do if-else to check different input
types. 2) the return type can only be Object and cause boxing issues.

I agree that Object[] is worse than InternalRow. But I can't think of real
use cases that will force the individual-parameters approach to use Object
instead of concrete types.


On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue  wrote:

> Thanks for adding your perspective, Erik!
>
> If the input is string type but the UDF implementation calls
> row.getLong(0), it returns wrong data
>
> I think this is misleading. It is true for UnsafeRow, but there is no
> reason why InternalRow should return incorrect values.
>
> The implementation in GenericInternalRow
> 
> would throw a ClassCastException. I don’t think that using a row is a bad
> option simply because UnsafeRow is unsafe.
>
> It’s unlikely that UnsafeRow would be used to pass the data. The
> implementation would evaluate each argument expression and set the result
> in a generic row, then pass that row to the UDF. We can use whatever
> implementation we choose to provide better guarantees than unsafe.
>
> I think we should consider query-compile-time checks as nearly-as-good as
> Java-compile-time checks for the purposes of safety.
>
> I don’t think I agree with this. A failure at query analysis time vs
> runtime still requires going back to a separate project, fixing something,
> and rebuilding. The time needed to fix a problem goes up significantly vs.
> compile-time checks. And that is even worse if the UDF is maintained by
> someone else.
>
> I think we also need to consider how common it would be that a use case
> can have the query-compile-time checks. Going through this in more detail
> below makes me think that it is unlikely that these checks would be used
> often because of the limitations of using an interface with type erasure.
>
> I believe that Wenchen’s proposal will provide stronger query-compile-time
> safety
>
> The proposal could have better safety for each argument, assuming that we
> detect failures by looking at the parameter types using reflection in the
> analyzer. But we don’t do that for any of the similar UDFs today so I’m
> skeptical that this would actually be a high enough priority to implement.
>
> As Erik pointed out, type erasure also limits the effectiveness. You can’t
> implement ScalarFunction2 and ScalarFunction2 Long>. You can handle those cases using InternalRow or you can handle
> them using VarargScalarFunction. That forces many use cases into
> varargs with Object, where you don’t get any of the proposed analyzer
> benefits and lose compile-time checks. The only time the additional checks
> (if implemented) would help is when only one set of argument types is
> needed because implementing ScalarFunction defeats the
> purpose.
>
> It’s worth noting that safety for the magic methods would be identical
> between the two options, so the trade-off to consider is for varargs and
> non-codegen cases. Combining the limitations discussed, this has better
> safety guarantees only if you need just one set of types for each number of
> arguments and are using the non-codegen path. Since varargs is one of the
> primary reasons to use this API, then I don’t think that it is a good idea
> to use Object[] instead of InternalRow.
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-01 Thread Ryan Blue
Thanks for adding your perspective, Erik!

If the input is string type but the UDF implementation calls row.getLong(0),
it returns wrong data

I think this is misleading. It is true for UnsafeRow, but there is no
reason why InternalRow should return incorrect values.

The implementation in GenericInternalRow

would throw a ClassCastException. I don’t think that using a row is a bad
option simply because UnsafeRow is unsafe.

It’s unlikely that UnsafeRow would be used to pass the data. The
implementation would evaluate each argument expression and set the result
in a generic row, then pass that row to the UDF. We can use whatever
implementation we choose to provide better guarantees than unsafe.

I think we should consider query-compile-time checks as nearly-as-good as
Java-compile-time checks for the purposes of safety.

I don’t think I agree with this. A failure at query analysis time vs
runtime still requires going back to a separate project, fixing something,
and rebuilding. The time needed to fix a problem goes up significantly vs.
compile-time checks. And that is even worse if the UDF is maintained by
someone else.

I think we also need to consider how common it would be that a use case can
have the query-compile-time checks. Going through this in more detail below
makes me think that it is unlikely that these checks would be used often
because of the limitations of using an interface with type erasure.

I believe that Wenchen’s proposal will provide stronger query-compile-time
safety

The proposal could have better safety for each argument, assuming that we
detect failures by looking at the parameter types using reflection in the
analyzer. But we don’t do that for any of the similar UDFs today so I’m
skeptical that this would actually be a high enough priority to implement.

As Erik pointed out, type erasure also limits the effectiveness. You can’t
implement ScalarFunction2 and ScalarFunction2.
You can handle those cases using InternalRow or you can handle them using
VarargScalarFunction. That forces many use cases into varargs with
Object, where you don’t get any of the proposed analyzer benefits and lose
compile-time checks. The only time the additional checks (if implemented)
would help is when only one set of argument types is needed because
implementing ScalarFunction defeats the purpose.

It’s worth noting that safety for the magic methods would be identical
between the two options, so the trade-off to consider is for varargs and
non-codegen cases. Combining the limitations discussed, this has better
safety guarantees only if you need just one set of types for each number of
arguments and are using the non-codegen path. Since varargs is one of the
primary reasons to use this API, then I don’t think that it is a good idea
to use Object[] instead of InternalRow.
-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-23 Thread Wenchen Fan
;>>>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The proposal already includes an API for aggregate functions
>>>>>>>>>>>>> and I think we would want to implement those right away.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Processing ColumnBatch is something we can easily extend the
>>>>>>>>>>>>> interfaces to support, similar to Wenchen's suggestion. The 
>>>>>>>>>>>>> important thing
>>>>>>>>>>>>> right now is to agree on some basic functionality: how to look up 
>>>>>>>>>>>>> functions
>>>>>>>>>>>>> and what the simple API should be. Like the TableCatalog 
>>>>>>>>>>>>> interfaces, we
>>>>>>>>>>>>> will layer on more support through optional interfaces like
>>>>>>>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <
>>>>>>>>>>>>> andrew.m...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This proposal looks very interesting. Would future goals for
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> functionality include both support for aggregation functions,
>>>>>>>>>>>>>> as well
>>>>>>>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>>>>>>>> Row/InternalRow)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thanks for the positive feedback, everyone. It sounds like
>>>>>>>>>>>>>> there is a clear path forward for calling functions. Even 
>>>>>>>>>>>>>> without a
>>>>>>>>>>>>>> prototype, the `invoke` plans show that Wenchen's suggested 
>>>>>>>>>>>>>> optimization
>>>>>>>>>>>>>> can be done, and incorporating it as an optional extension to 
>>>>>>>>>>>>>> this proposal
>>>>>>>>>>>>>> solves many of the unknowns.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > With that area now understood, is there any discussion
>>>>>>>>>>>>>> about other parts of the proposal, besides the function call 
>>>>>>>>>>>>>> interface?
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <
>>>>>>>>>>>>>> sunc...@apache.org> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> This is an important feature which can unblock several
>>>>>>>>>>>>>> other projects including bucket join support for DataSource v2, 
>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>> support for enforcing DataSource v2 distribution requirements on 
>>>>>>>>>>>>>> the write
>>>>>>>>>>>>>> path, etc. I like Ryan's proposals which look simple and 
>>>>>>>>>>>>>> elegant, with nice
>>>>>>>>>>&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-23 Thread Dongjoon Hyun
tention. Also, I'm expecting
>>>>>>>>>>> more and more if we have this on the `master` branch because we are
>>>>>>>>>>> developing together.
>>>>>>>>>>>
>>>>>>>>>>> > Spark SQL has many active contributors/committers and this
>>>>>>>>>>> thread doesn't get much attention yet.
>>>>>>>>>>>
>>>>>>>>>>> So, what's your ETA from now?
>>>>>>>>>>>
>>>>>>>>>>> > I think the problem here is we were discussing some very
>>>>>>>>>>> detailed things without actual code.
>>>>>>>>>>> > I'll implement my idea after the holiday and then we can
>>>>>>>>>>> have more effective discussions.
>>>>>>>>>>> > We can also do benchmarks and get some real numbers.
>>>>>>>>>>> > In the meantime, we can continue to discuss other parts of
>>>>>>>>>>> this proposal, and make a prototype if possible.
>>>>>>>>>>>
>>>>>>>>>>> I'm looking forward to seeing your PR. I hope we can conclude
>>>>>>>>>>> this thread and have at least one implementation in the `master` 
>>>>>>>>>>> branch
>>>>>>>>>>> this month (February).
>>>>>>>>>>> If you need more time (one month or longer), why don't we have
>>>>>>>>>>> Ryan's suggestion in the `master` branch first and benchmark with 
>>>>>>>>>>> your PR
>>>>>>>>>>> later during Apache Spark 3.2 timeframe.
>>>>>>>>>>>
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue
>>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> The proposal already includes an API for aggregate functions
>>>>>>>>>>>> and I think we would want to implement those right away.
>>>>>>>>>>>>
>>>>>>>>>>>> Processing ColumnBatch is something we can easily extend the
>>>>>>>>>>>> interfaces to support, similar to Wenchen's suggestion. The 
>>>>>>>>>>>> important thing
>>>>>>>>>>>> right now is to agree on some basic functionality: how to look up 
>>>>>>>>>>>> functions
>>>>>>>>>>>> and what the simple API should be. Like the TableCatalog 
>>>>>>>>>>>> interfaces, we
>>>>>>>>>>>> will layer on more support through optional interfaces like
>>>>>>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <
>>>>>>>>>>>> andrew.m...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This proposal looks very interesting. Would future goals for
>>>>>>>>>>>>> this
>>>>>>>>>>>>> functionality include both support for aggregation functions,
>>>>>>>>>>>>> as well
>>>>>>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>>>>>>> Row/InternalRow)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>>>>>>>  wrote:
>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-22 Thread Wenchen Fan
gt;> I think we would want to implement those right away.
>>>>>>>>>>>
>>>>>>>>>>> Processing ColumnBatch is something we can easily extend the
>>>>>>>>>>> interfaces to support, similar to Wenchen's suggestion. The 
>>>>>>>>>>> important thing
>>>>>>>>>>> right now is to agree on some basic functionality: how to look up 
>>>>>>>>>>> functions
>>>>>>>>>>> and what the simple API should be. Like the TableCatalog 
>>>>>>>>>>> interfaces, we
>>>>>>>>>>> will layer on more support through optional interfaces like
>>>>>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <
>>>>>>>>>>> andrew.m...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> This proposal looks very interesting. Would future goals for
>>>>>>>>>>>> this
>>>>>>>>>>>> functionality include both support for aggregation functions,
>>>>>>>>>>>> as well
>>>>>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>>>>>> Row/InternalRow)?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Andrew
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>>>>>>  wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > Thanks for the positive feedback, everyone. It sounds like
>>>>>>>>>>>> there is a clear path forward for calling functions. Even without a
>>>>>>>>>>>> prototype, the `invoke` plans show that Wenchen's suggested 
>>>>>>>>>>>> optimization
>>>>>>>>>>>> can be done, and incorporating it as an optional extension to this 
>>>>>>>>>>>> proposal
>>>>>>>>>>>> solves many of the unknowns.
>>>>>>>>>>>> >
>>>>>>>>>>>> > With that area now understood, is there any discussion about
>>>>>>>>>>>> other parts of the proposal, besides the function call interface?
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> This is an important feature which can unblock several other
>>>>>>>>>>>> projects including bucket join support for DataSource v2, complete 
>>>>>>>>>>>> support
>>>>>>>>>>>> for enforcing DataSource v2 distribution requirements on the write 
>>>>>>>>>>>> path,
>>>>>>>>>>>> etc. I like Ryan's proposals which look simple and elegant, with 
>>>>>>>>>>>> nice
>>>>>>>>>>>> support on function overloading and variadic arguments. On the 
>>>>>>>>>>>> other hand,
>>>>>>>>>>>> I think Wenchen made a very good point about performance. Overall, 
>>>>>>>>>>>> I'm
>>>>>>>>>>>> excited to see active discussions on this topic and believe the 
>>>>>>>>>>>> community
>>>>>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Chao
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>>>>>> gurwls...@gma

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-21 Thread Walaa Eldin Moustafa
gt;>>>>>>>>> >> This is an important feature which can unblock several other
>>>>>>>>>>> projects including bucket join support for DataSource v2, complete 
>>>>>>>>>>> support
>>>>>>>>>>> for enforcing DataSource v2 distribution requirements on the write 
>>>>>>>>>>> path,
>>>>>>>>>>> etc. I like Ryan's proposals which look simple and elegant, with 
>>>>>>>>>>> nice
>>>>>>>>>>> support on function overloading and variadic arguments. On the 
>>>>>>>>>>> other hand,
>>>>>>>>>>> I think Wenchen made a very good point about performance. Overall, 
>>>>>>>>>>> I'm
>>>>>>>>>>> excited to see active discussions on this topic and believe the 
>>>>>>>>>>> community
>>>>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Chao
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>>>>> gurwls...@gmail.com> wrote:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이
>>>>>>>>>>> 작성:
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd
>>>>>>>>>>> like to support the
>>>>>>>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>>>>>>>> feature.
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>> Thanks Ryan for working on this and I do also look forward
>>>>>>>>>>> to Wenchen's
>>>>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks
>>>>>>>>>>> a good
>>>>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>>>>> implementation, is there a
>>>>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>> John Zhuge wrote
>>>>>>>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>>>>>>>> important feature!
>>>>>>>>>>> >>>> >
>>>>>>>>>>> >>>> > The proposal lays a solid foundation of minimal feature
>>>>>>>>>>> set with careful
>>>>>>>>>>> >>>> > considerations for future optimizations and extensions.
>>>>>>>>>>> Can't wait to see
>>>>>>>>>>> >>>> > it leading to more advanced functionalities like views
>>>>>>>>>>> with shared custom
>>>>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has already
>>>>>>>>>>> borne fruit from
>>>>>>>>>>> >>>> > the constructive collaborations in this thread. Looking
>>>>>>>>>>> forward to
>>>>>>>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>>>>>>>> SupportsInvoke
>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-21 Thread Wenchen Fan
>>>>>> >
>>>>>>>>>> > With that area now understood, is there any discussion about
>>>>>>>>>> other parts of the proposal, besides the function call interface?
>>>>>>>>>> >
>>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun 
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> This is an important feature which can unblock several other
>>>>>>>>>> projects including bucket join support for DataSource v2, complete 
>>>>>>>>>> support
>>>>>>>>>> for enforcing DataSource v2 distribution requirements on the write 
>>>>>>>>>> path,
>>>>>>>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>>>>>>> support on function overloading and variadic arguments. On the other 
>>>>>>>>>> hand,
>>>>>>>>>> I think Wenchen made a very good point about performance. Overall, 
>>>>>>>>>> I'm
>>>>>>>>>> excited to see active discussions on this topic and believe the 
>>>>>>>>>> community
>>>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>>>> >>
>>>>>>>>>> >> Chao
>>>>>>>>>> >>
>>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>>>> gurwls...@gmail.com> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>>>> >>>
>>>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이
>>>>>>>>>> 작성:
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd
>>>>>>>>>> like to support the
>>>>>>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>>>>>>> feature.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Thanks Ryan for working on this and I do also look forward
>>>>>>>>>> to Wenchen's
>>>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a
>>>>>>>>>> good
>>>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>>>> implementation, is there a
>>>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> John Zhuge wrote
>>>>>>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>>>>>>> important feature!
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> > The proposal lays a solid foundation of minimal feature
>>>>>>>>>> set with careful
>>>>>>>>>> >>>> > considerations for future optimizations and extensions.
>>>>>>>>>> Can't wait to see
>>>>>>>>>> >>>> > it leading to more advanced functionalities like views
>>>>>>>>>> with shared custom
>>>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has already
>>>>>>>>>> borne fruit from
>>>>>>>>>> >>>> > the constructive collaborations in this thread. Looking
>>>>>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-19 Thread Ryan Blue
;>>>>>>> functionality include both support for aggregation functions, as
>>>>>>>>> well
>>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>>> Row/InternalRow)?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>>>  wrote:
>>>>>>>>> >
>>>>>>>>> > Thanks for the positive feedback, everyone. It sounds like there
>>>>>>>>> is a clear path forward for calling functions. Even without a 
>>>>>>>>> prototype,
>>>>>>>>> the `invoke` plans show that Wenchen's suggested optimization can be 
>>>>>>>>> done,
>>>>>>>>> and incorporating it as an optional extension to this proposal solves 
>>>>>>>>> many
>>>>>>>>> of the unknowns.
>>>>>>>>> >
>>>>>>>>> > With that area now understood, is there any discussion about
>>>>>>>>> other parts of the proposal, besides the function call interface?
>>>>>>>>> >
>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun 
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >> This is an important feature which can unblock several other
>>>>>>>>> projects including bucket join support for DataSource v2, complete 
>>>>>>>>> support
>>>>>>>>> for enforcing DataSource v2 distribution requirements on the write 
>>>>>>>>> path,
>>>>>>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>>>>>> support on function overloading and variadic arguments. On the other 
>>>>>>>>> hand,
>>>>>>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>>>>>>> excited to see active discussions on this topic and believe the 
>>>>>>>>> community
>>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>>> >>
>>>>>>>>> >> Chao
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>>> gurwls...@gmail.com> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>>> >>>
>>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이
>>>>>>>>> 작성:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd like
>>>>>>>>> to support the
>>>>>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>>>>>> feature.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>>>>>>> Wenchen's
>>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a
>>>>>>>>> good
>>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>>> implementation, is there a
>>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> John Zhuge wrote
>>>>>>>>> >>>> > Excited to see our Spar

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Wenchen Fan
nd believe the 
>>>>>>>> community
>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>> >>
>>>>>>>> >> Chao
>>>>>>>> >>
>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>> gurwls...@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>> >>>
>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이
>>>>>>>> 작성:
>>>>>>>> >>>>
>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd like
>>>>>>>> to support the
>>>>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>>>>> feature.
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>>>>>> Wenchen's
>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>> >>>>
>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a
>>>>>>>> good
>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>> implementation, is there a
>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> John Zhuge wrote
>>>>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>>>>> important feature!
>>>>>>>> >>>> >
>>>>>>>> >>>> > The proposal lays a solid foundation of minimal feature set
>>>>>>>> with careful
>>>>>>>> >>>> > considerations for future optimizations and extensions.
>>>>>>>> Can't wait to see
>>>>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>>>>> shared custom
>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has already
>>>>>>>> borne fruit from
>>>>>>>> >>>> > the constructive collaborations in this thread. Looking
>>>>>>>> forward to
>>>>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>>>>> SupportsInvoke
>>>>>>>> >>>> > extension proposed by Ryan.
>>>>>>>> >>>> >
>>>>>>>> >>>> >
>>>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>>>>>> >>>>
>>>>>>>> >>>> > owen.omalley@
>>>>>>>> >>>>
>>>>>>>> >>>> > 
>>>>>>>> >>>> > wrote:
>>>>>>>> >>>> >
>>>>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>>>>> standard way of
>>>>>>>> >>>> >> getting to and calling UDFs.
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> I like having the ScalarFunction as the API to call the
>>>>>>>> UDFs. It is
>>>>>>>> >>>> >> simple, yet covers all of the polymorphic type cases well.
>>>>>>>> I think it
>>>>>>>> >>>> >> would
>>>>>>>> >>>> >> also simplify using the functions in other contexts like
>>>>>>>> pushing down
>>>>>>>> >>>> >> filters into the ORC & Parquet readers although 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Walaa Eldin Moustafa
gt;> >>>> > The proposal lays a solid foundation of minimal feature set
>>>>>>> with careful
>>>>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>>>>> wait to see
>>>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>>>> shared custom
>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has already
>>>>>>> borne fruit from
>>>>>>> >>>> > the constructive collaborations in this thread. Looking
>>>>>>> forward to
>>>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>>>> SupportsInvoke
>>>>>>> >>>> > extension proposed by Ryan.
>>>>>>> >>>> >
>>>>>>> >>>> >
>>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>>>>> >>>>
>>>>>>> >>>> > owen.omalley@
>>>>>>> >>>>
>>>>>>> >>>> > 
>>>>>>> >>>> > wrote:
>>>>>>> >>>> >
>>>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>>>> standard way of
>>>>>>> >>>> >> getting to and calling UDFs.
>>>>>>> >>>> >>
>>>>>>> >>>> >> I like having the ScalarFunction as the API to call the
>>>>>>> UDFs. It is
>>>>>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>>>>>> think it
>>>>>>> >>>> >> would
>>>>>>> >>>> >> also simplify using the functions in other contexts like
>>>>>>> pushing down
>>>>>>> >>>> >> filters into the ORC & Parquet readers although there are a
>>>>>>> lot of
>>>>>>> >>>> >> details
>>>>>>> >>>> >> that would need to be considered there.
>>>>>>> >>>> >>
>>>>>>> >>>> >> .. Owen
>>>>>>> >>>> >>
>>>>>>> >>>> >>
>>>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>>>>> >>>>
>>>>>>> >>>> > ekrogen@.com
>>>>>>> >>>>
>>>>>>> >>>> > 
>>>>>>> >>>> >> wrote:
>>>>>>> >>>> >>
>>>>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>>>>> within Spark
>>>>>>> >>>> >>> to
>>>>>>> >>>> >>> provide support for shareable UDFs, as well as make
>>>>>>> movement towards
>>>>>>> >>>> >>> more
>>>>>>> >>>> >>> advanced functionality like views which themselves depend
>>>>>>> on UDFs, so I
>>>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>>>> >>>> >>>
>>>>>>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>>>>>> user-friendly
>>>>>>> >>>> >>> and
>>>>>>> >>>> >>> extensible. I generally think Wenchen's proposal is easier
>>>>>>> for a user to
>>>>>>> >>>> >>> work with in the common case, but has greater potential for
>>>>>>> confusing
>>>>>>> >>>> >>> and
>>>>>>> >>>> >>> hard-to-debug behavior due to use of reflective method
>>>>>>> signature
>>>>>>> >>>> >>> searches.
>>>>>>> >>>> >>> The merits on both sides can hopefully be more properly
&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Ryan Blue
; the unknowns.
>>>>>> >
>>>>>> > With that area now understood, is there any discussion about other
>>>>>> parts of the proposal, besides the function call interface?
>>>>>> >
>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun 
>>>>>> wrote:
>>>>>> >>
>>>>>> >> This is an important feature which can unblock several other
>>>>>> projects including bucket join support for DataSource v2, complete 
>>>>>> support
>>>>>> for enforcing DataSource v2 distribution requirements on the write path,
>>>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>>> support on function overloading and variadic arguments. On the other 
>>>>>> hand,
>>>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>>>> excited to see active discussions on this topic and believe the community
>>>>>> will come to a proposal with the best of both sides.
>>>>>> >>
>>>>>> >> Chao
>>>>>> >>
>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> +1 for Liang-chi's.
>>>>>> >>>
>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이
>>>>>> 작성:
>>>>>> >>>>
>>>>>> >>>> Basically I think the proposal makes sense to me and I'd like to
>>>>>> support the
>>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>>> feature.
>>>>>> >>>>
>>>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>>>> Wenchen's
>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>> >>>>
>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>>>>> >>>> alternative to me. Besides Wenchen's alternative implementation,
>>>>>> is there a
>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> John Zhuge wrote
>>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>>> important feature!
>>>>>> >>>> >
>>>>>> >>>> > The proposal lays a solid foundation of minimal feature set
>>>>>> with careful
>>>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>>>> wait to see
>>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>>> shared custom
>>>>>> >>>> > functions, function pushdown, lambda, etc. It has already
>>>>>> borne fruit from
>>>>>> >>>> > the constructive collaborations in this thread. Looking
>>>>>> forward to
>>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>>> SupportsInvoke
>>>>>> >>>> > extension proposed by Ryan.
>>>>>> >>>> >
>>>>>> >>>> >
>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>>>> >>>>
>>>>>> >>>> > owen.omalley@
>>>>>> >>>>
>>>>>> >>>> > 
>>>>>> >>>> > wrote:
>>>>>> >>>> >
>>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>>> standard way of
>>>>>> >>>> >> getting to and calling UDFs.
>>>>>> >>>> >>
>>>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs.
&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
ation of minimal feature set
>>>>> with careful
>>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>>> wait to see
>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>> shared custom
>>>>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>>>>> fruit from
>>>>> >>>> > the constructive collaborations in this thread. Looking forward
>>>>> to
>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>> SupportsInvoke
>>>>> >>>> > extension proposed by Ryan.
>>>>> >>>> >
>>>>> >>>> >
>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>>> >>>>
>>>>> >>>> > owen.omalley@
>>>>> >>>>
>>>>> >>>> > 
>>>>> >>>> > wrote:
>>>>> >>>> >
>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>> standard way of
>>>>> >>>> >> getting to and calling UDFs.
>>>>> >>>> >>
>>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs.
>>>>> It is
>>>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>>>> think it
>>>>> >>>> >> would
>>>>> >>>> >> also simplify using the functions in other contexts like
>>>>> pushing down
>>>>> >>>> >> filters into the ORC & Parquet readers although there are a
>>>>> lot of
>>>>> >>>> >> details
>>>>> >>>> >> that would need to be considered there.
>>>>> >>>> >>
>>>>> >>>> >> .. Owen
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>>> >>>>
>>>>> >>>> > ekrogen@.com
>>>>> >>>>
>>>>> >>>> > 
>>>>> >>>> >> wrote:
>>>>> >>>> >>
>>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>>> within Spark
>>>>> >>>> >>> to
>>>>> >>>> >>> provide support for shareable UDFs, as well as make movement
>>>>> towards
>>>>> >>>> >>> more
>>>>> >>>> >>> advanced functionality like views which themselves depend on
>>>>> UDFs, so I
>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>> >>>> >>>
>>>>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>>>> user-friendly
>>>>> >>>> >>> and
>>>>> >>>> >>> extensible. I generally think Wenchen's proposal is easier
>>>>> for a user to
>>>>> >>>> >>> work with in the common case, but has greater potential for
>>>>> confusing
>>>>> >>>> >>> and
>>>>> >>>> >>> hard-to-debug behavior due to use of reflective method
>>>>> signature
>>>>> >>>> >>> searches.
>>>>> >>>> >>> The merits on both sides can hopefully be more properly
>>>>> examined with
>>>>> >>>> >>> code,
>>>>> >>>> >>> so I look forward to seeing an implementation of Wenchen's
>>>>> ideas to
>>>>> >>>> >>> provide
>>>>> >>>> >>> a more concrete comparison. I am optimistic that we will not
>>>>> let the
>>>>> >>>> >>> debate
>>>>> >>>> >>> over this point unreasonably stall the SPIP from making
>>>>> progress.
>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Ryan Blue
extension to this proposal solves many of
>>>> the unknowns.
>>>> >
>>>> > With that area now understood, is there any discussion about other
>>>> parts of the proposal, besides the function call interface?
>>>> >
>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
>>>> >>
>>>> >> This is an important feature which can unblock several other
>>>> projects including bucket join support for DataSource v2, complete support
>>>> for enforcing DataSource v2 distribution requirements on the write path,
>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>> support on function overloading and variadic arguments. On the other hand,
>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>> excited to see active discussions on this topic and believe the community
>>>> will come to a proposal with the best of both sides.
>>>> >>
>>>> >> Chao
>>>> >>
>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon 
>>>> wrote:
>>>> >>>
>>>> >>> +1 for Liang-chi's.
>>>> >>>
>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>> >>>
>>>> >>>
>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>>>> >>>>
>>>> >>>> Basically I think the proposal makes sense to me and I'd like to
>>>> support the
>>>> >>>> SPIP as it looks like we have strong need for the important
>>>> feature.
>>>> >>>>
>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>> Wenchen's
>>>> >>>> implementation. Thanks for the discussion too.
>>>> >>>>
>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>>> >>>> alternative to me. Besides Wenchen's alternative implementation,
>>>> is there a
>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>> >>>>
>>>> >>>>
>>>> >>>> John Zhuge wrote
>>>> >>>> > Excited to see our Spark community rallying behind this
>>>> important feature!
>>>> >>>> >
>>>> >>>> > The proposal lays a solid foundation of minimal feature set with
>>>> careful
>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>> wait to see
>>>> >>>> > it leading to more advanced functionalities like views with
>>>> shared custom
>>>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>>>> fruit from
>>>> >>>> > the constructive collaborations in this thread. Looking forward
>>>> to
>>>> >>>> > Wenchen's prototype and further discussions including the
>>>> SupportsInvoke
>>>> >>>> > extension proposed by Ryan.
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>> >>>>
>>>> >>>> > owen.omalley@
>>>> >>>>
>>>> >>>> > 
>>>> >>>> > wrote:
>>>> >>>> >
>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>> standard way of
>>>> >>>> >> getting to and calling UDFs.
>>>> >>>> >>
>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs.
>>>> It is
>>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>>> think it
>>>> >>>> >> would
>>>> >>>> >> also simplify using the functions in other contexts like
>>>> pushing down
>>>> >>>> >> filters into the ORC & Parquet readers although there are a lot
>>>> of
>>>> >>>> >> details
>>>> >>>> >> that would need to be considered there.
>>>> >>>> >>
>&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Dongjoon Hyun
gt;> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이
>>>>> 작성:
>>>>> >>>>
>>>>> >>>> Basically I think the proposal makes sense to me and I'd like to
>>>>> support the
>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>> feature.
>>>>> >>>>
>>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>>> Wenchen's
>>>>> >>>> implementation. Thanks for the discussion too.
>>>>> >>>>
>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>>>> >>>> alternative to me. Besides Wenchen's alternative implementation,
>>>>> is there a
>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> John Zhuge wrote
>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>> important feature!
>>>>> >>>> >
>>>>> >>>> > The proposal lays a solid foundation of minimal feature set
>>>>> with careful
>>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>>> wait to see
>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>> shared custom
>>>>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>>>>> fruit from
>>>>> >>>> > the constructive collaborations in this thread. Looking forward
>>>>> to
>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>> SupportsInvoke
>>>>> >>>> > extension proposed by Ryan.
>>>>> >>>> >
>>>>> >>>> >
>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>>> >>>>
>>>>> >>>> > owen.omalley@
>>>>> >>>>
>>>>> >>>> > 
>>>>> >>>> > wrote:
>>>>> >>>> >
>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>> standard way of
>>>>> >>>> >> getting to and calling UDFs.
>>>>> >>>> >>
>>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs.
>>>>> It is
>>>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>>>> think it
>>>>> >>>> >> would
>>>>> >>>> >> also simplify using the functions in other contexts like
>>>>> pushing down
>>>>> >>>> >> filters into the ORC & Parquet readers although there are a
>>>>> lot of
>>>>> >>>> >> details
>>>>> >>>> >> that would need to be considered there.
>>>>> >>>> >>
>>>>> >>>> >> .. Owen
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>>> >>>>
>>>>> >>>> > ekrogen@.com
>>>>> >>>>
>>>>> >>>> > 
>>>>> >>>> >> wrote:
>>>>> >>>> >>
>>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>>> within Spark
>>>>> >>>> >>> to
>>>>> >>>> >>> provide support for shareable UDFs, as well as make movement
>>>>> towards
>>>>> >>>> >>> more
>>>>> >>>> >>> advanced functionality like views which themselves depend on
>>>>> UDFs, so I
>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>> >>>> >>>
>>>>> >>>> >>> I find both of the proposed UDF APIs to be suf

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
gt; >>>>
>>>> >>>> John Zhuge wrote
>>>> >>>> > Excited to see our Spark community rallying behind this
>>>> important feature!
>>>> >>>> >
>>>> >>>> > The proposal lays a solid foundation of minimal feature set with
>>>> careful
>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>> wait to see
>>>> >>>> > it leading to more advanced functionalities like views with
>>>> shared custom
>>>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>>>> fruit from
>>>> >>>> > the constructive collaborations in this thread. Looking forward
>>>> to
>>>> >>>> > Wenchen's prototype and further discussions including the
>>>> SupportsInvoke
>>>> >>>> > extension proposed by Ryan.
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>> >>>>
>>>> >>>> > owen.omalley@
>>>> >>>>
>>>> >>>> > 
>>>> >>>> > wrote:
>>>> >>>> >
>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>> standard way of
>>>> >>>> >> getting to and calling UDFs.
>>>> >>>> >>
>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs.
>>>> It is
>>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>>> think it
>>>> >>>> >> would
>>>> >>>> >> also simplify using the functions in other contexts like
>>>> pushing down
>>>> >>>> >> filters into the ORC & Parquet readers although there are a lot
>>>> of
>>>> >>>> >> details
>>>> >>>> >> that would need to be considered there.
>>>> >>>> >>
>>>> >>>> >> .. Owen
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>> >>>>
>>>> >>>> > ekrogen@.com
>>>> >>>>
>>>> >>>> > 
>>>> >>>> >> wrote:
>>>> >>>> >>
>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>> within Spark
>>>> >>>> >>> to
>>>> >>>> >>> provide support for shareable UDFs, as well as make movement
>>>> towards
>>>> >>>> >>> more
>>>> >>>> >>> advanced functionality like views which themselves depend on
>>>> UDFs, so I
>>>> >>>> >>> support this SPIP wholeheartedly.
>>>> >>>> >>>
>>>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>>> user-friendly
>>>> >>>> >>> and
>>>> >>>> >>> extensible. I generally think Wenchen's proposal is easier for
>>>> a user to
>>>> >>>> >>> work with in the common case, but has greater potential for
>>>> confusing
>>>> >>>> >>> and
>>>> >>>> >>> hard-to-debug behavior due to use of reflective method
>>>> signature
>>>> >>>> >>> searches.
>>>> >>>> >>> The merits on both sides can hopefully be more properly
>>>> examined with
>>>> >>>> >>> code,
>>>> >>>> >>> so I look forward to seeing an implementation of Wenchen's
>>>> ideas to
>>>> >>>> >>> provide
>>>> >>>> >>> a more concrete comparison. I am optimistic that we will not
>>>> let the
>>>> >>>> >>> debate
>>>> >>>> >>> over this point unreasonably stall the SPIP from making
>>>> progress.
>>>> >>>> &

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Hyukjin Kwon
 >>>
>>> >>>
>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>>> >>>>
>>> >>>> Basically I think the proposal makes sense to me and I'd like to
>>> support the
>>> >>>> SPIP as it looks like we have strong need for the important feature.
>>> >>>>
>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>> Wenchen's
>>> >>>> implementation. Thanks for the discussion too.
>>> >>>>
>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>> >>>> alternative to me. Besides Wenchen's alternative implementation, is
>>> there a
>>> >>>> chance we also have the SupportsInvoke for comparison?
>>> >>>>
>>> >>>>
>>> >>>> John Zhuge wrote
>>> >>>> > Excited to see our Spark community rallying behind this important
>>> feature!
>>> >>>> >
>>> >>>> > The proposal lays a solid foundation of minimal feature set with
>>> careful
>>> >>>> > considerations for future optimizations and extensions. Can't
>>> wait to see
>>> >>>> > it leading to more advanced functionalities like views with
>>> shared custom
>>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>>> fruit from
>>> >>>> > the constructive collaborations in this thread. Looking forward to
>>> >>>> > Wenchen's prototype and further discussions including the
>>> SupportsInvoke
>>> >>>> > extension proposed by Ryan.
>>> >>>> >
>>> >>>> >
>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>> >>>>
>>> >>>> > owen.omalley@
>>> >>>>
>>> >>>> > 
>>> >>>> > wrote:
>>> >>>> >
>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>> standard way of
>>> >>>> >> getting to and calling UDFs.
>>> >>>> >>
>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs. It
>>> is
>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>> think it
>>> >>>> >> would
>>> >>>> >> also simplify using the functions in other contexts like pushing
>>> down
>>> >>>> >> filters into the ORC & Parquet readers although there are a lot
>>> of
>>> >>>> >> details
>>> >>>> >> that would need to be considered there.
>>> >>>> >>
>>> >>>> >> .. Owen
>>> >>>> >>
>>> >>>> >>
>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>> >>>>
>>> >>>> > ekrogen@.com
>>> >>>>
>>> >>>> > 
>>> >>>> >> wrote:
>>> >>>> >>
>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>> within Spark
>>> >>>> >>> to
>>> >>>> >>> provide support for shareable UDFs, as well as make movement
>>> towards
>>> >>>> >>> more
>>> >>>> >>> advanced functionality like views which themselves depend on
>>> UDFs, so I
>>> >>>> >>> support this SPIP wholeheartedly.
>>> >>>> >>>
>>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>> user-friendly
>>> >>>> >>> and
>>> >>>> >>> extensible. I generally think Wenchen's proposal is easier for
>>> a user to
>>> >>>> >>> work with in the common case, but has greater potential for
>>> confusing
>>> >>>> >>> and
>>> >>>> >>> hard-to-debug behavior due to use of reflective method signature
>>> >>>> >>> searches.
>>> >>>> >>> The merits on both sides 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Dongjoon Hyun
>> >>>> > it leading to more advanced functionalities like views with shared
>> custom
>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>> fruit from
>> >>>> > the constructive collaborations in this thread. Looking forward to
>> >>>> > Wenchen's prototype and further discussions including the
>> SupportsInvoke
>> >>>> > extension proposed by Ryan.
>> >>>> >
>> >>>> >
>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>> >>>>
>> >>>> > owen.omalley@
>> >>>>
>> >>>> > 
>> >>>> > wrote:
>> >>>> >
>> >>>> >> I think this proposal is a very good thing giving Spark a
>> standard way of
>> >>>> >> getting to and calling UDFs.
>> >>>> >>
>> >>>> >> I like having the ScalarFunction as the API to call the UDFs. It
>> is
>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>> think it
>> >>>> >> would
>> >>>> >> also simplify using the functions in other contexts like pushing
>> down
>> >>>> >> filters into the ORC & Parquet readers although there are a lot of
>> >>>> >> details
>> >>>> >> that would need to be considered there.
>> >>>> >>
>> >>>> >> .. Owen
>> >>>> >>
>> >>>> >>
>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>> >>>>
>> >>>> > ekrogen@.com
>> >>>>
>> >>>> > 
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >>> I agree that there is a strong need for a FunctionCatalog within
>> Spark
>> >>>> >>> to
>> >>>> >>> provide support for shareable UDFs, as well as make movement
>> towards
>> >>>> >>> more
>> >>>> >>> advanced functionality like views which themselves depend on
>> UDFs, so I
>> >>>> >>> support this SPIP wholeheartedly.
>> >>>> >>>
>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>> user-friendly
>> >>>> >>> and
>> >>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>> user to
>> >>>> >>> work with in the common case, but has greater potential for
>> confusing
>> >>>> >>> and
>> >>>> >>> hard-to-debug behavior due to use of reflective method signature
>> >>>> >>> searches.
>> >>>> >>> The merits on both sides can hopefully be more properly examined
>> with
>> >>>> >>> code,
>> >>>> >>> so I look forward to seeing an implementation of Wenchen's ideas
>> to
>> >>>> >>> provide
>> >>>> >>> a more concrete comparison. I am optimistic that we will not let
>> the
>> >>>> >>> debate
>> >>>> >>> over this point unreasonably stall the SPIP from making progress.
>> >>>> >>>
>> >>>> >>> Thank you to both Wenchen and Ryan for your detailed
>> consideration and
>> >>>> >>> evaluation of these ideas!
>> >>>> >>> --
>> >>>> >>> *From:* Dongjoon Hyun 
>> >>>>
>> >>>> > dongjoon.hyun@
>> >>>>
>> >>>> > 
>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>> >>>> >>> *To:* Ryan Blue 
>> >>>>
>> >>>> > blue@
>> >>>>
>> >>>> > 
>> >>>> >>> *Cc:* Holden Karau 
>> >>>>
>> >>>> > holden@
>> >>>>
>> >>>> > ; Hyukjin Kwon <
>> >>>> >>>
>> >>>>
>> >>>> > gurwls223@
>> >>>>
>> >>>> >>; Spark Dev List 
>> >>>>
>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Ryan Blue
 lot of
> >>>> >> details
> >>>> >> that would need to be considered there.
> >>>> >>
> >>>> >> .. Owen
> >>>> >>
> >>>> >>
> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
> >>>>
> >>>> > ekrogen@.com
> >>>>
> >>>> > 
> >>>> >> wrote:
> >>>> >>
> >>>> >>> I agree that there is a strong need for a FunctionCatalog within
> Spark
> >>>> >>> to
> >>>> >>> provide support for shareable UDFs, as well as make movement
> towards
> >>>> >>> more
> >>>> >>> advanced functionality like views which themselves depend on
> UDFs, so I
> >>>> >>> support this SPIP wholeheartedly.
> >>>> >>>
> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
> user-friendly
> >>>> >>> and
> >>>> >>> extensible. I generally think Wenchen's proposal is easier for a
> user to
> >>>> >>> work with in the common case, but has greater potential for
> confusing
> >>>> >>> and
> >>>> >>> hard-to-debug behavior due to use of reflective method signature
> >>>> >>> searches.
> >>>> >>> The merits on both sides can hopefully be more properly examined
> with
> >>>> >>> code,
> >>>> >>> so I look forward to seeing an implementation of Wenchen's ideas
> to
> >>>> >>> provide
> >>>> >>> a more concrete comparison. I am optimistic that we will not let
> the
> >>>> >>> debate
> >>>> >>> over this point unreasonably stall the SPIP from making progress.
> >>>> >>>
> >>>> >>> Thank you to both Wenchen and Ryan for your detailed
> consideration and
> >>>> >>> evaluation of these ideas!
> >>>> >>> --
> >>>> >>> *From:* Dongjoon Hyun 
> >>>>
> >>>> > dongjoon.hyun@
> >>>>
> >>>> > 
> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
> >>>> >>> *To:* Ryan Blue 
> >>>>
> >>>> > blue@
> >>>>
> >>>> > 
> >>>> >>> *Cc:* Holden Karau 
> >>>>
> >>>> > holden@
> >>>>
> >>>> > ; Hyukjin Kwon <
> >>>> >>>
> >>>>
> >>>> > gurwls223@
> >>>>
> >>>> >>; Spark Dev List 
> >>>>
> >>>> > dev@.apache
> >>>>
> >>>> > ; Wenchen Fan
> >>>> >>> 
> >>>>
> >>>> > cloud0fan@
> >>>>
> >>>> > 
> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
> >>>> >>>
> >>>> >>> BTW, I forgot to add my opinion explicitly in this thread because
> I was
> >>>> >>> on the PR before this thread.
> >>>> >>>
> >>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has
> been
> >>>> >>> there for almost two years.
> >>>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed
> with
> >>>> >>> the latest updated design docs and AS-IS PR.
> >>>> >>>
> >>>> >>> And, the rest of the progress in this thread is also very
> satisfying to
> >>>> >>> me.
> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
> >>>> >>>
> >>>> >>> To All:
> >>>> >>> Please take a look at the design doc and the PR, and give us some
> >>>> >>> opinions.
> >>>> >>> We really need your participation in order to make DSv2 more
> complete.
> >>>> >>> This will unblock other DSv2 features, too.
> >>>> >>>
> >>>> >>> Bests,
> >>>> >>> Dongjoon.
> >>>> >>>
> >>>> >

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Andrew Melo
support this SPIP wholeheartedly.
>>>> >>>
>>>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>>> >>> and
>>>> >>> extensible. I generally think Wenchen's proposal is easier for a user 
>>>> >>> to
>>>> >>> work with in the common case, but has greater potential for confusing
>>>> >>> and
>>>> >>> hard-to-debug behavior due to use of reflective method signature
>>>> >>> searches.
>>>> >>> The merits on both sides can hopefully be more properly examined with
>>>> >>> code,
>>>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>>>> >>> provide
>>>> >>> a more concrete comparison. I am optimistic that we will not let the
>>>> >>> debate
>>>> >>> over this point unreasonably stall the SPIP from making progress.
>>>> >>>
>>>> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
>>>> >>> evaluation of these ideas!
>>>> >>> --
>>>> >>> *From:* Dongjoon Hyun 
>>>>
>>>> > dongjoon.hyun@
>>>>
>>>> > 
>>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>> >>> *To:* Ryan Blue 
>>>>
>>>> > blue@
>>>>
>>>> > 
>>>> >>> *Cc:* Holden Karau 
>>>>
>>>> > holden@
>>>>
>>>> > ; Hyukjin Kwon <
>>>> >>>
>>>>
>>>> > gurwls223@
>>>>
>>>> >>; Spark Dev List 
>>>>
>>>> > dev@.apache
>>>>
>>>> > ; Wenchen Fan
>>>> >>> 
>>>>
>>>> > cloud0fan@
>>>>
>>>> > 
>>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>> >>>
>>>> >>> BTW, I forgot to add my opinion explicitly in this thread because I was
>>>> >>> on the PR before this thread.
>>>> >>>
>>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>>> >>> there for almost two years.
>>>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>>>> >>> the latest updated design docs and AS-IS PR.
>>>> >>>
>>>> >>> And, the rest of the progress in this thread is also very satisfying to
>>>> >>> me.
>>>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>> >>>
>>>> >>> To All:
>>>> >>> Please take a look at the design doc and the PR, and give us some
>>>> >>> opinions.
>>>> >>> We really need your participation in order to make DSv2 more complete.
>>>> >>> This will unblock other DSv2 features, too.
>>>> >>>
>>>> >>> Bests,
>>>> >>> Dongjoon.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
>>>>
>>>> > dongjoon.hyun@
>>>>
>>>> > 
>>>> >>> wrote:
>>>> >>>
>>>> >>> Hi, Ryan.
>>>> >>>
>>>> >>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>>>> >>> suggested is double-checking the alternatives with the implementation 
>>>> >>> to
>>>> >>> give more momentum to our discussion.
>>>> >>>
>>>> >>> Your new suggestion about optional extention also sounds like a new
>>>> >>> reasonable alternative to me.
>>>> >>>
>>>> >>> We are still discussing this topic together and I hope we can make a
>>>> >>> conclude at this time (for Apache Spark 3.2) without being stucked like
>>>> >>> last time.
>>>> >>>
>>>> >>> I really appreciate your leadership in this dicsussion and the moving
>>>> >>> direction of this discussion looks constructive to me. Let's give some
>>>&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Walaa Eldin Moustafa
gt;> custom
>>>> > functions, function pushdown, lambda, etc. It has already borne fruit
>>>> from
>>>> > the constructive collaborations in this thread. Looking forward to
>>>> > Wenchen's prototype and further discussions including the
>>>> SupportsInvoke
>>>> > extension proposed by Ryan.
>>>> >
>>>> >
>>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>>
>>>> > owen.omalley@
>>>>
>>>> > 
>>>> > wrote:
>>>> >
>>>> >> I think this proposal is a very good thing giving Spark a standard
>>>> way of
>>>> >> getting to and calling UDFs.
>>>> >>
>>>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>>>> >> simple, yet covers all of the polymorphic type cases well. I think it
>>>> >> would
>>>> >> also simplify using the functions in other contexts like pushing down
>>>> >> filters into the ORC & Parquet readers although there are a lot of
>>>> >> details
>>>> >> that would need to be considered there.
>>>> >>
>>>> >> .. Owen
>>>> >>
>>>> >>
>>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>>
>>>> > ekrogen@.com
>>>>
>>>> > 
>>>> >> wrote:
>>>> >>
>>>> >>> I agree that there is a strong need for a FunctionCatalog within
>>>> Spark
>>>> >>> to
>>>> >>> provide support for shareable UDFs, as well as make movement towards
>>>> >>> more
>>>> >>> advanced functionality like views which themselves depend on UDFs,
>>>> so I
>>>> >>> support this SPIP wholeheartedly.
>>>> >>>
>>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>>> user-friendly
>>>> >>> and
>>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>>>> user to
>>>> >>> work with in the common case, but has greater potential for
>>>> confusing
>>>> >>> and
>>>> >>> hard-to-debug behavior due to use of reflective method signature
>>>> >>> searches.
>>>> >>> The merits on both sides can hopefully be more properly examined
>>>> with
>>>> >>> code,
>>>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>>>> >>> provide
>>>> >>> a more concrete comparison. I am optimistic that we will not let the
>>>> >>> debate
>>>> >>> over this point unreasonably stall the SPIP from making progress.
>>>> >>>
>>>> >>> Thank you to both Wenchen and Ryan for your detailed consideration
>>>> and
>>>> >>> evaluation of these ideas!
>>>> >>> --
>>>> >>> *From:* Dongjoon Hyun 
>>>>
>>>> > dongjoon.hyun@
>>>>
>>>> > 
>>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>> >>> *To:* Ryan Blue 
>>>>
>>>> > blue@
>>>>
>>>> > 
>>>> >>> *Cc:* Holden Karau 
>>>>
>>>> > holden@
>>>>
>>>> > ; Hyukjin Kwon <
>>>> >>>
>>>>
>>>> > gurwls223@
>>>>
>>>> >>; Spark Dev List 
>>>>
>>>> > dev@.apache
>>>>
>>>> > ; Wenchen Fan
>>>> >>> 
>>>>
>>>> > cloud0fan@
>>>>
>>>> > 
>>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>> >>>
>>>> >>> BTW, I forgot to add my opinion explicitly in this thread because I
>>>> was
>>>> >>> on the PR before this thread.
>>>> >>>
>>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>>> >>> there for almost two years.
>>>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed
>>>> with
>>>&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ye Xianjin
ed for a FunctionCatalog within Spark
>>>> >>> to
>>>> >>> provide support for shareable UDFs, as well as make movement towards
>>>> >>> more
>>>> >>> advanced functionality like views which themselves depend on UDFs, so I
>>>> >>> support this SPIP wholeheartedly.
>>>> >>>
>>>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>>> >>> and
>>>> >>> extensible. I generally think Wenchen's proposal is easier for a user 
>>>> >>> to
>>>> >>> work with in the common case, but has greater potential for confusing
>>>> >>> and
>>>> >>> hard-to-debug behavior due to use of reflective method signature
>>>> >>> searches.
>>>> >>> The merits on both sides can hopefully be more properly examined with
>>>> >>> code,
>>>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>>>> >>> provide
>>>> >>> a more concrete comparison. I am optimistic that we will not let the
>>>> >>> debate
>>>> >>> over this point unreasonably stall the SPIP from making progress.
>>>> >>>
>>>> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
>>>> >>> evaluation of these ideas!
>>>> >>> --
>>>> >>> *From:* Dongjoon Hyun 
>>>> 
>>>> > dongjoon.hyun@
>>>> 
>>>> > 
>>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>> >>> *To:* Ryan Blue 
>>>> 
>>>> > blue@
>>>> 
>>>> > 
>>>> >>> *Cc:* Holden Karau 
>>>> 
>>>> > holden@
>>>> 
>>>> > ; Hyukjin Kwon <
>>>> >>> 
>>>> 
>>>> > gurwls223@
>>>> 
>>>> >>; Spark Dev List 
>>>> 
>>>> > dev@.apache
>>>> 
>>>> > ; Wenchen Fan
>>>> >>> 
>>>> 
>>>> > cloud0fan@
>>>> 
>>>> > 
>>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>> >>>
>>>> >>> BTW, I forgot to add my opinion explicitly in this thread because I was
>>>> >>> on the PR before this thread.
>>>> >>>
>>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>>> >>> there for almost two years.
>>>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>>>> >>> the latest updated design docs and AS-IS PR.
>>>> >>>
>>>> >>> And, the rest of the progress in this thread is also very satisfying to
>>>> >>> me.
>>>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>> >>>
>>>> >>> To All:
>>>> >>> Please take a look at the design doc and the PR, and give us some
>>>> >>> opinions.
>>>> >>> We really need your participation in order to make DSv2 more complete.
>>>> >>> This will unblock other DSv2 features, too.
>>>> >>>
>>>> >>> Bests,
>>>> >>> Dongjoon.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
>>>> 
>>>> > dongjoon.hyun@
>>>> 
>>>> > 
>>>> >>> wrote:
>>>> >>>
>>>> >>> Hi, Ryan.
>>>> >>>
>>>> >>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>>>> >>> suggested is double-checking the alternatives with the implementation 
>>>> >>> to
>>>> >>> give more momentum to our discussion.
>>>> >>>
>>>> >>> Your new suggestion about optional extention also sounds like a new
>>>> >>> reasonable alternative to me.
>>>> >>>
>>>> >>> We are still discussing this topic together and I hope we can make a
>>>> >

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ryan Blue
Thanks for the positive feedback, everyone. It sounds like there is a clear
path forward for calling functions. Even without a prototype, the `invoke`
plans show that Wenchen's suggested optimization can be done, and
incorporating it as an optional extension to this proposal solves many of
the unknowns.

With that area now understood, is there any discussion about other parts of
the proposal, besides the function call interface?

On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:

> This is an important feature which can unblock several other projects
> including bucket join support for DataSource v2, complete support for
> enforcing DataSource v2 distribution requirements on the write path, etc. I
> like Ryan's proposals which look simple and elegant, with nice support on
> function overloading and variadic arguments. On the other hand, I think
> Wenchen made a very good point about performance. Overall, I'm excited to
> see active discussions on this topic and believe the community will come to
> a proposal with the best of both sides.
>
> Chao
>
> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>
>> +1 for Liang-chi's.
>>
>> Thanks Ryan and Wenchen for leading this.
>>
>>
>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>>
>>> Basically I think the proposal makes sense to me and I'd like to support
>>> the
>>> SPIP as it looks like we have strong need for the important feature.
>>>
>>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>>> implementation. Thanks for the discussion too.
>>>
>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>> alternative to me. Besides Wenchen's alternative implementation, is
>>> there a
>>> chance we also have the SupportsInvoke for comparison?
>>>
>>>
>>> John Zhuge wrote
>>> > Excited to see our Spark community rallying behind this important
>>> feature!
>>> >
>>> > The proposal lays a solid foundation of minimal feature set with
>>> careful
>>> > considerations for future optimizations and extensions. Can't wait to
>>> see
>>> > it leading to more advanced functionalities like views with shared
>>> custom
>>> > functions, function pushdown, lambda, etc. It has already borne fruit
>>> from
>>> > the constructive collaborations in this thread. Looking forward to
>>> > Wenchen's prototype and further discussions including the
>>> SupportsInvoke
>>> > extension proposed by Ryan.
>>> >
>>> >
>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>
>>> > owen.omalley@
>>>
>>> > 
>>> > wrote:
>>> >
>>> >> I think this proposal is a very good thing giving Spark a standard
>>> way of
>>> >> getting to and calling UDFs.
>>> >>
>>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>>> >> simple, yet covers all of the polymorphic type cases well. I think it
>>> >> would
>>> >> also simplify using the functions in other contexts like pushing down
>>> >> filters into the ORC & Parquet readers although there are a lot of
>>> >> details
>>> >> that would need to be considered there.
>>> >>
>>> >> .. Owen
>>> >>
>>> >>
>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>
>>> > ekrogen@.com
>>>
>>> > 
>>> >> wrote:
>>> >>
>>> >>> I agree that there is a strong need for a FunctionCatalog within
>>> Spark
>>> >>> to
>>> >>> provide support for shareable UDFs, as well as make movement towards
>>> >>> more
>>> >>> advanced functionality like views which themselves depend on UDFs,
>>> so I
>>> >>> support this SPIP wholeheartedly.
>>> >>>
>>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> >>> and
>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>>> user to
>>> >>> work with in the common case, but has greater potential for confusing
>>> >>> and
>>> >>> hard-to-debug behavior due to use of reflective method signature
>>> >>> searches.
>>> >>> The merits on both sides can hopefully be more properly exa

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Chao Sun
This is an important feature which can unblock several other projects
including bucket join support for DataSource v2, complete support for
enforcing DataSource v2 distribution requirements on the write path, etc. I
like Ryan's proposals which look simple and elegant, with nice support on
function overloading and variadic arguments. On the other hand, I think
Wenchen made a very good point about performance. Overall, I'm excited to
see active discussions on this topic and believe the community will come to
a proposal with the best of both sides.

Chao

On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:

> +1 for Liang-chi's.
>
> Thanks Ryan and Wenchen for leading this.
>
>
> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>
>> Basically I think the proposal makes sense to me and I'd like to support
>> the
>> SPIP as it looks like we have strong need for the important feature.
>>
>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>> implementation. Thanks for the discussion too.
>>
>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>> alternative to me. Besides Wenchen's alternative implementation, is there
>> a
>> chance we also have the SupportsInvoke for comparison?
>>
>>
>> John Zhuge wrote
>> > Excited to see our Spark community rallying behind this important
>> feature!
>> >
>> > The proposal lays a solid foundation of minimal feature set with careful
>> > considerations for future optimizations and extensions. Can't wait to
>> see
>> > it leading to more advanced functionalities like views with shared
>> custom
>> > functions, function pushdown, lambda, etc. It has already borne fruit
>> from
>> > the constructive collaborations in this thread. Looking forward to
>> > Wenchen's prototype and further discussions including the SupportsInvoke
>> > extension proposed by Ryan.
>> >
>> >
>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>
>> > owen.omalley@
>>
>> > 
>> > wrote:
>> >
>> >> I think this proposal is a very good thing giving Spark a standard way
>> of
>> >> getting to and calling UDFs.
>> >>
>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>> >> simple, yet covers all of the polymorphic type cases well. I think it
>> >> would
>> >> also simplify using the functions in other contexts like pushing down
>> >> filters into the ORC & Parquet readers although there are a lot of
>> >> details
>> >> that would need to be considered there.
>> >>
>> >> .. Owen
>> >>
>> >>
>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>
>> > ekrogen@.com
>>
>> > 
>> >> wrote:
>> >>
>> >>> I agree that there is a strong need for a FunctionCatalog within Spark
>> >>> to
>> >>> provide support for shareable UDFs, as well as make movement towards
>> >>> more
>> >>> advanced functionality like views which themselves depend on UDFs, so
>> I
>> >>> support this SPIP wholeheartedly.
>> >>>
>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>> >>> and
>> >>> extensible. I generally think Wenchen's proposal is easier for a user
>> to
>> >>> work with in the common case, but has greater potential for confusing
>> >>> and
>> >>> hard-to-debug behavior due to use of reflective method signature
>> >>> searches.
>> >>> The merits on both sides can hopefully be more properly examined with
>> >>> code,
>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>> >>> provide
>> >>> a more concrete comparison. I am optimistic that we will not let the
>> >>> debate
>> >>> over this point unreasonably stall the SPIP from making progress.
>> >>>
>> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
>> >>> evaluation of these ideas!
>> >>> --
>> >>> *From:* Dongjoon Hyun 
>>
>> > dongjoon.hyun@
>>
>> > 
>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>> >>> *To:* Ryan Blue 
>>
>> > blue@
>>
>> > 
>> >>> *Cc:* Holden Karau 
>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Hyukjin Kwon
+1 for Liang-chi's.

Thanks Ryan and Wenchen for leading this.


2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:

> Basically I think the proposal makes sense to me and I'd like to support
> the
> SPIP as it looks like we have strong need for the important feature.
>
> Thanks Ryan for working on this and I do also look forward to Wenchen's
> implementation. Thanks for the discussion too.
>
> Actually I think the SupportsInvoke proposed by Ryan looks a good
> alternative to me. Besides Wenchen's alternative implementation, is there a
> chance we also have the SupportsInvoke for comparison?
>
>
> John Zhuge wrote
> > Excited to see our Spark community rallying behind this important
> feature!
> >
> > The proposal lays a solid foundation of minimal feature set with careful
> > considerations for future optimizations and extensions. Can't wait to see
> > it leading to more advanced functionalities like views with shared custom
> > functions, function pushdown, lambda, etc. It has already borne fruit
> from
> > the constructive collaborations in this thread. Looking forward to
> > Wenchen's prototype and further discussions including the SupportsInvoke
> > extension proposed by Ryan.
> >
> >
> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>
> > owen.omalley@
>
> > 
> > wrote:
> >
> >> I think this proposal is a very good thing giving Spark a standard way
> of
> >> getting to and calling UDFs.
> >>
> >> I like having the ScalarFunction as the API to call the UDFs. It is
> >> simple, yet covers all of the polymorphic type cases well. I think it
> >> would
> >> also simplify using the functions in other contexts like pushing down
> >> filters into the ORC & Parquet readers although there are a lot of
> >> details
> >> that would need to be considered there.
> >>
> >> .. Owen
> >>
> >>
> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>
> > ekrogen@.com
>
> > 
> >> wrote:
> >>
> >>> I agree that there is a strong need for a FunctionCatalog within Spark
> >>> to
> >>> provide support for shareable UDFs, as well as make movement towards
> >>> more
> >>> advanced functionality like views which themselves depend on UDFs, so I
> >>> support this SPIP wholeheartedly.
> >>>
> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
> >>> and
> >>> extensible. I generally think Wenchen's proposal is easier for a user
> to
> >>> work with in the common case, but has greater potential for confusing
> >>> and
> >>> hard-to-debug behavior due to use of reflective method signature
> >>> searches.
> >>> The merits on both sides can hopefully be more properly examined with
> >>> code,
> >>> so I look forward to seeing an implementation of Wenchen's ideas to
> >>> provide
> >>> a more concrete comparison. I am optimistic that we will not let the
> >>> debate
> >>> over this point unreasonably stall the SPIP from making progress.
> >>>
> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
> >>> evaluation of these ideas!
> >>> --
> >>> *From:* Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > 
> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
> >>> *To:* Ryan Blue 
>
> > blue@
>
> > 
> >>> *Cc:* Holden Karau 
>
> > holden@
>
> > ; Hyukjin Kwon <
> >>>
>
> > gurwls223@
>
> >>; Spark Dev List 
>
> > dev@.apache
>
> > ; Wenchen Fan
> >>> 
>
> > cloud0fan@
>
> > 
> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
> >>>
> >>> BTW, I forgot to add my opinion explicitly in this thread because I was
> >>> on the PR before this thread.
> >>>
> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
> >>> there for almost two years.
> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
> >>> the latest updated design docs and AS-IS PR.
> >>>
> >>> And, the rest of the progress in this thread is also very satisfying to
> >>> me.
> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
> >>>
> >>> To All:
> >>> Please take a look at the

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Liang-Chi Hsieh
Basically I think the proposal makes sense to me and I'd like to support the
SPIP as it looks like we have strong need for the important feature.

Thanks Ryan for working on this and I do also look forward to Wenchen's
implementation. Thanks for the discussion too.

Actually I think the SupportsInvoke proposed by Ryan looks a good
alternative to me. Besides Wenchen's alternative implementation, is there a
chance we also have the SupportsInvoke for comparison?


John Zhuge wrote
> Excited to see our Spark community rallying behind this important feature!
> 
> The proposal lays a solid foundation of minimal feature set with careful
> considerations for future optimizations and extensions. Can't wait to see
> it leading to more advanced functionalities like views with shared custom
> functions, function pushdown, lambda, etc. It has already borne fruit from
> the constructive collaborations in this thread. Looking forward to
> Wenchen's prototype and further discussions including the SupportsInvoke
> extension proposed by Ryan.
> 
> 
> On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 

> owen.omalley@

> 
> wrote:
> 
>> I think this proposal is a very good thing giving Spark a standard way of
>> getting to and calling UDFs.
>>
>> I like having the ScalarFunction as the API to call the UDFs. It is
>> simple, yet covers all of the polymorphic type cases well. I think it
>> would
>> also simplify using the functions in other contexts like pushing down
>> filters into the ORC & Parquet readers although there are a lot of
>> details
>> that would need to be considered there.
>>
>> .. Owen
>>
>>
>> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 

> ekrogen@.com

> 
>> wrote:
>>
>>> I agree that there is a strong need for a FunctionCatalog within Spark
>>> to
>>> provide support for shareable UDFs, as well as make movement towards
>>> more
>>> advanced functionality like views which themselves depend on UDFs, so I
>>> support this SPIP wholeheartedly.
>>>
>>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> and
>>> extensible. I generally think Wenchen's proposal is easier for a user to
>>> work with in the common case, but has greater potential for confusing
>>> and
>>> hard-to-debug behavior due to use of reflective method signature
>>> searches.
>>> The merits on both sides can hopefully be more properly examined with
>>> code,
>>> so I look forward to seeing an implementation of Wenchen's ideas to
>>> provide
>>> a more concrete comparison. I am optimistic that we will not let the
>>> debate
>>> over this point unreasonably stall the SPIP from making progress.
>>>
>>> Thank you to both Wenchen and Ryan for your detailed consideration and
>>> evaluation of these ideas!
>>> --
>>> *From:* Dongjoon Hyun 

> dongjoon.hyun@

> 
>>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>> *To:* Ryan Blue 

> blue@

> 
>>> *Cc:* Holden Karau 

> holden@

> ; Hyukjin Kwon <
>>> 

> gurwls223@

>>; Spark Dev List 

> dev@.apache

> ; Wenchen Fan
>>> 

> cloud0fan@

> 
>>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>
>>> BTW, I forgot to add my opinion explicitly in this thread because I was
>>> on the PR before this thread.
>>>
>>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>> there for almost two years.
>>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>>> the latest updated design docs and AS-IS PR.
>>>
>>> And, the rest of the progress in this thread is also very satisfying to
>>> me.
>>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>
>>> To All:
>>> Please take a look at the design doc and the PR, and give us some
>>> opinions.
>>> We really need your participation in order to make DSv2 more complete.
>>> This will unblock other DSv2 features, too.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 

> dongjoon.hyun@

> 
>>> wrote:
>>>
>>> Hi, Ryan.
>>>
>>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>>> suggested is double-checking the alternatives with the implementation to
>>> give more momentum to our discussion.
>>>
>>> Your new suggestion

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread John Zhuge
Excited to see our Spark community rallying behind this important feature!

The proposal lays a solid foundation of minimal feature set with careful
considerations for future optimizations and extensions. Can't wait to see
it leading to more advanced functionalities like views with shared custom
functions, function pushdown, lambda, etc. It has already borne fruit from
the constructive collaborations in this thread. Looking forward to
Wenchen's prototype and further discussions including the SupportsInvoke
extension proposed by Ryan.


On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
wrote:

> I think this proposal is a very good thing giving Spark a standard way of
> getting to and calling UDFs.
>
> I like having the ScalarFunction as the API to call the UDFs. It is
> simple, yet covers all of the polymorphic type cases well. I think it would
> also simplify using the functions in other contexts like pushing down
> filters into the ORC & Parquet readers although there are a lot of details
> that would need to be considered there.
>
> .. Owen
>
>
> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
> wrote:
>
>> I agree that there is a strong need for a FunctionCatalog within Spark to
>> provide support for shareable UDFs, as well as make movement towards more
>> advanced functionality like views which themselves depend on UDFs, so I
>> support this SPIP wholeheartedly.
>>
>> I find both of the proposed UDF APIs to be sufficiently user-friendly and
>> extensible. I generally think Wenchen's proposal is easier for a user to
>> work with in the common case, but has greater potential for confusing and
>> hard-to-debug behavior due to use of reflective method signature searches.
>> The merits on both sides can hopefully be more properly examined with code,
>> so I look forward to seeing an implementation of Wenchen's ideas to provide
>> a more concrete comparison. I am optimistic that we will not let the debate
>> over this point unreasonably stall the SPIP from making progress.
>>
>> Thank you to both Wenchen and Ryan for your detailed consideration and
>> evaluation of these ideas!
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>> *To:* Ryan Blue 
>> *Cc:* Holden Karau ; Hyukjin Kwon <
>> gurwls...@gmail.com>; Spark Dev List ; Wenchen Fan
>> 
>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>
>> BTW, I forgot to add my opinion explicitly in this thread because I was
>> on the PR before this thread.
>>
>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>> there for almost two years.
>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>> the latest updated design docs and AS-IS PR.
>>
>> And, the rest of the progress in this thread is also very satisfying to
>> me.
>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>
>> To All:
>> Please take a look at the design doc and the PR, and give us some
>> opinions.
>> We really need your participation in order to make DSv2 more complete.
>> This will unblock other DSv2 features, too.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
>> wrote:
>>
>> Hi, Ryan.
>>
>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>> suggested is double-checking the alternatives with the implementation to
>> give more momentum to our discussion.
>>
>> Your new suggestion about optional extention also sounds like a new
>> reasonable alternative to me.
>>
>> We are still discussing this topic together and I hope we can make a
>> conclude at this time (for Apache Spark 3.2) without being stucked like
>> last time.
>>
>> I really appreciate your leadership in this dicsussion and the moving
>> direction of this discussion looks constructive to me. Let's give some time
>> to the alternatives.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue  wrote:
>>
>> I don’t think we should so quickly move past the drawbacks of this
>> approach. The problems are significant enough that using invoke is not
>> sufficient on its own. But, I think we can add it as an optional extension
>> to shore up the weaknesses.
>>
>> Here’s a summary of the drawbacks:
>>
>>- Magic function signatures are error-prone
>>- Spark would need considerable code to help users find what went
>>wrong
>>- Spark would likely need to coerce arguments (e.g., String,
>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Owen O'Malley
I think this proposal is a very good thing giving Spark a standard way of
getting to and calling UDFs.

I like having the ScalarFunction as the API to call the UDFs. It is simple,
yet covers all of the polymorphic type cases well. I think it would also
simplify using the functions in other contexts like pushing down filters
into the ORC & Parquet readers although there are a lot of details that
would need to be considered there.

.. Owen


On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
wrote:

> I agree that there is a strong need for a FunctionCatalog within Spark to
> provide support for shareable UDFs, as well as make movement towards more
> advanced functionality like views which themselves depend on UDFs, so I
> support this SPIP wholeheartedly.
>
> I find both of the proposed UDF APIs to be sufficiently user-friendly and
> extensible. I generally think Wenchen's proposal is easier for a user to
> work with in the common case, but has greater potential for confusing and
> hard-to-debug behavior due to use of reflective method signature searches.
> The merits on both sides can hopefully be more properly examined with code,
> so I look forward to seeing an implementation of Wenchen's ideas to provide
> a more concrete comparison. I am optimistic that we will not let the debate
> over this point unreasonably stall the SPIP from making progress.
>
> Thank you to both Wenchen and Ryan for your detailed consideration and
> evaluation of these ideas!
> --
> *From:* Dongjoon Hyun 
> *Sent:* Wednesday, February 10, 2021 6:06 PM
> *To:* Ryan Blue 
> *Cc:* Holden Karau ; Hyukjin Kwon <
> gurwls...@gmail.com>; Spark Dev List ; Wenchen Fan <
> cloud0...@gmail.com>
> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>
> BTW, I forgot to add my opinion explicitly in this thread because I was on
> the PR before this thread.
>
> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been there
> for almost two years.
> 2. I already gave my +1 on that PR last Saturday because I agreed with the
> latest updated design docs and AS-IS PR.
>
> And, the rest of the progress in this thread is also very satisfying to me.
> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>
> To All:
> Please take a look at the design doc and the PR, and give us some opinions.
> We really need your participation in order to make DSv2 more complete.
> This will unblock other DSv2 features, too.
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
> wrote:
>
> Hi, Ryan.
>
> We didn't move past anything (both yours and Wenchen's). What Wenchen
> suggested is double-checking the alternatives with the implementation to
> give more momentum to our discussion.
>
> Your new suggestion about optional extention also sounds like a new
> reasonable alternative to me.
>
> We are still discussing this topic together and I hope we can make a
> conclude at this time (for Apache Spark 3.2) without being stucked like
> last time.
>
> I really appreciate your leadership in this dicsussion and the moving
> direction of this discussion looks constructive to me. Let's give some time
> to the alternatives.
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue  wrote:
>
> I don’t think we should so quickly move past the drawbacks of this
> approach. The problems are significant enough that using invoke is not
> sufficient on its own. But, I think we can add it as an optional extension
> to shore up the weaknesses.
>
> Here’s a summary of the drawbacks:
>
>- Magic function signatures are error-prone
>- Spark would need considerable code to help users find what went wrong
>- Spark would likely need to coerce arguments (e.g., String,
>Option[Int]) for usability
>- It is unclear how Spark will find the Java Method to call
>- Use cases that require varargs fall back to casting; users will also
>get this wrong (cast to String instead of UTF8String)
>- The non-codegen path is significantly slower
>
> The benefit of invoke is to avoid moving data into a row, like this:
>
> -- using invoke
> int result = udfFunction(x, y)
>
> -- using row
> udfRow.update(0, x); -- actual: values[0] = x;
> udfRow.update(1, y);
> int result = udfFunction(udfRow);
>
> And, again, that won’t actually help much in cases that require varargs.
>
> I suggest we add a new marker trait for BoundMethod called SupportsInvoke.
> If that interface is implemented, then Spark will look for a method that
> matches the expected signature based on the bound input type. If it isn’t
> found, Spark can print a warning and fall back to the InternalRow call:
> “Cannot find

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Erik Krogen
I agree that there is a strong need for a FunctionCatalog within Spark to 
provide support for shareable UDFs, as well as make movement towards more 
advanced functionality like views which themselves depend on UDFs, so I support 
this SPIP wholeheartedly.

I find both of the proposed UDF APIs to be sufficiently user-friendly and 
extensible. I generally think Wenchen's proposal is easier for a user to work 
with in the common case, but has greater potential for confusing and 
hard-to-debug behavior due to use of reflective method signature searches. The 
merits on both sides can hopefully be more properly examined with code, so I 
look forward to seeing an implementation of Wenchen's ideas to provide a more 
concrete comparison. I am optimistic that we will not let the debate over this 
point unreasonably stall the SPIP from making progress.

Thank you to both Wenchen and Ryan for your detailed consideration and 
evaluation of these ideas!

From: Dongjoon Hyun 
Sent: Wednesday, February 10, 2021 6:06 PM
To: Ryan Blue 
Cc: Holden Karau ; Hyukjin Kwon ; 
Spark Dev List ; Wenchen Fan 
Subject: Re: [DISCUSS] SPIP: FunctionCatalog

BTW, I forgot to add my opinion explicitly in this thread because I was on the 
PR before this thread.

1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been there for 
almost two years.
2. I already gave my +1 on that PR last Saturday because I agreed with the 
latest updated design docs and AS-IS PR.

And, the rest of the progress in this thread is also very satisfying to me.
(e.g. Ryan's extension suggestion and Wenchen's alternative)

To All:
Please take a look at the design doc and the PR, and give us some opinions.
We really need your participation in order to make DSv2 more complete.
This will unblock other DSv2 features, too.

Bests,
Dongjoon.



On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Ryan.

We didn't move past anything (both yours and Wenchen's). What Wenchen suggested 
is double-checking the alternatives with the implementation to give more 
momentum to our discussion.

Your new suggestion about optional extention also sounds like a new reasonable 
alternative to me.

We are still discussing this topic together and I hope we can make a conclude 
at this time (for Apache Spark 3.2) without being stucked like last time.

I really appreciate your leadership in this dicsussion and the moving direction 
of this discussion looks constructive to me. Let's give some time to the 
alternatives.

Bests,
Dongjoon.

On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue 
mailto:b...@apache.org>> wrote:

I don’t think we should so quickly move past the drawbacks of this approach. 
The problems are significant enough that using invoke is not sufficient on its 
own. But, I think we can add it as an optional extension to shore up the 
weaknesses.

Here’s a summary of the drawbacks:

  *   Magic function signatures are error-prone
  *   Spark would need considerable code to help users find what went wrong
  *   Spark would likely need to coerce arguments (e.g., String, Option[Int]) 
for usability
  *   It is unclear how Spark will find the Java Method to call
  *   Use cases that require varargs fall back to casting; users will also get 
this wrong (cast to String instead of UTF8String)
  *   The non-codegen path is significantly slower

The benefit of invoke is to avoid moving data into a row, like this:

-- using invoke
int result = udfFunction(x, y)

-- using row
udfRow.update(0, x); -- actual: values[0] = x;
udfRow.update(1, y);
int result = udfFunction(udfRow);


And, again, that won’t actually help much in cases that require varargs.

I suggest we add a new marker trait for BoundMethod called SupportsInvoke. If 
that interface is implemented, then Spark will look for a method that matches 
the expected signature based on the bound input type. If it isn’t found, Spark 
can print a warning and fall back to the InternalRow call: “Cannot find 
udfFunction(int, int)”.

This approach allows the invoke optimization, but solves many of the problems:

  *   The method to invoke is found using the proposed load and bind approach
  *   Magic function signatures are optional and do not cause runtime failures
  *   Because this is an optional optimization, Spark can be more strict about 
types
  *   Varargs cases can still use rows
  *   Non-codegen can use an evaluation method rather than falling back to slow 
Java reflection

This seems like a good extension to me; this provides a plan for optimizing the 
UDF call to avoid building a row, while the existing proposal covers the other 
cases well and addresses how to locate these function calls.

This also highlights that the approach used in DSv2 and this proposal is 
working: start small and use extensions to layer on more complex support.

On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

Thank 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Dongjoon Hyun
BTW, I forgot to add my opinion explicitly in this thread because I was on
the PR before this thread.

1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been there
for almost two years.
2. I already gave my +1 on that PR last Saturday because I agreed with the
latest updated design docs and AS-IS PR.

And, the rest of the progress in this thread is also very satisfying to me.
(e.g. Ryan's extension suggestion and Wenchen's alternative)

To All:
Please take a look at the design doc and the PR, and give us some opinions.
We really need your participation in order to make DSv2 more complete.
This will unblock other DSv2 features, too.

Bests,
Dongjoon.



On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
wrote:

> Hi, Ryan.
>
> We didn't move past anything (both yours and Wenchen's). What Wenchen
> suggested is double-checking the alternatives with the implementation to
> give more momentum to our discussion.
>
> Your new suggestion about optional extention also sounds like a new
> reasonable alternative to me.
>
> We are still discussing this topic together and I hope we can make a
> conclude at this time (for Apache Spark 3.2) without being stucked like
> last time.
>
> I really appreciate your leadership in this dicsussion and the moving
> direction of this discussion looks constructive to me. Let's give some time
> to the alternatives.
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue  wrote:
>
>> I don’t think we should so quickly move past the drawbacks of this
>> approach. The problems are significant enough that using invoke is not
>> sufficient on its own. But, I think we can add it as an optional extension
>> to shore up the weaknesses.
>>
>> Here’s a summary of the drawbacks:
>>
>>- Magic function signatures are error-prone
>>- Spark would need considerable code to help users find what went
>>wrong
>>- Spark would likely need to coerce arguments (e.g., String,
>>Option[Int]) for usability
>>- It is unclear how Spark will find the Java Method to call
>>- Use cases that require varargs fall back to casting; users will
>>also get this wrong (cast to String instead of UTF8String)
>>- The non-codegen path is significantly slower
>>
>> The benefit of invoke is to avoid moving data into a row, like this:
>>
>> -- using invoke
>> int result = udfFunction(x, y)
>>
>> -- using row
>> udfRow.update(0, x); -- actual: values[0] = x;
>> udfRow.update(1, y);
>> int result = udfFunction(udfRow);
>>
>> And, again, that won’t actually help much in cases that require varargs.
>>
>> I suggest we add a new marker trait for BoundMethod called SupportsInvoke.
>> If that interface is implemented, then Spark will look for a method that
>> matches the expected signature based on the bound input type. If it isn’t
>> found, Spark can print a warning and fall back to the InternalRow call:
>> “Cannot find udfFunction(int, int)”.
>>
>> This approach allows the invoke optimization, but solves many of the
>> problems:
>>
>>- The method to invoke is found using the proposed load and bind
>>approach
>>- Magic function signatures are optional and do not cause runtime
>>failures
>>- Because this is an optional optimization, Spark can be more strict
>>about types
>>- Varargs cases can still use rows
>>- Non-codegen can use an evaluation method rather than falling back
>>to slow Java reflection
>>
>> This seems like a good extension to me; this provides a plan for
>> optimizing the UDF call to avoid building a row, while the existing
>> proposal covers the other cases well and addresses how to locate these
>> function calls.
>>
>> This also highlights that the approach used in DSv2 and this proposal is
>> working: start small and use extensions to layer on more complex support.
>>
>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun 
>> wrote:
>>
>> Thank you all for making a giant move forward for Apache Spark 3.2.0.
>>> I'm really looking forward to seeing Wenchen's implementation.
>>> That would be greatly helpful to make a decision!
>>>
>>> > I'll implement my idea after the holiday and then we can have
>>> more effective discussions. We can also do benchmarks and get some real
>>> numbers.
>>> > FYI: the Presto UDF API
>>>  also
>>> takes individual parameters instead of the row parameter. I think this
>>> direction at least worth a try so that we can see the performance
>>> difference. It's also mentioned in the design doc as an alternative (Trino).
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan  wrote:
>>>
 FYI: the Presto UDF API
  also
 takes individual parameters instead of the row parameter. I think this
 direction at least worth a try so that we can see the performance
 difference. It's also mentioned in the design doc as an alternative 
 (Trino).


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Dongjoon Hyun
Hi, Ryan.

We didn't move past anything (both yours and Wenchen's). What Wenchen
suggested is double-checking the alternatives with the implementation to
give more momentum to our discussion.

Your new suggestion about optional extention also sounds like a new
reasonable alternative to me.

We are still discussing this topic together and I hope we can make a
conclude at this time (for Apache Spark 3.2) without being stucked like
last time.

I really appreciate your leadership in this dicsussion and the moving
direction of this discussion looks constructive to me. Let's give some time
to the alternatives.

Bests,
Dongjoon.

On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue  wrote:

> I don’t think we should so quickly move past the drawbacks of this
> approach. The problems are significant enough that using invoke is not
> sufficient on its own. But, I think we can add it as an optional extension
> to shore up the weaknesses.
>
> Here’s a summary of the drawbacks:
>
>- Magic function signatures are error-prone
>- Spark would need considerable code to help users find what went wrong
>- Spark would likely need to coerce arguments (e.g., String,
>Option[Int]) for usability
>- It is unclear how Spark will find the Java Method to call
>- Use cases that require varargs fall back to casting; users will also
>get this wrong (cast to String instead of UTF8String)
>- The non-codegen path is significantly slower
>
> The benefit of invoke is to avoid moving data into a row, like this:
>
> -- using invoke
> int result = udfFunction(x, y)
>
> -- using row
> udfRow.update(0, x); -- actual: values[0] = x;
> udfRow.update(1, y);
> int result = udfFunction(udfRow);
>
> And, again, that won’t actually help much in cases that require varargs.
>
> I suggest we add a new marker trait for BoundMethod called SupportsInvoke.
> If that interface is implemented, then Spark will look for a method that
> matches the expected signature based on the bound input type. If it isn’t
> found, Spark can print a warning and fall back to the InternalRow call:
> “Cannot find udfFunction(int, int)”.
>
> This approach allows the invoke optimization, but solves many of the
> problems:
>
>- The method to invoke is found using the proposed load and bind
>approach
>- Magic function signatures are optional and do not cause runtime
>failures
>- Because this is an optional optimization, Spark can be more strict
>about types
>- Varargs cases can still use rows
>- Non-codegen can use an evaluation method rather than falling back to
>slow Java reflection
>
> This seems like a good extension to me; this provides a plan for
> optimizing the UDF call to avoid building a row, while the existing
> proposal covers the other cases well and addresses how to locate these
> function calls.
>
> This also highlights that the approach used in DSv2 and this proposal is
> working: start small and use extensions to layer on more complex support.
>
> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun 
> wrote:
>
> Thank you all for making a giant move forward for Apache Spark 3.2.0.
>> I'm really looking forward to seeing Wenchen's implementation.
>> That would be greatly helpful to make a decision!
>>
>> > I'll implement my idea after the holiday and then we can have
>> more effective discussions. We can also do benchmarks and get some real
>> numbers.
>> > FYI: the Presto UDF API
>>  also
>> takes individual parameters instead of the row parameter. I think this
>> direction at least worth a try so that we can see the performance
>> difference. It's also mentioned in the design doc as an alternative (Trino).
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan  wrote:
>>
>>> FYI: the Presto UDF API
>>>  also
>>> takes individual parameters instead of the row parameter. I think this
>>> direction at least worth a try so that we can see the performance
>>> difference. It's also mentioned in the design doc as an alternative (Trino).
>>>
>>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan 
>>> wrote:
>>>
 Hi Holden,

 As Hyukjin said, following existing designs is not the principle of DS
 v2 API design. We should make sure the DS v2 API makes sense. AFAIK we
 didn't fully follow the catalog API design from Hive and I believe Ryan
 also agrees with it.

 I think the problem here is we were discussing some very detailed
 things without actual code. I'll implement my idea after the holiday and
 then we can have more effective discussions. We can also do benchmarks and
 get some real numbers.

 In the meantime, we can continue to discuss other parts of this
 proposal, and make a prototype if possible. Spark SQL has many active
 contributors/committers and this thread doesn't get much attention yet.

 On Wed, Feb 10, 2021 at 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Ryan Blue
I don’t think we should so quickly move past the drawbacks of this
approach. The problems are significant enough that using invoke is not
sufficient on its own. But, I think we can add it as an optional extension
to shore up the weaknesses.

Here’s a summary of the drawbacks:

   - Magic function signatures are error-prone
   - Spark would need considerable code to help users find what went wrong
   - Spark would likely need to coerce arguments (e.g., String,
   Option[Int]) for usability
   - It is unclear how Spark will find the Java Method to call
   - Use cases that require varargs fall back to casting; users will also
   get this wrong (cast to String instead of UTF8String)
   - The non-codegen path is significantly slower

The benefit of invoke is to avoid moving data into a row, like this:

-- using invoke
int result = udfFunction(x, y)

-- using row
udfRow.update(0, x); -- actual: values[0] = x;
udfRow.update(1, y);
int result = udfFunction(udfRow);

And, again, that won’t actually help much in cases that require varargs.

I suggest we add a new marker trait for BoundMethod called SupportsInvoke.
If that interface is implemented, then Spark will look for a method that
matches the expected signature based on the bound input type. If it isn’t
found, Spark can print a warning and fall back to the InternalRow call:
“Cannot find udfFunction(int, int)”.

This approach allows the invoke optimization, but solves many of the
problems:

   - The method to invoke is found using the proposed load and bind approach
   - Magic function signatures are optional and do not cause runtime
   failures
   - Because this is an optional optimization, Spark can be more strict
   about types
   - Varargs cases can still use rows
   - Non-codegen can use an evaluation method rather than falling back to
   slow Java reflection

This seems like a good extension to me; this provides a plan for optimizing
the UDF call to avoid building a row, while the existing proposal covers
the other cases well and addresses how to locate these function calls.

This also highlights that the approach used in DSv2 and this proposal is
working: start small and use extensions to layer on more complex support.

On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun 
wrote:

Thank you all for making a giant move forward for Apache Spark 3.2.0.
> I'm really looking forward to seeing Wenchen's implementation.
> That would be greatly helpful to make a decision!
>
> > I'll implement my idea after the holiday and then we can have
> more effective discussions. We can also do benchmarks and get some real
> numbers.
> > FYI: the Presto UDF API
>  also
> takes individual parameters instead of the row parameter. I think this
> direction at least worth a try so that we can see the performance
> difference. It's also mentioned in the design doc as an alternative (Trino).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan  wrote:
>
>> FYI: the Presto UDF API
>>  also
>> takes individual parameters instead of the row parameter. I think this
>> direction at least worth a try so that we can see the performance
>> difference. It's also mentioned in the design doc as an alternative (Trino).
>>
>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan  wrote:
>>
>>> Hi Holden,
>>>
>>> As Hyukjin said, following existing designs is not the principle of DS
>>> v2 API design. We should make sure the DS v2 API makes sense. AFAIK we
>>> didn't fully follow the catalog API design from Hive and I believe Ryan
>>> also agrees with it.
>>>
>>> I think the problem here is we were discussing some very detailed things
>>> without actual code. I'll implement my idea after the holiday and then we
>>> can have more effective discussions. We can also do benchmarks and get some
>>> real numbers.
>>>
>>> In the meantime, we can continue to discuss other parts of this
>>> proposal, and make a prototype if possible. Spark SQL has many active
>>> contributors/committers and this thread doesn't get much attention yet.
>>>
>>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon 
>>> wrote:
>>>
 Just dropping a few lines. I remember that one of the goals in DSv2 is
 to correct the mistakes we made in the current Spark codes.
 It would not have much point if we will happen to just follow and mimic
 what Spark currently does. It might just end up with another copy of Spark
 APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
 I do believe we have been stuck mainly due to trying to come up with a
 better design. We already have an ugly picture of the current Spark APIs to
 draw a better bigger picture.


 2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:

> I think this proposal is a good set of trade-offs and has existed in
> the community for a long period of time. I especially appreciate how the
> 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Dongjoon Hyun
Thank you all for making a giant move forward for Apache Spark 3.2.0.
I'm really looking forward to seeing Wenchen's implementation.
That would be greatly helpful to make a decision!

> I'll implement my idea after the holiday and then we can have
more effective discussions. We can also do benchmarks and get some real
numbers.
> FYI: the Presto UDF API
 also
takes individual parameters instead of the row parameter. I think this
direction at least worth a try so that we can see the performance
difference. It's also mentioned in the design doc as an alternative (Trino).

Bests,
Dongjoon.


On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan  wrote:

> FYI: the Presto UDF API
>  also
> takes individual parameters instead of the row parameter. I think this
> direction at least worth a try so that we can see the performance
> difference. It's also mentioned in the design doc as an alternative (Trino).
>
> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan  wrote:
>
>> Hi Holden,
>>
>> As Hyukjin said, following existing designs is not the principle of DS v2
>> API design. We should make sure the DS v2 API makes sense. AFAIK we didn't
>> fully follow the catalog API design from Hive and I believe Ryan also
>> agrees with it.
>>
>> I think the problem here is we were discussing some very detailed things
>> without actual code. I'll implement my idea after the holiday and then we
>> can have more effective discussions. We can also do benchmarks and get some
>> real numbers.
>>
>> In the meantime, we can continue to discuss other parts of this proposal,
>> and make a prototype if possible. Spark SQL has many active
>> contributors/committers and this thread doesn't get much attention yet.
>>
>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon  wrote:
>>
>>> Just dropping a few lines. I remember that one of the goals in DSv2 is
>>> to correct the mistakes we made in the current Spark codes.
>>> It would not have much point if we will happen to just follow and mimic
>>> what Spark currently does. It might just end up with another copy of Spark
>>> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
>>> I do believe we have been stuck mainly due to trying to come up with a
>>> better design. We already have an ugly picture of the current Spark APIs to
>>> draw a better bigger picture.
>>>
>>>
>>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:
>>>
 I think this proposal is a good set of trade-offs and has existed in
 the community for a long period of time. I especially appreciate how the
 design is focused on a minimal useful component, with future optimizations
 considered from a point of view of making sure it's flexible, but actual
 concrete decisions left for the future once we see how this API is used. I
 think if we try and optimize everything right out of the gate, we'll
 quickly get stuck (again) and not make any progress.

 On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to start a discussion for adding a FunctionCatalog interface
> to catalog plugins. This will allow catalogs to expose functions to Spark,
> similar to how the TableCatalog interface allows a catalog to expose
> tables. The proposal doc is available here:
> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>
> Here's a high-level summary of some of the main design choices:
> * Adds the ability to list and load functions, not to create or modify
> them in an external catalog
> * Supports scalar, aggregate, and partial aggregate functions
> * Uses load and bind steps for better error messages and simpler
> implementations
> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
> data
> * Can be extended using mix-in interfaces to add vectorization,
> codegen, and other future features
>
> There is also a PR with the proposed API:
> https://github.com/apache/spark/pull/24559/files
>
> Let's discuss the proposal here rather than on that PR, to get better
> visibility. Also, please take the time to read the proposal first. That
> really helps clear up misconceptions.
>
>
>
> --
> Ryan Blue
>


 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
FYI: the Presto UDF API
 also
takes individual parameters instead of the row parameter. I think this
direction at least worth a try so that we can see the performance
difference. It's also mentioned in the design doc as an alternative (Trino).

On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan  wrote:

> Hi Holden,
>
> As Hyukjin said, following existing designs is not the principle of DS v2
> API design. We should make sure the DS v2 API makes sense. AFAIK we didn't
> fully follow the catalog API design from Hive and I believe Ryan also
> agrees with it.
>
> I think the problem here is we were discussing some very detailed things
> without actual code. I'll implement my idea after the holiday and then we
> can have more effective discussions. We can also do benchmarks and get some
> real numbers.
>
> In the meantime, we can continue to discuss other parts of this proposal,
> and make a prototype if possible. Spark SQL has many active
> contributors/committers and this thread doesn't get much attention yet.
>
> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon  wrote:
>
>> Just dropping a few lines. I remember that one of the goals in DSv2 is to
>> correct the mistakes we made in the current Spark codes.
>> It would not have much point if we will happen to just follow and mimic
>> what Spark currently does. It might just end up with another copy of Spark
>> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
>> I do believe we have been stuck mainly due to trying to come up with a
>> better design. We already have an ugly picture of the current Spark APIs to
>> draw a better bigger picture.
>>
>>
>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:
>>
>>> I think this proposal is a good set of trade-offs and has existed in the
>>> community for a long period of time. I especially appreciate how the design
>>> is focused on a minimal useful component, with future optimizations
>>> considered from a point of view of making sure it's flexible, but actual
>>> concrete decisions left for the future once we see how this API is used. I
>>> think if we try and optimize everything right out of the gate, we'll
>>> quickly get stuck (again) and not make any progress.
>>>
>>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:
>>>
 Hi everyone,

 I'd like to start a discussion for adding a FunctionCatalog interface
 to catalog plugins. This will allow catalogs to expose functions to Spark,
 similar to how the TableCatalog interface allows a catalog to expose
 tables. The proposal doc is available here:
 https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit

 Here's a high-level summary of some of the main design choices:
 * Adds the ability to list and load functions, not to create or modify
 them in an external catalog
 * Supports scalar, aggregate, and partial aggregate functions
 * Uses load and bind steps for better error messages and simpler
 implementations
 * Like the DSv2 table read and write APIs, it uses InternalRow to pass
 data
 * Can be extended using mix-in interfaces to add vectorization,
 codegen, and other future features

 There is also a PR with the proposed API:
 https://github.com/apache/spark/pull/24559/files

 Let's discuss the proposal here rather than on that PR, to get better
 visibility. Also, please take the time to read the proposal first. That
 really helps clear up misconceptions.



 --
 Ryan Blue

>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
Hi Holden,

As Hyukjin said, following existing designs is not the principle of DS v2
API design. We should make sure the DS v2 API makes sense. AFAIK we didn't
fully follow the catalog API design from Hive and I believe Ryan also
agrees with it.

I think the problem here is we were discussing some very detailed things
without actual code. I'll implement my idea after the holiday and then we
can have more effective discussions. We can also do benchmarks and get some
real numbers.

In the meantime, we can continue to discuss other parts of this proposal,
and make a prototype if possible. Spark SQL has many active
contributors/committers and this thread doesn't get much attention yet.

On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon  wrote:

> Just dropping a few lines. I remember that one of the goals in DSv2 is to
> correct the mistakes we made in the current Spark codes.
> It would not have much point if we will happen to just follow and mimic
> what Spark currently does. It might just end up with another copy of Spark
> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
> I do believe we have been stuck mainly due to trying to come up with a
> better design. We already have an ugly picture of the current Spark APIs to
> draw a better bigger picture.
>
>
> 2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:
>
>> I think this proposal is a good set of trade-offs and has existed in the
>> community for a long period of time. I especially appreciate how the design
>> is focused on a minimal useful component, with future optimizations
>> considered from a point of view of making sure it's flexible, but actual
>> concrete decisions left for the future once we see how this API is used. I
>> think if we try and optimize everything right out of the gate, we'll
>> quickly get stuck (again) and not make any progress.
>>
>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to start a discussion for adding a FunctionCatalog interface to
>>> catalog plugins. This will allow catalogs to expose functions to Spark,
>>> similar to how the TableCatalog interface allows a catalog to expose
>>> tables. The proposal doc is available here:
>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>
>>> Here's a high-level summary of some of the main design choices:
>>> * Adds the ability to list and load functions, not to create or modify
>>> them in an external catalog
>>> * Supports scalar, aggregate, and partial aggregate functions
>>> * Uses load and bind steps for better error messages and simpler
>>> implementations
>>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>>> data
>>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>>> and other future features
>>>
>>> There is also a PR with the proposed API:
>>> https://github.com/apache/spark/pull/24559/files
>>>
>>> Let's discuss the proposal here rather than on that PR, to get better
>>> visibility. Also, please take the time to read the proposal first. That
>>> really helps clear up misconceptions.
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Hyukjin Kwon
Just dropping a few lines. I remember that one of the goals in DSv2 is to
correct the mistakes we made in the current Spark codes.
It would not have much point if we will happen to just follow and mimic
what Spark currently does. It might just end up with another copy of Spark
APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
I do believe we have been stuck mainly due to trying to come up with a
better design. We already have an ugly picture of the current Spark APIs to
draw a better bigger picture.


2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:

> I think this proposal is a good set of trade-offs and has existed in the
> community for a long period of time. I especially appreciate how the design
> is focused on a minimal useful component, with future optimizations
> considered from a point of view of making sure it's flexible, but actual
> concrete decisions left for the future once we see how this API is used. I
> think if we try and optimize everything right out of the gate, we'll
> quickly get stuck (again) and not make any progress.
>
> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I'd like to start a discussion for adding a FunctionCatalog interface to
>> catalog plugins. This will allow catalogs to expose functions to Spark,
>> similar to how the TableCatalog interface allows a catalog to expose
>> tables. The proposal doc is available here:
>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>
>> Here's a high-level summary of some of the main design choices:
>> * Adds the ability to list and load functions, not to create or modify
>> them in an external catalog
>> * Supports scalar, aggregate, and partial aggregate functions
>> * Uses load and bind steps for better error messages and simpler
>> implementations
>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>> data
>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>> and other future features
>>
>> There is also a PR with the proposed API:
>> https://github.com/apache/spark/pull/24559/files
>>
>> Let's discuss the proposal here rather than on that PR, to get better
>> visibility. Also, please take the time to read the proposal first. That
>> really helps clear up misconceptions.
>>
>>
>>
>> --
>> Ryan Blue
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Holden Karau
I think this proposal is a good set of trade-offs and has existed in the
community for a long period of time. I especially appreciate how the design
is focused on a minimal useful component, with future optimizations
considered from a point of view of making sure it's flexible, but actual
concrete decisions left for the future once we see how this API is used. I
think if we try and optimize everything right out of the gate, we'll
quickly get stuck (again) and not make any progress.

On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to start a discussion for adding a FunctionCatalog interface to
> catalog plugins. This will allow catalogs to expose functions to Spark,
> similar to how the TableCatalog interface allows a catalog to expose
> tables. The proposal doc is available here:
> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>
> Here's a high-level summary of some of the main design choices:
> * Adds the ability to list and load functions, not to create or modify
> them in an external catalog
> * Supports scalar, aggregate, and partial aggregate functions
> * Uses load and bind steps for better error messages and simpler
> implementations
> * Like the DSv2 table read and write APIs, it uses InternalRow to pass data
> * Can be extended using mix-in interfaces to add vectorization, codegen,
> and other future features
>
> There is also a PR with the proposed API:
> https://github.com/apache/spark/pull/24559/files
>
> Let's discuss the proposal here rather than on that PR, to get better
> visibility. Also, please take the time to read the proposal first. That
> really helps clear up misconceptions.
>
>
>
> --
> Ryan Blue
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Ryan Blue
I don’t think that using Invoke really works. The usability is poor if
something goes wrong and it can’t handle varargs or parameterized use cases
well. There isn’t a significant enough performance penalty for passing data
as a row to justify making the API much more difficult and less expressive.
I don’t think that it makes much sense to move forward with the idea.

More replies inline.

On Tue, Feb 9, 2021 at 2:37 AM Wenchen Fan  wrote:

> There’s also a massive performance penalty for the Invoke approach when
> falling back to non-codegen because the function is loaded and invoked each
> time eval is called. It is much cheaper to use a method in an interface.
>
> Can you elaborate? Using the row parameter or individual parameters
> shouldn't change the life cycle of the UDF instance.
>
The eval method looks up the method and invokes it every time using
reflection. That’s quite a bit slower than calling an interface method on
an UDF instance.

> Should they use String or UTF8String? What representations are supported
> and how will Spark detect and produce those representations?
>
> It's the same as InternalRow. We can just copy-paste the document of
> InternalRow to explain the corresponding java type for each data type.
>
My point is that having a single method signature that uses InternalRow and
is inherited from an interface is much easier for users and Spark. If a
user forgets to use UTF8String and writes a method with String instead,
then the UDF wouldn’t work. What then? Does Spark detect that the wrong
type was used? It would need to or else it would be difficult for a UDF
developer to tell what is wrong. And this is a runtime issue so it is
caught late.
-- 
Ryan Blue


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
Hi Ryan,

Sorry if I didn't make it clear. I was referring to implementing UDF using
codegen, not calling the UDF with codegen or not. Calling UDF is Spark's
job and it doesn't matter if the UDF API uses row or individual parameters,
as you said. My point is, it's a bad idea to follow the Expression API to
separate the interpreted and codegen code paths in the UDF API, as it adds
implementation complexity and is error-prone (people need to keep
interpreted and codegen in sync).

You made a good point that the Invoke approach can lead to methods
expansion. It's a common problem in Java if you want to specialize the
generic type, and Scala can mitigate this a bit using syntax sugar (type
specification
).
Varargs is a problem and in that case I agree the row parameter is
better. The current Scala/Java UDF doesn't support varargs and I'm not sure
how common it is. The UDF can take struct type inputs, which kind of
supports varargs as people can do SELECT my_udf(struct(1, 'abc', ...)).

> And, the Invoke approach has a performance penalty when existing rows
could be simply projected using a wrapper.

This only happens in the interpreted code path. When the query falls back
to interpreted execution, it'll be very slow and a little performance
penalty of UDF doesn't really matter.

> There’s also a massive performance penalty for the Invoke approach when
falling back to non-codegen because the function is loaded and invoked each
time eval is called. It is much cheaper to use a method in an interface.

Can you elaborate? Using the row parameter or individual parameters
shouldn't change the life cycle of the UDF instance.

> Should they use String or UTF8String? What representations are supported
and how will Spark detect and produce those representations?

It's the same as InternalRow. We can just copy-paste the document of
InternalRow to explain the corresponding java type for each data type.

On Tue, Feb 9, 2021 at 6:28 AM Ryan Blue  wrote:

> Wenchen,
>
> There are a few issues with the Invoke approach, and I don’t think that
> it is really much better for the additional complexity of the API.
>
> First I think that you’re confusing codegen to call a function with
> codegen to implement a function. The non-goal refers to supporting codegen
> to *implement* a UDF. That’s what could have differences between the
> called version and generated version. But the Invoke option isn’t any
> different in that case because Invoke codegen is only used to call a
> method, and we can codegen int result = udfName(x, y) just like we can
> codegen int result = udfName(row).
>
> The Invoke approach also has a problem with expressiveness. Consider a map
> function that builds a map from its inputs as key/value pairs: map("x", r
> * cos(theta), "y", r * sin(theta)). If this requires a defined Java
> function, then there must be lots of implementations for different numbers
> of pairs, for different types, etc:
>
> public MapData buildMap(String k1, int v1);
> ...
> public MapData buildMap(String k1, long v1);
> ...
> public MapData buildMap(String k1, float v1);
> ...
> public MapData buildMap(String k1, double v1);
> public MapData buildMap(String k1, double v1, String k2, double v2);
> public MapData buildMap(String k1, double v1, String k2, double v2, String 
> k3, double v3);
> ...
>
> Clearly, this and many other use cases would fall back to varargs instead.
> In that case, there is little benefit to using invoke because all of the
> arguments will get collected into an Object[] anyway. That’s basically
> the same thing as using a row object, just without convenience functions
> that return specific types like getString, forcing implementations to
> cast instead. And, the Invoke approach has a performance *penalty* when
> existing rows could be simply projected using a wrapper.
>
> There’s also a massive performance penalty for the Invoke approach when
> falling back to non-codegen because the function is loaded and invoked each
> time eval is called. It is much cheaper to use a method in an interface.
>
> Next, the Invoke approach is much more complicated for implementers to
> use. Should they use String or UTF8String? What representations are
> supported and how will Spark detect and produce those representations? What
> if a function uses both String *and* UTF8String? Will Spark detect this
> for each parameter? Having one or two functions called by Spark is much
> easier to maintain in Spark and avoid a lot of debugging headaches when
> something goes wrong.
>
> On Mon, Feb 8, 2021 at 12:00 PM Wenchen Fan  wrote:
>
> This is a very important feature, thanks for working on it!
>>
>> Spark uses codegen by default, and it's a bit unfortunate to see that
>> codegen support is treated as a non-goal. I think it's better to not ask
>> the UDF implementations to provide two different code paths for interpreted
>> evaluation and 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Ryan Blue
Wenchen,

There are a few issues with the Invoke approach, and I don’t think that it
is really much better for the additional complexity of the API.

First I think that you’re confusing codegen to call a function with codegen
to implement a function. The non-goal refers to supporting codegen to
*implement* a UDF. That’s what could have differences between the called
version and generated version. But the Invoke option isn’t any different in
that case because Invoke codegen is only used to call a method, and we can
codegen int result = udfName(x, y) just like we can codegen int result =
udfName(row).

The Invoke approach also has a problem with expressiveness. Consider a map
function that builds a map from its inputs as key/value pairs: map("x", r *
cos(theta), "y", r * sin(theta)). If this requires a defined Java function,
then there must be lots of implementations for different numbers of pairs,
for different types, etc:

public MapData buildMap(String k1, int v1);
...
public MapData buildMap(String k1, long v1);
...
public MapData buildMap(String k1, float v1);
...
public MapData buildMap(String k1, double v1);
public MapData buildMap(String k1, double v1, String k2, double v2);
public MapData buildMap(String k1, double v1, String k2, double v2,
String k3, double v3);
...

Clearly, this and many other use cases would fall back to varargs instead.
In that case, there is little benefit to using invoke because all of the
arguments will get collected into an Object[] anyway. That’s basically the
same thing as using a row object, just without convenience functions that
return specific types like getString, forcing implementations to cast
instead. And, the Invoke approach has a performance *penalty* when existing
rows could be simply projected using a wrapper.

There’s also a massive performance penalty for the Invoke approach when
falling back to non-codegen because the function is loaded and invoked each
time eval is called. It is much cheaper to use a method in an interface.

Next, the Invoke approach is much more complicated for implementers to use.
Should they use String or UTF8String? What representations are supported
and how will Spark detect and produce those representations? What if a
function uses both String *and* UTF8String? Will Spark detect this for each
parameter? Having one or two functions called by Spark is much easier to
maintain in Spark and avoid a lot of debugging headaches when something
goes wrong.

On Mon, Feb 8, 2021 at 12:00 PM Wenchen Fan  wrote:

This is a very important feature, thanks for working on it!
>
> Spark uses codegen by default, and it's a bit unfortunate to see that
> codegen support is treated as a non-goal. I think it's better to not ask
> the UDF implementations to provide two different code paths for interpreted
> evaluation and codegen evaluation. The Expression API does so and it's very
> painful. Many bugs were found due to inconsistencies between
> the interpreted and codegen code paths.
>
> Now, Spark has the infra to call arbitrary Java functions in
> both interpreted and codegen code paths, see StaticInvoke and Invoke. I
> think we are able to define the UDF API in the most efficient way.
> For example, a UDF that takes long and string, and returns int:
>
> class MyUDF implements ... {
>   int call(long arg1, UTF8String arg2) { ... }
> }
>
> There is no compile-time type-safety. But there is also no boxing, no
> extra InternalRow building, no separated interpreted and codegen code
> paths. The UDF will report input data types and result data type, so the
> analyzer can check if the call method is valid via reflection, and we
> still have query-compile-time type safety. It also simplifies development
> as we can just use the Invoke expression to invoke UDFs.
>
> On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I'd like to start a discussion for adding a FunctionCatalog interface to
>> catalog plugins. This will allow catalogs to expose functions to Spark,
>> similar to how the TableCatalog interface allows a catalog to expose
>> tables. The proposal doc is available here:
>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>
>> Here's a high-level summary of some of the main design choices:
>> * Adds the ability to list and load functions, not to create or modify
>> them in an external catalog
>> * Supports scalar, aggregate, and partial aggregate functions
>> * Uses load and bind steps for better error messages and simpler
>> implementations
>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>> data
>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>> and other future features
>>
>> There is also a PR with the proposed API:
>> https://github.com/apache/spark/pull/24559/files
>>
>> Let's discuss the proposal here rather than on that PR, to get better
>> visibility. Also, please take the time to read the proposal first. That
>> really helps clear up 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Wenchen Fan
This is a very important feature, thanks for working on it!

Spark uses codegen by default, and it's a bit unfortunate to see that
codegen support is treated as a non-goal. I think it's better to not ask
the UDF implementations to provide two different code paths for interpreted
evaluation and codegen evaluation. The Expression API does so and it's very
painful. Many bugs were found due to inconsistencies between
the interpreted and codegen code paths.

Now, Spark has the infra to call arbitrary Java functions in
both interpreted and codegen code paths, see StaticInvoke and Invoke. I
think we are able to define the UDF API in the most efficient way.
For example, a UDF that takes long and string, and returns int:

class MyUDF implements ... {
  int call(long arg1, UTF8String arg2) { ... }
}

There is no compile-time type-safety. But there is also no boxing, no extra
InternalRow building, no separated interpreted and codegen code paths. The
UDF will report input data types and result data type, so the analyzer can
check if the call method is valid via reflection, and we still have
query-compile-time type safety. It also simplifies development as we can
just use the Invoke expression to invoke UDFs.

On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to start a discussion for adding a FunctionCatalog interface to
> catalog plugins. This will allow catalogs to expose functions to Spark,
> similar to how the TableCatalog interface allows a catalog to expose
> tables. The proposal doc is available here:
> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>
> Here's a high-level summary of some of the main design choices:
> * Adds the ability to list and load functions, not to create or modify
> them in an external catalog
> * Supports scalar, aggregate, and partial aggregate functions
> * Uses load and bind steps for better error messages and simpler
> implementations
> * Like the DSv2 table read and write APIs, it uses InternalRow to pass data
> * Can be extended using mix-in interfaces to add vectorization, codegen,
> and other future features
>
> There is also a PR with the proposed API:
> https://github.com/apache/spark/pull/24559/files
>
> Let's discuss the proposal here rather than on that PR, to get better
> visibility. Also, please take the time to read the proposal first. That
> really helps clear up misconceptions.
>
>
>
> --
> Ryan Blue
>