Re: [DISCUSS] SPIP: FunctionCatalog

Wenchen Fan Tue, 09 Feb 2021 22:17:56 -0800

FYI: the Presto UDF API
<https://prestodb.io/docs/current/develop/functions.html> also
takes individual parameters instead of the row parameter. I think this
direction at least worth a try so that we can see the performance
difference. It's also mentioned in the design doc as an alternative (Trino).


On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan <[email protected]> wrote:

> Hi Holden,
>
> As Hyukjin said, following existing designs is not the principle of DS v2
> API design. We should make sure the DS v2 API makes sense. AFAIK we didn't
> fully follow the catalog API design from Hive and I believe Ryan also
> agrees with it.
>
> I think the problem here is we were discussing some very detailed things
> without actual code. I'll implement my idea after the holiday and then we
> can have more effective discussions. We can also do benchmarks and get some
> real numbers.
>
> In the meantime, we can continue to discuss other parts of this proposal,
> and make a prototype if possible. Spark SQL has many active
> contributors/committers and this thread doesn't get much attention yet.
>
> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon <[email protected]> wrote:
>
>> Just dropping a few lines. I remember that one of the goals in DSv2 is to
>> correct the mistakes we made in the current Spark codes.
>> It would not have much point if we will happen to just follow and mimic
>> what Spark currently does. It might just end up with another copy of Spark
>> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
>> I do believe we have been stuck mainly due to trying to come up with a
>> better design. We already have an ugly picture of the current Spark APIs to
>> draw a better bigger picture.
>>
>>
>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau <[email protected]>님이 작성:
>>
>>> I think this proposal is a good set of trade-offs and has existed in the
>>> community for a long period of time. I especially appreciate how the design
>>> is focused on a minimal useful component, with future optimizations
>>> considered from a point of view of making sure it's flexible, but actual
>>> concrete decisions left for the future once we see how this API is used. I
>>> think if we try and optimize everything right out of the gate, we'll
>>> quickly get stuck (again) and not make any progress.
>>>
>>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue <[email protected]> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'd like to start a discussion for adding a FunctionCatalog interface
>>>> to catalog plugins. This will allow catalogs to expose functions to Spark,
>>>> similar to how the TableCatalog interface allows a catalog to expose
>>>> tables. The proposal doc is available here:
>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>
>>>> Here's a high-level summary of some of the main design choices:
>>>> * Adds the ability to list and load functions, not to create or modify
>>>> them in an external catalog
>>>> * Supports scalar, aggregate, and partial aggregate functions
>>>> * Uses load and bind steps for better error messages and simpler
>>>> implementations
>>>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>>>> data
>>>> * Can be extended using mix-in interfaces to add vectorization,
>>>> codegen, and other future features
>>>>
>>>> There is also a PR with the proposed API:
>>>> https://github.com/apache/spark/pull/24559/files
>>>>
>>>> Let's discuss the proposal here rather than on that PR, to get better
>>>> visibility. Also, please take the time to read the proposal first. That
>>>> really helps clear up misconceptions.
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to