FYI: the Presto UDF API <https://prestodb.io/docs/current/develop/functions.html> also takes individual parameters instead of the row parameter. I think this direction at least worth a try so that we can see the performance difference. It's also mentioned in the design doc as an alternative (Trino).
On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan <cloud0...@gmail.com> wrote: > Hi Holden, > > As Hyukjin said, following existing designs is not the principle of DS v2 > API design. We should make sure the DS v2 API makes sense. AFAIK we didn't > fully follow the catalog API design from Hive and I believe Ryan also > agrees with it. > > I think the problem here is we were discussing some very detailed things > without actual code. I'll implement my idea after the holiday and then we > can have more effective discussions. We can also do benchmarks and get some > real numbers. > > In the meantime, we can continue to discuss other parts of this proposal, > and make a prototype if possible. Spark SQL has many active > contributors/committers and this thread doesn't get much attention yet. > > On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Just dropping a few lines. I remember that one of the goals in DSv2 is to >> correct the mistakes we made in the current Spark codes. >> It would not have much point if we will happen to just follow and mimic >> what Spark currently does. It might just end up with another copy of Spark >> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this >> I do believe we have been stuck mainly due to trying to come up with a >> better design. We already have an ugly picture of the current Spark APIs to >> draw a better bigger picture. >> >> >> 2021년 2월 10일 (수) 오전 3:28, Holden Karau <hol...@pigscanfly.ca>님이 작성: >> >>> I think this proposal is a good set of trade-offs and has existed in the >>> community for a long period of time. I especially appreciate how the design >>> is focused on a minimal useful component, with future optimizations >>> considered from a point of view of making sure it's flexible, but actual >>> concrete decisions left for the future once we see how this API is used. I >>> think if we try and optimize everything right out of the gate, we'll >>> quickly get stuck (again) and not make any progress. >>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue <b...@apache.org> wrote: >>> >>>> Hi everyone, >>>> >>>> I'd like to start a discussion for adding a FunctionCatalog interface >>>> to catalog plugins. This will allow catalogs to expose functions to Spark, >>>> similar to how the TableCatalog interface allows a catalog to expose >>>> tables. The proposal doc is available here: >>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit >>>> >>>> Here's a high-level summary of some of the main design choices: >>>> * Adds the ability to list and load functions, not to create or modify >>>> them in an external catalog >>>> * Supports scalar, aggregate, and partial aggregate functions >>>> * Uses load and bind steps for better error messages and simpler >>>> implementations >>>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass >>>> data >>>> * Can be extended using mix-in interfaces to add vectorization, >>>> codegen, and other future features >>>> >>>> There is also a PR with the proposed API: >>>> https://github.com/apache/spark/pull/24559/files >>>> >>>> Let's discuss the proposal here rather than on that PR, to get better >>>> visibility. Also, please take the time to read the proposal first. That >>>> really helps clear up misconceptions. >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >>