Hi Y’all, Discussions seems to have settled so I plan to bring this for a vote end of the week but just wanted to make sure it everyone had a chance to comment with the holidays first.
Cheers & Happy New Year, Holden :) Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Mon, Dec 29, 2025 at 1:15 PM Holden Karau <[email protected]> wrote: > I've updated the SPIP doc a bit with the feedback in this thread the tl;dr > is: > > 1) Error message differences would be "acceptable" (put under What problem > is this proposal NOT designed to solve) > 2) Clarified that type promotion differences, None/Null handling would be > in-scope > 3) Property-based testing to ensure the above > 4) Added "A user should be able to tell when a UDF was transpiled by > looking at the final query plan." > 5) Added future work with Wenchen, Serge, and Jungtaek's suggestions, > including the way for users to indicate that a UDF should be transpiled and > to log when it's not, explore simple library usage (e.g., some NLTK > functions have equivalent catalyst operations, and then we could accelerate > those vectorized UDFS). > 6) Added a note around the optimizer only using transpiling when it won't > break in-language pipelining (no one asked for this but I thought it would > be good to clarify as explicitly in-scope anyways). > > Thanks everyone for their suggestions! > > On Mon, Dec 29, 2025 at 12:36 PM Holden Karau <[email protected]> > wrote: > >> Some in-line responses. >> >> >> On Mon, Dec 29, 2025 at 12:09 PM serge rielau.com <[email protected]> >> wrote: >> >>> I was expecting the optimizer pushdown argument. :-) >>> >> I mean, it's true for every rule in optimizer.scala that I've looked at. >> Even for join hints, when the optimizer overrides them, we don't throw an >> error; it just shows up in the log statements. >> >>> Exposing what happened in e.g. EXPLAIN would at least mitigate the >>> issue. Although EXPLAIN won’t tell WHY it couldn’t transpile (at least not >>> ordinarily…). >>> >> We could add a debug log level but I worry our logs are already so busy >> at that level it might not be actually useful, but we could put itin for >> sure. >> >>> >>> I think my concern is mostly one of being able to reason on why things >>> happen (or not). I.e. docs. >>> The rules for predicate pushdown are well understood. >>> This new-fangled transpilation smells like it’s going to be finicky…. >>> >> It will start by supporting a small number of cases and then work its way >> up. It's something where folks should be excited when it works, and their >> code goes faster unexpectedly, but when it doesn't, things behave as they >> do today. >> >>> >>> >>> >>> On Dec 29, 2025, at 11:46 AM, Holden Karau <[email protected]> >>> wrote: >>> >>> Oooh what about another way: if we expose in either the logs or the >>> query plan if a UDF has been transpiled? That way a user investigating a >>> regression can see? >>> >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>> >>> On Mon, Dec 29, 2025 at 11:45 AM Holden Karau <[email protected]> >>> wrote: >>> >>>> So most of our optimizer rules fall back gracefully when they can’t be >>>> applied, for example filter push down if it can’t push a filter through >>>> doesn’t raise an error. I’m thinking of this more like an optimizer rule >>>> personally. >>>> >>>> That’s why I don’t think we should try to expose transpilation to the >>>> user level like that, especially given we want to accelerate pandas on >>>> spark where we don’t really control the API fully. Do you have an idea of >>>> what you’d want that to look like though? >>>> >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Mon, Dec 29, 2025 at 9:55 AM serge rielau.com <[email protected]> >>>> wrote: >>>> >>>>> How about a compromise? If the user expects transpilation, via a >>>>> syntax clause we raise an error. >>>>> If the user says nothing then it’s best effort. >>>>> That’s also an easy way for a user to verify whether their code >>>>> applies. >>>>> On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>, >>>>> wrote: >>>>> >>>>> I don’t think raising an error makes sense, we only expect cover some >>>>> simple UDFS and when not supported we execute them as normal. >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> >>>>> On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]> >>>>> wrote: >>>>> >>>>>> One important aspect of coverage is to draw a clear line on what is, >>>>>> and what is not covered. >>>>>> I may go as far as propose to use explicit syntax to denote the >>>>>> intent to transpile. Then, Spark cannot do it, we can raise an error at >>>>>> DDL >>>>>> and the user at is not at a loss why their function is slower than >>>>>> expected. >>>>>> Or why a small bugfix in its body suddenly regresses perfromance. >>>>>> >>>>>> >>>>>> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]> >>>>>> wrote: >>>>>> >>>>>> So for vectorized UDF if it's still a simple mathematical expression >>>>>> we could transpile it. Error message equality I think is out of scope, >>>>>> that's a good call out. >>>>>> >>>>>> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> The idea sounds good but I'm also worried about the coverage. In the >>>>>>> recent Spark releases, pandas/arrow UDFs get more support than the >>>>>>> classic >>>>>>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we >>>>>>> don't have vectorized operators in Spark out of the box. >>>>>>> >>>>>>> It's also hard to simulate the behaviors exactly, such as overflow >>>>>>> behavior, NULL behavior, error message, etc. Is 100% same behavior the >>>>>>> goal >>>>>>> of transpilation? >>>>>>> >>>>>>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Responses in line, thanks for the questions :) >>>>>>>> >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> Pronouns: she/her >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks for the proposal. UDF has been known to be noticeably slow, >>>>>>>>> especially for the language where we run the external process and do >>>>>>>>> intercommunication, so this is an interesting topic. >>>>>>>>> >>>>>>>>> The starting question for this proposal would be the coverage. The >>>>>>>>> proposal says we create an AST and try to convert it to a catalyst >>>>>>>>> plan. >>>>>>>>> Since this does not sound like we are generating Java/bytecode so I >>>>>>>>> assume >>>>>>>>> this only leverages built-in operators/expressions. >>>>>>>>> >>>>>>>> Initially yes. Longer term I think it’s possible we explore >>>>>>>> transpiling to other languages (especially accelerator languages as >>>>>>>> called >>>>>>>> out in the docs), but that’s fuzzy. >>>>>>>> >>>>>>>>> >>>>>>>>> That said, when we say "simple" UDF, what is exactly the scope of >>>>>>>>> "simple" here? For me, it sounds to me like if the UDF can be >>>>>>>>> translated to >>>>>>>>> a catalyst plan (without UDF), the UDF has actually been something >>>>>>>>> users >>>>>>>>> could have written via the DataFrame API without UDF, unless we have >>>>>>>>> non-user-facing expressions where users are needed. Same with Pandas >>>>>>>>> on >>>>>>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail >>>>>>>>> to >>>>>>>>> write logic based on built-in SQL expressions while they can, and end >>>>>>>>> up >>>>>>>>> with choosing UDF? I think this needs more clarification given that's >>>>>>>>> really a user facing contract and the factor of evaluating this >>>>>>>>> project as >>>>>>>>> a successful one. >>>>>>>>> >>>>>>>> Given the transpiration target is Catalyst, yes these would mostly >>>>>>>> be things someone could express with SQL but expressed in another way. >>>>>>>> >>>>>>>> We do have some Catalyst expressions which aren’t directly SQL >>>>>>>> expressions so not always, but generally. >>>>>>>> >>>>>>>> To be clear: I don’t think we should expect users, especially >>>>>>>> Pandas on Spark users, to rewrite their data frame UDFS to SQL and >>>>>>>> that’s >>>>>>>> why this project makes sense. >>>>>>>> >>>>>>>>> >>>>>>>>> Once that is clarified, we may have follow-up questions/voices >>>>>>>>> with the answer, something along the line: >>>>>>>>> >>>>>>>>> 1. It might be the case we may just want this proposal to be >>>>>>>>> direct to the "future success", translating Python UDF to Java code >>>>>>>>> (codegen) to cover arbitrary logic (unless it's not involving python >>>>>>>>> library, which we had to find alternatives). >>>>>>>>> >>>>>>>> I think this can be a reasonable follow on this project if this >>>>>>>> project is successful. >>>>>>>> >>>>>>>>> >>>>>>>>> 2. We might want to make sure this proposal is addressing >>>>>>>>> major use cases and not just niche cases. e.g. it might be the case >>>>>>>>> the >>>>>>>>> majority of Python UDF usage is to pull other Python dependencies, >>>>>>>>> then we >>>>>>>>> lose the majority of cases. >>>>>>>>> >>>>>>>> I think we don’t expect to cover the majority of UDFS. Even while >>>>>>>> covering only the simple cases initially it would have a real >>>>>>>> performance >>>>>>>> improvement, especially for Pandas on Spark where people can’t express >>>>>>>> many >>>>>>>> of these things easily. >>>>>>>> >>>>>>>>> >>>>>>>>> Hope I understand the proposal well and ask valid questions. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>> >>>>>>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Folks, >>>>>>>>>> >>>>>>>>>> It's been a few years since we last looked at transpilation, and >>>>>>>>>> with the growth of Pandas on Spark I think it's time we revisit it. >>>>>>>>>> I've >>>>>>>>>> got a JIRA filed >>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-54783> some rough >>>>>>>>>> proof of concept code >>>>>>>>>> <https://github.com/apache/spark/pull/53547> (I think doing the >>>>>>>>>> transpilation Python side instead of Scala side makes more sense, >>>>>>>>>> but was >>>>>>>>>> interesting to play with), and of course everyones favourite a >>>>>>>>>> design doc. >>>>>>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing> >>>>>>>>>> (I >>>>>>>>>> also have a collection of YouTube streams playing with the idea >>>>>>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants >>>>>>>>>> to follow along on that journey). >>>>>>>>>> >>>>>>>>>> Wishing everyone a happy holidays :) >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Holden :) >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>> Pronouns: she/her >>>>>>>>>> >>>>>>>>> >>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> Pronouns: she/her >>>>>> >>>>>> >>>>>> >>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> > > > -- > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her >
