Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Holden Karau Wed, 07 Jan 2026 10:58:10 -0800

Hi Y’all,

Discussions seems to have settled so I plan to bring this for a vote end of
the week but just wanted to make sure it everyone had a chance to comment
with the holidays first.


Cheers & Happy New Year,

Holden :)

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Mon, Dec 29, 2025 at 1:15 PM Holden Karau <[email protected]> wrote:

> I've updated the SPIP doc a bit with the feedback in this thread the tl;dr
> is:
>
> 1) Error message differences would be "acceptable" (put under What problem
> is this proposal NOT designed to solve)
> 2) Clarified that type promotion differences, None/Null handling would be
> in-scope
> 3) Property-based testing to ensure the above
> 4) Added "A user should be able to tell when a UDF was transpiled by
> looking at the final query plan."
> 5) Added future work with Wenchen, Serge, and Jungtaek's suggestions,
> including the way for users to indicate that a UDF should be transpiled and
> to log when it's not, explore simple library usage (e.g., some NLTK
> functions have equivalent catalyst operations, and then we could accelerate
> those vectorized UDFS).
> 6) Added a note around the optimizer only using transpiling when it won't
> break in-language pipelining (no one asked for this but I thought it would
> be good to clarify as explicitly in-scope anyways).
>
> Thanks everyone for their suggestions!
>
> On Mon, Dec 29, 2025 at 12:36 PM Holden Karau <[email protected]>
> wrote:
>
>> Some in-line responses.
>>
>>
>> On Mon, Dec 29, 2025 at 12:09 PM serge rielau.com <[email protected]>
>> wrote:
>>
>>> I was expecting the optimizer pushdown argument. :-)
>>>
>> I mean, it's true for every rule in optimizer.scala that I've looked at.
>> Even for join hints, when the optimizer overrides them, we don't throw an
>> error; it just shows up in the log statements.
>>
>>> Exposing what happened in e.g. EXPLAIN would at least mitigate the
>>> issue. Although EXPLAIN won’t tell WHY it couldn’t transpile (at least not
>>> ordinarily…).
>>>
>> We could add a debug log level but I worry our logs are already so busy
>> at that level it might not be actually useful, but we could put itin for
>> sure.
>>
>>>
>>> I think my concern is mostly one of being able to reason on why things
>>> happen (or not). I.e. docs.
>>> The rules for predicate pushdown are well understood.
>>> This new-fangled transpilation smells like it’s going to be finicky….
>>>
>> It will start by supporting a small number of cases and then work its way
>> up. It's something where folks should be excited when it works, and their
>> code goes faster unexpectedly, but when it doesn't, things behave as they
>> do today.
>>
>>>
>>>
>>>
>>> On Dec 29, 2025, at 11:46 AM, Holden Karau <[email protected]>
>>> wrote:
>>>
>>> Oooh what about another way: if we expose in either the logs or the
>>> query plan if a UDF has been transpiled? That way a user investigating a
>>> regression can see?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Mon, Dec 29, 2025 at 11:45 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> So most of our optimizer rules fall back gracefully when they can’t be
>>>> applied, for example filter push down if it can’t push a filter through
>>>> doesn’t raise an error. I’m thinking of this more like an optimizer rule
>>>> personally.
>>>>
>>>> That’s why I don’t think we should try to expose transpilation to the
>>>> user level like that, especially given we want to accelerate pandas on
>>>> spark where we don’t really control the API fully. Do you have an idea of
>>>> what you’d want that to look like though?
>>>>
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Mon, Dec 29, 2025 at 9:55 AM serge rielau.com <[email protected]>
>>>> wrote:
>>>>
>>>>> How about a compromise? If the user expects transpilation, via a
>>>>> syntax clause we raise an error.
>>>>> If the user says nothing then it’s best effort.
>>>>> That’s also an easy way for a user to verify whether their code
>>>>> applies.
>>>>> On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>,
>>>>> wrote:
>>>>>
>>>>> I don’t think raising an error makes sense, we only expect cover some
>>>>> simple UDFS and when not supported we execute them as normal.
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> One important aspect of coverage is to draw a clear line on what is,
>>>>>> and what is not covered.
>>>>>> I may go as far as propose to use explicit syntax to denote the
>>>>>> intent to transpile. Then, Spark cannot do it, we can raise an error at 
>>>>>> DDL
>>>>>> and the user at is not at a loss why their function is slower than 
>>>>>> expected.
>>>>>> Or why a small bugfix in its body suddenly regresses perfromance.
>>>>>>
>>>>>>
>>>>>> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> So for vectorized UDF if it's still a simple mathematical expression
>>>>>> we could transpile it. Error message equality I think is out of scope,
>>>>>> that's a good call out.
>>>>>>
>>>>>> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> The idea sounds good but I'm also worried about the coverage. In the
>>>>>>> recent Spark releases, pandas/arrow UDFs get more support than the 
>>>>>>> classic
>>>>>>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we
>>>>>>> don't have vectorized operators in Spark out of the box.
>>>>>>>
>>>>>>> It's also hard to simulate the behaviors exactly, such as overflow
>>>>>>> behavior, NULL behavior, error message, etc. Is 100% same behavior the 
>>>>>>> goal
>>>>>>> of transpilation?
>>>>>>>
>>>>>>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Responses in line, thanks for the questions :)
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the proposal. UDF has been known to be noticeably slow,
>>>>>>>>> especially for the language where we run the external process and do
>>>>>>>>> intercommunication, so this is an interesting topic.
>>>>>>>>>
>>>>>>>>> The starting question for this proposal would be the coverage. The
>>>>>>>>> proposal says we create an AST and try to convert it to a catalyst 
>>>>>>>>> plan.
>>>>>>>>> Since this does not sound like we are generating Java/bytecode so I 
>>>>>>>>> assume
>>>>>>>>> this only leverages built-in operators/expressions.
>>>>>>>>>
>>>>>>>> Initially yes. Longer term I think it’s possible we explore
>>>>>>>> transpiling to other languages (especially accelerator languages as 
>>>>>>>> called
>>>>>>>> out in the docs), but that’s fuzzy.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> That said, when we say "simple" UDF, what is exactly the scope of
>>>>>>>>> "simple" here? For me, it sounds to me like if the UDF can be 
>>>>>>>>> translated to
>>>>>>>>> a catalyst plan (without UDF), the UDF has actually been something 
>>>>>>>>> users
>>>>>>>>> could have written via the DataFrame API without UDF, unless we have
>>>>>>>>> non-user-facing expressions where users are needed. Same with Pandas 
>>>>>>>>> on
>>>>>>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail 
>>>>>>>>> to
>>>>>>>>> write logic based on built-in SQL expressions while they can, and end 
>>>>>>>>> up
>>>>>>>>> with choosing UDF? I think this needs more clarification given that's
>>>>>>>>> really a user facing contract and the factor of evaluating this 
>>>>>>>>> project as
>>>>>>>>> a successful one.
>>>>>>>>>
>>>>>>>> Given the transpiration target is Catalyst, yes these would mostly
>>>>>>>> be things someone could express with SQL but expressed in another way.
>>>>>>>>
>>>>>>>> We do have some Catalyst expressions which aren’t directly SQL
>>>>>>>> expressions so not always, but generally.
>>>>>>>>
>>>>>>>> To be clear: I don’t think we should expect users, especially
>>>>>>>> Pandas on Spark users, to rewrite their data frame UDFS to SQL and 
>>>>>>>> that’s
>>>>>>>> why this project makes sense.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Once that is clarified, we may have follow-up questions/voices
>>>>>>>>> with the answer, something along the line:
>>>>>>>>>
>>>>>>>>> 1. It might be the case we may just want this proposal to be
>>>>>>>>> direct to the "future success", translating Python UDF to Java code
>>>>>>>>> (codegen) to cover arbitrary logic (unless it's not involving python
>>>>>>>>> library, which we had to find alternatives).
>>>>>>>>>
>>>>>>>> I think this can be a reasonable follow on this project if this
>>>>>>>> project is successful.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. We might want to make sure this proposal is addressing
>>>>>>>>> major use cases and not just niche cases. e.g. it might be the case 
>>>>>>>>> the
>>>>>>>>> majority of Python UDF usage is to pull other Python dependencies, 
>>>>>>>>> then we
>>>>>>>>> lose the majority of cases.
>>>>>>>>>
>>>>>>>> I think we don’t expect to cover the majority of UDFS. Even while
>>>>>>>> covering only the simple cases initially it would have a real 
>>>>>>>> performance
>>>>>>>> improvement, especially for Pandas on Spark where people can’t express 
>>>>>>>> many
>>>>>>>> of these things easily.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope I understand the proposal well and ask valid questions.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>
>>>>>>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Folks,
>>>>>>>>>>
>>>>>>>>>> It's been a few years since we last looked at transpilation, and
>>>>>>>>>> with the growth of Pandas on Spark I think it's time we revisit it. 
>>>>>>>>>> I've
>>>>>>>>>> got a JIRA filed
>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-54783> some rough
>>>>>>>>>> proof of concept code
>>>>>>>>>> <https://github.com/apache/spark/pull/53547> (I think doing the
>>>>>>>>>> transpilation Python side instead of Scala side makes more sense, 
>>>>>>>>>> but was
>>>>>>>>>> interesting to play with), and  of course everyones favourite a
>>>>>>>>>> design doc.
>>>>>>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>>>>>>>>>>  (I
>>>>>>>>>> also have a collection of YouTube streams playing with the idea
>>>>>>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants
>>>>>>>>>> to follow along on that journey).
>>>>>>>>>>
>>>>>>>>>> Wishing everyone a happy holidays :)
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Holden :)
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>>
>>>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to