Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Holden Karau Mon, 29 Dec 2025 12:37:59 -0800

Some in-line responses.


On Mon, Dec 29, 2025 at 12:09 PM serge rielau.com <[email protected]> wrote:

> I was expecting the optimizer pushdown argument. :-)
>
I mean, it's true for every rule in optimizer.scala that I've looked at.
Even for join hints, when the optimizer overrides them, we don't throw an
error; it just shows up in the log statements.

> Exposing what happened in e.g. EXPLAIN would at least mitigate the issue.
> Although EXPLAIN won’t tell WHY it couldn’t transpile (at least not
> ordinarily…).
>
We could add a debug log level but I worry our logs are already so busy at
that level it might not be actually useful, but we could put itin for sure.

>
> I think my concern is mostly one of being able to reason on why things
> happen (or not). I.e. docs.
> The rules for predicate pushdown are well understood.
> This new-fangled transpilation smells like it’s going to be finicky….
>
It will start by supporting a small number of cases and then work its way
up. It's something where folks should be excited when it works, and their
code goes faster unexpectedly, but when it doesn't, things behave as they
do today.

>
>
>
> On Dec 29, 2025, at 11:46 AM, Holden Karau <[email protected]> wrote:
>
> Oooh what about another way: if we expose in either the logs or the query
> plan if a UDF has been transpiled? That way a user investigating a
> regression can see?
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Mon, Dec 29, 2025 at 11:45 AM Holden Karau <[email protected]>
> wrote:
>
>> So most of our optimizer rules fall back gracefully when they can’t be
>> applied, for example filter push down if it can’t push a filter through
>> doesn’t raise an error. I’m thinking of this more like an optimizer rule
>> personally.
>>
>> That’s why I don’t think we should try to expose transpilation to the
>> user level like that, especially given we want to accelerate pandas on
>> spark where we don’t really control the API fully. Do you have an idea of
>> what you’d want that to look like though?
>>
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Mon, Dec 29, 2025 at 9:55 AM serge rielau.com <[email protected]>
>> wrote:
>>
>>> How about a compromise? If the user expects transpilation, via a syntax
>>> clause we raise an error.
>>> If the user says nothing then it’s best effort.
>>> That’s also an easy way for a user to verify whether their code applies.
>>> On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>,
>>> wrote:
>>>
>>> I don’t think raising an error makes sense, we only expect cover some
>>> simple UDFS and when not supported we execute them as normal.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]>
>>> wrote:
>>>
>>>> One important aspect of coverage is to draw a clear line on what is,
>>>> and what is not covered.
>>>> I may go as far as propose to use explicit syntax to denote the intent
>>>> to transpile. Then, Spark cannot do it, we can raise an error at DDL and
>>>> the user at is not at a loss why their function is slower than expected.
>>>> Or why a small bugfix in its body suddenly regresses perfromance.
>>>>
>>>>
>>>> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>> So for vectorized UDF if it's still a simple mathematical expression we
>>>> could transpile it. Error message equality I think is out of scope, that's
>>>> a good call out.
>>>>
>>>> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]>
>>>> wrote:
>>>>
>>>>> The idea sounds good but I'm also worried about the coverage. In the
>>>>> recent Spark releases, pandas/arrow UDFs get more support than the classic
>>>>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we
>>>>> don't have vectorized operators in Spark out of the box.
>>>>>
>>>>> It's also hard to simulate the behaviors exactly, such as overflow
>>>>> behavior, NULL behavior, error message, etc. Is 100% same behavior the 
>>>>> goal
>>>>> of transpilation?
>>>>>
>>>>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Responses in line, thanks for the questions :)
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the proposal. UDF has been known to be noticeably slow,
>>>>>>> especially for the language where we run the external process and do
>>>>>>> intercommunication, so this is an interesting topic.
>>>>>>>
>>>>>>> The starting question for this proposal would be the coverage. The
>>>>>>> proposal says we create an AST and try to convert it to a catalyst plan.
>>>>>>> Since this does not sound like we are generating Java/bytecode so I 
>>>>>>> assume
>>>>>>> this only leverages built-in operators/expressions.
>>>>>>>
>>>>>> Initially yes. Longer term I think it’s possible we explore
>>>>>> transpiling to other languages (especially accelerator languages as 
>>>>>> called
>>>>>> out in the docs), but that’s fuzzy.
>>>>>>
>>>>>>>
>>>>>>> That said, when we say "simple" UDF, what is exactly the scope of
>>>>>>> "simple" here? For me, it sounds to me like if the UDF can be 
>>>>>>> translated to
>>>>>>> a catalyst plan (without UDF), the UDF has actually been something users
>>>>>>> could have written via the DataFrame API without UDF, unless we have
>>>>>>> non-user-facing expressions where users are needed. Same with Pandas on
>>>>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to
>>>>>>> write logic based on built-in SQL expressions while they can, and end up
>>>>>>> with choosing UDF? I think this needs more clarification given that's
>>>>>>> really a user facing contract and the factor of evaluating this project 
>>>>>>> as
>>>>>>> a successful one.
>>>>>>>
>>>>>> Given the transpiration target is Catalyst, yes these would mostly be
>>>>>> things someone could express with SQL but expressed in another way.
>>>>>>
>>>>>> We do have some Catalyst expressions which aren’t directly SQL
>>>>>> expressions so not always, but generally.
>>>>>>
>>>>>> To be clear: I don’t think we should expect users, especially Pandas
>>>>>> on Spark users, to rewrite their data frame UDFS to SQL and that’s why 
>>>>>> this
>>>>>> project makes sense.
>>>>>>
>>>>>>>
>>>>>>> Once that is clarified, we may have follow-up questions/voices with
>>>>>>> the answer, something along the line:
>>>>>>>
>>>>>>> 1. It might be the case we may just want this proposal to be direct
>>>>>>> to the "future success", translating Python UDF to Java code (codegen) 
>>>>>>> to
>>>>>>> cover arbitrary logic (unless it's not involving python library, which 
>>>>>>> we
>>>>>>> had to find alternatives).
>>>>>>>
>>>>>> I think this can be a reasonable follow on this project if this
>>>>>> project is successful.
>>>>>>
>>>>>>>
>>>>>>> 2. We might want to make sure this proposal is addressing major use
>>>>>>> cases and not just niche cases. e.g. it might be the case the majority 
>>>>>>> of
>>>>>>> Python UDF usage is to pull other Python dependencies, then we lose
>>>>>>> the majority of cases.
>>>>>>>
>>>>>> I think we don’t expect to cover the majority of UDFS. Even while
>>>>>> covering only the simple cases initially it would have a real performance
>>>>>> improvement, especially for Pandas on Spark where people can’t express 
>>>>>> many
>>>>>> of these things easily.
>>>>>>
>>>>>>>
>>>>>>> Hope I understand the proposal well and ask valid questions.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> It's been a few years since we last looked at transpilation, and
>>>>>>>> with the growth of Pandas on Spark I think it's time we revisit it. 
>>>>>>>> I've
>>>>>>>> got a JIRA filed
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-54783> some rough
>>>>>>>> proof of concept code <https://github.com/apache/spark/pull/53547> (I
>>>>>>>> think doing the transpilation Python side instead of Scala side makes 
>>>>>>>> more
>>>>>>>> sense, but was interesting to play with), and  of course everyones 
>>>>>>>> favourite
>>>>>>>> a design doc.
>>>>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>>>>>>>>  (I
>>>>>>>> also have a collection of YouTube streams playing with the idea
>>>>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to
>>>>>>>> follow along on that journey).
>>>>>>>>
>>>>>>>> Wishing everyone a happy holidays :)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to