Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Holden Karau Mon, 29 Dec 2025 11:46:28 -0800

So most of our optimizer rules fall back gracefully when they can’t be
applied, for example filter push down if it can’t push a filter through
doesn’t raise an error. I’m thinking of this more like an optimizer rule
personally.


That’s why I don’t think we should try to expose transpilation to the user
level like that, especially given we want to accelerate pandas on spark
where we don’t really control the API fully. Do you have an idea of what
you’d want that to look like though?

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Mon, Dec 29, 2025 at 9:55 AM serge rielau.com <[email protected]> wrote:

> How about a compromise? If the user expects transpilation, via a syntax
> clause we raise an error.
> If the user says nothing then it’s best effort.
> That’s also an easy way for a user to verify whether their code applies.
> On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>,
> wrote:
>
> I don’t think raising an error makes sense, we only expect cover some
> simple UDFS and when not supported we execute them as normal.
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]> wrote:
>
>> One important aspect of coverage is to draw a clear line on what is, and
>> what is not covered.
>> I may go as far as propose to use explicit syntax to denote the intent to
>> transpile. Then, Spark cannot do it, we can raise an error at DDL and the
>> user at is not at a loss why their function is slower than expected.
>> Or why a small bugfix in its body suddenly regresses perfromance.
>>
>>
>> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]>
>> wrote:
>>
>> So for vectorized UDF if it's still a simple mathematical expression we
>> could transpile it. Error message equality I think is out of scope, that's
>> a good call out.
>>
>> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote:
>>
>>> The idea sounds good but I'm also worried about the coverage. In the
>>> recent Spark releases, pandas/arrow UDFs get more support than the classic
>>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we
>>> don't have vectorized operators in Spark out of the box.
>>>
>>> It's also hard to simulate the behaviors exactly, such as overflow
>>> behavior, NULL behavior, error message, etc. Is 100% same behavior the goal
>>> of transpilation?
>>>
>>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> Responses in line, thanks for the questions :)
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks for the proposal. UDF has been known to be noticeably slow,
>>>>> especially for the language where we run the external process and do
>>>>> intercommunication, so this is an interesting topic.
>>>>>
>>>>> The starting question for this proposal would be the coverage. The
>>>>> proposal says we create an AST and try to convert it to a catalyst plan.
>>>>> Since this does not sound like we are generating Java/bytecode so I assume
>>>>> this only leverages built-in operators/expressions.
>>>>>
>>>> Initially yes. Longer term I think it’s possible we explore transpiling
>>>> to other languages (especially accelerator languages as called out in the
>>>> docs), but that’s fuzzy.
>>>>
>>>>>
>>>>> That said, when we say "simple" UDF, what is exactly the scope of
>>>>> "simple" here? For me, it sounds to me like if the UDF can be translated 
>>>>> to
>>>>> a catalyst plan (without UDF), the UDF has actually been something users
>>>>> could have written via the DataFrame API without UDF, unless we have
>>>>> non-user-facing expressions where users are needed. Same with Pandas on
>>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to
>>>>> write logic based on built-in SQL expressions while they can, and end up
>>>>> with choosing UDF? I think this needs more clarification given that's
>>>>> really a user facing contract and the factor of evaluating this project as
>>>>> a successful one.
>>>>>
>>>> Given the transpiration target is Catalyst, yes these would mostly be
>>>> things someone could express with SQL but expressed in another way.
>>>>
>>>> We do have some Catalyst expressions which aren’t directly SQL
>>>> expressions so not always, but generally.
>>>>
>>>> To be clear: I don’t think we should expect users, especially Pandas on
>>>> Spark users, to rewrite their data frame UDFS to SQL and that’s why this
>>>> project makes sense.
>>>>
>>>>>
>>>>> Once that is clarified, we may have follow-up questions/voices with
>>>>> the answer, something along the line:
>>>>>
>>>>> 1. It might be the case we may just want this proposal to be direct to
>>>>> the "future success", translating Python UDF to Java code (codegen) to
>>>>> cover arbitrary logic (unless it's not involving python library, which we
>>>>> had to find alternatives).
>>>>>
>>>> I think this can be a reasonable follow on this project if this project
>>>> is successful.
>>>>
>>>>>
>>>>> 2. We might want to make sure this proposal is addressing major use
>>>>> cases and not just niche cases. e.g. it might be the case the majority of
>>>>> Python UDF usage is to pull other Python dependencies, then we lose
>>>>> the majority of cases.
>>>>>
>>>> I think we don’t expect to cover the majority of UDFS. Even while
>>>> covering only the simple cases initially it would have a real performance
>>>> improvement, especially for Pandas on Spark where people can’t express many
>>>> of these things easily.
>>>>
>>>>>
>>>>> Hope I understand the proposal well and ask valid questions.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> It's been a few years since we last looked at transpilation, and with
>>>>>> the growth of Pandas on Spark I think it's time we revisit it. I've got 
>>>>>> a JIRA
>>>>>> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough
>>>>>> proof of concept code <https://github.com/apache/spark/pull/53547> (I
>>>>>> think doing the transpilation Python side instead of Scala side makes 
>>>>>> more
>>>>>> sense, but was interesting to play with), and  of course everyones 
>>>>>> favourite
>>>>>> a design doc.
>>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>>>>>>  (I
>>>>>> also have a collection of YouTube streams playing with the idea
>>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to
>>>>>> follow along on that journey).
>>>>>>
>>>>>> Wishing everyone a happy holidays :)
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>>

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to