Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

serge rielau . com Mon, 29 Dec 2025 09:55:14 -0800

How about a compromise? If the user expects transpilation, via a syntax clause 
we raise an error.
If the user says nothing then it’s best effort.
That’s also an easy way for a user to verify whether their code applies.
On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>, wrote:
I don’t think raising an error makes sense, we only expect cover some simple 
UDFS and when not supported we execute them as normal.


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: 
https://www.fighthealthinsurance.com/<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com<http://rielau.com> 
<[email protected]<mailto:[email protected]>> wrote:
One important aspect of coverage is to draw a clear line on what is, and what 
is not covered.
I may go as far as propose to use explicit syntax to denote the intent to 
transpile. Then, Spark cannot do it, we can raise an error at DDL and the user 
at is not at a loss why their function is slower than expected.
Or why a small bugfix in its body suddenly regresses perfromance.


On Dec 28, 2025, at 11:28 PM, Holden Karau 
<[email protected]<mailto:[email protected]>> wrote:

So for vectorized UDF if it's still a simple mathematical expression we could 
transpile it. Error message equality I think is out of scope, that's a good 
call out.

On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan 
<[email protected]<mailto:[email protected]>> wrote:
The idea sounds good but I'm also worried about the coverage. In the recent 
Spark releases, pandas/arrow UDFs get more support than the classic Python 
UDFs, but I don't think we can translate pandas/arrow UDFs as we don't have 
vectorized operators in Spark out of the box.

It's also hard to simulate the behaviors exactly, such as overflow behavior, 
NULL behavior, error message, etc. Is 100% same behavior the goal of 
transpilation?

On Sat, Dec 20, 2025 at 5:14 PM Holden Karau 
<[email protected]<mailto:[email protected]>> wrote:
Responses in line, thanks for the questions :)

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: 
https://www.fighthealthinsurance.com/<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim 
<[email protected]<mailto:[email protected]>> wrote:
Thanks for the proposal. UDF has been known to be noticeably slow, especially 
for the language where we run the external process and do intercommunication, 
so this is an interesting topic.

The starting question for this proposal would be the coverage. The proposal 
says we create an AST and try to convert it to a catalyst plan. Since this does 
not sound like we are generating Java/bytecode so I assume this only leverages 
built-in operators/expressions.
Initially yes. Longer term I think it’s possible we explore transpiling to 
other languages (especially accelerator languages as called out in the docs), 
but that’s fuzzy.

That said, when we say "simple" UDF, what is exactly the scope of "simple" 
here? For me, it sounds to me like if the UDF can be translated to a catalyst 
plan (without UDF), the UDF has actually been something users could have 
written via the DataFrame API without UDF, unless we have non-user-facing 
expressions where users are needed. Same with Pandas on Spark for covering 
Pandas UDF. Do we see such a case e.g. users fail to write logic based on 
built-in SQL expressions while they can, and end up with choosing UDF? I think 
this needs more clarification given that's really a user facing contract and 
the factor of evaluating this project as a successful one.
Given the transpiration target is Catalyst, yes these would mostly be things 
someone could express with SQL but expressed in another way.

We do have some Catalyst expressions which aren’t directly SQL expressions so 
not always, but generally.

To be clear: I don’t think we should expect users, especially Pandas on Spark 
users, to rewrite their data frame UDFS to SQL and that’s why this project 
makes sense.

Once that is clarified, we may have follow-up questions/voices with the answer, 
something along the line:

1. It might be the case we may just want this proposal to be direct to the 
"future success", translating Python UDF to Java code (codegen) to cover 
arbitrary logic (unless it's not involving python library, which we had to find 
alternatives).
I think this can be a reasonable follow on this project if this project is 
successful.

2. We might want to make sure this proposal is addressing major use cases and 
not just niche cases. e.g. it might be the case the majority of Python UDF 
usage is to pull other Python dependencies, then we lose the majority of cases.
I think we don’t expect to cover the majority of UDFS. Even while covering only 
the simple cases initially it would have a real performance improvement, 
especially for Pandas on Spark where people can’t express many of these things 
easily.

Hope I understand the proposal well and ask valid questions.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, Dec 20, 2025 at 5:42 AM Holden Karau 
<[email protected]<mailto:[email protected]>> wrote:
Hi Folks,

It's been a few years since we last looked at transpilation, and with the 
growth of Pandas on Spark I think it's time we revisit it. I've got a JIRA 
filed<https://issues.apache.org/jira/browse/SPARK-54783> some rough proof of 
concept code<https://github.com/apache/spark/pull/53547> (I think doing the 
transpilation Python side instead of Scala side makes more sense, but was 
interesting to play with), and  of course everyones favourite a design 
doc.<https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
 (I also have a collection of YouTube streams playing with the 
idea<https://www.youtube.com/@HoldenKarau/streams> if anyone wants to follow 
along on that journey).

Wishing everyone a happy holidays :)

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: 
https://www.fighthealthinsurance.com/<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


--
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: 
https://www.fighthealthinsurance.com/<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to