Responses in line, thanks for the questions :) Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <[email protected]> wrote: > Thanks for the proposal. UDF has been known to be noticeably slow, > especially for the language where we run the external process and do > intercommunication, so this is an interesting topic. > > The starting question for this proposal would be the coverage. The > proposal says we create an AST and try to convert it to a catalyst plan. > Since this does not sound like we are generating Java/bytecode so I assume > this only leverages built-in operators/expressions. > Initially yes. Longer term I think it’s possible we explore transpiling to other languages (especially accelerator languages as called out in the docs), but that’s fuzzy. > > That said, when we say "simple" UDF, what is exactly the scope of "simple" > here? For me, it sounds to me like if the UDF can be translated to a > catalyst plan (without UDF), the UDF has actually been something users > could have written via the DataFrame API without UDF, unless we have > non-user-facing expressions where users are needed. Same with Pandas on > Spark for covering Pandas UDF. Do we see such a case e.g. users fail to > write logic based on built-in SQL expressions while they can, and end up > with choosing UDF? I think this needs more clarification given that's > really a user facing contract and the factor of evaluating this project as > a successful one. > Given the transpiration target is Catalyst, yes these would mostly be things someone could express with SQL but expressed in another way. We do have some Catalyst expressions which aren’t directly SQL expressions so not always, but generally. To be clear: I don’t think we should expect users, especially Pandas on Spark users, to rewrite their data frame UDFS to SQL and that’s why this project makes sense. > > Once that is clarified, we may have follow-up questions/voices with the > answer, something along the line: > > 1. It might be the case we may just want this proposal to be direct to the > "future success", translating Python UDF to Java code (codegen) to cover > arbitrary logic (unless it's not involving python library, which we had to > find alternatives). > I think this can be a reasonable follow on this project if this project is successful. > > 2. We might want to make sure this proposal is addressing major use cases > and not just niche cases. e.g. it might be the case the majority of Python > UDF usage is to pull other Python dependencies, then we lose the majority > of cases. > I think we don’t expect to cover the majority of UDFS. Even while covering only the simple cases initially it would have a real performance improvement, especially for Pandas on Spark where people can’t express many of these things easily. > > Hope I understand the proposal well and ask valid questions. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]> > wrote: > >> Hi Folks, >> >> It's been a few years since we last looked at transpilation, and with the >> growth of Pandas on Spark I think it's time we revisit it. I've got a JIRA >> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough >> proof of concept code <https://github.com/apache/spark/pull/53547> (I >> think doing the transpilation Python side instead of Scala side makes more >> sense, but was interesting to play with), and of course everyones favourite >> a design doc. >> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing> >> (I >> also have a collection of YouTube streams playing with the idea >> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to follow >> along on that journey). >> >> Wishing everyone a happy holidays :) >> >> Cheers, >> >> Holden :) >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >
