I think deprecating py4j is what we need to do eventually. A question: does that mean that in the future, all Connect dependencies will be required for PySpark Classic?
Tian On Thu, Jun 4, 2026 at 9:59 AM Holden Karau <[email protected]> wrote: > Spark 5 sounds like a good time to target these changes, I think we're > going to also want to audit and look for OSS code which uses Py4J to access > direct JVM objects and see which additional APIs we might want to / need to > expose for power users (the answer might just be write some scala code and > add it as a plugin). > > On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun <[email protected]> wrote: > >> Thank you so much for sharing the progress, Hyukjin. >> >> This sounds reasonable to me. This will reduce the gap between classic >> and connect mode greatly. >> >> Given that Apache Spark Release Cadence already defines the path to >> introduce a new feature with its feature flag, I hope we are able to start >> to test this in Apache Spark 4.3.0. >> >> After stabilizing this new architecture via Spark 4.3.0, we may choose >> more migration steps in Spark 5 in 2027 or Spark 6 in 2028. >> >> Thanks, >> Dongjoon. >> >> On 2026/06/04 06:12:16 Hyukjin Kwon wrote: >> > Hi all, >> > >> > Firstly, I wanted to check other opinions (rather saying this to push my >> > opinion hard in this way). >> > >> > Lately, I have been looking through the feasibility of replacing Py4J to >> > Spark Connect in PySpark. >> > More specifically I mean that use Spark Connect server, and leave Py4J >> > server as a fallback option to avoid breakage. >> > >> > As we know, Py4J gateway server itself already works similarly with >> Spark >> > Connect server. >> > It is a bit of a difficult story in Scala because there wasn't an >> > intermediate server in Classic but in PySpark there is already a Py4J >> > server running. >> > >> > There are few downsides of using Py4J: >> > - Py4J server itself exposes arbitrary access to the JVM machine which >> > actually is risky. >> > - Performance is quite slow when sending large data. Note that we are >> > switching to the raw sockets when we send large binaries >> > - Difficult to debug the errors >> > >> > With Spark Connect >> > - We could limit those accesses. >> > - We use Arrow batches to send the data should be more efficient >> > - Error handling are quite implemented well in Python Spark Connect >> > >> > Lastly, structurally they are quite the same. It is not like >> introducing a >> > new layer. >> > >> > Last time when we chatted about enabling Spark Connect on (for both >> Scala, >> > Python etc.), the biggest pushback was RDD API missing. I recently >> > prototyped, and concluded that, for Python specifically, we can add the >> > full support of RDD API, and most of SparkContext API ( >> > https://github.com/apache/spark/pull/55888). >> > >> > I also prototyped Python client side UI ( >> > https://github.com/apache/spark/pull/56053) here too, so for Python >> > specifically, I believe there is not quite much gap between classic and >> > connect. API-wise, structure-wise, etc. >> > >> > I have been thinking about this idea for quite a long time, and I >> concluded >> > that it is feasible. So I would like to know what others think about >> this. >> > >> > Thanks! >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: [email protected] >> >> > > -- > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her >
