Spark 5 sounds like a good time to target these changes, I think we're going to also want to audit and look for OSS code which uses Py4J to access direct JVM objects and see which additional APIs we might want to / need to expose for power users (the answer might just be write some scala code and add it as a plugin).
On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun <[email protected]> wrote: > Thank you so much for sharing the progress, Hyukjin. > > This sounds reasonable to me. This will reduce the gap between classic and > connect mode greatly. > > Given that Apache Spark Release Cadence already defines the path to > introduce a new feature with its feature flag, I hope we are able to start > to test this in Apache Spark 4.3.0. > > After stabilizing this new architecture via Spark 4.3.0, we may choose > more migration steps in Spark 5 in 2027 or Spark 6 in 2028. > > Thanks, > Dongjoon. > > On 2026/06/04 06:12:16 Hyukjin Kwon wrote: > > Hi all, > > > > Firstly, I wanted to check other opinions (rather saying this to push my > > opinion hard in this way). > > > > Lately, I have been looking through the feasibility of replacing Py4J to > > Spark Connect in PySpark. > > More specifically I mean that use Spark Connect server, and leave Py4J > > server as a fallback option to avoid breakage. > > > > As we know, Py4J gateway server itself already works similarly with Spark > > Connect server. > > It is a bit of a difficult story in Scala because there wasn't an > > intermediate server in Classic but in PySpark there is already a Py4J > > server running. > > > > There are few downsides of using Py4J: > > - Py4J server itself exposes arbitrary access to the JVM machine which > > actually is risky. > > - Performance is quite slow when sending large data. Note that we are > > switching to the raw sockets when we send large binaries > > - Difficult to debug the errors > > > > With Spark Connect > > - We could limit those accesses. > > - We use Arrow batches to send the data should be more efficient > > - Error handling are quite implemented well in Python Spark Connect > > > > Lastly, structurally they are quite the same. It is not like introducing > a > > new layer. > > > > Last time when we chatted about enabling Spark Connect on (for both > Scala, > > Python etc.), the biggest pushback was RDD API missing. I recently > > prototyped, and concluded that, for Python specifically, we can add the > > full support of RDD API, and most of SparkContext API ( > > https://github.com/apache/spark/pull/55888). > > > > I also prototyped Python client side UI ( > > https://github.com/apache/spark/pull/56053) here too, so for Python > > specifically, I believe there is not quite much gap between classic and > > connect. API-wise, structure-wise, etc. > > > > I have been thinking about this idea for quite a long time, and I > concluded > > that it is feasible. So I would like to know what others think about > this. > > > > Thanks! > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > > -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
