Thank you so much for sharing the progress, Hyukjin. This sounds reasonable to me. This will reduce the gap between classic and connect mode greatly.
Given that Apache Spark Release Cadence already defines the path to introduce a new feature with its feature flag, I hope we are able to start to test this in Apache Spark 4.3.0. After stabilizing this new architecture via Spark 4.3.0, we may choose more migration steps in Spark 5 in 2027 or Spark 6 in 2028. Thanks, Dongjoon. On 2026/06/04 06:12:16 Hyukjin Kwon wrote: > Hi all, > > Firstly, I wanted to check other opinions (rather saying this to push my > opinion hard in this way). > > Lately, I have been looking through the feasibility of replacing Py4J to > Spark Connect in PySpark. > More specifically I mean that use Spark Connect server, and leave Py4J > server as a fallback option to avoid breakage. > > As we know, Py4J gateway server itself already works similarly with Spark > Connect server. > It is a bit of a difficult story in Scala because there wasn't an > intermediate server in Classic but in PySpark there is already a Py4J > server running. > > There are few downsides of using Py4J: > - Py4J server itself exposes arbitrary access to the JVM machine which > actually is risky. > - Performance is quite slow when sending large data. Note that we are > switching to the raw sockets when we send large binaries > - Difficult to debug the errors > > With Spark Connect > - We could limit those accesses. > - We use Arrow batches to send the data should be more efficient > - Error handling are quite implemented well in Python Spark Connect > > Lastly, structurally they are quite the same. It is not like introducing a > new layer. > > Last time when we chatted about enabling Spark Connect on (for both Scala, > Python etc.), the biggest pushback was RDD API missing. I recently > prototyped, and concluded that, for Python specifically, we can add the > full support of RDD API, and most of SparkContext API ( > https://github.com/apache/spark/pull/55888). > > I also prototyped Python client side UI ( > https://github.com/apache/spark/pull/56053) here too, so for Python > specifically, I believe there is not quite much gap between classic and > connect. API-wise, structure-wise, etc. > > I have been thinking about this idea for quite a long time, and I concluded > that it is feasible. So I would like to know what others think about this. > > Thanks! > --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
