Hi all, Firstly, I wanted to check other opinions (rather saying this to push my opinion hard in this way).
Lately, I have been looking through the feasibility of replacing Py4J to Spark Connect in PySpark. More specifically I mean that use Spark Connect server, and leave Py4J server as a fallback option to avoid breakage. As we know, Py4J gateway server itself already works similarly with Spark Connect server. It is a bit of a difficult story in Scala because there wasn't an intermediate server in Classic but in PySpark there is already a Py4J server running. There are few downsides of using Py4J: - Py4J server itself exposes arbitrary access to the JVM machine which actually is risky. - Performance is quite slow when sending large data. Note that we are switching to the raw sockets when we send large binaries - Difficult to debug the errors With Spark Connect - We could limit those accesses. - We use Arrow batches to send the data should be more efficient - Error handling are quite implemented well in Python Spark Connect Lastly, structurally they are quite the same. It is not like introducing a new layer. Last time when we chatted about enabling Spark Connect on (for both Scala, Python etc.), the biggest pushback was RDD API missing. I recently prototyped, and concluded that, for Python specifically, we can add the full support of RDD API, and most of SparkContext API ( https://github.com/apache/spark/pull/55888). I also prototyped Python client side UI ( https://github.com/apache/spark/pull/56053) here too, so for Python specifically, I believe there is not quite much gap between classic and connect. API-wise, structure-wise, etc. I have been thinking about this idea for quite a long time, and I concluded that it is feasible. So I would like to know what others think about this. Thanks!
