Hi all,

Firstly, I wanted to check other opinions (rather saying this to push my
opinion hard in this way).

Lately, I have been looking through the feasibility of replacing Py4J to
Spark Connect in PySpark.
More specifically I mean that use Spark Connect server, and leave Py4J
server as a fallback option to avoid breakage.

As we know, Py4J gateway server itself already works similarly with Spark
Connect server.
It is a bit of a difficult story in Scala because there wasn't an
intermediate server in Classic but in PySpark there is already a Py4J
server running.

There are few downsides of using Py4J:
- Py4J server itself exposes arbitrary access to the JVM machine which
actually is risky.
- Performance is quite slow when sending large data. Note that we are
switching to the raw sockets when we send large binaries
- Difficult to debug the errors

With Spark Connect
- We could limit those accesses.
- We use Arrow batches to send the data should be more efficient
- Error handling are quite implemented well in Python Spark Connect

Lastly, structurally they are quite the same. It is not like introducing a
new layer.

Last time when we chatted about enabling Spark Connect on (for both Scala,
Python etc.), the biggest pushback was RDD API missing. I recently
prototyped, and concluded that, for Python specifically, we can add the
full support of RDD API, and most of SparkContext API (
https://github.com/apache/spark/pull/55888).

I also prototyped Python client side UI (
https://github.com/apache/spark/pull/56053) here too, so for Python
specifically, I believe there is not quite much gap between classic and
connect. API-wise, structure-wise, etc.

I have been thinking about this idea for quite a long time, and I concluded
that it is feasible. So I would like to know what others think about this.

Thanks!

Reply via email to