Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Ángel Álvarez Pascua Thu, 04 Jun 2026 01:27:41 -0700

I’m +0.5 / cautiously supportive.

For PySpark specifically, I agree this is feasible and probably
directionally good: Py4J is leaky, unsafe, hard to debug, and not a great
long-term protocol boundary. Spark Connect gives us a cleaner, constrained,
language-neutral boundary.


But I would be careful not to describe this as “structurally the same” or
just replacing one intermediate server with another. Spark Connect changes
the contract from “Python can reach into the Spark driver JVM” to “Python
describes operations through a protocol.” That is a major semantic change,
even if we can paper over much of the public API.

The hard part is not only API parity. It is preserving, rejecting, or
replacing the ecosystem behaviors that grew around Py4J: _jvm, _jsc, _jdf,
custom Java integrations, RDD closure semantics, SparkContext-side effects,
debugging habits, and local-mode expectations.

I’d support an experimental/default-off path with Py4J fallback, a public
compatibility matrix, benchmarks, and explicit migration guidance. If we
can show that RDD/SparkContext support is real rather than a compatibility
façade, and that common JVM-dependent extension patterns have a story, then
this could be a very good direction.

El jue, 4 jun 2026 a las 8:13, Hyukjin Kwon (<[email protected]>)
escribió:

> Hi all,
>
> Firstly, I wanted to check other opinions (rather saying this to push my
> opinion hard in this way).
>
> Lately, I have been looking through the feasibility of replacing Py4J to
> Spark Connect in PySpark.
> More specifically I mean that use Spark Connect server, and leave Py4J
> server as a fallback option to avoid breakage.
>
> As we know, Py4J gateway server itself already works similarly with Spark
> Connect server.
> It is a bit of a difficult story in Scala because there wasn't an
> intermediate server in Classic but in PySpark there is already a Py4J
> server running.
>
> There are few downsides of using Py4J:
> - Py4J server itself exposes arbitrary access to the JVM machine which
> actually is risky.
> - Performance is quite slow when sending large data. Note that we are
> switching to the raw sockets when we send large binaries
> - Difficult to debug the errors
>
> With Spark Connect
> - We could limit those accesses.
> - We use Arrow batches to send the data should be more efficient
> - Error handling are quite implemented well in Python Spark Connect
>
> Lastly, structurally they are quite the same. It is not like introducing a
> new layer.
>
> Last time when we chatted about enabling Spark Connect on (for both Scala,
> Python etc.), the biggest pushback was RDD API missing. I recently
> prototyped, and concluded that, for Python specifically, we can add the
> full support of RDD API, and most of SparkContext API (
> https://github.com/apache/spark/pull/55888).
>
> I also prototyped Python client side UI (
> https://github.com/apache/spark/pull/56053) here too, so for Python
> specifically, I believe there is not quite much gap between classic and
> connect. API-wise, structure-wise, etc.
>
> I have been thinking about this idea for quite a long time, and I
> concluded that it is feasible. So I would like to know what others think
> about this.
>
> Thanks!
>

Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Reply via email to