Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Sem Thu, 04 Jun 2026 12:10:50 -0700

Hello!

Correct me if I'm wrong, but for me it looks like py4j is currently the
main escape hatch for Python users to access JVM-side APIs from an
already running PySpark Classic session..


Let me explain it on the example of GraphFrames project where I'm a
maintainer. At the moment the project provide ~99% parity for classic
(via py4j) and connect (via "o.a.s.sql.connect.plugin.RelationPlugin")
from the API point of view.
But for users it is very different. If we imagine a running cluster
like YARN, users can just drop JAR to the CP and access it via GF py4j
API from their session. With connect it is different because the conf
"spark.connect.extensions.relation.classes" cannot be changed for the
running cluster: it is a "static conf" isn't it? So, all the
implementations of the "RelationPlugin" should be known before the
cluster started. So, from the user's perspective it means they need not
just add a dependency via "SparkSession.builder.conf("spark.jars",
...)" but to start the dedicated spark cluster with a correct list of
all the Conect's plugins. At least that is what I know based on my
experiments with Spark Connect at the version ~3.5.x and early 4.x RC-
builds.

On top of this I would like to highlight that at the moment none of the
major Spark's vendors supports changing of the
"spark.connect.extensions.relation.classes" at least I did not find
anything about it in their docs. And py4j API is the only way for users
to access 3d party packages.

I'm not a contributor of Spark, so feel free to ignore. I just wanted
to highlight this, because for me Apache Spark was always not only the
framework but a broad ecosystem of 3d party packages and extensions.

Best regards,
Sem

On Thu, 2026-06-04 at 11:04 -0700, Tian Gao via dev wrote:
> I think deprecating py4j is what we need to do eventually. A
> question: does that mean that in the future, all Connect dependencies
> will be required for PySpark Classic?
> 
> Tian
> 
> On Thu, Jun 4, 2026 at 9:59 AM Holden Karau <[email protected]>
> wrote:
> > Spark 5 sounds like a good time to target these changes, I think
> > we're going to also want to audit and look for OSS code which uses
> > Py4J to access direct JVM objects and see which additional APIs we
> > might want to / need to expose for power users (the answer might
> > just be write some scala code and add it as a plugin).
> > 
> > On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun <[email protected]>
> > wrote:
> > > Thank you so much for sharing the progress, Hyukjin.
> > > 
> > > This sounds reasonable to me. This will reduce the gap between
> > > classic and connect mode greatly.
> > > 
> > > Given that Apache Spark Release Cadence already defines the path
> > > to introduce a new feature with its feature flag, I hope we are
> > > able to start to test this in Apache Spark 4.3.0.
> > > 
> > > After stabilizing this new architecture via Spark 4.3.0, we may
> > > choose more migration steps in Spark 5 in 2027 or Spark 6 in
> > > 2028.
> > > 
> > > Thanks,
> > > Dongjoon.
> > > 
> > > On 2026/06/04 06:12:16 Hyukjin Kwon wrote:
> > > > Hi all,
> > > > 
> > > > Firstly, I wanted to check other opinions (rather saying this
> > > to push my
> > > > opinion hard in this way).
> > > > 
> > > > Lately, I have been looking through the feasibility of
> > > replacing Py4J to
> > > > Spark Connect in PySpark.
> > > > More specifically I mean that use Spark Connect server, and
> > > leave Py4J
> > > > server as a fallback option to avoid breakage.
> > > > 
> > > > As we know, Py4J gateway server itself already works similarly
> > > with Spark
> > > > Connect server.
> > > > It is a bit of a difficult story in Scala because there wasn't
> > > an
> > > > intermediate server in Classic but in PySpark there is already
> > > a Py4J
> > > > server running.
> > > > 
> > > > There are few downsides of using Py4J:
> > > > - Py4J server itself exposes arbitrary access to the JVM
> > > machine which
> > > > actually is risky.
> > > > - Performance is quite slow when sending large data. Note that
> > > we are
> > > > switching to the raw sockets when we send large binaries
> > > > - Difficult to debug the errors
> > > > 
> > > > With Spark Connect
> > > > - We could limit those accesses.
> > > > - We use Arrow batches to send the data should be more
> > > efficient
> > > > - Error handling are quite implemented well in Python Spark
> > > Connect
> > > > 
> > > > Lastly, structurally they are quite the same. It is not like
> > > introducing a
> > > > new layer.
> > > > 
> > > > Last time when we chatted about enabling Spark Connect on (for
> > > both Scala,
> > > > Python etc.), the biggest pushback was RDD API missing. I
> > > recently
> > > > prototyped, and concluded that, for Python specifically, we can
> > > add the
> > > > full support of RDD API, and most of SparkContext API (
> > > > https://github.com/apache/spark/pull/55888).
> > > > 
> > > > I also prototyped Python client side UI (
> > > > https://github.com/apache/spark/pull/56053) here too, so for
> > > Python
> > > > specifically, I believe there is not quite much gap between
> > > classic and
> > > > connect. API-wise, structure-wise, etc.
> > > > 
> > > > I have been thinking about this idea for quite a long time, and
> > > I concluded
> > > > that it is feasible. So I would like to know what others think
> > > about this.
> > > > 
> > > > Thanks!
> > > > 
> > > 
> > > -----------------------------------------------------------------
> > > ----
> > > To unsubscribe e-mail: [email protected]
> > > 
> > 
> > 
> > -- 
> > Twitter: https://twitter.com/holdenkarau
> > Fight Health Insurance: https://www.fighthealthinsurance.com/ [1]
> > Books (Learning Spark, High Performance Spark,
> > etc.): https://amzn.to/2MaRAG9  [2]
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> > Pronouns: she/her


[1] https://www.fighthealthinsurance.com/
    https://www.fighthealthinsurance.com/?q=hk_email
[2] https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9

Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Reply via email to