Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Sem Thu, 04 Jun 2026 14:45:24 -0700

> To Sem, I fully agree that there are existent use cases of Py4J
through PySpark. Note that the original intention is not to support JVM
accesses through Py4J, and they are not officially API in PySpark.
Nevertheless, I do understand those existent use cases, and I am
thinking about reducing those breakage by having Py4J as an option for
the time being.


Thanks for the explanation! Is there any chance that Spark Connect will
support dynamic resolution of the plugin classes
("o.a.s.sql.connect.plugin.RelationPlugin" and friends) for running
clusters? From my point of view it would be a parity with plugging via
py4j: users can specify `--package` or `spark.jars` and get the same
experience as with classic/py4j -- plugin's JAR is attached to the
running Spark Cluster and resolved in runtime, no code change is
required from user.

Best regards,
Sem

On Fri, 2026-06-05 at 06:31 +0900, Hyukjin Kwon wrote:
> >  A question: does that mean that in the future, all Connect
> dependencies will be required for PySpark Classic?
> 
> For now, I am still thinking about the Connect dependencies as an
> option to reduce the breakage. Maybe yes in Spark 5 but nothing is
> decided. Was more to hear what you guys think. I don't intend to open
> the vote for this very soon.
> 
> To Sem, I fully agree that there are existent use cases of Py4J
> through PySpark. Note that the original intention is not to support
> JVM accesses through Py4J, and they are not officially API in
> PySpark. Nevertheless, I do understand those existent use cases, and
> I am thinking about reducing those breakage by having Py4J as an
> option for the time being.
> 
> 
> On Fri, 5 Jun 2026 at 04:11, Sem <[email protected]> wrote:
> > Hello!
> > 
> > Correct me if I'm wrong, but for me it looks like py4j is currently
> > the main escape hatch for Python users to access JVM-side APIs from
> > an already running PySpark Classic session..
> > 
> > Let me explain it on the example of GraphFrames project where I'm a
> > maintainer. At the moment the project provide ~99% parity for
> > classic (via py4j) and connect (via
> > "o.a.s.sql.connect.plugin.RelationPlugin") from the API point of
> > view.
> > But for users it is very different. If we imagine a running cluster
> > like YARN, users can just drop JAR to the CP and access it via GF
> > py4j API from their session. With connect it is different because
> > the conf "spark.connect.extensions.relation.classes" cannot be
> > changed for the running cluster: it is a "static conf" isn't it?
> > So, all the implementations of the "RelationPlugin" should be known
> > before the cluster started. So, from the user's perspective it
> > means they need not just add a dependency via
> > "SparkSession.builder.conf("spark.jars", ...)" but to start the
> > dedicated spark cluster with a correct list of all the Conect's
> > plugins. At least that is what I know based on my experiments with
> > Spark Connect at the version ~3.5.x and early 4.x RC-builds.
> > 
> > On top of this I would like to highlight that at the moment none of
> > the major Spark's vendors supports changing of the
> > "spark.connect.extensions.relation.classes" at least I did not find
> > anything about it in their docs. And py4j API is the only way for
> > users to access 3d party packages.
> > 
> > I'm not a contributor of Spark, so feel free to ignore. I just
> > wanted to highlight this, because for me Apache Spark was always
> > not only the framework but a broad ecosystem of 3d party packages
> > and extensions.
> > 
> > Best regards,
> > Sem
> > 
> > On Thu, 2026-06-04 at 11:04 -0700, Tian Gao via dev wrote:
> > > I think deprecating py4j is what we need to do eventually. A
> > > question: does that mean that in the future, all Connect
> > > dependencies will be required for PySpark Classic?
> > > 
> > > Tian
> > > 
> > > On Thu, Jun 4, 2026 at 9:59 AM Holden Karau
> > > <[email protected]> wrote:
> > > > Spark 5 sounds like a good time to target these changes, I
> > > > think we're going to also want to audit and look for OSS code
> > > > which uses Py4J to access direct JVM objects and see which
> > > > additional APIs we might want to / need to expose for power
> > > > users (the answer might just be write some scala code and add
> > > > it as a plugin).
> > > > 
> > > > On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun
> > > > <[email protected]> wrote:
> > > > > Thank you so much for sharing the progress, Hyukjin.
> > > > > 
> > > > > This sounds reasonable to me. This will reduce the gap
> > > > > between classic and connect mode greatly.
> > > > > 
> > > > > Given that Apache Spark Release Cadence already defines the
> > > > > path to introduce a new feature with its feature flag, I hope
> > > > > we are able to start to test this in Apache Spark 4.3.0.
> > > > > 
> > > > > After stabilizing this new architecture via Spark 4.3.0, we
> > > > > may choose more migration steps in Spark 5 in 2027 or Spark 6
> > > > > in 2028.
> > > > > 
> > > > > Thanks,
> > > > > Dongjoon.
> > > > > 
> > > > > On 2026/06/04 06:12:16 Hyukjin Kwon wrote:
> > > > > > Hi all,
> > > > > > 
> > > > > > Firstly, I wanted to check other opinions (rather saying
> > > > > this to push my
> > > > > > opinion hard in this way).
> > > > > > 
> > > > > > Lately, I have been looking through the feasibility of
> > > > > replacing Py4J to
> > > > > > Spark Connect in PySpark.
> > > > > > More specifically I mean that use Spark Connect server, and
> > > > > leave Py4J
> > > > > > server as a fallback option to avoid breakage.
> > > > > > 
> > > > > > As we know, Py4J gateway server itself already works
> > > > > similarly with Spark
> > > > > > Connect server.
> > > > > > It is a bit of a difficult story in Scala because there
> > > > > wasn't an
> > > > > > intermediate server in Classic but in PySpark there is
> > > > > already a Py4J
> > > > > > server running.
> > > > > > 
> > > > > > There are few downsides of using Py4J:
> > > > > > - Py4J server itself exposes arbitrary access to the JVM
> > > > > machine which
> > > > > > actually is risky.
> > > > > > - Performance is quite slow when sending large data. Note
> > > > > that we are
> > > > > > switching to the raw sockets when we send large binaries
> > > > > > - Difficult to debug the errors
> > > > > > 
> > > > > > With Spark Connect
> > > > > > - We could limit those accesses.
> > > > > > - We use Arrow batches to send the data should be more
> > > > > efficient
> > > > > > - Error handling are quite implemented well in Python Spark
> > > > > Connect
> > > > > > 
> > > > > > Lastly, structurally they are quite the same. It is not
> > > > > like introducing a
> > > > > > new layer.
> > > > > > 
> > > > > > Last time when we chatted about enabling Spark Connect on
> > > > > (for both Scala,
> > > > > > Python etc.), the biggest pushback was RDD API missing. I
> > > > > recently
> > > > > > prototyped, and concluded that, for Python specifically, we
> > > > > can add the
> > > > > > full support of RDD API, and most of SparkContext API (
> > > > > > https://github.com/apache/spark/pull/55888).
> > > > > > 
> > > > > > I also prototyped Python client side UI (
> > > > > > https://github.com/apache/spark/pull/56053) here too, so
> > > > > for Python
> > > > > > specifically, I believe there is not quite much gap between
> > > > > classic and
> > > > > > connect. API-wise, structure-wise, etc.
> > > > > > 
> > > > > > I have been thinking about this idea for quite a long time,
> > > > > and I concluded
> > > > > > that it is feasible. So I would like to know what others
> > > > > think about this.
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > 
> > > > > -------------------------------------------------------------
> > > > > --------
> > > > > To unsubscribe e-mail: [email protected]
> > > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Twitter: https://twitter.com/holdenkarau
> > > > Fight Health Insurance: https://www.fighthealthinsurance.com/
> > > > [1]
> > > > Books (Learning Spark, High Performance Spark,
> > > > etc.): https://amzn.to/2MaRAG9  [2]
> > > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> > > > Pronouns: she/her
> > 
> > 


[1] https://www.fighthealthinsurance.com/
    https://www.fighthealthinsurance.com/?q=hk_email
[2] https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9

Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Reply via email to