Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Hyukjin Kwon Thu, 04 Jun 2026 14:32:42 -0700

>  A question: does that mean that in the future, all Connect dependencies
will be required for PySpark Classic?


For now, I am still thinking about the Connect dependencies as an option to
reduce the breakage. Maybe yes in Spark 5 but nothing is decided. Was more
to hear what you guys think. I don't intend to open the vote for this very
soon.

To Sem, I fully agree that there are existent use cases of Py4J through
PySpark. Note that the original intention is not to support JVM accesses
through Py4J, and they are not officially API in PySpark. Nevertheless, I
do understand those existent use cases, and I am thinking about reducing
those breakage by having Py4J as an option for the time being.


On Fri, 5 Jun 2026 at 04:11, Sem <[email protected]> wrote:

> Hello!
>
> Correct me if I'm wrong, but for me it looks like py4j is currently the
> main escape hatch for Python users to access JVM-side APIs from an already
> running PySpark Classic session..
>
> Let me explain it on the example of GraphFrames project where I'm a
> maintainer. At the moment the project provide ~99% parity for classic (via
> py4j) and connect (via "o.a.s.sql.connect.plugin.RelationPlugin") from the
> API point of view.
> But for users it is very different. If we imagine a running cluster like
> YARN, users can just drop JAR to the CP and access it via GF py4j API from
> their session. With connect it is different because the conf
> "spark.connect.extensions.relation.classes" cannot be changed for the
> running cluster: it is a "static conf" isn't it? So, all the
> implementations of the "RelationPlugin" should be known before the cluster
> started. So, from the user's perspective it means they need not just add a
> dependency via "SparkSession.builder.conf("spark.jars", ...)" but to start
> the dedicated spark cluster with a correct list of all the Conect's
> plugins. At least that is what I know based on my experiments with Spark
> Connect at the version ~3.5.x and early 4.x RC-builds.
>
> On top of this I would like to highlight that at the moment none of the
> major Spark's vendors supports changing of the
> "spark.connect.extensions.relation.classes" at least I did not find
> anything about it in their docs. And py4j API is the only way for users to
> access 3d party packages.
>
> I'm not a contributor of Spark, so feel free to ignore. I just wanted to
> highlight this, because for me Apache Spark was always not only the
> framework but a broad ecosystem of 3d party packages and extensions.
>
> Best regards,
> Sem
>
> On Thu, 2026-06-04 at 11:04 -0700, Tian Gao via dev wrote:
>
> I think deprecating py4j is what we need to do eventually. A question:
> does that mean that in the future, all Connect dependencies will be
> required for PySpark Classic?
>
> Tian
>
> On Thu, Jun 4, 2026 at 9:59 AM Holden Karau <[email protected]>
> wrote:
>
> Spark 5 sounds like a good time to target these changes, I think we're
> going to also want to audit and look for OSS code which uses Py4J to access
> direct JVM objects and see which additional APIs we might want to / need to
> expose for power users (the answer might just be write some scala code and
> add it as a plugin).
>
> On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun <[email protected]> wrote:
>
> Thank you so much for sharing the progress, Hyukjin.
>
> This sounds reasonable to me. This will reduce the gap between classic and
> connect mode greatly.
>
> Given that Apache Spark Release Cadence already defines the path to
> introduce a new feature with its feature flag, I hope we are able to start
> to test this in Apache Spark 4.3.0.
>
> After stabilizing this new architecture via Spark 4.3.0, we may choose
> more migration steps in Spark 5 in 2027 or Spark 6 in 2028.
>
> Thanks,
> Dongjoon.
>
> On 2026/06/04 06:12:16 Hyukjin Kwon wrote:
> > Hi all,
> >
> > Firstly, I wanted to check other opinions (rather saying this to push my
> > opinion hard in this way).
> >
> > Lately, I have been looking through the feasibility of replacing Py4J to
> > Spark Connect in PySpark.
> > More specifically I mean that use Spark Connect server, and leave Py4J
> > server as a fallback option to avoid breakage.
> >
> > As we know, Py4J gateway server itself already works similarly with Spark
> > Connect server.
> > It is a bit of a difficult story in Scala because there wasn't an
> > intermediate server in Classic but in PySpark there is already a Py4J
> > server running.
> >
> > There are few downsides of using Py4J:
> > - Py4J server itself exposes arbitrary access to the JVM machine which
> > actually is risky.
> > - Performance is quite slow when sending large data. Note that we are
> > switching to the raw sockets when we send large binaries
> > - Difficult to debug the errors
> >
> > With Spark Connect
> > - We could limit those accesses.
> > - We use Arrow batches to send the data should be more efficient
> > - Error handling are quite implemented well in Python Spark Connect
> >
> > Lastly, structurally they are quite the same. It is not like introducing
> a
> > new layer.
> >
> > Last time when we chatted about enabling Spark Connect on (for both
> Scala,
> > Python etc.), the biggest pushback was RDD API missing. I recently
> > prototyped, and concluded that, for Python specifically, we can add the
> > full support of RDD API, and most of SparkContext API (
> > https://github.com/apache/spark/pull/55888).
> >
> > I also prototyped Python client side UI (
> > https://github.com/apache/spark/pull/56053) here too, so for Python
> > specifically, I believe there is not quite much gap between classic and
> > connect. API-wise, structure-wise, etc.
> >
> > I have been thinking about this idea for quite a long time, and I
> concluded
> > that it is feasible. So I would like to know what others think about
> this.
> >
> > Thanks!
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
>

Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Reply via email to