Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Hyukjin Kwon Thu, 04 Jun 2026 15:57:13 -0700

Yes they will be supported all the same from my understanding.

On Fri, 5 Jun 2026 at 06:44, Sem <[email protected]> wrote:


> > To Sem, I fully agree that there are existent use cases of Py4J through
> PySpark. Note that the original intention is not to support JVM accesses
> through Py4J, and they are not officially API in PySpark. Nevertheless, I
> do understand those existent use cases, and I am thinking about reducing
> those breakage by having Py4J as an option for the time being.
>
> Thanks for the explanation! Is there any chance that Spark Connect will
> support dynamic resolution of the plugin classes
> ("o.a.s.sql.connect.plugin.RelationPlugin" and friends) for running
> clusters? From my point of view it would be a parity with plugging via
> py4j: users can specify `--package` or `spark.jars` and get the same
> experience as with classic/py4j -- plugin's JAR is attached to the running
> Spark Cluster and resolved in runtime, no code change is required from user.
>
> Best regards,
> Sem
>
> On Fri, 2026-06-05 at 06:31 +0900, Hyukjin Kwon wrote:
>
> >  A question: does that mean that in the future, all Connect dependencies
> will be required for PySpark Classic?
>
> For now, I am still thinking about the Connect dependencies as an option
> to reduce the breakage. Maybe yes in Spark 5 but nothing is decided. Was
> more to hear what you guys think. I don't intend to open the vote for this
> very soon.
>
> To Sem, I fully agree that there are existent use cases of Py4J through
> PySpark. Note that the original intention is not to support JVM accesses
> through Py4J, and they are not officially API in PySpark. Nevertheless, I
> do understand those existent use cases, and I am thinking about reducing
> those breakage by having Py4J as an option for the time being.
>
>
> On Fri, 5 Jun 2026 at 04:11, Sem <[email protected]> wrote:
>
> Hello!
>
> Correct me if I'm wrong, but for me it looks like py4j is currently the
> main escape hatch for Python users to access JVM-side APIs from an already
> running PySpark Classic session..
>
> Let me explain it on the example of GraphFrames project where I'm a
> maintainer. At the moment the project provide ~99% parity for classic (via
> py4j) and connect (via "o.a.s.sql.connect.plugin.RelationPlugin") from the
> API point of view.
> But for users it is very different. If we imagine a running cluster like
> YARN, users can just drop JAR to the CP and access it via GF py4j API from
> their session. With connect it is different because the conf
> "spark.connect.extensions.relation.classes" cannot be changed for the
> running cluster: it is a "static conf" isn't it? So, all the
> implementations of the "RelationPlugin" should be known before the cluster
> started. So, from the user's perspective it means they need not just add a
> dependency via "SparkSession.builder.conf("spark.jars", ...)" but to start
> the dedicated spark cluster with a correct list of all the Conect's
> plugins. At least that is what I know based on my experiments with Spark
> Connect at the version ~3.5.x and early 4.x RC-builds.
>
> On top of this I would like to highlight that at the moment none of the
> major Spark's vendors supports changing of the
> "spark.connect.extensions.relation.classes" at least I did not find
> anything about it in their docs. And py4j API is the only way for users to
> access 3d party packages.
>
> I'm not a contributor of Spark, so feel free to ignore. I just wanted to
> highlight this, because for me Apache Spark was always not only the
> framework but a broad ecosystem of 3d party packages and extensions.
>
> Best regards,
> Sem
>
> On Thu, 2026-06-04 at 11:04 -0700, Tian Gao via dev wrote:
>
> I think deprecating py4j is what we need to do eventually. A question:
> does that mean that in the future, all Connect dependencies will be
> required for PySpark Classic?
>
> Tian
>
> On Thu, Jun 4, 2026 at 9:59 AM Holden Karau <[email protected]>
> wrote:
>
> Spark 5 sounds like a good time to target these changes, I think we're
> going to also want to audit and look for OSS code which uses Py4J to access
> direct JVM objects and see which additional APIs we might want to / need to
> expose for power users (the answer might just be write some scala code and
> add it as a plugin).
>
> On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun <[email protected]> wrote:
>
> Thank you so much for sharing the progress, Hyukjin.
>
> This sounds reasonable to me. This will reduce the gap between classic and
> connect mode greatly.
>
> Given that Apache Spark Release Cadence already defines the path to
> introduce a new feature with its feature flag, I hope we are able to start
> to test this in Apache Spark 4.3.0.
>
> After stabilizing this new architecture via Spark 4.3.0, we may choose
> more migration steps in Spark 5 in 2027 or Spark 6 in 2028.
>
> Thanks,
> Dongjoon.
>
> On 2026/06/04 06:12:16 Hyukjin Kwon wrote:
> > Hi all,
> >
> > Firstly, I wanted to check other opinions (rather saying this to push my
> > opinion hard in this way).
> >
> > Lately, I have been looking through the feasibility of replacing Py4J to
> > Spark Connect in PySpark.
> > More specifically I mean that use Spark Connect server, and leave Py4J
> > server as a fallback option to avoid breakage.
> >
> > As we know, Py4J gateway server itself already works similarly with Spark
> > Connect server.
> > It is a bit of a difficult story in Scala because there wasn't an
> > intermediate server in Classic but in PySpark there is already a Py4J
> > server running.
> >
> > There are few downsides of using Py4J:
> > - Py4J server itself exposes arbitrary access to the JVM machine which
> > actually is risky.
> > - Performance is quite slow when sending large data. Note that we are
> > switching to the raw sockets when we send large binaries
> > - Difficult to debug the errors
> >
> > With Spark Connect
> > - We could limit those accesses.
> > - We use Arrow batches to send the data should be more efficient
> > - Error handling are quite implemented well in Python Spark Connect
> >
> > Lastly, structurally they are quite the same. It is not like introducing
> a
> > new layer.
> >
> > Last time when we chatted about enabling Spark Connect on (for both
> Scala,
> > Python etc.), the biggest pushback was RDD API missing. I recently
> > prototyped, and concluded that, for Python specifically, we can add the
> > full support of RDD API, and most of SparkContext API (
> > https://github.com/apache/spark/pull/55888).
> >
> > I also prototyped Python client side UI (
> > https://github.com/apache/spark/pull/56053) here too, so for Python
> > specifically, I believe there is not quite much gap between classic and
> > connect. API-wise, structure-wise, etc.
> >
> > I have been thinking about this idea for quite a long time, and I
> concluded
> > that it is feasible. So I would like to know what others think about
> this.
> >
> > Thanks!
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
>
>

Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Reply via email to