Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Tian Gao via dev Thu, 04 Jun 2026 11:05:16 -0700

I think deprecating py4j is what we need to do eventually. A question: does
that mean that in the future, all Connect dependencies will be required for
PySpark Classic?


Tian

On Thu, Jun 4, 2026 at 9:59 AM Holden Karau <[email protected]> wrote:

> Spark 5 sounds like a good time to target these changes, I think we're
> going to also want to audit and look for OSS code which uses Py4J to access
> direct JVM objects and see which additional APIs we might want to / need to
> expose for power users (the answer might just be write some scala code and
> add it as a plugin).
>
> On Thu, Jun 4, 2026 at 9:50 AM Dongjoon Hyun <[email protected]> wrote:
>
>> Thank you so much for sharing the progress, Hyukjin.
>>
>> This sounds reasonable to me. This will reduce the gap between classic
>> and connect mode greatly.
>>
>> Given that Apache Spark Release Cadence already defines the path to
>> introduce a new feature with its feature flag, I hope we are able to start
>> to test this in Apache Spark 4.3.0.
>>
>> After stabilizing this new architecture via Spark 4.3.0, we may choose
>> more migration steps in Spark 5 in 2027 or Spark 6 in 2028.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2026/06/04 06:12:16 Hyukjin Kwon wrote:
>> > Hi all,
>> >
>> > Firstly, I wanted to check other opinions (rather saying this to push my
>> > opinion hard in this way).
>> >
>> > Lately, I have been looking through the feasibility of replacing Py4J to
>> > Spark Connect in PySpark.
>> > More specifically I mean that use Spark Connect server, and leave Py4J
>> > server as a fallback option to avoid breakage.
>> >
>> > As we know, Py4J gateway server itself already works similarly with
>> Spark
>> > Connect server.
>> > It is a bit of a difficult story in Scala because there wasn't an
>> > intermediate server in Classic but in PySpark there is already a Py4J
>> > server running.
>> >
>> > There are few downsides of using Py4J:
>> > - Py4J server itself exposes arbitrary access to the JVM machine which
>> > actually is risky.
>> > - Performance is quite slow when sending large data. Note that we are
>> > switching to the raw sockets when we send large binaries
>> > - Difficult to debug the errors
>> >
>> > With Spark Connect
>> > - We could limit those accesses.
>> > - We use Arrow batches to send the data should be more efficient
>> > - Error handling are quite implemented well in Python Spark Connect
>> >
>> > Lastly, structurally they are quite the same. It is not like
>> introducing a
>> > new layer.
>> >
>> > Last time when we chatted about enabling Spark Connect on (for both
>> Scala,
>> > Python etc.), the biggest pushback was RDD API missing. I recently
>> > prototyped, and concluded that, for Python specifically, we can add the
>> > full support of RDD API, and most of SparkContext API (
>> > https://github.com/apache/spark/pull/55888).
>> >
>> > I also prototyped Python client side UI (
>> > https://github.com/apache/spark/pull/56053) here too, so for Python
>> > specifically, I believe there is not quite much gap between classic and
>> > connect. API-wise, structure-wise, etc.
>> >
>> > I have been thinking about this idea for quite a long time, and I
>> concluded
>> > that it is feasible. So I would like to know what others think about
>> this.
>> >
>> > Thanks!
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: [DISCUSS] Replacing Py4J to Spark Connect in PySpark (possibly with RDD API and UI)

Reply via email to