FYI: A Hallucination about Spark Connect Stability in Spark 4

Dongjoon Hyun Tue, 21 Jan 2025 14:32:05 -0800

It seems that there is misinformation about the stability of Spark Connect
in Spark 4. I would like to reduce the gap in our dev mailing list.


Frequently, some people claim `Spark Connect` is stable because it uses
Protobuf. Yes, we standardize the interface layer. However, may I ask if it
implies its implementation's stability?

Since Apache Spark is an open source community, you can see the stability
of implementation in our public CI. In our CI, the PySpark Connect client
has been technically broken most of the time.

1.
https://github.com/apache/spark/actions/workflows/build_python_connect.yml
(Spark Connect Python-only in master)

In addition, the Spark 3.5 client seems to face another difficulty talking
with Spark 4 server.

2.
https://github.com/apache/spark/actions/workflows/build_python_connect35.yml
(Spark Connect Python-only:master-server, 35-client)

3. What about the stability and the feature parities in different
languages? Do they work well with Apache Spark 4? I'm wondering if there is
any clue for the Apache Spark community to do assessment?

Given (1), (2), and (3), how can we make sure that `Spark Connect` is
stable or ready in Spark 4? From my perspective, this is still actively
under development with an open end.

The bottom line is `Spark Connect` needs more community love in order to be
claimed as Stable in Apache Spark 4. I'm looking forward to seeing the
healthy Spark Connect CI in Spark 4. Until then, let's clarify what is
stable in `Spark Connect` and what is not yet.

Best Regards,
Dongjoon.

PS.
This is a seperate thread from the previous flakiness issues.
https://lists.apache.org/thread/r5dzdr3w4ly0dr99k24mqvld06r4mzmq
([FYI] Known `Spark Connect` Test Suite Flakiness)

FYI: A Hallucination about Spark Connect Stability in Spark 4

Reply via email to