Hi all,

There is partial agreement and consensus that Spark Connect is crucial for
the future stability of Spark APIs for both end users and developers. At
the same time, a couple of PMC members raised concerns about making Spark
Connect the default in the upcoming Spark 4.0 release. I’m proposing an
alternative approach here: publish an additional Spark distribution with
Spark Connect enabled by default. This approach will help promote the
adoption of Spark Connect among new users while allowing us to gather
valuable feedback. A separate distribution with Spark Connect enabled by
default can promote future adoption of Spark Connect for languages like
Rust, Go, or Scala 3.

Here are the details of the proposal:

   - Spark 4.0 will include three PyPI packages:
      - pyspark: The classic package.
      - pyspark-client: The thin Spark Connect Python client. Note, in the
      Spark 4.0 preview releases, we have published the pyspark-connect package
      for the thin client, we will need to rename it in the official
4.0 release.
      - pyspark-connect: Spark Connect enabled by default.
   - An additional tarball will be added to the Spark 4.0 download page
   with updated scripts (spark-submit, spark-shell, etc.) to enable Spark
   Connect by default.
   - A new Docker image will be provided with Spark Connect enabled by
   default.

By taking this approach, we can make Spark Connect more visible and
accessible to users, which is more effective than simply asking them to
configure it manually.

Looking forward to hearing your thoughts!

Thanks,
Wenchen

Reply via email to