Hi all, There is partial agreement and consensus that Spark Connect is crucial for the future stability of Spark APIs for both end users and developers. At the same time, a couple of PMC members raised concerns about making Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing an alternative approach here: publish an additional Spark distribution with Spark Connect enabled by default. This approach will help promote the adoption of Spark Connect among new users while allowing us to gather valuable feedback. A separate distribution with Spark Connect enabled by default can promote future adoption of Spark Connect for languages like Rust, Go, or Scala 3.
Here are the details of the proposal: - Spark 4.0 will include three PyPI packages: - pyspark: The classic package. - pyspark-client: The thin Spark Connect Python client. Note, in the Spark 4.0 preview releases, we have published the pyspark-connect package for the thin client, we will need to rename it in the official 4.0 release. - pyspark-connect: Spark Connect enabled by default. - An additional tarball will be added to the Spark 4.0 download page with updated scripts (spark-submit, spark-shell, etc.) to enable Spark Connect by default. - A new Docker image will be provided with Spark Connect enabled by default. By taking this approach, we can make Spark Connect more visible and accessible to users, which is more effective than simply asking them to configure it manually. Looking forward to hearing your thoughts! Thanks, Wenchen