Many of the remaining PRs relate to Spark ML Connect support, but they are not critical blockers for offering an additional Spark distribution with Spark Connect enabled by default in Spark 4.0, allowing users to try it out and provide more feedback.
I agree that we should not postpone the Spark 4.0 release. If these PRs do not land before the RC cut, we should ensure they are properly documented. Thanks, DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > > Many new feature `Connect` patches are still landing `branch-4.0` > during the QA period after February 1st. > > SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala Client > SPARK-50104 Support SparkSession.executeCommand in Connect > SPARK-50943 Support `Correlation` on Connect > SPARK-50133 Support DataFrame conversion to table argument in Spark Connect > Python Client > SPARK-50942 Support `ChiSquareTest` on Connect > SPARK-50899 Support PrefixSpan on connect > SPARK-51060 Support `QuantileDiscretizer` on Connect > SPARK-50974 Add support foldCol for CrossValidator on connect > SPARK-51015 Support RFormulaModel.toString on Connect > SPARK-50843 Support return a new model from existing one > > AFAIK, what we can agree on in the community is only that `Connect` > development is unfinished yet. > - Since `Connect` development is unfinished yet, more patches will land if we > want it to be complete. > - Since `Connect` development is unfinished yet, there exists more concerns > on adding this as a new distribution. > > That's the reason why I asked about the release schedule only. > We need to consider not only your new patch, but also the remaining `Connect` > PRs > in order to deliver the new proposed distribution meaningfully and completely > in Spark 4.0. > > So, let me ask you again. Are you sure that there will be no delay? > According to the commit history, I'm wondering if > both Herman and Ruifeng agree with you or not. > > To be clear, if there is no harm to the Apache Spark community, > I'll give +1 of course. Why not? > > Thanks, > Dongjoon. > > > > > On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com > <mailto:cloud0...@gmail.com>> wrote: >> Hi Dongjoon, >> >> This is a big decision but not a big project. We just need to update the >> release scripts to produce the additional Spark distribution. If people are >> positive about this, I can start to implement the script changes now and >> merge it after this proposal has been voted on and approved. >> >> Thanks, >> Wenchen >> >> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com >> <mailto:dongjoon.h...@gmail.com>> wrote: >>> Hi, Wenchen. >>> >>> I'm wondering if this implies any delay of the existing QA and RC1 schedule >>> or not. >>> >>> If then, why don't we schedule this new alternative proposal on Spark 4.1 >>> properly? >>> >>> Best regards, >>> Dongjoon >>> >>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com >>> <mailto:cloud0...@gmail.com>> wrote: >>>> Hi all, >>>> >>>> There is partial agreement and consensus that Spark Connect is crucial for >>>> the future stability of Spark APIs for both end users and developers. At >>>> the same time, a couple of PMC members raised concerns about making Spark >>>> Connect the default in the upcoming Spark 4.0 release. I’m proposing an >>>> alternative approach here: publish an additional Spark distribution with >>>> Spark Connect enabled by default. This approach will help promote the >>>> adoption of Spark Connect among new users while allowing us to gather >>>> valuable feedback. A separate distribution with Spark Connect enabled by >>>> default can promote future adoption of Spark Connect for languages like >>>> Rust, Go, or Scala 3. >>>> >>>> Here are the details of the proposal: >>>> >>>> Spark 4.0 will include three PyPI packages: >>>> pyspark: The classic package. >>>> pyspark-client: The thin Spark Connect Python client. Note, in the Spark >>>> 4.0 preview releases, we have published the pyspark-connect package for >>>> the thin client, we will need to rename it in the official 4.0 release. >>>> pyspark-connect: Spark Connect enabled by default. >>>> An additional tarball will be added to the Spark 4.0 download page with >>>> updated scripts (spark-submit, spark-shell, etc.) to enable Spark Connect >>>> by default. >>>> A new Docker image will be provided with Spark Connect enabled by default. >>>> By taking this approach, we can make Spark Connect more visible and >>>> accessible to users, which is more effective than simply asking them to >>>> configure it manually. >>>> >>>> Looking forward to hearing your thoughts! >>>> >>>> Thanks, >>>> Wenchen >>>>