I am fine with providing another option +1 with leaving others as are. Once the vote passes, we should probably make it ready ASAP - I don't think it will need a lot of changes in any event.
On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote: > Many of the remaining PRs relate to Spark ML Connect support, but they are > not critical blockers for offering an additional Spark distribution with > Spark Connect enabled by default in Spark 4.0, allowing users to try it out > and provide more feedback. > > I agree that we should not postpone the Spark 4.0 release. If these PRs do > not land before the RC cut, we should ensure they are properly documented. > > Thanks, > > DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > > On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > > Many new feature `Connect` patches are still landing `branch-4.0` > during the QA period after February 1st. > > SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala > Client > SPARK-50104 Support SparkSession.executeCommand in Connect > SPARK-50943 Support `Correlation` on Connect > SPARK-50133 Support DataFrame conversion to table argument in Spark > Connect Python Client > SPARK-50942 Support `ChiSquareTest` on Connect > SPARK-50899 Support PrefixSpan on connect > SPARK-51060 Support `QuantileDiscretizer` on Connect > SPARK-50974 Add support foldCol for CrossValidator on connect > SPARK-51015 Support RFormulaModel.toString on Connect > SPARK-50843 Support return a new model from existing one > > AFAIK, what we can agree on in the community is only that `Connect` > development is unfinished yet. > - Since `Connect` development is unfinished yet, more patches will land if > we want it to be complete. > - Since `Connect` development is unfinished yet, there exists more > concerns on adding this as a new distribution. > > That's the reason why I asked about the release schedule only. > We need to consider not only your new patch, but also the remaining > `Connect` PRs > in order to deliver the new proposed distribution meaningfully and > completely in Spark 4.0. > > So, let me ask you again. Are you sure that there will be no delay? > According to the commit history, I'm wondering if > both Herman and Ruifeng agree with you or not. > > To be clear, if there is no harm to the Apache Spark community, > I'll give +1 of course. Why not? > > Thanks, > Dongjoon. > > > > > On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi Dongjoon, >> >> This is a big decision but not a big project. We just need to update the >> release scripts to produce the additional Spark distribution. If people are >> positive about this, I can start to implement the script changes now and >> merge it after this proposal has been voted on and approved. >> >> Thanks, >> Wenchen >> >> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Hi, Wenchen. >>> >>> I'm wondering if this implies any delay of the existing QA and RC1 >>> schedule or not. >>> >>> If then, why don't we schedule this new alternative proposal on Spark >>> 4.1 properly? >>> >>> Best regards, >>> Dongjoon >>> >>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> There is partial agreement and consensus that Spark Connect is crucial >>>> for the future stability of Spark APIs for both end users and developers. >>>> At the same time, a couple of PMC members raised concerns about making >>>> Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing >>>> an alternative approach here: publish an additional Spark distribution with >>>> Spark Connect enabled by default. This approach will help promote the >>>> adoption of Spark Connect among new users while allowing us to gather >>>> valuable feedback. A separate distribution with Spark Connect enabled by >>>> default can promote future adoption of Spark Connect for languages like >>>> Rust, Go, or Scala 3. >>>> >>>> Here are the details of the proposal: >>>> >>>> - Spark 4.0 will include three PyPI packages: >>>> - pyspark: The classic package. >>>> - pyspark-client: The thin Spark Connect Python client. Note, in >>>> the Spark 4.0 preview releases, we have published the pyspark-connect >>>> package for the thin client, we will need to rename it in the >>>> official 4.0 >>>> release. >>>> - pyspark-connect: Spark Connect enabled by default. >>>> - An additional tarball will be added to the Spark 4.0 download >>>> page with updated scripts (spark-submit, spark-shell, etc.) to enable >>>> Spark >>>> Connect by default. >>>> - A new Docker image will be provided with Spark Connect enabled by >>>> default. >>>> >>>> By taking this approach, we can make Spark Connect more visible and >>>> accessible to users, which is more effective than simply asking them to >>>> configure it manually. >>>> >>>> Looking forward to hearing your thoughts! >>>> >>>> Thanks, >>>> Wenchen >>>> >>> >