+1 to new distribution mechanisms which will increase Spark adoption ! I do agree with Dongjoon’s concerns that this should not result in slipping the schedule; something to watch out for.
Regards, Mridul On Tue, Feb 4, 2025 at 8:07 PM Hyukjin Kwon <gurwls...@apache.org> wrote: > I am fine with providing another option +1 with leaving others as are. > Once the vote passes, we should probably make it ready ASAP - I don't think > it will need a lot of changes in any event. > > On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote: > >> Many of the remaining PRs relate to Spark ML Connect support, but they >> are not critical blockers for offering an additional Spark distribution >> with Spark Connect enabled by default in Spark 4.0, allowing users to try >> it out and provide more feedback. >> >> I agree that we should not postpone the Spark 4.0 release. If these PRs >> do not land before the RC cut, we should ensure they are properly >> documented. >> >> Thanks, >> >> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >> >> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >> Many new feature `Connect` patches are still landing `branch-4.0` >> during the QA period after February 1st. >> >> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala >> Client >> SPARK-50104 Support SparkSession.executeCommand in Connect >> SPARK-50943 Support `Correlation` on Connect >> SPARK-50133 Support DataFrame conversion to table argument in Spark >> Connect Python Client >> SPARK-50942 Support `ChiSquareTest` on Connect >> SPARK-50899 Support PrefixSpan on connect >> SPARK-51060 Support `QuantileDiscretizer` on Connect >> SPARK-50974 Add support foldCol for CrossValidator on connect >> SPARK-51015 Support RFormulaModel.toString on Connect >> SPARK-50843 Support return a new model from existing one >> >> AFAIK, what we can agree on in the community is only that `Connect` >> development is unfinished yet. >> - Since `Connect` development is unfinished yet, more patches will land >> if we want it to be complete. >> - Since `Connect` development is unfinished yet, there exists more >> concerns on adding this as a new distribution. >> >> That's the reason why I asked about the release schedule only. >> We need to consider not only your new patch, but also the remaining >> `Connect` PRs >> in order to deliver the new proposed distribution meaningfully and >> completely in Spark 4.0. >> >> So, let me ask you again. Are you sure that there will be no delay? >> According to the commit history, I'm wondering if >> both Herman and Ruifeng agree with you or not. >> >> To be clear, if there is no harm to the Apache Spark community, >> I'll give +1 of course. Why not? >> >> Thanks, >> Dongjoon. >> >> >> >> >> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Hi Dongjoon, >>> >>> This is a big decision but not a big project. We just need to update the >>> release scripts to produce the additional Spark distribution. If people are >>> positive about this, I can start to implement the script changes now and >>> merge it after this proposal has been voted on and approved. >>> >>> Thanks, >>> Wenchen >>> >>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Hi, Wenchen. >>>> >>>> I'm wondering if this implies any delay of the existing QA and RC1 >>>> schedule or not. >>>> >>>> If then, why don't we schedule this new alternative proposal on Spark >>>> 4.1 properly? >>>> >>>> Best regards, >>>> Dongjoon >>>> >>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> There is partial agreement and consensus that Spark Connect is crucial >>>>> for the future stability of Spark APIs for both end users and developers. >>>>> At the same time, a couple of PMC members raised concerns about making >>>>> Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing >>>>> an alternative approach here: publish an additional Spark distribution >>>>> with >>>>> Spark Connect enabled by default. This approach will help promote the >>>>> adoption of Spark Connect among new users while allowing us to gather >>>>> valuable feedback. A separate distribution with Spark Connect enabled by >>>>> default can promote future adoption of Spark Connect for languages like >>>>> Rust, Go, or Scala 3. >>>>> >>>>> Here are the details of the proposal: >>>>> >>>>> - Spark 4.0 will include three PyPI packages: >>>>> - pyspark: The classic package. >>>>> - pyspark-client: The thin Spark Connect Python client. Note, >>>>> in the Spark 4.0 preview releases, we have published the >>>>> pyspark-connect >>>>> package for the thin client, we will need to rename it in the >>>>> official 4.0 >>>>> release. >>>>> - pyspark-connect: Spark Connect enabled by default. >>>>> - An additional tarball will be added to the Spark 4.0 download >>>>> page with updated scripts (spark-submit, spark-shell, etc.) to enable >>>>> Spark >>>>> Connect by default. >>>>> - A new Docker image will be provided with Spark Connect enabled >>>>> by default. >>>>> >>>>> By taking this approach, we can make Spark Connect more visible and >>>>> accessible to users, which is more effective than simply asking them to >>>>> configure it manually. >>>>> >>>>> Looking forward to hearing your thoughts! >>>>> >>>>> Thanks, >>>>> Wenchen >>>>> >>>> >>