I am fine with providing another option +1 with leaving others as are. Once
the vote passes, we should probably make it ready ASAP - I don't think it
will need a lot of changes in any event.

On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote:

> Many of the remaining PRs relate to Spark ML Connect support, but they are
> not critical blockers for offering an additional Spark distribution with
> Spark Connect enabled by default in Spark 4.0, allowing users to try it out
> and provide more feedback.
>
> I agree that we should not postpone the Spark 4.0 release. If these PRs do
> not land before the RC cut, we should ensure they are properly documented.
>
> Thanks,
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
>
> Many new feature `Connect` patches are still landing `branch-4.0`
> during the QA period after February 1st.
>
> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala
> Client
> SPARK-50104 Support SparkSession.executeCommand in Connect
> SPARK-50943 Support `Correlation` on Connect
> SPARK-50133 Support DataFrame conversion to table argument in Spark
> Connect Python Client
> SPARK-50942 Support `ChiSquareTest` on Connect
> SPARK-50899 Support PrefixSpan on connect
> SPARK-51060 Support `QuantileDiscretizer` on Connect
> SPARK-50974 Add support foldCol for CrossValidator on connect
> SPARK-51015 Support RFormulaModel.toString on Connect
> SPARK-50843 Support return a new model from existing one
>
> AFAIK, what we can agree on in the community is only that `Connect`
> development is unfinished yet.
> - Since `Connect` development is unfinished yet, more patches will land if
> we want it to be complete.
> - Since `Connect` development is unfinished yet, there exists more
> concerns on adding this as a new distribution.
>
> That's the reason why I asked about the release schedule only.
> We need to consider not only your new patch, but also the remaining
> `Connect` PRs
> in order to deliver the new proposed distribution meaningfully and
> completely in Spark 4.0.
>
> So, let me ask you again. Are you sure that there will be no delay?
> According to the commit history, I'm wondering if
> both Herman and Ruifeng agree with you or not.
>
> To be clear, if there is no harm to the Apache Spark community,
> I'll give +1 of course. Why not?
>
> Thanks,
> Dongjoon.
>
>
>
>
> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Hi Dongjoon,
>>
>> This is a big decision but not a big project. We just need to update the
>> release scripts to produce the additional Spark distribution. If people are
>> positive about this, I can start to implement the script changes now and
>> merge it after this proposal has been voted on and approved.
>>
>> Thanks,
>> Wenchen
>>
>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Hi, Wenchen.
>>>
>>> I'm wondering if this implies any delay of the existing QA and RC1
>>> schedule or not.
>>>
>>> If then, why don't we schedule this new alternative proposal on Spark
>>> 4.1 properly?
>>>
>>> Best regards,
>>> Dongjoon
>>>
>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> There is partial agreement and consensus that Spark Connect is crucial
>>>> for the future stability of Spark APIs for both end users and developers.
>>>> At the same time, a couple of PMC members raised concerns about making
>>>> Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing
>>>> an alternative approach here: publish an additional Spark distribution with
>>>> Spark Connect enabled by default. This approach will help promote the
>>>> adoption of Spark Connect among new users while allowing us to gather
>>>> valuable feedback. A separate distribution with Spark Connect enabled by
>>>> default can promote future adoption of Spark Connect for languages like
>>>> Rust, Go, or Scala 3.
>>>>
>>>> Here are the details of the proposal:
>>>>
>>>>    - Spark 4.0 will include three PyPI packages:
>>>>       - pyspark: The classic package.
>>>>       - pyspark-client: The thin Spark Connect Python client. Note, in
>>>>       the Spark 4.0 preview releases, we have published the pyspark-connect
>>>>       package for the thin client, we will need to rename it in the 
>>>> official 4.0
>>>>       release.
>>>>       - pyspark-connect: Spark Connect enabled by default.
>>>>    - An additional tarball will be added to the Spark 4.0 download
>>>>    page with updated scripts (spark-submit, spark-shell, etc.) to enable 
>>>> Spark
>>>>    Connect by default.
>>>>    - A new Docker image will be provided with Spark Connect enabled by
>>>>    default.
>>>>
>>>> By taking this approach, we can make Spark Connect more visible and
>>>> accessible to users, which is more effective than simply asking them to
>>>> configure it manually.
>>>>
>>>> Looking forward to hearing your thoughts!
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>
>

Reply via email to