+1 to new distribution mechanisms which will increase Spark adoption !

I do agree with Dongjoon’s concerns that this should not result in slipping
the schedule; something to watch out for.

Regards,
Mridul



On Tue, Feb 4, 2025 at 8:07 PM Hyukjin Kwon <gurwls...@apache.org> wrote:

> I am fine with providing another option +1 with leaving others as are.
> Once the vote passes, we should probably make it ready ASAP - I don't think
> it will need a lot of changes in any event.
>
> On Wed, 5 Feb 2025 at 02:40, DB Tsai <dbt...@dbtsai.com> wrote:
>
>> Many of the remaining PRs relate to Spark ML Connect support, but they
>> are not critical blockers for offering an additional Spark distribution
>> with Spark Connect enabled by default in Spark 4.0, allowing users to try
>> it out and provide more feedback.
>>
>> I agree that we should not postpone the Spark 4.0 release. If these PRs
>> do not land before the RC cut, we should ensure they are properly
>> documented.
>>
>> Thanks,
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Feb 4, 2025, at 7:23 AM, Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>> Many new feature `Connect` patches are still landing `branch-4.0`
>> during the QA period after February 1st.
>>
>> SPARK-49308 Support UserDefinedAggregateFunction in Spark Connect Scala
>> Client
>> SPARK-50104 Support SparkSession.executeCommand in Connect
>> SPARK-50943 Support `Correlation` on Connect
>> SPARK-50133 Support DataFrame conversion to table argument in Spark
>> Connect Python Client
>> SPARK-50942 Support `ChiSquareTest` on Connect
>> SPARK-50899 Support PrefixSpan on connect
>> SPARK-51060 Support `QuantileDiscretizer` on Connect
>> SPARK-50974 Add support foldCol for CrossValidator on connect
>> SPARK-51015 Support RFormulaModel.toString on Connect
>> SPARK-50843 Support return a new model from existing one
>>
>> AFAIK, what we can agree on in the community is only that `Connect`
>> development is unfinished yet.
>> - Since `Connect` development is unfinished yet, more patches will land
>> if we want it to be complete.
>> - Since `Connect` development is unfinished yet, there exists more
>> concerns on adding this as a new distribution.
>>
>> That's the reason why I asked about the release schedule only.
>> We need to consider not only your new patch, but also the remaining
>> `Connect` PRs
>> in order to deliver the new proposed distribution meaningfully and
>> completely in Spark 4.0.
>>
>> So, let me ask you again. Are you sure that there will be no delay?
>> According to the commit history, I'm wondering if
>> both Herman and Ruifeng agree with you or not.
>>
>> To be clear, if there is no harm to the Apache Spark community,
>> I'll give +1 of course. Why not?
>>
>> Thanks,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Feb 4, 2025 at 1:10 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi Dongjoon,
>>>
>>> This is a big decision but not a big project. We just need to update the
>>> release scripts to produce the additional Spark distribution. If people are
>>> positive about this, I can start to implement the script changes now and
>>> merge it after this proposal has been voted on and approved.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Tue, Feb 4, 2025 at 4:10 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Wenchen.
>>>>
>>>> I'm wondering if this implies any delay of the existing QA and RC1
>>>> schedule or not.
>>>>
>>>> If then, why don't we schedule this new alternative proposal on Spark
>>>> 4.1 properly?
>>>>
>>>> Best regards,
>>>> Dongjoon
>>>>
>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> There is partial agreement and consensus that Spark Connect is crucial
>>>>> for the future stability of Spark APIs for both end users and developers.
>>>>> At the same time, a couple of PMC members raised concerns about making
>>>>> Spark Connect the default in the upcoming Spark 4.0 release. I’m proposing
>>>>> an alternative approach here: publish an additional Spark distribution 
>>>>> with
>>>>> Spark Connect enabled by default. This approach will help promote the
>>>>> adoption of Spark Connect among new users while allowing us to gather
>>>>> valuable feedback. A separate distribution with Spark Connect enabled by
>>>>> default can promote future adoption of Spark Connect for languages like
>>>>> Rust, Go, or Scala 3.
>>>>>
>>>>> Here are the details of the proposal:
>>>>>
>>>>>    - Spark 4.0 will include three PyPI packages:
>>>>>       - pyspark: The classic package.
>>>>>       - pyspark-client: The thin Spark Connect Python client. Note,
>>>>>       in the Spark 4.0 preview releases, we have published the 
>>>>> pyspark-connect
>>>>>       package for the thin client, we will need to rename it in the 
>>>>> official 4.0
>>>>>       release.
>>>>>       - pyspark-connect: Spark Connect enabled by default.
>>>>>    - An additional tarball will be added to the Spark 4.0 download
>>>>>    page with updated scripts (spark-submit, spark-shell, etc.) to enable 
>>>>> Spark
>>>>>    Connect by default.
>>>>>    - A new Docker image will be provided with Spark Connect enabled
>>>>>    by default.
>>>>>
>>>>> By taking this approach, we can make Spark Connect more visible and
>>>>> accessible to users, which is more effective than simply asking them to
>>>>> configure it manually.
>>>>>
>>>>> Looking forward to hearing your thoughts!
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>
>>

Reply via email to