Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Dongjoon Hyun Fri, 13 Dec 2024 16:36:37 -0800

For the RDD part, I also disagree with Martin.
I believe RDD should be supported permanently as the public API.
Otherwise, it would be a surprise to me and my colleagues at least.


>  I would assume that we all agree that
> 99% of the _new_ users in Spark should not try to write code in RDDs.

According to this long discussion context,
I also decided to switch my vote from +1 to -1
because it seems too early to make this decision
given the pending `Spark Connect` work and active discussion.
Previously, I was biased only on the SQL part too much.

As a side note, I hope Apache Spark 4.0.0 release is not going
to be blocked by the `Spark Connect` pending work and decision.

Dongjoon.

On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <[email protected]> wrote:

>
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <[email protected]>
> wrote:
>
>> At the chance of repeating what Herman said word by word :) I would like
>> to call out the following:
>>
>>    1. The goal of setting the default is to guide users to use the Spark
>>    SQL APIs that have proven over time. We shouldn't underestimate the power
>>    of the default. I would assume that we all agree that 99% of the _new_
>>    users in Spark should not try to write code in RDDs.
>>
>> I would disagree here. Maybe like 75%
>
>>
>>    1.
>>    2. Any user, organization, or vendor can leverage *all* of their
>>    existing code by simply changing *one* configuration during startup:
>>    switching the spark.api.mode to classic (e.g., similar to ANSI mode). This
>>    means all existing RDD and library code just works fine.
>>
>>    3. Creating a fractured user experience by using some logic to
>>    identify which API mode is used is not ideal. For many of the use cases
>>    that I've seen that require additional jars (e.g., data sources, drivers),
>>    they just work fine because Spark already has the right abstractions. For
>>    JARs used in the client side part of the code they just work as Herman
>>    said.
>>
>> Introducing the config flag defaulting to a limited API already
> introduces a fractured user experience where an application may fail part
> way through running.
>
>>
>>    1.
>>
>> Similarly based on the experience of running Spark Connect in production,
>> the co-existence of workloads running in classic mode and connect mode is
>> working fine.
>>
>>
> I still don’t like classic mode (maybe “full” and “restricted”).
>
>>
>>
>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <[email protected]>
>> wrote:
>>
>>> I would switch to +0 if the default of connect was only for apps without
>>> any user provided jars/non-JVM apps.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> Given there is no plan to support RDDs I’ll update to -0.9
>>>>
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Holden and Mridul,
>>>>>
>>>>> Just to be clear. What API parity are you expecting here? We have
>>>>> parity for everything that is exposed in org.apache.spark.sql.
>>>>> Connect does not support RDDs, SparkContext, etc... There are
>>>>> currently no plans to support this. We are considering adding a
>>>>> compatibility layer but that will be limited in scope. From running 
>>>>> Connect
>>>>> in production for the last year, we see that most users can migrate their
>>>>> workloads without any problems.
>>>>>
>>>>> I do want to call out that this proposal is mostly aimed at how new
>>>>> users will interact with Spark. Existing users, when they migrate their
>>>>> application to Spark 4, have to set a conf when it turns out their
>>>>> application is not working. This should be a minor inconvenience compared
>>>>> to the headaches that a new Scala version or other library upgrades can
>>>>> cause.
>>>>>
>>>>> Since this is a breaking change, I do think this should be done in a
>>>>> major version.
>>>>>
>>>>> With the risk of repeating the SPIP, using Connect as the default
>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, 
>>>>> extensibility,
>>>>> etc...), I'd urge you to also factor this into your decision making.
>>>>>
>>>>> Happy thanksgiving!
>>>>>
>>>>> Cheers,
>>>>> Herman
>>>>>
>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>   I agree with Holden, I am leaning -1 on the proposal as well.
>>>>>> Unlike removal of deprecated features, which we align on a major
>>>>>> version boundary, changing the default is something we can do in a minor
>>>>>> version as well - once there is api parity.
>>>>>>
>>>>>> Irrespective of which major/minor version we make the switch in -
>>>>>> there could be user impact; minimizing this impact would be greatly
>>>>>> appreciated by our users.
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have API
>>>>>>> parity. (Binding but to be clear not a veto)
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Thank you Herman!
>>>>>>>>
>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> As part of the discussion on this topic, I would love to
>>>>>>>>>>> highlight the work that the community is currently doing to support
>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in Spark 
>>>>>>>>>>> Connect.
>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the 
>>>>>>>>>>> features of
>>>>>>>>>>> Spark Connect and support workloads that we previously thought 
>>>>>>>>>>> could not be
>>>>>>>>>>> supported easily.
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/pull/48791
>>>>>>>>>>>
>>>>>>>>>>> Martin
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF)
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>> -------- 原始邮件 --------
>>>>>>>>>>>> 发件人：Hyukjin Kwon<[email protected]>
>>>>>>>>>>>> 时间：2024-11-27 08:04:06
>>>>>>>>>>>> 主题：[外部邮件] Re： Spark Connect the default API in Spark 4.0
>>>>>>>>>>>> 收件人：Bjørn Jørgensen<[email protected]>;
>>>>>>>>>>>> 抄送人：Herman van Hovell<[email protected]>;Spark dev
>>>>>>>>>>>> list<[email protected]>;
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell
>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the
>>>>>>>>>>>>>> default API in Spark 4.0".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The rationale for this change is that Spark Connect brings a
>>>>>>>>>>>>>> lot of improvements with respect to simplicity, stability, 
>>>>>>>>>>>>>> isolation,
>>>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). In 
>>>>>>>>>>>>>> a nutshell:
>>>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows a
>>>>>>>>>>>>>> user to choose between classic or connect mode, the default
>>>>>>>>>>>>>> being connect. A user can easily fallback to Classic by
>>>>>>>>>>>>>> setting spark.api.mode to classic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SPIP:
>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3
>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn>
>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411
>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am looking forward to your feedback!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Herman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Bjørn Jørgensen
>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>> Norge
>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>
>>>>>>>>>>>>> +47 480 94 297
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Reply via email to