Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Martin Grund Fri, 29 Nov 2024 00:25:40 -0800

At the chance of repeating what Herman said word by word :) I would like to
call out the following:


   1. The goal of setting the default is to guide users to use the Spark
   SQL APIs that have proven over time. We shouldn't underestimate the power
   of the default. I would assume that we all agree that 99% of the _new_
   users in Spark should not try to write code in RDDs.

   2. Any user, organization, or vendor can leverage *all* of their
   existing code by simply changing *one* configuration during startup:
   switching the spark.api.mode to classic (e.g., similar to ANSI mode). This
   means all existing RDD and library code just works fine.

   3. Creating a fractured user experience by using some logic to identify
   which API mode is used is not ideal. For many of the use cases that I've
   seen that require additional jars (e.g., data sources, drivers), they just
   work fine because Spark already has the right abstractions. For JARs used
   in the client side part of the code they just work as Herman said.

Similarly based on the experience of running Spark Connect in production,
the co-existence of workloads running in classic mode and connect mode is
working fine.



On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <[email protected]> wrote:

> I would switch to +0 if the default of connect was only for apps without
> any user provided jars/non-JVM apps.
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <[email protected]>
> wrote:
>
>> Given there is no plan to support RDDs I’ll update to -0.9
>>
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell <[email protected]>
>> wrote:
>>
>>> Hi Holden and Mridul,
>>>
>>> Just to be clear. What API parity are you expecting here? We have parity
>>> for everything that is exposed in org.apache.spark.sql. Connect does
>>> not support RDDs, SparkContext, etc... There are currently no plans to
>>> support this. We are considering adding a compatibility layer but that will
>>> be limited in scope. From running Connect in production for the last year,
>>> we see that most users can migrate their workloads without any problems.
>>>
>>> I do want to call out that this proposal is mostly aimed at how new
>>> users will interact with Spark. Existing users, when they migrate their
>>> application to Spark 4, have to set a conf when it turns out their
>>> application is not working. This should be a minor inconvenience compared
>>> to the headaches that a new Scala version or other library upgrades can
>>> cause.
>>>
>>> Since this is a breaking change, I do think this should be done in a
>>> major version.
>>>
>>> With the risk of repeating the SPIP, using Connect as the default brings
>>> a lot to the table (e.g. simplicity, easier upgrades, extensibility,
>>> etc...), I'd urge you to also factor this into your decision making.
>>>
>>> Happy thanksgiving!
>>>
>>> Cheers,
>>> Herman
>>>
>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>   I agree with Holden, I am leaning -1 on the proposal as well.
>>>> Unlike removal of deprecated features, which we align on a major
>>>> version boundary, changing the default is something we can do in a minor
>>>> version as well - once there is api parity.
>>>>
>>>> Irrespective of which major/minor version we make the switch in - there
>>>> could be user impact; minimizing this impact would be greatly appreciated
>>>> by our users.
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> -0.5: I don’t think this a good idea for JVM apps until we have API
>>>>> parity. (Binding but to be clear not a veto)
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Thank you Herman!
>>>>>>
>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> As part of the discussion on this topic, I would love to highlight
>>>>>>>>> the work that the community is currently doing to support SparkML, 
>>>>>>>>> which is
>>>>>>>>> traditionally very RDD-heavy, natively in Spark Connect. Bobby's 
>>>>>>>>> awesome
>>>>>>>>> work shows that, over time, we can extend the features of Spark 
>>>>>>>>> Connect and
>>>>>>>>> support workloads that we previously thought could not be supported 
>>>>>>>>> easily.
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/pull/48791
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF)
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>> -------- 原始邮件 --------
>>>>>>>>>> 发件人：Hyukjin Kwon<[email protected]>
>>>>>>>>>> 时间：2024-11-27 08:04:06
>>>>>>>>>> 主题：[外部邮件] Re： Spark Connect the default API in Spark 4.0
>>>>>>>>>> 收件人：Bjørn Jørgensen<[email protected]>;
>>>>>>>>>> 抄送人：Herman van Hovell<[email protected]>;Spark dev
>>>>>>>>>> list<[email protected]>;
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell
>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the
>>>>>>>>>>>> default API in Spark 4.0".
>>>>>>>>>>>>
>>>>>>>>>>>> The rationale for this change is that Spark Connect brings a
>>>>>>>>>>>> lot of improvements with respect to simplicity, stability, 
>>>>>>>>>>>> isolation,
>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). In a 
>>>>>>>>>>>> nutshell:
>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows a
>>>>>>>>>>>> user to choose between classic or connect mode, the default
>>>>>>>>>>>> being connect. A user can easily fallback to Classic by
>>>>>>>>>>>> setting spark.api.mode to classic.
>>>>>>>>>>>>
>>>>>>>>>>>> SPIP:
>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3
>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn>
>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411
>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d>
>>>>>>>>>>>>
>>>>>>>>>>>> I am looking forward to your feedback!
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Herman
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Bjørn Jørgensen
>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>> Norge
>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>
>>>>>>>>>>> +47 480 94 297
>>>>>>>>>>>
>>>>>>>>>>

Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Reply via email to