Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Martin Grund Sat, 14 Dec 2024 13:03:43 -0800

Dongjoon, nobody is saying that RDD should not be part of the public API.
It is very important to understand the difference here.


I've articulated this before and will try again. It is possible that
existing workloads require RDDs and these are very much supported by
setting the spark conf for the API mode. This is similar to the other spark
confs any deployment of an application sets to be configured accordingly.

The guidance with Spark Connect as a default is to provide a path where the
new future developers and users of Spark leverage the declarative interface
by default.

I would really like to look at this proposal as a forward looking decision
that aims to ease the life of Spark users with better classpath isolation,
better upgrade behavior and better application integration. The goal is to
optimize for the new users and workloads that will come over time while
allowing all existing workloads to run by setting exactly one spark conf.


On Sat, Dec 14, 2024 at 04:22 Ángel <[email protected]> wrote:

> -1
>
>
> El sáb, 14 dic 2024 a las 1:36, Dongjoon Hyun (<[email protected]>)
> escribió:
>
>> For the RDD part, I also disagree with Martin.
>> I believe RDD should be supported permanently as the public API.
>> Otherwise, it would be a surprise to me and my colleagues at least.
>>
>> >  I would assume that we all agree that
>> > 99% of the _new_ users in Spark should not try to write code in RDDs.
>>
>> According to this long discussion context,
>> I also decided to switch my vote from +1 to -1
>> because it seems too early to make this decision
>> given the pending `Spark Connect` work and active discussion.
>> Previously, I was biased only on the SQL part too much.
>>
>> As a side note, I hope Apache Spark 4.0.0 release is not going
>> to be blocked by the `Spark Connect` pending work and decision.
>>
>> Dongjoon.
>>
>> On Tue, Dec 3, 2024 at 7:51 PM Holden Karau <[email protected]>
>> wrote:
>>
>>>
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Fri, Nov 29, 2024 at 12:24 AM Martin Grund <[email protected]>
>>> wrote:
>>>
>>>> At the chance of repeating what Herman said word by word :) I would
>>>> like to call out the following:
>>>>
>>>>    1. The goal of setting the default is to guide users to use the
>>>>    Spark SQL APIs that have proven over time. We shouldn't underestimate 
>>>> the
>>>>    power of the default. I would assume that we all agree that 99% of the
>>>>    _new_ users in Spark should not try to write code in RDDs.
>>>>
>>>> I would disagree here. Maybe like 75%
>>>
>>>>
>>>>    1.
>>>>    2. Any user, organization, or vendor can leverage *all* of their
>>>>    existing code by simply changing *one* configuration during
>>>>    startup: switching the spark.api.mode to classic (e.g., similar to ANSI
>>>>    mode). This means all existing RDD and library code just works fine.
>>>>
>>>>    3. Creating a fractured user experience by using some logic to
>>>>    identify which API mode is used is not ideal. For many of the use cases
>>>>    that I've seen that require additional jars (e.g., data sources, 
>>>> drivers),
>>>>    they just work fine because Spark already has the right abstractions. 
>>>> For
>>>>    JARs used in the client side part of the code they just work as Herman
>>>>    said.
>>>>
>>>> Introducing the config flag defaulting to a limited API already
>>> introduces a fractured user experience where an application may fail part
>>> way through running.
>>>
>>>>
>>>>    1.
>>>>
>>>> Similarly based on the experience of running Spark Connect in
>>>> production, the co-existence of workloads running in classic mode and
>>>> connect mode is working fine.
>>>>
>>>>
>>> I still don’t like classic mode (maybe “full” and “restricted”).
>>>
>>>>
>>>>
>>>> On Fri, Nov 29, 2024 at 3:18 AM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> I would switch to +0 if the default of connect was only for apps
>>>>> without any user provided jars/non-JVM apps.
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Thu, Nov 28, 2024 at 6:11 PM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Given there is no plan to support RDDs I’ll update to -0.9
>>>>>>
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 28, 2024 at 6:00 PM Herman van Hovell <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Holden and Mridul,
>>>>>>>
>>>>>>> Just to be clear. What API parity are you expecting here? We have
>>>>>>> parity for everything that is exposed in org.apache.spark.sql.
>>>>>>> Connect does not support RDDs, SparkContext, etc... There are
>>>>>>> currently no plans to support this. We are considering adding a
>>>>>>> compatibility layer but that will be limited in scope. From running 
>>>>>>> Connect
>>>>>>> in production for the last year, we see that most users can migrate 
>>>>>>> their
>>>>>>> workloads without any problems.
>>>>>>>
>>>>>>> I do want to call out that this proposal is mostly aimed at how new
>>>>>>> users will interact with Spark. Existing users, when they migrate their
>>>>>>> application to Spark 4, have to set a conf when it turns out their
>>>>>>> application is not working. This should be a minor inconvenience 
>>>>>>> compared
>>>>>>> to the headaches that a new Scala version or other library upgrades can
>>>>>>> cause.
>>>>>>>
>>>>>>> Since this is a breaking change, I do think this should be done in a
>>>>>>> major version.
>>>>>>>
>>>>>>> With the risk of repeating the SPIP, using Connect as the default
>>>>>>> brings a lot to the table (e.g. simplicity, easier upgrades, 
>>>>>>> extensibility,
>>>>>>> etc...), I'd urge you to also factor this into your decision making.
>>>>>>>
>>>>>>> Happy thanksgiving!
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Herman
>>>>>>>
>>>>>>> On Thu, Nov 28, 2024 at 8:43 PM Mridul Muralidharan <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>   I agree with Holden, I am leaning -1 on the proposal as well.
>>>>>>>> Unlike removal of deprecated features, which we align on a major
>>>>>>>> version boundary, changing the default is something we can do in a 
>>>>>>>> minor
>>>>>>>> version as well - once there is api parity.
>>>>>>>>
>>>>>>>> Irrespective of which major/minor version we make the switch in -
>>>>>>>> there could be user impact; minimizing this impact would be greatly
>>>>>>>> appreciated by our users.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 27, 2024 at 8:31 PM Holden Karau <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> -0.5: I don’t think this a good idea for JVM apps until we have
>>>>>>>>> API parity. (Binding but to be clear not a veto)
>>>>>>>>>
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>> Pronouns: she/her
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 27, 2024 at 6:27 PM Xinrong Meng <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> Thank you Herman!
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 28, 2024 at 3:37 AM Dongjoon Hyun <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 27, 2024 at 09:16 Denny Lee <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 27, 2024 at 3:07 AM Martin Grund
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> As part of the discussion on this topic, I would love to
>>>>>>>>>>>>> highlight the work that the community is currently doing to 
>>>>>>>>>>>>> support
>>>>>>>>>>>>> SparkML, which is traditionally very RDD-heavy, natively in Spark 
>>>>>>>>>>>>> Connect.
>>>>>>>>>>>>> Bobby's awesome work shows that, over time, we can extend the 
>>>>>>>>>>>>> features of
>>>>>>>>>>>>> Spark Connect and support workloads that we previously thought 
>>>>>>>>>>>>> could not be
>>>>>>>>>>>>> supported easily.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/spark/pull/48791
>>>>>>>>>>>>>
>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 27, 2024 at 11:42 AM Yang,Jie(INF)
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>> -------- 原始邮件 --------
>>>>>>>>>>>>>> 发件人：Hyukjin Kwon<[email protected]>
>>>>>>>>>>>>>> 时间：2024-11-27 08:04:06
>>>>>>>>>>>>>> 主题：[外部邮件] Re： Spark Connect the default API in Spark 4.0
>>>>>>>>>>>>>> 收件人：Bjørn Jørgensen<[email protected]>;
>>>>>>>>>>>>>> 抄送人：Herman van Hovell<[email protected]>;Spark
>>>>>>>>>>>>>> dev list<[email protected]>;
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, 25 Nov 2024 at 23:33, Bjørn Jørgensen <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> man. 25. nov. 2024 kl. 14:48 skrev Herman van Hovell
>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to start a discussion on "Spark Connect the
>>>>>>>>>>>>>>>> default API in Spark 4.0".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The rationale for this change is that Spark Connect brings
>>>>>>>>>>>>>>>> a lot of improvements with respect to simplicity, stability, 
>>>>>>>>>>>>>>>> isolation,
>>>>>>>>>>>>>>>> upgradability, and extensibility (all detailed in the SPIP). 
>>>>>>>>>>>>>>>> In a nutshell:
>>>>>>>>>>>>>>>> we want to introduce a flag, spark.api.mode, that allows a
>>>>>>>>>>>>>>>> user to choose between classic or connect mode, the
>>>>>>>>>>>>>>>> default being connect. A user can easily fallback to
>>>>>>>>>>>>>>>> Classic by setting spark.api.mode to classic.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> SPIP:
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1C0kuQEliG78HujVwdnSk0wjNwHEDdwo2o8aVq7kbhTo/edit?tab=t.0#heading=h.r2c3xrbiklu3
>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=5uIK5BsJhkKEitTyTno8Yb7Zq%2boLHvRsgSoBr5oTNJEHXWS9Np0U8pCuv2DeJDfCQJiI52FAoCrxDEqnj1jOqX9A3jtJcetvkKkKE696xfrLfKuuRuyCC9YrwN5IW4OUtkhdHz7C%2bER2GN9EPqnlIlX2osm36Zbn>
>>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-50411
>>>>>>>>>>>>>>>> <https://mailshield.baidu.com/check?q=vc5arXeK3OKfjk5Oxe1F%2fMNjR%2fSx5pTdbaOArWe9m2MpZDOF702CYYagPMQmbDqV7xnWwxsUdOc%3d>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am looking forward to your feedback!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Herman
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Bjørn Jørgensen
>>>>>>>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>> Norge
>>>>>>>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++%0D%0ANorge?entry=gmail&source=g>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +47 480 94 297
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: [外部邮件] Re: Spark Connect the default API in Spark 4.0

Reply via email to