Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-04 Thread Jay Han
Hi,
what about supporting for solving the disk space problem of "device space
isn't enough"? I think it's same as OOM exception.

kalyan  于2024年1月27日周六 13:00写道:

> Hi all,
>
> Sorry for the delay in getting the first draft of (my first) SPIP out.
>
> https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1
>
> Let me know what you think.
>
> Regards
> kalyan.
>
> On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh  wrote:
>
>> Hey all,
>>
>> Thanks for this discussion, the timing of this couldn't be better!
>>
>> At Pinterest, we recently started to look into reducing OOM failures
>> while also reducing memory consumption of spark applications. We considered
>> the following options.
>> 1. Changing core count on executor to change memory available per task in
>> the executor.
>> 2. Changing resource profile based on task failures and gc metrics to
>> grow or shrink executor memory size. We do this at application level based
>> on the app's past runs today.
>> 3. K8s vertical pod autoscaler
>> 
>>
>> Internally, we are mostly getting aligned on option 2. We would love to
>> make this happen and are looking forward to the SPIP.
>>
>>
>> On Wed, Jan 17, 2024 at 9:34 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>   We are internally exploring adding support for dynamically changing
>>> the resource profile of a stage based on runtime characteristics.
>>> This includes failures due to OOM and the like, slowness due to
>>> excessive GC, resource wastage due to excessive overprovisioning, etc.
>>> Essentially handles scale up and scale down of resources.
>>> Instead of baking these into the scheduler directly (which is already
>>> complex), we are modeling it as a plugin - so that the 'business logic' of
>>> how to handle task events and mutate state is pluggable.
>>>
>>> The main limitation I find with mutating only the cores is the limits it
>>> places on what kind of problems can be solved with it - and mutating
>>> resource profiles is a much more natural way to handle this
>>> (spark.task.cpus predates RP).
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, Jan 17, 2024 at 9:18 AM Tom Graves 
>>> wrote:
>>>
 It is interesting. I think there are definitely some discussion points
 around this.  reliability vs performance is always a trade off and its
 great it doesn't fail but if it doesn't meet someone's SLA now that could
 be as bad if its hard to figure out why.   I think if something like this
 kicks in, it needs to be very obvious to the user so they can see that it
 occurred.  Do you have something in place on UI or something that indicates
 this? The nice thing is also you aren't wasting memory by increasing it for
 all tasks when maybe you only need it for one or two.  The downside is you
 are only finding out after failure.

 I do also worry a little bit that in your blog post, the error you
 pointed out isn't a java OOM but an off heap memory issue (overhead + heap
 usage).  You don't really address heap memory vs off heap in that article.
 Only thing I see mentioned is spark.executor.memory which is heap memory.
 Obviously adjusting to only run one task is going to give that task more
 overall memory but the reasons its running out in the first place could be
 different.  If it was on heap memory for instance with more tasks I would
 expect to see more GC and not executor OOM.  If you are getting executor
 OOM you are likely using more off heap memory/stack space, etc then you
 allocated.   Ultimately it would be nice to know why that is happening and
 see if we can address it to not fail in the first place.  That could be
 extremely difficult though, especially if using software outside Spark that
 is using that memory.

 As Holden said,  we need to make sure this would play nice with the
 resource profiles, or potentially if we can use the resource profile
 functionality.  Theoretically you could extend this to try to get new
 executor if using dynamic allocation for instance.

 I agree doing a SPIP would be a good place to start to have more
 discussions.

 Tom

 On Wednesday, January 17, 2024 at 12:47:51 AM CST, kalyan <
 justfors...@gmail.com> wrote:


 Hello All,

 At Uber, we had recently, done some work on improving the reliability
 of spark applications in scenarios of fatter executors going out of memory
 and leading to application failure. Fatter executors are those that have
 more than 1 task running on it at a given time concurrently. This has
 significantly improved the reliability of many spark applications for us at
 Uber. We made a blog about this recently. Link:
 https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/

 At a high level, we have done the below 

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Gengliang Wang
+1

On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala  wrote:

> +1
>
> On Sun, Feb 4, 2024 at 10:13 PM John Zhuge  wrote:
>
>> +1
>>
>> John Zhuge
>>
>>
>> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
>>  wrote:
>>
>>> +1
>>>
>>> On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
>>> wrote:
>>>
 +1

 On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:

> +1
>
>
>
> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>
> +1
>
> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
> wrote:
>
>> +1
>>
>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
>> 写入:
>>
>>
>> +1
>>
>>
>> Jungtaek Lim > kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>> >
>> > Hi dev,
>> >
>> > looks like there are a huge number of commits being pushed to
>> branch-3.5 after 3.5.0 was released, 200+ commits.
>> >
>> > $ git log --oneline v3.5.0..HEAD | wc -l
>> > 202
>> >
>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version,
>> and 10 resolved issues are either marked as blocker (even correctness
>> issues) or critical, which justifies the release.
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>> >
>> > What do you think about releasing 3.5.1 with the current head of
>> branch-3.5? I'm happy to volunteer as the release manager.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > dev-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --




Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Hussein Awala
+1

On Sun, Feb 4, 2024 at 10:13 PM John Zhuge  wrote:

> +1
>
> John Zhuge
>
>
> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
>  wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
>> wrote:
>>
>>> +1
>>>
>>> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>>>
 +1



 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:

 +1

 On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
 wrote:

> +1
>
> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
> 写入:
>
>
> +1
>
>
> Jungtaek Lim  kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
> >
> > Hi dev,
> >
> > looks like there are a huge number of commits being pushed to
> branch-3.5 after 3.5.0 was released, 200+ commits.
> >
> > $ git log --oneline v3.5.0..HEAD | wc -l
> > 202
> >
> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version,
> and 10 resolved issues are either marked as blocker (even correctness
> issues) or critical, which justifies the release.
> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
> >
> > What do you think about releasing 3.5.1 with the current head of
> branch-3.5? I'm happy to volunteer as the release manager.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
>
>
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>>
>>>


Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread John Zhuge
+1

John Zhuge


On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
 wrote:

> +1
>
> On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
> wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>>>
>>> +1
>>>
>>> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
>>> wrote:
>>>
 +1

 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
 写入:


 +1


 Jungtaek Lim >>> kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
 >
 > Hi dev,
 >
 > looks like there are a huge number of commits being pushed to
 branch-3.5 after 3.5.0 was released, 200+ commits.
 >
 > $ git log --oneline v3.5.0..HEAD | wc -l
 > 202
 >
 > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version,
 and 10 resolved issues are either marked as blocker (even correctness
 issues) or critical, which justifies the release.
 > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
 https://issues.apache.org/jira/projects/SPARK/versions/12353495>
 >
 > What do you think about releasing 3.5.1 with the current head of
 branch-3.5? I'm happy to volunteer as the release manager.
 >
 > Thanks,
 > Jungtaek Lim (HeartSaVioR)


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> dev-unsubscr...@spark.apache.org>






 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>>
>>


Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Santosh Pingale
+1

On Sun, Feb 4, 2024, 8:18 PM Xiao Li  wrote:

> +1
>
> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>
>> +1
>>
>>
>>
>> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>>
>> +1
>>
>> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
>> wrote:
>>
>>> +1
>>>
>>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入:
>>>
>>>
>>> +1
>>>
>>>
>>> Jungtaek Lim >> kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>>> >
>>> > Hi dev,
>>> >
>>> > looks like there are a huge number of commits being pushed to
>>> branch-3.5 after 3.5.0 was released, 200+ commits.
>>> >
>>> > $ git log --oneline v3.5.0..HEAD | wc -l
>>> > 202
>>> >
>>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version,
>>> and 10 resolved issues are either marked as blocker (even correctness
>>> issues) or critical, which justifies the release.
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>>> >
>>> > What do you think about releasing 3.5.1 with the current head of
>>> branch-3.5? I'm happy to volunteer as the release manager.
>>> >
>>> > Thanks,
>>> > Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> dev-unsubscr...@spark.apache.org>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
>
>


Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Xiao Li
+1

On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:

> +1
>
>
>
> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>
> +1
>
> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
> wrote:
>
>> +1
>>
>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入:
>>
>>
>> +1
>>
>>
>> Jungtaek Lim > kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>> >
>> > Hi dev,
>> >
>> > looks like there are a huge number of commits being pushed to
>> branch-3.5 after 3.5.0 was released, 200+ commits.
>> >
>> > $ git log --oneline v3.5.0..HEAD | wc -l
>> > 202
>> >
>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and
>> 10 resolved issues are either marked as blocker (even correctness issues)
>> or critical, which justifies the release.
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>> >
>> > What do you think about releasing 3.5.1 with the current head of
>> branch-3.5? I'm happy to volunteer as the release manager.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > dev-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

--


Re:Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread beliefer
+1







在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:

+1



On Sat, Feb 3, 2024 at 9:18 PM yangjie01  wrote:

+1

在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入:


+1


Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>
> Hi dev,
>
> looks like there are a huge number of commits being pushed to branch-3.5 
> after 3.5.0 was released, 200+ commits.
>
> $ git log --oneline v3.5.0..HEAD | wc -l
> 202
>
> Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and 10 
> resolved issues are either marked as blocker (even correctness issues) or 
> critical, which justifies the release.
> https://issues.apache.org/jira/projects/SPARK/versions/12353495 
> 
>
> What do you think about releasing 3.5.1 with the current head of branch-3.5? 
> I'm happy to volunteer as the release manager.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 







-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org