Re: [Spark-Core] Improving Reliability of spark when Executors OOM

kalyan Mon, 05 Feb 2024 17:39:07 -0800

Hey,
Disk space not enough is also a reliability concern, but might need a diff
strategy to handle it.
As suggested by Mridul, I am working on making things more configurable in
another(new) module… with that, we can plug in new rules for each type of
error.


Regards
Kalyan.

On Mon, 5 Feb 2024 at 1:10 PM, Jay Han <tunyu...@gmail.com> wrote:

> Hi,
> what about supporting for solving the disk space problem of "device space
> isn't enough"? I think it's same as OOM exception.
>
> kalyan <justfors...@gmail.com> 于2024年1月27日周六 13:00写道：
>
>> Hi all,
>>
>
>> Sorry for the delay in getting the first draft of (my first) SPIP out.
>>
>> https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1
>>
>> Let me know what you think.
>>
>> Regards
>> kalyan.
>>
>> On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh <asi...@apache.org> wrote:
>>
>>> Hey all,
>>>
>>> Thanks for this discussion, the timing of this couldn't be better!
>>>
>>> At Pinterest, we recently started to look into reducing OOM failures
>>> while also reducing memory consumption of spark applications. We considered
>>> the following options.
>>> 1. Changing core count on executor to change memory available per task
>>> in the executor.
>>> 2. Changing resource profile based on task failures and gc metrics to
>>> grow or shrink executor memory size. We do this at application level based
>>> on the app's past runs today.
>>> 3. K8s vertical pod autoscaler
>>> <https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler>
>>>
>>> Internally, we are mostly getting aligned on option 2. We would love to
>>> make this happen and are looking forward to the SPIP.
>>>
>>>
>>> On Wed, Jan 17, 2024 at 9:34 AM Mridul Muralidharan <mri...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>>   We are internally exploring adding support for dynamically changing
>>>> the resource profile of a stage based on runtime characteristics.
>>>> This includes failures due to OOM and the like, slowness due to
>>>> excessive GC, resource wastage due to excessive overprovisioning, etc.
>>>> Essentially handles scale up and scale down of resources.
>>>> Instead of baking these into the scheduler directly (which is already
>>>> complex), we are modeling it as a plugin - so that the 'business logic' of
>>>> how to handle task events and mutate state is pluggable.
>>>>
>>>> The main limitation I find with mutating only the cores is the limits
>>>> it places on what kind of problems can be solved with it - and mutating
>>>> resource profiles is a much more natural way to handle this
>>>> (spark.task.cpus predates RP).
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> On Wed, Jan 17, 2024 at 9:18 AM Tom Graves <tgraves...@yahoo.com.invalid>
>>>> wrote:
>>>>
>>>>> It is interesting. I think there are definitely some discussion points
>>>>> around this.  reliability vs performance is always a trade off and its
>>>>> great it doesn't fail but if it doesn't meet someone's SLA now that could
>>>>> be as bad if its hard to figure out why.   I think if something like this
>>>>> kicks in, it needs to be very obvious to the user so they can see that it
>>>>> occurred.  Do you have something in place on UI or something that 
>>>>> indicates
>>>>> this? The nice thing is also you aren't wasting memory by increasing it 
>>>>> for
>>>>> all tasks when maybe you only need it for one or two.  The downside is you
>>>>> are only finding out after failure.
>>>>>
>>>>> I do also worry a little bit that in your blog post, the error you
>>>>> pointed out isn't a java OOM but an off heap memory issue (overhead + heap
>>>>> usage).  You don't really address heap memory vs off heap in that article.
>>>>> Only thing I see mentioned is spark.executor.memory which is heap memory.
>>>>> Obviously adjusting to only run one task is going to give that task more
>>>>> overall memory but the reasons its running out in the first place could be
>>>>> different.  If it was on heap memory for instance with more tasks I would
>>>>> expect to see more GC and not executor OOM.  If you are getting executor
>>>>> OOM you are likely using more off heap memory/stack space, etc then you
>>>>> allocated.   Ultimately it would be nice to know why that is happening and
>>>>> see if we can address it to not fail in the first place.  That could be
>>>>> extremely difficult though, especially if using software outside Spark 
>>>>> that
>>>>> is using that memory.
>>>>>
>>>>> As Holden said,  we need to make sure this would play nice with the
>>>>> resource profiles, or potentially if we can use the resource profile
>>>>> functionality.  Theoretically you could extend this to try to get new
>>>>> executor if using dynamic allocation for instance.
>>>>>
>>>>> I agree doing a SPIP would be a good place to start to have more
>>>>> discussions.
>>>>>
>>>>> Tom
>>>>>
>>>>> On Wednesday, January 17, 2024 at 12:47:51 AM CST, kalyan <
>>>>> justfors...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hello All,
>>>>>
>>>>> At Uber, we had recently, done some work on improving the reliability
>>>>> of spark applications in scenarios of fatter executors going out of memory
>>>>> and leading to application failure. Fatter executors are those that have
>>>>> more than 1 task running on it at a given time concurrently. This has
>>>>> significantly improved the reliability of many spark applications for us 
>>>>> at
>>>>> Uber. We made a blog about this recently. Link:
>>>>> https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/
>>>>>
>>>>> At a high level, we have done the below changes:
>>>>>
>>>>>    1. When a Task fails with the OOM of an executor, we update the
>>>>>    core requirements of the task to max executor cores.
>>>>>    2. When the task is picked for rescheduling, the new attempt of
>>>>>    the task happens to be on an executor where no other task can run
>>>>>    concurrently. All cores get allocated to this task itself.
>>>>>    3. This way we ensure that the configured memory is completely at
>>>>>    the disposal of a single task. Thus eliminating contention of memory.
>>>>>
>>>>> The best part of this solution is that it's reactive. It kicks in only
>>>>> when the executors fail with the OOM exception.
>>>>>
>>>>> We understand that the problem statement is very common and we expect
>>>>> our solution to be effective in many cases.
>>>>>
>>>>> There could be more cases that can be covered. Executor failing with
>>>>> OOM is like a hard signal. The framework(making the driver aware of
>>>>> what's happening with the executor) can be extended to handle scenarios of
>>>>> other forms of memory pressure like excessive spilling to disk, etc.
>>>>>
>>>>> While we had developed this on Spark 2.4.3 in-house, we would like to
>>>>> collaborate and contribute this work to the latest versions of Spark.
>>>>>
>>>>> What is the best way forward here? Will an SPIP proposal to detail the
>>>>> changes help?
>>>>>
>>>>> Regards,
>>>>> Kalyan.
>>>>> Uber India.
>>>>>
>>>>

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

Reply via email to