Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread Holden Karau
Oh interesting solution, a co-worker was suggesting something similar using
resource profiles to increase memory -- but your approach avoids a lot of
complexity I like it (and we could extend it out to support resource
profile growth too).

I think an SPIP sounds like a great next step.

On Tue, Jan 16, 2024 at 10:46 PM kalyan  wrote:

> Hello All,
>
> At Uber, we had recently, done some work on improving the reliability of
> spark applications in scenarios of fatter executors going out of memory and
> leading to application failure. Fatter executors are those that have more
> than 1 task running on it at a given time concurrently. This has
> significantly improved the reliability of many spark applications for us at
> Uber. We made a blog about this recently. Link:
> https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/
>
> At a high level, we have done the below changes:
>
>1. When a Task fails with the OOM of an executor, we update the core
>requirements of the task to max executor cores.
>2. When the task is picked for rescheduling, the new attempt of the
>task happens to be on an executor where no other task can run concurrently.
>All cores get allocated to this task itself.
>3. This way we ensure that the configured memory is completely at the
>disposal of a single task. Thus eliminating contention of memory.
>
> The best part of this solution is that it's reactive. It kicks in only
> when the executors fail with the OOM exception.
>
> We understand that the problem statement is very common and we expect our
> solution to be effective in many cases.
>
> There could be more cases that can be covered. Executor failing with OOM
> is like a hard signal. The framework(making the driver aware of
> what's happening with the executor) can be extended to handle scenarios of
> other forms of memory pressure like excessive spilling to disk, etc.
>
> While we had developed this on Spark 2.4.3 in-house, we would like to
> collaborate and contribute this work to the latest versions of Spark.
>
> What is the best way forward here? Will an SPIP proposal to detail the
> changes help?
>
> Regards,
> Kalyan.
> Uber India.
>


-- 
Cell : 425-233-8271


[Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread kalyan
Hello All,

At Uber, we had recently, done some work on improving the reliability of
spark applications in scenarios of fatter executors going out of memory and
leading to application failure. Fatter executors are those that have more
than 1 task running on it at a given time concurrently. This has
significantly improved the reliability of many spark applications for us at
Uber. We made a blog about this recently. Link:
https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/

At a high level, we have done the below changes:

   1. When a Task fails with the OOM of an executor, we update the core
   requirements of the task to max executor cores.
   2. When the task is picked for rescheduling, the new attempt of the task
   happens to be on an executor where no other task can run concurrently. All
   cores get allocated to this task itself.
   3. This way we ensure that the configured memory is completely at the
   disposal of a single task. Thus eliminating contention of memory.

The best part of this solution is that it's reactive. It kicks in only when
the executors fail with the OOM exception.

We understand that the problem statement is very common and we expect our
solution to be effective in many cases.

There could be more cases that can be covered. Executor failing with OOM is
like a hard signal. The framework(making the driver aware of
what's happening with the executor) can be extended to handle scenarios of
other forms of memory pressure like excessive spilling to disk, etc.

While we had developed this on Spark 2.4.3 in-house, we would like to
collaborate and contribute this work to the latest versions of Spark.

What is the best way forward here? Will an SPIP proposal to detail the
changes help?

Regards,
Kalyan.
Uber India.


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-16 Thread Adam Hobbs
Hi,

This is my first time using the dev mailing list so I hope this is the correct 
way to do it.

I would like to lend my support to this proposal and offer my experiences as a 
consumer of spark, and specifically Spark Structured Streaming (SSS). I am more 
of an cloud infrastructure devops engineer that a spark/scala coder.

Over the last couple of years I have been a member of a team that has built a 
banking application on top of SSS, kafka and microservices.  We currently run 
about 40 SSS apps that run 24x7.  The load on the jobs fluctuates throughout 
the day based on customer activity and overnight there is a large amount of 
data that comes from core banking batch runs.

We have been down the path of trying to make DRA work within our spark 
infrastructure and it has taken a long time to properly understand that the 
existing DRA mechanisms in spark are mostly useless for SSS.  We chased dynamic 
allocation for some time until we finally realised it is focussed on batch jobs 
and that it would not work properly with our SSS jobs (documentation relating 
to SSS and DRA is sparse to non-existent and the fact that what DRA stuff is 
well documented isn't relevant to SSS was not at first clear).  Most of our 
jobs have enough data flow that they never hit the idle timeout that governs 
standard DRA.  Those that do have low data flow would tend to end up causing 
cluster flapping as scaling would take longer than it would take to process the 
data.

Eventually we have landed on the best stability and performance compromise by 
completely disabling all DRA and deploying our SSS apps at a static size that 
the resourcing can cope with daily peaks and overnight batch load.  Obviously 
this means that for much of the day the deployed apps are running very over 
provisioned.

Proper DRA that is built to work with SSS would be a massive money saver for us.

To me it seems that Pavan has a very good understanding of the same sort of 
issues that we have found and seems to have a working solution (I'm sure I read 
that he has his code in place and working successfully for his organisation)

I think it would be a great thing to get some form of DRA in place for SSS even 
if it is rudimentary in form as it will be a definite step up from what is 
essentially zero support that works with 24x7 style SSS apps.

If there is more that I can do to support this initiative and get this code 
included in an official Spark release, please let me know.


Regards,

Adam Hobbs



This communication is intended only for use of the addressee and may contain 
legally privileged and confidential information.
If you are not the addressee or intended recipient, you are notified that any 
dissemination, copying or use of any of the information is unauthorised.

The legal privilege and confidentiality attached to this e-mail is not waived, 
lost or destroyed by reason of a mistaken delivery to you.
If you have received this message in error, we would appreciate an immediate 
notification via e-mail to contac...@bendigoadelaide.com.au or by phoning 1300 
BENDIGO (1300 236 344), and ask that the e-mail be permanently deleted from 
your system.

Bendigo and Adelaide Bank Limited ABN 11 068 049 178