Re: [Spark-Core] Improving Reliability of spark when Executors OOM
Oh interesting solution, a co-worker was suggesting something similar using resource profiles to increase memory -- but your approach avoids a lot of complexity I like it (and we could extend it out to support resource profile growth too). I think an SPIP sounds like a great next step. On Tue, Jan 16, 2024 at 10:46 PM kalyan wrote: > Hello All, > > At Uber, we had recently, done some work on improving the reliability of > spark applications in scenarios of fatter executors going out of memory and > leading to application failure. Fatter executors are those that have more > than 1 task running on it at a given time concurrently. This has > significantly improved the reliability of many spark applications for us at > Uber. We made a blog about this recently. Link: > https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/ > > At a high level, we have done the below changes: > >1. When a Task fails with the OOM of an executor, we update the core >requirements of the task to max executor cores. >2. When the task is picked for rescheduling, the new attempt of the >task happens to be on an executor where no other task can run concurrently. >All cores get allocated to this task itself. >3. This way we ensure that the configured memory is completely at the >disposal of a single task. Thus eliminating contention of memory. > > The best part of this solution is that it's reactive. It kicks in only > when the executors fail with the OOM exception. > > We understand that the problem statement is very common and we expect our > solution to be effective in many cases. > > There could be more cases that can be covered. Executor failing with OOM > is like a hard signal. The framework(making the driver aware of > what's happening with the executor) can be extended to handle scenarios of > other forms of memory pressure like excessive spilling to disk, etc. > > While we had developed this on Spark 2.4.3 in-house, we would like to > collaborate and contribute this work to the latest versions of Spark. > > What is the best way forward here? Will an SPIP proposal to detail the > changes help? > > Regards, > Kalyan. > Uber India. > -- Cell : 425-233-8271
[Spark-Core] Improving Reliability of spark when Executors OOM
Hello All, At Uber, we had recently, done some work on improving the reliability of spark applications in scenarios of fatter executors going out of memory and leading to application failure. Fatter executors are those that have more than 1 task running on it at a given time concurrently. This has significantly improved the reliability of many spark applications for us at Uber. We made a blog about this recently. Link: https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/ At a high level, we have done the below changes: 1. When a Task fails with the OOM of an executor, we update the core requirements of the task to max executor cores. 2. When the task is picked for rescheduling, the new attempt of the task happens to be on an executor where no other task can run concurrently. All cores get allocated to this task itself. 3. This way we ensure that the configured memory is completely at the disposal of a single task. Thus eliminating contention of memory. The best part of this solution is that it's reactive. It kicks in only when the executors fail with the OOM exception. We understand that the problem statement is very common and we expect our solution to be effective in many cases. There could be more cases that can be covered. Executor failing with OOM is like a hard signal. The framework(making the driver aware of what's happening with the executor) can be extended to handle scenarios of other forms of memory pressure like excessive spilling to disk, etc. While we had developed this on Spark 2.4.3 in-house, we would like to collaborate and contribute this work to the latest versions of Spark. What is the best way forward here? Will an SPIP proposal to detail the changes help? Regards, Kalyan. Uber India.
Re: Dynamic resource allocation for structured streaming [SPARK-24815]
Hi, This is my first time using the dev mailing list so I hope this is the correct way to do it. I would like to lend my support to this proposal and offer my experiences as a consumer of spark, and specifically Spark Structured Streaming (SSS). I am more of an cloud infrastructure devops engineer that a spark/scala coder. Over the last couple of years I have been a member of a team that has built a banking application on top of SSS, kafka and microservices. We currently run about 40 SSS apps that run 24x7. The load on the jobs fluctuates throughout the day based on customer activity and overnight there is a large amount of data that comes from core banking batch runs. We have been down the path of trying to make DRA work within our spark infrastructure and it has taken a long time to properly understand that the existing DRA mechanisms in spark are mostly useless for SSS. We chased dynamic allocation for some time until we finally realised it is focussed on batch jobs and that it would not work properly with our SSS jobs (documentation relating to SSS and DRA is sparse to non-existent and the fact that what DRA stuff is well documented isn't relevant to SSS was not at first clear). Most of our jobs have enough data flow that they never hit the idle timeout that governs standard DRA. Those that do have low data flow would tend to end up causing cluster flapping as scaling would take longer than it would take to process the data. Eventually we have landed on the best stability and performance compromise by completely disabling all DRA and deploying our SSS apps at a static size that the resourcing can cope with daily peaks and overnight batch load. Obviously this means that for much of the day the deployed apps are running very over provisioned. Proper DRA that is built to work with SSS would be a massive money saver for us. To me it seems that Pavan has a very good understanding of the same sort of issues that we have found and seems to have a working solution (I'm sure I read that he has his code in place and working successfully for his organisation) I think it would be a great thing to get some form of DRA in place for SSS even if it is rudimentary in form as it will be a definite step up from what is essentially zero support that works with 24x7 style SSS apps. If there is more that I can do to support this initiative and get this code included in an official Spark release, please let me know. Regards, Adam Hobbs This communication is intended only for use of the addressee and may contain legally privileged and confidential information. If you are not the addressee or intended recipient, you are notified that any dissemination, copying or use of any of the information is unauthorised. The legal privilege and confidentiality attached to this e-mail is not waived, lost or destroyed by reason of a mistaken delivery to you. If you have received this message in error, we would appreciate an immediate notification via e-mail to contac...@bendigoadelaide.com.au or by phoning 1300 BENDIGO (1300 236 344), and ask that the e-mail be permanently deleted from your system. Bendigo and Adelaide Bank Limited ABN 11 068 049 178