Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Nan Zhu Thu, 11 Dec 2025 18:23:57 -0800

yes, that's the worst case in the scenario, please check my earlier
response to Qiegang's question, we have a set of strategies adopted in prod
to mitigate the issue


On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> wrote:

> Thanks for the explanation! So the executor is not guaranteed to get 50 GB
> physical memory, right? All pods on the same host may reach peak memory
> usage at the same time and cause paging/swapping which hurts performance?
>
> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> wrote:
>
>> np, let me try to explain
>>
>> 1. Each executor container will be run in a pod together with some other
>> sidecar containers taking care of tasks like authentication, etc. , for
>> simplicity, we assume each pod has only one container which is the executor
>> container
>>
>> 2. Each container is assigned with two values, r*equest&limit** (limit
>> >= request),* for both of CPU/memory resources (we only discuss memory
>> here). Each pod will have request/limit values as the sum of all containers
>> belonging to this pod
>>
>> 3. K8S Scheduler chooses a machine to host a pod based on *request*
>> value, and cap the resource usage of each container based on their
>> *limit* value, e.g. if I have a pod with a single container in it , and
>> it has 1G/2G as request and limit value respectively, any machine with 1G
>> free RAM space will be a candidate to host this pod, and when the container
>> use more than 2G memory, it will be killed by cgroup oomkiller. Once a pod
>> is scheduled to a host, the memory space sized at "sum of all its
>> containers' request values" will be booked exclusively for this pod.
>>
>> 4. By default, Spark *sets request/limit as the same value for executors
>> in k8s*, and this value is basically spark.executor.memory +
>> spark.executor.memoryOverhead in most cases . However,
>> spark.executor.memoryOverhead usage is very bursty, the user setting
>> spark.executor.memoryOverhead as 10G usually means each executor only needs
>> 10G in a very small portion of the executor's whole lifecycle
>>
>> 5. The proposed SPIP is essentially to decouple request/limit value in
>> spark@k8s for executors in a safe way (this idea is from the bytedance
>> paper we refer to in SPIP paper).
>>
>> Using the aforementioned example ,
>>
>> if we have a single node cluster with 100G RAM space, we have two pods
>> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor to
>> 1.2, without the mechanism proposed in this SPIP, we can at most host 2
>> pods with this machine, and because of the bursty usage of that 10G space,
>> the memory utilization would be compromised.
>>
>> When applying the burst-aware memory allocation, we only need 40 + 10 -
>> min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free
>> memory space left in the machine which can be used to host some smaller
>> pods. At the same time, as we didn't change the limit value of the executor
>> pods, these executors can still use 50G at max.
>>
>>
>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]> wrote:
>>
>>> Sorry I'm not very familiar with the k8s infra, how does it work under
>>> the hood? The container will adjust its system memory size depending on the
>>> actual memory usage of the processes in this container?
>>>
>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> wrote:
>>>
>>>> yeah, we have a few cases that we have significantly larger O than
>>>> H, the proposed algorithm is actually a great fit
>>>>
>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will
>>>> allocate a non-trivial G to ensure the safety of running but still cut a
>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of money
>>>> burnt by them
>>>>
>>>> but regarding native accelerators, some native acceleration engines do
>>>> not use memoryOverhead but use off-heap (spark.memory.offHeap.size)
>>>> explicitly (e.g. Gluten). The current implementation does not cover this
>>>> part , while that will be an easy extension
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> wrote:
>>>>
>>>>> Thanks for the reply.
>>>>>
>>>>> Have you tested in environments where O is bigger than H? Wondering if
>>>>> the proposed algorithm would help more in those environments (eg. with
>>>>> native accelerators)?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>
>>>>>> please check the following answer
>>>>>>
>>>>>> > My initial understanding is that Kubernetes will use the Executor
>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>> better resource packing.
>>>>>>
>>>>>> yes, your understanding is correct
>>>>>>
>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>> allocatable capacity? Does the proposal implicitly rely on the cluster
>>>>>> operator to manually ensure an unrequested memory buffer exists on the 
>>>>>> node
>>>>>> to serve as the shared pool?
>>>>>>
>>>>>> in PINS, we basically apply a set of strategies, setting conservative
>>>>>> bursty factor, progressive rollout, monitor the cluster metrics like 
>>>>>> Linux
>>>>>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
>>>>>> factor... in usual, K8S operators will set a reserved space for daemon
>>>>>> processes on each host, we found it is sufficient to in our case and our
>>>>>> major tuning focuses on bursty factor value
>>>>>>
>>>>>>
>>>>>> > Have you considered scheduling optimizations to ensure a strategic
>>>>>> mix of executors with large S and small S values on a single node?  I am
>>>>>> wondering if this would reduce the probability of concurrent bursting and
>>>>>> host-level OOM.
>>>>>>
>>>>>> Yes, when we work on this project, we put some attention on the
>>>>>> cluster scheduling policy/behavior... two things we mostly care about
>>>>>>
>>>>>> 1. as stated in the SPIP doc, the cluster should have certain level
>>>>>> of diversity of workloads so that we have enough candidates to form a 
>>>>>> mixed
>>>>>> set of executors with large S and small S values
>>>>>>
>>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack
>>>>>> more pods from the same job to the same host, which can create troubles 
>>>>>> as
>>>>>> they are more likely to ask for max memory at the same time
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>
>>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>>> better resource packing.  I have a few questions regarding the
>>>>>>> shared portion S:
>>>>>>>
>>>>>>>    1. How is the risk of host-level OOM mitigated when the total
>>>>>>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>>    allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>> cluster
>>>>>>>    operator to manually ensure an unrequested memory buffer exists on 
>>>>>>> the node
>>>>>>>    to serve as the shared pool?
>>>>>>>    2. Have you considered scheduling optimizations to ensure a
>>>>>>>    strategic mix of executors with large S and small S values on a
>>>>>>>    single node?  I am wondering if this would reduce the probability of
>>>>>>>    concurrent bursting and host-level OOM.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>
>>>>>>>>    - Is the memory overhead off-heap? The formular indicates a
>>>>>>>>    fixed heap size, and memory overhead can't be dynamic if it's 
>>>>>>>> on-heap.
>>>>>>>>    - Do Spark applications have static profiles? When we submit
>>>>>>>>    stages, the cluster is already allocated, how can we change 
>>>>>>>> anything?
>>>>>>>>    - How do we assign the shared memory overhead? Fairly among all
>>>>>>>>    applications on the same physical node?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> we didn't separate the design into another doc since the main idea
>>>>>>>>> is relatively simple...
>>>>>>>>>
>>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP
>>>>>>>>> doc
>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>
>>>>>>>>> it is calculated based on per profile (you can say it is based on
>>>>>>>>> per stage), when the cluster manager compose the pod spec, it 
>>>>>>>>> calculates
>>>>>>>>> the new memory overhead based on what user asks for in that resource 
>>>>>>>>> profile
>>>>>>>>>
>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Do we have a design sketch? How to determine the memory request
>>>>>>>>>> and limit? Is it per stage or per executor?
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>
>>>>>>>>>>> but if there is any other cluster manager coming in future,  as
>>>>>>>>>>> long as it has a similar concept , it can leverage this easily as 
>>>>>>>>>>> the main
>>>>>>>>>>> logic is implemented in ResourceProfile
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of
>>>>>>>>>>>>> spark clusters.
>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) and Yixin(
>>>>>>>>>>>>> [email protected]).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to