Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Wenchen Fan Thu, 11 Dec 2025 18:22:39 -0800

Thanks for the explanation! So the executor is not guaranteed to get 50 GB
physical memory, right? All pods on the same host may reach peak memory
usage at the same time and cause paging/swapping which hurts performance?


On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> wrote:

> np, let me try to explain
>
> 1. Each executor container will be run in a pod together with some other
> sidecar containers taking care of tasks like authentication, etc. , for
> simplicity, we assume each pod has only one container which is the executor
> container
>
> 2. Each container is assigned with two values, r*equest&limit** (limit >=
> request),* for both of CPU/memory resources (we only discuss memory
> here). Each pod will have request/limit values as the sum of all containers
> belonging to this pod
>
> 3. K8S Scheduler chooses a machine to host a pod based on *request*
> value, and cap the resource usage of each container based on their *limit*
> value, e.g. if I have a pod with a single container in it , and it has
> 1G/2G as request and limit value respectively, any machine with 1G free RAM
> space will be a candidate to host this pod, and when the container use more
> than 2G memory, it will be killed by cgroup oomkiller. Once a pod is
> scheduled to a host, the memory space sized at "sum of all its containers'
> request values" will be booked exclusively for this pod.
>
> 4. By default, Spark *sets request/limit as the same value for executors
> in k8s*, and this value is basically spark.executor.memory +
> spark.executor.memoryOverhead in most cases . However,
> spark.executor.memoryOverhead usage is very bursty, the user setting
> spark.executor.memoryOverhead as 10G usually means each executor only needs
> 10G in a very small portion of the executor's whole lifecycle
>
> 5. The proposed SPIP is essentially to decouple request/limit value in
> spark@k8s for executors in a safe way (this idea is from the bytedance
> paper we refer to in SPIP paper).
>
> Using the aforementioned example ,
>
> if we have a single node cluster with 100G RAM space, we have two pods
> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor to
> 1.2, without the mechanism proposed in this SPIP, we can at most host 2
> pods with this machine, and because of the bursty usage of that 10G space,
> the memory utilization would be compromised.
>
> When applying the burst-aware memory allocation, we only need 40 + 10 -
> min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free
> memory space left in the machine which can be used to host some smaller
> pods. At the same time, as we didn't change the limit value of the executor
> pods, these executors can still use 50G at max.
>
>
> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]> wrote:
>
>> Sorry I'm not very familiar with the k8s infra, how does it work under
>> the hood? The container will adjust its system memory size depending on the
>> actual memory usage of the processes in this container?
>>
>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> wrote:
>>
>>> yeah, we have a few cases that we have significantly larger O than
>>> H, the proposed algorithm is actually a great fit
>>>
>>> as I explained in SPIP doc Appendix C, the proposed algorithm will
>>> allocate a non-trivial G to ensure the safety of running but still cut a
>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of money
>>> burnt by them
>>>
>>> but regarding native accelerators, some native acceleration engines do
>>> not use memoryOverhead but use off-heap (spark.memory.offHeap.size)
>>> explicitly (e.g. Gluten). The current implementation does not cover this
>>> part , while that will be an easy extension
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> wrote:
>>>
>>>> Thanks for the reply.
>>>>
>>>> Have you tested in environments where O is bigger than H? Wondering if
>>>> the proposed algorithm would help more in those environments (eg. with
>>>> native accelerators)?
>>>>
>>>>
>>>>
>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> wrote:
>>>>
>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>
>>>>> please check the following answer
>>>>>
>>>>> > My initial understanding is that Kubernetes will use the Executor
>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>> better resource packing.
>>>>>
>>>>> yes, your understanding is correct
>>>>>
>>>>> > How is the risk of host-level OOM mitigated when the total potential
>>>>> usage  sum of H+G+S across all pods on a node exceeds its allocatable
>>>>> capacity? Does the proposal implicitly rely on the cluster operator to
>>>>> manually ensure an unrequested memory buffer exists on the node to serve 
>>>>> as
>>>>> the shared pool?
>>>>>
>>>>> in PINS, we basically apply a set of strategies, setting conservative
>>>>> bursty factor, progressive rollout, monitor the cluster metrics like Linux
>>>>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
>>>>> factor... in usual, K8S operators will set a reserved space for daemon
>>>>> processes on each host, we found it is sufficient to in our case and our
>>>>> major tuning focuses on bursty factor value
>>>>>
>>>>>
>>>>> > Have you considered scheduling optimizations to ensure a strategic
>>>>> mix of executors with large S and small S values on a single node?  I am
>>>>> wondering if this would reduce the probability of concurrent bursting and
>>>>> host-level OOM.
>>>>>
>>>>> Yes, when we work on this project, we put some attention on the
>>>>> cluster scheduling policy/behavior... two things we mostly care about
>>>>>
>>>>> 1. as stated in the SPIP doc, the cluster should have certain level
>>>>> of diversity of workloads so that we have enough candidates to form a 
>>>>> mixed
>>>>> set of executors with large S and small S values
>>>>>
>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack
>>>>> more pods from the same job to the same host, which can create troubles as
>>>>> they are more likely to ask for max memory at the same time
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote:
>>>>>
>>>>>> Thanks for sharing this interesting proposal.
>>>>>>
>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>>>> better resource packing.  I have a few questions regarding the
>>>>>> shared portion S:
>>>>>>
>>>>>>    1. How is the risk of host-level OOM mitigated when the total
>>>>>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>>>    allocatable capacity? Does the proposal implicitly rely on the cluster
>>>>>>    operator to manually ensure an unrequested memory buffer exists on 
>>>>>> the node
>>>>>>    to serve as the shared pool?
>>>>>>    2. Have you considered scheduling optimizations to ensure a
>>>>>>    strategic mix of executors with large S and small S values on a
>>>>>>    single node?  I am wondering if this would reduce the probability of
>>>>>>    concurrent bursting and host-level OOM.
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>
>>>>>>>    - Is the memory overhead off-heap? The formular indicates a
>>>>>>>    fixed heap size, and memory overhead can't be dynamic if it's 
>>>>>>> on-heap.
>>>>>>>    - Do Spark applications have static profiles? When we submit
>>>>>>>    stages, the cluster is already allocated, how can we change anything?
>>>>>>>    - How do we assign the shared memory overhead? Fairly among all
>>>>>>>    applications on the same physical node?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> we didn't separate the design into another doc since the main idea
>>>>>>>> is relatively simple...
>>>>>>>>
>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP doc
>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>
>>>>>>>> it is calculated based on per profile (you can say it is based on
>>>>>>>> per stage), when the cluster manager compose the pod spec, it 
>>>>>>>> calculates
>>>>>>>> the new memory overhead based on what user asks for in that resource 
>>>>>>>> profile
>>>>>>>>
>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Do we have a design sketch? How to determine the memory request
>>>>>>>>> and limit? Is it per stage or per executor?
>>>>>>>>>
>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>
>>>>>>>>>> but if there is any other cluster manager coming in future,  as
>>>>>>>>>> long as it has a similar concept , it can leverage this easily as 
>>>>>>>>>> the main
>>>>>>>>>> logic is implemented in ResourceProfile
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>
>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of spark
>>>>>>>>>>>> clusters.
>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) and Yixin(
>>>>>>>>>>>> [email protected]).
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>
>>>>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to