Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

John Zhuge Sat, 03 Jan 2026 21:54:31 -0800

+1 (non-binding)

On Sat, Jan 3, 2026 at 7:23 PM Kent Yao <[email protected]> wrote:


> +1
>
> Kent
>
>
>
> On 2025/12/17 07:48:06 Wenchen Fan wrote:
> > +1
> >
> > On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]>
> wrote:
> >
> > > +1 from me.
> > > I think it's well-scoped and takes advantage of Kubernetes' features
> > > exactly for what they are designed for(as per my understanding).
> > >
> > > On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote:
> > >
> > >> Thanks Yao and Nan for the proposal, and thanks everyone for the
> detailed
> > >> and thoughtful discussion.
> > >>
> > >> Overall, this looks like a valuable addition for organizations running
> > >> Spark on Kubernetes, especially given how bursty memoryOverhead usage
> > >> tends to be in practice. I appreciate that the change is relatively
> small
> > >> in scope and fully opt-in, which helps keep the risk low.
> > >>
> > >> From my perspective, the questions raised on the thread and in the
> SPIP
> > >> have been addressed. If others feel the same, do we have consensus to
> move
> > >> forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
> > >>
> > >> Best,
> > >> Chao
> > >>
> > >> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]>
> wrote:
> > >>
> > >>> this is a good question
> > >>>
> > >>> > a stage is bursty and consumes the shared portion and fails to
> release
> > >>> it for subsequent stages
> > >>>
> > >>> in the scenario you described, since the memory-leaking stage and the
> > >>> subsequence ones are from the same job , the pod will likely be
> killed by
> > >>> cgroup oomkiller
> > >>>
> > >>> taking the following as the example
> > >>>
> > >>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in
> > >>> theory, it should release all 7G and then claim 7G again in some
> later
> > >>> stages, however, due to the memory peak, it holds 2G forever and ask
> for
> > >>> another 7G, as a result,  it hits the pod memory limit  and cgroup
> > >>> oomkiller will take action to terminate the pod
> > >>>
> > >>> so this should be safe to the system
> > >>>
> > >>>
> > >>>
> > >>> however, we should be careful about the memory peak for sure,
> because it
> > >>> essentially breaks the assumption that the usage of memoryOverhead is
> > >>> bursty (memory peak ~= use memory forever)... unfortunately,
> > >>> shared/guaranteed memory is managed by user applications instead of
> on
> > >>> cluster level , they, especially S, are just logical concepts
> instead of a
> > >>> physical memory pool which pods can explicitly claim memory from...
> > >>>
> > >>>
> > >>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <
> [email protected]>
> > >>> wrote:
> > >>>
> > >>>> Thanks for the interesting proposal.
> > >>>> The design seems to rely on memoryOverhead being transient.
> > >>>> What happens when a stage is bursty and consumes the shared portion
> and
> > >>>> fails to release it for subsequent stages (e.g.,  off-heap buffers
> and its
> > >>>> not garbage collected since its off-heap)? Would this trigger the
> > >>>> host-level OOM like described in Q6? or are there strategies to
> release the
> > >>>> shared portion?
> > >>>>
> > >>>>
> > >>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]>
> wrote:
> > >>>>
> > >>>>> yes, that's the worst case in the scenario, please check my earlier
> > >>>>> response to Qiegang's question, we have a set of strategies
> adopted in prod
> > >>>>> to mitigate the issue
> > >>>>>
> > >>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks for the explanation! So the executor is not guaranteed to
> get
> > >>>>>> 50 GB physical memory, right? All pods on the same host may reach
> peak
> > >>>>>> memory usage at the same time and cause paging/swapping which
> hurts
> > >>>>>> performance?
> > >>>>>>
> > >>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> np, let me try to explain
> > >>>>>>>
> > >>>>>>> 1. Each executor container will be run in a pod together with
> some
> > >>>>>>> other sidecar containers taking care of tasks like
> authentication, etc. ,
> > >>>>>>> for simplicity, we assume each pod has only one container which
> is the
> > >>>>>>> executor container
> > >>>>>>>
> > >>>>>>> 2. Each container is assigned with two values, r*equest&limit**
> (limit
> > >>>>>>> >= request),* for both of CPU/memory resources (we only discuss
> > >>>>>>> memory here). Each pod will have request/limit values as the sum
> of all
> > >>>>>>> containers belonging to this pod
> > >>>>>>>
> > >>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
> *request*
> > >>>>>>> value, and cap the resource usage of each container based on
> their
> > >>>>>>> *limit* value, e.g. if I have a pod with a single container in
> it ,
> > >>>>>>> and it has 1G/2G as request and limit value respectively, any
> machine with
> > >>>>>>> 1G free RAM space will be a candidate to host this pod, and when
> the
> > >>>>>>> container use more than 2G memory, it will be killed by cgroup
> > >>>>>>> oomkiller. Once a pod is scheduled to a host, the memory space
> sized at
> > >>>>>>> "sum of all its containers' request values" will be booked
> exclusively for
> > >>>>>>> this pod.
> > >>>>>>>
> > >>>>>>> 4. By default, Spark *sets request/limit as the same value for
> > >>>>>>> executors in k8s*, and this value is basically
> > >>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most
> cases .
> > >>>>>>> However,  spark.executor.memoryOverhead usage is very bursty,
> the user
> > >>>>>>> setting  spark.executor.memoryOverhead as 10G usually means each
> executor
> > >>>>>>> only needs 10G in a very small portion of the executor's whole
> lifecycle
> > >>>>>>>
> > >>>>>>> 5. The proposed SPIP is essentially to decouple request/limit
> value
> > >>>>>>> in spark@k8s for executors in a safe way (this idea is from the
> > >>>>>>> bytedance paper we refer to in SPIP paper).
> > >>>>>>>
> > >>>>>>> Using the aforementioned example ,
> > >>>>>>>
> > >>>>>>> if we have a single node cluster with 100G RAM space, we have two
> > >>>>>>> pods requesting 40G + 10G (on-heap + memoryOverhead) and we set
> bursty
> > >>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we
> can at most
> > >>>>>>> host 2 pods with this machine, and because of the bursty usage
> of that 10G
> > >>>>>>> space, the memory utilization would be compromised.
> > >>>>>>>
> > >>>>>>> When applying the burst-aware memory allocation, we only need 40
> +
> > >>>>>>> 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we
> have 20G free
> > >>>>>>> memory space left in the machine which can be used to host some
> smaller
> > >>>>>>> pods. At the same time, as we didn't change the limit value of
> the executor
> > >>>>>>> pods, these executors can still use 50G at max.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]
> >
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work
> > >>>>>>>> under the hood? The container will adjust its system memory size
> > >>>>>>>> depending on the actual memory usage of the processes in this
> container?
> > >>>>>>>>
> > >>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]
> >
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> yeah, we have a few cases that we have significantly larger O
> than
> > >>>>>>>>> H, the proposed algorithm is actually a great fit
> > >>>>>>>>>
> > >>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm
> will
> > >>>>>>>>> allocate a non-trivial G to ensure the safety of running but
> still cut a
> > >>>>>>>>> big chunk of memory (10s of GBs) and treat them as S , saving
> tons of money
> > >>>>>>>>> burnt by them
> > >>>>>>>>>
> > >>>>>>>>> but regarding native accelerators, some native acceleration
> > >>>>>>>>> engines do not use memoryOverhead but use off-heap
> > >>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The
> current
> > >>>>>>>>> implementation does not cover this part , while that will be
> an easy
> > >>>>>>>>> extension
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <
> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Thanks for the reply.
> > >>>>>>>>>>
> > >>>>>>>>>> Have you tested in environments where O is bigger than H?
> > >>>>>>>>>> Wondering if the proposed algorithm would help more in those
> environments
> > >>>>>>>>>> (eg. with
> > >>>>>>>>>> native accelerators)?
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <
> [email protected]>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
> > >>>>>>>>>>>
> > >>>>>>>>>>> please check the following answer
> > >>>>>>>>>>>
> > >>>>>>>>>>> > My initial understanding is that Kubernetes will use the
> Executor
> > >>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
> > >>>>>>>>>>> for better resource packing.
> > >>>>>>>>>>>
> > >>>>>>>>>>> yes, your understanding is correct
> > >>>>>>>>>>>
> > >>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total
> > >>>>>>>>>>> potential usage  sum of H+G+S across all pods on a node
> exceeds its
> > >>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on
> the cluster
> > >>>>>>>>>>> operator to manually ensure an unrequested memory buffer
> exists on the node
> > >>>>>>>>>>> to serve as the shared pool?
> > >>>>>>>>>>>
> > >>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
> > >>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the
> cluster
> > >>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us
> to the optimal
> > >>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a
> reserved space
> > >>>>>>>>>>> for daemon processes on each host, we found it is sufficient
> to in our case
> > >>>>>>>>>>> and our major tuning focuses on bursty factor value
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> > Have you considered scheduling optimizations to ensure a
> > >>>>>>>>>>> strategic mix of executors with large S and small S values
> on a single
> > >>>>>>>>>>> node?  I am wondering if this would reduce the probability
> of concurrent
> > >>>>>>>>>>> bursting and host-level OOM.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Yes, when we work on this project, we put some attention on
> the
> > >>>>>>>>>>> cluster scheduling policy/behavior... two things we mostly
> care about
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain
> > >>>>>>>>>>> level of diversity of workloads so that we have enough
> candidates to form a
> > >>>>>>>>>>> mixed set of executors with large S and small S values
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to
> > >>>>>>>>>>> pack more pods from the same job to the same host, which can
> create
> > >>>>>>>>>>> troubles as they are more likely to ask for max memory at
> the same time
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <
> [email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for sharing this interesting proposal.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> My initial understanding is that Kubernetes will use the
> Executor
> > >>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which
> allows
> > >>>>>>>>>>>> for better resource packing.  I have a few questions
> regarding
> > >>>>>>>>>>>> the shared portion S:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when the
> > >>>>>>>>>>>>    total potential usage  sum of H+G+S across all pods on a
> node exceeds its
> > >>>>>>>>>>>>    allocatable capacity? Does the proposal implicitly rely
> on the cluster
> > >>>>>>>>>>>>    operator to manually ensure an unrequested memory buffer
> exists on the node
> > >>>>>>>>>>>>    to serve as the shared pool?
> > >>>>>>>>>>>>    2. Have you considered scheduling optimizations to
> ensure a
> > >>>>>>>>>>>>    strategic mix of executors with large S and small S
> values
> > >>>>>>>>>>>>    on a single node?  I am wondering if this would reduce
> the probability of
> > >>>>>>>>>>>>    concurrent bursting and host-level OOM.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
> [email protected]>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I think I'm still missing something in the big picture:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular
> indicates
> > >>>>>>>>>>>>>    a fixed heap size, and memory overhead can't be dynamic
> if it's on-heap.
> > >>>>>>>>>>>>>    - Do Spark applications have static profiles? When we
> > >>>>>>>>>>>>>    submit stages, the cluster is already allocated, how
> can we change anything?
> > >>>>>>>>>>>>>    - How do we assign the shared memory overhead? Fairly
> > >>>>>>>>>>>>>    among all applications on the same physical node?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
> [email protected]>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> we didn't separate the design into another doc since the
> main
> > >>>>>>>>>>>>>> idea is relatively simple...
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the
> > >>>>>>>>>>>>>> SPIP doc
> > >>>>>>>>>>>>>>
> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> it is calculated based on per profile (you can say it is
> > >>>>>>>>>>>>>> based on per stage), when the cluster manager compose the
> pod spec, it
> > >>>>>>>>>>>>>> calculates the new memory overhead based on what user
> asks for in that
> > >>>>>>>>>>>>>> resource profile
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
> > >>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory
> > >>>>>>>>>>>>>>> request and limit? Is it per stage or per executor?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
> > >>>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
> > >>>>>>>>>>>>>>>> request/limit concept in K8S, ...
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
> > >>>>>>>>>>>>>>>> future,  as long as it has a similar concept , it can
> leverage this easily
> > >>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
> > >>>>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> This feature is only available on k8s because it allows
> > >>>>>>>>>>>>>>>>> containers to have dynamic resources?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
> [email protected]>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi Folks,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
> allocation
> > >>>>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory
> utilization of
> > >>>>>>>>>>>>>>>>>> spark clusters.
> > >>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
> > >>>>>>>>>>>>>>>>>> <
> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0
> >.
> > >>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
> > >>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
> > >>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
> > >>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
> > >>>>>>>>>>>>>>>>>> and Yixin([email protected]).
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thank you.
> > >>>>>>>>>>>>>>>>>> Yao Wang
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

-- 
John Zhuge

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to