Hi Yao, Nan, and Chao,
Thank you for this proposal though I already approved . The cost-efficiency
goals are very compelling, and the cited $6M annual savings at
Pinterest clearly
demonstrates the value of moving away from rigid peak provisioning.
However, after modeling the proposed design against standard Kubernetes
behavior and modern Spark workloads, I have identified *three critical
stability risks* that need to be addressed before this is finalized.
I have drafted a *Supplementary Design Amendment* (linked below/attached)
that proposes fixes for these issues, but here is the summary:
1. The "Guaranteed QoS" Contradiction
The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for stability
risks2.
The Issue: Technically, this mitigation is impossible under your proposal.
-
In Kubernetes, a Pod is assigned the *Guaranteed* QoS class *only*
if Request
== Limit for both CPU and Memory.
-
Your proposal explicitly sets Memory Request < Memory Limit
(specifically $H+G < H+O$)3.
-
*Consequence:* This configuration *automatically downgrades* the Pod to
the *Burstable* QoS class. In a multi-tenant cluster, the Kubelet
eviction manager will kill these "Burstable" Spark pods *before* any
Guaranteed system pods during node pressure.
-
*Proposed Fix:* We cannot rely on Guaranteed QoS. We must introduce a
priorityClassName configuration to offset this eviction risk.
2. The "Zero-Guarantee" Edge Case
The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a dangerous edge
case for High-Heap/Low-Overhead jobs (common in ETL).
-
*Scenario:* If a job has a large Heap ($H$) relative to Overhead ($O$),
the calculated burst deduction often exceeds the total Overhead.
-
*Result:* The formula yields *$G = 0$*.
-
*Risk:* Allocating 0MB of guaranteed overhead is unsafe. Essential JVM
operations (thread stacks, Netty control buffers) require a non-zero
baseline. Relying 100% on a shared burst pool for basic functionality will
lead to immediate container failures if the node is contended.
-
*Proposed Fix:* Implement a safety floor using a minGuaranteedRatio
(e.g., max(Calculated_G, O * 0.1)).
3. Native Execution Gap (Off-Heap)
The proposal focuses entirely on memoryOverhead5.
The Issue: Modern native engines (Gluten, Velox, Photon) shift execution
memory to spark.memory.offHeap.size. This memory is equally "bursty" but is
excluded from your optimization.
*Proposed Fix: *The burst-aware logic should be extensible to include
Off-Heap memory if enabled.
https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing
I believe these changes are necessary to make the feature robust enough for
general community adoption beyond specific controlled environments.
Regards,
Viquar Khan
Sr Data Architect
https://www.linkedin.com/in/vaquar-khan-b695577/
On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> wrote:
> +1
>
> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> wrote:
>
>> +1
>>
>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]>
>> wrote:
>>
>>> +1 from me.
>>> I think it's well-scoped and takes advantage of Kubernetes' features
>>> exactly for what they are designed for(as per my understanding).
>>>
>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote:
>>>
>>>> Thanks Yao and Nan for the proposal, and thanks everyone for the
>>>> detailed and thoughtful discussion.
>>>>
>>>> Overall, this looks like a valuable addition for organizations running
>>>> Spark on Kubernetes, especially given how bursty memoryOverhead usage
>>>> tends to be in practice. I appreciate that the change is relatively small
>>>> in scope and fully opt-in, which helps keep the risk low.
>>>>
>>>> From my perspective, the questions raised on the thread and in the SPIP
>>>> have been addressed. If others feel the same, do we have consensus to move
>>>> forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>>
>>>> Best,
>>>> Chao
>>>>
>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]>
>>>> wrote:
>>>>
>>>>> this is a good question
>>>>>
>>>>> > a stage is bursty and consumes the shared portion and fails to
>>>>> release it for subsequent stages
>>>>>
>>>>> in the scenario you described, since the memory-leaking stage and the
>>>>> subsequence ones are from the same job , the pod will likely be killed by
>>>>> cgroup oomkiller
>>>>>
>>>>> taking the following as the example
>>>>>
>>>>> the usage pattern is G = 5GB S = 2GB, it uses G + S at max and in
>>>>> theory, it should release all 7G and then claim 7G again in some later
>>>>> stages, however, due to the memory peak, it holds 2G forever and ask for
>>>>> another 7G, as a result, it hits the pod memory limit and cgroup
>>>>> oomkiller will take action to terminate the pod
>>>>>
>>>>> so this should be safe to the system
>>>>>
>>>>>
>>>>>
>>>>> however, we should be careful about the memory peak for sure, because
>>>>> it essentially breaks the assumption that the usage of memoryOverhead is
>>>>> bursty (memory peak ~= use memory forever)... unfortunately,
>>>>> shared/guaranteed memory is managed by user applications instead of on
>>>>> cluster level , they, especially S, are just logical concepts instead of
>>>>> a
>>>>> physical memory pool which pods can explicitly claim memory from...
>>>>>
>>>>>
>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the interesting proposal.
>>>>>> The design seems to rely on memoryOverhead being transient.
>>>>>> What happens when a stage is bursty and consumes the shared portion
>>>>>> and fails to release it for subsequent stages (e.g., off-heap buffers
>>>>>> and
>>>>>> its not garbage collected since its off-heap)? Would this trigger the
>>>>>> host-level OOM like described in Q6? or are there strategies to release
>>>>>> the
>>>>>> shared portion?
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> yes, that's the worst case in the scenario, please check my earlier
>>>>>>> response to Qiegang's question, we have a set of strategies adopted in
>>>>>>> prod
>>>>>>> to mitigate the issue
>>>>>>>
>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the explanation! So the executor is not guaranteed to
>>>>>>>> get 50 GB physical memory, right? All pods on the same host may reach
>>>>>>>> peak
>>>>>>>> memory usage at the same time and cause paging/swapping which hurts
>>>>>>>> performance?
>>>>>>>>
>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> np, let me try to explain
>>>>>>>>>
>>>>>>>>> 1. Each executor container will be run in a pod together with some
>>>>>>>>> other sidecar containers taking care of tasks like authentication,
>>>>>>>>> etc. ,
>>>>>>>>> for simplicity, we assume each pod has only one container which is the
>>>>>>>>> executor container
>>>>>>>>>
>>>>>>>>> 2. Each container is assigned with two values, r*equest&limit** (limit
>>>>>>>>> >= request),* for both of CPU/memory resources (we only discuss
>>>>>>>>> memory here). Each pod will have request/limit values as the sum of
>>>>>>>>> all
>>>>>>>>> containers belonging to this pod
>>>>>>>>>
>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
>>>>>>>>> *request* value, and cap the resource usage of each container
>>>>>>>>> based on their *limit* value, e.g. if I have a pod with a single
>>>>>>>>> container in it , and it has 1G/2G as request and limit value
>>>>>>>>> respectively,
>>>>>>>>> any machine with 1G free RAM space will be a candidate to host this
>>>>>>>>> pod,
>>>>>>>>> and when the container use more than 2G memory, it will be killed by
>>>>>>>>> cgroup
>>>>>>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized
>>>>>>>>> at
>>>>>>>>> "sum of all its containers' request values" will be booked
>>>>>>>>> exclusively for
>>>>>>>>> this pod.
>>>>>>>>>
>>>>>>>>> 4. By default, Spark *sets request/limit as the same value for
>>>>>>>>> executors in k8s*, and this value is basically
>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases .
>>>>>>>>> However, spark.executor.memoryOverhead usage is very bursty, the user
>>>>>>>>> setting spark.executor.memoryOverhead as 10G usually means each
>>>>>>>>> executor
>>>>>>>>> only needs 10G in a very small portion of the executor's whole
>>>>>>>>> lifecycle
>>>>>>>>>
>>>>>>>>> 5. The proposed SPIP is essentially to decouple request/limit
>>>>>>>>> value in spark@k8s for executors in a safe way (this idea is from
>>>>>>>>> the bytedance paper we refer to in SPIP paper).
>>>>>>>>>
>>>>>>>>> Using the aforementioned example ,
>>>>>>>>>
>>>>>>>>> if we have a single node cluster with 100G RAM space, we have two
>>>>>>>>> pods requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty
>>>>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we can at
>>>>>>>>> most
>>>>>>>>> host 2 pods with this machine, and because of the bursty usage of
>>>>>>>>> that 10G
>>>>>>>>> space, the memory utilization would be compromised.
>>>>>>>>>
>>>>>>>>> When applying the burst-aware memory allocation, we only need 40 +
>>>>>>>>> 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have
>>>>>>>>> 20G free
>>>>>>>>> memory space left in the machine which can be used to host some
>>>>>>>>> smaller
>>>>>>>>> pods. At the same time, as we didn't change the limit value of the
>>>>>>>>> executor
>>>>>>>>> pods, these executors can still use 50G at max.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work
>>>>>>>>>> under the hood? The container will adjust its system memory size
>>>>>>>>>> depending on the actual memory usage of the processes in this
>>>>>>>>>> container?
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> yeah, we have a few cases that we have significantly larger O
>>>>>>>>>>> than H, the proposed algorithm is actually a great fit
>>>>>>>>>>>
>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm
>>>>>>>>>>> will allocate a non-trivial G to ensure the safety of running but
>>>>>>>>>>> still cut
>>>>>>>>>>> a big chunk of memory (10s of GBs) and treat them as S , saving
>>>>>>>>>>> tons of
>>>>>>>>>>> money burnt by them
>>>>>>>>>>>
>>>>>>>>>>> but regarding native accelerators, some native acceleration
>>>>>>>>>>> engines do not use memoryOverhead but use off-heap
>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The current
>>>>>>>>>>> implementation does not cover this part , while that will be an easy
>>>>>>>>>>> extension
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>
>>>>>>>>>>>> Have you tested in environments where O is bigger than H?
>>>>>>>>>>>> Wondering if the proposed algorithm would help more in those
>>>>>>>>>>>> environments
>>>>>>>>>>>> (eg. with
>>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>>
>>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>>
>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use the
>>>>>>>>>>>>> > Executor
>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
>>>>>>>>>>>>> for better resource packing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>>
>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>>>>>>>>> potential usage sum of H+G+S across all pods on a node exceeds
>>>>>>>>>>>>> its
>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer exists
>>>>>>>>>>>>> on the node
>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>
>>>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the
>>>>>>>>>>>>> optimal
>>>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a
>>>>>>>>>>>>> reserved space
>>>>>>>>>>>>> for daemon processes on each host, we found it is sufficient to
>>>>>>>>>>>>> in our case
>>>>>>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> > Have you considered scheduling optimizations to ensure a
>>>>>>>>>>>>> strategic mix of executors with large S and small S values on a
>>>>>>>>>>>>> single
>>>>>>>>>>>>> node? I am wondering if this would reduce the probability of
>>>>>>>>>>>>> concurrent
>>>>>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, when we work on this project, we put some attention on
>>>>>>>>>>>>> the cluster scheduling policy/behavior... two things we mostly
>>>>>>>>>>>>> care about
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain
>>>>>>>>>>>>> level of diversity of workloads so that we have enough candidates
>>>>>>>>>>>>> to form a
>>>>>>>>>>>>> mixed set of executors with large S and small S values
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to
>>>>>>>>>>>>> pack more pods from the same job to the same host, which can
>>>>>>>>>>>>> create
>>>>>>>>>>>>> troubles as they are more likely to ask for max memory at the
>>>>>>>>>>>>> same time
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use the Executor
>>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which
>>>>>>>>>>>>>> allows for better resource packing. I have a few questions
>>>>>>>>>>>>>> regarding the shared portion S:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>> total potential usage sum of H+G+S across all pods on a node
>>>>>>>>>>>>>> exceeds its
>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on
>>>>>>>>>>>>>> the cluster
>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer
>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>> 2. Have you considered scheduling optimizations to ensure
>>>>>>>>>>>>>> a strategic mix of executors with large S and small S values
>>>>>>>>>>>>>> on a single node? I am wondering if this would reduce the
>>>>>>>>>>>>>> probability of
>>>>>>>>>>>>>> concurrent bursting and host-level OOM.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Is the memory overhead off-heap? The formular
>>>>>>>>>>>>>>> indicates a fixed heap size, and memory overhead can't be
>>>>>>>>>>>>>>> dynamic if it's
>>>>>>>>>>>>>>> on-heap.
>>>>>>>>>>>>>>> - Do Spark applications have static profiles? When we
>>>>>>>>>>>>>>> submit stages, the cluster is already allocated, how can we
>>>>>>>>>>>>>>> change anything?
>>>>>>>>>>>>>>> - How do we assign the shared memory overhead? Fairly
>>>>>>>>>>>>>>> among all applications on the same physical node?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> we didn't separate the design into another doc since the
>>>>>>>>>>>>>>>> main idea is relatively simple...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the
>>>>>>>>>>>>>>>> SPIP doc
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say it is
>>>>>>>>>>>>>>>> based on per stage), when the cluster manager compose the pod
>>>>>>>>>>>>>>>> spec, it
>>>>>>>>>>>>>>>> calculates the new memory overhead based on what user asks for
>>>>>>>>>>>>>>>> in that
>>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory
>>>>>>>>>>>>>>>>> request and limit? Is it per stage or per executor?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
>>>>>>>>>>>>>>>>>> future, as long as it has a similar concept , it can
>>>>>>>>>>>>>>>>>> leverage this easily
>>>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it allows
>>>>>>>>>>>>>>>>>>> containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve memory
>>>>>>>>>>>>>>>>>>>> utilization of spark clusters.
>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
--
Regards,
Vaquar Khan