Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

vaquar khan Mon, 29 Dec 2025 22:37:51 -0800

Hi Nan,

Thanks for the detailed reply. I appreciate you sharing the specific
context from the Pinterest implementation—it helps clarify the operational
model you are using.


However, I maintain that for a general-purpose open-source feature (which
will be used by teams without dedicated platform engineers to manage
rollouts), we need structural safety guardrails.

*Here is my response to your points:*

1. Re: "Zero-Guarantee" & Safety (Critical)

You suggested that "setting a conservative bursty factor" resolves the risk
of zero-guaranteed overhead.

Mathematically, this is incorrect for High-Heap jobs. The formula is
structural: G = O - \min((H+O) \times (B-1), O).

Consider a standard ETL job: Heap (H) = 100GB, Overhead (O) = 5GB.

Even if we set a very conservative Bursty Factor (B) of 1.06 (only 6%
burst):

   -

   Calculation: $(100 + 5) \times (1.06 - 1) = 105 \times 0.06 = 6.3GB.
   -

   Since 6.3GB > 5GB, the formula sets *Guaranteed Overhead = 0GB*.

Even with an extremely conservative factor, the design forces this pod to
have zero guaranteed memory for OS/JVM threads. This is not a tuning issue;
it is a formulaic edge case for high-memory jobs.

* A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing up
the design"—it is mathematically necessary to prevent the formula from
collapsing to zero in valid production scenarios.*

2. Re: Kubelet Eviction vs. OOMKiller

I concede that in sudden memory spikes, the Kernel OOMKiller often acts
faster than Kubelet eviction.

However, Kubelet eviction is the primary mechanism for other pressure types
(DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios where
memory.available crosses the eviction threshold before the kernel panics.

* Adding priorityClassName support to the Pod spec is a low-effort,
zero-risk change that aligns with Kubernetes best practices for "Defense in
Depth." It costs nothing to expose this config.*

3. Re: Native Support

Fair point. To keep the scope tight, I am happy to drop the Native Support
request for this SPIP. We can treat that as a separate follow-up.

Path Forward

I am happy to support  if we can agree to:

   1.

   Add the minGuaranteedRatio config (to handle the High-Heap math proven
   above).
   2.

   Expose the priorityClassName config (standard K8S practice).


Regards,

Viquar Khan

Sr Data Architect

https://www.linkedin.com/in/vaquar-khan-b695577/


On Tue, 30 Dec 2025 at 00:16, Nan Zhu <[email protected]> wrote:

> > Kubelet Eviction is the first line of defense before the Kernel
> OOMKiller strikes.
>
> This is *NOT* true. Eviction will be the first to kill some best effort
> pod which doesn't make any difference  on memory pressure in most cases and
> before it takes action again, Kernel OOMKiller already killed some executor
> pods. This is exactly the reason for me to say, we don't really worry about
> eviction here, before eviction touches those executors, OOMKiller already
> killed them. This behavior is consistently observed and we also had
> discussions with other companies who had to modify Kernel code to mitigate
> this behavior.
>
> > Re: "Zero-Guarantee" & Safety
>
> you basically want to tradeoff saving with system safety , then why not
> just setting a conservative value of bursty factor? it is exactly what we
> did in PINS, please check my earlier response in the thread ... key part as
> following:
>
> "in PINS, we basically apply a set of strategies, setting conservative
> bursty factor, progressive rollout, monitor the cluster metrics like Linux
> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
> factor... in usual, K8S operators will set a reserved space for daemon
> processes on each host, we found it is sufficient to in our case and our
> major tuning focuses on bursty factor value "
>
> If you really want, you can enable this feature only for jobs when
> OnHeap/MemoryOverhead is smaller than a certain value...
>
> I just didn't see the value of bringing another configuration
>
>
> > Re: Native Support
>
> I mean....this SPIP is NOT about native execution engine's memory pattern at
> all..... why do we bother to bring it up....
>
>
>
> On Mon, Dec 29, 2025 at 9:42 PM vaquar khan <[email protected]> wrote:
>
>> Hi Nan,
>>
>> Thanks for the prompt response and for clarifying the design intent.
>>
>> I understand the goal is to maximize savings—and I agree we shouldn't
>> block the current momentum even though you can see my vote is +1—but I want
>> to ensure we aren't over-optimizing for specific internal environments at
>> the cost of general community stability.
>>
>> *Here is my rejoinder on the technical points:*
>>
>> 1. Re: PriorityClass & OOMKiller (Defense in Depth)
>>
>> You mentioned that “priorityClassName is NOT the solution... What we
>> worry about is the Linux Kernel OOMKiller.”
>>
>> I agree that the Kernel OOMKiller (cgroup) primarily looks at
>> oom_score_adj (which is determined by QoS Class). However, Kubelet Eviction
>> is the first line of defense before the Kernel OOMKiller strikes.
>>
>> When a node comes under memory pressure (e.g., memory.available drops
>> below evictionHard), the *Kubelet* actively selects pods to evict to
>> reclaim resources. Unlike the Kernel, the Kubelet *does* explicitly use
>> PriorityClass when ranking candidates for eviction.
>>
>>    -
>>
>>    *The Risk:* Since we are downgrading these pods to *Burstable*
>>    (increasing their OOM risk), we lose the "Guaranteed" protection shield.
>>    -
>>
>>    *The Fix:* By assigning a high PriorityClass, we ensure that if the
>>    Kubelet needs to free space, it evicts lower-priority batch jobs
>>    *before* these Spark executors. It is a necessary "Defense in Depth"
>>    strategy for multi-tenant clusters that prevents optimized Spark jobs from
>>    being the first victims of node pressure.
>>
>> 2. Re: "Zero-Guarantee" & Safety
>>
>> You noted that “savings come from these 0 memory overhead pods.”
>>
>> While G=0 maximizes "on-paper" savings, it is theoretically unsafe for a
>> JVM. A JVM physically requires non-heap memory for Thread Stacks,
>> CodeCache, and Metaspace just to run.
>>
>>    -
>>
>>    *The Reality:* If G=0, then Pod Request == Heap. If a node is fully
>>    packed (Sum of Requests ≈ Node Capacity), the pod relies *100%* on
>>    the burst pool for basic thread allocation. If neighbors are noisy, that
>>    pod cannot even spawn a thread.
>>    -
>>
>>    *The Compromise:* I strongly suggest we add the configuration
>>    spark.executor.memoryOverhead.minGuaranteedRatio but set the *default
>>    to 0.0*.
>>    -
>>
>>       This preserves your logic/savings by default.
>>       -
>>
>>       But it gives platform admins a "safety knob" to turn (e.g., to
>>       0.1) when they inevitably encounter instability in high-contention
>>       environments, without needing a code patch.
>>
>> 3. Re: Native Support
>>
>> Agreed. We can treat Off-Heap support as a follow-up item. I would just
>> request that we add a "Known Limitation" note in the SPIP stating that this
>> optimization does not yet apply to spark.memory.offHeap.size, so users of
>> Gluten/Velox are aware.
>>
>> I am happy to support the PR moving forward if we can agree to include
>> the *PriorityClass* config support and the *Safety Floor* config (even
>> if disabled by default) ,Please update your SIP. This ensures the feature
>> is robust enough for the wider user base.
>>
>> Regards,
>>
>> Viquar Khan
>>
>> Sr Data Architect
>>
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>> On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote:
>>
>>> Hi, Vaquar
>>>
>>> thanks for the replies,
>>>
>>> 1. for Guaranteed QoS
>>>
>>> I may missed some words in the  original doc, the idea I would like to
>>> convey is that we essentially need to give up this idea due to what cannot
>>> achieve Guaranteed QoS as we already have different values of memory and
>>> even we only consider CPU request/limit values, it brings other risks to us
>>>
>>> Additionally , priorityClassName is NOT the solution here. What we
>>> really worry about is NOT eviction, but Linux Kernel OOMKiller where we
>>> cannot pass pod priority information into. With burstable pods, the
>>> only thing Linux Kernel OOMKiller considers is the memory request size
>>> which not necessary maps to priority information
>>>
>>> 2. The "Zero-Guarantee" Edge Case
>>>
>>> Actually, a lot of savings are from these 0 memory overhead pods... I am
>>> curious if you have adopted the PoC PR in prod as you have identified it is
>>> "unsafe"?
>>>
>>> Something like a minGuaranteedRatio is not a good idea , it will mess
>>> up the original design idea of the formula (check Appendix C), the simplest
>>> thing you can do is to avoid rolling out features to the jobs which you
>>> feel will be unsafe...
>>>
>>> 3. Native Execution Gap (Off-Heap)
>>>
>>> I am not sure the off-heap memory usage of gluten/comet shows the
>>> same/similar pattern as memoryOverhead. No one has validated that in
>>> production environment, but both PINS/Bytedance has validated
>>> memoryOverhead part thoroughly in their clusters
>>>
>>> Additionally, the key design of the proposal is to capture the
>>> relationship between on-heap and memoryOverhead sizes, in another word,
>>> they co-exist.... offheap memory used by native engines are different
>>> stories where , ideally, the on-heap usage should be minimum and most of
>>> memory usage should come from off-heap part...so the formula here may not
>>> work out of box
>>>
>>> my suggestion is, since the community has approved the original design
>>> which have been tested by at least 2 companies in production environments,
>>> we go with the current design and continue code review , in future, we can
>>> add what have been found/tested in production as followups
>>>
>>> Thanks
>>>
>>> Nan
>>>
>>>
>>> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Yao, Nan, and Chao,
>>>>
>>>> Thank you for this proposal though I already approved . The
>>>> cost-efficiency goals are very compelling, and the cited $6M annual savings
>>>> at Pinterest  clearly demonstrates the value of moving away from rigid
>>>> peak provisioning.
>>>>
>>>> However, after modeling the proposed design against standard Kubernetes
>>>> behavior and modern Spark workloads, I have identified *three critical
>>>> stability risks* that need to be addressed before this is finalized.
>>>>
>>>> I have drafted a *Supplementary Design Amendment* (linked
>>>> below/attached) that proposes fixes for these issues, but here is the
>>>> summary:
>>>> 1. The "Guaranteed QoS" Contradiction
>>>>
>>>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for
>>>> stability risks2.
>>>>
>>>> The Issue: Technically, this mitigation is impossible under your
>>>> proposal.
>>>>
>>>>    -
>>>>
>>>>    In Kubernetes, a Pod is assigned the *Guaranteed* QoS class *only*
>>>>    if Request == Limit for both CPU and Memory.
>>>>    -
>>>>
>>>>    Your proposal explicitly sets Memory Request < Memory Limit
>>>>    (specifically $H+G < H+O$)3.
>>>>
>>>>    -
>>>>
>>>>    *Consequence:* This configuration *automatically downgrades* the
>>>>    Pod to the *Burstable* QoS class. In a multi-tenant cluster, the
>>>>    Kubelet eviction manager will kill these "Burstable" Spark pods
>>>>    *before* any Guaranteed system pods during node pressure.
>>>>    -
>>>>
>>>>    *Proposed Fix:* We cannot rely on Guaranteed QoS. We must introduce
>>>>    a priorityClassName configuration to offset this eviction risk.
>>>>
>>>> 2. The "Zero-Guarantee" Edge Case
>>>>
>>>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a dangerous
>>>> edge case for High-Heap/Low-Overhead jobs (common in ETL).
>>>>
>>>>
>>>>    -
>>>>
>>>>    *Scenario:* If a job has a large Heap ($H$) relative to Overhead (
>>>>    $O$), the calculated burst deduction often exceeds the total
>>>>    Overhead.
>>>>    -
>>>>
>>>>    *Result:* The formula yields *$G = 0$*.
>>>>    -
>>>>
>>>>    *Risk:* Allocating 0MB of guaranteed overhead is unsafe. Essential
>>>>    JVM operations (thread stacks, Netty control buffers) require a non-zero
>>>>    baseline. Relying 100% on a shared burst pool for basic functionality 
>>>> will
>>>>    lead to immediate container failures if the node is contended.
>>>>    -
>>>>
>>>>    *Proposed Fix:* Implement a safety floor using a minGuaranteedRatio
>>>>    (e.g., max(Calculated_G, O * 0.1)).
>>>>
>>>> 3. Native Execution Gap (Off-Heap)
>>>>
>>>> The proposal focuses entirely on memoryOverhead5.
>>>>
>>>> The Issue: Modern native engines (Gluten, Velox, Photon) shift
>>>> execution memory to spark.memory.offHeap.size. This memory is equally
>>>> "bursty" but is excluded from your optimization.
>>>>
>>>> *Proposed Fix: *The burst-aware logic should be extensible to include
>>>> Off-Heap memory if enabled.
>>>>
>>>>
>>>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing
>>>>
>>>>
>>>> I believe these changes are necessary to make the feature robust enough
>>>> for general community adoption beyond specific controlled environments.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> Sr Data Architect
>>>>
>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>
>>>>
>>>>
>>>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 from me.
>>>>>>> I think it's well-scoped and takes advantage of Kubernetes' features
>>>>>>> exactly for what they are designed for(as per my understanding).
>>>>>>>
>>>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for the
>>>>>>>> detailed and thoughtful discussion.
>>>>>>>>
>>>>>>>> Overall, this looks like a valuable addition for organizations
>>>>>>>> running Spark on Kubernetes, especially given how bursty
>>>>>>>> memoryOverhead usage tends to be in practice. I appreciate that
>>>>>>>> the change is relatively small in scope and fully opt-in, which helps 
>>>>>>>> keep
>>>>>>>> the risk low.
>>>>>>>>
>>>>>>>> From my perspective, the questions raised on the thread and in the
>>>>>>>> SPIP have been addressed. If others feel the same, do we have 
>>>>>>>> consensus to
>>>>>>>> move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Chao
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> this is a good question
>>>>>>>>>
>>>>>>>>> > a stage is bursty and consumes the shared portion and fails to
>>>>>>>>> release it for subsequent stages
>>>>>>>>>
>>>>>>>>> in the scenario you described, since the memory-leaking stage and
>>>>>>>>> the subsequence ones are from the same job , the pod will likely be 
>>>>>>>>> killed
>>>>>>>>> by cgroup oomkiller
>>>>>>>>>
>>>>>>>>> taking the following as the example
>>>>>>>>>
>>>>>>>>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in
>>>>>>>>> theory, it should release all 7G and then claim 7G again in some later
>>>>>>>>> stages, however, due to the memory peak, it holds 2G forever and ask 
>>>>>>>>> for
>>>>>>>>> another 7G, as a result,  it hits the pod memory limit  and cgroup
>>>>>>>>> oomkiller will take action to terminate the pod
>>>>>>>>>
>>>>>>>>> so this should be safe to the system
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> however, we should be careful about the memory peak for sure,
>>>>>>>>> because it essentially breaks the assumption that the usage of
>>>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)...
>>>>>>>>> unfortunately, shared/guaranteed memory is managed by user 
>>>>>>>>> applications
>>>>>>>>> instead of on cluster level , they, especially S, are just logical
>>>>>>>>> concepts  instead of a physical memory pool which pods can explicitly 
>>>>>>>>> claim
>>>>>>>>> memory from...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the interesting proposal.
>>>>>>>>>> The design seems to rely on memoryOverhead being transient.
>>>>>>>>>> What happens when a stage is bursty and consumes the shared
>>>>>>>>>> portion and fails to release it for subsequent stages (e.g.,  
>>>>>>>>>> off-heap
>>>>>>>>>> buffers and its not garbage collected since its off-heap)? Would this
>>>>>>>>>> trigger the host-level OOM like described in Q6? or are there 
>>>>>>>>>> strategies to
>>>>>>>>>> release the shared portion?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> yes, that's the worst case in the scenario, please check my
>>>>>>>>>>> earlier response to Qiegang's question, we have a set of strategies 
>>>>>>>>>>> adopted
>>>>>>>>>>> in prod to mitigate the issue
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the explanation! So the executor is not guaranteed
>>>>>>>>>>>> to get 50 GB physical memory, right? All pods on the same host may 
>>>>>>>>>>>> reach
>>>>>>>>>>>> peak memory usage at the same time and cause paging/swapping which 
>>>>>>>>>>>> hurts
>>>>>>>>>>>> performance?
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> np, let me try to explain
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Each executor container will be run in a pod together with
>>>>>>>>>>>>> some other sidecar containers taking care of tasks like 
>>>>>>>>>>>>> authentication,
>>>>>>>>>>>>> etc. , for simplicity, we assume each pod has only one container 
>>>>>>>>>>>>> which is
>>>>>>>>>>>>> the executor container
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Each container is assigned with two values, r*equest&limit** 
>>>>>>>>>>>>> (limit
>>>>>>>>>>>>> >= request),* for both of CPU/memory resources (we only
>>>>>>>>>>>>> discuss memory here). Each pod will have request/limit values as 
>>>>>>>>>>>>> the sum of
>>>>>>>>>>>>> all containers belonging to this pod
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
>>>>>>>>>>>>> *request* value, and cap the resource usage of each container
>>>>>>>>>>>>> based on their *limit* value, e.g. if I have a pod with a
>>>>>>>>>>>>> single container in it , and it has 1G/2G as request and limit 
>>>>>>>>>>>>> value
>>>>>>>>>>>>> respectively, any machine with 1G free RAM space will be a 
>>>>>>>>>>>>> candidate to
>>>>>>>>>>>>> host this pod, and when the container use more than 2G memory, it 
>>>>>>>>>>>>> will be
>>>>>>>>>>>>> killed by cgroup oomkiller. Once a pod is scheduled to a host, 
>>>>>>>>>>>>> the memory
>>>>>>>>>>>>> space sized at "sum of all its containers' request values" will 
>>>>>>>>>>>>> be booked
>>>>>>>>>>>>> exclusively for this pod.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same value
>>>>>>>>>>>>> for executors in k8s*, and this value is basically
>>>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most 
>>>>>>>>>>>>> cases .
>>>>>>>>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, the 
>>>>>>>>>>>>> user
>>>>>>>>>>>>> setting  spark.executor.memoryOverhead as 10G usually means each 
>>>>>>>>>>>>> executor
>>>>>>>>>>>>> only needs 10G in a very small portion of the executor's whole 
>>>>>>>>>>>>> lifecycle
>>>>>>>>>>>>>
>>>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple request/limit
>>>>>>>>>>>>> value in spark@k8s for executors in a safe way (this idea is
>>>>>>>>>>>>> from the bytedance paper we refer to in SPIP paper).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Using the aforementioned example ,
>>>>>>>>>>>>>
>>>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we have
>>>>>>>>>>>>> two pods requesting 40G + 10G (on-heap + memoryOverhead) and we 
>>>>>>>>>>>>> set bursty
>>>>>>>>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we 
>>>>>>>>>>>>> can at most
>>>>>>>>>>>>> host 2 pods with this machine, and because of the bursty usage of 
>>>>>>>>>>>>> that 10G
>>>>>>>>>>>>> space, the memory utilization would be compromised.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When applying the burst-aware memory allocation, we only
>>>>>>>>>>>>> need 40 + 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, 
>>>>>>>>>>>>> i.e. we
>>>>>>>>>>>>> have 20G free memory space left in the machine which can be used 
>>>>>>>>>>>>> to host
>>>>>>>>>>>>> some smaller pods. At the same time, as we didn't change the 
>>>>>>>>>>>>> limit value of
>>>>>>>>>>>>> the executor pods, these executors can still use 50G at max.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does
>>>>>>>>>>>>>> it work under the hood? The container will adjust its system 
>>>>>>>>>>>>>> memory size
>>>>>>>>>>>>>> depending on the actual memory usage of the processes in this 
>>>>>>>>>>>>>> container?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> yeah, we have a few cases that we have significantly larger
>>>>>>>>>>>>>>> O than H, the proposed algorithm is actually a great fit
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed
>>>>>>>>>>>>>>> algorithm will allocate a non-trivial G to ensure the safety of 
>>>>>>>>>>>>>>> running but
>>>>>>>>>>>>>>> still cut a big chunk of memory (10s of GBs) and treat them as 
>>>>>>>>>>>>>>> S , saving
>>>>>>>>>>>>>>> tons of money burnt by them
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> but regarding native accelerators, some native acceleration
>>>>>>>>>>>>>>> engines do not use memoryOverhead but use off-heap
>>>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The 
>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>> implementation does not cover this part , while that will be an 
>>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>> extension
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Have you tested in environments where O is bigger than H?
>>>>>>>>>>>>>>>> Wondering if the proposed algorithm would help more in those 
>>>>>>>>>>>>>>>> environments
>>>>>>>>>>>>>>>> (eg. with
>>>>>>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use the 
>>>>>>>>>>>>>>>>> > Executor
>>>>>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which
>>>>>>>>>>>>>>>>> allows for better resource packing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>>>>> total potential usage  sum of H+G+S across all pods on a node 
>>>>>>>>>>>>>>>>> exceeds its
>>>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on 
>>>>>>>>>>>>>>>>> the cluster
>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the 
>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to 
>>>>>>>>>>>>>>>>> the optimal
>>>>>>>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a 
>>>>>>>>>>>>>>>>> reserved space
>>>>>>>>>>>>>>>>> for daemon processes on each host, we found it is sufficient 
>>>>>>>>>>>>>>>>> to in our case
>>>>>>>>>>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to ensure a
>>>>>>>>>>>>>>>>> strategic mix of executors with large S and small S values on 
>>>>>>>>>>>>>>>>> a single
>>>>>>>>>>>>>>>>> node?  I am wondering if this would reduce the probability of 
>>>>>>>>>>>>>>>>> concurrent
>>>>>>>>>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, when we work on this project, we put some attention
>>>>>>>>>>>>>>>>> on the cluster scheduling policy/behavior... two things we 
>>>>>>>>>>>>>>>>> mostly care
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have
>>>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have 
>>>>>>>>>>>>>>>>> enough candidates
>>>>>>>>>>>>>>>>> to form a mixed set of executors with large S and
>>>>>>>>>>>>>>>>> small S values
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends
>>>>>>>>>>>>>>>>> to pack more pods from the same job to the same host, which 
>>>>>>>>>>>>>>>>> can create
>>>>>>>>>>>>>>>>> troubles as they are more likely to ask for max memory at the 
>>>>>>>>>>>>>>>>> same time
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use the 
>>>>>>>>>>>>>>>>>> Executor
>>>>>>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which
>>>>>>>>>>>>>>>>>> allows for better resource packing.  I have a few
>>>>>>>>>>>>>>>>>> questions regarding the shared portion S:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when
>>>>>>>>>>>>>>>>>>    the total potential usage  sum of H+G+S across all pods 
>>>>>>>>>>>>>>>>>> on a node exceeds
>>>>>>>>>>>>>>>>>>    its allocatable capacity? Does the proposal implicitly 
>>>>>>>>>>>>>>>>>> rely on the cluster
>>>>>>>>>>>>>>>>>>    operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>>>>>>    to serve as the shared pool?
>>>>>>>>>>>>>>>>>>    2. Have you considered scheduling optimizations to
>>>>>>>>>>>>>>>>>>    ensure a strategic mix of executors with large S and
>>>>>>>>>>>>>>>>>>    small S values on a single node?  I am wondering if
>>>>>>>>>>>>>>>>>>    this would reduce the probability of concurrent bursting 
>>>>>>>>>>>>>>>>>> and host-level OOM.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular
>>>>>>>>>>>>>>>>>>>    indicates a fixed heap size, and memory overhead can't 
>>>>>>>>>>>>>>>>>>> be dynamic if it's
>>>>>>>>>>>>>>>>>>>    on-heap.
>>>>>>>>>>>>>>>>>>>    - Do Spark applications have static profiles? When
>>>>>>>>>>>>>>>>>>>    we submit stages, the cluster is already allocated, how 
>>>>>>>>>>>>>>>>>>> can we change
>>>>>>>>>>>>>>>>>>>    anything?
>>>>>>>>>>>>>>>>>>>    - How do we assign the shared memory overhead?
>>>>>>>>>>>>>>>>>>>    Fairly among all applications on the same physical node?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc
>>>>>>>>>>>>>>>>>>>> since the main idea is relatively simple...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of
>>>>>>>>>>>>>>>>>>>> the SPIP doc
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say it
>>>>>>>>>>>>>>>>>>>> is based on per stage), when the cluster manager compose 
>>>>>>>>>>>>>>>>>>>> the pod spec, it
>>>>>>>>>>>>>>>>>>>> calculates the new memory overhead based on what user asks 
>>>>>>>>>>>>>>>>>>>> for in that
>>>>>>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the
>>>>>>>>>>>>>>>>>>>>> memory request and limit? Is it per stage or per executor?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
>>>>>>>>>>>>>>>>>>>>>> future,  as long as it has a similar concept , it can 
>>>>>>>>>>>>>>>>>>>>>> leverage this easily
>>>>>>>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it
>>>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
>>>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve
>>>>>>>>>>>>>>>>>>>>>>>> memory utilization of spark clusters.
>>>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original
>>>>>>>>>>>>>>>>>>>>>>>> paper
>>>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>>
>>>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>

-- 
Regards,
Vaquar Khan

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to