Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

vaquar khan Tue, 30 Dec 2025 07:29:38 -0800

Hi Nan, Yao, and Chao,

I have done a deep dive into the underlying Linux and Kubernetes kernel
behaviors to validate our respective positions. While I fully support the
economic goal of reclaiming the estimated 30-50% of stranded memory in
static clusters, the technical evidence suggests that the "Zero-Guarantee"
configuration is not just an optimization choice—it is architecturally
unsafe for standard Kubernetes environments due to how the kernel
calculates OOM scores.


I am sharing these findings to explain why I have insisted on the *Safety
Floor (minGuaranteedRatio)* as a necessary guardrail.

*1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that
"Zero-Guarantee" pods work fine in Pinterest's environment. However, in a
standard environment, the math works against us. The Linux kernel
calculates oom_score_adj inversely to the Request size: 1000 - (1000 *
Request / Capacity).

   -

   *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the
   Request), we are mathematically inflating the OOM score. For example, on a
   standard node, a Zero-Guarantee pod ends up with a significantly higher OOM
   score (more likely to be killed) compared to a standard pod.
   -

   *The Consequence:* In a race condition, the kernel will mathematically
   target these "optimized" Spark pods for termination *before* their
   neighbors, regardless of our intent.

*2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in the
Kubelet (Issue #131169) where *PriorityClass is ignored when calculating
OOM scores for Burstable pods*.

   -

   This invalidates the assumption that we can simply "manage" the risk
   with priorities later.
   -

   Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod is
   statistically identical to a "Best Effort" pod in the eyes of the OOM
   killer.
   -

   *Conclusion:* We ideally *should* enforce a minimum memory floor to keep
   the Request value high enough to secure a survivable OOM score.

*3. Silent Failures (Thread Exhaustion)* The research confirms that
"Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: unable to
create new native thread.

   -

   If a pod lands on a node with just enough RAM for the Heap (Request) but
   zero extra for the OS, the pthread_create call will fail immediately.
   -

   This results in "silent" application crashes that do not trigger
   standard K8s OOM alerts, leading to un-debuggable support scenarios for
   general users.

*Final Proposal & Documentation Compromise*

My strong preference is to add the *Safety Floor (minGuaranteedRatio)*
configuration to the code.

However, if after reviewing this evidence you are *adamant* that no new
configurations should be added to the code, I am willing to *unblock the
vote* on one strict condition:

*The SPIP and Documentation must explicitly flag this risk.* We cannot
simply leave this as an implementation detail. The documentation must
contain a "Critical Warning" block stating:

*"Warning: High-Heap/Low-Overhead configurations may result in 0MB
guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may
bypass PriorityClass protections and lead to silent 'Native Thread'
exhaustion failures on contended nodes. Users are responsible for
validating stability."*

If you agree to either the code change (preferred) or this specific
documentation warning please update SIP doc , I am happy to support..


Regards,

Viquar Khan

Sr Data Architect

https://www.linkedin.com/in/vaquar-khan-b695577/



On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote:

> 1. Re: "Imagined Reasons" & Zero Overhead
> when I said "imagined reasons", I meant I didn't see the issue you
> described appear in a prod environment running millions of jobs every
> month, and I have also said that why it won't happen in PINS and other
> normal case: as in a K8S cluster , there will be a reserved space for
> system daemons in each host, even with many 0-memoryOverhead jobs, they
> won't be "fully packed" as you imagined since these 0-memory overhead jobs
> don't need much memory overhead space anyway
>
> let me bring my earlier suggestions again, if you don't want any job to
> have 0 memoryOverhead, you can just calculate how much memoryOverhead is
> guaranteed with simple arithmetic, if it is 0, do not use this feature
>
> In general, I don't really suggest you use this feature if you cannot
> manage the rollout process, just like no one should apply something like
> auto-tuning to all of their jobs without a dedicated Spark platform team .
>
> 2. Kubelet Eviction Relevance
>
> 2.a my question is , how PID/Disk pressure is related to the memory
> related feature we are discussing here? please don't fan out the discussion
> scope unlimitedly
> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far
> away from a reasonable design, the priority class name should be controlled
> in cluster level and then specified via something like spark operator or if
> you can specify pod spec, instead of embedding it to a memory related
> feature
>
> 3. Can we agree to simply *add these two parameters as optional
> configurations*?
>
> unfortunately no...
>
> some of the problems you raised probably will happen in very very extreme
> cases, I have provided solutions to them without the need to add additional
> configs... Other  problems you raised are not related to what this SPIP is
> about, e.g. PID exhausting, etc.   and some of your proposed design doesn't
> make sense to me  , e.g. specifying executor's priority class via such a
> memory related feature....
>
>
> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]>
> wrote:
>
>> Hi Nan,
>>
>> Thanks for the candid response. I see where you are coming from regarding
>> managed rollouts, but I think we are viewing this from two different
>> lenses: "Internal Platform" vs. "General Open Source Product."
>>
>> Here is why I am pushing for these two specific configuration hooks:
>>
>> 1. Re: "Imagined Reasons" & Zero Overhead
>>
>> You mentioned that you have observed jobs running fine with zero
>> memoryOverhead.
>>
>> While that may be true for specific workloads in your environment, the
>> requirement for non-heap memory is not "imagined"—it is a JVM
>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control
>> structures must live in non-heap memory.
>>
>>    -
>>
>>    *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully
>>    bin-packed (Sum of Requests = Node Capacity), your executor is
>>    mathematically guaranteed *zero bytes* of non-heap memory unless it
>>    can steal from the burst pool.
>>    -
>>
>>    *The Risk:* If the burst pool is temporarily exhausted by neighbors,
>>    a simple thread creation will throw OutOfMemoryError: unable to
>>    create new native thread.
>>    -
>>
>>    *The Fix:* I am not asking to change your default behavior. I am
>>    asking to *expose the config* (minGuaranteedRatio). If you set it to
>>    0.0 (default), your behavior is unchanged. But for those of us
>>    running high-concurrency environments who need a 5-10% safety buffer for
>>    thread stacks, we need the *capability* to configure it without
>>    maintaining a fork or writing complex pre-submission wrappers.
>>
>> 2. Re: Kubelet Eviction Relevance
>>
>> You asked how Disk/PID pressure is related.
>>
>> In Kubernetes, PriorityClass is the universal signal for pod importance
>> during any node-pressure event (not just memory).
>>
>>    -
>>
>>    If a node runs out of Ephemeral Storage (common with Spark Shuffle),
>>    the Kubelet evicts pods.
>>    -
>>
>>    Without a priorityClassName config, these Spark pods (which are now
>>    QoS-downgraded to Burstable) will be evicted *before* Best-Effort
>>    jobs that might have a higher priority class.
>>    -
>>
>>    Again, this is a standard Kubernetes spec feature. There is no
>>    downside to exposing
>>    spark.kubernetes.executor.bursty.priorityClassName as an optional
>>    config.
>>
>> *Proposal to Unblock*
>>
>> We both want this feature merged. I am not asking to change your
>> formula's default behavior.
>>
>> Can we agree to simply *add these two parameters as optional
>> configurations*?
>>
>>    1.
>>
>>    minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly).
>>    2.
>>
>>    priorityClassName (Default: null -> preserves your logic exactly).
>>
>> This satisfies your design goals while making the feature robust enough
>> for my production requirements.
>>
>>
>> Regards,
>>
>> Viquar Khan
>>
>> Sr Data Architect
>>
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>>
>>
>> On Tue, 30 Dec 2025 at 01:04, Nan Zhu <[email protected]> wrote:
>>
>>> > However, I maintain that for a general-purpose open-source feature
>>> (which will be used by teams without dedicated platform engineers to manage
>>> rollouts), we need structural safety guardrails.
>>>
>>> I am not sure we can roll out such a feature to all jobs without a
>>> managed rollout process, it is an anti-pattern in any engineering org.  this
>>> feature is disabled by default which is already a guard to prevent users
>>> silently getting into what they won't expect
>>>
>>>
>>> > A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing
>>> up the design"—it is mathematically necessary to prevent the formula from
>>> collapsing to zero in valid production scenarios.
>>>
>>> this formula *IS* designed to output 0 in some cases, so it is *NOT*
>>> collapsing to zero.. I observed that , even with 0 memoryoverhead in many
>>> jobs, with a proper bursty factor saved tons of money in a real PROD
>>> environment instead of from my imagination.. if you don't want any
>>> memoryOverhead to be zero in your job for your imagined reasons,  you can
>>> just calculate your threshold for on-heap/memoryOverhead ratio for rolling
>>> out
>>>
>>> step back... if your team doesn't know how to manage rollout, most
>>> likely you are rolling out this feature for individual jobs without a
>>> centralized feature rolling out point,  right? then, you can just use the
>>> simple arithmetics to calculate whether the resulting memoryOverhead is 0,
>>> if yes, don't use this feature, that's it....
>>>
>>>
>>> > However, Kubelet eviction is the primary mechanism for other pressure
>>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios
>>> where memory.available crosses the eviction threshold before the kernel
>>> panics.
>>>
>>> How are they related to this feature?
>>>
>>>
>>> On Mon, Dec 29, 2025 at 10:37 PM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Nan,
>>>>
>>>> Thanks for the detailed reply. I appreciate you sharing the specific
>>>> context from the Pinterest implementation—it helps clarify the operational
>>>> model you are using.
>>>>
>>>> However, I maintain that for a general-purpose open-source feature
>>>> (which will be used by teams without dedicated platform engineers to manage
>>>> rollouts), we need structural safety guardrails.
>>>>
>>>> *Here is my response to your points:*
>>>>
>>>> 1. Re: "Zero-Guarantee" & Safety (Critical)
>>>>
>>>> You suggested that "setting a conservative bursty factor" resolves the
>>>> risk of zero-guaranteed overhead.
>>>>
>>>> Mathematically, this is incorrect for High-Heap jobs. The formula is
>>>> structural: G = O - \min((H+O) \times (B-1), O).
>>>>
>>>> Consider a standard ETL job: Heap (H) = 100GB, Overhead (O) = 5GB.
>>>>
>>>> Even if we set a very conservative Bursty Factor (B) of 1.06 (only 6%
>>>> burst):
>>>>
>>>>    -
>>>>
>>>>    Calculation: $(100 + 5) \times (1.06 - 1) = 105 \times 0.06 = 6.3GB.
>>>>    -
>>>>
>>>>    Since 6.3GB > 5GB, the formula sets *Guaranteed Overhead = 0GB*.
>>>>
>>>> Even with an extremely conservative factor, the design forces this pod
>>>> to have zero guaranteed memory for OS/JVM threads. This is not a tuning
>>>> issue; it is a formulaic edge case for high-memory jobs.
>>>>
>>>> * A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing
>>>> up the design"—it is mathematically necessary to prevent the formula from
>>>> collapsing to zero in valid production scenarios.*
>>>>
>>>> 2. Re: Kubelet Eviction vs. OOMKiller
>>>>
>>>> I concede that in sudden memory spikes, the Kernel OOMKiller often acts
>>>> faster than Kubelet eviction.
>>>>
>>>> However, Kubelet eviction is the primary mechanism for other pressure
>>>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios
>>>> where memory.available crosses the eviction threshold before the kernel
>>>> panics.
>>>>
>>>> * Adding priorityClassName support to the Pod spec is a low-effort,
>>>> zero-risk change that aligns with Kubernetes best practices for "Defense in
>>>> Depth." It costs nothing to expose this config.*
>>>>
>>>> 3. Re: Native Support
>>>>
>>>> Fair point. To keep the scope tight, I am happy to drop the Native
>>>> Support request for this SPIP. We can treat that as a separate follow-up.
>>>>
>>>> Path Forward
>>>>
>>>> I am happy to support  if we can agree to:
>>>>
>>>>    1.
>>>>
>>>>    Add the minGuaranteedRatio config (to handle the High-Heap math
>>>>    proven above).
>>>>    2.
>>>>
>>>>    Expose the priorityClassName config (standard K8S practice).
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> Sr Data Architect
>>>>
>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>
>>>>
>>>> On Tue, 30 Dec 2025 at 00:16, Nan Zhu <[email protected]> wrote:
>>>>
>>>>> > Kubelet Eviction is the first line of defense before the Kernel
>>>>> OOMKiller strikes.
>>>>>
>>>>> This is *NOT* true. Eviction will be the first to kill some best
>>>>> effort pod which doesn't make any difference  on memory pressure in most
>>>>> cases and before it takes action again, Kernel OOMKiller already killed
>>>>> some executor pods. This is exactly the reason for me to say, we don't
>>>>> really worry about eviction here, before eviction touches those executors,
>>>>> OOMKiller already killed them. This behavior is consistently observed and
>>>>> we also had discussions with other companies who had to modify Kernel code
>>>>> to mitigate this behavior.
>>>>>
>>>>> > Re: "Zero-Guarantee" & Safety
>>>>>
>>>>> you basically want to tradeoff saving with system safety , then why
>>>>> not just setting a conservative value of bursty factor? it is exactly what
>>>>> we did in PINS, please check my earlier response in the thread ... key 
>>>>> part
>>>>> as following:
>>>>>
>>>>> "in PINS, we basically apply a set of strategies, setting
>>>>> conservative bursty factor, progressive rollout, monitor the cluster
>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the optimal
>>>>> setup of bursty factor... in usual, K8S operators will set a reserved 
>>>>> space
>>>>> for daemon processes on each host, we found it is sufficient to in our 
>>>>> case
>>>>> and our major tuning focuses on bursty factor value "
>>>>>
>>>>> If you really want, you can enable this feature only for jobs when
>>>>> OnHeap/MemoryOverhead is smaller than a certain value...
>>>>>
>>>>> I just didn't see the value of bringing another configuration
>>>>>
>>>>>
>>>>> > Re: Native Support
>>>>>
>>>>> I mean....this SPIP is NOT about native execution engine's memory
>>>>> pattern at all..... why do we bother to bring it up....
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Dec 29, 2025 at 9:42 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Nan,
>>>>>>
>>>>>> Thanks for the prompt response and for clarifying the design intent.
>>>>>>
>>>>>> I understand the goal is to maximize savings—and I agree we shouldn't
>>>>>> block the current momentum even though you can see my vote is +1—but I 
>>>>>> want
>>>>>> to ensure we aren't over-optimizing for specific internal environments at
>>>>>> the cost of general community stability.
>>>>>>
>>>>>> *Here is my rejoinder on the technical points:*
>>>>>>
>>>>>> 1. Re: PriorityClass & OOMKiller (Defense in Depth)
>>>>>>
>>>>>> You mentioned that “priorityClassName is NOT the solution... What we
>>>>>> worry about is the Linux Kernel OOMKiller.”
>>>>>>
>>>>>> I agree that the Kernel OOMKiller (cgroup) primarily looks at
>>>>>> oom_score_adj (which is determined by QoS Class). However, Kubelet 
>>>>>> Eviction
>>>>>> is the first line of defense before the Kernel OOMKiller strikes.
>>>>>>
>>>>>> When a node comes under memory pressure (e.g., memory.available
>>>>>> drops below evictionHard), the *Kubelet* actively selects pods to
>>>>>> evict to reclaim resources. Unlike the Kernel, the Kubelet *does*
>>>>>> explicitly use PriorityClass when ranking candidates for eviction.
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *The Risk:* Since we are downgrading these pods to *Burstable*
>>>>>>    (increasing their OOM risk), we lose the "Guaranteed" protection 
>>>>>> shield.
>>>>>>    -
>>>>>>
>>>>>>    *The Fix:* By assigning a high PriorityClass, we ensure that if
>>>>>>    the Kubelet needs to free space, it evicts lower-priority batch jobs
>>>>>>    *before* these Spark executors. It is a necessary "Defense in
>>>>>>    Depth" strategy for multi-tenant clusters that prevents optimized 
>>>>>> Spark
>>>>>>    jobs from being the first victims of node pressure.
>>>>>>
>>>>>> 2. Re: "Zero-Guarantee" & Safety
>>>>>>
>>>>>> You noted that “savings come from these 0 memory overhead pods.”
>>>>>>
>>>>>> While G=0 maximizes "on-paper" savings, it is theoretically unsafe
>>>>>> for a JVM. A JVM physically requires non-heap memory for Thread Stacks,
>>>>>> CodeCache, and Metaspace just to run.
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *The Reality:* If G=0, then Pod Request == Heap. If a node is
>>>>>>    fully packed (Sum of Requests ≈ Node Capacity), the pod relies
>>>>>>    *100%* on the burst pool for basic thread allocation. If
>>>>>>    neighbors are noisy, that pod cannot even spawn a thread.
>>>>>>    -
>>>>>>
>>>>>>    *The Compromise:* I strongly suggest we add the configuration
>>>>>>    spark.executor.memoryOverhead.minGuaranteedRatio but set the *default
>>>>>>    to 0.0*.
>>>>>>    -
>>>>>>
>>>>>>       This preserves your logic/savings by default.
>>>>>>       -
>>>>>>
>>>>>>       But it gives platform admins a "safety knob" to turn (e.g., to
>>>>>>       0.1) when they inevitably encounter instability in high-contention
>>>>>>       environments, without needing a code patch.
>>>>>>
>>>>>> 3. Re: Native Support
>>>>>>
>>>>>> Agreed. We can treat Off-Heap support as a follow-up item. I would
>>>>>> just request that we add a "Known Limitation" note in the SPIP stating 
>>>>>> that
>>>>>> this optimization does not yet apply to spark.memory.offHeap.size, so 
>>>>>> users
>>>>>> of Gluten/Velox are aware.
>>>>>>
>>>>>> I am happy to support the PR moving forward if we can agree to
>>>>>> include the *PriorityClass* config support and the *Safety Floor*
>>>>>> config (even if disabled by default) ,Please update your SIP. This 
>>>>>> ensures
>>>>>> the feature is robust enough for the wider user base.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Viquar Khan
>>>>>>
>>>>>> Sr Data Architect
>>>>>>
>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>
>>>>>> On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote:
>>>>>>
>>>>>>> Hi, Vaquar
>>>>>>>
>>>>>>> thanks for the replies,
>>>>>>>
>>>>>>> 1. for Guaranteed QoS
>>>>>>>
>>>>>>> I may missed some words in the  original doc, the idea I would like
>>>>>>> to convey is that we essentially need to give up this idea due to what
>>>>>>> cannot achieve Guaranteed QoS as we already have different values of 
>>>>>>> memory
>>>>>>> and even we only consider CPU request/limit values, it brings other 
>>>>>>> risks
>>>>>>> to us
>>>>>>>
>>>>>>> Additionally , priorityClassName is NOT the solution here. What we
>>>>>>> really worry about is NOT eviction, but Linux Kernel OOMKiller where we
>>>>>>> cannot pass pod priority information into. With burstable pods, the
>>>>>>> only thing Linux Kernel OOMKiller considers is the memory request size
>>>>>>> which not necessary maps to priority information
>>>>>>>
>>>>>>> 2. The "Zero-Guarantee" Edge Case
>>>>>>>
>>>>>>> Actually, a lot of savings are from these 0 memory overhead pods...
>>>>>>> I am curious if you have adopted the PoC PR in prod as you have 
>>>>>>> identified
>>>>>>> it is "unsafe"?
>>>>>>>
>>>>>>> Something like a minGuaranteedRatio is not a good idea , it will
>>>>>>> mess up the original design idea of the formula (check Appendix C), the
>>>>>>> simplest thing you can do is to avoid rolling out features to the jobs
>>>>>>> which you feel will be unsafe...
>>>>>>>
>>>>>>> 3. Native Execution Gap (Off-Heap)
>>>>>>>
>>>>>>> I am not sure the off-heap memory usage of gluten/comet shows the
>>>>>>> same/similar pattern as memoryOverhead. No one has validated that in
>>>>>>> production environment, but both PINS/Bytedance has validated
>>>>>>> memoryOverhead part thoroughly in their clusters
>>>>>>>
>>>>>>> Additionally, the key design of the proposal is to capture the
>>>>>>> relationship between on-heap and memoryOverhead sizes, in another word,
>>>>>>> they co-exist.... offheap memory used by native engines are different
>>>>>>> stories where , ideally, the on-heap usage should be minimum and most of
>>>>>>> memory usage should come from off-heap part...so the formula here may 
>>>>>>> not
>>>>>>> work out of box
>>>>>>>
>>>>>>> my suggestion is, since the community has approved the original
>>>>>>> design which have been tested by at least 2 companies in production
>>>>>>> environments, we go with the current design and continue code review , 
>>>>>>> in
>>>>>>> future, we can add what have been found/tested in production as 
>>>>>>> followups
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Nan
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Yao, Nan, and Chao,
>>>>>>>>
>>>>>>>> Thank you for this proposal though I already approved . The
>>>>>>>> cost-efficiency goals are very compelling, and the cited $6M annual 
>>>>>>>> savings
>>>>>>>> at Pinterest  clearly demonstrates the value of moving away from
>>>>>>>> rigid peak provisioning.
>>>>>>>>
>>>>>>>> However, after modeling the proposed design against standard
>>>>>>>> Kubernetes behavior and modern Spark workloads, I have identified 
>>>>>>>> *three
>>>>>>>> critical stability risks* that need to be addressed before this is
>>>>>>>> finalized.
>>>>>>>>
>>>>>>>> I have drafted a *Supplementary Design Amendment* (linked
>>>>>>>> below/attached) that proposes fixes for these issues, but here is the
>>>>>>>> summary:
>>>>>>>> 1. The "Guaranteed QoS" Contradiction
>>>>>>>>
>>>>>>>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for
>>>>>>>> stability risks2.
>>>>>>>>
>>>>>>>> The Issue: Technically, this mitigation is impossible under your
>>>>>>>> proposal.
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    In Kubernetes, a Pod is assigned the *Guaranteed* QoS class
>>>>>>>>    *only* if Request == Limit for both CPU and Memory.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Your proposal explicitly sets Memory Request < Memory Limit
>>>>>>>>    (specifically $H+G < H+O$)3.
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Consequence:* This configuration *automatically downgrades*
>>>>>>>>    the Pod to the *Burstable* QoS class. In a multi-tenant
>>>>>>>>    cluster, the Kubelet eviction manager will kill these "Burstable" 
>>>>>>>> Spark
>>>>>>>>    pods *before* any Guaranteed system pods during node pressure.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Proposed Fix:* We cannot rely on Guaranteed QoS. We must
>>>>>>>>    introduce a priorityClassName configuration to offset this
>>>>>>>>    eviction risk.
>>>>>>>>
>>>>>>>> 2. The "Zero-Guarantee" Edge Case
>>>>>>>>
>>>>>>>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a
>>>>>>>> dangerous edge case for High-Heap/Low-Overhead jobs (common in ETL).
>>>>>>>>
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Scenario:* If a job has a large Heap ($H$) relative to
>>>>>>>>    Overhead ($O$), the calculated burst deduction often exceeds
>>>>>>>>    the total Overhead.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Result:* The formula yields *$G = 0$*.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Risk:* Allocating 0MB of guaranteed overhead is unsafe.
>>>>>>>>    Essential JVM operations (thread stacks, Netty control buffers) 
>>>>>>>> require a
>>>>>>>>    non-zero baseline. Relying 100% on a shared burst pool for basic
>>>>>>>>    functionality will lead to immediate container failures if the node 
>>>>>>>> is
>>>>>>>>    contended.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Proposed Fix:* Implement a safety floor using a
>>>>>>>>    minGuaranteedRatio (e.g., max(Calculated_G, O * 0.1)).
>>>>>>>>
>>>>>>>> 3. Native Execution Gap (Off-Heap)
>>>>>>>>
>>>>>>>> The proposal focuses entirely on memoryOverhead5.
>>>>>>>>
>>>>>>>> The Issue: Modern native engines (Gluten, Velox, Photon) shift
>>>>>>>> execution memory to spark.memory.offHeap.size. This memory is equally
>>>>>>>> "bursty" but is excluded from your optimization.
>>>>>>>>
>>>>>>>> *Proposed Fix: *The burst-aware logic should be extensible to
>>>>>>>> include Off-Heap memory if enabled.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe these changes are necessary to make the feature robust
>>>>>>>> enough for general community adoption beyond specific controlled
>>>>>>>> environments.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Viquar Khan
>>>>>>>>
>>>>>>>> Sr Data Architect
>>>>>>>>
>>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 from me.
>>>>>>>>>>> I think it's well-scoped and takes advantage of Kubernetes'
>>>>>>>>>>> features exactly for what they are designed for(as per my 
>>>>>>>>>>> understanding).
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for
>>>>>>>>>>>> the detailed and thoughtful discussion.
>>>>>>>>>>>>
>>>>>>>>>>>> Overall, this looks like a valuable addition for organizations
>>>>>>>>>>>> running Spark on Kubernetes, especially given how bursty
>>>>>>>>>>>> memoryOverhead usage tends to be in practice. I appreciate
>>>>>>>>>>>> that the change is relatively small in scope and fully opt-in, 
>>>>>>>>>>>> which helps
>>>>>>>>>>>> keep the risk low.
>>>>>>>>>>>>
>>>>>>>>>>>> From my perspective, the questions raised on the thread and in
>>>>>>>>>>>> the SPIP have been addressed. If others feel the same, do we have 
>>>>>>>>>>>> consensus
>>>>>>>>>>>> to move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Chao
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> this is a good question
>>>>>>>>>>>>>
>>>>>>>>>>>>> > a stage is bursty and consumes the shared portion and fails
>>>>>>>>>>>>> to release it for subsequent stages
>>>>>>>>>>>>>
>>>>>>>>>>>>> in the scenario you described, since the memory-leaking stage
>>>>>>>>>>>>> and the subsequence ones are from the same job , the pod will 
>>>>>>>>>>>>> likely be
>>>>>>>>>>>>> killed by cgroup oomkiller
>>>>>>>>>>>>>
>>>>>>>>>>>>> taking the following as the example
>>>>>>>>>>>>>
>>>>>>>>>>>>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max
>>>>>>>>>>>>> and in theory, it should release all 7G and then claim 7G again 
>>>>>>>>>>>>> in some
>>>>>>>>>>>>> later stages, however, due to the memory peak, it holds 2G 
>>>>>>>>>>>>> forever and ask
>>>>>>>>>>>>> for another 7G, as a result,  it hits the pod memory limit  and 
>>>>>>>>>>>>> cgroup
>>>>>>>>>>>>> oomkiller will take action to terminate the pod
>>>>>>>>>>>>>
>>>>>>>>>>>>> so this should be safe to the system
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> however, we should be careful about the memory peak for sure,
>>>>>>>>>>>>> because it essentially breaks the assumption that the usage of
>>>>>>>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)...
>>>>>>>>>>>>> unfortunately, shared/guaranteed memory is managed by user 
>>>>>>>>>>>>> applications
>>>>>>>>>>>>> instead of on cluster level , they, especially S, are just logical
>>>>>>>>>>>>> concepts  instead of a physical memory pool which pods can 
>>>>>>>>>>>>> explicitly claim
>>>>>>>>>>>>> memory from...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the interesting proposal.
>>>>>>>>>>>>>> The design seems to rely on memoryOverhead being transient.
>>>>>>>>>>>>>> What happens when a stage is bursty and consumes the shared
>>>>>>>>>>>>>> portion and fails to release it for subsequent stages (e.g.,  
>>>>>>>>>>>>>> off-heap
>>>>>>>>>>>>>> buffers and its not garbage collected since its off-heap)? Would 
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> trigger the host-level OOM like described in Q6? or are there 
>>>>>>>>>>>>>> strategies to
>>>>>>>>>>>>>> release the shared portion?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> yes, that's the worst case in the scenario, please check my
>>>>>>>>>>>>>>> earlier response to Qiegang's question, we have a set of 
>>>>>>>>>>>>>>> strategies adopted
>>>>>>>>>>>>>>> in prod to mitigate the issue
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the explanation! So the executor is not
>>>>>>>>>>>>>>>> guaranteed to get 50 GB physical memory, right? All pods on 
>>>>>>>>>>>>>>>> the same host
>>>>>>>>>>>>>>>> may reach peak memory usage at the same time and cause 
>>>>>>>>>>>>>>>> paging/swapping
>>>>>>>>>>>>>>>> which hurts performance?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> np, let me try to explain
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Each executor container will be run in a pod together
>>>>>>>>>>>>>>>>> with some other sidecar containers taking care of tasks like
>>>>>>>>>>>>>>>>> authentication, etc. , for simplicity, we assume each pod has 
>>>>>>>>>>>>>>>>> only one
>>>>>>>>>>>>>>>>> container which is the executor container
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. Each container is assigned with two values, r
>>>>>>>>>>>>>>>>> *equest&limit** (limit >= request),* for both of
>>>>>>>>>>>>>>>>> CPU/memory resources (we only discuss memory here). Each pod 
>>>>>>>>>>>>>>>>> will have
>>>>>>>>>>>>>>>>> request/limit values as the sum of all containers belonging 
>>>>>>>>>>>>>>>>> to this pod
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
>>>>>>>>>>>>>>>>> *request* value, and cap the resource usage of each
>>>>>>>>>>>>>>>>> container based on their *limit* value, e.g. if I have a
>>>>>>>>>>>>>>>>> pod with a single container in it , and it has 1G/2G as 
>>>>>>>>>>>>>>>>> request and limit
>>>>>>>>>>>>>>>>> value respectively, any machine with 1G free RAM space will 
>>>>>>>>>>>>>>>>> be a candidate
>>>>>>>>>>>>>>>>> to host this pod, and when the container use more than 2G 
>>>>>>>>>>>>>>>>> memory, it will
>>>>>>>>>>>>>>>>> be killed by cgroup oomkiller. Once a pod is scheduled to a 
>>>>>>>>>>>>>>>>> host, the
>>>>>>>>>>>>>>>>> memory space sized at "sum of all its containers' request 
>>>>>>>>>>>>>>>>> values" will be
>>>>>>>>>>>>>>>>> booked exclusively for this pod.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same
>>>>>>>>>>>>>>>>> value for executors in k8s*, and this value is basically
>>>>>>>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most 
>>>>>>>>>>>>>>>>> cases .
>>>>>>>>>>>>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, 
>>>>>>>>>>>>>>>>> the user
>>>>>>>>>>>>>>>>> setting  spark.executor.memoryOverhead as 10G usually means 
>>>>>>>>>>>>>>>>> each executor
>>>>>>>>>>>>>>>>> only needs 10G in a very small portion of the executor's 
>>>>>>>>>>>>>>>>> whole lifecycle
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple
>>>>>>>>>>>>>>>>> request/limit value in spark@k8s for executors in a safe
>>>>>>>>>>>>>>>>> way (this idea is from the bytedance paper we refer to in 
>>>>>>>>>>>>>>>>> SPIP paper).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Using the aforementioned example ,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we
>>>>>>>>>>>>>>>>> have two pods requesting 40G + 10G (on-heap + memoryOverhead) 
>>>>>>>>>>>>>>>>> and we set
>>>>>>>>>>>>>>>>> bursty factor to 1.2, without the mechanism proposed in this 
>>>>>>>>>>>>>>>>> SPIP, we can
>>>>>>>>>>>>>>>>> at most host 2 pods with this machine, and because of the 
>>>>>>>>>>>>>>>>> bursty usage of
>>>>>>>>>>>>>>>>> that 10G space, the memory utilization would be compromised.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When applying the burst-aware memory allocation, we only
>>>>>>>>>>>>>>>>> need 40 + 10 - min((40 + 10) * 0.2, 10) = 40G to host each 
>>>>>>>>>>>>>>>>> pod, i.e. we
>>>>>>>>>>>>>>>>> have 20G free memory space left in the machine which can be 
>>>>>>>>>>>>>>>>> used to host
>>>>>>>>>>>>>>>>> some smaller pods. At the same time, as we didn't change the 
>>>>>>>>>>>>>>>>> limit value of
>>>>>>>>>>>>>>>>> the executor pods, these executors can still use 50G at max.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does
>>>>>>>>>>>>>>>>>> it work under the hood? The container will adjust its system 
>>>>>>>>>>>>>>>>>> memory size
>>>>>>>>>>>>>>>>>> depending on the actual memory usage of the processes in 
>>>>>>>>>>>>>>>>>> this container?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> yeah, we have a few cases that we have significantly
>>>>>>>>>>>>>>>>>>> larger O than H, the proposed algorithm is actually a great 
>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed
>>>>>>>>>>>>>>>>>>> algorithm will allocate a non-trivial G to ensure the 
>>>>>>>>>>>>>>>>>>> safety of running but
>>>>>>>>>>>>>>>>>>> still cut a big chunk of memory (10s of GBs) and treat them 
>>>>>>>>>>>>>>>>>>> as S , saving
>>>>>>>>>>>>>>>>>>> tons of money burnt by them
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> but regarding native accelerators, some native
>>>>>>>>>>>>>>>>>>> acceleration engines do not use memoryOverhead but use 
>>>>>>>>>>>>>>>>>>> off-heap
>>>>>>>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The 
>>>>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>> implementation does not cover this part , while that will 
>>>>>>>>>>>>>>>>>>> be an easy
>>>>>>>>>>>>>>>>>>> extension
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Have you tested in environments where O is bigger than
>>>>>>>>>>>>>>>>>>>> H? Wondering if the proposed algorithm would help more in 
>>>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> environments (eg. with
>>>>>>>>>>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use
>>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling
>>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>>>>>>>>> total potential usage  sum of H+G+S across all pods on a 
>>>>>>>>>>>>>>>>>>>>> node exceeds its
>>>>>>>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely 
>>>>>>>>>>>>>>>>>>>>> on the cluster
>>>>>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies,
>>>>>>>>>>>>>>>>>>>>> setting conservative bursty factor, progressive rollout, 
>>>>>>>>>>>>>>>>>>>>> monitor the
>>>>>>>>>>>>>>>>>>>>> cluster metrics like Linux Kernel OOMKiller occurrence to 
>>>>>>>>>>>>>>>>>>>>> guide us to the
>>>>>>>>>>>>>>>>>>>>> optimal setup of bursty factor... in usual, K8S operators 
>>>>>>>>>>>>>>>>>>>>> will set a
>>>>>>>>>>>>>>>>>>>>> reserved space for daemon processes on each host, we 
>>>>>>>>>>>>>>>>>>>>> found it is sufficient
>>>>>>>>>>>>>>>>>>>>> to in our case and our major tuning focuses on bursty 
>>>>>>>>>>>>>>>>>>>>> factor value
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to
>>>>>>>>>>>>>>>>>>>>> ensure a strategic mix of executors with large S and 
>>>>>>>>>>>>>>>>>>>>> small S values on a
>>>>>>>>>>>>>>>>>>>>> single node?  I am wondering if this would reduce the 
>>>>>>>>>>>>>>>>>>>>> probability of
>>>>>>>>>>>>>>>>>>>>> concurrent bursting and host-level OOM.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yes, when we work on this project, we put some
>>>>>>>>>>>>>>>>>>>>> attention on the cluster scheduling policy/behavior... 
>>>>>>>>>>>>>>>>>>>>> two things we mostly
>>>>>>>>>>>>>>>>>>>>> care about
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have
>>>>>>>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have 
>>>>>>>>>>>>>>>>>>>>> enough candidates
>>>>>>>>>>>>>>>>>>>>> to form a mixed set of executors with large S and
>>>>>>>>>>>>>>>>>>>>> small S values
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which
>>>>>>>>>>>>>>>>>>>>> tends to pack more pods from the same job to the same 
>>>>>>>>>>>>>>>>>>>>> host, which can
>>>>>>>>>>>>>>>>>>>>> create troubles as they are more likely to ask for max 
>>>>>>>>>>>>>>>>>>>>> memory at the same
>>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use
>>>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling
>>>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing.  I
>>>>>>>>>>>>>>>>>>>>>> have a few questions regarding the shared portion S:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated
>>>>>>>>>>>>>>>>>>>>>>    when the total potential usage  sum of H+G+S across 
>>>>>>>>>>>>>>>>>>>>>> all pods on a node
>>>>>>>>>>>>>>>>>>>>>>    exceeds its allocatable capacity? Does the proposal 
>>>>>>>>>>>>>>>>>>>>>> implicitly rely on the
>>>>>>>>>>>>>>>>>>>>>>    cluster operator to manually ensure an unrequested 
>>>>>>>>>>>>>>>>>>>>>> memory buffer exists on
>>>>>>>>>>>>>>>>>>>>>>    the node to serve as the shared pool?
>>>>>>>>>>>>>>>>>>>>>>    2. Have you considered scheduling optimizations
>>>>>>>>>>>>>>>>>>>>>>    to ensure a strategic mix of executors with large S 
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>    small S values on a single node?  I am wondering
>>>>>>>>>>>>>>>>>>>>>>    if this would reduce the probability of concurrent 
>>>>>>>>>>>>>>>>>>>>>> bursting and host-level
>>>>>>>>>>>>>>>>>>>>>>    OOM.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I think I'm still missing something in the big
>>>>>>>>>>>>>>>>>>>>>>> picture:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular
>>>>>>>>>>>>>>>>>>>>>>>    indicates a fixed heap size, and memory overhead 
>>>>>>>>>>>>>>>>>>>>>>> can't be dynamic if it's
>>>>>>>>>>>>>>>>>>>>>>>    on-heap.
>>>>>>>>>>>>>>>>>>>>>>>    - Do Spark applications have static profiles?
>>>>>>>>>>>>>>>>>>>>>>>    When we submit stages, the cluster is already 
>>>>>>>>>>>>>>>>>>>>>>> allocated, how can we change
>>>>>>>>>>>>>>>>>>>>>>>    anything?
>>>>>>>>>>>>>>>>>>>>>>>    - How do we assign the shared memory overhead?
>>>>>>>>>>>>>>>>>>>>>>>    Fairly among all applications on the same physical 
>>>>>>>>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc
>>>>>>>>>>>>>>>>>>>>>>>> since the main idea is relatively simple...
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4
>>>>>>>>>>>>>>>>>>>>>>>> of the SPIP doc
>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say
>>>>>>>>>>>>>>>>>>>>>>>> it is based on per stage), when the cluster manager 
>>>>>>>>>>>>>>>>>>>>>>>> compose the pod spec,
>>>>>>>>>>>>>>>>>>>>>>>> it calculates the new memory overhead based on what 
>>>>>>>>>>>>>>>>>>>>>>>> user asks for in that
>>>>>>>>>>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the
>>>>>>>>>>>>>>>>>>>>>>>>> memory request and limit? Is it per stage or per 
>>>>>>>>>>>>>>>>>>>>>>>>> executor?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on
>>>>>>>>>>>>>>>>>>>>>>>>>> the request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming
>>>>>>>>>>>>>>>>>>>>>>>>>> in future,  as long as it has a similar concept , it 
>>>>>>>>>>>>>>>>>>>>>>>>>> can leverage this
>>>>>>>>>>>>>>>>>>>>>>>>>> easily as the main logic is implemented in 
>>>>>>>>>>>>>>>>>>>>>>>>>> ResourceProfile
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it
>>>>>>>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
>>>>>>>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve
>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory utilization of spark clusters.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original
>>>>>>>>>>>>>>>>>>>>>>>>>>>> paper
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui(
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]) and Yixin(
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Vaquar Khan
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Vaquar Khan
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>>
>>>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>

-- 
Regards,
Vaquar Khan

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to