Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Sean Owen Tue, 30 Dec 2025 12:50:18 -0800

vaquar I think I need to ask, how much of your messages are written by an
AI? It has many stylistic characteristics of this output. This is not by
itself wrong.
While well-formed, the replies are verbose and repetitive, and seem to be
talking past the responses you receive.
There are 1000s of subscribers here and I want to make sure we are spending
everyone's time in good faith.


On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]> wrote:

> Hi Nan, Yao, and Chao,
>
> I have done a deep dive into the underlying Linux and Kubernetes kernel
> behaviors to validate our respective positions. While I fully support the
> economic goal of reclaiming the estimated 30-50% of stranded memory in
> static clusters, the technical evidence suggests that the
> "Zero-Guarantee" configuration is not just an optimization choice—it is
> architecturally unsafe for standard Kubernetes environments due to how the
> kernel calculates OOM scores.
>
> I am sharing these findings to explain why I have insisted on the *Safety
> Floor (minGuaranteedRatio)* as a necessary guardrail.
>
> *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that
> "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a
> standard environment, the math works against us. The Linux kernel
> calculates oom_score_adj inversely to the Request size: 1000 - (1000 *
> Request / Capacity).
>
>    -
>
>    *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the
>    Request), we are mathematically inflating the OOM score. For example, on a
>    standard node, a Zero-Guarantee pod ends up with a significantly higher OOM
>    score (more likely to be killed) compared to a standard pod.
>    -
>
>    *The Consequence:* In a race condition, the kernel will mathematically
>    target these "optimized" Spark pods for termination *before* their
>    neighbors, regardless of our intent.
>
> *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in
> the Kubelet (Issue #131169) where *PriorityClass is ignored when
> calculating OOM scores for Burstable pods*.
>
>    -
>
>    This invalidates the assumption that we can simply "manage" the risk
>    with priorities later.
>    -
>
>    Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod
>    is statistically identical to a "Best Effort" pod in the eyes of the OOM
>    killer.
>    -
>
>    *Conclusion:* We ideally *should* enforce a minimum memory floor to
>    keep the Request value high enough to secure a survivable OOM score.
>
> *3. Silent Failures (Thread Exhaustion)* The research confirms that
> "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: unable
> to create new native thread.
>
>    -
>
>    If a pod lands on a node with just enough RAM for the Heap (Request)
>    but zero extra for the OS, the pthread_create call will fail
>    immediately.
>    -
>
>    This results in "silent" application crashes that do not trigger
>    standard K8s OOM alerts, leading to un-debuggable support scenarios for
>    general users.
>
> *Final Proposal & Documentation Compromise*
>
> My strong preference is to add the *Safety Floor (minGuaranteedRatio)*
> configuration to the code.
>
> However, if after reviewing this evidence you are *adamant* that no new
> configurations should be added to the code, I am willing to *unblock the
> vote* on one strict condition:
>
> *The SPIP and Documentation must explicitly flag this risk.* We cannot
> simply leave this as an implementation detail. The documentation must
> contain a "Critical Warning" block stating:
>
> *"Warning: High-Heap/Low-Overhead configurations may result in 0MB
> guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may
> bypass PriorityClass protections and lead to silent 'Native Thread'
> exhaustion failures on contended nodes. Users are responsible for
> validating stability."*
>
> If you agree to either the code change (preferred) or this specific
> documentation warning please update SIP doc , I am happy to support..
>
>
> Regards,
>
> Viquar Khan
>
> Sr Data Architect
>
> https://www.linkedin.com/in/vaquar-khan-b695577/
>
>
>
> On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote:
>
>> 1. Re: "Imagined Reasons" & Zero Overhead
>> when I said "imagined reasons", I meant I didn't see the issue you
>> described appear in a prod environment running millions of jobs every
>> month, and I have also said that why it won't happen in PINS and other
>> normal case: as in a K8S cluster , there will be a reserved space for
>> system daemons in each host, even with many 0-memoryOverhead jobs, they
>> won't be "fully packed" as you imagined since these 0-memory overhead jobs
>> don't need much memory overhead space anyway
>>
>> let me bring my earlier suggestions again, if you don't want any job to
>> have 0 memoryOverhead, you can just calculate how much memoryOverhead is
>> guaranteed with simple arithmetic, if it is 0, do not use this feature
>>
>> In general, I don't really suggest you use this feature if you cannot
>> manage the rollout process, just like no one should apply something like
>> auto-tuning to all of their jobs without a dedicated Spark platform team .
>>
>> 2. Kubelet Eviction Relevance
>>
>> 2.a my question is , how PID/Disk pressure is related to the memory
>> related feature we are discussing here? please don't fan out the discussion
>> scope unlimitedly
>> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far
>> away from a reasonable design, the priority class name should be controlled
>> in cluster level and then specified via something like spark operator or if
>> you can specify pod spec, instead of embedding it to a memory related
>> feature
>>
>> 3. Can we agree to simply *add these two parameters as optional
>> configurations*?
>>
>> unfortunately no...
>>
>> some of the problems you raised probably will happen in very very extreme
>> cases, I have provided solutions to them without the need to add additional
>> configs... Other  problems you raised are not related to what this SPIP is
>> about, e.g. PID exhausting, etc.   and some of your proposed design doesn't
>> make sense to me  , e.g. specifying executor's priority class via such a
>> memory related feature....
>>
>>
>> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]>
>> wrote:
>>
>>> Hi Nan,
>>>
>>> Thanks for the candid response. I see where you are coming from
>>> regarding managed rollouts, but I think we are viewing this from two
>>> different lenses: "Internal Platform" vs. "General Open Source Product."
>>>
>>> Here is why I am pushing for these two specific configuration hooks:
>>>
>>> 1. Re: "Imagined Reasons" & Zero Overhead
>>>
>>> You mentioned that you have observed jobs running fine with zero
>>> memoryOverhead.
>>>
>>> While that may be true for specific workloads in your environment, the
>>> requirement for non-heap memory is not "imagined"—it is a JVM
>>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control
>>> structures must live in non-heap memory.
>>>
>>>    -
>>>
>>>    *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully
>>>    bin-packed (Sum of Requests = Node Capacity), your executor is
>>>    mathematically guaranteed *zero bytes* of non-heap memory unless it
>>>    can steal from the burst pool.
>>>    -
>>>
>>>    *The Risk:* If the burst pool is temporarily exhausted by neighbors,
>>>    a simple thread creation will throw OutOfMemoryError: unable to
>>>    create new native thread.
>>>    -
>>>
>>>    *The Fix:* I am not asking to change your default behavior. I am
>>>    asking to *expose the config* (minGuaranteedRatio). If you set it to
>>>    0.0 (default), your behavior is unchanged. But for those of us
>>>    running high-concurrency environments who need a 5-10% safety buffer for
>>>    thread stacks, we need the *capability* to configure it without
>>>    maintaining a fork or writing complex pre-submission wrappers.
>>>
>>> 2. Re: Kubelet Eviction Relevance
>>>
>>> You asked how Disk/PID pressure is related.
>>>
>>> In Kubernetes, PriorityClass is the universal signal for pod importance
>>> during any node-pressure event (not just memory).
>>>
>>>    -
>>>
>>>    If a node runs out of Ephemeral Storage (common with Spark Shuffle),
>>>    the Kubelet evicts pods.
>>>    -
>>>
>>>    Without a priorityClassName config, these Spark pods (which are now
>>>    QoS-downgraded to Burstable) will be evicted *before* Best-Effort
>>>    jobs that might have a higher priority class.
>>>    -
>>>
>>>    Again, this is a standard Kubernetes spec feature. There is no
>>>    downside to exposing
>>>    spark.kubernetes.executor.bursty.priorityClassName as an optional
>>>    config.
>>>
>>> *Proposal to Unblock*
>>>
>>> We both want this feature merged. I am not asking to change your
>>> formula's default behavior.
>>>
>>> Can we agree to simply *add these two parameters as optional
>>> configurations*?
>>>
>>>    1.
>>>
>>>    minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly).
>>>    2.
>>>
>>>    priorityClassName (Default: null -> preserves your logic exactly).
>>>
>>> This satisfies your design goals while making the feature robust enough
>>> for my production requirements.
>>>
>>>
>>> Regards,
>>>
>>> Viquar Khan
>>>
>>> Sr Data Architect
>>>
>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>
>>
>
>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to