vaquar I think I need to ask, how much of your messages are written by an AI? It has many stylistic characteristics of this output. This is not by itself wrong. While well-formed, the replies are verbose and repetitive, and seem to be talking past the responses you receive. There are 1000s of subscribers here and I want to make sure we are spending everyone's time in good faith.
On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]> wrote: > Hi Nan, Yao, and Chao, > > I have done a deep dive into the underlying Linux and Kubernetes kernel > behaviors to validate our respective positions. While I fully support the > economic goal of reclaiming the estimated 30-50% of stranded memory in > static clusters, the technical evidence suggests that the > "Zero-Guarantee" configuration is not just an optimization choice—it is > architecturally unsafe for standard Kubernetes environments due to how the > kernel calculates OOM scores. > > I am sharing these findings to explain why I have insisted on the *Safety > Floor (minGuaranteedRatio)* as a necessary guardrail. > > *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that > "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a > standard environment, the math works against us. The Linux kernel > calculates oom_score_adj inversely to the Request size: 1000 - (1000 * > Request / Capacity). > > - > > *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the > Request), we are mathematically inflating the OOM score. For example, on a > standard node, a Zero-Guarantee pod ends up with a significantly higher OOM > score (more likely to be killed) compared to a standard pod. > - > > *The Consequence:* In a race condition, the kernel will mathematically > target these "optimized" Spark pods for termination *before* their > neighbors, regardless of our intent. > > *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in > the Kubelet (Issue #131169) where *PriorityClass is ignored when > calculating OOM scores for Burstable pods*. > > - > > This invalidates the assumption that we can simply "manage" the risk > with priorities later. > - > > Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod > is statistically identical to a "Best Effort" pod in the eyes of the OOM > killer. > - > > *Conclusion:* We ideally *should* enforce a minimum memory floor to > keep the Request value high enough to secure a survivable OOM score. > > *3. Silent Failures (Thread Exhaustion)* The research confirms that > "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: unable > to create new native thread. > > - > > If a pod lands on a node with just enough RAM for the Heap (Request) > but zero extra for the OS, the pthread_create call will fail > immediately. > - > > This results in "silent" application crashes that do not trigger > standard K8s OOM alerts, leading to un-debuggable support scenarios for > general users. > > *Final Proposal & Documentation Compromise* > > My strong preference is to add the *Safety Floor (minGuaranteedRatio)* > configuration to the code. > > However, if after reviewing this evidence you are *adamant* that no new > configurations should be added to the code, I am willing to *unblock the > vote* on one strict condition: > > *The SPIP and Documentation must explicitly flag this risk.* We cannot > simply leave this as an implementation detail. The documentation must > contain a "Critical Warning" block stating: > > *"Warning: High-Heap/Low-Overhead configurations may result in 0MB > guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may > bypass PriorityClass protections and lead to silent 'Native Thread' > exhaustion failures on contended nodes. Users are responsible for > validating stability."* > > If you agree to either the code change (preferred) or this specific > documentation warning please update SIP doc , I am happy to support.. > > > Regards, > > Viquar Khan > > Sr Data Architect > > https://www.linkedin.com/in/vaquar-khan-b695577/ > > > > On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote: > >> 1. Re: "Imagined Reasons" & Zero Overhead >> when I said "imagined reasons", I meant I didn't see the issue you >> described appear in a prod environment running millions of jobs every >> month, and I have also said that why it won't happen in PINS and other >> normal case: as in a K8S cluster , there will be a reserved space for >> system daemons in each host, even with many 0-memoryOverhead jobs, they >> won't be "fully packed" as you imagined since these 0-memory overhead jobs >> don't need much memory overhead space anyway >> >> let me bring my earlier suggestions again, if you don't want any job to >> have 0 memoryOverhead, you can just calculate how much memoryOverhead is >> guaranteed with simple arithmetic, if it is 0, do not use this feature >> >> In general, I don't really suggest you use this feature if you cannot >> manage the rollout process, just like no one should apply something like >> auto-tuning to all of their jobs without a dedicated Spark platform team . >> >> 2. Kubelet Eviction Relevance >> >> 2.a my question is , how PID/Disk pressure is related to the memory >> related feature we are discussing here? please don't fan out the discussion >> scope unlimitedly >> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far >> away from a reasonable design, the priority class name should be controlled >> in cluster level and then specified via something like spark operator or if >> you can specify pod spec, instead of embedding it to a memory related >> feature >> >> 3. Can we agree to simply *add these two parameters as optional >> configurations*? >> >> unfortunately no... >> >> some of the problems you raised probably will happen in very very extreme >> cases, I have provided solutions to them without the need to add additional >> configs... Other problems you raised are not related to what this SPIP is >> about, e.g. PID exhausting, etc. and some of your proposed design doesn't >> make sense to me , e.g. specifying executor's priority class via such a >> memory related feature.... >> >> >> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]> >> wrote: >> >>> Hi Nan, >>> >>> Thanks for the candid response. I see where you are coming from >>> regarding managed rollouts, but I think we are viewing this from two >>> different lenses: "Internal Platform" vs. "General Open Source Product." >>> >>> Here is why I am pushing for these two specific configuration hooks: >>> >>> 1. Re: "Imagined Reasons" & Zero Overhead >>> >>> You mentioned that you have observed jobs running fine with zero >>> memoryOverhead. >>> >>> While that may be true for specific workloads in your environment, the >>> requirement for non-heap memory is not "imagined"—it is a JVM >>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control >>> structures must live in non-heap memory. >>> >>> - >>> >>> *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully >>> bin-packed (Sum of Requests = Node Capacity), your executor is >>> mathematically guaranteed *zero bytes* of non-heap memory unless it >>> can steal from the burst pool. >>> - >>> >>> *The Risk:* If the burst pool is temporarily exhausted by neighbors, >>> a simple thread creation will throw OutOfMemoryError: unable to >>> create new native thread. >>> - >>> >>> *The Fix:* I am not asking to change your default behavior. I am >>> asking to *expose the config* (minGuaranteedRatio). If you set it to >>> 0.0 (default), your behavior is unchanged. But for those of us >>> running high-concurrency environments who need a 5-10% safety buffer for >>> thread stacks, we need the *capability* to configure it without >>> maintaining a fork or writing complex pre-submission wrappers. >>> >>> 2. Re: Kubelet Eviction Relevance >>> >>> You asked how Disk/PID pressure is related. >>> >>> In Kubernetes, PriorityClass is the universal signal for pod importance >>> during any node-pressure event (not just memory). >>> >>> - >>> >>> If a node runs out of Ephemeral Storage (common with Spark Shuffle), >>> the Kubelet evicts pods. >>> - >>> >>> Without a priorityClassName config, these Spark pods (which are now >>> QoS-downgraded to Burstable) will be evicted *before* Best-Effort >>> jobs that might have a higher priority class. >>> - >>> >>> Again, this is a standard Kubernetes spec feature. There is no >>> downside to exposing >>> spark.kubernetes.executor.bursty.priorityClassName as an optional >>> config. >>> >>> *Proposal to Unblock* >>> >>> We both want this feature merged. I am not asking to change your >>> formula's default behavior. >>> >>> Can we agree to simply *add these two parameters as optional >>> configurations*? >>> >>> 1. >>> >>> minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly). >>> 2. >>> >>> priorityClassName (Default: null -> preserves your logic exactly). >>> >>> This satisfies your design goals while making the feature robust enough >>> for my production requirements. >>> >>> >>> Regards, >>> >>> Viquar Khan >>> >>> Sr Data Architect >>> >>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>> >> > >
