+1 (non-binding) On Sat, Jan 3, 2026 at 7:23 PM Kent Yao <[email protected]> wrote:
> +1 > > Kent > > > > On 2025/12/17 07:48:06 Wenchen Fan wrote: > > +1 > > > > On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]> > wrote: > > > > > +1 from me. > > > I think it's well-scoped and takes advantage of Kubernetes' features > > > exactly for what they are designed for(as per my understanding). > > > > > > On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote: > > > > > >> Thanks Yao and Nan for the proposal, and thanks everyone for the > detailed > > >> and thoughtful discussion. > > >> > > >> Overall, this looks like a valuable addition for organizations running > > >> Spark on Kubernetes, especially given how bursty memoryOverhead usage > > >> tends to be in practice. I appreciate that the change is relatively > small > > >> in scope and fully opt-in, which helps keep the risk low. > > >> > > >> From my perspective, the questions raised on the thread and in the > SPIP > > >> have been addressed. If others feel the same, do we have consensus to > move > > >> forward with a vote? cc Wenchen, Qieqiang, and Karuppayya. > > >> > > >> Best, > > >> Chao > > >> > > >> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> > wrote: > > >> > > >>> this is a good question > > >>> > > >>> > a stage is bursty and consumes the shared portion and fails to > release > > >>> it for subsequent stages > > >>> > > >>> in the scenario you described, since the memory-leaking stage and the > > >>> subsequence ones are from the same job , the pod will likely be > killed by > > >>> cgroup oomkiller > > >>> > > >>> taking the following as the example > > >>> > > >>> the usage pattern is G = 5GB S = 2GB, it uses G + S at max and in > > >>> theory, it should release all 7G and then claim 7G again in some > later > > >>> stages, however, due to the memory peak, it holds 2G forever and ask > for > > >>> another 7G, as a result, it hits the pod memory limit and cgroup > > >>> oomkiller will take action to terminate the pod > > >>> > > >>> so this should be safe to the system > > >>> > > >>> > > >>> > > >>> however, we should be careful about the memory peak for sure, > because it > > >>> essentially breaks the assumption that the usage of memoryOverhead is > > >>> bursty (memory peak ~= use memory forever)... unfortunately, > > >>> shared/guaranteed memory is managed by user applications instead of > on > > >>> cluster level , they, especially S, are just logical concepts > instead of a > > >>> physical memory pool which pods can explicitly claim memory from... > > >>> > > >>> > > >>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya < > [email protected]> > > >>> wrote: > > >>> > > >>>> Thanks for the interesting proposal. > > >>>> The design seems to rely on memoryOverhead being transient. > > >>>> What happens when a stage is bursty and consumes the shared portion > and > > >>>> fails to release it for subsequent stages (e.g., off-heap buffers > and its > > >>>> not garbage collected since its off-heap)? Would this trigger the > > >>>> host-level OOM like described in Q6? or are there strategies to > release the > > >>>> shared portion? > > >>>> > > >>>> > > >>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> > wrote: > > >>>> > > >>>>> yes, that's the worst case in the scenario, please check my earlier > > >>>>> response to Qiegang's question, we have a set of strategies > adopted in prod > > >>>>> to mitigate the issue > > >>>>> > > >>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> > > >>>>> wrote: > > >>>>> > > >>>>>> Thanks for the explanation! So the executor is not guaranteed to > get > > >>>>>> 50 GB physical memory, right? All pods on the same host may reach > peak > > >>>>>> memory usage at the same time and cause paging/swapping which > hurts > > >>>>>> performance? > > >>>>>> > > >>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> np, let me try to explain > > >>>>>>> > > >>>>>>> 1. Each executor container will be run in a pod together with > some > > >>>>>>> other sidecar containers taking care of tasks like > authentication, etc. , > > >>>>>>> for simplicity, we assume each pod has only one container which > is the > > >>>>>>> executor container > > >>>>>>> > > >>>>>>> 2. Each container is assigned with two values, r*equest&limit** > (limit > > >>>>>>> >= request),* for both of CPU/memory resources (we only discuss > > >>>>>>> memory here). Each pod will have request/limit values as the sum > of all > > >>>>>>> containers belonging to this pod > > >>>>>>> > > >>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on > *request* > > >>>>>>> value, and cap the resource usage of each container based on > their > > >>>>>>> *limit* value, e.g. if I have a pod with a single container in > it , > > >>>>>>> and it has 1G/2G as request and limit value respectively, any > machine with > > >>>>>>> 1G free RAM space will be a candidate to host this pod, and when > the > > >>>>>>> container use more than 2G memory, it will be killed by cgroup > > >>>>>>> oomkiller. Once a pod is scheduled to a host, the memory space > sized at > > >>>>>>> "sum of all its containers' request values" will be booked > exclusively for > > >>>>>>> this pod. > > >>>>>>> > > >>>>>>> 4. By default, Spark *sets request/limit as the same value for > > >>>>>>> executors in k8s*, and this value is basically > > >>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most > cases . > > >>>>>>> However, spark.executor.memoryOverhead usage is very bursty, > the user > > >>>>>>> setting spark.executor.memoryOverhead as 10G usually means each > executor > > >>>>>>> only needs 10G in a very small portion of the executor's whole > lifecycle > > >>>>>>> > > >>>>>>> 5. The proposed SPIP is essentially to decouple request/limit > value > > >>>>>>> in spark@k8s for executors in a safe way (this idea is from the > > >>>>>>> bytedance paper we refer to in SPIP paper). > > >>>>>>> > > >>>>>>> Using the aforementioned example , > > >>>>>>> > > >>>>>>> if we have a single node cluster with 100G RAM space, we have two > > >>>>>>> pods requesting 40G + 10G (on-heap + memoryOverhead) and we set > bursty > > >>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we > can at most > > >>>>>>> host 2 pods with this machine, and because of the bursty usage > of that 10G > > >>>>>>> space, the memory utilization would be compromised. > > >>>>>>> > > >>>>>>> When applying the burst-aware memory allocation, we only need 40 > + > > >>>>>>> 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we > have 20G free > > >>>>>>> memory space left in the machine which can be used to host some > smaller > > >>>>>>> pods. At the same time, as we didn't change the limit value of > the executor > > >>>>>>> pods, these executors can still use 50G at max. > > >>>>>>> > > >>>>>>> > > >>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected] > > > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work > > >>>>>>>> under the hood? The container will adjust its system memory size > > >>>>>>>> depending on the actual memory usage of the processes in this > container? > > >>>>>>>> > > >>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected] > > > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> yeah, we have a few cases that we have significantly larger O > than > > >>>>>>>>> H, the proposed algorithm is actually a great fit > > >>>>>>>>> > > >>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm > will > > >>>>>>>>> allocate a non-trivial G to ensure the safety of running but > still cut a > > >>>>>>>>> big chunk of memory (10s of GBs) and treat them as S , saving > tons of money > > >>>>>>>>> burnt by them > > >>>>>>>>> > > >>>>>>>>> but regarding native accelerators, some native acceleration > > >>>>>>>>> engines do not use memoryOverhead but use off-heap > > >>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The > current > > >>>>>>>>> implementation does not cover this part , while that will be > an easy > > >>>>>>>>> extension > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long < > [email protected]> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Thanks for the reply. > > >>>>>>>>>> > > >>>>>>>>>> Have you tested in environments where O is bigger than H? > > >>>>>>>>>> Wondering if the proposed algorithm would help more in those > environments > > >>>>>>>>>> (eg. with > > >>>>>>>>>> native accelerators)? > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu < > [email protected]> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well > > >>>>>>>>>>> > > >>>>>>>>>>> please check the following answer > > >>>>>>>>>>> > > >>>>>>>>>>> > My initial understanding is that Kubernetes will use the > Executor > > >>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows > > >>>>>>>>>>> for better resource packing. > > >>>>>>>>>>> > > >>>>>>>>>>> yes, your understanding is correct > > >>>>>>>>>>> > > >>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total > > >>>>>>>>>>> potential usage sum of H+G+S across all pods on a node > exceeds its > > >>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on > the cluster > > >>>>>>>>>>> operator to manually ensure an unrequested memory buffer > exists on the node > > >>>>>>>>>>> to serve as the shared pool? > > >>>>>>>>>>> > > >>>>>>>>>>> in PINS, we basically apply a set of strategies, setting > > >>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the > cluster > > >>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us > to the optimal > > >>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a > reserved space > > >>>>>>>>>>> for daemon processes on each host, we found it is sufficient > to in our case > > >>>>>>>>>>> and our major tuning focuses on bursty factor value > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > Have you considered scheduling optimizations to ensure a > > >>>>>>>>>>> strategic mix of executors with large S and small S values > on a single > > >>>>>>>>>>> node? I am wondering if this would reduce the probability > of concurrent > > >>>>>>>>>>> bursting and host-level OOM. > > >>>>>>>>>>> > > >>>>>>>>>>> Yes, when we work on this project, we put some attention on > the > > >>>>>>>>>>> cluster scheduling policy/behavior... two things we mostly > care about > > >>>>>>>>>>> > > >>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain > > >>>>>>>>>>> level of diversity of workloads so that we have enough > candidates to form a > > >>>>>>>>>>> mixed set of executors with large S and small S values > > >>>>>>>>>>> > > >>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to > > >>>>>>>>>>> pack more pods from the same job to the same host, which can > create > > >>>>>>>>>>> troubles as they are more likely to ask for max memory at > the same time > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long < > [email protected]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Thanks for sharing this interesting proposal. > > >>>>>>>>>>>> > > >>>>>>>>>>>> My initial understanding is that Kubernetes will use the > Executor > > >>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which > allows > > >>>>>>>>>>>> for better resource packing. I have a few questions > regarding > > >>>>>>>>>>>> the shared portion S: > > >>>>>>>>>>>> > > >>>>>>>>>>>> 1. How is the risk of host-level OOM mitigated when the > > >>>>>>>>>>>> total potential usage sum of H+G+S across all pods on a > node exceeds its > > >>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely > on the cluster > > >>>>>>>>>>>> operator to manually ensure an unrequested memory buffer > exists on the node > > >>>>>>>>>>>> to serve as the shared pool? > > >>>>>>>>>>>> 2. Have you considered scheduling optimizations to > ensure a > > >>>>>>>>>>>> strategic mix of executors with large S and small S > values > > >>>>>>>>>>>> on a single node? I am wondering if this would reduce > the probability of > > >>>>>>>>>>>> concurrent bursting and host-level OOM. > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan < > [email protected]> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> I think I'm still missing something in the big picture: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> - Is the memory overhead off-heap? The formular > indicates > > >>>>>>>>>>>>> a fixed heap size, and memory overhead can't be dynamic > if it's on-heap. > > >>>>>>>>>>>>> - Do Spark applications have static profiles? When we > > >>>>>>>>>>>>> submit stages, the cluster is already allocated, how > can we change anything? > > >>>>>>>>>>>>> - How do we assign the shared memory overhead? Fairly > > >>>>>>>>>>>>> among all applications on the same physical node? > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu < > [email protected]> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> we didn't separate the design into another doc since the > main > > >>>>>>>>>>>>>> idea is relatively simple... > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the > > >>>>>>>>>>>>>> SPIP doc > > >>>>>>>>>>>>>> > https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> it is calculated based on per profile (you can say it is > > >>>>>>>>>>>>>> based on per stage), when the cluster manager compose the > pod spec, it > > >>>>>>>>>>>>>> calculates the new memory overhead based on what user > asks for in that > > >>>>>>>>>>>>>> resource profile > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan < > > >>>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory > > >>>>>>>>>>>>>>> request and limit? Is it per stage or per executor? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu < > > >>>>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the > > >>>>>>>>>>>>>>>> request/limit concept in K8S, ... > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> but if there is any other cluster manager coming in > > >>>>>>>>>>>>>>>> future, as long as it has a similar concept , it can > leverage this easily > > >>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan < > > >>>>>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> This feature is only available on k8s because it allows > > >>>>>>>>>>>>>>>>> containers to have dynamic resources? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao < > [email protected]> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Hi Folks, > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead > allocation > > >>>>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory > utilization of > > >>>>>>>>>>>>>>>>>> spark clusters. > > >>>>>>>>>>>>>>>>>> Please see more details in SPIP doc > > >>>>>>>>>>>>>>>>>> < > https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0 > >. > > >>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature. > > >>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper > > >>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from > > >>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) > > >>>>>>>>>>>>>>>>>> and Yixin([email protected]). > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thank you. > > >>>>>>>>>>>>>>>>>> Yao Wang > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > > -- John Zhuge
