yeah, we have a few cases that we have significantly larger O than H, the proposed algorithm is actually a great fit
as I explained in SPIP doc Appendix C, the proposed algorithm will allocate a non-trivial G to ensure the safety of running but still cut a big chunk of memory (10s of GBs) and treat them as S , saving tons of money burnt by them but regarding native accelerators, some native acceleration engines do not use memoryOverhead but use off-heap (spark.memory.offHeap.size) explicitly (e.g. Gluten). The current implementation does not cover this part , while that will be an easy extension On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> wrote: > Thanks for the reply. > > Have you tested in environments where O is bigger than H? Wondering if the > proposed algorithm would help more in those environments (eg. with > native accelerators)? > > > > On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> wrote: > >> Hi, Qiegang, thanks for the good questions as well >> >> please check the following answer >> >> > My initial understanding is that Kubernetes will use the Executor >> Memory Request (H + G) for scheduling decisions, which allows for better >> resource packing. >> >> yes, your understanding is correct >> >> > How is the risk of host-level OOM mitigated when the total potential >> usage sum of H+G+S across all pods on a node exceeds its allocatable >> capacity? Does the proposal implicitly rely on the cluster operator to >> manually ensure an unrequested memory buffer exists on the node to serve as >> the shared pool? >> >> in PINS, we basically apply a set of strategies, setting conservative >> bursty factor, progressive rollout, monitor the cluster metrics like Linux >> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty >> factor... in usual, K8S operators will set a reserved space for daemon >> processes on each host, we found it is sufficient to in our case and our >> major tuning focuses on bursty factor value >> >> >> > Have you considered scheduling optimizations to ensure a strategic mix >> of executors with large S and small S values on a single node? I am >> wondering if this would reduce the probability of concurrent bursting and >> host-level OOM. >> >> Yes, when we work on this project, we put some attention on the cluster >> scheduling policy/behavior... two things we mostly care about >> >> 1. as stated in the SPIP doc, the cluster should have certain level of >> diversity of workloads so that we have enough candidates to form a mixed >> set of executors with large S and small S values >> >> 2. we avoid using binpack scheduling algorithm which tends to pack more >> pods from the same job to the same host, which can create troubles as they >> are more likely to ask for max memory at the same time >> >> >> >> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote: >> >>> Thanks for sharing this interesting proposal. >>> >>> My initial understanding is that Kubernetes will use the Executor >>> Memory Request (H + G) for scheduling decisions, which allows for >>> better resource packing. I have a few questions regarding the shared >>> portion S: >>> >>> 1. How is the risk of host-level OOM mitigated when the total >>> potential usage sum of H+G+S across all pods on a node exceeds its >>> allocatable capacity? Does the proposal implicitly rely on the cluster >>> operator to manually ensure an unrequested memory buffer exists on the >>> node >>> to serve as the shared pool? >>> 2. Have you considered scheduling optimizations to ensure a >>> strategic mix of executors with large S and small S values on a >>> single node? I am wondering if this would reduce the probability of >>> concurrent bursting and host-level OOM. >>> >>> >>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> wrote: >>> >>>> I think I'm still missing something in the big picture: >>>> >>>> - Is the memory overhead off-heap? The formular indicates a fixed >>>> heap size, and memory overhead can't be dynamic if it's on-heap. >>>> - Do Spark applications have static profiles? When we submit >>>> stages, the cluster is already allocated, how can we change anything? >>>> - How do we assign the shared memory overhead? Fairly among all >>>> applications on the same physical node? >>>> >>>> >>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote: >>>> >>>>> we didn't separate the design into another doc since the main idea is >>>>> relatively simple... >>>>> >>>>> for request/limit calculation, I described it in Q4 of the SPIP doc >>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>>> >>>>> it is calculated based on per profile (you can say it is based on per >>>>> stage), when the cluster manager compose the pod spec, it calculates the >>>>> new memory overhead based on what user asks for in that resource profile >>>>> >>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> >>>>> wrote: >>>>> >>>>>> Do we have a design sketch? How to determine the memory request and >>>>>> limit? Is it per stage or per executor? >>>>>> >>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> yeah, the implementation is basically relying on the request/limit >>>>>>> concept in K8S, ... >>>>>>> >>>>>>> but if there is any other cluster manager coming in future, as long >>>>>>> as it has a similar concept , it can leverage this easily as the main >>>>>>> logic >>>>>>> is implemented in ResourceProfile >>>>>>> >>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> This feature is only available on k8s because it allows containers >>>>>>>> to have dynamic resources? >>>>>>>> >>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Folks, >>>>>>>>> >>>>>>>>> We are proposing a burst-aware memoryOverhead allocation algorithm >>>>>>>>> for Spark@K8S to improve memory utilization of spark clusters. >>>>>>>>> Please see more details in SPIP doc >>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>>> >>>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>>> Also want to thank the authors of the original paper >>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance, >>>>>>>>> specifically Rui([email protected]) and Yixin( >>>>>>>>> [email protected]). >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> Yao Wang >>>>>>>>> >>>>>>>>
