Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Nan Zhu Thu, 11 Dec 2025 10:49:22 -0800

yeah, we have a few cases that we have significantly larger O than H, the
proposed algorithm is actually a great fit


as I explained in SPIP doc Appendix C, the proposed algorithm will allocate
a non-trivial G to ensure the safety of running but still cut a big chunk
of memory (10s of GBs) and treat them as S , saving tons of money burnt by
them

but regarding native accelerators, some native acceleration engines do not
use memoryOverhead but use off-heap (spark.memory.offHeap.size) explicitly
(e.g. Gluten). The current implementation does not cover this part , while
that will be an easy extension






On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> wrote:

> Thanks for the reply.
>
> Have you tested in environments where O is bigger than H? Wondering if the
> proposed algorithm would help more in those environments (eg. with
> native accelerators)?
>
>
>
> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> wrote:
>
>> Hi, Qiegang, thanks for the good questions as well
>>
>> please check the following answer
>>
>> > My initial understanding is that Kubernetes will use the Executor
>> Memory Request (H + G) for scheduling decisions, which allows for better
>> resource packing.
>>
>> yes, your understanding is correct
>>
>> > How is the risk of host-level OOM mitigated when the total potential
>> usage  sum of H+G+S across all pods on a node exceeds its allocatable
>> capacity? Does the proposal implicitly rely on the cluster operator to
>> manually ensure an unrequested memory buffer exists on the node to serve as
>> the shared pool?
>>
>> in PINS, we basically apply a set of strategies, setting conservative
>> bursty factor, progressive rollout, monitor the cluster metrics like Linux
>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
>> factor... in usual, K8S operators will set a reserved space for daemon
>> processes on each host, we found it is sufficient to in our case and our
>> major tuning focuses on bursty factor value
>>
>>
>> > Have you considered scheduling optimizations to ensure a strategic mix
>> of executors with large S and small S values on a single node?  I am
>> wondering if this would reduce the probability of concurrent bursting and
>> host-level OOM.
>>
>> Yes, when we work on this project, we put some attention on the cluster
>> scheduling policy/behavior... two things we mostly care about
>>
>> 1. as stated in the SPIP doc, the cluster should have certain level of
>> diversity of workloads so that we have enough candidates to form a mixed
>> set of executors with large S and small S values
>>
>> 2. we avoid using binpack scheduling algorithm which tends to pack more
>> pods from the same job to the same host, which can create troubles as they
>> are more likely to ask for max memory at the same time
>>
>>
>>
>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote:
>>
>>> Thanks for sharing this interesting proposal.
>>>
>>> My initial understanding is that Kubernetes will use the Executor
>>> Memory Request (H + G) for scheduling decisions, which allows for
>>> better resource packing.  I have a few questions regarding the shared
>>> portion S:
>>>
>>>    1. How is the risk of host-level OOM mitigated when the total
>>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>>    allocatable capacity? Does the proposal implicitly rely on the cluster
>>>    operator to manually ensure an unrequested memory buffer exists on the 
>>> node
>>>    to serve as the shared pool?
>>>    2. Have you considered scheduling optimizations to ensure a
>>>    strategic mix of executors with large S and small S values on a
>>>    single node?  I am wondering if this would reduce the probability of
>>>    concurrent bursting and host-level OOM.
>>>
>>>
>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> wrote:
>>>
>>>> I think I'm still missing something in the big picture:
>>>>
>>>>    - Is the memory overhead off-heap? The formular indicates a fixed
>>>>    heap size, and memory overhead can't be dynamic if it's on-heap.
>>>>    - Do Spark applications have static profiles? When we submit
>>>>    stages, the cluster is already allocated, how can we change anything?
>>>>    - How do we assign the shared memory overhead? Fairly among all
>>>>    applications on the same physical node?
>>>>
>>>>
>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote:
>>>>
>>>>> we didn't separate the design into another doc since the main idea is
>>>>> relatively simple...
>>>>>
>>>>> for request/limit calculation, I described it in Q4 of the SPIP doc
>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>
>>>>> it is calculated based on per profile (you can say it is based on per
>>>>> stage), when the cluster manager compose the pod spec, it calculates the
>>>>> new memory overhead based on what user asks for in that resource profile
>>>>>
>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Do we have a design sketch? How to determine the memory request and
>>>>>> limit? Is it per stage or per executor?
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> yeah, the implementation is basically relying on the request/limit
>>>>>>> concept in K8S, ...
>>>>>>>
>>>>>>> but if there is any other cluster manager coming in future,  as long
>>>>>>> as it has a similar concept , it can leverage this easily as the main 
>>>>>>> logic
>>>>>>> is implemented in ResourceProfile
>>>>>>>
>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This feature is only available on k8s because it allows containers
>>>>>>>> to have dynamic resources?
>>>>>>>>
>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Folks,
>>>>>>>>>
>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation algorithm
>>>>>>>>> for Spark@K8S to improve memory utilization of spark clusters.
>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>
>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance,
>>>>>>>>> specifically Rui([email protected]) and Yixin(
>>>>>>>>> [email protected]).
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>> Yao Wang
>>>>>>>>>
>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to