Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Wenchen Fan Thu, 11 Dec 2025 17:42:25 -0800

Sorry I'm not very familiar with the k8s infra, how does it work under the
hood? The container will adjust its system memory size depending on the
actual memory usage of the processes in this container?


On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> wrote:

> yeah, we have a few cases that we have significantly larger O than H, the
> proposed algorithm is actually a great fit
>
> as I explained in SPIP doc Appendix C, the proposed algorithm will
> allocate a non-trivial G to ensure the safety of running but still cut a
> big chunk of memory (10s of GBs) and treat them as S , saving tons of money
> burnt by them
>
> but regarding native accelerators, some native acceleration engines do not
> use memoryOverhead but use off-heap (spark.memory.offHeap.size) explicitly
> (e.g. Gluten). The current implementation does not cover this part , while
> that will be an easy extension
>
>
>
>
>
>
> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> wrote:
>
>> Thanks for the reply.
>>
>> Have you tested in environments where O is bigger than H? Wondering if
>> the proposed algorithm would help more in those environments (eg. with
>> native accelerators)?
>>
>>
>>
>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> wrote:
>>
>>> Hi, Qiegang, thanks for the good questions as well
>>>
>>> please check the following answer
>>>
>>> > My initial understanding is that Kubernetes will use the Executor
>>> Memory Request (H + G) for scheduling decisions, which allows for
>>> better resource packing.
>>>
>>> yes, your understanding is correct
>>>
>>> > How is the risk of host-level OOM mitigated when the total potential
>>> usage  sum of H+G+S across all pods on a node exceeds its allocatable
>>> capacity? Does the proposal implicitly rely on the cluster operator to
>>> manually ensure an unrequested memory buffer exists on the node to serve as
>>> the shared pool?
>>>
>>> in PINS, we basically apply a set of strategies, setting conservative
>>> bursty factor, progressive rollout, monitor the cluster metrics like Linux
>>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
>>> factor... in usual, K8S operators will set a reserved space for daemon
>>> processes on each host, we found it is sufficient to in our case and our
>>> major tuning focuses on bursty factor value
>>>
>>>
>>> > Have you considered scheduling optimizations to ensure a strategic mix
>>> of executors with large S and small S values on a single node?  I am
>>> wondering if this would reduce the probability of concurrent bursting and
>>> host-level OOM.
>>>
>>> Yes, when we work on this project, we put some attention on the cluster
>>> scheduling policy/behavior... two things we mostly care about
>>>
>>> 1. as stated in the SPIP doc, the cluster should have certain level of
>>> diversity of workloads so that we have enough candidates to form a mixed
>>> set of executors with large S and small S values
>>>
>>> 2. we avoid using binpack scheduling algorithm which tends to pack more
>>> pods from the same job to the same host, which can create troubles as they
>>> are more likely to ask for max memory at the same time
>>>
>>>
>>>
>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> wrote:
>>>
>>>> Thanks for sharing this interesting proposal.
>>>>
>>>> My initial understanding is that Kubernetes will use the Executor
>>>> Memory Request (H + G) for scheduling decisions, which allows for
>>>> better resource packing.  I have a few questions regarding the shared
>>>> portion S:
>>>>
>>>>    1. How is the risk of host-level OOM mitigated when the total
>>>>    potential usage  sum of H+G+S across all pods on a node exceeds its
>>>>    allocatable capacity? Does the proposal implicitly rely on the cluster
>>>>    operator to manually ensure an unrequested memory buffer exists on the 
>>>> node
>>>>    to serve as the shared pool?
>>>>    2. Have you considered scheduling optimizations to ensure a
>>>>    strategic mix of executors with large S and small S values on a
>>>>    single node?  I am wondering if this would reduce the probability of
>>>>    concurrent bursting and host-level OOM.
>>>>
>>>>
>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> wrote:
>>>>
>>>>> I think I'm still missing something in the big picture:
>>>>>
>>>>>    - Is the memory overhead off-heap? The formular indicates a fixed
>>>>>    heap size, and memory overhead can't be dynamic if it's on-heap.
>>>>>    - Do Spark applications have static profiles? When we submit
>>>>>    stages, the cluster is already allocated, how can we change anything?
>>>>>    - How do we assign the shared memory overhead? Fairly among all
>>>>>    applications on the same physical node?
>>>>>
>>>>>
>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> wrote:
>>>>>
>>>>>> we didn't separate the design into another doc since the main idea is
>>>>>> relatively simple...
>>>>>>
>>>>>> for request/limit calculation, I described it in Q4 of the SPIP doc
>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>
>>>>>> it is calculated based on per profile (you can say it is based on per
>>>>>> stage), when the cluster manager compose the pod spec, it calculates the
>>>>>> new memory overhead based on what user asks for in that resource profile
>>>>>>
>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Do we have a design sketch? How to determine the memory request and
>>>>>>> limit? Is it per stage or per executor?
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> yeah, the implementation is basically relying on the request/limit
>>>>>>>> concept in K8S, ...
>>>>>>>>
>>>>>>>> but if there is any other cluster manager coming in future,  as
>>>>>>>> long as it has a similar concept , it can leverage this easily as the 
>>>>>>>> main
>>>>>>>> logic is implemented in ResourceProfile
>>>>>>>>
>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This feature is only available on k8s because it allows containers
>>>>>>>>> to have dynamic resources?
>>>>>>>>>
>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Folks,
>>>>>>>>>>
>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of spark
>>>>>>>>>> clusters.
>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>
>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from ByteDance,
>>>>>>>>>> specifically Rui([email protected]) and Yixin(
>>>>>>>>>> [email protected]).
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>> Yao Wang
>>>>>>>>>>
>>>>>>>>>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to