[
https://issues.apache.org/jira/browse/YUNIKORN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067517#comment-18067517
]
Wilfred Spiegelenburg edited comment on YUNIKORN-3243 at 3/22/26 11:33 PM:
---------------------------------------------------------------------------
Tracking the slack discussion here:
{quote}we are running YK 1.61 on a 2.5k node cluster. We ran into a starvation
issue in production that I wanted to flag and discuss.
{*}Problem{*}: When two sibling queues have highly asymmetric guaranteed
resources (e.g., 3600:1 ratio), the smaller queue gets completely starved —
zero allocations, indefinitely. This also makes the autoscaler (Karpenter in
our case) blind to the starved queue's asks, since {{schedulingAttempted}} is
only set inside {{{}tryAllocate(){}}}, which never visits the starved queue.
{*}JIRA{*}: https://issues.apache.org/jira/browse/YUNIKORN-3243 PR with
reproducing tests: [https://github.com/apache/yunikorn-core/pull/1077]
{*}Root cause{*}: {{TryAllocate()}} returns on the first child queue that
succeeds. With a 3600:1 guaranteed ratio, the large queue's DRF ratio stays
below the small queue's for thousands of cycles, so the small queue is never
reached. With continuous demand on the large queue, this becomes indefinite.
We've been looking at a few possible approaches to address this and would love
the community's input:
# Time-weighted ratio — factor in consecutive skipped cycles when computing
the sort order, e.g., {{{}effectiveRatio = ratio / (1 + skippedCycles *
alpha){}}}. Lightweight change but tricky to tune at extreme ratios without
breaking DRF fairness for normal cases.
# Bounded multi-allocation loop — wrap {{schedule()}} in a configurable loop
(e.g., {{{}maxAllocsPerCycle{}}}), re-sorting queues between each allocation.
This is similar to how YARN CapacityScheduler handles it with its
{{while(canAllocateMore)}} pattern. Would also need the RM callback to be
async/batched since it's currently synchronous.
# Automatic starvation detection with priority escalation — track consecutive
skipped cycles per queue, and dynamically boost priority when a threshold is
exceeded. Basically a self-healing version of the {{priority.offset}} config
workaround.
# Longer-term: heap-based pop-push loop — refactor {{TryAllocate()}} to pop
the best queue, allocate, push it back re-sorted, repeat. This is structurally
how Volcano and KAI-Scheduler prevent starvation.
Curious about the community's thoughts — especially around why the
single-allocation-per-cycle approach was chosen. Was it a deliberate tradeoff
for consistency/simplicity, or something that could be revisited?
For now, {{priority.offset}} on the smaller queue works as a config-level
mitigation, and we've validated it in the tests on the PR.{quote}
was (Author: wifreds):
Tracking the slack discussion here:
{quote}we are running YK 1.61 on a 2.5k node cluster. We ran into a starvation
issue in production that I wanted to flag and discuss.
{*}Problem{*}: When two sibling queues have highly asymmetric guaranteed
resources (e.g., 3600:1 ratio), the smaller queue gets completely starved —
zero allocations, indefinitely. This also makes the autoscaler (Karpenter in
our case) blind to the starved queue's asks, since {{schedulingAttempted}} is
only set inside {{{}tryAllocate(){}}}, which never visits the starved queue.
{*}JIRA{*}: https://issues.apache.org/jira/browse/YUNIKORN-3243 PR with
reproducing tests: [https://github.com/apache/yunikorn-core/pull/1077]
{*}Root cause{*}: {{TryAllocate()}} returns on the first child queue that
succeeds. With a 3600:1 guaranteed ratio, the large queue's DRF ratio stays
below the small queue's for thousands of cycles, so the small queue is never
reached. With continuous demand on the large queue, this becomes indefinite.
We've been looking at a few possible approaches to address this and would love
the community's input: # Time-weighted ratio — factor in consecutive skipped
cycles when computing the sort order, e.g., {{{}effectiveRatio = ratio / (1 +
skippedCycles * alpha){}}}. Lightweight change but tricky to tune at extreme
ratios without breaking DRF fairness for normal cases.
# Bounded multi-allocation loop — wrap {{schedule()}} in a configurable loop
(e.g., {{{}maxAllocsPerCycle{}}}), re-sorting queues between each allocation.
This is similar to how YARN CapacityScheduler handles it with its
{{while(canAllocateMore)}} pattern. Would also need the RM callback to be
async/batched since it's currently synchronous.
# Automatic starvation detection with priority escalation — track consecutive
skipped cycles per queue, and dynamically boost priority when a threshold is
exceeded. Basically a self-healing version of the {{priority.offset}} config
workaround.
# Longer-term: heap-based pop-push loop — refactor {{TryAllocate()}} to pop
the best queue, allocate, push it back re-sorted, repeat. This is structurally
how Volcano and KAI-Scheduler prevent starvation.
Curious about the community's thoughts — especially around why the
single-allocation-per-cycle approach was chosen. Was it a deliberate tradeoff
for consistency/simplicity, or something that could be revisited?
For now, {{priority.offset}} on the smaller queue works as a config-level
mitigation, and we've validated it in the tests on the PR.{quote}
> Fair-share queue sorting causes starvation of sibling queues with asymmetric
> guaranteed resources
> -------------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-3243
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3243
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Shubham Mishra
> Assignee: Shubham Mishra
> Priority: Major
> Fix For: 1.6.1
>
>
> When two sibling queues under the same parent have vastly different
> guaranteed resources (e.g., 3600:1 ratio), the fair-share queue sorting in
> {{TryAllocate}} causes the smaller queue to be completely starved — its
> {{app.tryAllocate()}} is never called. This has two consequences:
> # {*}Scheduling starvation{*}: The smaller queue's asks are never evaluated
> for allocation, even when nodes have capacity.
> # {*}Autoscaler blindness{*}: Because
> {{[SetSchedulingAttempted|https://github.com/apache/yunikorn-core/blob/cb7f2381b6098f8936fe57dd7f13f205939a0021/pkg/scheduler/objects/application.go#L1065](true)}}
> is only set inside {{app.tryAllocate()}} (line 1065 of
> {{{}application.go{}}}), the starved queue's asks never get this flag.
> {{inspectOutstandingRequests}} skips them, so the cluster autoscaler (e.g.,
> Karpenter) is never notified that capacity is needed.
> The second issue is the more critical one — even if scheduling is delayed,
> the autoscaler should be able to provision nodes in parallel. But with the
> current design, the autoscaler signal is gated on queue visitation.
> Here are the unit-tests to reproduce this -
> [https://github.com/apache/yunikorn-core/pull/1077]
> h3.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]