[ 
https://issues.apache.org/jira/browse/YUNIKORN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067517#comment-18067517
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-3243:
-------------------------------------------------

Tracking the slack discussion here:
{quote}we are running YK 1.61 on a 2.5k node cluster. We ran into a starvation 
issue in production that I wanted to flag and discuss.
{*}Problem{*}: When two sibling queues have highly asymmetric guaranteed 
resources (e.g., 3600:1 ratio), the smaller queue gets completely starved — 
zero allocations, indefinitely. This also makes the autoscaler (Karpenter in 
our case) blind to the starved queue's asks, since {{schedulingAttempted}} is 
only set inside {{{}tryAllocate(){}}}, which never visits the starved queue.
{*}JIRA{*}: https://issues.apache.org/jira/browse/YUNIKORN-3243 PR with 
reproducing tests: [https://github.com/apache/yunikorn-core/pull/1077]
{*}Root cause{*}: {{TryAllocate()}} returns on the first child queue that 
succeeds. With a 3600:1 guaranteed ratio, the large queue's DRF ratio stays 
below the small queue's for thousands of cycles, so the small queue is never 
reached. With continuous demand on the large queue, this becomes indefinite.
We've been looking at a few possible approaches to address this and would love 
the community's input: # Time-weighted ratio — factor in consecutive skipped 
cycles when computing the sort order, e.g., {{{}effectiveRatio = ratio / (1 + 
skippedCycles * alpha){}}}. Lightweight change but tricky to tune at extreme 
ratios without breaking DRF fairness for normal cases.
 # Bounded multi-allocation loop — wrap {{schedule()}} in a configurable loop 
(e.g., {{{}maxAllocsPerCycle{}}}), re-sorting queues between each allocation. 
This is similar to how YARN CapacityScheduler handles it with its 
{{while(canAllocateMore)}} pattern. Would also need the RM callback to be 
async/batched since it's currently synchronous.
 # Automatic starvation detection with priority escalation — track consecutive 
skipped cycles per queue, and dynamically boost priority when a threshold is 
exceeded. Basically a self-healing version of the {{priority.offset}} config 
workaround.
 # Longer-term: heap-based pop-push loop — refactor {{TryAllocate()}} to pop 
the best queue, allocate, push it back re-sorted, repeat. This is structurally 
how Volcano and KAI-Scheduler prevent starvation.

Curious about the community's thoughts — especially around why the 
single-allocation-per-cycle approach was chosen. Was it a deliberate tradeoff 
for consistency/simplicity, or something that could be revisited?
For now, {{priority.offset}} on the smaller queue works as a config-level 
mitigation, and we've validated it in the tests on the PR.{quote}

> Fair-share queue sorting causes starvation of sibling queues with asymmetric 
> guaranteed resources
> -------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3243
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3243
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Shubham Mishra
>            Assignee: Shubham Mishra
>            Priority: Major
>             Fix For: 1.6.1
>
>
> When two sibling queues under the same parent have vastly different 
> guaranteed resources (e.g., 3600:1 ratio), the fair-share queue sorting in 
> {{TryAllocate}} causes the smaller queue to be completely starved — its 
> {{app.tryAllocate()}} is never called. This has two consequences:
>  # {*}Scheduling starvation{*}: The smaller queue's asks are never evaluated 
> for allocation, even when nodes have capacity.
>  # {*}Autoscaler blindness{*}: Because 
> {{[SetSchedulingAttempted|https://github.com/apache/yunikorn-core/blob/cb7f2381b6098f8936fe57dd7f13f205939a0021/pkg/scheduler/objects/application.go#L1065](true)}}
>  is only set inside {{app.tryAllocate()}} (line 1065 of 
> {{{}application.go{}}}), the starved queue's asks never get this flag. 
> {{inspectOutstandingRequests}} skips them, so the cluster autoscaler (e.g., 
> Karpenter) is never notified that capacity is needed.
> The second issue is the more critical one — even if scheduling is delayed, 
> the autoscaler should be able to provision nodes in parallel. But with the 
> current design, the autoscaler signal is gated on queue visitation.
> Here are the unit-tests to reproduce this - 
> [https://github.com/apache/yunikorn-core/pull/1077]
> h3.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to