Shubham Mishra created YUNIKORN-3243:
----------------------------------------
Summary: Fair-share queue sorting causes indefinite starvation of
sibling queues with asymmetric guaranteed resources
Key: YUNIKORN-3243
URL: https://issues.apache.org/jira/browse/YUNIKORN-3243
Project: Apache YuniKorn
Issue Type: Bug
Reporter: Shubham Mishra
When two sibling queues under the same parent have vastly different guaranteed
resources (e.g., 3600:1 ratio), the fair-share queue sorting in {{TryAllocate}}
causes the smaller queue to be completely starved — its {{app.tryAllocate()}}
is never called. This has two consequences:
# {*}Scheduling starvation{*}: The smaller queue's asks are never evaluated
for allocation, even when nodes have capacity.
# {*}Autoscaler blindness{*}: Because {{SetSchedulingAttempted(true)}} is only
set inside {{app.tryAllocate()}} (line 1035 of {{{}application.go{}}}), the
starved queue's asks never get this flag. {{inspectOutstandingRequests}} skips
them, so the cluster autoscaler (e.g., Karpenter) is never notified that
capacity is needed.
The second issue is the more critical one — even if scheduling is delayed, the
autoscaler should be able to provision nodes in parallel. But with the current
design, the autoscaler signal is gated on queue visitation.
h3.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]