[
https://issues.apache.org/jira/browse/YUNIKORN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shubham Mishra updated YUNIKORN-3243:
-------------------------------------
Summary: Fair-share queue sorting causes starvation of sibling queues with
asymmetric guaranteed resources (was: Fair-share queue sorting causes
indefinite starvation of sibling queues with asymmetric guaranteed resources)
> Fair-share queue sorting causes starvation of sibling queues with asymmetric
> guaranteed resources
> -------------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-3243
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3243
> Project: Apache YuniKorn
> Issue Type: Bug
> Reporter: Shubham Mishra
> Priority: Major
>
> When two sibling queues under the same parent have vastly different
> guaranteed resources (e.g., 3600:1 ratio), the fair-share queue sorting in
> {{TryAllocate}} causes the smaller queue to be completely starved — its
> {{app.tryAllocate()}} is never called. This has two consequences:
> # {*}Scheduling starvation{*}: The smaller queue's asks are never evaluated
> for allocation, even when nodes have capacity.
> # {*}Autoscaler blindness{*}: Because {{SetSchedulingAttempted(true)}} is
> only set inside {{app.tryAllocate()}} (line 1035 of {{{}application.go{}}}),
> the starved queue's asks never get this flag. {{inspectOutstandingRequests}}
> skips them, so the cluster autoscaler (e.g., Karpenter) is never notified
> that capacity is needed.
> The second issue is the more critical one — even if scheduling is delayed,
> the autoscaler should be able to provision nodes in parallel. But with the
> current design, the autoscaler signal is gated on queue visitation.
> h3.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]