[
https://issues.apache.org/jira/browse/YUNIKORN-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg resolved YUNIKORN-2950.
---------------------------------------------
Fix Version/s: 1.6.0
Resolution: Fixed
Resolving as fixed in 1.6.0 via YUNIKORN-2790
> Race condition in 1.5.2 causes queue usage to be incorrectly calculated
> -----------------------------------------------------------------------
>
> Key: YUNIKORN-2950
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2950
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Paul Santa Clara
> Priority: Major
> Fix For: 1.6.0
>
>
> We observed some of our larger clusters(with autoscaling via Karpenter) begin
> to mis-report their queue usages after upgrading to 1.5.2.
> As an example, given a leaf queue root.tiers.2, we observed the following
> state:
> {code:java}
> allocated capacity(root.tiers.2) : pods 200 memory 2.187347412109375 Tib
> vcore 0.2 k ephemeral-storage 4.294967295999999TB{code}
> but when we summed the allocations in the full-state dump, we found:
> {code:java}
> root.tiers.2 : pods 0 memory 0.0 Tib vcore 0.0 k ephemeral-storage 0.0
> TB{code}
> Similarly, we examined the number of running pods in K8s, and we found 0.
> The queue allocations were clearly off.
> This was fixed by applying the following patch to remove the race condition:
> {code:java}
> func (sq *Queue) IncAllocatedResource(alloc *resources.Resource,
> nodeReported bool) error {
> // check this queue: failure stops checks if the allocation is not
> part of a node addition
> - fit, newAllocated := sq.allocatedResFits(alloc)
> + fit, _ := sq.allocatedResFits(alloc)
> if !nodeReported && !fit {
> return fmt.Errorf("allocation (%v) puts queue '%s' over
> maximum allocation (%v), current usage (%v)",
> alloc, sq.QueuePath, sq.maxResource,
> sq.allocatedResource)
> @@ -1058,6 +1058,7 @@ func (sq *Queue) IncAllocatedResource(alloc
> *resources.Resource, nodeReported bo
> sq.Lock()
> defer sq.Unlock()
> // all OK update this queue
> + newAllocated := resources.Add(sq.allocatedResource, alloc)
> sq.allocatedResource = newAllocated
> sq.updateAllocatedResourceMetrics()
> return nil {code}
>
> This appears to be [fixed|#L1041] in 1.6.0, although I have not confirmed it.
> It looks to me that the race condition was introduced
> [here|https://github.com/apache/yunikorn-core/pull/839/files#diff-27632d48eb925e150a33bc92370ceaa66c31048018d11ca7a53a0b50ab7250acL1033].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]