[ https://issues.apache.org/jira/browse/YUNIKORN-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wilfred Spiegelenburg resolved YUNIKORN-2950. --------------------------------------------- Fix Version/s: 1.6.0 Resolution: Fixed Resolving as fixed in 1.6.0 via YUNIKORN-2790 > Race condition in 1.5.2 causes queue usage to be incorrectly calculated > ----------------------------------------------------------------------- > > Key: YUNIKORN-2950 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2950 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Paul Santa Clara > Priority: Major > Fix For: 1.6.0 > > > We observed some of our larger clusters(with autoscaling via Karpenter) begin > to mis-report their queue usages after upgrading to 1.5.2. > As an example, given a leaf queue root.tiers.2, we observed the following > state: > {code:java} > allocated capacity(root.tiers.2) : pods 200 memory 2.187347412109375 Tib > vcore 0.2 k ephemeral-storage 4.294967295999999TB{code} > but when we summed the allocations in the full-state dump, we found: > {code:java} > root.tiers.2 : pods 0 memory 0.0 Tib vcore 0.0 k ephemeral-storage 0.0 > TB{code} > Similarly, we examined the number of running pods in K8s, and we found 0. > The queue allocations were clearly off. > This was fixed by applying the following patch to remove the race condition: > {code:java} > func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, > nodeReported bool) error { > // check this queue: failure stops checks if the allocation is not > part of a node addition > - fit, newAllocated := sq.allocatedResFits(alloc) > + fit, _ := sq.allocatedResFits(alloc) > if !nodeReported && !fit { > return fmt.Errorf("allocation (%v) puts queue '%s' over > maximum allocation (%v), current usage (%v)", > alloc, sq.QueuePath, sq.maxResource, > sq.allocatedResource) > @@ -1058,6 +1058,7 @@ func (sq *Queue) IncAllocatedResource(alloc > *resources.Resource, nodeReported bo > sq.Lock() > defer sq.Unlock() > // all OK update this queue > + newAllocated := resources.Add(sq.allocatedResource, alloc) > sq.allocatedResource = newAllocated > sq.updateAllocatedResourceMetrics() > return nil {code} > > This appears to be [fixed|#L1041] in 1.6.0, although I have not confirmed it. > It looks to me that the race condition was introduced > [here|https://github.com/apache/yunikorn-core/pull/839/files#diff-27632d48eb925e150a33bc92370ceaa66c31048018d11ca7a53a0b50ab7250acL1033]. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org