Paul Santa Clara created YUNIKORN-2950: ------------------------------------------
Summary: Race condition in 1.5.2 causes queue usage to be incorrectly calcuated Key: YUNIKORN-2950 URL: https://issues.apache.org/jira/browse/YUNIKORN-2950 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Paul Santa Clara We observed some of our larger clusters(with autoscaling via Karpenter) begin to mis-report their queue usages after upgrading to 1.5.2. As an example, given a leaf queue root.tiers.2, we observed the following state: {{}}{{allocated capacity(root.tiers.2) : pods 200 memory 2.187347412109375 Tib vcore 0.2 k ephemeral-storage 4.294967295999999TB}} but when we summed the allocations in full-state dump, we found: {{}}{{root.tiers.2 : pods 0 memory 0.0 Tib vcore 0.0 k ephemeral-storage 0.0 TB}} Similarly, we examined the number of running pods in K8s, and we found 0. The queue allocations were clearly off. This was fixed by applying the following patch to remove the race condition: {code:java} func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported bool) error { // check this queue: failure stops checks if the allocation is not part of a node addition - fit, newAllocated := sq.allocatedResFits(alloc) + fit, _ := sq.allocatedResFits(alloc) if !nodeReported && !fit { return fmt.Errorf("allocation (%v) puts queue '%s' over maximum allocation (%v), current usage (%v)", alloc, sq.QueuePath, sq.maxResource, sq.allocatedResource) @@ -1058,6 +1058,7 @@ func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported bo sq.Lock() defer sq.Unlock() // all OK update this queue + newAllocated := resources.Add(sq.allocatedResource, alloc) sq.allocatedResource = newAllocated sq.updateAllocatedResourceMetrics() return nil {code} This appears to be [fixed|[https://github.com/apache/yunikorn-core/blob/v1.6.0/pkg/scheduler/objects/queue.go#L1041]] in 1.6.0, although I have not confirmed it. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org