Paul Santa Clara created YUNIKORN-2950:
------------------------------------------
Summary: Race condition in 1.5.2 causes queue usage to be
incorrectly calcuated
Key: YUNIKORN-2950
URL: https://issues.apache.org/jira/browse/YUNIKORN-2950
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: Paul Santa Clara
We observed some of our larger clusters(with autoscaling via Karpenter) begin
to mis-report their queue usages after upgrading to 1.5.2.
As an example, given a leaf queue root.tiers.2, we observed the following state:
{{}}{{allocated capacity(root.tiers.2) : pods 200 memory
2.187347412109375 Tib vcore 0.2 k
ephemeral-storage 4.294967295999999TB}}
but when we summed the allocations in full-state dump, we found:
{{}}{{root.tiers.2 : pods 0 memory 0.0
Tib vcore 0.0 k ephemeral-storage 0.0
TB}}
Similarly, we examined the number of running pods in K8s, and we found 0. The
queue allocations were clearly off.
This was fixed by applying the following patch to remove the race condition:
{code:java}
func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported
bool) error {
// check this queue: failure stops checks if the allocation is not part
of a node addition
- fit, newAllocated := sq.allocatedResFits(alloc)
+ fit, _ := sq.allocatedResFits(alloc)
if !nodeReported && !fit {
return fmt.Errorf("allocation (%v) puts queue '%s' over maximum
allocation (%v), current usage (%v)",
alloc, sq.QueuePath, sq.maxResource,
sq.allocatedResource)
@@ -1058,6 +1058,7 @@ func (sq *Queue) IncAllocatedResource(alloc
*resources.Resource, nodeReported bo
sq.Lock()
defer sq.Unlock()
// all OK update this queue
+ newAllocated := resources.Add(sq.allocatedResource, alloc)
sq.allocatedResource = newAllocated
sq.updateAllocatedResourceMetrics()
return nil {code}
This appears to be
[fixed|[https://github.com/apache/yunikorn-core/blob/v1.6.0/pkg/scheduler/objects/queue.go#L1041]]
in 1.6.0, although I have not confirmed it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]