[ 
https://issues.apache.org/jira/browse/YUNIKORN-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2950.
---------------------------------------------
    Fix Version/s: 1.6.0
       Resolution: Fixed

Resolving as fixed in 1.6.0 via YUNIKORN-2790

> Race condition in 1.5.2 causes queue usage to be incorrectly calculated
> -----------------------------------------------------------------------
>
>                 Key: YUNIKORN-2950
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2950
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Paul Santa Clara
>            Priority: Major
>             Fix For: 1.6.0
>
>
> We observed some of our larger clusters(with autoscaling via Karpenter) begin 
> to mis-report their queue usages after upgrading to 1.5.2.
> As an example, given a leaf queue root.tiers.2, we observed the following 
> state:
> {code:java}
> allocated capacity(root.tiers.2) : pods 200 memory 2.187347412109375 Tib 
> vcore 0.2 k ephemeral-storage 4.294967295999999TB{code}
> but when we summed the allocations in the full-state dump, we found:
> {code:java}
> root.tiers.2 : pods 0 memory 0.0 Tib vcore 0.0 k ephemeral-storage 0.0 
> TB{code}
> Similarly, we examined the number of running pods in K8s, and we found 0.  
> The queue allocations were clearly off.  
> This was fixed by applying the following patch to remove the race condition:
> {code:java}
>  func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, 
> nodeReported bool) error {
>         // check this queue: failure stops checks if the allocation is not 
> part of a node addition
> -       fit, newAllocated := sq.allocatedResFits(alloc)
> +       fit, _ := sq.allocatedResFits(alloc)
>         if !nodeReported && !fit {
>                 return fmt.Errorf("allocation (%v) puts queue '%s' over 
> maximum allocation (%v), current usage (%v)",
>                         alloc, sq.QueuePath, sq.maxResource, 
> sq.allocatedResource)
> @@ -1058,6 +1058,7 @@ func (sq *Queue) IncAllocatedResource(alloc 
> *resources.Resource, nodeReported bo
>         sq.Lock()
>         defer sq.Unlock()
>         // all OK update this queue
> +       newAllocated := resources.Add(sq.allocatedResource, alloc)
>         sq.allocatedResource = newAllocated
>         sq.updateAllocatedResourceMetrics()
>         return nil {code}
>   
> This appears to be [fixed|#L1041] in 1.6.0,  although I have not confirmed it.
> It looks to me that the race condition was introduced 
> [here|https://github.com/apache/yunikorn-core/pull/839/files#diff-27632d48eb925e150a33bc92370ceaa66c31048018d11ca7a53a0b50ab7250acL1033].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to