[jira] [Updated] (YUNIKORN-2543) Examine locking in RMProxy

Peter Bacsko (Jira) Fri, 05 Apr 2024 12:40:05 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Bacsko updated YUNIKORN-2543:
-----------------------------------
    Description: 
After merging YUNIKORN-2539, we already saw a potential issue with 
{{rmproxy.RMProxy}} and {{cache.Context}}:


Gourutine 1:
{noformat}
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
 rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
 rmproxy.(*RMProxy).GetResourceManagerCallback ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
 rmproxy.(*RMProxy).UpdateNode ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
cache.(*Context).updateNodeResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
cache.(*Context).updateNodeOccupiedResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
cache.(*Context).updateForeignPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
cache.(*Context).UpdatePod ???
{noformat}

Goroutine 2:
{noformat}
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
cache.(*Context).ForgetPod ??? <<<<<
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
cache.(*Context).ForgetPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
cache.(*AsyncRMCallback).UpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
 rmproxy.(*RMProxy).triggerUpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
 rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
 rmproxy.(*RMProxy).handleRMEvents ???
{noformat}

Right now this seems to be safe because we only call {{RLock()}} in the 
{{RMProxy}} methods. However, should any of this change, we're in trouble due 
to lock ordering (Cache->RMProxy and RMProxy->Cache).

We need to investigate why we use only {{RLock()}} and whether it's needed at 
all. If nothing is modified, then we can drop the mutex completely.

  was:
After merging YUNIKORN-2539, we already saw a potential issue with 
{{rmproxy.RMProxy}} and {{cache.Context}}:

{noformat}
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
 rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
 rmproxy.(*RMProxy).GetResourceManagerCallback ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
 rmproxy.(*RMProxy).UpdateNode ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
cache.(*Context).updateNodeResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
cache.(*Context).updateNodeOccupiedResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
cache.(*Context).updateForeignPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
cache.(*Context).UpdatePod ???

github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
cache.(*Context).ForgetPod ??? <<<<<
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
cache.(*Context).ForgetPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
cache.(*AsyncRMCallback).UpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
 rmproxy.(*RMProxy).triggerUpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
 rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
 rmproxy.(*RMProxy).handleRMEvents ???
{noformat}

Right now this seems to be safe because we only call {{RLock()}} in the 
{{RMProxy}} methods. However, should any of this change, we're in trouble due 
to lock ordering (Cache->RMProxy and RMProxy->Cache).

We need to investigate why we use only {{RLock()}} and whether it's needed at 
all. If nothing is modified, then we can drop the mutex completely.


> Examine locking in RMProxy
> --------------------------
>
>                 Key: YUNIKORN-2543
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Priority: Critical
>
> After merging YUNIKORN-2539, we already saw a potential issue with 
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
>  rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
>  rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
>  rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
>  rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
>  rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
>  rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the 
> {{RMProxy}} methods. However, should any of this change, we're in trouble due 
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at 
> all. If nothing is modified, then we can drop the mutex completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-2543) Examine locking in RMProxy

Reply via email to