Peter Bacsko created YUNIKORN-2543:
--------------------------------------

             Summary: Examine locking in RMProxy
                 Key: YUNIKORN-2543
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
             Project: Apache YuniKorn
          Issue Type: Improvement
          Components: core - scheduler
            Reporter: Peter Bacsko


After merging YUNIKORN-2539, we already saw a potential issue with 
{{rmproxy.RMProxy}} and {{cache.Context}}:

{noformat}
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
 rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
 rmproxy.(*RMProxy).GetResourceManagerCallback ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
 rmproxy.(*RMProxy).UpdateNode ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
cache.(*Context).updateNodeResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
cache.(*Context).updateNodeOccupiedResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
cache.(*Context).updateForeignPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
cache.(*Context).UpdatePod ???

github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
cache.(*Context).ForgetPod ??? <<<<<
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
cache.(*Context).ForgetPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
cache.(*AsyncRMCallback).UpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
 rmproxy.(*RMProxy).triggerUpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
 rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
 rmproxy.(*RMProxy).handleRMEvents ???
{noformat}

Right now this seems to be safe because we only call {{RLock()}} in the RMProxy 
methods. However, should any of this change, we're in trouble immediately due 
to lock ordering (Cache->RMProxy and RMProxy->Cache).

We need to investigate why we only {{RLock()}} and whether it's needed at all. 
If nothing is modified, then we can drop the mutex completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to