Peter Bacsko created YUNIKORN-2543:
--------------------------------------
Summary: Examine locking in RMProxy
Key: YUNIKORN-2543
URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
Project: Apache YuniKorn
Issue Type: Improvement
Components: core - scheduler
Reporter: Peter Bacsko
After merging YUNIKORN-2539, we already saw a potential issue with
{{rmproxy.RMProxy}} and {{cache.Context}}:
{noformat}
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
rmproxy.(*RMProxy).GetResourceManagerCallback ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
rmproxy.(*RMProxy).UpdateNode ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603
cache.(*Context).updateNodeResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484
cache.(*Context).updateNodeOccupiedResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392
cache.(*Context).updateForeignPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286
cache.(*Context).UpdatePod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847
cache.(*Context).ForgetPod ??? <<<<<
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846
cache.(*Context).ForgetPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104
cache.(*AsyncRMCallback).UpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
rmproxy.(*RMProxy).triggerUpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
rmproxy.(*RMProxy).handleRMEvents ???
{noformat}
Right now this seems to be safe because we only call {{RLock()}} in the RMProxy
methods. However, should any of this change, we're in trouble immediately due
to lock ordering (Cache->RMProxy and RMProxy->Cache).
We need to investigate why we only {{RLock()}} and whether it's needed at all.
If nothing is modified, then we can drop the mutex completely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]