[
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko resolved YUNIKORN-2543.
------------------------------------
Fix Version/s: 1.6.0
Resolution: Fixed
> Fix locking in RMProxy
> ----------------------
>
> Key: YUNIKORN-2543
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: core - scheduler
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.6.0
>
>
> After merging YUNIKORN-2539, we already saw a potential issue with
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
> rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
> rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
> rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
> rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
> rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
> rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the
> {{RMProxy}} methods. However, should any of this change, we're in trouble due
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at
> all. If nothing is modified, then we can drop the mutex completely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]