[
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko updated YUNIKORN-2543:
-----------------------------------
Description:
After merging YUNIKORN-2539, we already saw a potential issue with
{{rmproxy.RMProxy}} and {{cache.Context}}:
Gourutine 1:
{noformat}
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
rmproxy.(*RMProxy).GetResourceManagerCallback ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
rmproxy.(*RMProxy).UpdateNode ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603
cache.(*Context).updateNodeResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484
cache.(*Context).updateNodeOccupiedResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392
cache.(*Context).updateForeignPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286
cache.(*Context).UpdatePod ???
{noformat}
Goroutine 2:
{noformat}
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847
cache.(*Context).ForgetPod ??? <<<<<
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846
cache.(*Context).ForgetPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104
cache.(*AsyncRMCallback).UpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
rmproxy.(*RMProxy).triggerUpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
rmproxy.(*RMProxy).handleRMEvents ???
{noformat}
Right now this seems to be safe because we only call {{RLock()}} in the
{{RMProxy}} methods. However, should any of this change, we're in trouble due
to lock ordering (Cache->RMProxy and RMProxy->Cache).
We need to investigate why we use only {{RLock()}} and whether it's needed at
all. If nothing is modified, then we can drop the mutex completely.
was:
After merging YUNIKORN-2539, we already saw a potential issue with
{{rmproxy.RMProxy}} and {{cache.Context}}:
{noformat}
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
rmproxy.(*RMProxy).GetResourceManagerCallback ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
rmproxy.(*RMProxy).UpdateNode ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603
cache.(*Context).updateNodeResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484
cache.(*Context).updateNodeOccupiedResources ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392
cache.(*Context).updateForeignPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286
cache.(*Context).UpdatePod ???
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847
cache.(*Context).ForgetPod ??? <<<<<
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846
cache.(*Context).ForgetPod ???
github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104
cache.(*AsyncRMCallback).UpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
rmproxy.(*RMProxy).triggerUpdateAllocation ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
rmproxy.(*RMProxy).handleRMEvents ???
{noformat}
Right now this seems to be safe because we only call {{RLock()}} in the
{{RMProxy}} methods. However, should any of this change, we're in trouble due
to lock ordering (Cache->RMProxy and RMProxy->Cache).
We need to investigate why we use only {{RLock()}} and whether it's needed at
all. If nothing is modified, then we can drop the mutex completely.
> Examine locking in RMProxy
> --------------------------
>
> Key: YUNIKORN-2543
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Peter Bacsko
> Priority: Critical
>
> After merging YUNIKORN-2539, we already saw a potential issue with
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
> rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
> rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
> rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
> rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
> rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
> rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the
> {{RMProxy}} methods. However, should any of this change, we're in trouble due
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at
> all. If nothing is modified, then we can drop the mutex completely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]