[ 
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834423#comment-17834423
 ] 

Peter Bacsko edited comment on YUNIKORN-2543 at 4/5/24 7:57 PM:
----------------------------------------------------------------

cc [~wilfreds] [~chia7712] [~chenyulin0719] what's the point of having read 
lock only? That means there's nothing to protect from concurrent modification, 
that is, it can be removed. Am I missing something?

EDIT: ok, actually it's needed. The instance already exists when 
\{{RegisterResourceManager()}} is called. First I missed it and I thought it 
was a function not tied to an instance. IMO {{RegisterResourceManager()}} 
should act as a factory which creates immutable, lockless {{RMProxy}} 
instances. Therefore, {{RegisterResourceManager()}} should return a new object 
to the caller. 

In short:
# New type {{RMProxyFactory}} type with method {{RegisterResourceManager()}}. 
This would return an {{*RMProxy}} instance.
# {{RMproxy}} with the current methods, but without any locking


was (Author: pbacsko):
cc [~wilfreds] [~chia7712] [~chenyulin0719] what's the point of having read 
lock only? That means there's nothing to protect from concurrent modification, 
that is, it can be removed. Am I missing something?

EDIT: ok, actually it's needed. The instance already exists when 
\{{RegisterResourceManager()}} is called. First I missed it and I thought it 
was a function not tied to an instance. IMO {{RegisterResourceManager()}} 
should act as a factory which creates immutable, lockless {{RMProxy}} 
instances. Therefore, {{RegisterResourceManager()}} should return a new object 
to the caller. 

In short:
# New type {{RMProxyFactory}} type with method {{RegisterResourceManager()}}
# {{RMproxy}} with the current methods, but without any locking

> Examine locking in RMProxy
> --------------------------
>
>                 Key: YUNIKORN-2543
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Priority: Critical
>
> After merging YUNIKORN-2539, we already saw a potential issue with 
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
>  rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
>  rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
>  rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
>  rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
>  rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
>  rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the 
> {{RMProxy}} methods. However, should any of this change, we're in trouble due 
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at 
> all. If nothing is modified, then we can drop the mutex completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to