[ 
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834538#comment-17834538
 ] 

Peter Bacsko commented on YUNIKORN-2543:
----------------------------------------

I created a POC:

* core: 
https://github.com/pbacsko/incubator-yunikorn-core/commit/1a07bc3f4dd7983b26ebb60b94b9d70bcaf5693f
* SI: 
https://github.com/pbacsko/yunikorn-scheduler-interface/commit/a5ac4e8cdf5627b8bf3a47a61e6e2eb4373ca5c6
* k8shim: 
https://github.com/pbacsko/incubator-yunikorn-k8shim/commit/cf73f8979c2d2a3c288c75347f445f254dc1923a

The shim part is not ready, but the core compiles & all tests pass. The main 
idea is the new Factory API which creates a new proxy instance for each client 
(different rmID). Lots of TODO obviously. I like this approach because the 
returned RMProxy from the factory is completely lockless.

> Examine locking in RMProxy
> --------------------------
>
>                 Key: YUNIKORN-2543
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Critical
>
> After merging YUNIKORN-2539, we already saw a potential issue with 
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
>  rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
>  rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
>  rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
>  rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
>  rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
>  rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the 
> {{RMProxy}} methods. However, should any of this change, we're in trouble due 
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at 
> all. If nothing is modified, then we can drop the mutex completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to