[ 
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834431#comment-17834431
 ] 

Chia-Ping Tsai commented on YUNIKORN-2543:
------------------------------------------

Here is my two cents: we can narrow the scope of rmp.RLock() by following steps

1. remove rmp.RLock from processRMReleaseAllocationEvent 
([https://github.com/apache/yunikorn-core/blob/d687411221c5d27c5071c58dc4f4fc603dfaea65/pkg/rmproxy/rmproxy.go#L145])
{code:go}
func (rmp *RMProxy) processRMReleaseAllocationEvent(event 
*rmevent.RMReleaseAllocationEvent) {
        allocationsCount := len(event.ReleasedAllocations)
        if allocationsCount != 0 {
                // we don't need to hold the read lock right now. Instead, we 
get read lock in triggerUpdateAllocation
                // rmp.RLock()
                // defer rmp.RUnlock()
                response := &si.AllocationResponse{
                        Released: event.ReleasedAllocations,
                }
                rmp.triggerUpdateAllocation(event.RmID, response)
                
metrics.GetSchedulerMetrics().AddReleasedContainers(len(event.ReleasedAllocations))
        }

        // Done, notify channel
        event.Channel <- &rmevent.Result{
                Succeeded: true,
                Reason:    "no. of allocations: " + 
strconv.Itoa(allocationsCount),
        }
}
{code}
2. hold the read lock only when we wan to get ResourceManagerCallback 
([https://github.com/apache/yunikorn-core/blob/d687411221c5d27c5071c58dc4f4fc603dfaea65/pkg/rmproxy/rmproxy.go#L161])
{code:go}
func (rmp *RMProxy) triggerUpdateAllocation(rmID string, response 
*si.AllocationResponse) {
        // replace rmp.rmIDToCallback[rmID] by 
rmp.GetResourceManagerCallback(rmID)
        if callback := rmp.rmIDToCallback[rmID]; callback != nil {
                if err := callback.UpdateAllocation(response); err != nil {
                        // this method does not need lock
                        rmp.handleUpdateResponseError(rmID, err)
                }
        } else {
                log.Log(log.RMProxy).DPanic("RM is not registered",
                        zap.String("rmID", rmID))
        }
}
{code}
In short, we avoid holding rmp .RLock when calling `callback.UpdateAllocation`.

> Examine locking in RMProxy
> --------------------------
>
>                 Key: YUNIKORN-2543
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Priority: Critical
>
> After merging YUNIKORN-2539, we already saw a potential issue with 
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
>  rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
>  rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
>  rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
>  rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
>  rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
>  rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the 
> {{RMProxy}} methods. However, should any of this change, we're in trouble due 
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at 
> all. If nothing is modified, then we can drop the mutex completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to