[
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834431#comment-17834431
]
Chia-Ping Tsai commented on YUNIKORN-2543:
------------------------------------------
Here is my two cents: we can narrow the scope of rmp.RLock() by following steps
1. remove rmp.RLock from processRMReleaseAllocationEvent
([https://github.com/apache/yunikorn-core/blob/d687411221c5d27c5071c58dc4f4fc603dfaea65/pkg/rmproxy/rmproxy.go#L145])
{code:go}
func (rmp *RMProxy) processRMReleaseAllocationEvent(event
*rmevent.RMReleaseAllocationEvent) {
allocationsCount := len(event.ReleasedAllocations)
if allocationsCount != 0 {
// we don't need to hold the read lock right now. Instead, we
get read lock in triggerUpdateAllocation
// rmp.RLock()
// defer rmp.RUnlock()
response := &si.AllocationResponse{
Released: event.ReleasedAllocations,
}
rmp.triggerUpdateAllocation(event.RmID, response)
metrics.GetSchedulerMetrics().AddReleasedContainers(len(event.ReleasedAllocations))
}
// Done, notify channel
event.Channel <- &rmevent.Result{
Succeeded: true,
Reason: "no. of allocations: " +
strconv.Itoa(allocationsCount),
}
}
{code}
2. hold the read lock only when we wan to get ResourceManagerCallback
([https://github.com/apache/yunikorn-core/blob/d687411221c5d27c5071c58dc4f4fc603dfaea65/pkg/rmproxy/rmproxy.go#L161])
{code:go}
func (rmp *RMProxy) triggerUpdateAllocation(rmID string, response
*si.AllocationResponse) {
// replace rmp.rmIDToCallback[rmID] by
rmp.GetResourceManagerCallback(rmID)
if callback := rmp.rmIDToCallback[rmID]; callback != nil {
if err := callback.UpdateAllocation(response); err != nil {
// this method does not need lock
rmp.handleUpdateResponseError(rmID, err)
}
} else {
log.Log(log.RMProxy).DPanic("RM is not registered",
zap.String("rmID", rmID))
}
}
{code}
In short, we avoid holding rmp .RLock when calling `callback.UpdateAllocation`.
> Examine locking in RMProxy
> --------------------------
>
> Key: YUNIKORN-2543
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Peter Bacsko
> Priority: Critical
>
> After merging YUNIKORN-2539, we already saw a potential issue with
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:307
> rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:306
> rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:359
> rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847
> cache.(*Context).ForgetPod ??? <<<<<
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:162
> rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:150
> rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/[email protected]/pkg/rmproxy/rmproxy.go:234
> rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the
> {{RMProxy}} methods. However, should any of this change, we're in trouble due
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at
> all. If nothing is modified, then we can drop the mutex completely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]