I’m all for fixing these… and in general where lockless algorithms can be implemented cleanly, I’m in favor of those implementations instead of requiring locks, so for RMProxy I’m +1 on that. The extra memory for an RMProxy instance is irrelevant.
The recursive locking case is a real problem, and I’m surprised that hasn’t bitten us harder. It can cause all sorts of issues. Craig > On Apr 6, 2024, at 2:19 PM, Peter Bacsko <[email protected]> wrote: > > Hi all, > > after YUNIKORN-2539 got merged, we identified some potential deadlocks. > These are false positives now, but a small change can cause Yunikorn to > fall apart, so the term "potential deadlock" describes them properly. > > Thoughs, opinions are welcome. IMO we should handle these with priority to > ensure stability. > > *1. Locking order: Cache→RMProxy vs RMProxy→Cache* > > We grab the locks of these types in different order (read bottom to top): > > pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback > ??? <<<<< > pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ??? > pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ??? > cache.(*Context).updateNodeResources ??? > cache.(*Context).updateNodeOccupiedResources ??? > cache.(*Context).updateForeignPod ??? > cache.(*Context).UpdatePod ??? > > cache.(*Context).ForgetPod ??? <<<<< > cache.(*Context).ForgetPod ??? > cache.(*AsyncRMCallback).UpdateAllocation ??? > rmproxy.(*RMProxy).triggerUpdateAllocation ??? > rmproxy.(*RMProxy).processRMReleaseAllocationEvent ??? > rmproxy.(*RMProxy).handleRMEvents ??? > > I already created a JIRA for this one: > https://issues.apache.org/jira/browse/YUNIKORN-2543. > > Luckily we only call RLock() inside RMProxy while processing allocations, > but if this changes, deadlock is guaranteed. I made a POC which creates a > lockless RMProxy instance after calling a factory method, see my comment in > the JIRA. The scheduler core compiles & all tests pass. > > *2. Lock order: multiple locks calls on the same goroutine which interferes > with another goroutine* > > We shouldn't call RLock() multiple times or after Lock() on the same mutex > instance. This is dangerous as described here: > https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine > > objects.(*Queue).GetCurrentPriority ??? <<<<< ← Queue RLock > objects.(*Queue).GetCurrentPriority ??? > objects.(*Queue).addChildQueue ??? ← Queue Lock > objects.NewConfiguredQueue ??? > scheduler.(*PartitionContext).addQueue ??? > scheduler.(*PartitionContext).addQueue ??? > scheduler.(*PartitionContext).initialPartitionFromConfig ??? > scheduler.newPartitionContext ??? > scheduler.(*ClusterContext).updateSchedulerConfig ??? > scheduler.(*ClusterContext).processRMRegistrationEvent ??? > scheduler.(*Scheduler).handleRMEvent ??? > > objects.(*Queue).internalHeadRoom ??? <<<<< ← Queue RLock #2 > objects.(*Queue).internalHeadRoom ??? > objects.(*Queue).getHeadRoom ??? > objects.(*Queue).getHeadRoom ??? > objects.(*Queue).GetPartitionQueueDAOInfo ??? ←Queue RLock #1 > objects.(*Queue).GetPartitionQueueDAOInfo ??? > objects.(*Queue).GetPartitionQueueDAOInfo ??? > scheduler.(*PartitionContext).GetPartitionQueues ??? > webservice.getPartitionQueues ??? > http.HandlerFunc.ServeHTTP ??? > > No JIRA yet. Fix looks straightfoward. > > *3. Recursive locking* > > It's basically the same as #2, but it's not about an interaction between > two goroutines. It's just a single goroutine which grabs RLock() multiple > times. This can be dangerous in itself. > > objects.(*Queue).internalHeadRoom ??? <<<<< ← Queue RLock #2 > objects.(*Queue).internalHeadRoom ??? > objects.(*Queue).getHeadRoom ??? > objects.(*Queue).getHeadRoom ??? > objects.(*Queue).GetPartitionQueueDAOInfo ??? ← Queue RLock #1 > objects.(*Queue).GetPartitionQueueDAOInfo ??? > objects.(*Queue).GetPartitionQueueDAOInfo ??? > scheduler.(*PartitionContext).GetPartitionQueues ??? > webservice.getPartitionQueues ??? > http.HandlerFunc.ServeHTTP ??? > > No JIRA yet. Fix looks straightforward. > > I can see more from the tool but these are the most obvious ones. Not every > "potential deadlock" is exactly clear to me at the moment. See JIRA > attachment > https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt. > > These cause a lot of noise to log files which were uploaded > to YUNIKORN-2521. It can slow down our efforts to track down the real > problem. > > I suggest fixing all of these. > > Peter --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
