Hi all, after YUNIKORN-2539 got merged, we identified some potential deadlocks. These are false positives now, but a small change can cause Yunikorn to fall apart, so the term "potential deadlock" describes them properly.
Thoughs, opinions are welcome. IMO we should handle these with priority to ensure stability. *1. Locking order: Cache→RMProxy vs RMProxy→Cache* We grab the locks of these types in different order (read bottom to top): pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<< pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ??? pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ??? cache.(*Context).updateNodeResources ??? cache.(*Context).updateNodeOccupiedResources ??? cache.(*Context).updateForeignPod ??? cache.(*Context).UpdatePod ??? cache.(*Context).ForgetPod ??? <<<<< cache.(*Context).ForgetPod ??? cache.(*AsyncRMCallback).UpdateAllocation ??? rmproxy.(*RMProxy).triggerUpdateAllocation ??? rmproxy.(*RMProxy).processRMReleaseAllocationEvent ??? rmproxy.(*RMProxy).handleRMEvents ??? I already created a JIRA for this one: https://issues.apache.org/jira/browse/YUNIKORN-2543. Luckily we only call RLock() inside RMProxy while processing allocations, but if this changes, deadlock is guaranteed. I made a POC which creates a lockless RMProxy instance after calling a factory method, see my comment in the JIRA. The scheduler core compiles & all tests pass. *2. Lock order: multiple locks calls on the same goroutine which interferes with another goroutine* We shouldn't call RLock() multiple times or after Lock() on the same mutex instance. This is dangerous as described here: https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine objects.(*Queue).GetCurrentPriority ??? <<<<< ← Queue RLock objects.(*Queue).GetCurrentPriority ??? objects.(*Queue).addChildQueue ??? ← Queue Lock objects.NewConfiguredQueue ??? scheduler.(*PartitionContext).addQueue ??? scheduler.(*PartitionContext).addQueue ??? scheduler.(*PartitionContext).initialPartitionFromConfig ??? scheduler.newPartitionContext ??? scheduler.(*ClusterContext).updateSchedulerConfig ??? scheduler.(*ClusterContext).processRMRegistrationEvent ??? scheduler.(*Scheduler).handleRMEvent ??? objects.(*Queue).internalHeadRoom ??? <<<<< ← Queue RLock #2 objects.(*Queue).internalHeadRoom ??? objects.(*Queue).getHeadRoom ??? objects.(*Queue).getHeadRoom ??? objects.(*Queue).GetPartitionQueueDAOInfo ??? ←Queue RLock #1 objects.(*Queue).GetPartitionQueueDAOInfo ??? objects.(*Queue).GetPartitionQueueDAOInfo ??? scheduler.(*PartitionContext).GetPartitionQueues ??? webservice.getPartitionQueues ??? http.HandlerFunc.ServeHTTP ??? No JIRA yet. Fix looks straightfoward. *3. Recursive locking* It's basically the same as #2, but it's not about an interaction between two goroutines. It's just a single goroutine which grabs RLock() multiple times. This can be dangerous in itself. objects.(*Queue).internalHeadRoom ??? <<<<< ← Queue RLock #2 objects.(*Queue).internalHeadRoom ??? objects.(*Queue).getHeadRoom ??? objects.(*Queue).getHeadRoom ??? objects.(*Queue).GetPartitionQueueDAOInfo ??? ← Queue RLock #1 objects.(*Queue).GetPartitionQueueDAOInfo ??? objects.(*Queue).GetPartitionQueueDAOInfo ??? scheduler.(*PartitionContext).GetPartitionQueues ??? webservice.getPartitionQueues ??? http.HandlerFunc.ServeHTTP ??? No JIRA yet. Fix looks straightforward. I can see more from the tool but these are the most obvious ones. Not every "potential deadlock" is exactly clear to me at the moment. See JIRA attachment https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt. These cause a lot of noise to log files which were uploaded to YUNIKORN-2521. It can slow down our efforts to track down the real problem. I suggest fixing all of these. Peter
