Hi all,

after YUNIKORN-2539 got merged, we identified some potential deadlocks.
These are false positives now, but a small change can cause Yunikorn to
fall apart, so the term "potential deadlock" describes them properly.

Thoughs, opinions are welcome. IMO we should handle these with priority to
ensure stability.

*1. Locking order: Cache→RMProxy vs RMProxy→Cache*

We grab the locks of these types in different order (read bottom to top):

pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback
??? <<<<<
pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ???
pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ???
cache.(*Context).updateNodeResources ???
cache.(*Context).updateNodeOccupiedResources ???
cache.(*Context).updateForeignPod ???
cache.(*Context).UpdatePod ???

cache.(*Context).ForgetPod ??? <<<<<
cache.(*Context).ForgetPod ???
cache.(*AsyncRMCallback).UpdateAllocation ???
rmproxy.(*RMProxy).triggerUpdateAllocation ???
rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
rmproxy.(*RMProxy).handleRMEvents ???

I already created a JIRA for this one:
https://issues.apache.org/jira/browse/YUNIKORN-2543.

Luckily we only call RLock() inside RMProxy while processing allocations,
but if this changes, deadlock is guaranteed. I made a POC which creates a
lockless RMProxy instance after calling a factory method, see my comment in
the JIRA. The scheduler core compiles & all tests pass.

*2. Lock order: multiple locks calls on the same goroutine which interferes
with another goroutine*

We shouldn't call RLock() multiple times or after Lock() on the same mutex
instance. This is dangerous as described here:
https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine

objects.(*Queue).GetCurrentPriority ??? <<<<<   ← Queue RLock
objects.(*Queue).GetCurrentPriority ???
objects.(*Queue).addChildQueue ???    ← Queue Lock
objects.NewConfiguredQueue ???
scheduler.(*PartitionContext).addQueue ???
scheduler.(*PartitionContext).addQueue ???
scheduler.(*PartitionContext).initialPartitionFromConfig ???
scheduler.newPartitionContext ???
scheduler.(*ClusterContext).updateSchedulerConfig ???
scheduler.(*ClusterContext).processRMRegistrationEvent ???
scheduler.(*Scheduler).handleRMEvent ???

objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
objects.(*Queue).internalHeadRoom ???
objects.(*Queue).getHeadRoom ???
objects.(*Queue).getHeadRoom ???
objects.(*Queue).GetPartitionQueueDAOInfo ???  ←Queue RLock #1
objects.(*Queue).GetPartitionQueueDAOInfo ???
objects.(*Queue).GetPartitionQueueDAOInfo ???
scheduler.(*PartitionContext).GetPartitionQueues ???
webservice.getPartitionQueues ???
http.HandlerFunc.ServeHTTP ???

No JIRA yet. Fix looks straightfoward.

*3. Recursive locking*

It's basically the same as #2, but it's not about an interaction between
two goroutines. It's just a single goroutine which grabs RLock() multiple
times. This can be dangerous in itself.

objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
objects.(*Queue).internalHeadRoom ???
objects.(*Queue).getHeadRoom ???
objects.(*Queue).getHeadRoom ???
objects.(*Queue).GetPartitionQueueDAOInfo ???  ← Queue RLock #1
objects.(*Queue).GetPartitionQueueDAOInfo ???
objects.(*Queue).GetPartitionQueueDAOInfo ???
scheduler.(*PartitionContext).GetPartitionQueues ???
webservice.getPartitionQueues ???
http.HandlerFunc.ServeHTTP ???

No JIRA yet. Fix looks straightforward.

I can see more from the tool but these are the most obvious ones. Not every
"potential deadlock" is exactly clear to me at the moment. See JIRA
attachment
https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt.

These cause a lot of noise to log files which were uploaded
to YUNIKORN-2521. It can slow down our efforts to track down the real
problem.

I suggest fixing all of these.

Peter

Reply via email to