I’m all for fixing these… and in general where lockless algorithms can be 
implemented cleanly, I’m in favor of those implementations instead of requiring 
locks, so for RMProxy I’m +1 on that. The extra memory for an RMProxy instance 
is irrelevant.

The recursive locking case is a real problem, and I’m surprised that hasn’t 
bitten us harder. It can cause all sorts of issues.

Craig

> On Apr 6, 2024, at 2:19 PM, Peter Bacsko <[email protected]> wrote:
> 
> Hi all,
> 
> after YUNIKORN-2539 got merged, we identified some potential deadlocks.
> These are false positives now, but a small change can cause Yunikorn to
> fall apart, so the term "potential deadlock" describes them properly.
> 
> Thoughs, opinions are welcome. IMO we should handle these with priority to
> ensure stability.
> 
> *1. Locking order: Cache→RMProxy vs RMProxy→Cache*
> 
> We grab the locks of these types in different order (read bottom to top):
> 
> pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback
> ??? <<<<<
> pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ???
> pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ???
> cache.(*Context).updateNodeResources ???
> cache.(*Context).updateNodeOccupiedResources ???
> cache.(*Context).updateForeignPod ???
> cache.(*Context).UpdatePod ???
> 
> cache.(*Context).ForgetPod ??? <<<<<
> cache.(*Context).ForgetPod ???
> cache.(*AsyncRMCallback).UpdateAllocation ???
> rmproxy.(*RMProxy).triggerUpdateAllocation ???
> rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> rmproxy.(*RMProxy).handleRMEvents ???
> 
> I already created a JIRA for this one:
> https://issues.apache.org/jira/browse/YUNIKORN-2543.
> 
> Luckily we only call RLock() inside RMProxy while processing allocations,
> but if this changes, deadlock is guaranteed. I made a POC which creates a
> lockless RMProxy instance after calling a factory method, see my comment in
> the JIRA. The scheduler core compiles & all tests pass.
> 
> *2. Lock order: multiple locks calls on the same goroutine which interferes
> with another goroutine*
> 
> We shouldn't call RLock() multiple times or after Lock() on the same mutex
> instance. This is dangerous as described here:
> https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine
> 
> objects.(*Queue).GetCurrentPriority ??? <<<<<   ← Queue RLock
> objects.(*Queue).GetCurrentPriority ???
> objects.(*Queue).addChildQueue ???    ← Queue Lock
> objects.NewConfiguredQueue ???
> scheduler.(*PartitionContext).addQueue ???
> scheduler.(*PartitionContext).addQueue ???
> scheduler.(*PartitionContext).initialPartitionFromConfig ???
> scheduler.newPartitionContext ???
> scheduler.(*ClusterContext).updateSchedulerConfig ???
> scheduler.(*ClusterContext).processRMRegistrationEvent ???
> scheduler.(*Scheduler).handleRMEvent ???
> 
> objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
> objects.(*Queue).internalHeadRoom ???
> objects.(*Queue).getHeadRoom ???
> objects.(*Queue).getHeadRoom ???
> objects.(*Queue).GetPartitionQueueDAOInfo ???  ←Queue RLock #1
> objects.(*Queue).GetPartitionQueueDAOInfo ???
> objects.(*Queue).GetPartitionQueueDAOInfo ???
> scheduler.(*PartitionContext).GetPartitionQueues ???
> webservice.getPartitionQueues ???
> http.HandlerFunc.ServeHTTP ???
> 
> No JIRA yet. Fix looks straightfoward.
> 
> *3. Recursive locking*
> 
> It's basically the same as #2, but it's not about an interaction between
> two goroutines. It's just a single goroutine which grabs RLock() multiple
> times. This can be dangerous in itself.
> 
> objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
> objects.(*Queue).internalHeadRoom ???
> objects.(*Queue).getHeadRoom ???
> objects.(*Queue).getHeadRoom ???
> objects.(*Queue).GetPartitionQueueDAOInfo ???  ← Queue RLock #1
> objects.(*Queue).GetPartitionQueueDAOInfo ???
> objects.(*Queue).GetPartitionQueueDAOInfo ???
> scheduler.(*PartitionContext).GetPartitionQueues ???
> webservice.getPartitionQueues ???
> http.HandlerFunc.ServeHTTP ???
> 
> No JIRA yet. Fix looks straightforward.
> 
> I can see more from the tool but these are the most obvious ones. Not every
> "potential deadlock" is exactly clear to me at the moment. See JIRA
> attachment
> https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt.
> 
> These cause a lot of noise to log files which were uploaded
> to YUNIKORN-2521. It can slow down our efforts to track down the real
> problem.
> 
> I suggest fixing all of these.
> 
> Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to