Scheduling is lock free when we get to any of the application.Try...() calls. The scheduling thread does not hold any locks until we get there. That was how it was designed and implemented. When we get there nothing but the scheduling thread is allowed to make changes to the application for the duration of the scheduler working on that application. Letting go of the application lock will break that assumption. That cannot happen.
Setting the termination callback for the application only happens once. When we add the newly created application to the partition. We should fix the timing and or locking around this as part of PR #838 in which the AddApplication in the partition is refactored. The locking for accessing nodes in the partition is unneeded. They are left over from before the node iterator was changed. The nodes used to be a simple map directly in the partition. Any call thus needed to lock the partition. Now the nodes are an object that is set on partition creation and handles its own locking. All of the locks in the partition around the nodes object should be removed. They are not needed. Wilfred On Thu, 11 Apr 2024 at 21:44, Peter Bacsko <[email protected]> wrote: > > Thanks for the replies. I managed to get good progress on this issue. > > There's a thing which I'd like to talk about. It's not something which is > critical but it needs to be addressed IMO. > > The scope of the mutex-protected critical section is too large in > tryAllocate, tryReservedAllocate and tryPlaceholderAllocate: > > func (sa *Application) trySomeAllocation() { > sa.Lock() > defer sa.Unlock() > > .. doing things ... > } > > The "doing things" can mean a lot of different operations. For example, in > tryAllocate(), this means: > - checking headroom/quota > - checking requiredNode: if it's set, then it tries to allocate on a > specific node > - iterating through available nodes with the node iterator > - if we don't have an allocation, we attempt to preempt existing allocations > > This is a lot of task while holding the application lock. In YUNIKORN-2521 > <https://issues.apache.org/jira/browse/YUNIKORN-2521>, we can clearly see > why this is bad: state dump times out and other REST endpoints which try to > access the application will also time out. > > I suggest re-thinking the scopes to avoid this. > > Peter > > On Mon, Apr 8, 2024 at 4:58 AM Wilfred Spiegelenburg <[email protected]> > wrote: > > > Case 1: > > I am all for simplifying and removing locks. Changing the SI like you > > propose will trigger a YuniKorn 2.0 as it is incompatible with the > > current setup. There is a much simpler change that does not require a > > 2.0 version. See comments in the jira. > > > > Case 2: > > This is a bug I think, which has nothing to do with locking. The call > > to get the priority of the child is made before we have updated the > > properties for child or even applied the template. We would not be > > calling that until after the properties are converted into setting in > > UpdateQueueProperties. That should solve the double lock issue also. > > > > Case 3: > > The call(s) in GetPartitionQueueDAOInfo inside the lock could be > > dangerous. The getHeadRoom call walks up the queue and takes locks > > there. > > It should be moved outside of the existing queue lock. The comment on > > the lock even says so but it has slipped through. > > > > Wilfred > > > > On Sun, 7 Apr 2024 at 05:33, Craig Condit <[email protected]> wrote: > > > > > > I’m all for fixing these… and in general where lockless algorithms can > > be implemented cleanly, I’m in favor of those implementations instead of > > requiring locks, so for RMProxy I’m +1 on that. The extra memory for an > > RMProxy instance is irrelevant. > > > > > > The recursive locking case is a real problem, and I’m surprised that > > hasn’t bitten us harder. It can cause all sorts of issues. > > > > > > Craig > > > > > > > On Apr 6, 2024, at 2:19 PM, Peter Bacsko <[email protected]> wrote: > > > > > > > > Hi all, > > > > > > > > after YUNIKORN-2539 got merged, we identified some potential deadlocks. > > > > These are false positives now, but a small change can cause Yunikorn to > > > > fall apart, so the term "potential deadlock" describes them properly. > > > > > > > > Thoughs, opinions are welcome. IMO we should handle these with > > priority to > > > > ensure stability. > > > > > > > > *1. Locking order: Cache→RMProxy vs RMProxy→Cache* > > > > > > > > We grab the locks of these types in different order (read bottom to > > top): > > > > > > > > pkg/rmproxy/rmproxy.go:307 > > rmproxy.(*RMProxy).GetResourceManagerCallback > > > > ??? <<<<< > > > > pkg/rmproxy/rmproxy.go:306 > > rmproxy.(*RMProxy).GetResourceManagerCallback ??? > > > > pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ??? > > > > cache.(*Context).updateNodeResources ??? > > > > cache.(*Context).updateNodeOccupiedResources ??? > > > > cache.(*Context).updateForeignPod ??? > > > > cache.(*Context).UpdatePod ??? > > > > > > > > cache.(*Context).ForgetPod ??? <<<<< > > > > cache.(*Context).ForgetPod ??? > > > > cache.(*AsyncRMCallback).UpdateAllocation ??? > > > > rmproxy.(*RMProxy).triggerUpdateAllocation ??? > > > > rmproxy.(*RMProxy).processRMReleaseAllocationEvent ??? > > > > rmproxy.(*RMProxy).handleRMEvents ??? > > > > > > > > I already created a JIRA for this one: > > > > https://issues.apache.org/jira/browse/YUNIKORN-2543. > > > > > > > > Luckily we only call RLock() inside RMProxy while processing > > allocations, > > > > but if this changes, deadlock is guaranteed. I made a POC which > > creates a > > > > lockless RMProxy instance after calling a factory method, see my > > comment in > > > > the JIRA. The scheduler core compiles & all tests pass. > > > > > > > > *2. Lock order: multiple locks calls on the same goroutine which > > interferes > > > > with another goroutine* > > > > > > > > We shouldn't call RLock() multiple times or after Lock() on the same > > mutex > > > > instance. This is dangerous as described here: > > > > > > https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine > > > > > > > > objects.(*Queue).GetCurrentPriority ??? <<<<< ← Queue RLock > > > > objects.(*Queue).GetCurrentPriority ??? > > > > objects.(*Queue).addChildQueue ??? ← Queue Lock > > > > objects.NewConfiguredQueue ??? > > > > scheduler.(*PartitionContext).addQueue ??? > > > > scheduler.(*PartitionContext).addQueue ??? > > > > scheduler.(*PartitionContext).initialPartitionFromConfig ??? > > > > scheduler.newPartitionContext ??? > > > > scheduler.(*ClusterContext).updateSchedulerConfig ??? > > > > scheduler.(*ClusterContext).processRMRegistrationEvent ??? > > > > scheduler.(*Scheduler).handleRMEvent ??? > > > > > > > > objects.(*Queue).internalHeadRoom ??? <<<<< ← Queue RLock #2 > > > > objects.(*Queue).internalHeadRoom ??? > > > > objects.(*Queue).getHeadRoom ??? > > > > objects.(*Queue).getHeadRoom ??? > > > > objects.(*Queue).GetPartitionQueueDAOInfo ??? ←Queue RLock #1 > > > > objects.(*Queue).GetPartitionQueueDAOInfo ??? > > > > objects.(*Queue).GetPartitionQueueDAOInfo ??? > > > > scheduler.(*PartitionContext).GetPartitionQueues ??? > > > > webservice.getPartitionQueues ??? > > > > http.HandlerFunc.ServeHTTP ??? > > > > > > > > No JIRA yet. Fix looks straightfoward. > > > > > > > > *3. Recursive locking* > > > > > > > > It's basically the same as #2, but it's not about an interaction > > between > > > > two goroutines. It's just a single goroutine which grabs RLock() > > multiple > > > > times. This can be dangerous in itself. > > > > > > > > objects.(*Queue).internalHeadRoom ??? <<<<< ← Queue RLock #2 > > > > objects.(*Queue).internalHeadRoom ??? > > > > objects.(*Queue).getHeadRoom ??? > > > > objects.(*Queue).getHeadRoom ??? > > > > objects.(*Queue).GetPartitionQueueDAOInfo ??? ← Queue RLock #1 > > > > objects.(*Queue).GetPartitionQueueDAOInfo ??? > > > > objects.(*Queue).GetPartitionQueueDAOInfo ??? > > > > scheduler.(*PartitionContext).GetPartitionQueues ??? > > > > webservice.getPartitionQueues ??? > > > > http.HandlerFunc.ServeHTTP ??? > > > > > > > > No JIRA yet. Fix looks straightforward. > > > > > > > > I can see more from the tool but these are the most obvious ones. Not > > every > > > > "potential deadlock" is exactly clear to me at the moment. See JIRA > > > > attachment > > > > > > https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt > > . > > > > > > > > These cause a lot of noise to log files which were uploaded > > > > to YUNIKORN-2521. It can slow down our efforts to track down the real > > > > problem. > > > > > > > > I suggest fixing all of these. > > > > > > > > Peter > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
