Thanks for the replies. I managed to get good progress on this issue.

There's a thing which I'd like to talk about. It's not something which is
critical but it needs to be addressed IMO.

The scope of the mutex-protected critical section is too large in
tryAllocate, tryReservedAllocate and tryPlaceholderAllocate:

func (sa *Application) trySomeAllocation() {
  sa.Lock()
  defer sa.Unlock()

  .. doing things ...
}

The "doing things" can mean a lot of different operations. For example, in
tryAllocate(), this means:
- checking headroom/quota
- checking requiredNode: if it's set, then it tries to allocate on a
specific node
- iterating through available nodes with the node iterator
- if we don't have an allocation, we attempt to preempt existing allocations

This is a lot of task while holding the application lock. In YUNIKORN-2521
<https://issues.apache.org/jira/browse/YUNIKORN-2521>, we can clearly see
why this is bad: state dump times out and other REST endpoints which try to
access the application will also time out.

I suggest re-thinking the scopes to avoid this.

Peter

On Mon, Apr 8, 2024 at 4:58 AM Wilfred Spiegelenburg <[email protected]>
wrote:

> Case 1:
> I am all for simplifying and removing locks. Changing the SI like you
> propose will trigger a YuniKorn 2.0 as it is incompatible with the
> current setup. There is a much simpler change that does not require a
> 2.0 version. See comments in the jira.
>
> Case 2:
> This is a bug I think, which has nothing to do with locking. The call
> to get the priority of the child is made before we have updated the
> properties for child or even applied the template. We would not be
> calling that until after the properties are converted into setting in
> UpdateQueueProperties. That should solve the double lock issue also.
>
> Case 3:
> The call(s) in GetPartitionQueueDAOInfo inside the lock could be
> dangerous. The getHeadRoom call walks up the queue and takes locks
> there.
> It should be moved outside of the existing queue lock. The comment on
> the lock even says so but it has slipped through.
>
> Wilfred
>
> On Sun, 7 Apr 2024 at 05:33, Craig Condit <[email protected]> wrote:
> >
> > I’m all for fixing these… and in general where lockless algorithms can
> be implemented cleanly, I’m in favor of those implementations instead of
> requiring locks, so for RMProxy I’m +1 on that. The extra memory for an
> RMProxy instance is irrelevant.
> >
> > The recursive locking case is a real problem, and I’m surprised that
> hasn’t bitten us harder. It can cause all sorts of issues.
> >
> > Craig
> >
> > > On Apr 6, 2024, at 2:19 PM, Peter Bacsko <[email protected]> wrote:
> > >
> > > Hi all,
> > >
> > > after YUNIKORN-2539 got merged, we identified some potential deadlocks.
> > > These are false positives now, but a small change can cause Yunikorn to
> > > fall apart, so the term "potential deadlock" describes them properly.
> > >
> > > Thoughs, opinions are welcome. IMO we should handle these with
> priority to
> > > ensure stability.
> > >
> > > *1. Locking order: Cache→RMProxy vs RMProxy→Cache*
> > >
> > > We grab the locks of these types in different order (read bottom to
> top):
> > >
> > > pkg/rmproxy/rmproxy.go:307
> rmproxy.(*RMProxy).GetResourceManagerCallback
> > > ??? <<<<<
> > > pkg/rmproxy/rmproxy.go:306
> rmproxy.(*RMProxy).GetResourceManagerCallback ???
> > > pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ???
> > > cache.(*Context).updateNodeResources ???
> > > cache.(*Context).updateNodeOccupiedResources ???
> > > cache.(*Context).updateForeignPod ???
> > > cache.(*Context).UpdatePod ???
> > >
> > > cache.(*Context).ForgetPod ??? <<<<<
> > > cache.(*Context).ForgetPod ???
> > > cache.(*AsyncRMCallback).UpdateAllocation ???
> > > rmproxy.(*RMProxy).triggerUpdateAllocation ???
> > > rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> > > rmproxy.(*RMProxy).handleRMEvents ???
> > >
> > > I already created a JIRA for this one:
> > > https://issues.apache.org/jira/browse/YUNIKORN-2543.
> > >
> > > Luckily we only call RLock() inside RMProxy while processing
> allocations,
> > > but if this changes, deadlock is guaranteed. I made a POC which
> creates a
> > > lockless RMProxy instance after calling a factory method, see my
> comment in
> > > the JIRA. The scheduler core compiles & all tests pass.
> > >
> > > *2. Lock order: multiple locks calls on the same goroutine which
> interferes
> > > with another goroutine*
> > >
> > > We shouldn't call RLock() multiple times or after Lock() on the same
> mutex
> > > instance. This is dangerous as described here:
> > >
> https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine
> > >
> > > objects.(*Queue).GetCurrentPriority ??? <<<<<   ← Queue RLock
> > > objects.(*Queue).GetCurrentPriority ???
> > > objects.(*Queue).addChildQueue ???    ← Queue Lock
> > > objects.NewConfiguredQueue ???
> > > scheduler.(*PartitionContext).addQueue ???
> > > scheduler.(*PartitionContext).addQueue ???
> > > scheduler.(*PartitionContext).initialPartitionFromConfig ???
> > > scheduler.newPartitionContext ???
> > > scheduler.(*ClusterContext).updateSchedulerConfig ???
> > > scheduler.(*ClusterContext).processRMRegistrationEvent ???
> > > scheduler.(*Scheduler).handleRMEvent ???
> > >
> > > objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
> > > objects.(*Queue).internalHeadRoom ???
> > > objects.(*Queue).getHeadRoom ???
> > > objects.(*Queue).getHeadRoom ???
> > > objects.(*Queue).GetPartitionQueueDAOInfo ???  ←Queue RLock #1
> > > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > > scheduler.(*PartitionContext).GetPartitionQueues ???
> > > webservice.getPartitionQueues ???
> > > http.HandlerFunc.ServeHTTP ???
> > >
> > > No JIRA yet. Fix looks straightfoward.
> > >
> > > *3. Recursive locking*
> > >
> > > It's basically the same as #2, but it's not about an interaction
> between
> > > two goroutines. It's just a single goroutine which grabs RLock()
> multiple
> > > times. This can be dangerous in itself.
> > >
> > > objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
> > > objects.(*Queue).internalHeadRoom ???
> > > objects.(*Queue).getHeadRoom ???
> > > objects.(*Queue).getHeadRoom ???
> > > objects.(*Queue).GetPartitionQueueDAOInfo ???  ← Queue RLock #1
> > > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > > scheduler.(*PartitionContext).GetPartitionQueues ???
> > > webservice.getPartitionQueues ???
> > > http.HandlerFunc.ServeHTTP ???
> > >
> > > No JIRA yet. Fix looks straightforward.
> > >
> > > I can see more from the tool but these are the most obvious ones. Not
> every
> > > "potential deadlock" is exactly clear to me at the moment. See JIRA
> > > attachment
> > >
> https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt
> .
> > >
> > > These cause a lot of noise to log files which were uploaded
> > > to YUNIKORN-2521. It can slow down our efforts to track down the real
> > > problem.
> > >
> > > I suggest fixing all of these.
> > >
> > > Peter
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to