And, sadly, the change to wait all also doesn't prevent the schedule_timeout
from occurring. Something more subtle going on obviously.
Mark.
-----Original Message-----
From: Tim Smith <[email protected]>
Sent: 10 October 2018 09:23
To: [email protected]
Cc: Andreas Gruenbacher <[email protected]>; Ross Lagerwall
<[email protected]>; Mark Syms <[email protected]>
Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find
insert glock
On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote:
> On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote:
> > On Tue, 9 Oct 2018 at 14:46, Tim Smith <[email protected]> wrote:
> > > On Tuesday, 9 October 2018 13:34:47 BST you wrote:
> > > > There must be another reason for the missed wake-up. I'll have
> > > > to study the code some more.
> > >
> > > I think it needs to go into gfs2_glock_dealloc(), in such a way as
> > > to avoid that problem. Currently working out a patch to do that.
> >
> > That doesn't sound right: find_insert_glock is waiting for the glock
> > to be removed from the rhashtable. In gfs2_glock_free, we remove the
> > glock from the rhashtable and then we do the wake-up. Delaying the
> > wakeup further isn't going to help (but it might hide the problem).
>
> The only way we can get the problem we're seeing is if we get an
> effective order of
>
> T0: wake_up_glock()
> T1: prepare_to_wait()
> T1: schedule()
>
> so clearly there's a way for that to happen. Any other order and
> either
> schedule() doesn't sleep or it gets woken.
>
> The only way I can see at the moment to ensure that wake_up_glock()
> *cannot* get called until after prepare_to_wait() is to delay it until
> the read_side critical sections are done, and the first place that's
> got that property is the start of gfs2_glock_dealloc(), unless we want
> to add synchronise_rcu() to gfs2_glock_free() and I'm guessing there's
> a reason it's using
> call_rcu() instead.
>
> I'll keep thinking about it.
OK, we have a result that this definitely *isn't* the right answer (still
getting the wakeup happening). This is lends more weight to the idea that there
are multiple waiters, so we'll try your patch.
--
Tim Smith <[email protected]>