And, sadly, the change to wait all also doesn't prevent the schedule_timeout from occurring. Something more subtle going on obviously.
Mark. -----Original Message----- From: Tim Smith <tim.sm...@citrix.com> Sent: 10 October 2018 09:23 To: cluster-devel@redhat.com Cc: Andreas Gruenbacher <agrue...@redhat.com>; Ross Lagerwall <ross.lagerw...@citrix.com>; Mark Syms <mark.s...@citrix.com> Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote: > On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote: > > On Tue, 9 Oct 2018 at 14:46, Tim Smith <tim.sm...@citrix.com> wrote: > > > On Tuesday, 9 October 2018 13:34:47 BST you wrote: > > > > There must be another reason for the missed wake-up. I'll have > > > > to study the code some more. > > > > > > I think it needs to go into gfs2_glock_dealloc(), in such a way as > > > to avoid that problem. Currently working out a patch to do that. > > > > That doesn't sound right: find_insert_glock is waiting for the glock > > to be removed from the rhashtable. In gfs2_glock_free, we remove the > > glock from the rhashtable and then we do the wake-up. Delaying the > > wakeup further isn't going to help (but it might hide the problem). > > The only way we can get the problem we're seeing is if we get an > effective order of > > T0: wake_up_glock() > T1: prepare_to_wait() > T1: schedule() > > so clearly there's a way for that to happen. Any other order and > either > schedule() doesn't sleep or it gets woken. > > The only way I can see at the moment to ensure that wake_up_glock() > *cannot* get called until after prepare_to_wait() is to delay it until > the read_side critical sections are done, and the first place that's > got that property is the start of gfs2_glock_dealloc(), unless we want > to add synchronise_rcu() to gfs2_glock_free() and I'm guessing there's > a reason it's using > call_rcu() instead. > > I'll keep thinking about it. OK, we have a result that this definitely *isn't* the right answer (still getting the wakeup happening). This is lends more weight to the idea that there are multiple waiters, so we'll try your patch. -- Tim Smith <tim.sm...@citrix.com>