On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote: > On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote: > > On Tue, 9 Oct 2018 at 14:46, Tim Smith <tim.sm...@citrix.com> wrote: > > > On Tuesday, 9 October 2018 13:34:47 BST you wrote: > > > > There must be another reason for the missed wake-up. I'll have to > > > > study the code some more. > > > > > > I think it needs to go into gfs2_glock_dealloc(), in such a way as to > > > avoid > > > that problem. Currently working out a patch to do that. > > > > That doesn't sound right: find_insert_glock is waiting for the glock > > to be removed from the rhashtable. In gfs2_glock_free, we remove the > > glock from the rhashtable and then we do the wake-up. Delaying the > > wakeup further isn't going to help (but it might hide the problem). > > The only way we can get the problem we're seeing is if we get an effective > order of > > T0: wake_up_glock() > T1: prepare_to_wait() > T1: schedule() > > so clearly there's a way for that to happen. Any other order and either > schedule() doesn't sleep or it gets woken. > > The only way I can see at the moment to ensure that wake_up_glock() *cannot* > get called until after prepare_to_wait() is to delay it until the read_side > critical sections are done, and the first place that's got that property is > the start of gfs2_glock_dealloc(), unless we want to add synchronise_rcu() > to gfs2_glock_free() and I'm guessing there's a reason it's using > call_rcu() instead. > > I'll keep thinking about it.
OK, we have a result that this definitely *isn't* the right answer (still getting the wakeup happening). This is lends more weight to the idea that there are multiple waiters, so we'll try your patch. -- Tim Smith <tim.sm...@citrix.com>