On Tuesday, 9 October 2018 13:34:47 BST you wrote: > On Mon, 8 Oct 2018 at 15:27, Tim Smith <tim.sm...@citrix.com> wrote: > > On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote: > > > Hi, > > > > > > On 08/10/18 14:10, Tim Smith wrote: > > > > On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote: > > > >> On 08/10/18 13:59, Mark Syms wrote: > > > >>> That sounds entirely reasonable so long as you are absolutely sure > > > >>> that > > > >>> nothing is ever going to mess with that glock, we erred on the side > > > >>> of > > > >>> more caution not knowing whether it would be guaranteed safe or not. > > > >>> > > > >>> Thanks, > > > >>> > > > >>> Mark > > > >> > > > >> We should have a look at the history to see how that wait got added. > > > >> However the "dead" flag here means "don't touch this glock" and is > > > >> there > > > >> so that we can separate the marking dead from the actual removal from > > > >> the list (which simplifies the locking during the scanning > > > >> procedures) > > > > > > > > You beat me to it :-) > > > > > > > > I think there might be a bit of a problem inserting a new entry with > > > > the > > > > same name before the old entry has been fully destroyed (or at least > > > > removed), which would be why the schedule() is there. > > > > > > If the old entry is marked dead, all future lookups should ignore it. We > > > should only have a single non-dead entry at a time, but that doesn't > > > seem like it should need us to wait for it. > > > > On the second call we do have the new glock to insert as arg2, so we could > > try to swap them cleanly, yeah. > > > > > If we do discover that the wait is really required, then it sounds like > > > as you mentioned above there is a lost wakeup, and that must presumably > > > be on a code path that sets the dead flag and then fails to send a wake > > > up later on. If we can drop the wait in the first place, that seems like > > > a better plan, > > > > Ooooh, I wonder if these two lines: > > wake_up_glock(gl); > > call_rcu(&gl->gl_rcu, gfs2_glock_dealloc); > > > > in gfs2_glock_free() are the wrong way round? > > No, what's important here is that the wake-up happens after the glock > being freed has been removed from the rhashtable, and that's the case > here. wake_up_glock actually accesses the glock, which may no longer > be around after the call_rcu, so swapping the two lines would > introduce a use-after-free bug.
Yes, realised that :-/ > There must be another reason for the missed wake-up. I'll have to > study the code some more. I think it needs to go into gfs2_glock_dealloc(), in such a way as to avoid that problem. Currently working out a patch to do that. > Thanks, > Andreas -- Tim Smith <tim.sm...@citrix.com>