We think we have, just making a build to test. Will follow up later.
Mark. -----Original Message----- From: Steven Whitehouse <swhit...@redhat.com> Sent: 09 October 2018 09:41 To: Mark Syms <mark.s...@citrix.com>; Tim Smith <tim.sm...@citrix.com> Cc: cluster-devel@redhat.com; Ross Lagerwall <ross.lagerw...@citrix.com> Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock On 09/10/18 09:13, Mark Syms wrote: > Having swapped the line below around we still see the timeout on > schedule fire, but only once in a fairly mega stress test. This is why > we weren't worried about the timeout being HZ, the situation is hardly > ever hit as having to wait is rare and normally we are woken from > schedule and without a timeout on schedule we never wake up so a rare > occurrence of waiting a second really doesn't seem too bad. > > Mark. We should still get to the bottom of why the wake up is missing though, since without that fix we won't know if there is something else wrong somewhere, Steve. > > -----Original Message----- > From: Tim Smith <tim.sm...@citrix.com> > Sent: 08 October 2018 14:27 > To: Steven Whitehouse <swhit...@redhat.com> > Cc: Mark Syms <mark.s...@citrix.com>; cluster-devel@redhat.com; Ross > Lagerwall <ross.lagerw...@citrix.com> > Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in > find insert glock > > On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote: >> Hi, >> >> On 08/10/18 14:10, Tim Smith wrote: >>> On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote: >>>> On 08/10/18 13:59, Mark Syms wrote: >>>>> That sounds entirely reasonable so long as you are absolutely sure >>>>> that nothing is ever going to mess with that glock, we erred on >>>>> the side of more caution not knowing whether it would be guaranteed safe >>>>> or not. >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>> We should have a look at the history to see how that wait got added. >>>> However the "dead" flag here means "don't touch this glock" and is >>>> there so that we can separate the marking dead from the actual >>>> removal from the list (which simplifies the locking during the >>>> scanning procedures) >>> You beat me to it :-) >>> >>> I think there might be a bit of a problem inserting a new entry with >>> the same name before the old entry has been fully destroyed (or at >>> least removed), which would be why the schedule() is there. >> If the old entry is marked dead, all future lookups should ignore it. >> We should only have a single non-dead entry at a time, but that >> doesn't seem like it should need us to wait for it. > On the second call we do have the new glock to insert as arg2, so we could > try to swap them cleanly, yeah. > >> If we do discover that the wait is really required, then it sounds >> like as you mentioned above there is a lost wakeup, and that must >> presumably be on a code path that sets the dead flag and then fails >> to send a wake up later on. If we can drop the wait in the first >> place, that seems like a better plan, > Ooooh, I wonder if these two lines: > > wake_up_glock(gl); > call_rcu(&gl->gl_rcu, gfs2_glock_dealloc); > > in gfs2_glock_free() are the wrong way round? > > -- > Tim Smith <tim.sm...@citrix.com> > >