On 09/10/18 09:13, Mark Syms wrote:
Having swapped the line below around we still see the timeout on schedule fire, 
but only once in
a fairly mega stress test. This is why we weren't worried about the timeout 
being HZ, the situation
is hardly ever hit as having to wait is rare and normally we are woken from 
schedule and without
a timeout on schedule we never wake up so a rare occurrence of waiting a second 
really doesn't
seem too bad.

Mark.
We should still get to the bottom of why the wake up is missing though, since without that fix we won't know if there is something else wrong somewhere,

Steve.


-----Original Message-----
From: Tim Smith <tim.sm...@citrix.com>
Sent: 08 October 2018 14:27
To: Steven Whitehouse <swhit...@redhat.com>
Cc: Mark Syms <mark.s...@citrix.com>; cluster-devel@redhat.com; Ross Lagerwall 
<ross.lagerw...@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock

On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote:
Hi,

On 08/10/18 14:10, Tim Smith wrote:
On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote:
On 08/10/18 13:59, Mark Syms wrote:
That sounds entirely reasonable so long as you are absolutely sure
that nothing is ever going to mess with that glock, we erred on
the side of more caution not knowing whether it would be guaranteed safe or not.

Thanks,

        Mark
We should have a look at the history to see how that wait got added.
However the "dead" flag here means "don't touch this glock" and is
there so that we can separate the marking dead from the actual
removal from the list (which simplifies the locking during the
scanning procedures)
You beat me to it :-)

I think there might be a bit of a problem inserting a new entry with
the same name before the old entry has been fully destroyed (or at
least removed), which would be why the schedule() is there.
If the old entry is marked dead, all future lookups should ignore it.
We should only have a single non-dead entry at a time, but that
doesn't seem like it should need us to wait for it.
On the second call we do have the new glock to insert as arg2, so we could try 
to swap them cleanly, yeah.

If we do discover that the wait is really required, then it sounds
like as you mentioned above there is a lost wakeup, and that must
presumably be on a code path that sets the dead flag and then fails to
send a wake up later on. If we can drop the wait in the first place,
that seems like a better plan,
Ooooh, I wonder if these two lines:

        wake_up_glock(gl);
        call_rcu(&gl->gl_rcu, gfs2_glock_dealloc);

in gfs2_glock_free() are the wrong way round?

--
Tim Smith <tim.sm...@citrix.com>



Reply via email to