Bob, On Tue, Aug 2, 2022 at 7:58 PM Bob Peterson <rpete...@redhat.com> wrote: > There are a couple places in function do_xmote where normal processing > is circumvented due to withdraws in progress. However, since we bypass > most of do_xmote() we bypass telling dlm to lock the dlm lock, which > means dlm will never respond with a completion callback. Since the > completion callback ordinarily clears GLF_LOCK, this patch changes > function do_xmote to handle those situations more gracefully so the > file system may be unmounted after withdraw. > > A very similar situation happens with the GLF_DEMOTE_IN_PROGRESS flag, > which is cleared by function finish_xmote(). Since the withdraw causes > us to skip the majority of do_xmote, it therefore also skips the call > to finish_xmote() so the DEMOTE_IN_PROGRESS flag needs to be cleared > manually as well. > > Signed-off-by: Bob Peterson <rpete...@redhat.com> > --- > fs/gfs2/glock.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c > index 0bfecffd71f1..d508d8fa0838 100644 > --- a/fs/gfs2/glock.c > +++ b/fs/gfs2/glock.c > @@ -59,6 +59,8 @@ typedef void (*glock_examiner) (struct gfs2_glock * gl); > > static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned > int target); > static void __gfs2_glock_dq(struct gfs2_holder *gh); > +static void handle_callback(struct gfs2_glock *gl, unsigned int state, > + unsigned long delay, bool remote); > > static struct dentry *gfs2_root; > static struct workqueue_struct *glock_workqueue; > @@ -762,8 +764,21 @@ __acquires(&gl->gl_lockref.lock) > int ret; > > if (target != LM_ST_UNLOCKED && glock_blocked_by_withdraw(gl) && > - gh && !(gh->gh_flags & LM_FLAG_NOEXP)) > + gh && !(gh->gh_flags & LM_FLAG_NOEXP)) { > + /* > + * We won't tell dlm to perform the lock, so we won't get a > + * reply that would otherwise clear GLF_LOCK. So we clear it. > + */ > + handle_callback(gl, LM_ST_UNLOCKED, 0, false); > + clear_bit(GLF_LOCK, &gl->gl_flags); > + clear_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags); > + /* > + * Don't increment lockref here. The next time the worker > runs it will do > + * glock_put, which will decrement it to 0, and free the > glock. > + */
I don't understand the reference counting logic here: where's the alleged reference coming from that we're passing on to the work function here? Note that further below in do_xmote(), we're calling gfs2_glock_hold() followed by gfs2_glock_queue_work(), so the reference counting logic seems normal there -- except that when ->lm_lock returns an error, we're apparently leaking a reference. So maybe the gfs2_glock_hold() should be moved right in front of the gfs2_glock_queue_work() calls to make the code less fragile? > + __gfs2_glock_queue_work(gl, GL_GLOCK_DFT_HOLD); > return; > + } > lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | > LM_FLAG_PRIORITY); > GLOCK_BUG_ON(gl, gl->gl_state == target); > @@ -848,6 +863,8 @@ __acquires(&gl->gl_lockref.lock) > (target != LM_ST_UNLOCKED || > test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags))) { > if (!is_system_glock(gl)) { > + clear_bit(GLF_LOCK, &gl->gl_flags); > + clear_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags); > gfs2_glock_queue_work(gl, GL_GLOCK_DFT_HOLD); > goto out; > } else { > -- > 2.36.1 > Thanks, Andreas