Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

David Teigland Mon, 19 Dec 2011 09:48:30 -0800

On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> >  struct lm_lockstruct {
> >     int ls_jid;
> >     unsigned int ls_first;
> > -   unsigned int ls_first_done;
> >     unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.


yes, I can use a flag in place of ls_first.

> > +   int ls_recover_jid_done; /* read by gfs_controld */
> > +   int ls_recover_jid_status; /* read by gfs_controld */
>                                           ^^^^^^^^^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.

I'll add a longer comment saying something like that.

> > +   /*
> > +    * Other nodes need to do some work in dlm recovery and gfs2_control
> > +    * before the recover_done and control_lock will be ready for us below.
> > +    * A delay here is not required but often avoids having to retry.
> > +    */
> > +
> > +   msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.

I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
  much better chance of killing a mount command, which is a fairly nice
  feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
  lock attempts, which could get rather excessive.  I'd be worried about
  having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?


> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > +   struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +   char lvb_bits[GDLM_LVB_SIZE];
> > +   uint32_t start_gen, block_gen;
> > +   int error;
> > +
> > +restart:
> > +   spin_lock(&ls->ls_recover_spin);
> > +   start_gen = ls->ls_recover_start;
> > +   block_gen = ls->ls_recover_block;
> > +
> > +   if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > +       !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > +       !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > +           /* sanity check, should not happen */
> > +           fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > +                  start_gen, block_gen, ls->ls_recover_flags);
> > +           spin_unlock(&ls->ls_recover_spin);
> > +           control_unlock(sdp);
> > +           return -1;
> > +   }
> > +
> > +   if (start_gen == block_gen) {
> > +           /*
> > +            * Wait for the end of a dlm recovery cycle to switch from
> > +            * first mounter recovery.  We can ignore any recover_slot
> > +            * callbacks between the recover_prep and next recover_done
> > +            * because we are still the first mounter and any failed nodes
> > +            * have not fully mounted, so they don't need recovery.
> > +            */
> > +           spin_unlock(&ls->ls_recover_spin);
> > +           fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > +           msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.

This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism.  It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery.  We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something.  I'll see if any of those would work here without adding too
much to the code.


> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > +                            unsigned int result)
> > +{
> > +   struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > +   /* don't care about the recovery of own journal during mount */
> > +   if (jid == ls->ls_jid)
> > +           return;
> > +
> > +   /* another node is recovering the journal, give it a chance to
> > +      finish before trying again */
> > +   if (result == LM_RD_GAVEUP)
> > +           msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.

In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can.  I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.


> > +   error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > +                             &ops, &ls->ls_dlm);
> > +
> > +   if (error == -EOPNOTSUPP) {
> > +           /*
> > +            * dlm does not support ops callbacks,
> > +            * old dlm_controld/gfs_controld are used, try without ops.
> > +            */
> > +           fs_info(sdp, "dlm lockspace ops not used %d\n", error);
> > +           free_recover_size(ls);
> > +
> > +           error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > +                                     NULL, &ls->ls_dlm);
> > +           if (error)
> > +                   fs_err(sdp, "dlm_new_lockspace error %d\n", error);
> > +           return error;
> > +   }
> > +
> Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> If we do need to know whether the dlm supports the recovery ops, then
> lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> success and -ve means error). It doesn't matter if we don't call
> free_recover_size until umount time I think, even if the dlm doesn't
> support that since the data structures are fairly small.

I went with this because I thought it was simpler than adding a second
return value for the ops status.  It would also let us simply drop the
special case in the future.  The alternative is:

int dlm_new_lockspace(const char *name, const char *cluster,
                      uint32_t flags, int lvblen,
                      struct dlm_lockspace_ops *ops, void *ops_arg,
                      int *ops_error, dlm_lockspace_t **lockspace);

I'm willing to try that if you think it's clearer to understand.

Dave

Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

Reply via email to