On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> > struct lm_lockstruct {
> > int ls_jid;
> > unsigned int ls_first;
> > - unsigned int ls_first_done;
> > unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.
yes, I can use a flag in place of ls_first.
> > + int ls_recover_jid_done; /* read by gfs_controld */
> > + int ls_recover_jid_status; /* read by gfs_controld */
> ^^^^^^^^^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.
I'll add a longer comment saying something like that.
> > + /*
> > + * Other nodes need to do some work in dlm recovery and gfs2_control
> > + * before the recover_done and control_lock will be ready for us below.
> > + * A delay here is not required but often avoids having to retry.
> > + */
> > +
> > + msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.
I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
much better chance of killing a mount command, which is a fairly nice
feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
lock attempts, which could get rather excessive. I'd be worried about
having that kind of unlimited loop sitting there.
- using some kind of delay.
While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?
> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > + struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > + char lvb_bits[GDLM_LVB_SIZE];
> > + uint32_t start_gen, block_gen;
> > + int error;
> > +
> > +restart:
> > + spin_lock(&ls->ls_recover_spin);
> > + start_gen = ls->ls_recover_start;
> > + block_gen = ls->ls_recover_block;
> > +
> > + if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > + !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > + !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > + /* sanity check, should not happen */
> > + fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > + start_gen, block_gen, ls->ls_recover_flags);
> > + spin_unlock(&ls->ls_recover_spin);
> > + control_unlock(sdp);
> > + return -1;
> > + }
> > +
> > + if (start_gen == block_gen) {
> > + /*
> > + * Wait for the end of a dlm recovery cycle to switch from
> > + * first mounter recovery. We can ignore any recover_slot
> > + * callbacks between the recover_prep and next recover_done
> > + * because we are still the first mounter and any failed nodes
> > + * have not fully mounted, so they don't need recovery.
> > + */
> > + spin_unlock(&ls->ls_recover_spin);
> > + fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > + msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.
This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism. It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery. We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something. I'll see if any of those would work here without adding too
much to the code.
> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > + unsigned int result)
> > +{
> > + struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > + /* don't care about the recovery of own journal during mount */
> > + if (jid == ls->ls_jid)
> > + return;
> > +
> > + /* another node is recovering the journal, give it a chance to
> > + finish before trying again */
> > + if (result == LM_RD_GAVEUP)
> > + msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.
In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can. I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.
> > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > + &ops, &ls->ls_dlm);
> > +
> > + if (error == -EOPNOTSUPP) {
> > + /*
> > + * dlm does not support ops callbacks,
> > + * old dlm_controld/gfs_controld are used, try without ops.
> > + */
> > + fs_info(sdp, "dlm lockspace ops not used %d\n", error);
> > + free_recover_size(ls);
> > +
> > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > + NULL, &ls->ls_dlm);
> > + if (error)
> > + fs_err(sdp, "dlm_new_lockspace error %d\n", error);
> > + return error;
> > + }
> > +
> Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> If we do need to know whether the dlm supports the recovery ops, then
> lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> success and -ve means error). It doesn't matter if we don't call
> free_recover_size until umount time I think, even if the dlm doesn't
> support that since the data structures are fairly small.
I went with this because I thought it was simpler than adding a second
return value for the ops status. It would also let us simply drop the
special case in the future. The alternative is:
int dlm_new_lockspace(const char *name, const char *cluster,
uint32_t flags, int lvblen,
struct dlm_lockspace_ops *ops, void *ops_arg,
int *ops_error, dlm_lockspace_t **lockspace);
I'm willing to try that if you think it's clearer to understand.
Dave