Re: [PATCH master 00/11] WIP: shard cluster verify by node group

Guido Trotter Tue, 03 May 2011 08:21:28 -0700

On Tue, May 3, 2011 at 3:22 PM, Adeodato Simo <[email protected]> wrote:


Hi,

Thanks dato for sharing these.

> Hi,
>
> Guido is going to be taking over this, and we agreed that I send my
> most recent version of the patch series to the list, since it should
> be in relatively good form already.
>

Plan is for me to review them and try to apply them with as few
changes as possible, perhaps as a part of a longer series with patches
from me in between (so we can keep consistency in git over
authorship).

>  - it's rebased against current master, and adjusted to work against it
>
>  - locking in LUClusterVerifyGroup has been updated not to lock the
>    entire cluster (this was one of important things to fix)
>
>  - a `--node-group` option has been added to `gnt-cluster verify` in
>    the last patch, which hadn't been included in the original series
>
>  - some minor fixes
>

Thanks for fixing these! :)

> I know of the **following standing issues**, which I'll present to Guido
> later today:
>
>  1. Regarding locking, as mentioned LUClusterVerifyGroup now locks only
>     the nodes/instances it needs, which is what Iustin indicated it
>     should do. But I'm unsure what the locks in LUClusterVerifyConfig
>     should be: I have an uncommitted diff that locks everything, but I
>     can't remember if that's what was concluded or not.
>

I think the best thing would be to do it in a way that just grabs the
"in-config" leaf lock, so that it verifies only "internal" consistency
but allows work to proceed. This is a bit complicated by the fact that
ongoing jobs may modify the config outside of the config lock, but
this shouldn't pose a huge deal in real life (and if it does it
requires anyway a better fix than locking everything at that stage).

>  2. Regarding file checksums: in aef59ae, Michael refactored the
>     _VerifyNodeFiles test to happen once per cluster, rather than once
>     per node. With this patch series, it happens once per group.
>
>     It is necessary, however, that the checksums from the master are
>     always present (so that verification actually works). Prior to
>     Michael's patch, these were obtained locally via
>     utils.FingerprintFiles(), but now the ones from all_nvinfo are used.
>
>     I've changed the code to do an extra RPC call to the master in all
>     groups but the one that holds the master node; but I was left
>     wondering whether we shouldn't go back to utils.FingerprintFiles()
>     for this.
>
>     And, if we keep the extra RPC call, is a lock on the master needed?
>

Not unless some other job can modify those files. By the way if a job
can, and doesn't modify the master then locking here is pointless
anyway. Just to give you an example:
- suppose we verify the checksum on config.data on the master and the
candidates, and we lock all nodes in a group
- at the same time an instance is being added in another group, locks
the relvant node, and modifies the config
- then no matter what, the checksum has risks of failing, because the
files were modified on the master by a job that didn't lock the master
(so I guess more work might be needed for that)

>  3. I see that, as of 6a37364, ResultWithJobs has been introduced, but
>     that nothing is using it yet. This patch series, as it stands,
>     still does the old agreed-upon behavior of "return the list of
>     group names from the global LUClusterVerifyConfig operation".
>
>     This should be a good candidate for using ResultWithJobs, but I'm
>     thinking perhaps it'll be wiser to do it in a later cleanup series?
>

Yes, no problem with this, I can look into adding more patches to
actually do this after this series.

>  4. Patch 8/11 may need attention: it's important because, without it,
>     each node in the cluster would try to contact (NV_NODELIST) every
>     other online node in the cluster. I discussed this with Iustin, and
>     it seemed too much; we agreed on contacting all online nodes in the
>     current group, plus one node from every other cluster.
>

every other group. yes, sounds like a good approach. one or two, no more.

>     However, contacted nodes from other groups are not being locked:
>     should they be? Should we just restrict NV_NODELIST to online nodes
>     in the same group, and be done with it?
>

No, I don't think we should lock nodes just for "contacting them".
Node locks are there just to avoid operations on the node conflicting
with each other. We can even decide in the future to split them for
different classes of operations not conflicting with the other
classes, but in the meantime they are unique, but no need to grab them
for no-ops.

>  5. I haven't verified whether QA tests need updates to explicitly run cluster
>     verify in a multi-group environment, or if that's happening already.
>

Thanks, I will verify that.

Guido

Re: [PATCH master 00/11] WIP: shard cluster verify by node group

Reply via email to