Hi,
Guido is going to be taking over this, and we agreed that I send my
most recent version of the patch series to the list, since it should
be in relatively good form already.
Over the version I sent some weeks ago, the following has happened:
- it's rebased against current master, and adjusted to work against it
- locking in LUClusterVerifyGroup has been updated not to lock the
entire cluster (this was one of important things to fix)
- a `--node-group` option has been added to `gnt-cluster verify` in
the last patch, which hadn't been included in the original series
- some minor fixes
I know of the **following standing issues**, which I'll present to Guido
later today:
1. Regarding locking, as mentioned LUClusterVerifyGroup now locks only
the nodes/instances it needs, which is what Iustin indicated it
should do. But I'm unsure what the locks in LUClusterVerifyConfig
should be: I have an uncommitted diff that locks everything, but I
can't remember if that's what was concluded or not.
2. Regarding file checksums: in aef59ae, Michael refactored the
_VerifyNodeFiles test to happen once per cluster, rather than once
per node. With this patch series, it happens once per group.
It is necessary, however, that the checksums from the master are
always present (so that verification actually works). Prior to
Michael's patch, these were obtained locally via
utils.FingerprintFiles(), but now the ones from all_nvinfo are used.
I've changed the code to do an extra RPC call to the master in all
groups but the one that holds the master node; but I was left
wondering whether we shouldn't go back to utils.FingerprintFiles()
for this.
And, if we keep the extra RPC call, is a lock on the master needed?
3. I see that, as of 6a37364, ResultWithJobs has been introduced, but
that nothing is using it yet. This patch series, as it stands,
still does the old agreed-upon behavior of "return the list of
group names from the global LUClusterVerifyConfig operation".
This should be a good candidate for using ResultWithJobs, but I'm
thinking perhaps it'll be wiser to do it in a later cleanup series?
4. Patch 8/11 may need attention: it's important because, without it,
each node in the cluster would try to contact (NV_NODELIST) every
other online node in the cluster. I discussed this with Iustin, and
it seemed too much; we agreed on contacting all online nodes in the
current group, plus one node from every other cluster.
However, contacted nodes from other groups are not being locked:
should they be? Should we just restrict NV_NODELIST to online nodes
in the same group, and be done with it?
5. I haven't verified whether QA tests need updates to explicitly run cluster
verify in a multi-group environment, or if that's happening already.
And this is all I have for now.
Cheers!
Adeodato Simo (11):
Cluster verify: gather node/instance list in CheckPrereq
Cluster verify: don't assume we're verifying all nodes/instances
Cluster verify: master must be present for _VerifyFiles
Cluster verify: make "instance runs in wrong node" node-driven
Cluster verify: factor out error codes and functions
Split LUClusterVerify into LUClusterVerify{Config,Group}
Cluster verify: verify hypervisor parameters only once
Cluster verify: make NV_NODELIST smaller
Cluster verify: fix LV checks for split instances
Cluster verify: check for nodes/instances with no group
Cluster verify: accept a --node-group option
lib/client/gnt_cluster.py | 38 +++-
lib/cmdlib.py | 485 ++++++++++++++++++++++++++++------------
lib/opcodes.py | 17 ++-
man/gnt-cluster.rst | 7 +-
test/ganeti.cmdlib_unittest.py | 4 +-
5 files changed, 399 insertions(+), 152 deletions(-)
--
1.7.3.1