Hi,

Guido is going to be taking over this, and we agreed that I send my
most recent version of the patch series to the list, since it should
be in relatively good form already.

Over the version I sent some weeks ago, the following has happened:

  - it's rebased against current master, and adjusted to work against it

  - locking in LUClusterVerifyGroup has been updated not to lock the
    entire cluster (this was one of important things to fix)

  - a `--node-group` option has been added to `gnt-cluster verify` in
    the last patch, which hadn't been included in the original series

  - some minor fixes

I know of the **following standing issues**, which I'll present to Guido
later today:

  1. Regarding locking, as mentioned LUClusterVerifyGroup now locks only
     the nodes/instances it needs, which is what Iustin indicated it
     should do. But I'm unsure what the locks in LUClusterVerifyConfig
     should be: I have an uncommitted diff that locks everything, but I
     can't remember if that's what was concluded or not.

  2. Regarding file checksums: in aef59ae, Michael refactored the
     _VerifyNodeFiles test to happen once per cluster, rather than once
     per node. With this patch series, it happens once per group.

     It is necessary, however, that the checksums from the master are
     always present (so that verification actually works). Prior to
     Michael's patch, these were obtained locally via
     utils.FingerprintFiles(), but now the ones from all_nvinfo are used.

     I've changed the code to do an extra RPC call to the master in all
     groups but the one that holds the master node; but I was left
     wondering whether we shouldn't go back to utils.FingerprintFiles()
     for this.

     And, if we keep the extra RPC call, is a lock on the master needed?

  3. I see that, as of 6a37364, ResultWithJobs has been introduced, but
     that nothing is using it yet. This patch series, as it stands,
     still does the old agreed-upon behavior of "return the list of
     group names from the global LUClusterVerifyConfig operation".

     This should be a good candidate for using ResultWithJobs, but I'm
     thinking perhaps it'll be wiser to do it in a later cleanup series?

  4. Patch 8/11 may need attention: it's important because, without it,
     each node in the cluster would try to contact (NV_NODELIST) every
     other online node in the cluster. I discussed this with Iustin, and
     it seemed too much; we agreed on contacting all online nodes in the
     current group, plus one node from every other cluster.

     However, contacted nodes from other groups are not being locked:
     should they be? Should we just restrict NV_NODELIST to online nodes
     in the same group, and be done with it?

  5. I haven't verified whether QA tests need updates to explicitly run cluster
     verify in a multi-group environment, or if that's happening already.

And this is all I have for now.

Cheers!

Adeodato Simo (11):
  Cluster verify: gather node/instance list in CheckPrereq
  Cluster verify: don't assume we're verifying all nodes/instances
  Cluster verify: master must be present for _VerifyFiles
  Cluster verify: make "instance runs in wrong node" node-driven
  Cluster verify: factor out error codes and functions
  Split LUClusterVerify into LUClusterVerify{Config,Group}
  Cluster verify: verify hypervisor parameters only once
  Cluster verify: make NV_NODELIST smaller
  Cluster verify: fix LV checks for split instances
  Cluster verify: check for nodes/instances with no group
  Cluster verify: accept a --node-group option

 lib/client/gnt_cluster.py      |   38 +++-
 lib/cmdlib.py                  |  485 ++++++++++++++++++++++++++++------------
 lib/opcodes.py                 |   17 ++-
 man/gnt-cluster.rst            |    7 +-
 test/ganeti.cmdlib_unittest.py |    4 +-
 5 files changed, 399 insertions(+), 152 deletions(-)

-- 
1.7.3.1

Reply via email to