corosync calculates certain timeouts based on the cluster size. As a result, after a node failure in a cluster with default corosync settings and more than ~30 nodes, the time until corosync reestablishes a membership can get close to or exceed 50-60s. In an HA cluster, this can be too long because the watchdog may start fencing nodes after ~60 seconds without a membership.
To avoid this, create new corosync clusters with smaller timeouts. See patch #2 for more details. In a future patch series, we may also want to detect problematic settings at certain points and notify the user accordingly (e.g. when adding a new node or in `pvecm status`). This will also benefit users with existing clusters with more than ~30 nodes. pve-cluster: Friedrich Weber (2): corosync: create config: allow setting token coefficient api: cluster config: create new clusters with lower token coefficient src/PVE/API2/ClusterConfig.pm | 9 +++++++++ src/PVE/Corosync.pm | 3 +++ 2 files changed, 12 insertions(+) pve-docs: Friedrich Weber (1): pvecm: config: document how to change the token coefficient pvecm.adoc | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) Summary over all repositories: 3 files changed, 29 insertions(+), 0 deletions(-) -- Generated by git-murpp 0.8.1
