corosync calculates certain timeouts based on the cluster size. As a result,
after a node failure in a cluster with default corosync settings and more than
~30 nodes, the time until corosync reestablishes a membership can get close to
or exceed 50-60s. In an HA cluster, this can be too long because the watchdog
may start fencing nodes after ~60 seconds without a membership.

To avoid this, create new corosync clusters with smaller timeouts. See patch #2
for more details.

In a future patch series, we may also want to detect problematic settings at
certain points and notify the user accordingly (e.g. when adding a new node or
in `pvecm status`). This will also benefit users with existing clusters with
more than ~30 nodes.

pve-cluster:

Friedrich Weber (2):
  corosync: create config: allow setting token coefficient
  api: cluster config: create new clusters with lower token coefficient

 src/PVE/API2/ClusterConfig.pm | 9 +++++++++
 src/PVE/Corosync.pm           | 3 +++
 2 files changed, 12 insertions(+)


pve-docs:

Friedrich Weber (1):
  pvecm: config: document how to change the token coefficient

 pvecm.adoc | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)


Summary over all repositories:
  3 files changed, 29 insertions(+), 0 deletions(-)

-- 
Generated by git-murpp 0.8.1



Reply via email to