Hi,
On 05/02/2017 05:40 PM, Adam Carheden wrote:
What's supposed to happen if two nodes in a 4-node HA cluster go offline?
If all of them have HA services configured then there may happen a full
cluster reset.
If two nodes go offline the whole cluster looses quorum, so all nodes
with an active watchdog (i.e. all nodes with active services (in the
past)) will reset.
For such situation, where there's a tie an external voting arbitrator
would help, this could be a fifth (tiny) node or an corosync QDevices.
QDevices have the advantage that they can run on any newer Linux Distro
which ship corosync (2.4 and newer AFAIK) independent of the PVE stack.
They can provide arbitrator votes to multiple cluster, and have less
constraints regarding network setup latency as the communication happens
over TCP.
This is usable from PVE but we haven't documented it, which I started to
do and need to pick up again soon.
Just a note for any other reader, while this can boost reliability and
recovery in Clusters with an even vote count (you can only 'win' there),
it can do the reverse in Clusters with uneven Node counts.
I have a 4-node test cluster, two nodes are in one server room and the
other two in another server room. I had HA inadvertently tested for me
this morning due to an unexpected network issue and watchdog rebooted
two of the nodes.
I think this is the expected behavior, and certainly seems like what I
want to happen. However, quorum is 3, not 2, so why didn't all 4 nodes
reboot?
Because, if the `ha-manager status` still mirrors the same setup (i.e.
same services on same nodes configured) as when the network failure
happened, I see that just one node hast active services running.
We do not fence nodes which have no configured HA services, or if all of
there configured HA services are disabled.
As we think that this would just lower reliability for non-ha services
but bring no increase in reliability for HA services.
# pvecm status
Quorum information
------------------
Date: Tue May 2 09:35:23 2017
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 4/524
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000004 1 192.168.0.11
0x00000003 1 192.168.0.203
0x00000001 1 192.168.0.204 (local)
0x00000002 1 192.168.0.206
# ha-manager status
quorum OK
master node3 (active, Tue May 2 09:35:24 2017)
lrm node1 (idle, Tue May 2 09:35:27 2017)
lrm node2 (active, Tue May 2 09:35:26 2017)
lrm node3 (idle, Tue May 2 09:35:23 2017)
lrm node3 (idle, Tue May 2 09:35:23 2017)
Somehow proxmox was smart enough to keep two of the nodes online, but
with a quorum of 3 neither group should have had quorum. How does it
decide which group to keep online?
see above
Cheers,
Thomas
_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user