Hello fellow linux-ha people
I have some scenarios where quorumd+heartbeat make stretch clusters some what
difficult to work with.
Summary of issues:
quorumd + 4 node cluster - short lived have_quorum="false" caused by joining
nodes and removing nodes
quorumd + 4 node cluster - partition off DC node and the partitioned DC node
never looses quorum.
Details below.. feedback welcome.
Clayton
Setup:
4 node cluster + quorumd running on a 5th remote node
all hosts are SLES10 SP1 with heartbeat-2.0.8-0.19
Default quorumd timings (http://www.linux-ha.org/QuorumServerGuide)
Default crm_config timings
ha.cf -> crm true for heartbeat v2
Case 1A: 4 node cluster + quorum server - cluster is quorate all IP
connections working.
Event: Ungracefully power off a node
Result: When the remaining nodes notice the other node as dead, quorum is
momentarily set to false and returns to true around 30 seconds
later.
Case 1B: Continue on from Case 1A
Event: Power on the node from Case 1A
Result: When the nodes re-vote/agree on the new DC, quorum is momentarily set
to false on all nodes.
Quorum is true around 30 seconds later on all nodes.
This is some what of a nuisance... no_quorum_policy="stop" seems sensible for
stretch clusters. However these brief quorum changes cause resources to be
unnecessarily restarted. I've tried a wide variety of quorumd.conf timing
parameters but all result in at least a momentary state of have_quorum="false".
This one is more problematic..
Case 2: 4 node cluster + quorum server - cluster is quorate all IP connections
working.
Event: Using iptables entirely partition off the DC node so that it can't talk
IP to anything.
Result: The partitioned DC node doesn't notice a loss of connectivity to the
quorum server and remains quorate.
The partitioned DC eventually takes over all resources.
At the same time the remaining nodes from the other partition
vote on a new DC, connect
to the quorum server and take over all resources.
End state - duplicate resources and data corruption
Case 2 isn't a problem when STONITH is in place. The other 3 nodes would
STONITH the partitioned DC.
STONITH isn't particular useful in stretch clusters. Instead quorum seems like
a good choice as to whether a node should be available to run resources or not.
With the two cases given though, quorum isn't strictly a good choice - loss of
quorum for "too" long would be enough to indicate the node should stop its
resources.
Case 1 can be worked around by using prereq="quorum" on the start/stop
operations of all resources.
If quorum is false or undefined for too long then the node can be rebooted by a
script. But case 2 prevents this.. as the DC never looses quorum.
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have received
this e-mail in error, please notify the sender and delete this e-mail
immediately. Any confidentiality, privilege or copyright is not waived or lost
because this e-mail has been sent to you in error. It is your responsibility to
check this e-mail and any attachments for viruses. No warranty is made that
this material is free from computer virus or any other defect or error. Any
loss/damage incurred by using this material is not the sender's responsibility.
The sender's entire liability will be limited to resupplying the material.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems