[Linux-HA] quorumd: transient quorum changes on important events

Clayton Bell Thu, 08 May 2008 01:03:16 -0700

Hello fellow linux-ha people

I have some scenarios where quorumd+heartbeat make stretch clusters some what 
difficult to work with.


Summary of issues:
quorumd + 4 node cluster - short lived have_quorum="false" caused by joining 
nodes and removing nodes
quorumd + 4 node cluster - partition off DC node and the partitioned DC node 
never looses quorum.

Details below.. feedback welcome.

Clayton



Setup:
        4 node cluster + quorumd running on a 5th remote node
        all hosts are SLES10 SP1 with heartbeat-2.0.8-0.19
        Default quorumd timings (http://www.linux-ha.org/QuorumServerGuide)
        Default crm_config timings
        ha.cf -> crm true for heartbeat v2


Case 1A:        4 node cluster + quorum server - cluster is quorate all IP 
connections working.  
Event:  Ungracefully power off a node

Result: When the remaining nodes notice the other node as dead, quorum is 
                momentarily set to false and returns to true around 30 seconds 
later.

Case 1B:        Continue on from Case 1A
Event:  Power on the node from Case 1A

Result: When the nodes re-vote/agree on the new DC, quorum is momentarily set 
to false on all nodes.  
                Quorum is true around 30 seconds later on all nodes.

This is some what of a nuisance... no_quorum_policy="stop" seems sensible for 
stretch clusters.  However these brief quorum changes cause resources to be 
unnecessarily restarted.  I've tried a wide variety of quorumd.conf timing 
parameters but all result in at least a momentary state of have_quorum="false".


This one is more problematic..

Case 2: 4 node cluster + quorum server - cluster is quorate all IP connections 
working.  
Event:  Using iptables entirely partition off the DC node so that it can't talk 
IP to anything.  

Result: The partitioned DC node doesn't notice a loss of connectivity to the 
quorum server and remains quorate.
                The partitioned DC eventually takes over all resources.
                At the same time the remaining nodes from the other partition 
vote on a new DC, connect 
                to the quorum server and take over all resources.
                End state - duplicate resources and data corruption

Case 2 isn't a problem when STONITH is in place.  The other 3 nodes would 
STONITH the partitioned DC.

STONITH isn't particular useful in stretch clusters.  Instead quorum seems like 
a good choice as to whether a node should be available to run resources or not.

With the two cases given though, quorum isn't strictly a good choice - loss of 
quorum for "too" long would be enough to indicate the node should stop its 
resources.  

Case 1 can be worked around by using prereq="quorum" on the start/stop 
operations of all resources.

If quorum is false or undefined for too long then the node can be rebooted by a 
script.  But case 2 prevents this.. as the DC never looses quorum. 



This email and any attachments may contain privileged and confidential 
information and are intended for the named addressee only. If you have received 
this e-mail in error, please notify the sender and delete this e-mail 
immediately. Any confidentiality, privilege or copyright is not waived or lost 
because this e-mail has been sent to you in error. It is your responsibility to 
check this e-mail and any attachments for viruses.  No warranty is made that 
this material is free from computer virus or any other defect or error.  Any 
loss/damage incurred by using this material is not the sender's responsibility. 
 The sender's entire liability will be limited to resupplying the material.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] quorumd: transient quorum changes on important events

Reply via email to