Dear Heartbeat user community -and Masters-,
I'm having many troubles making working a "simple" DRBD/NFS
Active/Passive config in a 2-node cluster as soon as I want to put
additional feature to increase the availability in case of network
failure : STONITH (suicide) and pingd (failover if the default gateway
gets unreachable).
What I try to do is pretty simple : if the network is detected as down
(by pingd), the DC asks to the "isolated" lrmd to gracefully shutdown
the resources (especially force that the drbd volume has been released
on the "Secondary" state), in order to ensure that the failover node
starts the resources asap. Using STONITH to reboot the network isolated
node would also ensure that it will become the next new backup as soon
as the network outage is fixed.
Usefull data:
- HB version : 2.0.7 used in R2 config mode (OS: fc5)
- Both network (bond'ed NICs) and serial interfaces used for
the heartbeat communication
- here's the ha.cf :
use_logd yes
record_pengine_inputs off
enable_config_writes off
record_config_changes off
keepalive 1
deadtime 10
warntime 2
initdead 20
udpport 694
autojoin none
msgfmt netstring
#Serial Config
baud 115200
serial /dev/ttyS0
#Soft Watchdog
watchdog /dev/watchdog
auto_failback off
crm yes
traditional_compression false
ucast bond0.52 192.168.52.12
node fs1-certif fs2-certif
ping dgw
- and so the cib.xml
<cib>
<configuration>
<crm_config>
<cluster_property_set id="default">
<attributes>
<nvpair id="symmetric_cluster" name="symmetric_cluster"
value="true"/>
<nvpair id="no_quorum_policy" name="no_quorum_policy"
value="ignore"/>
<nvpair id="default_resource_stickiness"
name="default_resource_stickiness" value="500"/>
<nvpair id="default_resource_failure_stickiness"
name="default_resource_failure_stickiness" value="-100"/>
<nvpair id="cib-bootstrap-options-default_action_timeout"
name="default_action_timeout" value="10s"/>
<nvpair id="stonith_enabled" name="stonith_enabled"
value="true"/>
<nvpair id="stonith_action" name="stonith_action"
value="reboot"/>
<nvpair id="stop_orphan_resources"
name="stop_orphan_resources" value="false"/>
<nvpair id="stop_orphan_actions" name="stop_orphan_actions"
value="false"/>
<nvpair id="remove_after_stop" name="remove_after_stop"
value="false"/>
<nvpair id="short_resource_names" name="short_resource_names"
value="true"/>
<nvpair id="transition_idle_timeout"
name="transition_idle_timeout" value="20s"/>
<nvpair id="is_managed_default" name="is_managed_default"
value="true"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes/>
<resources>
<group id="group_1">
<primitive class="heartbeat" id="drbddisk_1"
provider="heartbeat" type="drbddisk">
<operations>
<op id="drbddisk_1_mon" interval="5min" name="monitor"
timeout="5s"/>
</operations>
<instance_attributes id="drbd">
<attributes>
<nvpair id="drbddisk_1_attr_1" name="1"
value="drbd-resource-0"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="lsb" id="fuc_2" provider="heartbeat" type="fuc">
<operations>
<op id="fuc_2_mon" interval="4min" name="monitor"
timeout="5s"/>
</operations>
</primitive>
<primitive class="heartbeat" id="Filesystem_3"
provider="heartbeat" type="Filesystem">
<operations>
<op id="Filesystem_3_start" name="start" timeout="15s"/>
<op id="Filesystem_3_stop" name="stop" timeout="15s"/>
<op id="Filesystem_3_mon" interval="3min" name="monitor"
timeout="5s"/>
</operations>
<instance_attributes id="mount">
<attributes>
<nvpair id="Filesystem_3_attr_1" name="1"
value="/dev/drbd0"/>
<nvpair id="Filesystem_3_attr_2" name="2"
value="/mnt/centile"/>
<nvpair id="Filesystem_3_attr_3" name="3" value="ext3"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="lsb" id="nfs-ha_4" provider="heartbeat"
type="nfs-ha">
<operations>
<op id="nfs-ha_4_mon" interval="30s" name="monitor"
timeout="5s"/>
</operations>
</primitive>
<primitive class="lsb" id="nfslock-ha_5" provider="heartbeat"
type="nfslock-ha">
<operations>
<op id="nfslock-ha_5_mon" interval="50s" name="monitor"
timeout="5s"/>
</operations>
</primitive>
<primitive class="heartbeat" id="IPaddr2_6"
provider="heartbeat" type="IPaddr2">
<operations>
<op id="IPaddr2_6_mon" interval="30s" name="monitor"
timeout="5s"/>
</operations>
<instance_attributes id="vip">
<attributes>
<nvpair id="IPaddr2_6_attr_1" name="1"
value="192.168.52.19/24/bond0.52"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="lsb" id="edns-ha_7" provider="heartbeat"
type="edns-ha">
<operations>
<op id="edns-ha_7_mon" interval="30s" name="monitor"
timeout="5s"/>
</operations>
</primitive>
</group>
<clone id="pingd">
<instance_attributes id="pingd">
<attributes>
<nvpair id="pingd-clone_node_max" name="clone_node_max"
value="1"/>
<nvpair id="pingd-dampen" name="dampen" value="5s"/>
<nvpair id="pingd-multiplier" name="multiplier" value="100"/>
<nvpair id="pingd-pidfile" name="pidfile"
value="/tmp/pingd.pid"/>
<nvpair id="pingd-user" name="user" value="hacluster"/>
</attributes>
</instance_attributes>
<primitive id="Check_DefaultGW" provider="heartbeat" class="ocf"
type="pingd">
<operations>
<op id="Check_Default_GW-monitor" name="monitor"
interval="60s" timeout="10s" prereq="nothing"/>
<op id="Check_Default_GW-start" name="start" prereq="nothing"/>
</operations>
</primitive>
</clone>
<clone id="DoFencing">
<instance_attributes id="DoFencing">
<attributes>
<nvpair id="fencing-clone_max" name="clone_max" value="2"/>
<nvpair id="fencing-clone_node_max" name="clone_node_max"
value="1"/>
</attributes>
</instance_attributes>
<primitive id="child_DoFencing" class="stonith" type="suicide"
provider="heartbeat">
<operations>
<op id="Fencing-monitor" name="monitor" interval="60s"
timeout="10s" prereq="nothing"/>
<op id="Fencing-start" name="start" timeout="20s"
prereq="nothing"/>
</operations>
</primitive>
</clone>
</resources>
<constraints>
<rsc_location id="rsc_location_group_1" rsc="group_1">
<rule id="prefered_location_group_1" score="100">
<expression attribute="#uname"
id="prefered_location_group_1_expr" operation="eq" value="fs1-certif"/>
</rule>
</rsc_location>
<rsc_location id="group_1:not_connected" rsc="group_1">
<rule id="group_1:not_connected:rule" score="-INFINITY">
<expression id="group_1:not_connected:expr"
attribute="pingd" operation="not_defined"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
TestCases:
* STONITH and PINGD disabled in the cib.xml
Everything works fine, but the resources (like NFSd) are unavailable
from their clients during the whole network downtime. No split-brain
thanks to the serial cable ,maybe but the resources remains unavailable
during that period.
* PINGD enabled , STONITH disabled in the cib.xml
The failover is correctly started when pindg detects its ping node as
unreachable. The problem is that the HB status on the isolated node
becomes OFFLINE and the resources still run on it !!
* PINGD and STONITH enabled
The network outage is detected by pingd, the isolated node (which is
the DC in my test) shutdown all its resource, but the backup node
doesn't re-launch the resource. Also the Stonith suicide operation
didn't occured. Maybe the backup waits for the stonith ack before
starting the resources?
Before sending a lot of traces, and because I'm *sure* that I did
mistake somewhere into the cib.xml file, I prefer that you tell me first
about my obvious errors :). I can easily reproduce the problems here, so
I'm ready to feed the thread with logs if needed :)
I hope I was enough clear, as my english looked pretty wacky to your
eyes :) Sorry I'm french !
Many thanks for your help and for this great program! :)
Regards,
-Yann
_______________________________________________
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems