Dear Heartbeat user community -and Masters-,

I'm having many troubles making working a "simple" DRBD/NFS Active/Passive config in a 2-node cluster as soon as I want to put additional feature to increase the availability in case of network failure : STONITH (suicide) and pingd (failover if the default gateway gets unreachable).

What I try to do is pretty simple : if the network is detected as down (by pingd), the DC asks to the "isolated" lrmd to gracefully shutdown the resources (especially force that the drbd volume has been released on the "Secondary" state), in order to ensure that the failover node starts the resources asap. Using STONITH to reboot the network isolated node would also ensure that it will become the next new backup as soon as the network outage is fixed.


   Usefull data:

        - HB version : 2.0.7 used in R2 config mode (OS: fc5)
- Both network (bond'ed NICs) and serial interfaces used for the heartbeat communication
        - here's the ha.cf :

use_logd yes
record_pengine_inputs off
enable_config_writes off
record_config_changes off
keepalive 1
deadtime 10
warntime 2
initdead 20
udpport 694
autojoin none
msgfmt netstring
#Serial Config
baud 115200
serial /dev/ttyS0
#Soft Watchdog
watchdog /dev/watchdog
auto_failback off
crm yes
traditional_compression false
ucast bond0.52 192.168.52.12
node fs1-certif fs2-certif
ping dgw


      - and so the cib.xml

<cib>
  <configuration>
    <crm_config>
      <cluster_property_set id="default">
        <attributes>
<nvpair id="symmetric_cluster" name="symmetric_cluster" value="true"/> <nvpair id="no_quorum_policy" name="no_quorum_policy" value="ignore"/> <nvpair id="default_resource_stickiness" name="default_resource_stickiness" value="500"/> <nvpair id="default_resource_failure_stickiness" name="default_resource_failure_stickiness" value="-100"/> <nvpair id="cib-bootstrap-options-default_action_timeout" name="default_action_timeout" value="10s"/> <nvpair id="stonith_enabled" name="stonith_enabled" value="true"/> <nvpair id="stonith_action" name="stonith_action" value="reboot"/> <nvpair id="stop_orphan_resources" name="stop_orphan_resources" value="false"/> <nvpair id="stop_orphan_actions" name="stop_orphan_actions" value="false"/> <nvpair id="remove_after_stop" name="remove_after_stop" value="false"/> <nvpair id="short_resource_names" name="short_resource_names" value="true"/> <nvpair id="transition_idle_timeout" name="transition_idle_timeout" value="20s"/> <nvpair id="is_managed_default" name="is_managed_default" value="true"/>
        </attributes>
      </cluster_property_set>
    </crm_config>
    <nodes/>
    <resources>

      <group id="group_1">
<primitive class="heartbeat" id="drbddisk_1" provider="heartbeat" type="drbddisk">
          <operations>
<op id="drbddisk_1_mon" interval="5min" name="monitor" timeout="5s"/>
          </operations>
          <instance_attributes id="drbd">
            <attributes>
<nvpair id="drbddisk_1_attr_1" name="1" value="drbd-resource-0"/>
            </attributes>
          </instance_attributes>
        </primitive>
        <primitive class="lsb" id="fuc_2" provider="heartbeat" type="fuc">
          <operations>
<op id="fuc_2_mon" interval="4min" name="monitor" timeout="5s"/>
          </operations>
        </primitive>
<primitive class="heartbeat" id="Filesystem_3" provider="heartbeat" type="Filesystem">
          <operations>
            <op id="Filesystem_3_start" name="start" timeout="15s"/>
            <op id="Filesystem_3_stop" name="stop" timeout="15s"/>
<op id="Filesystem_3_mon" interval="3min" name="monitor" timeout="5s"/>
          </operations>
          <instance_attributes id="mount">
            <attributes>
<nvpair id="Filesystem_3_attr_1" name="1" value="/dev/drbd0"/> <nvpair id="Filesystem_3_attr_2" name="2" value="/mnt/centile"/>
              <nvpair id="Filesystem_3_attr_3" name="3" value="ext3"/>
            </attributes>
          </instance_attributes>
        </primitive>
<primitive class="lsb" id="nfs-ha_4" provider="heartbeat" type="nfs-ha">
          <operations>
<op id="nfs-ha_4_mon" interval="30s" name="monitor" timeout="5s"/>
          </operations>
        </primitive>
<primitive class="lsb" id="nfslock-ha_5" provider="heartbeat" type="nfslock-ha">
          <operations>
<op id="nfslock-ha_5_mon" interval="50s" name="monitor" timeout="5s"/>
          </operations>
        </primitive>
<primitive class="heartbeat" id="IPaddr2_6" provider="heartbeat" type="IPaddr2">
          <operations>
<op id="IPaddr2_6_mon" interval="30s" name="monitor" timeout="5s"/>
          </operations>
          <instance_attributes id="vip">
            <attributes>
<nvpair id="IPaddr2_6_attr_1" name="1" value="192.168.52.19/24/bond0.52"/>
            </attributes>
          </instance_attributes>
        </primitive>
<primitive class="lsb" id="edns-ha_7" provider="heartbeat" type="edns-ha">
          <operations>
<op id="edns-ha_7_mon" interval="30s" name="monitor" timeout="5s"/>
          </operations>
        </primitive>
      </group>

      <clone id="pingd">
         <instance_attributes id="pingd">
           <attributes>
<nvpair id="pingd-clone_node_max" name="clone_node_max" value="1"/>
             <nvpair id="pingd-dampen"     name="dampen" value="5s"/>
             <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
<nvpair id="pingd-pidfile" name="pidfile" value="/tmp/pingd.pid"/>
             <nvpair id="pingd-user" name="user" value="hacluster"/>
           </attributes>
         </instance_attributes>
<primitive id="Check_DefaultGW" provider="heartbeat" class="ocf" type="pingd">
        <operations>
<op id="Check_Default_GW-monitor" name="monitor" interval="60s" timeout="10s" prereq="nothing"/>
           <op id="Check_Default_GW-start" name="start" prereq="nothing"/>
        </operations>
       </primitive>
     </clone>

    <clone id="DoFencing">
      <instance_attributes id="DoFencing">
       <attributes>
        <nvpair id="fencing-clone_max" name="clone_max" value="2"/>
<nvpair id="fencing-clone_node_max" name="clone_node_max" value="1"/>
       </attributes>
      </instance_attributes>
<primitive id="child_DoFencing" class="stonith" type="suicide" provider="heartbeat">
      <operations>
<op id="Fencing-monitor" name="monitor" interval="60s" timeout="10s" prereq="nothing"/> <op id="Fencing-start" name="start" timeout="20s" prereq="nothing"/>
      </operations>
     </primitive>
   </clone>

    </resources>

    <constraints>
      <rsc_location id="rsc_location_group_1" rsc="group_1">
        <rule id="prefered_location_group_1" score="100">
<expression attribute="#uname" id="prefered_location_group_1_expr" operation="eq" value="fs1-certif"/>
        </rule>
      </rsc_location>
      <rsc_location id="group_1:not_connected" rsc="group_1">
          <rule id="group_1:not_connected:rule" score="-INFINITY">
<expression id="group_1:not_connected:expr" attribute="pingd" operation="not_defined"/>
          </rule>
      </rsc_location>
    </constraints>

  </configuration>
</cib>


TestCases:

      * STONITH and PINGD disabled in the cib.xml

Everything works fine, but the resources (like NFSd) are unavailable from their clients during the whole network downtime. No split-brain thanks to the serial cable ,maybe but the resources remains unavailable during that period.

      * PINGD enabled , STONITH disabled in the cib.xml

The failover is correctly started when pindg detects its ping node as unreachable. The problem is that the HB status on the isolated node becomes OFFLINE and the resources still run on it !!

      * PINGD and STONITH enabled

The network outage is detected by pingd, the isolated node (which is the DC in my test) shutdown all its resource, but the backup node doesn't re-launch the resource. Also the Stonith suicide operation didn't occured. Maybe the backup waits for the stonith ack before starting the resources?


Before sending a lot of traces, and because I'm *sure* that I did mistake somewhere into the cib.xml file, I prefer that you tell me first about my obvious errors :). I can easily reproduce the problems here, so I'm ready to feed the thread with logs if needed :)

I hope I was enough clear, as my english looked pretty wacky to your eyes :) Sorry I'm french !

Many thanks for your help and for this great program! :)

Regards,

-Yann



_______________________________________________
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to