[Linux-HA] 2 nodes A/P using pindg+stonith problem

Yann Dille Thu, 12 Apr 2007 14:04:19 -0700

Dear Heartbeat user community -and Masters-,

I'm having many troubles making working a "simple" DRBD/NFSActive/Passive config in a 2-node cluster as soon as I want to putadditional feature to increase the availability in case of networkfailure : STONITH (suicide) and pingd (failover if the default gatewaygets unreachable).

What I try to do is pretty simple : if the network is detected as down(by pingd), the DC asks to the "isolated" lrmd to gracefully shutdownthe resources (especially force that the drbd volume has been releasedon the "Secondary" state), in order to ensure that the failover nodestarts the resources asap. Using STONITH to reboot the network isolatednode would also ensure that it will become the next new backup as soonas the network outage is fixed.



   Usefull data:

        - HB version : 2.0.7 used in R2 config mode (OS: fc5)

- Both network (bond'ed NICs) and serial interfaces used forthe heartbeat communication

        - here's the ha.cf :

use_logd yes
record_pengine_inputs off
enable_config_writes off
record_config_changes off
keepalive 1
deadtime 10
warntime 2
initdead 20
udpport 694
autojoin none
msgfmt netstring
#Serial Config
baud 115200
serial /dev/ttyS0
#Soft Watchdog
watchdog /dev/watchdog
auto_failback off
crm yes
traditional_compression false
ucast bond0.52 192.168.52.12
node fs1-certif fs2-certif
ping dgw


      - and so the cib.xml

<cib>
  <configuration>
    <crm_config>
      <cluster_property_set id="default">
        <attributes>

<nvpair id="symmetric_cluster" name="symmetric_cluster"value="true"/><nvpair id="no_quorum_policy" name="no_quorum_policy"value="ignore"/><nvpair id="default_resource_stickiness"name="default_resource_stickiness" value="500"/><nvpair id="default_resource_failure_stickiness"name="default_resource_failure_stickiness" value="-100"/><nvpair id="cib-bootstrap-options-default_action_timeout"name="default_action_timeout" value="10s"/><nvpair id="stonith_enabled" name="stonith_enabled"value="true"/><nvpair id="stonith_action" name="stonith_action"value="reboot"/><nvpair id="stop_orphan_resources"name="stop_orphan_resources" value="false"/><nvpair id="stop_orphan_actions" name="stop_orphan_actions"value="false"/><nvpair id="remove_after_stop" name="remove_after_stop"value="false"/><nvpair id="short_resource_names" name="short_resource_names"value="true"/><nvpair id="transition_idle_timeout"name="transition_idle_timeout" value="20s"/><nvpair id="is_managed_default" name="is_managed_default"value="true"/>

        </attributes>
      </cluster_property_set>
    </crm_config>
    <nodes/>
    <resources>

      <group id="group_1">

<primitive class="heartbeat" id="drbddisk_1"provider="heartbeat" type="drbddisk">

          <operations>

<op id="drbddisk_1_mon" interval="5min" name="monitor"timeout="5s"/>

          </operations>
          <instance_attributes id="drbd">
            <attributes>

<nvpair id="drbddisk_1_attr_1" name="1"value="drbd-resource-0"/>

            </attributes>
          </instance_attributes>
        </primitive>
        <primitive class="lsb" id="fuc_2" provider="heartbeat" type="fuc">
          <operations>

<op id="fuc_2_mon" interval="4min" name="monitor"timeout="5s"/>

          </operations>
        </primitive>

<primitive class="heartbeat" id="Filesystem_3"provider="heartbeat" type="Filesystem">

          <operations>
            <op id="Filesystem_3_start" name="start" timeout="15s"/>
            <op id="Filesystem_3_stop" name="stop" timeout="15s"/>

<op id="Filesystem_3_mon" interval="3min" name="monitor"timeout="5s"/>

          </operations>
          <instance_attributes id="mount">
            <attributes>

<nvpair id="Filesystem_3_attr_1" name="1"value="/dev/drbd0"/><nvpair id="Filesystem_3_attr_2" name="2"value="/mnt/centile"/>

              <nvpair id="Filesystem_3_attr_3" name="3" value="ext3"/>
            </attributes>
          </instance_attributes>
        </primitive>

<primitive class="lsb" id="nfs-ha_4" provider="heartbeat"type="nfs-ha">

          <operations>

<op id="nfs-ha_4_mon" interval="30s" name="monitor"timeout="5s"/>

          </operations>
        </primitive>

<primitive class="lsb" id="nfslock-ha_5" provider="heartbeat"type="nfslock-ha">

          <operations>

<op id="nfslock-ha_5_mon" interval="50s" name="monitor"timeout="5s"/>

          </operations>
        </primitive>

<primitive class="heartbeat" id="IPaddr2_6"provider="heartbeat" type="IPaddr2">

          <operations>

<op id="IPaddr2_6_mon" interval="30s" name="monitor"timeout="5s"/>

          </operations>
          <instance_attributes id="vip">
            <attributes>

<nvpair id="IPaddr2_6_attr_1" name="1"value="192.168.52.19/24/bond0.52"/>

            </attributes>
          </instance_attributes>
        </primitive>

<primitive class="lsb" id="edns-ha_7" provider="heartbeat"type="edns-ha">

          <operations>

<op id="edns-ha_7_mon" interval="30s" name="monitor"timeout="5s"/>

          </operations>
        </primitive>
      </group>

      <clone id="pingd">
         <instance_attributes id="pingd">
           <attributes>

<nvpair id="pingd-clone_node_max" name="clone_node_max"value="1"/>

             <nvpair id="pingd-dampen"     name="dampen" value="5s"/>
             <nvpair id="pingd-multiplier" name="multiplier" value="100"/>

<nvpair id="pingd-pidfile" name="pidfile"value="/tmp/pingd.pid"/>

             <nvpair id="pingd-user" name="user" value="hacluster"/>
           </attributes>
         </instance_attributes>

<primitive id="Check_DefaultGW" provider="heartbeat" class="ocf"type="pingd">

        <operations>

<op id="Check_Default_GW-monitor" name="monitor"interval="60s" timeout="10s" prereq="nothing"/>

           <op id="Check_Default_GW-start" name="start" prereq="nothing"/>
        </operations>
       </primitive>
     </clone>

    <clone id="DoFencing">
      <instance_attributes id="DoFencing">
       <attributes>
        <nvpair id="fencing-clone_max" name="clone_max" value="2"/>

<nvpair id="fencing-clone_node_max" name="clone_node_max"value="1"/>

       </attributes>
      </instance_attributes>

<primitive id="child_DoFencing" class="stonith" type="suicide"provider="heartbeat">

      <operations>

<op id="Fencing-monitor" name="monitor" interval="60s"timeout="10s" prereq="nothing"/><op id="Fencing-start" name="start" timeout="20s"prereq="nothing"/>

      </operations>
     </primitive>
   </clone>

    </resources>

    <constraints>
      <rsc_location id="rsc_location_group_1" rsc="group_1">
        <rule id="prefered_location_group_1" score="100">

<expression attribute="#uname"id="prefered_location_group_1_expr" operation="eq" value="fs1-certif"/>

        </rule>
      </rsc_location>
      <rsc_location id="group_1:not_connected" rsc="group_1">
          <rule id="group_1:not_connected:rule" score="-INFINITY">

<expression id="group_1:not_connected:expr"attribute="pingd" operation="not_defined"/>

          </rule>
      </rsc_location>
    </constraints>

  </configuration>
</cib>


TestCases:

      * STONITH and PINGD disabled in the cib.xml

Everything works fine, but the resources (like NFSd) are unavailablefrom their clients during the whole network downtime. No split-brainthanks to the serial cable ,maybe but the resources remains unavailableduring that period.


      * PINGD enabled , STONITH disabled in the cib.xml

The failover is correctly started when pindg detects its ping node asunreachable. The problem is that the HB status on the isolated nodebecomes OFFLINE and the resources still run on it !!


      * PINGD and STONITH enabled

The network outage is detected by pingd, the isolated node (which isthe DC in my test) shutdown all its resource, but the backup nodedoesn't re-launch the resource. Also the Stonith suicide operationdidn't occured. Maybe the backup waits for the stonith ack beforestarting the resources?

Before sending a lot of traces, and because I'm *sure* that I didmistake somewhere into the cib.xml file, I prefer that you tell me firstabout my obvious errors :). I can easily reproduce the problems here, soI'm ready to feed the thread with logs if needed :)

I hope I was enough clear, as my english looked pretty wacky to youreyes :) Sorry I'm french !


Many thanks for your help and for this great program! :)

Regards,

-Yann

_______________________________________________

Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] 2 nodes A/P using pindg+stonith problem

Reply via email to