[Linux-HA] Timeout issue with DRBD and multiple master/slave sets

Denny Daugherty Wed, 18 Mar 2009 07:29:30 -0700

I have heartbeat running on two nodes with the following resources:


  * 2 DRBD Master/Slave sets + 2 respective filesystems
  * 2 IP addresses
  * NFS

The two DRBD disks sync between the two servers, and heartbeat handles
promoting in the event one server fails. It also mounts the filesystem
and starts NFS on the master.

For the most part this is working very well, however I am occasionally
experiencing a timeout in the process that monitors DRBD which is
causing everything to break down. I have tried increasing the timeout
from 10 seconds to 30 and then then to 60 seconds, however a timeout has
yet occurred at 60 seconds. The reason for the timeout may be
legitimate, however when it times out for one of the DRBD disks it
causes everything to break. Because of how everything is set up, both
disks must be master on one node. However, this is what's happening:

  - monitor for drbd1 fails on the master
  - the other node takes over as master for drbd1
  - the first node is still master for drbd0
  - none of the other resources are started (nfs, ip, etc)

So, we end up having one DRBD disk primary on one node and the other
disk primary on the other node. It then requires manual intervention to
resolve. So far if I manually migrate the other disk from the first node
to the second everything else starts up.

Essentially, I need to make sure that if there is a problem with either
DRBD disk on either node, that BOTH DRBD resources will then be migrated
to the other node, not just the one effected.

Is there a way to do this with constraints or resource groups?

I have attached the following:

  - cib.xml: excerpt of my cib.xml including resources and constraints
  - crm_mon.txt: output of crm_mon when everything is working correctly
  - crm_mon_problem.txt: output of crm_mon when the problem occurs

Thanks!

============
Last updated: Fri Mar 13 17:06:05 2009
Current DC: ramsbottom (d3e0959f-8eea-4a28-8fe0-e4d3d092945e)
2 Nodes configured.
3 Resources configured.
============

Node: ramsbottom (d3e0959f-8eea-4a28-8fe0-e4d3d092945e): online
Node: ladytot (e90db3e4-4a60-442d-9ac5-4038e4dcd666): online

Master/Slave Set: ms-drbd0
    drbd0:0     (ocf::heartbeat:drbd):  Master ramsbottom
    drbd0:1     (ocf::heartbeat:drbd):  Started ladytot
Master/Slave Set: ms-drbd1
    drbd1:0     (ocf::heartbeat:drbd):  Master ramsbottom
    drbd1:1     (ocf::heartbeat:drbd):  Started ladytot
Resource Group: group_1
    fs0 (ocf::heartbeat:Filesystem):    Started ramsbottom
    fs1 (ocf::heartbeat:Filesystem):    Started ramsbottom
    nfs_resource        (lsb:nfs-kernel-server):        Started ramsbottom
    ip_resource (ocf::heartbeat:IPaddr):        Started ramsbottom
    ip_resource2        (ocf::heartbeat:IPaddr):        Started ramsbottom

     <resources>
       <master_slave id="ms-drbd0">
         <meta_attributes id="ma-ms-drbd0">
           <attributes>
             <nvpair id="ma-ms-drbd0-1" name="clone_max" value="2"/>
             <nvpair id="ma-ms-drbd0-2" name="clone_node_max" value="1"/>
             <nvpair id="ma-ms-drbd0-3" name="master_max" value="1"/>
             <nvpair id="ma-ms-drbd0-4" name="master_node_max" value="1"/>
             <nvpair id="ma-ms-drbd0-5" name="notify" value="yes"/>
             <nvpair id="ma-ms-drbd0-6" name="globally_unique" value="false"/>
             <nvpair id="ma-ms-drbd0-7" name="target_role" value="#default"/>
           </attributes>
         </meta_attributes>
         <primitive id="drbd0" class="ocf" provider="heartbeat" type="drbd">
           <instance_attributes id="ia-drbd0">
             <attributes>
               <nvpair id="ia-drbd0-1" name="drbd_resource" value="data"/>
             </attributes>
           </instance_attributes>
           <operations>
             <op id="op-drbd0-1" name="monitor" interval="2m" timeout="60s" role="Master"/>
             <op id="op-drbd0-2" name="monitor" interval="5m" timeout="60s" role="Slave"/>
           </operations>
         </primitive>
       </master_slave>
       <master_slave id="ms-drbd1">
         <meta_attributes id="ma-ms-drbd1">
           <attributes>
             <nvpair id="ma-ms-drbd1-1" name="clone_max" value="2"/>
             <nvpair id="ma-ms-drbd1-2" name="clone_node_max" value="1"/>
             <nvpair id="ma-ms-drbd1-3" name="master_max" value="1"/>
             <nvpair id="ma-ms-drbd1-4" name="master_node_max" value="1"/>
             <nvpair id="ma-ms-drbd1-5" name="notify" value="yes"/>
             <nvpair id="ma-ms-drbd1-6" name="globally_unique" value="false"/>
             <nvpair id="ma-ms-drbd1-7" name="target_role" value="#default"/>
           </attributes>
         </meta_attributes>
         <primitive id="drbd1" class="ocf" provider="heartbeat" type="drbd">
           <instance_attributes id="ia-drbd1">
             <attributes>
               <nvpair id="ia-drbd1-1" name="drbd_resource" value="webmail"/>
             </attributes>
           </instance_attributes>
           <operations>
             <op id="op-drbd1-1" name="monitor" interval="2m" timeout="60s" role="Master"/>
             <op id="op-drbd1-2" name="monitor" interval="5m" timeout="60s" role="Slave"/>
           </operations>
         </primitive>
       </master_slave>
       <group id="group_1">
         <primitive class="ocf" provider="heartbeat" type="Filesystem" id="fs0">
           <instance_attributes id="ia-fs0">
             <attributes>
               <nvpair id="ia-fs0-1" name="fstype" value="ext3"/>
               <nvpair id="ia-fs0-2" name="directory" value="/export/data"/>
               <nvpair id="ia-fs0-3" name="device" value="/dev/drbd0"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive class="ocf" provider="heartbeat" type="Filesystem" id="fs1">
           <instance_attributes id="ia-fs1">
             <attributes>
               <nvpair id="ia-fs1-1" name="fstype" value="ext3"/>
               <nvpair id="ia-fs1-2" name="directory" value="/export/webmail"/>
               <nvpair id="ia-fs1-3" name="device" value="/dev/drbd1"/>
             </attributes>
           </instance_attributes>
           <instance_attributes id="fs1">
             <attributes>
               <nvpair id="fs1-target_role" name="target_role" value="started"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive id="nfs_resource" class="lsb" type="nfs-kernel-server"/>
         <primitive id="ip_resource" class="ocf" type="IPaddr" provider="heartbeat">
           <instance_attributes id="ip_resource_attributes">
             <attributes>
               <nvpair id="failover_ip" name="ip" value="192.168.0.1"/>
               <nvpair id="failover_nic" name="nic" value="eth1"/>
               <nvpair id="failover_netmask" name="netmask" value="255.255.255.0"/>
             </attributes>
           </instance_attributes>
           <instance_attributes id="ip_resource">
             <attributes>
               <nvpair id="ip_resource-target_role" name="target_role" value="started"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive id="ip_resource2" class="ocf" type="IPaddr" provider="heartbeat">
           <instance_attributes id="ip_resource2_attributes">
             <attributes>
               <nvpair id="ip_resource2_ip" name="ip" value="xxx.xxx.xxx.xxx"/>
               <nvpair id="ip_resource2_nic" name="nic" value="eth0"/>
               <nvpair id="ip_resource2_netmask" name="netmask" value="255.255.255.248"/>
             </attributes>
           </instance_attributes>
           <instance_attributes id="ip_resource2">
             <attributes>
               <nvpair id="ip_resource2-target_role" name="target_role" value="started"/>
             </attributes>
           </instance_attributes>
         </primitive>
       </group>
     </resources>
     <constraints>
       <rsc_location id="run_ip_resource" rsc="group_1">
         <rule id="preferered_location_group_1" score="100">
           <expression id="preferred_hostname" attribute="#uname" operation="eq" value="ramsbottom"/>
         </rule>
       </rsc_location>
       <rsc_order id="drbd0_before_fs0" from="fs0" action="start" to="ms-drbd0" to_action="promote"/>
       <rsc_colocation id="fs0_on_drbd0" to="ms-drbd0" to_role="master" from="fs0" score="infinity"/>
       <rsc_order id="drbd1_before_fs1" from="fs1" action="start" to="ms-drbd1" to_action="promote"/>
       <rsc_colocation id="fs1_on_drbd1" to="ms-drbd1" to_role="master" from="fs1" score="infinity"/>
     </constraints>

============
Last updated: Fri Mar 13 17:08:35 2009
Current DC: ramsbottom (d3e0959f-8eea-4a28-8fe0-e4d3d092945e)
2 Nodes configured.
3 Resources configured.
============

Node: ramsbottom (d3e0959f-8eea-4a28-8fe0-e4d3d092945e): online
Node: ladytot (e90db3e4-4a60-442d-9ac5-4038e4dcd666): online

Master/Slave Set: ms-drbd0
    drbd0:0     (ocf::heartbeat:drbd):  Master ramsbottom
    drbd0:1     (ocf::heartbeat:drbd):  Started ladytot
Master/Slave Set: ms-drbd1
    drbd1:0     (ocf::heartbeat:drbd):  Master ramsbottom
    drbd1:1     (ocf::heartbeat:drbd):  Started ladytot
Resource Group: group_1
    fs0 (ocf::heartbeat:Filesystem):    Started ramsbottom
    fs1 (ocf::heartbeat:Filesystem):    Started ramsbottom
    nfs_resource        (lsb:nfs-kernel-server):        Started ramsbottom
    ip_resource (ocf::heartbeat:IPaddr):        Started ramsbottom
    ip_resource2        (ocf::heartbeat:IPaddr):        Started ramsbottom

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Timeout issue with DRBD and multiple master/slave sets

Reply via email to