Hi

William Francis wrote:
Ubuntu 7.10 with DRBD 8.0.3 and Heartbeat 2.1.2 with an updated Filesystem file

kernel 2.6.22-14 (updated from stock)

I have possibly two problems, a heartbeat and a DRBD issue. My goal is
to get a pair of machines working with a large /opt partition for
zimbra (my mail server software) and an a virtual IP.


1.  I can configure heartbeat and DRBD with an virtual IP with no
problems at all. I can start and stop heartbeat on the two machines
and because of the colocations I have set up the resources move around
properly with no problems. If I start zimbra manually on the machine
that currently has the /opt partition mounted and the virtual IP it
works with no problem (I installed it with no issues).

I then add Zimbra, a lsb resource, like so:

<primitive id="zimbra" class="lsb" type="zimbra"/>

You did check that this script is LSB compliant? If not, see http://wiki.linux-ha.org/LSBResourceAgent and change the script if necessary.

in crm_mon, I can see it start the zimbra resource (on the machine
with the other resources). However, after several seconds it reports a
failure and I see something like this in crm_mon
Master/Slave Set: ms-drbd0
    drbd0:0     (heartbeat::ocf:drbd):  Master d243
    drbd0:1     (heartbeat::ocf:drbd):  Started d242
fs0     (heartbeat::ocf:Filesystem):    Started d243
ip_resource     (heartbeat::ocf:IPaddr):        Started d243
zimbra  (lsb:zimbra):   Started d243 (unmanaged) FAILED

Failed actions:
    zimbra_start_0 (node=d243, call=7, rc=1): Error
    zimbra_stop_0 (node=d243, call=8, rc=1): Error

Well, as you can see, the start operation failed. Therefore, the resource is stopped afterwards (notice the larger "call" number). But the stop operation also failed. So as the cluster cannot say what status this resource is in, it will not be touched anymore.

Actually, if you had stonith configured, your node would be rebooted now, but that's another topic.


It should be noted that zimbra takes a long time to start and stop,
maybe as long a two minutes

Then you should set an appropriate value for timeout in the "start" operation. Something like this:

<primitive ...
<operations>
<op id="zimbra-start-op" name="start" timeout="120s"/>
</operations>
</primitive>

since it launches many sub processes. If
there is a way to take that into account, I don't know where to do it.

Also, I have made rsc_order and rsc_colocation  constraints but I have
the same results as here. If I start zimbra but it's init.d script and
then 'echo $?' it returns 0 and starts properly.

That's because the default timeout is 20s (or something in that range at least) and that seems to not be enough for you.

What I don't get is that it looks like it's trying to start zimbra
before DRBD is active even though I have a rsc_order set not to do so.
The constraints are below and I've included a small part of the logs
at the bottom. It seems to fail because it can't write out to a file
on /opt, which it can't do because it's not mounted.


2. Let's say I restart heartbeat on the other machine. DRBD does not
seem to reconnect properly and I get stuck with them in
WFReportParams/WFBitMapT and I have yet to find a way outside of
rebooting one machine to fix this. this only happens when I have
zimbra as a resource and when nothing is really using /opt I can
switch back and forth with no problems. I've seen some reports that
this might be DRBD/kernel version problem but it seem like most of
those were under DRBD 7.

For master/slave resources, you should use a newer version of heartbeat and especially of the crm (which is now called pacemaker and needs to be installed separately). To install newer version, please read http://www.clusterlabs.org/mw/Install and use this pacemaker version: http://hg.clusterlabs.org/pacemaker/stable-0.6/archive/tip.tar.gz

I have removed all files in the rc*.d directories for drbd and zimbra.
much of this was taken directly from faqs and howtos

I will happily provide logs or other debugging info. configs to follow



[EMAIL PROTECTED]:/root/tmp# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-25 00:46:06
 0: cs:WFBitMapT st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0


[EMAIL PROTECTED]:/etc/init.d# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-24 16:02:09
 0: cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown C r---
    ns:4 nr:42960 dw:43504 dr:45105 al:0 bm:7 lo:2 pe:0 ua:0 ap:1
        resync: used:0/31 hits:51 misses:7 starving:0 dirty:0 changed:7
        act_log: used:1/257 hits:136 misses:1 starving:0 dirty:0 changed:0



drbd.conf

global {
    usage-count yes;
}
common {
  syncer { rate 50M; }
}
resource drbd0 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";
  }
  startup {
  }
  disk {
    on-io-error   detach;
  }
  net {
    after-sb-0pri discard-younger-primary;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 50M;
    al-extents 257;
  }
  on d242 {
    device     /dev/drbd0;
    disk       /dev/sda3;
    address    192.168.243.242:7788;
    meta-disk  internal;
  }
  on d243 {
    device    /dev/drbd0;
    disk      /dev/sda3;
    address   192.168.243.243:7788;
    meta-disk internal;
  }
}


the configuration part of cib.xml

 <configuration>
     <crm_config>
       <cluster_property_set id="bootstrap">
         <attributes>
           <nvpair id="bootstrap01" name="transition-idle-timeout" value="60"/>
           <nvpair id="bootstrap02" name="default-resource-stickiness"
value="INFINITY"/>
           <nvpair id="bootstrap03"
name="default-resource-failure-stickiness" value="-500"/>
           <nvpair id="bootstrap04" name="stonith-enabled" value="false"/>
           <nvpair id="bootstrap05" name="stonith-action" value="reboot"/>
           <nvpair id="bootstrap06" name="symmetric-cluster" value="true"/>
           <nvpair id="bootstrap07" name="no-quorum-policy" value="stop"/>
           <nvpair id="bootstrap08" name="stop-orphan-resources" value="true"/>
           <nvpair id="bootstrap09" name="stop-orphan-actions" value="true"/>
           <nvpair id="bootstrap10" name="is-managed-default" value="true"/>
         </attributes>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node id="0ed23ab0-3b94-40d2-858d-c5b5c437f1b6" uname="d243"
type="normal"/>
       <node id="6778303d-77cc-49b4-8704-15c5da3c55fe" uname="d242"
type="normal"/>
     </nodes>
     <resources>
       <master_slave id="ms-drbd0">
         <meta_attributes id="ma-ms-drbd0">
           <attributes>
             <nvpair id="ma-ms-drbd0-1" name="clone_max" value="2"/>
             <nvpair id="ma-ms-drbd0-2" name="clone_node_max" value="1"/>
             <nvpair id="ma-ms-drbd0-3" name="master_max" value="1"/>
             <nvpair id="ma-ms-drbd0-4" name="master_node_max" value="1"/>
             <nvpair id="ma-ms-drbd0-5" name="notify" value="yes"/>
             <nvpair id="ma-ms-drbd0-6" name="globally_unique" value="false"/>
             <nvpair id="ma-ms-drbd0-7" name="target_role" value="#default"/>
           </attributes>
         </meta_attributes>
         <primitive id="drbd0" class="ocf" provider="heartbeat" type="drbd">
           <instance_attributes id="ia-drbd0">
             <attributes>
               <nvpair id="ia-drbd0-1" name="drbd_resource" value="drbd0"/>
             </attributes>
           </instance_attributes>
         </primitive>
       </master_slave>
       <primitive class="ocf" provider="heartbeat" type="Filesystem" id="fs0">
         <meta_attributes id="ma-fs0">
           <attributes>
             <nvpair name="target_role" id="ma-fs0-1" value="#default"/>
           </attributes>
         </meta_attributes>
         <instance_attributes id="ia-fs0">
           <attributes>
             <nvpair id="ia-fs0-1" name="fstype" value="ext3"/>
             <nvpair id="ia-fs0-2" name="directory" value="/opt"/>
             <nvpair id="ia-fs0-3" name="device" value="/dev/drbd0"/>
           </attributes>
         </instance_attributes>
       </primitive>
       <primitive id="ip_resource" class="ocf" type="IPaddr"
provider="heartbeat">
         <instance_attributes id="0a922086-cf51-47ef-b027-7b9d65f30a24">
           <attributes>
             <nvpair name="ip" value="192.168.243.244"
id="fd11e0eb-1b24-4552-a13b-d07afd57f046"/>
           </attributes>
         </instance_attributes>
       </primitive>
       <primitive id="zimbra" class="lsb" type="zimbra"/>
     </resources>
     <constraints>
    <rsc_order id="drbd0_before_fs0" from="fs0" action="start"
to="ms-drbd0" to_action="promote"/>
       <rsc_colocation id="fs0_on_drbd0" to="ms-drbd0"
to_role="master" from="fs0" score="infinity"/>
       <rsc_colocation id="ip_on_drbd0" to="ms-drbd0" to_role="master"
from="ip_resource" score="infinity"/>
       <rsc_order from="zimbra" to="fs0"
id="20e679fd-50a2-4ab5-b7a0-961ac7169569"/>

Imho, these constraints look good for what you said you want to do.

     </constraints>
   </configuration>

Regards
Dominik
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to