[Linux-HA] lsb resource problem

William Francis Wed, 02 Apr 2008 17:50:37 -0700

Ubuntu 7.10 with DRBD 8.0.3 and Heartbeat 2.1.2 with an updated Filesystem file


kernel 2.6.22-14 (updated from stock)

I have possibly two problems, a heartbeat and a DRBD issue. My goal is
to get a pair of machines working with a large /opt partition for
zimbra (my mail server software) and an a virtual IP.


1.  I can configure heartbeat and DRBD with an virtual IP with no
problems at all. I can start and stop heartbeat on the two machines
and because of the colocations I have set up the resources move around
properly with no problems. If I start zimbra manually on the machine
that currently has the /opt partition mounted and the virtual IP it
works with no problem (I installed it with no issues).

I then add Zimbra, a lsb resource, like so:

<primitive id="zimbra" class="lsb" type="zimbra"/>

in crm_mon, I can see it start the zimbra resource (on the machine
with the other resources). However, after several seconds it reports a
failure and I see something like this in crm_mon
Master/Slave Set: ms-drbd0
    drbd0:0     (heartbeat::ocf:drbd):  Master d243
    drbd0:1     (heartbeat::ocf:drbd):  Started d242
fs0     (heartbeat::ocf:Filesystem):    Started d243
ip_resource     (heartbeat::ocf:IPaddr):        Started d243
zimbra  (lsb:zimbra):   Started d243 (unmanaged) FAILED

Failed actions:
    zimbra_start_0 (node=d243, call=7, rc=1): Error
    zimbra_stop_0 (node=d243, call=8, rc=1): Error


It should be noted that zimbra takes a long time to start and stop,
maybe as long a two minutes since it launches many sub processes. If
there is a way to take that into account, I don't know where to do it.
Also, I have made rsc_order and rsc_colocation  constraints but I have
the same results as here. If I start zimbra but it's init.d script and
then 'echo $?' it returns 0 and starts properly.

What I don't get is that it looks like it's trying to start zimbra
before DRBD is active even though I have a rsc_order set not to do so.
The constraints are below and I've included a small part of the logs
at the bottom. It seems to fail because it can't write out to a file
on /opt, which it can't do because it's not mounted.


2. Let's say I restart heartbeat on the other machine. DRBD does not
seem to reconnect properly and I get stuck with them in
WFReportParams/WFBitMapT and I have yet to find a way outside of
rebooting one machine to fix this. this only happens when I have
zimbra as a resource and when nothing is really using /opt I can
switch back and forth with no problems. I've seen some reports that
this might be DRBD/kernel version problem but it seem like most of
those were under DRBD 7.

I have removed all files in the rc*.d directories for drbd and zimbra.
much of this was taken directly from faqs and howtos

I will happily provide logs or other debugging info. configs to follow



[EMAIL PROTECTED]:/root/tmp# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-25 00:46:06
 0: cs:WFBitMapT st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0


[EMAIL PROTECTED]:/etc/init.d# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-24 16:02:09
 0: cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown C r---
    ns:4 nr:42960 dw:43504 dr:45105 al:0 bm:7 lo:2 pe:0 ua:0 ap:1
        resync: used:0/31 hits:51 misses:7 starving:0 dirty:0 changed:7
        act_log: used:1/257 hits:136 misses:1 starving:0 dirty:0 changed:0



drbd.conf

global {
    usage-count yes;
}
common {
  syncer { rate 50M; }
}
resource drbd0 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";
  }
  startup {
  }
  disk {
    on-io-error   detach;
  }
  net {
    after-sb-0pri discard-younger-primary;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 50M;
    al-extents 257;
  }
  on d242 {
    device     /dev/drbd0;
    disk       /dev/sda3;
    address    192.168.243.242:7788;
    meta-disk  internal;
  }
  on d243 {
    device    /dev/drbd0;
    disk      /dev/sda3;
    address   192.168.243.243:7788;
    meta-disk internal;
  }
}


the configuration part of cib.xml

 <configuration>
     <crm_config>
       <cluster_property_set id="bootstrap">
         <attributes>
           <nvpair id="bootstrap01" name="transition-idle-timeout" value="60"/>
           <nvpair id="bootstrap02" name="default-resource-stickiness"
value="INFINITY"/>
           <nvpair id="bootstrap03"
name="default-resource-failure-stickiness" value="-500"/>
           <nvpair id="bootstrap04" name="stonith-enabled" value="false"/>
           <nvpair id="bootstrap05" name="stonith-action" value="reboot"/>
           <nvpair id="bootstrap06" name="symmetric-cluster" value="true"/>
           <nvpair id="bootstrap07" name="no-quorum-policy" value="stop"/>
           <nvpair id="bootstrap08" name="stop-orphan-resources" value="true"/>
           <nvpair id="bootstrap09" name="stop-orphan-actions" value="true"/>
           <nvpair id="bootstrap10" name="is-managed-default" value="true"/>
         </attributes>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node id="0ed23ab0-3b94-40d2-858d-c5b5c437f1b6" uname="d243"
type="normal"/>
       <node id="6778303d-77cc-49b4-8704-15c5da3c55fe" uname="d242"
type="normal"/>
     </nodes>
     <resources>
       <master_slave id="ms-drbd0">
         <meta_attributes id="ma-ms-drbd0">
           <attributes>
             <nvpair id="ma-ms-drbd0-1" name="clone_max" value="2"/>
             <nvpair id="ma-ms-drbd0-2" name="clone_node_max" value="1"/>
             <nvpair id="ma-ms-drbd0-3" name="master_max" value="1"/>
             <nvpair id="ma-ms-drbd0-4" name="master_node_max" value="1"/>
             <nvpair id="ma-ms-drbd0-5" name="notify" value="yes"/>
             <nvpair id="ma-ms-drbd0-6" name="globally_unique" value="false"/>
             <nvpair id="ma-ms-drbd0-7" name="target_role" value="#default"/>
           </attributes>
         </meta_attributes>
         <primitive id="drbd0" class="ocf" provider="heartbeat" type="drbd">
           <instance_attributes id="ia-drbd0">
             <attributes>
               <nvpair id="ia-drbd0-1" name="drbd_resource" value="drbd0"/>
             </attributes>
           </instance_attributes>
         </primitive>
       </master_slave>
       <primitive class="ocf" provider="heartbeat" type="Filesystem" id="fs0">
         <meta_attributes id="ma-fs0">
           <attributes>
             <nvpair name="target_role" id="ma-fs0-1" value="#default"/>
           </attributes>
         </meta_attributes>
         <instance_attributes id="ia-fs0">
           <attributes>
             <nvpair id="ia-fs0-1" name="fstype" value="ext3"/>
             <nvpair id="ia-fs0-2" name="directory" value="/opt"/>
             <nvpair id="ia-fs0-3" name="device" value="/dev/drbd0"/>
           </attributes>
         </instance_attributes>
       </primitive>
       <primitive id="ip_resource" class="ocf" type="IPaddr"
provider="heartbeat">
         <instance_attributes id="0a922086-cf51-47ef-b027-7b9d65f30a24">
           <attributes>
             <nvpair name="ip" value="192.168.243.244"
id="fd11e0eb-1b24-4552-a13b-d07afd57f046"/>
           </attributes>
         </instance_attributes>
       </primitive>
       <primitive id="zimbra" class="lsb" type="zimbra"/>
     </resources>
     <constraints>
    <rsc_order id="drbd0_before_fs0" from="fs0" action="start"
to="ms-drbd0" to_action="promote"/>
       <rsc_colocation id="fs0_on_drbd0" to="ms-drbd0"
to_role="master" from="fs0" score="infinity"/>
       <rsc_colocation id="ip_on_drbd0" to="ms-drbd0" to_role="master"
from="ip_resource" score="infinity"/>
       <rsc_order from="zimbra" to="fs0"
id="20e679fd-50a2-4ab5-b7a0-961ac7169569"/>
     </constraints>
   </configuration>

Apr  2 16:18:21 d243 pengine: [9607]: info: determine_online_status:
Node d243 is onlineApr  2 16:18:21 d243 pengine: [9607]: WARN:
unpack_rsc_op: Processing failed op (zimbra_start_0) on d243
Apr  2 16:18:21 d243 pengine: [9607]: WARN: unpack_rsc_op: Handling
failed start for zimbra on d243Apr  2 16:18:21 d243 pengine: [9607]:
info: determine_online_status: Node d242 is online
Apr  2 16:18:21 d243 pengine: [9607]: info: clone_print: Master/Slave
Set: ms-drbd0Apr  2 16:18:21 d243 pengine: [9607]: info: native_print:
    drbd0:0^I(heartbeat::ocf:drbd):^IStopped
Apr  2 16:18:21 d243 pengine: [9607]: info: native_print:
drbd0:1^I(heartbeat::ocf:drbd):^IStopped Apr  2 16:18:21 d243 pengine:
[9607]: info: native_print: fs0^I(heartbeat::ocf:Filesystem):^IStopped
Apr  2 16:18:21 d243 pengine: [9607]: info: native_print:
ip_resource^I(heartbeat::ocf:IPaddr):^IStopped
Apr  2 16:18:21 d243 pengine: [9607]: info: native_print:
zimbra^I(lsb:zimbra):^IStarted d243 FAILEDApr  2 16:18:21 d243
pengine: [9607]: info: master_color: Promoted 0 instances of a
possible 1 to master
Apr  2 16:18:21 d243 pengine: [9607]: notice: StartRsc:  d243^IStart
drbd0:0Apr  2 16:18:21 d243 pengine: [9607]: notice: StartRsc:
d242^IStart drbd0:1
Apr  2 16:18:21 d243 pengine: [9607]: notice: StartRsc:  d243^IStart
drbd0:0Apr  2 16:18:21 d243 pengine: [9607]: notice: StartRsc:
d242^IStart drbd0:1
Apr  2 16:18:21 d243 pengine: [9607]: info: master_color: Promoted 0
instances of a possible 1 to masterApr  2 16:18:21 d243 pengine:
[9607]: WARN: native_color: Resource fs0 cannot run anywhere
Apr  2 16:18:21 d243 pengine: [9607]: info: master_color: Promoted 0
instances of a possible 1 to masterApr  2 16:18:21 d243 pengine:
[9607]: WARN: native_color: Resource ip_resource cannot run anywhere
Apr  2 16:18:21 d243 pengine: [9607]: notice: NoRoleChange: Recover
resource zimbra^I(d242)Apr  2 16:18:21 d243 pengine: [9607]: notice:
StopRsc:   d243^IStop zimbra
Apr  2 16:18:21 d243 pengine: [9607]: notice: StartRsc:  d242^IStart
zimbraApr  2 16:18:21 d243 pengine: [9607]: WARN: process_pe_message:
Transition 2: WARNINGs found during PE p
rocessing. PEngine Input stored in:
/var/lib/heartbeat/pengine/pe-warn-1870.bz2Apr  2 16:18:21 d243
pengine: [9607]: info: process_pe_message: Configuration WARNINGs
found during PE p
rocessing.  Please run "crm_verify -L" to identify issues.Apr  2
16:18:21 d243 crmd: [9599]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRAN
SITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=route_message ]Apr  2 16:18:21 d243 tengine: [9606]: info:
unpack_graph: Unpacked transition 2: 12 actions in 12 synaps
esApr  2 16:18:21 d243 tengine: [9606]: info: te_pseudo_action: Pseudo
action 9 fired and confirmed
Apr  2 16:18:21 d243 tengine: [9606]: info: te_pseudo_action: Pseudo
action 10 fired and confirmedApr  2 16:18:21 d243 tengine: [9606]:
info: send_rsc_command: Initiating action 1: zimbra_stop_0 on d243
Apr  2 16:18:21 d243 crmd: [9599]: info: do_lrm_rsc_op: Performing
op=zimbra_stop_0 key=1:2:952627c4-4e23-4f85-9b0f-5245d074753c)
Apr  2 16:18:21 d243 tengine: [9606]: info: te_pseudo_action: Pseudo
action 7 fired and confirmedApr  2 16:18:21 d243 tengine: [9606]:
info: send_rsc_command: Initiating action 5: drbd0:0_start_0 on d2
43Apr  2 16:18:21 d243 tengine: [9606]: info: send_rsc_command:
Initiating action 6: drbd0:1_start_0 on d2
42Apr  2 16:18:21 d243 lrmd: [9733]: WARN: For LSB init script, no
additional parameters are needed.
Apr  2 16:18:21 d243 crmd: [9599]: info: do_lrm_rsc_op: Performing
op=drbd0:0_start_0 key=5:2:952627c4-4e23-4f85-9b0f-5245d074753c)
Apr  2 16:18:21 d243 lrmd: [9596]: info: RA output:
(zimbra:stop:stderr) -su: /opt/zimbra/log/startup.log: No such file or
directory
Apr  2 16:18:21 d243 lrmd: [9596]: WARN: Exiting zimbra:stop process
9733 returned rc 1.
Apr  2 16:18:21 d243 tengine: [9606]: WARN: status_from_rc: Action
stop on d243 failed (target: <null> vs. rc: 1): Error
Apr  2 16:18:21 d243 tengine: [9606]: info: update_abort_priority:
Abort priority upgraded to 1
Apr  2 16:18:21 d243 tengine: [9606]: info: update_abort_priority:
Abort action 0 superceeded by 2Apr  2 16:18:21 d243 tengine: [9606]:
info: match_graph_event: Action zimbra_stop_0 (1) confirmed on d24
3
Apr  2 16:18:21 d243 kernel: [ 2991.120000] drbd0: disk( Diskless ->
Attaching )
Apr  2 16:18:21 d243 kernel: [ 2991.140000] drbd0: Found 6
transactions (324 active extents) in activity
 log.
Apr  2 16:18:21 d243 kernel: [ 2991.140000] drbd0: max_segment_size (
= BIO size ) = 32768
Apr  2 16:18:21 d243 kernel: [ 2991.140000] drbd0: drbd_bm_resize
called with capacity == 2711914064
Apr  2 16:18:21 d243 kernel: [ 2991.160000] drbd0: resync bitmap:
bits=338989258 words=10593416
Apr  2 16:18:21 d243 kernel: [ 2991.160000] drbd0: size = 1293 GB
(1355957032 KB)
Apr  2 16:18:21 d243 kernel: [ 2991.400000] drbd0: reading of bitmap
took 24 jiffies
Apr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: recounting of set
bits took additional 6 jiffies
Apr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: 88 KB marked
out-of-sync by on disk bit-map.
Apr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: disk( Attaching ->
UpToDate )
Apr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: Writing meta data
super block now.
Apr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: conn( StandAlone ->
Unconnected )
Apr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: receiver
(re)startedApr  2 16:18:22 d243 kernel: [ 2991.460000] drbd0: conn(
Unconnected -> WFConnection )
Apr  2 16:18:22 d243 lrmd: [9596]: info: RA output:
(drbd0:0:start:stdout)  Apr  2 16:18:22 d243 lrmd: [9596]: info:
Exiting drbd0:0:start process 9736 returned rc 0.
Apr  2 16:18:22 d243 crmd: [9599]: info: process_lrm_event: LRM
operation drbd0:0_start_0 (call=9, rc=0) complete
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] lsb resource problem

Reply via email to