Ubuntu 7.10 with DRBD 8.0.3 and Heartbeat 2.1.2 with an updated Filesystem file
kernel 2.6.22-14 (updated from stock)
I have possibly two problems, a heartbeat and a DRBD issue. My goal is
to get a pair of machines working with a large /opt partition for
zimbra (my mail server software) and an a virtual IP.
1. I can configure heartbeat and DRBD with an virtual IP with no
problems at all. I can start and stop heartbeat on the two machines
and because of the colocations I have set up the resources move around
properly with no problems. If I start zimbra manually on the machine
that currently has the /opt partition mounted and the virtual IP it
works with no problem (I installed it with no issues).
I then add Zimbra, a lsb resource, like so:
<primitive id="zimbra" class="lsb" type="zimbra"/>
in crm_mon, I can see it start the zimbra resource (on the machine
with the other resources). However, after several seconds it reports a
failure and I see something like this in crm_mon
Master/Slave Set: ms-drbd0
drbd0:0 (heartbeat::ocf:drbd): Master d243
drbd0:1 (heartbeat::ocf:drbd): Started d242
fs0 (heartbeat::ocf:Filesystem): Started d243
ip_resource (heartbeat::ocf:IPaddr): Started d243
zimbra (lsb:zimbra): Started d243 (unmanaged) FAILED
Failed actions:
zimbra_start_0 (node=d243, call=7, rc=1): Error
zimbra_stop_0 (node=d243, call=8, rc=1): Error
It should be noted that zimbra takes a long time to start and stop,
maybe as long a two minutes since it launches many sub processes. If
there is a way to take that into account, I don't know where to do it.
Also, I have made rsc_order and rsc_colocation constraints but I have
the same results as here. If I start zimbra but it's init.d script and
then 'echo $?' it returns 0 and starts properly.
What I don't get is that it looks like it's trying to start zimbra
before DRBD is active even though I have a rsc_order set not to do so.
The constraints are below and I've included a small part of the logs
at the bottom. It seems to fail because it can't write out to a file
on /opt, which it can't do because it's not mounted.
2. Let's say I restart heartbeat on the other machine. DRBD does not
seem to reconnect properly and I get stuck with them in
WFReportParams/WFBitMapT and I have yet to find a way outside of
rebooting one machine to fix this. this only happens when I have
zimbra as a resource and when nothing is really using /opt I can
switch back and forth with no problems. I've seen some reports that
this might be DRBD/kernel version problem but it seem like most of
those were under DRBD 7.
I have removed all files in the rc*.d directories for drbd and zimbra.
much of this was taken directly from faqs and howtos
I will happily provide logs or other debugging info. configs to follow
[EMAIL PROTECTED]:/root/tmp# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-25 00:46:06
0: cs:WFBitMapT st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
[EMAIL PROTECTED]:/etc/init.d# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-24 16:02:09
0: cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown C r---
ns:4 nr:42960 dw:43504 dr:45105 al:0 bm:7 lo:2 pe:0 ua:0 ap:1
resync: used:0/31 hits:51 misses:7 starving:0 dirty:0 changed:7
act_log: used:1/257 hits:136 misses:1 starving:0 dirty:0 changed:0
drbd.conf
global {
usage-count yes;
}
common {
syncer { rate 50M; }
}
resource drbd0 {
protocol C;
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
outdate-peer "/usr/sbin/drbd-peer-outdater";
}
startup {
}
disk {
on-io-error detach;
}
net {
after-sb-0pri discard-younger-primary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
syncer {
rate 50M;
al-extents 257;
}
on d242 {
device /dev/drbd0;
disk /dev/sda3;
address 192.168.243.242:7788;
meta-disk internal;
}
on d243 {
device /dev/drbd0;
disk /dev/sda3;
address 192.168.243.243:7788;
meta-disk internal;
}
}
the configuration part of cib.xml
<configuration>
<crm_config>
<cluster_property_set id="bootstrap">
<attributes>
<nvpair id="bootstrap01" name="transition-idle-timeout" value="60"/>
<nvpair id="bootstrap02" name="default-resource-stickiness"
value="INFINITY"/>
<nvpair id="bootstrap03"
name="default-resource-failure-stickiness" value="-500"/>
<nvpair id="bootstrap04" name="stonith-enabled" value="false"/>
<nvpair id="bootstrap05" name="stonith-action" value="reboot"/>
<nvpair id="bootstrap06" name="symmetric-cluster" value="true"/>
<nvpair id="bootstrap07" name="no-quorum-policy" value="stop"/>
<nvpair id="bootstrap08" name="stop-orphan-resources" value="true"/>
<nvpair id="bootstrap09" name="stop-orphan-actions" value="true"/>
<nvpair id="bootstrap10" name="is-managed-default" value="true"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes>
<node id="0ed23ab0-3b94-40d2-858d-c5b5c437f1b6" uname="d243"
type="normal"/>
<node id="6778303d-77cc-49b4-8704-15c5da3c55fe" uname="d242"
type="normal"/>
</nodes>
<resources>
<master_slave id="ms-drbd0">
<meta_attributes id="ma-ms-drbd0">
<attributes>
<nvpair id="ma-ms-drbd0-1" name="clone_max" value="2"/>
<nvpair id="ma-ms-drbd0-2" name="clone_node_max" value="1"/>
<nvpair id="ma-ms-drbd0-3" name="master_max" value="1"/>
<nvpair id="ma-ms-drbd0-4" name="master_node_max" value="1"/>
<nvpair id="ma-ms-drbd0-5" name="notify" value="yes"/>
<nvpair id="ma-ms-drbd0-6" name="globally_unique" value="false"/>
<nvpair id="ma-ms-drbd0-7" name="target_role" value="#default"/>
</attributes>
</meta_attributes>
<primitive id="drbd0" class="ocf" provider="heartbeat" type="drbd">
<instance_attributes id="ia-drbd0">
<attributes>
<nvpair id="ia-drbd0-1" name="drbd_resource" value="drbd0"/>
</attributes>
</instance_attributes>
</primitive>
</master_slave>
<primitive class="ocf" provider="heartbeat" type="Filesystem" id="fs0">
<meta_attributes id="ma-fs0">
<attributes>
<nvpair name="target_role" id="ma-fs0-1" value="#default"/>
</attributes>
</meta_attributes>
<instance_attributes id="ia-fs0">
<attributes>
<nvpair id="ia-fs0-1" name="fstype" value="ext3"/>
<nvpair id="ia-fs0-2" name="directory" value="/opt"/>
<nvpair id="ia-fs0-3" name="device" value="/dev/drbd0"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="ip_resource" class="ocf" type="IPaddr"
provider="heartbeat">
<instance_attributes id="0a922086-cf51-47ef-b027-7b9d65f30a24">
<attributes>
<nvpair name="ip" value="192.168.243.244"
id="fd11e0eb-1b24-4552-a13b-d07afd57f046"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="zimbra" class="lsb" type="zimbra"/>
</resources>
<constraints>
<rsc_order id="drbd0_before_fs0" from="fs0" action="start"
to="ms-drbd0" to_action="promote"/>
<rsc_colocation id="fs0_on_drbd0" to="ms-drbd0"
to_role="master" from="fs0" score="infinity"/>
<rsc_colocation id="ip_on_drbd0" to="ms-drbd0" to_role="master"
from="ip_resource" score="infinity"/>
<rsc_order from="zimbra" to="fs0"
id="20e679fd-50a2-4ab5-b7a0-961ac7169569"/>
</constraints>
</configuration>
Apr 2 16:18:21 d243 pengine: [9607]: info: determine_online_status:
Node d243 is onlineApr 2 16:18:21 d243 pengine: [9607]: WARN:
unpack_rsc_op: Processing failed op (zimbra_start_0) on d243
Apr 2 16:18:21 d243 pengine: [9607]: WARN: unpack_rsc_op: Handling
failed start for zimbra on d243Apr 2 16:18:21 d243 pengine: [9607]:
info: determine_online_status: Node d242 is online
Apr 2 16:18:21 d243 pengine: [9607]: info: clone_print: Master/Slave
Set: ms-drbd0Apr 2 16:18:21 d243 pengine: [9607]: info: native_print:
drbd0:0^I(heartbeat::ocf:drbd):^IStopped
Apr 2 16:18:21 d243 pengine: [9607]: info: native_print:
drbd0:1^I(heartbeat::ocf:drbd):^IStopped Apr 2 16:18:21 d243 pengine:
[9607]: info: native_print: fs0^I(heartbeat::ocf:Filesystem):^IStopped
Apr 2 16:18:21 d243 pengine: [9607]: info: native_print:
ip_resource^I(heartbeat::ocf:IPaddr):^IStopped
Apr 2 16:18:21 d243 pengine: [9607]: info: native_print:
zimbra^I(lsb:zimbra):^IStarted d243 FAILEDApr 2 16:18:21 d243
pengine: [9607]: info: master_color: Promoted 0 instances of a
possible 1 to master
Apr 2 16:18:21 d243 pengine: [9607]: notice: StartRsc: d243^IStart
drbd0:0Apr 2 16:18:21 d243 pengine: [9607]: notice: StartRsc:
d242^IStart drbd0:1
Apr 2 16:18:21 d243 pengine: [9607]: notice: StartRsc: d243^IStart
drbd0:0Apr 2 16:18:21 d243 pengine: [9607]: notice: StartRsc:
d242^IStart drbd0:1
Apr 2 16:18:21 d243 pengine: [9607]: info: master_color: Promoted 0
instances of a possible 1 to masterApr 2 16:18:21 d243 pengine:
[9607]: WARN: native_color: Resource fs0 cannot run anywhere
Apr 2 16:18:21 d243 pengine: [9607]: info: master_color: Promoted 0
instances of a possible 1 to masterApr 2 16:18:21 d243 pengine:
[9607]: WARN: native_color: Resource ip_resource cannot run anywhere
Apr 2 16:18:21 d243 pengine: [9607]: notice: NoRoleChange: Recover
resource zimbra^I(d242)Apr 2 16:18:21 d243 pengine: [9607]: notice:
StopRsc: d243^IStop zimbra
Apr 2 16:18:21 d243 pengine: [9607]: notice: StartRsc: d242^IStart
zimbraApr 2 16:18:21 d243 pengine: [9607]: WARN: process_pe_message:
Transition 2: WARNINGs found during PE p
rocessing. PEngine Input stored in:
/var/lib/heartbeat/pengine/pe-warn-1870.bz2Apr 2 16:18:21 d243
pengine: [9607]: info: process_pe_message: Configuration WARNINGs
found during PE p
rocessing. Please run "crm_verify -L" to identify issues.Apr 2
16:18:21 d243 crmd: [9599]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRAN
SITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=route_message ]Apr 2 16:18:21 d243 tengine: [9606]: info:
unpack_graph: Unpacked transition 2: 12 actions in 12 synaps
esApr 2 16:18:21 d243 tengine: [9606]: info: te_pseudo_action: Pseudo
action 9 fired and confirmed
Apr 2 16:18:21 d243 tengine: [9606]: info: te_pseudo_action: Pseudo
action 10 fired and confirmedApr 2 16:18:21 d243 tengine: [9606]:
info: send_rsc_command: Initiating action 1: zimbra_stop_0 on d243
Apr 2 16:18:21 d243 crmd: [9599]: info: do_lrm_rsc_op: Performing
op=zimbra_stop_0 key=1:2:952627c4-4e23-4f85-9b0f-5245d074753c)
Apr 2 16:18:21 d243 tengine: [9606]: info: te_pseudo_action: Pseudo
action 7 fired and confirmedApr 2 16:18:21 d243 tengine: [9606]:
info: send_rsc_command: Initiating action 5: drbd0:0_start_0 on d2
43Apr 2 16:18:21 d243 tengine: [9606]: info: send_rsc_command:
Initiating action 6: drbd0:1_start_0 on d2
42Apr 2 16:18:21 d243 lrmd: [9733]: WARN: For LSB init script, no
additional parameters are needed.
Apr 2 16:18:21 d243 crmd: [9599]: info: do_lrm_rsc_op: Performing
op=drbd0:0_start_0 key=5:2:952627c4-4e23-4f85-9b0f-5245d074753c)
Apr 2 16:18:21 d243 lrmd: [9596]: info: RA output:
(zimbra:stop:stderr) -su: /opt/zimbra/log/startup.log: No such file or
directory
Apr 2 16:18:21 d243 lrmd: [9596]: WARN: Exiting zimbra:stop process
9733 returned rc 1.
Apr 2 16:18:21 d243 tengine: [9606]: WARN: status_from_rc: Action
stop on d243 failed (target: <null> vs. rc: 1): Error
Apr 2 16:18:21 d243 tengine: [9606]: info: update_abort_priority:
Abort priority upgraded to 1
Apr 2 16:18:21 d243 tengine: [9606]: info: update_abort_priority:
Abort action 0 superceeded by 2Apr 2 16:18:21 d243 tengine: [9606]:
info: match_graph_event: Action zimbra_stop_0 (1) confirmed on d24
3
Apr 2 16:18:21 d243 kernel: [ 2991.120000] drbd0: disk( Diskless ->
Attaching )
Apr 2 16:18:21 d243 kernel: [ 2991.140000] drbd0: Found 6
transactions (324 active extents) in activity
log.
Apr 2 16:18:21 d243 kernel: [ 2991.140000] drbd0: max_segment_size (
= BIO size ) = 32768
Apr 2 16:18:21 d243 kernel: [ 2991.140000] drbd0: drbd_bm_resize
called with capacity == 2711914064
Apr 2 16:18:21 d243 kernel: [ 2991.160000] drbd0: resync bitmap:
bits=338989258 words=10593416
Apr 2 16:18:21 d243 kernel: [ 2991.160000] drbd0: size = 1293 GB
(1355957032 KB)
Apr 2 16:18:21 d243 kernel: [ 2991.400000] drbd0: reading of bitmap
took 24 jiffies
Apr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: recounting of set
bits took additional 6 jiffies
Apr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: 88 KB marked
out-of-sync by on disk bit-map.
Apr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: disk( Attaching ->
UpToDate )
Apr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: Writing meta data
super block now.
Apr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: conn( StandAlone ->
Unconnected )
Apr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: receiver
(re)startedApr 2 16:18:22 d243 kernel: [ 2991.460000] drbd0: conn(
Unconnected -> WFConnection )
Apr 2 16:18:22 d243 lrmd: [9596]: info: RA output:
(drbd0:0:start:stdout) Apr 2 16:18:22 d243 lrmd: [9596]: info:
Exiting drbd0:0:start process 9736 returned rc 0.
Apr 2 16:18:22 d243 crmd: [9599]: info: process_lrm_event: LRM
operation drbd0:0_start_0 (call=9, rc=0) complete
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems