Good evening,
I noticed that when corosync is set to start at boot my stonith devices
don't start up correctly.
Here is some version info:
cluster-glue: 1.0.6
Corosync Cluster Engine, version '1.2.7' SVN revision '3008'
Name : pacemaker
Version : 1.0.9.1
Release : 1.15.el5
I've read in many places that stonith devices may rely upon atd. I
haven't looked around enough to fully understand the necessity of this
dependency, but I believe it's the cause of the problem I'm
experiencing. The corosync init script is configured to start and stop
at 20, and atd is configured to start and stop at 95 and 5 on my RHEL5.5
system. If I move corosync up to 98 (after atd) my stonith devices start
just fine. If I add a start-delay to the stonith device that delays it
past the startup of atd, the stonith device also starts just fine. Using
the default init script and no start-delay ends with a Failed Action for
the stonith device, and it never recovers without manual intervention.
My questions are: Why is the default init script shipped with the RPM
from the clusterlabs repo configured to start before atd if atd is a
dependency of certain parts of the pacemaker framework (if this indeed
the case)? Is it safe/recommended to add a start-delay of several
minutes to a stonith device to work around this problem?
Thanks!!
Eric Schoeller
Here are some logs:
Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing
key=52:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_start_0 )
Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:8: start
Oct 11 20:33:14 nodea lrmd: [3397]: info: Try to start STONITH resource
<rsc_id=st-nodeb-ipmi> : Device=external/ipmi
Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing
key=12:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=drbd_nfs:0_start_0 )
Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:drbd_nfs:0:9: start
Oct 11 20:33:37 nodea external/ipmi[3433]: ERROR: error executing
ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session^M Unable to
get Chassis Power Status
Oct 11 20:33:38 nodea stonithd: [3432]: info: external_run_cmd: Calling
'/usr/lib64/stonith/plugins/external/ipmi status' returned 256
Oct 11 20:33:38 nodea stonithd: [3432]: CRIT: external_status: 'ipmi
status' failed with rc 256
Oct 11 20:33:38 nodea stonithd: [3151]: WARN: start st-nodeb-ipmi
failed, because its hostlist is empty
Oct 11 20:33:38 nodea lrmd: [3153]: WARN: Managed st-nodeb-ipmi:start
process 3397 exited with return code 1.
Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM
operation st-nodeb-ipmi_start_0 (call=8, rc=1, cib-update=16,
confirmed=true) unknown error
Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update
relayed from nodeb.domain.com
Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-st-nodeb-ipmi (INFINITY)
Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent
update 26: fail-count-st-nodeb-ipmi=INFINITY
Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update
relayed from nodeb.domain.com
Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-st-nodeb-ipmi (1286850818)
Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent
update 29: last-failure-st-nodeb-ipmi=1286850818
Oct 11 20:33:38 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing
key=1:58:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_stop_0 )
Oct 11 20:33:38 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:12: stop
Oct 11 20:33:38 nodea lrmd: [5063]: info: Try to stop STONITH resource
<rsc_id=st-nodeb-ipmi> : Device=external/ipmi
Oct 11 20:33:38 nodea stonithd: [3151]: notice: try to stop a resource
st-nodeb-ipmi who is not in started resource queue.
Oct 11 20:33:38 nodea lrmd: [3153]: info: Managed st-nodeb-ipmi:stop
process 5063 exited with return code 0.
Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM
operation st-nodeb-ipmi_stop_0 (call=12, rc=0, cib-update=17,
confirmed=true) ok
Here is my cluster configuration:
node nodea.domain.com \
attributes standby="off"
node nodeb.domain.com \
attributes standby="off"
primitive drbd_nfs ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s"
primitive fs_nfs ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/mnt/drbd0" fstype="ext3" \
meta is-managed="true"
primitive ip_nfs ocf:heartbeat:IPaddr2 \
params ip="1.2.3.20" cidr_netmask="32" nic="bond0"
primitive nfsserver ocf:heartbeat:nfsserver \
params nfs_shared_infodir="/mnt/drbd0/nfs" nfs_ip="1.2.3.20"
nfs_init_script="/etc/init.d/nfs"
primitive st-nodea-ipmi stonith:external/ipmi \
params hostname="nodea.domain.com" ipaddr="1.2.3.23"
userid="coolguy" passwd="changeme" interface="lanplus" \
op monitor interval="20m" timeout="1m" \
op start interval="0" timeout="1m" start-delay="360s" \
meta target-role="Started"
primitive st-nodeb-ipmi stonith:external/ipmi \
params hostname="nodeb.domain.com" ipaddr="1.2.3.25"
userid="coolguy" passwd="changeme" interface="lanplus" \
op monitor interval="20m" timeout="1m" \
op start interval="0" timeout="1m" start-delay="360s" \
meta target-role="Started"
group nfs fs_nfs ip_nfs nfsserver \
meta target-role="Started"
ms ms_drbd_nfs drbd_nfs \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started" is-managed="true"
location l-st-nodea st-nodea-ipmi -inf: nodea.domain.com
location l-st-nodeb st-nodeb-ipmi -inf: nodeb.domain.com
colocation nfs_on_drbd inf: nfs ms_drbd_nfs:Master
order nfs_after_drbd inf: ms_drbd_nfs:promote nfs:start
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
last-lrm-refresh="1286851694"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems