Hi everyone,
I have a two node cluster running with pacemaker 1.1.2 and DRBD 8.3.7. It is an
active/active cluster so the DRBD partition is used with OCFS2. For testing
purposes I have configured external/ssh as a Stonith device. I did the
following test resulting in the surviving node becoming unstable and unusable.
- Pulled the HA network cable
- Put it back after a couple of seconds
Result:
- Node 2 is being restarted
- Load average on Node 1 increases until the system becomes unreachable
- A massive amount of log messages are being produced by OCFS2 on the
surviving node (see below)
- The DRBD partition is not accessible
- Node 1 cannot be rebooted only a hard reset brings it back to life
Log messages:
May 16 11:46:40 node1 cluster-dlm[3964]: set_fs_notified:
9253D9511DFE4DBD9FB87368177ECC25 set_fs_notified 1 zero check_fs
May 16 11:46:40 node1 ocfs2_controld[4058]: message from dlmcontrol
May 16 11:46:40 node1 ocfs2_controld[4058]: Notified for
"9253D9511DFE4DBD9FB87368177ECC25", node 1, status -11
May 16 11:46:40 node1 ocfs2_controld[4058]: Sending notification of node 1 for
"9253D9511DFE4DBD9FB87368177ECC25"
May 16 11:46:40 node1 cluster-dlm[3964]: set_fs_notified:
9253D9511DFE4DBD9FB87368177ECC25 set_fs_notified 1 zero check_fs
May 16 11:46:40 node1 ocfs2_controld[4058]: message from dlmcontrol
May 16 11:46:40 node1 ocfs2_controld[4058]: Notified for
"9253D9511DFE4DBD9FB87368177ECC25", node 1, status -11
May 16 11:46:40 node1 ocfs2_controld[4058]: Sending notification of node 1 for
"9253D9511DFE4DBD9FB87368177ECC25"
May 16 11:46:40 node1 cluster-dlm[3964]: set_fs_notified:
9253D9511DFE4DBD9FB87368177ECC25 set_fs_notified 1 zero check_fs
May 16 11:46:40 node1 ocfs2_controld[4058]: message from dlmcontrol
These messages are being repeated endless times and /var/log/messages grows to
a couple of Gigabytes.
Cluster configuration:
node node1
node node2 \
attributes standby="off"
primitive p_dlm ocf:pacemaker:controld \
op monitor interval="120s"
primitive p_drbd ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="20" role="Master" timeout="20" \
op monitor interval="30" role="Slave" timeout="20"
primitive p_fs ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/data" fstype="ocfs2" \
op monitor interval="120s" \
meta target-role="Started"
primitive p_nfsserver lsb:nfsserver \
op monitor interval="10s" timeout="30s"
primitive p_o2cb ocf:ocfs2:o2cb \
op monitor interval="120s"
primitive p_rpcbind lsb:rpcbind \
op monitor interval="10s" timeout="30s"
primitive p_stonith-ssh stonith:external/ssh \
params hostlist="node1 node2"
group nfs p_rpcbind p_nfsserver
group share-fs p_dlm p_o2cb p_fs
ms ms_drbd p_drbd \
meta resource-stickines="100" notify="true" master-max="2"
interleave="true" target-role="Started" is-managed="true"
clone cl_nfs nfs \
meta target-role="Started"
clone cl_share-fs share-fs \
meta target-role="Started" is-managed="true"
clone cl_stonith-ssh p_stonith-ssh \
meta is-managed="true"
colocation co_sharefs-drbd inf: cl_share-fs ms_drbd:Master
colocation co_sharefs-nfs inf: cl_share-fs cl_nfs
order o_drbd-sharefs inf: ms_drbd:promote cl_share-fs
order o_sharefs-nfs inf: cl_share-fs cl_nfs
property $id="cib-bootstrap-options" \
dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
no-quorum-policy="ignore" \
stonith-enabled="true" \
last-lrm-refresh="1305635327" \
stonith-action="reboot"
Any ideas on how to make the cluster more robust? I really don't want to end up
with no nodes at all after a failure.
Thanks,
Sascha
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems