Hi everyone,

I have a two node cluster running with pacemaker 1.1.2 and DRBD 8.3.7. It is an 
active/active cluster so the DRBD partition is used with OCFS2. For testing 
purposes I have configured external/ssh as a Stonith device. I did the 
following test resulting in the surviving node becoming unstable and unusable.


-          Pulled the HA network cable

-          Put it back after a couple of seconds

Result:


-          Node 2 is being restarted

-          Load average on Node 1 increases until the system becomes unreachable

-          A massive amount of log messages are being produced by OCFS2 on the 
surviving node (see below)

-          The DRBD partition is not accessible

-          Node 1 cannot be rebooted only a hard reset brings it back to life

Log messages:

May 16 11:46:40 node1 cluster-dlm[3964]: set_fs_notified: 
9253D9511DFE4DBD9FB87368177ECC25 set_fs_notified 1 zero check_fs
May 16 11:46:40 node1 ocfs2_controld[4058]: message from dlmcontrol
May 16 11:46:40 node1 ocfs2_controld[4058]: Notified for 
"9253D9511DFE4DBD9FB87368177ECC25", node 1, status -11
May 16 11:46:40 node1 ocfs2_controld[4058]: Sending notification of node 1 for 
"9253D9511DFE4DBD9FB87368177ECC25"
May 16 11:46:40 node1 cluster-dlm[3964]: set_fs_notified: 
9253D9511DFE4DBD9FB87368177ECC25 set_fs_notified 1 zero check_fs
May 16 11:46:40 node1 ocfs2_controld[4058]: message from dlmcontrol
May 16 11:46:40 node1 ocfs2_controld[4058]: Notified for 
"9253D9511DFE4DBD9FB87368177ECC25", node 1, status -11
May 16 11:46:40 node1 ocfs2_controld[4058]: Sending notification of node 1 for 
"9253D9511DFE4DBD9FB87368177ECC25"
May 16 11:46:40 node1 cluster-dlm[3964]: set_fs_notified: 
9253D9511DFE4DBD9FB87368177ECC25 set_fs_notified 1 zero check_fs
May 16 11:46:40 node1 ocfs2_controld[4058]: message from dlmcontrol

These messages are being repeated endless times and /var/log/messages grows to 
a couple of Gigabytes.

Cluster configuration:

node node1
node node2 \
        attributes standby="off"
primitive p_dlm ocf:pacemaker:controld \
        op monitor interval="120s"
primitive p_drbd ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="20" role="Master" timeout="20" \
        op monitor interval="30" role="Slave" timeout="20"
primitive p_fs ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/data" fstype="ocfs2" \
        op monitor interval="120s" \
        meta target-role="Started"
primitive p_nfsserver lsb:nfsserver \
        op monitor interval="10s" timeout="30s"
primitive p_o2cb ocf:ocfs2:o2cb \
        op monitor interval="120s"
primitive p_rpcbind lsb:rpcbind \
        op monitor interval="10s" timeout="30s"
primitive p_stonith-ssh stonith:external/ssh \
        params hostlist="node1 node2"
group nfs p_rpcbind p_nfsserver
group share-fs p_dlm p_o2cb p_fs
ms ms_drbd p_drbd \
        meta resource-stickines="100" notify="true" master-max="2" 
interleave="true" target-role="Started" is-managed="true"
clone cl_nfs nfs \
        meta target-role="Started"
clone cl_share-fs share-fs \
        meta target-role="Started" is-managed="true"
clone cl_stonith-ssh p_stonith-ssh \
        meta is-managed="true"
colocation co_sharefs-drbd inf: cl_share-fs ms_drbd:Master
colocation co_sharefs-nfs inf: cl_share-fs cl_nfs
order o_drbd-sharefs inf: ms_drbd:promote cl_share-fs
order o_sharefs-nfs inf: cl_share-fs cl_nfs
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="true" \
        last-lrm-refresh="1305635327" \
        stonith-action="reboot"

Any ideas on how to make the cluster more robust? I really don't want to end up 
with no nodes at all after a failure.

Thanks,
Sascha
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to