Hi, On Thu, Feb 18, 2010 at 05:15:42PM +0100, Patrick Zwahlen wrote: > Dear list, > > I am looking for some advice regarding a freeze that we experienced. My > project is a 2-nodes active-passive NFS cluster on two virtual machines. > I am using CentOS 5.4 x86_64, drbd, xfs, corosync and pacemaker. > Following are the RPM versions: > > From clusterlabs: > cluster-glue.x86_64 1.0.1-1.el5 > cluster-glue-libs.x86_64 1.0.1-1.el5
Please upgrade to 1.0.3. Not sure, but those versions you have may have a bad bug. > corosync.x86_64 1.2.0-1.el5 > corosynclib.x86_64 1.2.0-1.el5 > heartbeat.x86_64 3.0.1-1.el5 > heartbeat-libs.x86_64 3.0.1-1.el5 You don't need both heartbeat and corosync. > pacemaker.x86_64 1.0.7-2.el5 > pacemaker-libs.x86_64 1.0.7-2.el5 > resource-agents.x86_64 1.0.1-1.el5 > > From CentOS extras: > drbd83.x86_64 8.3.2-6.el5_3 > kmod-drbd83.x86_64 8.3.2-6.el5_3 > > I made many tests before going into production, and the cluster has been > running fine for some weeks. We regularly test failover by powering off > one of the physical node that is running the VMs. > > Our problem appeared after shutting down the host that was hosting the > backup node. After powering-off the backup node, the primary became > totally unresponsive, and we lost the NFS store. Had to reboot the > primary node. > > I rebuilt a lab, and tried to replicate the problem by powering off the > backup node. After about 50 tries, I could replicate, and saw that: > > - It was not a kernel panic > - VM console was totally unresponsive > - VM was using 100% CPU > - I was still able to PING the VM > - I was unable to log on the console/ssh Anything in logs? Or is that the log attached? > I have attached all my config files, as well as the /var/log/messages You can use hb_report to collect all relevant info. > around the crash (messages from the primary node). We see the secondary > leaving the cluster, drbd activity and then nothing until the reboot. > Since the crash, I have made one single change to the pacemaker config, > which was to change my drbd location rule from +INF to 1000, as I > thought the rule included by drbd fencing (with -INF wheight) could > conflict with my +INF. Feb 4 17:41:54 nfs2a lrmd: [3072]: info: RA output: (res_drbd:1:start:stderr) 0 : Failure: (124) Device is attached to a disk (use detach first) Feb 4 17:41:54 nfs2a lrmd: [3072]: info: RA output: (res_drbd:1:start:stderr) Command 'drbdsetup 0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --fencing=resource-only --on-io-error=detach' terminated with exit code 10 Feb 4 17:41:54 nfs2a drbd[3243]: ERROR: nfs: Called drbdadm -c /etc/drbd.conf --peer nfs2b.test.local up nfs Feb 4 17:41:54 nfs2a drbd[3243]: ERROR: nfs: Exit code 1 That's what I could find in the logs. Thanks, Dejan > Of course, I have no clue whether this is a > pacemaker/drbd/corosync/other issue and I am just looking for advice or > similar experience. Corosync 1.2.0 being quite new, I thought I might > make another test using the heartbeat stack. > > Any hint appreciated. Thx, - Patrick - > > > ************************************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager. postmas...@navixia.com > ************************************************************************************** > node nfs2a.test.local \ > attributes standby="off" > node nfs2b.test.local \ > attributes standby="off" > primitive res_drbd ocf:linbit:drbd \ > params drbd_resource="nfs" \ > op monitor interval="9s" role="Master" timeout="20s" \ > op monitor interval="10s" role="Slave" timeout="20s" > primitive res_fs ocf:heartbeat:Filesystem \ > params fstype="xfs" directory="/mnt/drbd" device="/dev/drbd0" > options="noatime,nodiratime,logbufs=8" \ > op monitor interval="10s" > primitive res_ip ocf:heartbeat:IPaddr2 \ > params ip="10.1.111.33" \ > op monitor interval="10s" > primitive res_nfs lsb:nfs \ > op monitor interval="10s" > group grp_nfs res_fs res_nfs res_ip \ > meta target-role="Started" > ms ms_drbd res_drbd \ > meta clone-max="2" notify="true" > location loc_drbd-master ms_drbd \ > rule $id="loc_drbd-master-rule" $role="master" 1000: #uname eq > nfs2a.test.local > colocation col_grp_nfs_on_drbd_master inf: grp_nfs ms_drbd:Master > order ord_drbd_before_grp_nfs inf: ms_drbd:promote grp_nfs:start > property $id="cib-bootstrap-options" \ > dc-version="1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1263554345" > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker