Dear guys, i've a small problem with an NFS/drbd/heartbeat cluster: basically the secondary node (that was previously promoted as primary) when the primary come up again, is unable to release the resources (ip/drbd) gracefully, and reboots. I know there's resource stickiness with pacemaker, but i could find any info for plain heartbeat. Moreover, is the behavior of node2 normal in my case? (config files are below).
But let's start from the beginning: i setup a cluster following (more or less) this guide (http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat) on a Centos system. # node1 is primary, works ok, etcetc [p...@node1 ~]$ /sbin/ifconfig eth0:0 eth0:0 Link encap:Ethernet HWaddr 00:25:64:3A:DE:68 inet addr:172.16.6.77 Bcast:172.16.6.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:169 Memory:dfdf0000-dfe00000 [p...@node1 ~]$ cat /proc/drbd version: 8.0.16 (api:86/proto:86) GIT-hash: d30881451c988619e243d6294a899139eed1183d build by [email protected], 2009-08-22 13:26:57 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:20484 nr:0 dw:136 dr:20865 al:5 bm:6 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:1279 misses:3 starving:0 dirty:0 changed:3 act_log: used:0/257 hits:29 misses:5 starving:0 dirty:0 changed:5 # while node2 is secondary, waiting to take over [p...@node2 ~]$ cat /proc/drbd version: 8.0.16 (api:86/proto:86) GIT-hash: d30881451c988619e243d6294a899139eed1183d build by [email protected], 2009-08-22 13:26:57 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:20484 dw:20484 dr:0 al:0 bm:5 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:1279 misses:3 starving:0 dirty:0 changed:3 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 # to simulate a disaster, i ungracefully unplug power cord from node1, and node2 takes over [p...@node2 ~]$ /sbin/ifconfig eth0:0 eth0:0 Link encap:Ethernet HWaddr 00:25:64:3A:E3:15 inet addr:172.16.6.77 Bcast:172.16.6.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:169 Memory:dfdf0000-dfe00000 [p...@node2 ~]$ cat /proc/drbd version: 8.0.16 (api:86/proto:86) GIT-hash: d30881451c988619e243d6294a899139eed1183d build by [email protected], 2009-08-22 13:26:57 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r--- ns:0 nr:72 dw:164 dr:381 al:4 bm:2 lo:0 pe:0 ua:0 ap:0 resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:19 misses:4 starving:0 dirty:0 changed:4 # HERE COMES THE PROBLEM: at this point node1 recovers and become # the primary again BUT the secondary complains loudly [/var/log/messages] Oct 20 17:44:53 node2 ResourceManager[3809]: info: Releasing resource group: node1.cluster.nfs.contactlab.lan IPaddr::172.16.6.77/24/eth0 drbddisk::nfsdata Filesystem::/dev/drbd0::/data::ex t3 nfslock nfs Oct 20 17:44:53 node2 ResourceManager[3809]: info: Running /etc/init.d/nfs stop Oct 20 17:44:53 node2 mountd[3630]: Caught signal 15, un-registering and exiting. Oct 20 17:44:53 node2 ResourceManager[3809]: info: Running /etc/init.d/nfslock stop Oct 20 17:44:53 node2 rpc.statd[3520]: Caught signal 15, un-registering and exiting. Oct 20 17:44:53 node2 portmap[3914]: connect from 127.0.0.1 to unset(status): request from unprivileged port Oct 20 17:44:53 node2 ResourceManager[3809]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /data ext3 stop Oct 20 17:44:53 node2 Filesystem[3953]: INFO: Running stop for /dev/drbd0 on /data Oct 20 17:44:53 node2 Filesystem[3953]: INFO: Trying to unmount /data Oct 20 17:44:53 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; trying cleanup with SIGTERM Oct 20 17:44:53 node2 Filesystem[3953]: INFO: No processes on /data were signalled Oct 20 17:44:54 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; trying cleanup with SIGTERM Oct 20 17:44:54 node2 Filesystem[3953]: INFO: No processes on /data were signalled Oct 20 17:44:55 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; trying cleanup with SIGTERM Oct 20 17:44:55 node2 Filesystem[3953]: INFO: No processes on /data were signalled Oct 20 17:44:56 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; trying cleanup with SIGKILL Oct 20 17:44:56 node2 Filesystem[3953]: INFO: No processes on /data were signalled Oct 20 17:44:57 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; trying cleanup with SIGKILL Oct 20 17:44:57 node2 Filesystem[3953]: INFO: No processes on /data were signalled Oct 20 17:44:58 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; trying cleanup with SIGKILL Oct 20 17:44:58 node2 Filesystem[3953]: INFO: No processes on /data were signalled Oct 20 17:44:59 node2 Filesystem[3953]: ERROR: Couldn't unmount /data, giving up! Oct 20 17:44:59 node2 Filesystem[3942]: ERROR: Generic error [snip] Oct 20 17:46:12 node2 Filesystem[5844]: ERROR: Couldn't unmount /data, giving up! Oct 20 17:46:12 node2 Filesystem[5833]: ERROR: Generic error Oct 20 17:46:12 node2 ResourceManager[3809]: ERROR: Return code 1 from /etc/ha.d/resource.d/Filesystem Oct 20 17:46:12 node2 Filesystem[6006]: INFO: Running OK Oct 20 17:46:12 node2 ResourceManager[3809]: CRIT: Resource STOP failure. Reboot required! Oct 20 17:46:12 node2 ResourceManager[3809]: CRIT: Killing heartbeat ungracefully! Oct 20 17:46:12 node2 kernel: md: stopping all md devices. Oct 20 17:46:13 node2 kernel: Synchronizing SCSI cache for disk sda: Oct 20 17:46:13 node2 kernel: usb 4-1: new full speed USB device using uhci_hcd and address 2 Oct 20 17:46:13 node2 kernel: usb 4-1: not running at top speed; connect to a high speed hub Oct 20 17:46:13 node2 kernel: usb 4-1: configuration #1 chosen from 1 choice Oct 20 17:46:13 node2 kernel: hub 4-1:1.0: USB hub found Oct 20 17:46:13 node2 kernel: hub 4-1:1.0: 4 ports detected Oct 20 17:46:14 node2 kernel: ACPI: PCI interrupt for device 0000:02:00.1 disabled Oct 20 17:46:14 node2 kernel: ACPI: PCI interrupt for device 0000:02:00.0 disabled Oct 20 17:48:30 node2 syslogd 1.4.1: restart. Oct 20 17:48:30 node2 kernel: klogd 1.4.1, log source = /proc/kmsg started. Oct 20 17:48:30 node2 kernel: Linux version 2.6.18-164.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 03:28:30 EDT 2009 /etc/drbd.conf: -------------- global { minor-count 1; } resource nfsdata { protocol C; on node1 { device /dev/drbd0; disk /dev/sda4; address 192.168.0.1:7788; meta-disk internal; } on node2 { device /dev/drbd0; disk /dev/sda4; address 192.168.0.2:7788; meta-disk internal; } disk { on-io-error detach; } net { max-buffers 2048; #datablock buffers used before writing to disk. ko-count 4; # Peer is dead if this count is exceeded. #on-disconnect reconnect; # Peer disconnected, try to reconnect. } syncer { rate 10M; # Synchronization rate, in megebytes. Good for 100Mb network. #group 1; # Used for grouping resources, parallel sync. al-extents 257; # Must be prime, number of active sets. } startup { wfc-timeout 0; degr-wfc-timeout 120; } } /etc/ha.d/ha.cf: --------------- logfacility local0 keepalive 2 deadtime 10 bcast eth1 node node1 node2 /etc/ha.d/haresources: --------------------- gnode1 IPaddr::172.16.6.77/24/eth0 drbddisk::nfsdata Filesystem::/dev/drbd0::/data::ext3 nfslock nfs /etc/ha.d/authkeys: ------------------ auth 3 3 md5 f00b4r _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
