Any idea why the resource manager tries to run this command: ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[4059]: 2008/07/14_09:55:03 INFO: Running start for /dev/drbd1 on //mnt/data
twice? I think this is the source of all my problems, because the mount point is busy (since its already mounted), heartbeat just disconnects. Has any one else seen this kind of a behavior? I'm running CentOS 5.1 Any help is appreciated. Thanks, Nikhil -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nikhil Kulkarni Sent: Monday, July 14, 2008 10:13 AM To: [email protected] Subject: [Linux-HA] Wierd heartbeat problem. Hi, I have a weird heartbeat problem and am hoping that someone on this email list can point to something that I am doing wrong. I have drbd+heartbeat+nfs running over 2 systems installed with CentOS 5.1 operating system. My ha.cf is: debugfile /var/log/ha-debug.log logfile /var/log/ha.log logfacility local0 keepalive 2 warntime 10 deadtime 20 initdead 75 bcast eth0 auto_failback off node watchdog-client1 node watchdog-client2 My drbd.conf is global { usage-count yes; } common { protocol C; } resource r0 { startup { wfc-timeout 0; degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; } net { timeout 60; connect-int 10; ping-int 10; max-buffers 4096; max-epoch-size 4096; } syncer { rate 500M; al-extents 257; } on watchdog-client1 { device /dev/drbd1; disk /dev/sdb1; address 10.0.38.40:7789; meta-disk /dev/sda6[0]; } on watchdog-client2 { device /dev/drbd1; disk /dev/sdb1; address 10.0.38.43:7789; meta-disk /dev/sda6[0]; } } This is my haresources file: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 kill nfs Delay::3::0 nfs nfslock watchdog-client1 is the initial Primary node and client2 is the secondary node. When I shutdown client1, client2 becomes the primary and mounts the file system as expected. Everything works like a charm!!! Now, when I power back client1, it never becomes the secondary and the file system on client2 for some reason gets un-mounted and never gets mounted again. The logs on the client1 machine, tell me that it's trying to mount the file system twice (as shown in bold below) and it fails the second time, since the file system is already mounted the first time successfully. Since, it fails the second time, the heartbeat resource just stops all its other processes. Below is my client1 logs: heartbeat[2542]: 2008/07/14_09:54:48 info: Received shutdown notice from 'watchdog-client2'. heartbeat[2542]: 2008/07/14_09:54:48 info: Resources being acquired from watchdog-client2. heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq(): child count 1 heartbeat[2847]: 2008/07/14_09:54:48 info: acquire all HA resources (standby). ResourceManager[2873]: 2008/07/14_09:54:48 info: Acquiring resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock IPaddr[2915]: 2008/07/14_09:54:48 INFO: Resource is stopped IPaddr[2929]: 2008/07/14_09:54:48 INFO: Resource is stopped heartbeat[2848]: 2008/07/14_09:54:48 info: Local Resource acquisition completed. heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq(): child count 2 heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq(): child count 1 ResourceManager[2873]: 2008/07/14_09:54:48 info: Running /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start ResourceManager[2873]: 2008/07/14_09:54:48 debug: Starting /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start IPaddr[3072]: 2008/07/14_09:54:48 INFO: Using calculated netmask for 10.0.38.71: 255.255.255.0 IPaddr[3072]: 2008/07/14_09:54:48 DEBUG: Using calculated broadcast for 10.0.38.71: 10.0.38.255 IPaddr[3072]: 2008/07/14_09:54:48 INFO: eval ifconfig eth0:0 10.0.38.71 netmask 255.255.255.0 broadcast 10.0.38.255 IPaddr[3072]: 2008/07/14_09:54:48 DEBUG: Sending Gratuitous Arp for 10.0.38.71 on eth0:0 [eth0] IPaddr[3043]: 2008/07/14_09:54:48 INFO: Success INFO: Success ResourceManager[2873]: 2008/07/14_09:54:48 debug: /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start done. RC=0 ResourceManager[2873]: 2008/07/14_09:54:48 info: Running /etc/ha.d/resource.d/drbddisk r0 start ResourceManager[2873]: 2008/07/14_09:54:48 debug: Starting /etc/ha.d/resource.d/drbddisk r0 start ResourceManager[2873]: 2008/07/14_09:54:48 debug: /etc/ha.d/resource.d/drbddisk r0 start done. RC=0 Delay[3231]: 2008/07/14_09:54:51 INFO: Delay is stopped Delay[3220]: 2008/07/14_09:54:51 INFO: Resource is stopped ResourceManager[2873]: 2008/07/14_09:54:51 info: Running /etc/ha.d/resource.d/Delay 3 0 start ResourceManager[2873]: 2008/07/14_09:54:51 debug: Starting /etc/ha.d/resource.d/Delay 3 0 start Delay[3278]: 2008/07/14_09:54:54 INFO: Success INFO: Success ResourceManager[2873]: 2008/07/14_09:54:54 debug: /etc/ha.d/resource.d/Delay 3 0 start done. RC=0 Filesystem[3325]: 2008/07/14_09:54:54 INFO: Resource is stopped ResourceManager[2873]: 2008/07/14_09:54:55 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start ResourceManager[2873]: 2008/07/14_09:54:55 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[3406]: 2008/07/14_09:54:55 INFO: Running start for /dev/drbd1 on //mnt/data Filesystem[3395]: 2008/07/14_09:54:55 INFO: Success INFO: Success ResourceManager[2873]: 2008/07/14_09:54:55 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC=0 ResourceManager[2873]: 2008/07/14_09:54:55 info: Running /etc/init.d/killnfs start ResourceManager[2873]: 2008/07/14_09:54:55 debug: Starting /etc/init.d/killnfs start kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] ResourceManager[2873]: 2008/07/14_09:54:56 debug: /etc/init.d/killnfs start done. RC=0 Delay[3515]: 2008/07/14_09:54:59 INFO: Delay is running OK Delay[3504]: 2008/07/14_09:54:59 INFO: Running OK ResourceManager[2873]: 2008/07/14_09:54:59 info: Running /etc/init.d/nfs start ResourceManager[2873]: 2008/07/14_09:54:59 debug: Starting /etc/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting RPC idmapd: [ OK ] ResourceManager[2873]: 2008/07/14_09:55:00 debug: /etc/init.d/nfs start done. RC=0 ResourceManager[2873]: 2008/07/14_09:55:00 info: Running /etc/init.d/nfslock start ResourceManager[2873]: 2008/07/14_09:55:00 debug: Starting /etc/init.d/nfslock start Starting NFS statd: [ OK ] ResourceManager[2873]: 2008/07/14_09:55:00 debug: /etc/init.d/nfslock start done. RC=0 heartbeat[2847]: 2008/07/14_09:55:00 info: all HA resource acquisition completed (standby). heartbeat[2542]: 2008/07/14_09:55:00 info: Standby resource acquisition done [all]. heartbeat[3753]: 2008/07/14_09:55:00 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[3753]: 2008/07/14_09:55:00 info: Running /etc/ha.d/rc.d/status status mach_down[3769]: 2008/07/14_09:55:00 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[3769]: 2008/07/14_09:55:00 info: mach_down takeover complete for node watchdog-client2. heartbeat[2542]: 2008/07/14_09:55:00 info: mach_down takeover complete. heartbeat[3803]: 2008/07/14_09:55:00 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[3803]: 2008/07/14_09:55:00 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[3803]: 2008/07/14_09:55:00 received ip-request-resp IPaddr::10.0.38.71/24/eth0 OK yes ResourceManager[3824]: 2008/07/14_09:55:00 info: Acquiring resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock IPaddr[3851]: 2008/07/14_09:55:00 INFO: Running OK Delay[3941]: 2008/07/14_09:55:03 INFO: Delay is running OK Delay[3930]: 2008/07/14_09:55:03 INFO: Running OK Filesystem[3978]: 2008/07/14_09:55:03 INFO: Resource is stopped ResourceManager[3824]: 2008/07/14_09:55:03 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[4059]: 2008/07/14_09:55:03 INFO: Running start for /dev/drbd1 on //mnt/data mount: /dev/drbd1 already mounted or //mnt/data busy mount: according to mtab, /dev/drbd1 is mounted on /mnt/data Filesystem[4059]: 2008/07/14_09:55:03 ERROR: Couldn't mount filesystem /dev/drbd1 on //mnt/data Filesystem[4048]: 2008/07/14_09:55:03 ERROR: Generic error ERROR: Generic error ResourceManager[3824]: 2008/07/14_09:55:03 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC=1 ResourceManager[3824]: 2008/07/14_09:55:03 ERROR: Return code 1 from /etc/ha.d/resource.d/Filesystem ResourceManager[3824]: 2008/07/14_09:55:03 CRIT: Giving up resources due to failure of Filesystem::/dev/drbd1:://mnt/data::ext3 ResourceManager[3824]: 2008/07/14_09:55:03 info: Releasing resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock ResourceManager[3824]: 2008/07/14_09:55:03 info: Running /etc/init.d/nfslock stop ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting /etc/init.d/nfslock stop Stopping NFS locking: [ OK ] Stopping NFS statd: [ OK ] ResourceManager[3824]: 2008/07/14_09:55:03 debug: /etc/init.d/nfslock stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:03 info: Running /etc/init.d/nfs stop ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting /etc/init.d/nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] ResourceManager[3824]: 2008/07/14_09:55:04 debug: /etc/init.d/nfs stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:04 info: Running /etc/ha.d/resource.d/Delay 3 0 stop ResourceManager[3824]: 2008/07/14_09:55:04 debug: Starting /etc/ha.d/resource.d/Delay 3 0 stop Delay[4249]: 2008/07/14_09:55:04 INFO: Success INFO: Success ResourceManager[3824]: 2008/07/14_09:55:04 debug: /etc/ha.d/resource.d/Delay 3 0 stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:04 info: Running /etc/init.d/killnfs stop ResourceManager[3824]: 2008/07/14_09:55:04 debug: Starting /etc/init.d/killnfs stop ResourceManager[3824]: 2008/07/14_09:55:05 debug: /etc/init.d/killnfs stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:05 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop Filesystem[4343]: 2008/07/14_09:55:05 INFO: Running stop for /dev/drbd1 on //mnt/data Filesystem[4332]: 2008/07/14_09:55:05 INFO: Success INFO: Success ResourceManager[3824]: 2008/07/14_09:55:05 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:05 info: Running /etc/ha.d/resource.d/Delay 3 0 stop ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting /etc/ha.d/resource.d/Delay 3 0 stop Delay[4426]: 2008/07/14_09:55:05 INFO: Delay already stopped. Delay[4415]: 2008/07/14_09:55:05 INFO: Success INFO: Success ResourceManager[3824]: 2008/07/14_09:55:05 debug: /etc/ha.d/resource.d/Delay 3 0 stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:05 info: Running /etc/ha.d/resource.d/drbddisk r0 stop ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting /etc/ha.d/resource.d/drbddisk r0 stop ResourceManager[3824]: 2008/07/14_09:55:05 debug: /etc/ha.d/resource.d/drbddisk r0 stop done. RC=0 ResourceManager[3824]: 2008/07/14_09:55:05 info: Running /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop In IP Stop SIOCDELRT: No such process IPaddr[4522]: 2008/07/14_09:55:05 INFO: ifconfig eth0:0 down IPaddr[4493]: 2008/07/14_09:55:05 INFO: Success INFO: Success ResourceManager[3824]: 2008/07/14_09:55:05 debug: /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop done. RC=0 heartbeat[2542]: 2008/07/14_09:55:09 WARN: node watchdog-client2: is dead heartbeat[2542]: 2008/07/14_09:55:09 info: Dead node watchdog-client2 gave up resources. heartbeat[2542]: 2008/07/14_09:55:09 info: Link watchdog-client2:eth0 dead. hb_standby[4564]: 2008/07/14_09:55:35 Going standby [foreign]. heartbeat[2542]: 2008/07/14_09:55:35 info: watchdog-client1 wants to go standby [foreign] heartbeat[2542]: 2008/07/14_09:55:45 WARN: No reply to standby request. Standby request cancelled. [EMAIL PROTECTED] log]# Any idea why it's trying to mount the file system twice? I think that's what's causing the problem. Also, on a reboot, shouldn't client1 become the secondary and client 2 stay as the primary node? If anyone needs any other information, please let me know. I'm trying to fight this problem un-successfully for the past 2-3 days now. Thanks, Nikhil _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
