Hi, With a bit of delay ;-)
On Tue, Oct 28, 2008 at 10:45:49AM -0400, David Miller wrote: > Hi All; > > I've looked at the log files until I'm blind and I've no idea why a simple > two node HA pair failed over. > > The setup is straightforward; two boxes with external RAID storage setup > for failover NFS services. > > Saturday morning the backup server had a problem with the scsi bus, > renegotiated to 40 MB/s, and remounted root in read only mode. It > effectively went offline for a couple of minutes where ipfail and heartbeat > couldn't talk to the primary. When it couldn't see the primary it decided > it was to take over, which makes sense. It may make sense to the computers, but you'd probably hate to lose data: use stonith! > What doesn't make sense is that a couple of minutes later, when the scsi > bit was sorted out and the two servers talked to each other again, is that > they decided to failover to the backup. > > There was never anything wrong with the primary server. > > Any ideas what would have induced the failover? > > Relevant bits: > > Heartbeat v 2.0.7-2 installed via debian packages. Debian etch, updated a > couple of weeks ago. 2.6.18-6-686 kernel. > > haresources has only this resource group: > > nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all > Filesystem::/dev/drbd0::/nfs_exports::ext3 > Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server > start_dnotify.sh > > ha.cfg: > > logfacility local0 > keepalive 4 You could heartbeat more often. Also, set warntime to twice keepalive. > deadtime 30 Increase deadtime to 60. > bcast eth0 eth1 > auto_failback off > node nfs1 > node nfs2 > respawn hacluster /usr/lib/heartbeat/ipfail > ping 192.168.120.1 > > > Logfile from nfs1 (primary server): > > Oct 25 06:43:16 nfs1 heartbeat: [9320]: WARN: node nfs2: is dead > Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Dead node nfs2 gave up > resources. > Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Link nfs2:eth0 dead. > Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Link nfs2:eth1 dead. > Oct 25 06:43:16 nfs1 ipfail: [9342]: info: Status update: Node nfs2 now has > status dead > Oct 25 06:43:18 nfs1 ipfail: [9342]: info: NS: We are still alive! > Oct 25 06:43:18 nfs1 ipfail: [9342]: info: Link Status update: Link > nfs2/eth0 now has status dead > Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: For information on cluster > partitions, See URL: http://linux-ha.org/SplitBrain > Oct 25 06:43:19 nfs1 heartbeat: [9320]: WARN: Deadtime value may be too > small. > Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: See FAQ for information on > tuning deadtime. > Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: URL: > http://linux-ha.org/FAQ#heavy_load > Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Link nfs2:eth1 up. > Oct 25 06:43:19 nfs1 heartbeat: [9320]: WARN: Late heartbeat: Node nfs2: > interval 33030 ms > Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Status update for node nfs2: > status active > Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Link nfs2:eth0 up. > Oct 25 06:43:19 nfs1 harc[6029]: info: Running /etc/ha.d/rc.d/status status > Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Asking other side for ping node > count. > Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Checking remote count of ping > nodes. > Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Link Status update: Link > nfs2/eth1 now has status dead > Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Asking other side for ping node > count. > Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Checking remote count of ping > nodes. > Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Link Status update: Link > nfs2/eth1 now has status up > Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Status update: Node nfs2 now has > status active > Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Link Status update: Link > nfs2/eth0 now has status up > Oct 25 06:43:21 nfs1 heartbeat: [9320]: info: Heartbeat shutdown in > progress. (9320) > Oct 25 06:43:21 nfs1 heartbeat: [6039]: info: Giving up all HA resources. > Oct 25 06:43:21 nfs1 ResourceManager[6049]: info: Releasing resource group: > nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all > Filesystem::/dev/drbd0::/nfs_exports::ext3 > Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server > start_dnotify.sh > Oct 25 06:43:21 nfs1 ResourceManager[6049]: info: Running > /etc/ha.d/resource.d/start_dnotify.sh stop > Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running > /etc/init.d/nfs-kernel-server stop > Oct 25 06:43:22 nfs1 ipfail: [9342]: info: No giveup timer to abort. > Oct 25 06:43:22 nfs1 kernel: nfsd: last server has exited > Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd1 /llnw_backups ext3 stop > Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: Running stop for /dev/drbd1 on > /llnw_backups > Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: Trying to unmount /llnw_backups > Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: unmounted /llnw_backups > successfully > Oct 25 06:43:22 nfs1 Filesystem[6139]: INFO: Filesystem Success > Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs_exports ext3 stop > Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Running stop for /dev/drbd0 on > /nfs_exports > Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Trying to unmount /nfs_exports > Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Some processes on /nfs_exports > were signalled > Oct 25 06:43:23 nfs1 ipfail: [9342]: info: No giveup timer to abort. > Oct 25 06:43:25 nfs1 Filesystem[6329]: INFO: unmounted /nfs_exports > successfully > Oct 25 06:43:25 nfs1 Filesystem[6265]: INFO: Filesystem Success > Oct 25 06:43:25 nfs1 ResourceManager[6049]: info: Running > /etc/ha.d/resource.d/drbddisk all stop > Oct 25 06:43:26 nfs1 kernel: drbd0: Primary/Secondary --> > Secondary/Secondary > Oct 25 06:43:26 nfs1 kernel: drbd1: Primary/Secondary --> > Secondary/Secondary > Oct 25 06:43:26 nfs1 ResourceManager[6049]: info: Running > /etc/ha.d/resource.d/IPaddr 192.168.120.230/24/eth0 stop > Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: /sbin/route -n del -host > 192.168.120.230 > Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: /sbin/ifconfig eth0:0 > 192.168.120.230 down > Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: IP Address 192.168.120.230 released > Oct 25 06:43:27 nfs1 IPaddr[6443]: INFO: IPaddr Success > Oct 25 06:43:27 nfs1 heartbeat: [6039]: info: All HA resources relinquished. > Oct 25 06:43:28 nfs1 kernel: drbd0: Secondary/Secondary --> > Secondary/Primary > Oct 25 06:43:28 nfs1 kernel: drbd1: Secondary/Secondary --> > Secondary/Primary > Oct 25 06:43:28 nfs1 heartbeat: [9320]: info: killing > /usr/lib/heartbeat/ipfail process group 9342 with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process 9328 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process 9329 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBFIFO process 9323 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process 9324 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process 9325 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process 9326 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process 9327 > with signal 15 > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9328 exited. 7 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9323 exited. 6 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9329 exited. 5 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9325 exited. 4 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9326 exited. 3 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9327 exited. 2 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9324 exited. 1 > remaining > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: nfs1 Heartbeat shutdown > complete. > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Heartbeat restart triggered. > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Restarting heartbeat. > Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Performing heartbeat restart > exec. > Oct 25 06:43:48 nfs1 syslogd 1.4.1#18: restart. > Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Core dumps could be lost if > multiple dumps occur > Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Consider setting > /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum > supportability > Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Logging daemon is disabled > --enabling logging daemon is recommended > Oct 25 06:44:01 nfs1 heartbeat: [9320]: info: ************************** > [finishes restarting heartbeat] This node decided to reboot after realizing that there was a split brain condition. That's a (very simple) recovery procedure. The other node never figured out that there was a split brain due to lack of cpu attention/hw problems. It, the secondary node, just noticed that the primary node was leaving. In short: heartbeat reacted as it should have. If you experience often hardware induced heartbeat delays, consider increasing deadtime, or, even better, fixing/replacing that SCSI controller. Sorry for such a late reply. Thanks, Dejan > =============================== > > Log file from nfs2: > > Oct 25 06:42:00 nfs2 kernel: mptscsih: ioc0: attempting task abort! > (sc=df8111c0) > Oct 25 06:42:00 nfs2 kernel: sd 0:0:1:0: > Oct 25 06:42:00 nfs2 kernel: command: Write(10): 2a 00 28 6c 02 8f > 00 00 30 00 > Oct 25 06:42:02 nfs2 kernel: mptbase: Initiating ioc0 recovery > Oct 25 06:42:08 nfs2 kernel: mptscsih: ioc0: task abort: SUCCESS > (sc=df8111c0) > Oct 25 06:42:08 nfs2 kernel: target0:0:0: Beginning Domain Validation > Oct 25 06:42:09 nfs2 kernel: target0:0:0: Domain Validation skipping write > tests > Oct 25 06:42:09 nfs2 kernel: target0:0:0: Ending Domain Validation > Oct 25 06:42:09 nfs2 kernel: target0:0:0: FAST-40 SCSI 40.0 MB/s ST (25 > ns, offset 127) > Oct 25 06:42:09 nfs2 kernel: target0:0:1: Beginning Domain Validation > Oct 25 06:42:09 nfs2 kernel: target0:0:1: Domain Validation skipping write > tests > Oct 25 06:42:09 nfs2 kernel: target0:0:1: Ending Domain Validation > Oct 25 06:42:09 nfs2 kernel: target0:0:1: FAST-40 SCSI 40.0 MB/s ST (25 > ns, offset 127) > Oct 25 06:43:19 nfs2 kernel: mptscsih: ioc0: attempting task abort! > (sc=eb523980) > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: G_CH_dispatch_int: Dispatch > function for read child took too long to execute: 31980 ms (> 50 ms) > (GSource: 0x80fbbb8) > Oct 25 06:43:19 nfs2 kernel: sd 0:0:1:0: > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 29200 ms (> 2010 ms) > before being called (GSource: 0x80fbeb0) > Oct 25 06:43:19 nfs2 kernel: command: Write(10): 2a 00 00 01 b9 17 > 00 00 28 00 > Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: > started at 1751846343 should have started at 1751843423 > Oct 25 06:43:19 nfs2 kernel: mptbase: Initiating ioc0 recovery > Oct 25 06:43:19 nfs2 kernel: mptscsih: ioc0: task abort: SUCCESS > (sc=eb523980) > Oct 25 06:43:19 nfs2 kernel: target0:0:0: Beginning Domain Validation > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node nfs2: > interval 33200 ms > Oct 25 06:43:19 nfs2 kernel: target0:0:0: Domain Validation skipping write > tests > Oct 25 06:43:19 nfs2 kernel: target0:0:0: Ending Domain Validation > Oct 25 06:43:19 nfs2 kernel: target0:0:0: FAST-40 SCSI 40.0 MB/s ST (25 > ns, offset 127) > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node nfs1: > interval 33980 ms > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node > 192.168.120.1: interval 32980 ms > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 31670 ms (> 2010 ms) > before being called (GSource: 0x80fc1b8) > Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: > started at 1751846344 should have started at 1751843177 > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: > Dispatch function for update msgfree count was delayed 31990 ms (> 30000 > ms) before being called (GSource: 0x80fc288) > Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: > started at 1751846344 should have started at 1751843145 > Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: > Dispatch function for client audit was delayed 27510 ms (> 5000 ms) before > being called (GSource: 0x80fc0e8) > Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: > started at 1751846344 should have started at 1751843593 > Oct 25 06:43:19 nfs2 kernel: target0:0:1: Beginning Domain Validation > Oct 25 06:43:19 nfs2 kernel: target0:0:1: Domain Validation skipping write > tests > Oct 25 06:43:19 nfs2 kernel: target0:0:1: Ending Domain Validation > Oct 25 06:43:19 nfs2 kernel: target0:0:1: FAST-40 SCSI 40.0 MB/s ST (25 > ns, offset 127) > Oct 25 06:43:21 nfs2 ipfail: [22731]: info: Ping node count is balanced. > Oct 25 06:43:22 nfs2 ipfail: [22731]: info: Ping node count is balanced. > Oct 25 06:43:26 nfs2 kernel: drbd0: Secondary/Primary --> > Secondary/Secondary > Oct 25 06:43:26 nfs2 kernel: drbd1: Secondary/Primary --> > Secondary/Secondary > Oct 25 06:43:27 nfs2 heartbeat: [22707]: info: Received shutdown notice > from 'nfs1'. > Oct 25 06:43:27 nfs2 heartbeat: [22707]: info: Resources being acquired > from nfs1. > Oct 25 06:43:27 nfs2 heartbeat: [32715]: info: acquire all HA resources > (standby). > Oct 25 06:43:28 nfs2 heartbeat: [32716]: info: No local resources > [/usr/lib/heartbeat/ResourceManager listkeys nfs2] to acquire. > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Acquiring resource > group: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all > Filesystem::/dev/drbd0::/nfs_exports::ext3 > Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server > start_dnotify.sh > Oct 25 06:43:28 nfs2 IPaddr[32759]: INFO: IPaddr Resource is stopped > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running > /etc/ha.d/resource.d/IPaddr 192.168.120.230/24/eth0 start > Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: eval /sbin/ifconfig eth0:0 > 192.168.120.230 netmask 255.255.255.0 broadcast 192.168.120.255 > Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: Sending Gratuitous Arp for > 192.168.120.230 on eth0:0 [eth0] > Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: /usr/lib/heartbeat/send_arp -i 500 > -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.120.230 eth0 > 192.168.120.230 auto 192.168.120.230 ffffffffffff > Oct 25 06:43:28 nfs2 IPaddr[409]: INFO: IPaddr Success > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running > /etc/ha.d/resource.d/drbddisk all start > Oct 25 06:43:28 nfs2 kernel: drbd0: Secondary/Secondary --> > Primary/Secondary > Oct 25 06:43:28 nfs2 kernel: drbd1: Secondary/Secondary --> > Primary/Secondary > Oct 25 06:43:28 nfs2 Filesystem[697]: INFO: Running status for /dev/drbd0 > on /nfs_exports > Oct 25 06:43:28 nfs2 Filesystem[697]: INFO: /nfs_exports is unmounted > (stopped) > Oct 25 06:43:28 nfs2 Filesystem[633]: INFO: Filesystem Resource is stopped > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs_exports ext3 start > Oct 25 06:43:28 nfs2 Filesystem[812]: INFO: Running start for /dev/drbd0 on > /nfs_exports > Oct 25 06:43:28 nfs2 kernel: kjournald starting. Commit interval 5 seconds > Oct 25 06:43:28 nfs2 kernel: EXT3 FS on drbd0, internal journal > Oct 25 06:43:28 nfs2 kernel: EXT3-fs: mounted filesystem with ordered data > mode. > Oct 25 06:43:28 nfs2 Filesystem[743]: INFO: Filesystem Success > Oct 25 06:43:28 nfs2 Filesystem[921]: INFO: Running status for /dev/drbd1 > on /llnw_backups > Oct 25 06:43:28 nfs2 Filesystem[921]: INFO: /llnw_backups is unmounted > (stopped) > Oct 25 06:43:28 nfs2 Filesystem[853]: INFO: Filesystem Resource is stopped > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd1 /llnw_backups ext3 start > Oct 25 06:43:28 nfs2 Filesystem[1030]: INFO: Running start for /dev/drbd1 > on /llnw_backups > Oct 25 06:43:28 nfs2 kernel: kjournald starting. Commit interval 5 seconds > Oct 25 06:43:28 nfs2 kernel: EXT3 FS on drbd1, internal journal > Oct 25 06:43:28 nfs2 kernel: EXT3-fs: mounted filesystem with ordered data > mode. > Oct 25 06:43:28 nfs2 Filesystem[966]: INFO: Filesystem Success > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running > /etc/init.d/nfs-kernel-server start > Oct 25 06:43:28 nfs2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the > NFSv4 state recovery directory > Oct 25 06:43:28 nfs2 kernel: NFSD: starting 90-second grace period > Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running > /etc/ha.d/resource.d/start_dnotify.sh start > Oct 25 06:43:32 nfs2 heartbeat: [32715]: info: all HA resource acquisition > completed (standby). > Oct 25 06:43:32 nfs2 heartbeat: [22707]: info: Standby resource acquisition > done [all]. > Oct 25 06:43:32 nfs2 harc[1324]: info: Running /etc/ha.d/rc.d/status status > Oct 25 06:43:32 nfs2 mach_down[1334]: info: Taking over resource group > IPaddr::192.168.120.230/24/eth0 > Oct 25 06:43:32 nfs2 ResourceManager[1354]: info: Acquiring resource group: > nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all > Filesystem::/dev/drbd0::/nfs_exports::ext3 > Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server > start_dnotify.sh > Oct 25 06:43:32 nfs2 IPaddr[1378]: INFO: IPaddr Running OK > Oct 25 06:43:32 nfs2 ResourceManager[1354]: info: Running > /etc/ha.d/resource.d/drbddisk all start > Oct 25 06:43:32 nfs2 Filesystem[1601]: INFO: Running status for /dev/drbd0 > on /nfs_exports > Oct 25 06:43:32 nfs2 Filesystem[1601]: INFO: /nfs_exports is mounted > (running) > Oct 25 06:43:32 nfs2 Filesystem[1537]: INFO: Filesystem Running OK > Oct 25 06:43:33 nfs2 Filesystem[1700]: INFO: Running status for /dev/drbd1 > on /llnw_backups > Oct 25 06:43:33 nfs2 Filesystem[1700]: INFO: /llnw_backups is mounted > (running) > Oct 25 06:43:33 nfs2 Filesystem[1636]: INFO: Filesystem Running OK > Oct 25 06:43:33 nfs2 ResourceManager[1354]: info: Running > /etc/ha.d/resource.d/start_dnotify.sh start > Oct 25 06:43:33 nfs2 mach_down[1334]: info: /usr/lib/heartbeat/mach_down: > nice_failback: foreign resources acquired > Oct 25 06:43:33 nfs2 mach_down[1334]: info: mach_down takeover complete for > node nfs1. > Oct 25 06:43:33 nfs2 heartbeat: [22707]: info: mach_down takeover complete. > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
