Just refreshing this topic. I really want to understand why healthy node ftp02 gave up its resources to overloaded node ftp01. From looking at the logs below, it appears that ftp02 did not received ftp01's heartbeats within the deadtime and decalred ftp01 dead. Then ftp01 came back a short while later and ftp02 said, "Nice to have you back. Here's my resources."
> I woke up this morning and discovered that one of my clusters > had failed > over during the night. Everything was working fine, but I > wanted to know > what happened. From reading the logs, it looks to like the > primary node > ftp02 gave up its resources to the secondary node ftp01, which had > become overloaded and failed to report in within the deadtime. (ftp01 > was busy taking a backup and was experiencing high iowait, > which I think > explains why it failed to send heartbeats soon enough.) > > Why would apparently healthy node ftp02 give its resources to the > overloaded ftp01? > > Here is the pertinent section of the log from ftp02, which was primary > before the failover: > > > > heartbeat[5391]: 2011/03/03_04:03:20 info: Link > ftp01.mycharts.md:/dev/ttyS0 dead. > heartbeat[5391]: 2011/03/03_04:03:20 info: Link > ftp01.mycharts.md:bond0 > dead. > ipfail[5775]: 2011/03/03_04:03:20 info: Link Status update: Link > ftp01.mycharts.md//dev/ttyS0 now has status dead > heartbeat[5391]: 2011/03/03_04:03:20 info: Link ftp01.mycharts.md:eth2 > dead. > ipfail[5775]: 2011/03/03_04:03:20 debug: Found ping node > 192.168.10.100! > ipfail[5775]: 2011/03/03_04:03:20 info: Asking other side for > ping node > count. > ipfail[5775]: 2011/03/03_04:03:20 debug: Message [num_ping] sent. > ipfail[5775]: 2011/03/03_04:03:20 info: Checking remote count of ping > nodes. > ipfail[5775]: 2011/03/03_04:03:20 info: Link Status update: Link > ftp01.mycharts.md/bond0 now has status dead > ipfail[5775]: 2011/03/03_04:03:21 debug: Found ping node > 192.168.10.100! > heartbeat[5391]: 2011/03/03_04:03:21 WARN: node ftp01.mycharts.md: is > dead > heartbeat[5391]: 2011/03/03_04:03:21 info: Dead node ftp01.mycharts.md > gave up resources. > heartbeat[5391]: 2011/03/03_04:03:21 debug: > hb_rsc_recover_dead_resources: other now stable > ipfail[5775]: 2011/03/03_04:03:21 info: Asking other side for > ping node > count. > ipfail[5775]: 2011/03/03_04:03:21 debug: Message [num_ping] sent. > ipfail[5775]: 2011/03/03_04:03:21 info: Checking remote count of ping > nodes. > ipfail[5775]: 2011/03/03_04:03:21 info: Link Status update: Link > ftp01.mycharts.md/eth2 now has status dead > ipfail[5775]: 2011/03/03_04:03:21 debug: Found ping node > 192.168.10.100! > ipfail[5775]: 2011/03/03_04:03:21 info: Asking other side for > ping node > count. > ipfail[5775]: 2011/03/03_04:03:21 debug: Message [num_ping] sent. > ipfail[5775]: 2011/03/03_04:03:21 info: Checking remote count of ping > nodes. > ipfail[5775]: 2011/03/03_04:03:21 info: Status update: Node > ftp01.mycharts.md now has status dead > ipfail[5775]: 2011/03/03_04:03:22 debug: Found ping node > 192.168.10.100! > ipfail[5775]: 2011/03/03_04:03:22 info: NS: We are still alive! > heartbeat[5391]: 2011/03/03_04:03:38 CRIT: Cluster node > ftp01.mycharts.md returning after partition. > heartbeat[5391]: 2011/03/03_04:03:38 info: For information on cluster > partitions, See URL: http://linux-ha.org/SplitBrain > <http://linux-ha.org/SplitBrain> > heartbeat[5391]: 2011/03/03_04:03:38 WARN: Deadtime value may be too > small. > heartbeat[5391]: 2011/03/03_04:03:38 info: See FAQ for information on > tuning deadtime. > heartbeat[5391]: 2011/03/03_04:03:38 info: URL: > http://linux-ha.org/FAQ#heavy_load > <http://linux-ha.org/FAQ#heavy_load> > heartbeat[5391]: 2011/03/03_04:03:38 info: Link > ftp01.mycharts.md:/dev/ttyS0 up. > heartbeat[5391]: 2011/03/03_04:03:38 WARN: Late heartbeat: Node > ftp01.mycharts.md: interval 49030 ms > heartbeat[5391]: 2011/03/03_04:03:38 info: Status update for node > ftp01.mycharts.md: status active > ipfail[5775]: 2011/03/03_04:03:38 info: Link Status update: Link > ftp01.mycharts.md//dev/ttyS0 now has status up > heartbeat[5391]: 2011/03/03_04:03:38 debug: Status seqno: 568925 > msgtime: 1299153818 > heartbeat[5391]: 2011/03/03_04:03:38 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[5391]: 2011/03/03_04:03:38 debug: notify_world: > invoking harc: > OLD status: active > heartbeat[5391]: 2011/03/03_04:03:38 debug: Process [status] > started pid > 16973 > heartbeat[5391]: 2011/03/03_04:03:38 debug: Starting notify process > [status] > ipfail[5775]: 2011/03/03_04:03:38 info: Status update: Node > ftp01.mycharts.md now has status active > heartbeat[5391]: 2011/03/03_04:03:38 info: Link > ftp01.mycharts.md:bond0 > up. > ipfail[5775]: 2011/03/03_04:03:38 info: Link Status update: Link > ftp01.mycharts.md/bond0 now has status up > heartbeat[5391]: 2011/03/03_04:03:38 info: Link ftp01.mycharts.md:eth2 > up. > heartbeat[16973]: 2011/03/03_04:03:38 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > ipfail[5775]: 2011/03/03_04:03:38 info: Link Status update: Link > ftp01.mycharts.md/eth2 now has status up > heartbeat[16973]: 2011/03/03_04:03:38 debug: notify_world: > Running harc > status > harc[16973]: 2011/03/03_04:03:38 info: Running /etc/ha.d/rc.d/status > status > heartbeat[5391]: 2011/03/03_04:03:38 info: Managed status > process 16973 > exited with return code 0. > heartbeat[5391]: 2011/03/03_04:03:38 debug: RscMgmtProc > 'status' exited > code 0 > heartbeat[5391]: 2011/03/03_04:03:40 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5391]: 2011/03/03_04:03:40 info: hb_giveup_resources(): > current status: active > heartbeat[5391]: 2011/03/03_04:03:40 debug: Sending hold > resources msg: > none, stable=0 # shutdown > heartbeat[5391]: 2011/03/03_04:03:40 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 5 > heartbeat[5391]: 2011/03/03_04:03:40 info: Heartbeat shutdown in > progress. (5391) > heartbeat[5391]: 2011/03/03_04:03:40 debug: Process > [hb_giveup_resources] started pid 17023 > heartbeat[17023]: 2011/03/03_04:03:40 info: Giving up all HA > resources. > ResourceManager[17036]: 2011/03/03_04:03:40 info: Releasing resource > group: ftp01.mycharts.md drbddisk::ha_ftp > Filesystem::/dev/drbd0::/ha_ftp::ext3::noatime > IPaddr2::192.168.10.203/24/bond0 pure-ftpd nfs > ResourceManager[17036]: 2011/03/03_04:03:40 info: Running > /etc/init.d/nfs stop > ResourceManager[17036]: 2011/03/03_04:03:40 debug: Starting > /etc/init.d/nfs stop > Shutting down NFS mountd: [ OK ] > Shutting down NFS daemon: [ OK ] > Shutting down NFS quotas: [ OK ] > Shutting down NFS services: [ OK ] > ResourceManager[17036]: 2011/03/03_04:03:41 debug: > /etc/init.d/nfs stop > done. RC=0 > ResourceManager[17036]: 2011/03/03_04:03:41 info: Running > /etc/init.d/pure-ftpd stop > ResourceManager[17036]: 2011/03/03_04:03:41 debug: Starting > /etc/init.d/pure-ftpd stop > Stopping pure-ftpd: [ OK ] > ResourceManager[17036]: 2011/03/03_04:03:41 debug: > /etc/init.d/pure-ftpd > stop done. RC=0 > ResourceManager[17036]: 2011/03/03_04:03:41 info: Running > /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 stop > ResourceManager[17036]: 2011/03/03_04:03:41 debug: Starting > /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 stop > IPaddr2[17196]: 2011/03/03_04:03:41 INFO: ip -f inet addr delete > 192.168.10.203/24 dev bond0 > IPaddr2[17196]: 2011/03/03_04:03:41 INFO: ip -o -f inet addr > show bond0 > IPaddr2[17167]: 2011/03/03_04:03:41 INFO: Success > INFO: Success > ResourceManager[17036]: 2011/03/03_04:03:41 debug: > /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 stop done. RC=0 > ResourceManager[17036]: 2011/03/03_04:03:41 info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime stop > ResourceManager[17036]: 2011/03/03_04:03:41 debug: Starting > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime stop > Filesystem[17277]: 2011/03/03_04:03:41 INFO: Running stop for > /dev/drbd0 > on /ha_ftp > Filesystem[17277]: 2011/03/03_04:03:41 INFO: Trying to unmount /ha_ftp > umount: /ha_ftp: device is busy > umount: /ha_ftp: device is busy > Filesystem[17277]: 2011/03/03_04:03:41 ERROR: Couldn't > unmount /ha_ftp; > trying cleanup with SIGTERM > /ha_ftp: 8796c > Filesystem[17277]: 2011/03/03_04:03:42 INFO: Some processes on /ha_ftp > were signalled > ipfail[5775]: 2011/03/03_04:03:45 info: No giveup timer to abort. > ipfail[5775]: 2011/03/03_04:03:47 info: No giveup timer to abort. > ipfail[5775]: 2011/03/03_04:03:48 info: No giveup timer to abort. > Filesystem[17277]: 2011/03/03_04:03:54 INFO: unmounted /ha_ftp > successfully > Filesystem[17266]: 2011/03/03_04:03:54 INFO: Success > INFO: Success > ResourceManager[17036]: 2011/03/03_04:03:54 debug: > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime stop > done. RC=0 > ResourceManager[17036]: 2011/03/03_04:03:54 info: Running > /etc/ha.d/resource.d/drbddisk ha_ftp stop > ResourceManager[17036]: 2011/03/03_04:03:54 debug: Starting > /etc/ha.d/resource.d/drbddisk ha_ftp stop > ResourceManager[17036]: 2011/03/03_04:03:54 debug: > /etc/ha.d/resource.d/drbddisk ha_ftp stop done. RC=0 > heartbeat[17023]: 2011/03/03_04:03:54 info: All HA resources > relinquished. > heartbeat[17023]: 2011/03/03_04:03:54 debug: Sending T_SHUTDONE. > heartbeat[17023]: 2011/03/03_04:03:54 info: FIFO message > [type shutdone] > written rc=27 > heartbeat[5391]: 2011/03/03_04:03:54 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 5 > heartbeat[5391]: 2011/03/03_04:03:54 debug: Received > T_SHUTDONE from us. > heartbeat[5391]: 2011/03/03_04:03:54 debug: Calling > hb_mcp_final_shutdown in a second. > heartbeat[5391]: 2011/03/03_04:03:54 debug: RscMgmtProc > 'hb_giveup_resources' exited code 0 > heartbeat[5391]: 2011/03/03_04:03:55 debug: hb_mcp_final_shutdown() > phase 0 > heartbeat[5391]: 2011/03/03_04:03:55 debug: Shutting down client > /usr/lib/heartbeat/ipfail > heartbeat[5391]: 2011/03/03_04:03:55 info: killing > /usr/lib/heartbeat/ipfail process group 5775 with signal 15 > heartbeat[5391]: 2011/03/03_04:03:55 info: other_holds_resources: 3 > heartbeat[5391]: 2011/03/03_04:03:55 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 5 > heartbeat[5391]: 2011/03/03_04:03:56 debug: G_remove_client(pid=5775, > reason='died' gsource=0x9e4b348) { > heartbeat[5391]: 2011/03/03_04:03:56 debug: api_remove_client_int: > removing pid [5775] reason: died > heartbeat[5391]: 2011/03/03_04:03:56 debug: api_send_client: > client 5775 > died > heartbeat[5391]: 2011/03/03_04:03:56 debug: }/*G_remove_client;*/ > heartbeat[5391]: 2011/03/03_04:03:56 debug: Final client > "/usr/lib/heartbeat/ipfail" died. > heartbeat[5391]: 2011/03/03_04:03:56 debug: hb_mcp_final_shutdown() > phase 1 > heartbeat[5391]: 2011/03/03_04:03:57 debug: hb_mcp_final_shutdown() > phase 2 > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5500 > with signal 15 > heartbeat[5497]: 2011/03/03_04:03:57 debug: Process 5497 processing > SIGTERM > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE > process 5501 > with signal 15 > heartbeat[5497]: 2011/03/03_04:03:57 debug: Exiting from pid 5497 > [rc=15] > heartbeat[5503]: 2011/03/03_04:03:57 debug: Process 5503 processing > SIGTERM > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5502 > with signal 15 > heartbeat[5503]: 2011/03/03_04:03:57 debug: Exiting from pid 5503 > [rc=15] > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE > process 5503 > with signal 15 > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5504 > with signal 15 > heartbeat[5501]: 2011/03/03_04:03:57 debug: Process 5501 processing > SIGTERM > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE > process 5494 > with signal 15 > heartbeat[5504]: 2011/03/03_04:03:57 debug: Process 5504 processing > SIGTERM > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5495 > with signal 15 > heartbeat[5501]: 2011/03/03_04:03:57 debug: Exiting from pid 5501 > [rc=15] > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBFIFO process 5419 > with signal 15 > heartbeat[5419]: 2011/03/03_04:03:57 debug: Process 5419 processing > SIGTERM > heartbeat[5502]: 2011/03/03_04:03:57 debug: Process 5502 processing > SIGTERM > heartbeat[5502]: 2011/03/03_04:03:57 debug: Exiting from pid 5502 > [rc=15] > heartbeat[5500]: 2011/03/03_04:03:57 debug: Process 5500 processing > SIGTERM > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE > process 5497 > with signal 15 > heartbeat[5494]: 2011/03/03_04:03:57 debug: Process 5494 processing > SIGTERM > heartbeat[5494]: 2011/03/03_04:03:57 debug: Calling > localdie() function > heartbeat[5495]: 2011/03/03_04:03:57 debug: Process 5495 processing > SIGTERM > heartbeat[5495]: 2011/03/03_04:03:57 debug: Exiting from pid 5495 > [rc=15] > heartbeat[5499]: 2011/03/03_04:03:57 debug: Process 5499 processing > SIGTERM > heartbeat[5498]: 2011/03/03_04:03:57 debug: Process 5498 processing > SIGTERM > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5498 > with signal 15 > heartbeat[5494]: 2011/03/03_04:03:57 debug: glib: serial_localdie: > Flushing tty > heartbeat[5499]: 2011/03/03_04:03:57 debug: Exiting from pid 5499 > [rc=15] > heartbeat[5494]: 2011/03/03_04:03:57 debug: Exiting from pid 5494 > [rc=15] > heartbeat[5498]: 2011/03/03_04:03:57 debug: Exiting from pid 5498 > [rc=15] > heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE > process 5499 > with signal 15 > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5494 > exited. 11 > remaining > heartbeat[5419]: 2011/03/03_04:03:57 debug: Exiting from pid 5419 > [rc=15] > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5495 > exited. 10 > remaining > heartbeat[5504]: 2011/03/03_04:03:57 debug: Exiting from pid 5504 > [rc=15] > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5497 exited. 9 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5498 exited. 8 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5499 exited. 7 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5501 exited. 6 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5502 exited. 5 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5503 exited. 4 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5419 exited. 3 > remaining > heartbeat[5500]: 2011/03/03_04:03:57 debug: Exiting from pid 5500 > [rc=15] > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5504 exited. 2 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5500 exited. 1 > remaining > heartbeat[5391]: 2011/03/03_04:03:57 info: ftp02.mycharts.md Heartbeat > shutdown complete. > heartbeat[5391]: 2011/03/03_04:03:57 info: Heartbeat restart > triggered. > heartbeat[5391]: 2011/03/03_04:03:57 info: Restarting heartbeat. > heartbeat[5391]: 2011/03/03_04:03:57 info: Performing > heartbeat restart > exec. > heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=hacluster, > gid=haclient > uid=hacluster > heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=hacluster, gid=<null> > heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=<null>, gid=haclient > heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=root, gid=<null> > heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=<null>, gid=haclient > heartbeat[5391]: 2011/03/03_04:04:28 debug: Beginning authentication > parsing > heartbeat[5391]: 2011/03/03_04:04:28 debug: 16 max authentication > methods > heartbeat[5391]: 2011/03/03_04:04:28 debug: Keyfile opened > heartbeat[5391]: 2011/03/03_04:04:28 debug: Keyfile perms OK > heartbeat[5391]: 2011/03/03_04:04:28 debug: 16 max authentication > methods > heartbeat[5391]: 2011/03/03_04:04:28 debug: Found > authentication method > [sha1] > heartbeat[5391]: 2011/03/03_04:04:28 info: AUTH: i=2: key = > 0x1412cf48, > auth=0x2b4e4e66a610, authname=sha1 > heartbeat[5391]: 2011/03/03_04:04:28 debug: Outbound signing > method is 2 > heartbeat[5391]: 2011/03/03_04:04:28 debug: Authentication parsing > complete [1] > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(cluster,linux-ha) > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(hopfudge,1) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(hbgenmethod,file) > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(realtime,true) > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(msgfmt,classic) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(conn_logd_time,60) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(log_badpack,true) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(syslogmsgfmt,false) > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(coredumps,true) > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(crm,false) > heartbeat[5391]: 2011/03/03_04:04:28 info: Version 2 support: false > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(autojoin,none) > heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(uuidfrom,file) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(compression,zlib) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(compression_threshold,2) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(max_rexmit_delay,250) > heartbeat[5391]: 2011/03/03_04:04:28 debug: Setting > max_rexmit_delay to > 250 ms > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(record_config_changes,on) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(record_pengine_inputs,on) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(enable_config_writes,on) > heartbeat[5391]: 2011/03/03_04:04:28 debug: > add_option(memreserve,6500) > heartbeat[5391]: 2011/03/03_04:04:28 WARN: Logging daemon is disabled > --enabling logging daemon is recommended > heartbeat[5391]: 2011/03/03_04:04:28 info: ************************** > heartbeat[5391]: 2011/03/03_04:04:28 info: Configuration validated. > Starting heartbeat 2.1.3 > heartbeat[5391]: 2011/03/03_04:04:28 debug: HA configuration OK. > Heartbeat starting. > heartbeat[18456]: 2011/03/03_04:04:28 info: heartbeat: version 2.1.3 > heartbeat[18456]: 2011/03/03_04:04:28 info: Heartbeat generation: > 1248828643 > heartbeat[18456]: 2011/03/03_04:04:28 debug: uuid > is:4ae573d0-2ccd-4f5b-b769-ddcb47da58c9 > heartbeat[18456]: 2011/03/03_04:04:28 debug: FIFO process pid: 18458 > heartbeat[18456]: 2011/03/03_04:04:28 debug: opening serial /dev/ttyS0 > (serial ring) > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: tty setup on > /dev/ttyS0 complete. > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: Baud rate > set to: 0xe > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_iflag = 0xb2 > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_oflag = 0x4 > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_cflag = > 0x80000cbe > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_lflag = 0x8a30 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: Starting serial > heartbeat on tty /dev/ttyS0 (19200 baud) > heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18466 > heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: > 18467 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair wchan socket 9 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair rchan socket 11 > heartbeat[18456]: 2011/03/03_04:04:28 debug: opening bcast > bond0 (UDP/IP > broadcast) > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: SO_BINDTODEVICE(r) > set for device bond0 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast > heartbeat started on port 698 (698) interface bond0 > heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18468 > heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: > 18469 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast > heartbeat closed on port 698 interface bond0 - Status: 1 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair wchan socket 12 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair rchan socket 14 > heartbeat[18456]: 2011/03/03_04:04:28 debug: opening bcast > eth2 (UDP/IP > broadcast) > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: SO_BINDTODEVICE(r) > set for device eth2 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast > heartbeat started on port 698 (698) interface eth2 > heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18470 > heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: > 18471 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast > heartbeat closed on port 698 interface eth2 - Status: 1 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair wchan socket 13 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair rchan socket 16 > heartbeat[18456]: 2011/03/03_04:04:28 debug: opening bcast > eth3 (UDP/IP > broadcast) > heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: SO_BINDTODEVICE(r) > set for device eth3 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast > heartbeat started on port 698 (698) interface eth3 > heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18472 > heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: > 18473 > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast > heartbeat closed on port 698 interface eth3 - Status: 1 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair wchan socket 15 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair rchan socket 18 > heartbeat[18456]: 2011/03/03_04:04:28 debug: opening ping > 192.168.10.100 > (ping membership) > heartbeat[18456]: 2011/03/03_04:04:28 info: glib: ping heartbeat > started. > heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18474 > heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: > 18475 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair wchan socket 10 > heartbeat[18456]: 2011/03/03_04:04:28 debug: > make_io_childpair: CREATED > childpair rchan socket 19 > heartbeat[18456]: 2011/03/03_04:04:28 info: G_main_add_TriggerHandler: > Added signal manual handler > heartbeat[18456]: 2011/03/03_04:04:28 info: G_main_add_TriggerHandler: > Added signal manual handler > heartbeat[18456]: 2011/03/03_04:04:28 info: G_main_add_SignalHandler: > Added signal handler for signal 17 > heartbeat[18456]: 2011/03/03_04:04:28 debug: Limiting CPU: 42 CPU > seconds every 60000 milliseconds > heartbeat[18456]: 2011/03/03_04:04:28 debug: pid 18456 locked > in memory. > heartbeat[18456]: 2011/03/03_04:04:28 debug: Waiting for > child processes > to start > heartbeat[18456]: 2011/03/03_04:04:28 info: Local status now set to: > 'up' > heartbeat[18456]: 2011/03/03_04:04:28 debug: All your child > process are > belong to us > heartbeat[18456]: 2011/03/03_04:04:28 debug: Starting local status > message @ 2000 ms intervals > heartbeat[18456]: 2011/03/03_04:04:28 debug: Forking temp process > write_hostcachedata > heartbeat[18456]: 2011/03/03_04:04:28 info: Managed > write_hostcachedata > process 18476 exited with return code 0. > heartbeat[18458]: 2011/03/03_04:04:29 debug: pid 18458 locked > in memory. > heartbeat[18458]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 > CPU seconds > every 60000 milliseconds > heartbeat[18466]: 2011/03/03_04:04:29 debug: pid 18466 locked > in memory. > heartbeat[18466]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU > seconds every 60000 milliseconds > heartbeat[18467]: 2011/03/03_04:04:29 debug: pid 18467 locked > in memory. > heartbeat[18467]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 > CPU seconds > every 60000 milliseconds > heartbeat[18469]: 2011/03/03_04:04:29 debug: pid 18469 locked > in memory. > heartbeat[18468]: 2011/03/03_04:04:29 debug: pid 18468 locked > in memory. > heartbeat[18469]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 > CPU seconds > every 60000 milliseconds > heartbeat[18468]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU > seconds every 60000 milliseconds > heartbeat[18471]: 2011/03/03_04:04:29 debug: pid 18471 locked > in memory. > heartbeat[18471]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 > CPU seconds > every 60000 milliseconds > heartbeat[18472]: 2011/03/03_04:04:29 debug: pid 18472 locked > in memory. > heartbeat[18472]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU > seconds every 60000 milliseconds > heartbeat[18470]: 2011/03/03_04:04:29 debug: pid 18470 locked > in memory. > heartbeat[18470]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU > seconds every 60000 milliseconds > heartbeat[18475]: 2011/03/03_04:04:29 debug: pid 18475 locked > in memory. > heartbeat[18475]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 > CPU seconds > every 60000 milliseconds > heartbeat[18473]: 2011/03/03_04:04:29 debug: pid 18473 locked > in memory. > heartbeat[18473]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 > CPU seconds > every 60000 milliseconds > heartbeat[18474]: 2011/03/03_04:04:29 debug: pid 18474 locked > in memory. > heartbeat[18474]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU > seconds every 60000 milliseconds > heartbeat[18456]: 2011/03/03_04:04:29 info: Link > 192.168.10.100:192.168.10.100 up. > heartbeat[18456]: 2011/03/03_04:04:29 debug: > CreateInitialFilter: status > heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: > ask_resources > heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: > ip-request > heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: > ip-request-resp > heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: > hb_takeover > heartbeat[18456]: 2011/03/03_04:04:29 info: Status update for node > 192.168.10.100: status ping > heartbeat[18456]: 2011/03/03_04:04:29 debug: Status seqno: 0 msgtime: > 1299153868 > heartbeat[18456]: 2011/03/03_04:04:29 info: Link > ftp02.mycharts.md:bond0 > up. > heartbeat[18456]: 2011/03/03_04:04:29 info: Link > ftp02.mycharts.md:eth2 > up. > heartbeat[18456]: 2011/03/03_04:04:29 info: Link > ftp02.mycharts.md:eth3 > up. > heartbeat[18456]: 2011/03/03_04:04:30 WARN: string2msg_ll: node > [ftp01.mycharts.md] failed authentication > heartbeat[18456]: 2011/03/03_04:04:30 info: Link > ftp01.mycharts.md:bond0 > up. > heartbeat[18456]: 2011/03/03_04:04:30 debug: sending reqnodes msg to > node ftp01.mycharts.md > heartbeat[18456]: 2011/03/03_04:04:30 info: Link > ftp01.mycharts.md:eth2 > up. > heartbeat[18456]: 2011/03/03_04:04:30 info: Link > ftp01.mycharts.md:/dev/ttyS0 up. > heartbeat[18456]: 2011/03/03_04:04:30 debug: Forking temp process > write_hostcachedata > heartbeat[18456]: 2011/03/03_04:04:30 info: Managed > write_hostcachedata > process 18512 exited with return code 0. > heartbeat[18456]: 2011/03/03_04:04:30 info: Status update for node > ftp01.mycharts.md: status active > heartbeat[18456]: 2011/03/03_04:04:30 debug: Status seqno: 568961 > msgtime: 1299153870 > heartbeat[18456]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[18456]: 2011/03/03_04:04:30 debug: notify_world: invoking > harc: OLD status: up > heartbeat[18456]: 2011/03/03_04:04:30 debug: Process [status] started > pid 18540 > heartbeat[18456]: 2011/03/03_04:04:30 debug: Starting notify process > [status] > heartbeat[18456]: 2011/03/03_04:04:30 debug: Get a repnodes msg from > ftp01.mycharts.md > heartbeat[18456]: 2011/03/03_04:04:30 debug: nodelist > received:ftp01.mycharts.md ftp02.mycharts.md > heartbeat[18456]: 2011/03/03_04:04:30 info: Comm_now_up(): updating > status to active > heartbeat[18456]: 2011/03/03_04:04:30 info: Local status now set to: > 'active' > heartbeat[18456]: 2011/03/03_04:04:30 debug: Sending local > starting msg: > resourcestate = 0 > heartbeat[18456]: 2011/03/03_04:04:30 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 0 > heartbeat[18456]: 2011/03/03_04:04:30 info: Starting child client > "/usr/lib/heartbeat/ipfail" (498,496) > heartbeat[18456]: 2011/03/03_04:04:30 debug: Forking temp process > write_hostcachedata > heartbeat[18541]: 2011/03/03_04:04:30 info: Starting > "/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 18541) > heartbeat[18456]: 2011/03/03_04:04:30 debug: Forking temp process > write_delcachedata > heartbeat[18540]: 2011/03/03_04:04:30 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > heartbeat[18540]: 2011/03/03_04:04:30 debug: notify_world: > Running harc > status > heartbeat[18456]: 2011/03/03_04:04:30 info: Managed > write_hostcachedata > process 18542 exited with return code 0. > heartbeat[18456]: 2011/03/03_04:04:30 info: Managed write_delcachedata > process 18543 exited with return code 0. > ipfail[18541]: 2011/03/03_04:04:30 debug: PID=18541 > ipfail[18541]: 2011/03/03_04:04:30 debug: Signing in with heartbeat > heartbeat[18456]: 2011/03/03_04:04:30 debug: > APIregistration_dispatch() > { > heartbeat[18456]: 2011/03/03_04:04:30 debug: process_registerevent() { > heartbeat[18456]: 2011/03/03_04:04:30 debug: client->gsource = > 0x14151c28 > heartbeat[18456]: 2011/03/03_04:04:30 debug: > }/*process_registerevent*/; > heartbeat[18456]: 2011/03/03_04:04:30 debug: > }/*APIregistration_dispatch*/; > heartbeat[18456]: 2011/03/03_04:04:30 debug: Checking client > authorization for client ipfail (498:496) > heartbeat[18456]: 2011/03/03_04:04:30 debug: > create_seq_snapshot_table:no missing packets found for node > ftp01.mycharts.md > heartbeat[18456]: 2011/03/03_04:04:30 debug: > create_seq_snapshot_table:no missing packets found for node > ftp02.mycharts.md > heartbeat[18456]: 2011/03/03_04:04:30 debug: Signing on API > client 18541 > (ipfail) > harc[18540]: 2011/03/03_04:04:30 info: Running /etc/ha.d/rc.d/status > status > heartbeat[18456]: 2011/03/03_04:04:30 info: Managed status > process 18540 > exited with return code 0. > heartbeat[18456]: 2011/03/03_04:04:30 debug: RscMgmtProc > 'status' exited > code 0 > heartbeat[18456]: 2011/03/03_04:04:30 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 1 > ipfail[18541]: 2011/03/03_04:04:30 debug: [We are ftp02.mycharts.md] > heartbeat[18456]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 0, > foreign 1, reason 'T_RESOURCES' (0)) > heartbeat[18456]: 2011/03/03_04:04:31 info: remote resource transition > completed. > heartbeat[18456]: 2011/03/03_04:04:31 debug: Sending hold > resources msg: > none, stable=0 # <none> > heartbeat[18456]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 0, > foreign 1, reason 'T_RESOURCES' (0)) > heartbeat[18456]: 2011/03/03_04:04:31 info: STATE 1 => 3 > heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 3 > heartbeat[18456]: 2011/03/03_04:04:31 debug: Calling > PerformAutoFailback() > heartbeat[18456]: 2011/03/03_04:04:31 info: other_holds_resources: 3 > heartbeat[18456]: 2011/03/03_04:04:31 info: remote resource transition > completed. > heartbeat[18456]: 2011/03/03_04:04:31 info: Local Resource acquisition > completed. (none) > heartbeat[18456]: 2011/03/03_04:04:31 debug: Sending hold > resources msg: > none, stable=1 # <none> > heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 3 > heartbeat[18456]: 2011/03/03_04:04:31 debug: Calling > PerformAutoFailback() > heartbeat[18456]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 1, > foreign 1, reason 'T_RESOURCES(them)' (0)) > heartbeat[18456]: 2011/03/03_04:04:31 info: Initial resource > acquisition > complete (T_RESOURCES(them)) > heartbeat[18456]: 2011/03/03_04:04:31 info: STATE 3 => 4 > heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > ipfail[18541]: 2011/03/03_04:04:31 debug: auto_failback -> 0 (off) > ipfail[18541]: 2011/03/03_04:04:31 debug: Setting message filter mode > heartbeat[18456]: 2011/03/03_04:04:31 info: other_holds_resources: 3 > heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > ipfail[18541]: 2011/03/03_04:04:31 debug: Starting node walk > ipfail[18541]: 2011/03/03_04:04:32 debug: Cluster node: > 192.168.10.100: > status: ping > ipfail[18541]: 2011/03/03_04:04:32 debug: Cluster node: > ftp02.mycharts.md: status: active > ipfail[18541]: 2011/03/03_04:04:33 debug: Cluster node: > ftp01.mycharts.md: status: active > ipfail[18541]: 2011/03/03_04:04:33 debug: [They are ftp01.mycharts.md] > ipfail[18541]: 2011/03/03_04:04:33 debug: Setting message signal > ipfail[18541]: 2011/03/03_04:04:33 debug: Waiting for messages... > ipfail[18541]: 2011/03/03_04:04:34 debug: Other side is now stable. > ipfail[18541]: 2011/03/03_04:04:34 debug: Other side is now stable. > ipfail[18541]: 2011/03/03_04:04:34 debug: Got asked for num_ping. > ipfail[18541]: 2011/03/03_04:04:35 debug: Found ping node > 192.168.10.100! > ipfail[18541]: 2011/03/03_04:04:36 info: Ping node count is balanced. > ipfail[18541]: 2011/03/03_04:04:36 debug: Abort message sent. > heartbeat[18456]: 2011/03/03_04:14:29 info: Daily informational memory > statistics > > > ////////////////////////////////////////////////////////////// > > > Now here is the ha-debug from server ftp01, which was secondary before > the failover: > > > > > heartbeat[5280]: 2011/03/03_04:03:38 WARN: Gmain_timeout_dispatch: > Dispatch function for memory stats took too long to execute: > 48880 ms (> > 100 ms) (GSource: 0xb29108) > heartbeat[5280]: 2011/03/03_04:03:38 WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 47980 ms > (> 1010 ms) > before being called (GSource: 0xb28398) > heartbeat[5280]: 2011/03/03_04:03:40 info: Gmain_timeout_dispatch: > started at 542452156 should have started at 542447358 > heartbeat[5280]: 2011/03/03_04:03:41 WARN: Gmain_timeout_dispatch: > Dispatch function for send local status took too long to execute: 2480 > ms (> 50 ms) (GSource: 0xb28398) > heartbeat[5280]: 2011/03/03_04:03:44 WARN: Late heartbeat: Node > ftp02.mycharts.md: interval 53380 ms > heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch > function for read child took too long to execute: 3280 ms (> 50 ms) > (GSource: 0xb1c388) > heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch > function for read child was delayed 3290 ms (> 500 ms) before being > called (GSource: 0xb1ee98) > heartbeat[5280]: 2011/03/03_04:03:44 info: G_CH_dispatch_int: > started at > 542452778 should have started at 542452449 > heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch > function for read child was delayed 3290 ms (> 500 ms) before being > called (GSource: 0xb1f7d8) > heartbeat[5280]: 2011/03/03_04:03:44 info: G_CH_dispatch_int: > started at > 542452778 should have started at 542452449 > heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch > function for read child was delayed 3280 ms (> 500 ms) before being > called (GSource: 0xb20528) > heartbeat[5280]: 2011/03/03_04:03:44 info: G_CH_dispatch_int: > started at > 542452778 should have started at 542452450 > heartbeat[5280]: 2011/03/03_04:03:44 WARN: Late heartbeat: Node > 192.168.10.100: interval 56190 ms > heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 1740 ms > (> 1010 ms) > before being called (GSource: 0xb28398) > heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: > started at 542452778 should have started at 542452604 > ipfail[5456]: 2011/03/03_04:03:44 debug: Got asked for num_ping. > heartbeat[5280]: 2011/03/03_04:03:44 debug: > process_resources(2): other > now unstable > heartbeat[5280]: 2011/03/03_04:03:44 info: other_holds_resources: 0 > heartbeat[5280]: 2011/03/03_04:03:44 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 54200 ms > (> 1010 ms) > before being called (GSource: 0xb29698) > heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: > started at 542452779 should have started at 542447359 > heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: > Dispatch function for update msgfree count was delayed 55170 > ms (> 30000 > ms) before being called (GSource: 0xb29818) > heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: > started at 542452779 should have started at 542447262 > ipfail[5456]: 2011/03/03_04:03:44 debug: Found ping node > 192.168.10.100! > heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: > Dispatch function for client audit was delayed 52220 ms (> 5000 ms) > before being called (GSource: 0xb29518) > heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: > started at 542452779 should have started at 542447557 > ipfail[5456]: 2011/03/03_04:03:44 info: Ping node count is balanced. > ipfail[5456]: 2011/03/03_04:03:44 debug: Abort message sent. > ipfail[5456]: 2011/03/03_04:03:45 debug: Got asked for num_ping. > ipfail[5456]: 2011/03/03_04:03:45 debug: Found ping node > 192.168.10.100! > ipfail[5456]: 2011/03/03_04:03:46 info: Ping node count is balanced. > ipfail[5456]: 2011/03/03_04:03:46 debug: Abort message sent. > ipfail[5456]: 2011/03/03_04:03:46 debug: Got asked for num_ping. > ipfail[5456]: 2011/03/03_04:03:47 debug: Found ping node > 192.168.10.100! > ipfail[5456]: 2011/03/03_04:03:48 info: Ping node count is balanced. > ipfail[5456]: 2011/03/03_04:03:48 debug: Abort message sent. > ipfail[5456]: 2011/03/03_04:03:48 debug: Other side is unstable. > heartbeat[5280]: 2011/03/03_04:03:54 debug: > process_resources(4): other > now stable - T_SHUTDONE > heartbeat[5280]: 2011/03/03_04:03:54 debug: Process > [go_standby] started > pid 1142 > heartbeat[5280]: 2011/03/03_04:03:54 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 3, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:03:54 info: Received shutdown > notice from > 'ftp02.mycharts.md'. > heartbeat[5280]: 2011/03/03_04:03:54 info: Resources being > acquired from > ftp02.mycharts.md. > heartbeat[5280]: 2011/03/03_04:03:54 debug: StartNextRemoteRscReq(): > child count 1 > heartbeat[5280]: 2011/03/03_04:03:54 debug: takeover_from_node: other > now stable > heartbeat[5280]: 2011/03/03_04:03:54 debug: Process > [req_our_resources(ask)] started pid 1143 > heartbeat[1142]: 2011/03/03_04:03:54 info: acquire all HA resources > (standby). > heartbeat[1142]: 2011/03/03_04:03:54 info: go_standby: who: 2 resource > set: all > heartbeat[1142]: 2011/03/03_04:03:54 info: go_standby: (query/action): > (allkeys/takegroup) > heartbeat[1143]: 2011/03/03_04:03:54 debug: > req_our_resources(/usr/share/heartbeat/ResourceManager listkeys > ftp01.mycharts.md) > heartbeat[1143]: 2011/03/03_04:03:54 debug: > req_our_resources(): running > [/usr/share/heartbeat/req_resource drbddisk::ha_ftp] > ResourceManager[1169]: 2011/03/03_04:03:55 info: Acquiring resource > group: ftp01.mycharts.md drbddisk::ha_ftp > Filesystem::/dev/drbd0::/ha_ftp::ext3::noatime > IPaddr2::192.168.10.203/24/bond0 pure-ftpd nfs > req_resource[1168]: 2011/03/03_04:03:55 debug: in > /usr/share/heartbeat/req_resource drbddisk::ha_ftp > req_resource[1168]: 2011/03/03_04:03:55 debug: dont_ask: > nice_failback: > yes > heartbeat[1143]: 2011/03/03_04:03:55 info: 1 local resources from > [/usr/share/heartbeat/ResourceManager listkeys ftp01.mycharts.md] > heartbeat[1143]: 2011/03/03_04:03:55 info: Local Resource acquisition > completed. > heartbeat[1143]: 2011/03/03_04:03:55 debug: Sending hold > resources msg: > all, stable=1 # req_our_resources() > heartbeat[1143]: 2011/03/03_04:03:55 info: FIFO message [type > resource] > written rc=79 > heartbeat[5280]: 2011/03/03_04:03:55 debug: StartNextRemoteRscReq(): > child count 2 > heartbeat[5280]: 2011/03/03_04:03:55 info: AnnounceTakeover(local 1, > foreign 1, reason 'T_RESOURCES(us)' (1)) > heartbeat[5280]: 2011/03/03_04:03:55 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 2, other_is_stable: 1, takeover_in_progress: > 1, going_standby: 3, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:03:55 info: Managed > req_our_resources(ask) process 1143 exited with return code 0. > heartbeat[5280]: 2011/03/03_04:03:55 debug: RscMgmtProc > 'req_our_resources(ask)' exited code 0 > heartbeat[5280]: 2011/03/03_04:03:55 debug: StartNextRemoteRscReq(): > child count 1 > ResourceManager[1169]: 2011/03/03_04:03:55 info: Running > /etc/ha.d/resource.d/drbddisk ha_ftp start > ResourceManager[1169]: 2011/03/03_04:03:55 debug: Starting > /etc/ha.d/resource.d/drbddisk ha_ftp start > ResourceManager[1169]: 2011/03/03_04:03:55 debug: > /etc/ha.d/resource.d/drbddisk ha_ftp start done. RC=0 > Filesystem[1280]: 2011/03/03_04:03:56 INFO: Resource is stopped > ResourceManager[1169]: 2011/03/03_04:03:56 info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime start > ResourceManager[1169]: 2011/03/03_04:03:56 debug: Starting > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime start > Filesystem[1366]: 2011/03/03_04:03:56 INFO: Running start for > /dev/drbd0 > on /ha_ftp > Filesystem[1355]: 2011/03/03_04:03:57 INFO: Success > INFO: Success > ResourceManager[1169]: 2011/03/03_04:03:57 debug: > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime start > done. RC=0 > IPaddr2[1440]: 2011/03/03_04:03:57 INFO: Resource is stopped > ResourceManager[1169]: 2011/03/03_04:03:58 info: Running > /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 start > ResourceManager[1169]: 2011/03/03_04:03:58 debug: Starting > /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 start > IPaddr2[1552]: 2011/03/03_04:03:58 INFO: ip -f inet addr add > 192.168.10.203/24 brd 192.168.10.255 dev bond0 > IPaddr2[1552]: 2011/03/03_04:03:58 INFO: ip link set bond0 up > IPaddr2[1552]: 2011/03/03_04:03:58 INFO: /usr/lib64/heartbeat/send_arp > -i 200 -r 5 -p > /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.10.203 bond0 > 192.168.10.203 auto not_used not_used > IPaddr2[1523]: 2011/03/03_04:03:58 INFO: Success > INFO: Success > ResourceManager[1169]: 2011/03/03_04:03:58 debug: > /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 start done. RC=0 > ResourceManager[1169]: 2011/03/03_04:03:58 info: Running > /etc/init.d/pure-ftpd start > ResourceManager[1169]: 2011/03/03_04:03:58 debug: Starting > /etc/init.d/pure-ftpd start > Starting pure-ftpd: [ OK ] > ResourceManager[1169]: 2011/03/03_04:03:58 debug: > /etc/init.d/pure-ftpd > start done. RC=0 > ResourceManager[1169]: 2011/03/03_04:03:58 info: Running > /etc/init.d/nfs > start > ResourceManager[1169]: 2011/03/03_04:03:58 debug: Starting > /etc/init.d/nfs start > Starting NFS services: [ OK ] > Starting NFS quotas: [ OK ] > Starting NFS daemon: [ OK ] > Starting NFS mountd: [ OK ] > ResourceManager[1169]: 2011/03/03_04:03:59 debug: > /etc/init.d/nfs start > done. RC=0 > heartbeat[1142]: 2011/03/03_04:03:59 info: all HA resource acquisition > completed (standby). > heartbeat[1142]: 2011/03/03_04:03:59 debug: Sending standby [done] msg > heartbeat[1142]: 2011/03/03_04:03:59 info: FIFO message [type > ask_resources] written rc=47 > heartbeat[5280]: 2011/03/03_04:03:59 debug: Received standby message > done from ftp01.mycharts.md in state 3 > heartbeat[5280]: 2011/03/03_04:03:59 info: Standby resource > acquisition > done [all]. > heartbeat[5280]: 2011/03/03_04:03:59 debug: Sending hold > resources msg: > all, stable=1 # <none> > heartbeat[5280]: 2011/03/03_04:03:59 info: AnnounceTakeover(local 1, > foreign 1, reason 'T_RESOURCES(us)' (1)) > heartbeat[5280]: 2011/03/03_04:03:59 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: > 1, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:03:59 info: New standby state: 0 > heartbeat[5280]: 2011/03/03_04:03:59 info: Managed go_standby process > 1142 exited with return code 0. > heartbeat[5280]: 2011/03/03_04:03:59 debug: RscMgmtProc 'go_standby' > exited code 0 > heartbeat[5280]: 2011/03/03_04:03:59 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[5280]: 2011/03/03_04:03:59 debug: notify_world: > invoking harc: > OLD status: active > heartbeat[5280]: 2011/03/03_04:03:59 debug: Process [status] > started pid > 1802 > heartbeat[5280]: 2011/03/03_04:03:59 debug: Starting notify process > [status] > heartbeat[1802]: 2011/03/03_04:03:59 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > heartbeat[1802]: 2011/03/03_04:03:59 debug: notify_world: Running harc > status > harc[1802]: 2011/03/03_04:03:59 info: Running /etc/ha.d/rc.d/status > status > mach_down[1818]: 2011/03/03_04:03:59 info: > /usr/share/heartbeat/mach_down: nice_failback: foreign resources > acquired > mach_down[1818]: 2011/03/03_04:03:59 info: mach_down takeover complete > for node ftp02.mycharts.md. > heartbeat[5280]: 2011/03/03_04:03:59 info: AnnounceTakeover(local 1, > foreign 1, reason 'T_RESOURCES(us)' (1)) > heartbeat[5280]: 2011/03/03_04:03:59 info: mach_down takeover > complete. > heartbeat[5280]: 2011/03/03_04:03:59 debug: > process_resources(3): other > now stable > heartbeat[5280]: 2011/03/03_04:03:59 info: AnnounceTakeover(local 1, > foreign 1, reason 'mach_down' (1)) > heartbeat[5280]: 2011/03/03_04:03:59 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:03:59 info: Managed status process 1802 > exited with return code 0. > heartbeat[5280]: 2011/03/03_04:03:59 debug: RscMgmtProc > 'status' exited > code 0 > heartbeat[5280]: 2011/03/03_04:03:59 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[5280]: 2011/03/03_04:03:59 debug: notify_world: > invoking harc: > OLD status: active > heartbeat[5280]: 2011/03/03_04:03:59 debug: Process [ip-request-resp] > started pid 1852 > heartbeat[5280]: 2011/03/03_04:03:59 debug: Starting notify process > [ip-request-resp] > heartbeat[1852]: 2011/03/03_04:03:59 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > heartbeat[1852]: 2011/03/03_04:03:59 debug: notify_world: Running harc > ip-request-resp > harc[1852]: 2011/03/03_04:03:59 info: Running > /etc/ha.d/rc.d/ip-request-resp ip-request-resp > ip-request-resp[1852]: 2011/03/03_04:03:59 received ip-request-resp > drbddisk::ha_ftp OK yes > ResourceManager[1873]: 2011/03/03_04:03:59 info: Acquiring resource > group: ftp01.mycharts.md drbddisk::ha_ftp > Filesystem::/dev/drbd0::/ha_ftp::ext3::noatime > IPaddr2::192.168.10.203/24/bond0 pure-ftpd nfs > Filesystem[1915]: 2011/03/03_04:04:00 INFO: Running OK > IPaddr2[1975]: 2011/03/03_04:04:00 INFO: Running OK > heartbeat[5280]: 2011/03/03_04:04:00 info: Managed ip-request-resp > process 1852 exited with return code 0. > heartbeat[5280]: 2011/03/03_04:04:00 debug: RscMgmtProc > 'ip-request-resp' exited code 0 > heartbeat[5280]: 2011/03/03_04:04:00 info: AnnounceTakeover(local 1, > foreign 1, reason 'ip-request-resp' (1)) > heartbeat[5340]: 2011/03/03_04:04:25 WARN: glib: TTY write timeout on > [/dev/ttyS0] (no connection or bad cable? [see documentation]) > heartbeat[5340]: 2011/03/03_04:04:25 info: glib: See > http://linux-ha.org/FAQ#TTYtimeout > <http://linux-ha.org/FAQ#TTYtimeout> > for details > heartbeat[5280]: 2011/03/03_04:04:26 info: Link > ftp02.mycharts.md:bond0 > dead. > heartbeat[5280]: 2011/03/03_04:04:26 info: Link ftp02.mycharts.md:eth2 > dead. > ipfail[5456]: 2011/03/03_04:04:26 info: Link Status update: Link > ftp02.mycharts.md/bond0 now has status dead > ipfail[5456]: 2011/03/03_04:04:27 debug: Found ping node > 192.168.10.100! > heartbeat[5280]: 2011/03/03_04:04:27 info: Link > ftp02.mycharts.md:/dev/ttyS0 dead. > ipfail[5456]: 2011/03/03_04:04:27 info: Asking other side for > ping node > count. > ipfail[5456]: 2011/03/03_04:04:27 debug: Message [num_ping] sent. > ipfail[5456]: 2011/03/03_04:04:27 info: Checking remote count of ping > nodes. > ipfail[5456]: 2011/03/03_04:04:27 info: Link Status update: Link > ftp02.mycharts.md/eth2 now has status dead > ipfail[5456]: 2011/03/03_04:04:28 debug: Found ping node > 192.168.10.100! > ipfail[5456]: 2011/03/03_04:04:28 info: Asking other side for > ping node > count. > ipfail[5456]: 2011/03/03_04:04:28 debug: Message [num_ping] sent. > ipfail[5456]: 2011/03/03_04:04:28 info: Checking remote count of ping > nodes. > ipfail[5456]: 2011/03/03_04:04:28 info: Link Status update: Link > ftp02.mycharts.md//dev/ttyS0 now has status dead > ipfail[5456]: 2011/03/03_04:04:29 debug: Found ping node > 192.168.10.100! > ipfail[5456]: 2011/03/03_04:04:29 info: Asking other side for > ping node > count. > ipfail[5456]: 2011/03/03_04:04:29 debug: Message [num_ping] sent. > ipfail[5456]: 2011/03/03_04:04:29 info: Checking remote count of ping > nodes. > heartbeat[5280]: 2011/03/03_04:04:30 info: Heartbeat restart on node > ftp02.mycharts.md > heartbeat[5280]: 2011/03/03_04:04:30 info: Link > ftp02.mycharts.md:/dev/ttyS0 up. > heartbeat[5280]: 2011/03/03_04:04:30 info: Status update for node > ftp02.mycharts.md: status init > heartbeat[5280]: 2011/03/03_04:04:30 debug: Status seqno: 1 msgtime: > 1299153868 > ipfail[5456]: 2011/03/03_04:04:30 info: Link Status update: Link > ftp02.mycharts.md//dev/ttyS0 now has status up > heartbeat[5280]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[5280]: 2011/03/03_04:04:30 debug: notify_world: > invoking harc: > OLD status: active > heartbeat[5280]: 2011/03/03_04:04:30 debug: Process [status] > started pid > 2976 > heartbeat[5280]: 2011/03/03_04:04:30 debug: Starting notify process > [status] > ipfail[5456]: 2011/03/03_04:04:30 info: Status update: Node > ftp02.mycharts.md now has status init > heartbeat[5280]: 2011/03/03_04:04:30 info: Link > ftp02.mycharts.md:bond0 > up. > ipfail[5456]: 2011/03/03_04:04:30 info: Link Status update: Link > ftp02.mycharts.md/bond0 now has status up > heartbeat[5280]: 2011/03/03_04:04:30 info: Link ftp02.mycharts.md:eth2 > up. > ipfail[5456]: 2011/03/03_04:04:30 info: Link Status update: Link > ftp02.mycharts.md/eth2 now has status up > heartbeat[5280]: 2011/03/03_04:04:30 info: Status update for node > ftp02.mycharts.md: status up > heartbeat[5280]: 2011/03/03_04:04:30 debug: Status seqno: 2 msgtime: > 1299153868 > heartbeat[5280]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq(): > child count 1 > ipfail[5456]: 2011/03/03_04:04:30 info: Status update: Node > ftp02.mycharts.md now has status up > heartbeat[2976]: 2011/03/03_04:04:30 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > heartbeat[2976]: 2011/03/03_04:04:30 debug: notify_world: Running harc > status > harc[2976]: 2011/03/03_04:04:30 info: Running /etc/ha.d/rc.d/status > status > heartbeat[5280]: 2011/03/03_04:04:30 info: Managed status process 2976 > exited with return code 0. > heartbeat[5280]: 2011/03/03_04:04:30 debug: RscMgmtProc > 'status' exited > code 0 > heartbeat[5280]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[5280]: 2011/03/03_04:04:30 debug: notify_world: > invoking harc: > OLD status: active > heartbeat[5280]: 2011/03/03_04:04:30 debug: Process [status] > started pid > 2992 > heartbeat[5280]: 2011/03/03_04:04:30 debug: Starting notify process > [status] > heartbeat[2992]: 2011/03/03_04:04:30 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > heartbeat[2992]: 2011/03/03_04:04:30 debug: notify_world: Running harc > status > harc[2992]: 2011/03/03_04:04:30 info: Running /etc/ha.d/rc.d/status > status > heartbeat[5280]: 2011/03/03_04:04:30 info: Managed status process 2992 > exited with return code 0. > heartbeat[5280]: 2011/03/03_04:04:30 debug: RscMgmtProc > 'status' exited > code 0 > heartbeat[5280]: 2011/03/03_04:04:30 debug: Get a reqnodes > message from > ftp02.mycharts.md > heartbeat[5280]: 2011/03/03_04:04:30 debug: get_delnodelist: > delnodelist= > heartbeat[5280]: 2011/03/03_04:04:31 info: Status update for node > ftp02.mycharts.md: status active > heartbeat[5280]: 2011/03/03_04:04:31 debug: Status seqno: 6 msgtime: > 1299153870 > heartbeat[5280]: 2011/03/03_04:04:31 debug: StartNextRemoteRscReq() - > calling hook > heartbeat[5280]: 2011/03/03_04:04:31 debug: notify_world: > invoking harc: > OLD status: active > heartbeat[5280]: 2011/03/03_04:04:31 debug: Process [status] > started pid > 3046 > heartbeat[5280]: 2011/03/03_04:04:31 debug: Starting notify process > [status] > ipfail[5456]: 2011/03/03_04:04:31 info: Status update: Node > ftp02.mycharts.md now has status active > heartbeat[5280]: 2011/03/03_04:04:31 debug: > process_resources: other now > unstable > heartbeat[5280]: 2011/03/03_04:04:31 debug: Sending hold > resources msg: > all, stable=1 # <none> > heartbeat[5280]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 1, > foreign 1, reason 'T_RESOURCES(us)' (1)) > heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[3046]: 2011/03/03_04:04:31 debug: notify_world: setting > SIGCHLD Handler to SIG_DFL > heartbeat[3046]: 2011/03/03_04:04:31 debug: notify_world: Running harc > status > ipfail[5456]: 2011/03/03_04:04:31 debug: Got join message from another > ipfail client. (ftp02.mycharts.md) > harc[3046]: 2011/03/03_04:04:31 info: Running /etc/ha.d/rc.d/status > status > heartbeat[5280]: 2011/03/03_04:04:31 info: Managed status process 3046 > exited with return code 0. > heartbeat[5280]: 2011/03/03_04:04:31 debug: RscMgmtProc > 'status' exited > code 0 > ipfail[5456]: 2011/03/03_04:04:31 debug: Found ping node > 192.168.10.100! > heartbeat[5280]: 2011/03/03_04:04:31 debug: > process_resources(2): other > now unstable > heartbeat[5280]: 2011/03/03_04:04:31 info: other_holds_resources: 0 > heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:04:31 info: remote resource transition > completed. > heartbeat[5280]: 2011/03/03_04:04:31 debug: Sending hold > resources msg: > all, stable=1 # <none> > heartbeat[5280]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 1, > foreign 1, reason 'T_RESOURCES(us)' (1)) > heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > heartbeat[5280]: 2011/03/03_04:04:31 debug: Calling > PerformAutoFailback() > heartbeat[5280]: 2011/03/03_04:04:31 info: other_holds_resources: 0 > heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: > ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: > 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 > ipfail[5456]: 2011/03/03_04:04:31 info: Asking other side for > ping node > count. > ipfail[5456]: 2011/03/03_04:04:31 debug: Message [num_ping] sent. > ipfail[5456]: 2011/03/03_04:04:31 debug: Other side is unstable. > ipfail[5456]: 2011/03/03_04:04:31 debug: Other side is now stable. > ipfail[5456]: 2011/03/03_04:04:37 info: No giveup timer to abort. > heartbeat[5280]: 2011/03/03_04:12:49 info: Daily informational memory > statistics > > > > > -- > Eric Robinson > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Disclaimer - March 3, 2011 > This email and any files transmitted with it are confidential > and intended solely for General Linux-HA mailing list. If you > are not the named addressee you should not disseminate, > distribute, copy or alter this email. Any views or opinions > presented in this email are solely those of the author and > might not represent those of Physicians' Managed Care or > Physician Select Management. Warning: Although Physicians' > Managed Care or Physician Select Management has taken > reasonable precautions to ensure no viruses are present in > this email, the company cannot accept responsibility for any > loss or damage arising from the use of this email or attachments. > This disclaimer was added by Policy Patrol: > http://www.policypatrol.com/ > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
