Hello List. I came in this morning and found that one of my clusters had a failure over the weekend. This cluster is not yet production, but is slated to become a production cluster soon, so I am hoping I can get some help/advice to keep this type of failure from occurring again in the future.
This cluster contains two Dell 2950's who's primary job is to provide a high-availability Postgres server. The systems do not use shared disk, rather the leverage Slony to keep the DB's in sync with each other. Heartbeat is used to provide a VIP address as well as to run a couple of custom scripts (one for a java process, another to inform slony about the changes made to the status of the DB). We are using the old Heartbeat 1 style setup on this cluster. The operating system is Debian Etch, heartbeat is version 2.0.7-2. Both machines are using Ethernet Bonding in failover mode, and are then connected to discrete blades on a Cisco 6509 switch. Now on to the issue. Around 16:00 on Saturday afternoon, node2 declared node1 to be dead and initiated a failover. Shortly after node1 recovered, recalimed the resources and caused node2 to restart heartbeat. I have both machines instrumented with Cacti and neither saw a heaving load or network traffic during this time. If this were to occur with high load or during a time of heavy network utilization, it would make more sense to me. Looking over the configuration files, I am further perplexed, auto_failback is set to 'off'. I assumed this would prevent the fail back to my primary node? Here are my log files and the relevant configuration files. After reviewing the configuration, I do think the multiple ucast directives could be problematic, but could they have caused this problem? Names, IP's & scripts renamed to protect the guilty: #### Messages from Primary Node Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 40280 ms (> 510 ms) before being called (GSource: 0x811e098) Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963267 Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node node1: interval 41280 ms Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node node2: interval 41780 ms Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 40200 ms (> 510 ms) before being called (GSource: 0x811e3a0) Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963275 Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 40780 ms (> 15000 ms) before being called (GSource: 0x811e470) Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963217 Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 38380 ms (> 5000 ms) before being called (GSource: 0x811e2d0) Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963457 Nov 15 16:22:48 node1 heartbeat: [5734]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:48 node1 heartbeat: [5738]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:48 node1 heartbeat: [5731]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:48 node1 logd: [5692]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:48 node1 logd: [5730]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:49 node1 heartbeat: [5735]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:49 node1 heartbeat: [5737]: info: time_longclock: clock_t wrapped around (uptime). Nov 15 16:22:53 node1 heartbeat: [5731]: info: Received shutdown notice from 'node2'. Nov 15 16:22:53 node1 heartbeat: [5731]: info: Resources being acquired from node2. Nov 15 16:22:53 node1 harc[7346]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:22:53 node1 mach_down[7367]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired Nov 15 16:22:53 node1 mach_down[7367]: info: mach_down takeover complete for node node2. Nov 15 16:22:53 node1 heartbeat: [5731]: info: mach_down takeover complete. Nov 15 16:22:53 node1 heartbeat: [7347]: info: Local Resource acquisition completed. Nov 15 16:22:53 node1 harc[7408]: info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp Nov 15 16:22:53 node1 ip-request-resp[7408]: received ip-request-resp custom_script1.sh OK yes Nov 15 16:22:53 node1 ResourceManager[7423]: info: Acquiring resource group: node1 custom_script1.sh custom_script2.sh 10.0.0.10 MailTo::my_email_at_domain_tld Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running /etc/ha.d/resource.d/custom_script1.sh start Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running /etc/init.d/custom_script2.sh start Nov 15 16:22:53 node1 IPaddr[7573]: INFO: IPaddr Running OK Nov 15 16:22:54 node1 MailTo[7726]: WARNING: Don't stat/monitor me! MailTo is a pseudo resource agent, so the status reported may be incorrect Nov 15 16:22:54 node1 MailTo[7680]: INFO: MailTo Running OK Nov 15 16:23:12 node1 heartbeat: [5731]: info: Heartbeat restart on node node2 Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node node2: status init Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node node2: status up Nov 15 16:23:12 node1 harc[7751]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:23:12 node1 harc[7761]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:23:13 node1 heartbeat: [5731]: info: Status update for node node2: status active Nov 15 16:23:13 node1 harc[7771]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:23:13 node1 heartbeat: [5731]: info: remote resource transition completed. #### Messages from Secondary Node Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: node node1: is dead Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: No STONITH device configured. Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: Shared disks are not protected. Nov 15 16:22:23 node2 heartbeat: [3494]: info: Resources being acquired from node1. Nov 15 16:22:23 node2 harc[14952]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:22:23 node2 heartbeat: [14953]: info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys node2] to acquire. Nov 15 16:22:23 node2 mach_down[14972]: info: Taking over resource group custom_script1.sh Nov 15 16:22:23 node2 ResourceManager[14992]: info: Acquiring resource group: node1 custom_script1.sh custom_script2.sh 10.0.0.10 MailTo::my_email_at_domain_tld Nov 15 16:22:23 node2 ResourceManager[14992]: info: Running /etc/ha.d/resource.d/custom_script1.sh start Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running /etc/init.d/custom_script2.sh start Nov 15 16:22:32 node2 IPaddr[15200]: INFO: IPaddr Resource is stopped Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running /etc/ha.d/resource.d/IPaddr 10.0.0.10 start Nov 15 16:22:33 node2 IPaddr[15382]: INFO: eval /sbin/ifconfig bond0:0 10.0.0.10 netmask 255.255.255.0 broadcast 192.168.24.255 Nov 15 16:22:33 node2 IPaddr[15382]: INFO: Sending Gratuitous Arp for 10.0.0.10 on bond0:0 [bond0] Nov 15 16:22:33 node2 IPaddr[15382]: INFO: /usr/lib/heartbeat/send_arp -i 500 -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.0.0.10 bond0 10.0.0.10 auto 10.0.0.10 ffffffffffff Nov 15 16:22:33 node2 IPaddr[15312]: INFO: IPaddr Success Nov 15 16:22:33 node2 MailTo[15518]: WARNING: Don't stat/monitor me! MailTo is a pseudo resource agent, so the status reported may be incorrect Nov 15 16:22:33 node2 MailTo[15472]: INFO: MailTo Resource is stopped Nov 15 16:22:33 node2 ResourceManager[14992]: info: Running /etc/ha.d/resource.d/MailTo my_email_at_domain_tld start Nov 15 16:22:33 node2 MailTo[15554]: INFO: MailTo Success Nov 15 16:22:33 node2 mach_down[14972]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired Nov 15 16:22:33 node2 mach_down[14972]: info: mach_down takeover complete for node node1. Nov 15 16:22:33 node2 heartbeat: [3494]: info: mach_down takeover complete. Nov 15 16:22:49 node2 heartbeat: [3494]: info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Deadtime value may be too small. Nov 15 16:22:49 node2 heartbeat: [3494]: info: See FAQ for information on tuning deadtime. Nov 15 16:22:49 node2 heartbeat: [3494]: info: URL: http://linux-ha.org/FAQ#heavy_load Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Late heartbeat: Node node1: interval 41590 ms Nov 15 16:22:49 node2 heartbeat: [3494]: info: Status update for node node1: status active Nov 15 16:22:49 node2 harc[15678]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:22:51 node2 heartbeat: [3494]: info: Heartbeat shutdown in progress. (3494) Nov 15 16:22:51 node2 heartbeat: [15689]: info: Giving up all HA resources. Nov 15 16:22:51 node2 ResourceManager[15699]: info: Releasing resource group: node1 custom_script1.sh custom_script2.sh 10.0.0.10 MailTo::my_email_at_domain_tld Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running /etc/ha.d/resource.d/MailTo my_email_at_domain_tld stop Nov 15 16:22:51 node2 MailTo[15734]: INFO: MailTo Success Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running /etc/ha.d/resource.d/IPaddr 10.0.0.10 stop Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/route -n del -host 10.0.0.10 Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/ifconfig bond0:0 10.0.0.10 down Nov 15 16:22:51 node2 IPaddr[15894]: INFO: IP Address 10.0.0.10 released Nov 15 16:22:51 node2 IPaddr[15824]: INFO: IPaddr Success Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running /etc/init.d/custom_script2.sh stop Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running /etc/ha.d/resource.d/custom_script1.sh stop Nov 15 16:22:53 node2 heartbeat: [15689]: info: All HA resources relinquished. Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBFIFO process 3509 with signal 15 Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process 3510 with signal 15 Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process 3511 with signal 15 Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process 3512 with signal 15 Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process 3513 with signal 15 Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3511 exited. 5 remaining Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3509 exited. 4 remaining Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3510 exited. 3 remaining Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3513 exited. 2 remaining Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3512 exited. 1 remaining Nov 15 16:22:55 node2 heartbeat: [3494]: info: node2 Heartbeat shutdown complete. Nov 15 16:22:55 node2 heartbeat: [3494]: info: Heartbeat restart triggered. Nov 15 16:22:55 node2 heartbeat: [3494]: info: Restarting heartbeat. Nov 15 16:22:55 node2 heartbeat: [3494]: info: Performing heartbeat restart exec. Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Core dumps could be lost if multiple dumps occur Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Nov 15 16:23:11 node2 heartbeat: [3494]: info: ************************** Nov 15 16:23:11 node2 heartbeat: [3494]: info: Configuration validated. Starting heartbeat 2.0.7 Nov 15 16:23:11 node2 heartbeat: [16005]: info: heartbeat: version 2.0.7 Nov 15 16:23:11 node2 heartbeat: [16005]: info: Heartbeat generation: 13 Nov 15 16:23:11 node2 heartbeat: [16005]: info: G_main_add_TriggerHandler: Added signal manual handler Nov 15 16:23:11 node2 heartbeat: [16005]: info: G_main_add_TriggerHandler: Added signal manual handler Nov 15 16:23:11 node2 heartbeat: [16005]: info: Removing /var/run/heartbeat/rsctmp failed, recreating. Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond0 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound send socket to device: bond0 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound receive socket to device: bond0 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started on port 694 interface bond0 to 10.0.0.1 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond0 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound send socket to device: bond0 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound receive socket to device: bond0 Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started on port 694 interface bond0 to 10.0.0.2 Nov 15 16:23:11 node2 heartbeat: [16005]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Nov 15 16:23:11 node2 heartbeat: [16005]: info: Local status now set to: 'up' Nov 15 16:23:12 node2 heartbeat: [16005]: info: Link node1:bond0 up. Nov 15 16:23:12 node2 heartbeat: [16005]: info: Status update for node node1: status active Nov 15 16:23:12 node2 harc[16013]: info: Running /etc/ha.d/rc.d/status status Nov 15 16:23:13 node2 heartbeat: [16005]: info: Comm_now_up(): updating status to active Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local status now set to: 'active' Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource transition completed. Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource transition completed. Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local Resource acquisition completed. (none) Nov 15 16:23:13 node2 heartbeat: [16005]: info: Initial resource acquisition complete (T_RESOURCES(them)) #### ha.cf # log to syslog logfacility local0 # UDP heartbeats over eth0 ucast eth0 10.0.0.1 ucast eth0 10.0.0.2 # don't try to revert to the "primary" node automatically after a failure auto_failback off # list of cluster members node node1 node2 # heartbeat spacing in seconds keepalive 1 # how long until we pronounce peer dead in seconds deadtime 15 #### haresources node1 custom_script1.sh custom_script2.sh 10.0.0.10 MailTo::my_email_at_domain_tld Thanks in advance for taking the time to read this. Cheers, Ryan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
