Hi, On Tue, Nov 18, 2008 at 08:29:07AM -0700, Ryan Kish wrote: > Hello List. > > I came in this morning and found that one of my clusters had a failure > over the weekend. This cluster is not yet production, but is slated to > become a production cluster soon, so I am hoping I can get some > help/advice to keep this type of failure from occurring again in the > future. > > This cluster contains two Dell 2950's who's primary job is to provide > a high-availability Postgres server. The systems do not use shared > disk, rather the leverage Slony to keep the DB's in sync with each > other. Heartbeat is used to provide a VIP address as well as to run a > couple of custom scripts (one for a java process, another to inform > slony about the changes made to the status of the DB). > > We are using the old Heartbeat 1 style setup on this cluster. The > operating system is Debian Etch, heartbeat is version 2.0.7-2. Both > machines are using Ethernet Bonding in failover mode, and are then > connected to discrete blades on a Cisco 6509 switch. > > Now on to the issue. Around 16:00 on Saturday afternoon, node2 > declared node1 to be dead and initiated a failover. Shortly after > node1 recovered, recalimed the resources and caused node2 to restart > heartbeat. I have both machines instrumented with Cacti and neither > saw a heaving load or network traffic during this time. If this were > to occur with high load or during a time of heavy network utilization, > it would make more sense to me.
According to the logs, heartbeats were unscheduled for about 40 seconds: > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 40280 ms (> 510 > ms) before being called (GSource: 0x811e098) There was also this: > Nov 15 16:22:48 node1 heartbeat: [5734]: info: time_longclock: clock_t > wrapped around (uptime). Which reminds me of one old bug, connected with uptime greater than something like 400 days (32-bit wrap). Though that should've been fixed before 2.0.7. (described in bugzilla too) At any rate, this made node2 think that node1 died. That's split-brain. You need stonith to handle split-brains. > Looking over the configuration files, > I am further perplexed, auto_failback is set to 'off'. I assumed this > would prevent the fail back to my primary node? This wasn't exactly a failover. And not failback. node1 never stopped its resources, they were just (erronously, because of split brain) started on node2. Once node2 realized that node1 is actually not dead, I guess, it decided to shutdown or reboot (sorry, my v1 knowledge is rusty). Thanks, Dejan > Here are my log files and the relevant configuration files. After > reviewing the configuration, I do think the multiple ucast directives > could be problematic, but could they have caused this problem? Names, > IP's & scripts renamed to protect the guilty: > > #### Messages from Primary Node > > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 40280 ms (> 510 > ms) before being called (GSource: 0x811e098) > Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: > started at 4294967295 should have started at 4294963267 > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node > node1: interval 41280 ms > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node > node2: interval 41780 ms > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 40200 ms (> 510 > ms) before being called (GSource: 0x811e3a0) > Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: > started at 4294967295 should have started at 4294963275 > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: > Dispatch function for update msgfree count was delayed 40780 ms (> > 15000 ms) before being called (GSource: 0x811e470) > Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: > started at 4294967295 should have started at 4294963217 > Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch: > Dispatch function for client audit was delayed 38380 ms (> 5000 ms) > before being called (GSource: 0x811e2d0) > Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch: > started at 4294967295 should have started at 4294963457 > Nov 15 16:22:48 node1 heartbeat: [5734]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:48 node1 heartbeat: [5738]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:48 node1 heartbeat: [5731]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:48 node1 logd: [5692]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:48 node1 logd: [5730]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:49 node1 heartbeat: [5735]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:49 node1 heartbeat: [5737]: info: time_longclock: clock_t > wrapped around (uptime). > Nov 15 16:22:53 node1 heartbeat: [5731]: info: Received shutdown > notice from 'node2'. > Nov 15 16:22:53 node1 heartbeat: [5731]: info: Resources being > acquired from node2. > Nov 15 16:22:53 node1 harc[7346]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:22:53 node1 mach_down[7367]: info: > /usr/lib/heartbeat/mach_down: nice_failback: foreign resources > acquired > Nov 15 16:22:53 node1 mach_down[7367]: info: mach_down takeover > complete for node node2. > Nov 15 16:22:53 node1 heartbeat: [5731]: info: mach_down takeover complete. > Nov 15 16:22:53 node1 heartbeat: [7347]: info: Local Resource > acquisition completed. > Nov 15 16:22:53 node1 harc[7408]: info: Running > /etc/ha.d/rc.d/ip-request-resp ip-request-resp > Nov 15 16:22:53 node1 ip-request-resp[7408]: received ip-request-resp > custom_script1.sh OK yes > Nov 15 16:22:53 node1 ResourceManager[7423]: info: Acquiring resource > group: node1 custom_script1.sh custom_script2.sh 10.0.0.10 > MailTo::my_email_at_domain_tld > Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running > /etc/ha.d/resource.d/custom_script1.sh start > Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running > /etc/init.d/custom_script2.sh start > Nov 15 16:22:53 node1 IPaddr[7573]: INFO: IPaddr Running OK > Nov 15 16:22:54 node1 MailTo[7726]: WARNING: Don't stat/monitor me! > MailTo is a pseudo resource agent, so the status reported may be > incorrect > Nov 15 16:22:54 node1 MailTo[7680]: INFO: MailTo Running OK > Nov 15 16:23:12 node1 heartbeat: [5731]: info: Heartbeat restart on node node2 > Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node > node2: status init > Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node > node2: status up > Nov 15 16:23:12 node1 harc[7751]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:23:12 node1 harc[7761]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:23:13 node1 heartbeat: [5731]: info: Status update for node > node2: status active > Nov 15 16:23:13 node1 harc[7771]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:23:13 node1 heartbeat: [5731]: info: remote resource > transition completed. > > > #### Messages from Secondary Node > > Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: node node1: is dead > Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: No STONITH device configured. > Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: Shared disks are not protected. > Nov 15 16:22:23 node2 heartbeat: [3494]: info: Resources being > acquired from node1. > Nov 15 16:22:23 node2 harc[14952]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:22:23 node2 heartbeat: [14953]: info: No local resources > [/usr/lib/heartbeat/ResourceManager listkeys node2] to acquire. > Nov 15 16:22:23 node2 mach_down[14972]: info: Taking over resource > group custom_script1.sh > Nov 15 16:22:23 node2 ResourceManager[14992]: info: Acquiring resource > group: node1 custom_script1.sh custom_script2.sh 10.0.0.10 > MailTo::my_email_at_domain_tld > Nov 15 16:22:23 node2 ResourceManager[14992]: info: Running > /etc/ha.d/resource.d/custom_script1.sh start > Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running > /etc/init.d/custom_script2.sh start > Nov 15 16:22:32 node2 IPaddr[15200]: INFO: IPaddr Resource is stopped > Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running > /etc/ha.d/resource.d/IPaddr 10.0.0.10 start > Nov 15 16:22:33 node2 IPaddr[15382]: INFO: eval /sbin/ifconfig bond0:0 > 10.0.0.10 netmask 255.255.255.0 broadcast 192.168.24.255 > Nov 15 16:22:33 node2 IPaddr[15382]: INFO: Sending Gratuitous Arp for > 10.0.0.10 on bond0:0 [bond0] > Nov 15 16:22:33 node2 IPaddr[15382]: INFO: /usr/lib/heartbeat/send_arp > -i 500 -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.0.0.10 > bond0 10.0.0.10 auto 10.0.0.10 > ffffffffffff > Nov 15 16:22:33 node2 IPaddr[15312]: INFO: IPaddr Success > Nov 15 16:22:33 node2 MailTo[15518]: WARNING: Don't stat/monitor me! > MailTo is a pseudo resource agent, so the status reported may be > incorrect > Nov 15 16:22:33 node2 MailTo[15472]: INFO: MailTo Resource is stopped > Nov 15 16:22:33 node2 ResourceManager[14992]: info: Running > /etc/ha.d/resource.d/MailTo my_email_at_domain_tld start > Nov 15 16:22:33 node2 MailTo[15554]: INFO: MailTo Success > Nov 15 16:22:33 node2 mach_down[14972]: info: > /usr/lib/heartbeat/mach_down: nice_failback: foreign resources > acquired > Nov 15 16:22:33 node2 mach_down[14972]: info: mach_down takeover > complete for node node1. > Nov 15 16:22:33 node2 heartbeat: [3494]: info: mach_down takeover complete. > Nov 15 16:22:49 node2 heartbeat: [3494]: info: For information on > cluster partitions, See URL: http://linux-ha.org/SplitBrain > Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Deadtime value may be too > small. > Nov 15 16:22:49 node2 heartbeat: [3494]: info: See FAQ for information > on tuning deadtime. > Nov 15 16:22:49 node2 heartbeat: [3494]: info: URL: > http://linux-ha.org/FAQ#heavy_load > Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Late heartbeat: Node > node1: interval 41590 ms > Nov 15 16:22:49 node2 heartbeat: [3494]: info: Status update for node > node1: status active > Nov 15 16:22:49 node2 harc[15678]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:22:51 node2 heartbeat: [3494]: info: Heartbeat shutdown in > progress. (3494) > Nov 15 16:22:51 node2 heartbeat: [15689]: info: Giving up all HA resources. > Nov 15 16:22:51 node2 ResourceManager[15699]: info: Releasing resource > group: node1 custom_script1.sh custom_script2.sh 10.0.0.10 > MailTo::my_email_at_domain_tld > Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running > /etc/ha.d/resource.d/MailTo my_email_at_domain_tld stop > Nov 15 16:22:51 node2 MailTo[15734]: INFO: MailTo Success > Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running > /etc/ha.d/resource.d/IPaddr 10.0.0.10 stop > Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/route -n del -host 10.0.0.10 > Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/ifconfig bond0:0 10.0.0.10 > down > Nov 15 16:22:51 node2 IPaddr[15894]: INFO: IP Address 10.0.0.10 released > Nov 15 16:22:51 node2 IPaddr[15824]: INFO: IPaddr Success > Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running > /etc/init.d/custom_script2.sh stop > Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running > /etc/ha.d/resource.d/custom_script1.sh stop > Nov 15 16:22:53 node2 heartbeat: [15689]: info: All HA resources relinquished. > Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBFIFO process > 3509 with signal 15 > Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process > 3510 with signal 15 > Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process > 3511 with signal 15 > Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process > 3512 with signal 15 > Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process > 3513 with signal 15 > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3511 > exited. 5 remaining > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3509 > exited. 4 remaining > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3510 > exited. 3 remaining > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3513 > exited. 2 remaining > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3512 > exited. 1 remaining > Nov 15 16:22:55 node2 heartbeat: [3494]: info: node2 Heartbeat > shutdown complete. > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Heartbeat restart triggered. > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Restarting heartbeat. > Nov 15 16:22:55 node2 heartbeat: [3494]: info: Performing heartbeat > restart exec. > Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Core dumps could be > lost if multiple dumps occur > Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Consider setting > /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum > supportability > Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Logging daemon is > disabled --enabling logging daemon is recommended > Nov 15 16:23:11 node2 heartbeat: [3494]: info: ************************** > Nov 15 16:23:11 node2 heartbeat: [3494]: info: Configuration > validated. Starting heartbeat 2.0.7 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: heartbeat: version 2.0.7 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: Heartbeat generation: 13 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: > G_main_add_TriggerHandler: Added signal manual handler > Nov 15 16:23:11 node2 heartbeat: [16005]: info: > G_main_add_TriggerHandler: Added signal manual handler > Nov 15 16:23:11 node2 heartbeat: [16005]: info: Removing > /var/run/heartbeat/rsctmp failed, recreating. > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write > socket priority set to IPTOS_LOWDELAY on bond0 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound > send socket to device: bond0 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound > receive socket to device: bond0 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started > on port 694 interface bond0 to 10.0.0.1 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write > socket priority set to IPTOS_LOWDELAY on bond0 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound > send socket to device: bond0 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound > receive socket to device: bond0 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started > on port 694 interface bond0 to 10.0.0.2 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: > G_main_add_SignalHandler: Added signal handler for signal 17 > Nov 15 16:23:11 node2 heartbeat: [16005]: info: Local status now set to: 'up' > Nov 15 16:23:12 node2 heartbeat: [16005]: info: Link node1:bond0 up. > Nov 15 16:23:12 node2 heartbeat: [16005]: info: Status update for node > node1: status active > Nov 15 16:23:12 node2 harc[16013]: info: Running /etc/ha.d/rc.d/status status > Nov 15 16:23:13 node2 heartbeat: [16005]: info: Comm_now_up(): > updating status to active > Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local status now set > to: 'active' > Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource > transition completed. > Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource > transition completed. > Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local Resource > acquisition completed. (none) > Nov 15 16:23:13 node2 heartbeat: [16005]: info: Initial resource > acquisition complete (T_RESOURCES(them)) > > > #### ha.cf > > # log to syslog > logfacility local0 > > # UDP heartbeats over eth0 > ucast eth0 10.0.0.1 > ucast eth0 10.0.0.2 > > # don't try to revert to the "primary" node automatically after a failure > auto_failback off > > # list of cluster members > node node1 node2 > > # heartbeat spacing in seconds > keepalive 1 > > # how long until we pronounce peer dead in seconds > deadtime 15 > > #### haresources > > node1 custom_script1.sh custom_script2.sh 10.0.0.10 > MailTo::my_email_at_domain_tld > > Thanks in advance for taking the time to read this. > Cheers, > Ryan > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
