Hi,

On Tue, Nov 18, 2008 at 08:29:07AM -0700, Ryan Kish wrote:
> Hello List.
> 
> I came in this morning and found that one of my clusters had a failure
> over the weekend. This cluster is not yet production, but is slated to
> become a production cluster soon, so I am hoping I can get some
> help/advice to keep this type of failure from occurring again in the
> future.
> 
> This cluster contains two Dell 2950's who's primary job is to provide
> a high-availability Postgres server. The systems do not use shared
> disk, rather the leverage Slony to keep the DB's in sync with each
> other. Heartbeat is used to provide a VIP address as well as to run a
> couple of custom scripts (one for a java process, another to inform
> slony about the changes made to the status of the DB).
> 
> We are using the old Heartbeat 1 style setup on this cluster. The
> operating system is Debian Etch, heartbeat is version 2.0.7-2. Both
> machines are using Ethernet Bonding in failover mode, and are then
> connected to discrete blades on a Cisco 6509 switch.
> 
> Now on to the issue. Around 16:00 on Saturday afternoon, node2
> declared node1 to be dead and initiated a failover. Shortly after
> node1 recovered, recalimed the resources and caused node2 to restart
> heartbeat. I have both machines instrumented with Cacti and neither
> saw a heaving load or network traffic during this time. If this were
> to occur with high load or during a time of heavy network utilization,
> it would make more sense to me.

According to the logs, heartbeats were unscheduled for about 40
seconds:

> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 40280 ms (> 510
> ms) before being called (GSource: 0x811e098)

There was also this:

> Nov 15 16:22:48 node1 heartbeat: [5734]: info: time_longclock: clock_t
> wrapped around (uptime).

Which reminds me of one old bug, connected with uptime greater
than something like 400 days (32-bit wrap). Though that should've
been fixed before 2.0.7. (described in bugzilla too)

At any rate, this made node2 think that node1 died. That's
split-brain. You need stonith to handle split-brains.

> Looking over the configuration files,
> I am further perplexed, auto_failback is set to 'off'. I assumed this
> would prevent the fail back to my primary node?

This wasn't exactly a failover. And not failback. node1 never
stopped its resources, they were just (erronously, because of
split brain) started on node2. Once node2 realized that node1 is
actually not dead, I guess, it decided to shutdown or reboot
(sorry, my v1 knowledge is rusty).

Thanks,

Dejan

> Here are my log files and the relevant configuration files. After
> reviewing the configuration, I do think the multiple ucast directives
> could be problematic, but could they have caused this problem?  Names,
> IP's & scripts renamed to protect the guilty:
> 
> #### Messages from Primary Node
> 
> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 40280 ms (> 510
> ms) before being called (GSource: 0x811e098)
> Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
> started at 4294967295 should have started at 4294963267
> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node
> node1: interval 41280 ms
> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node
> node2: interval 41780 ms
> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 40200 ms (> 510
> ms) before being called (GSource: 0x811e3a0)
> Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
> started at 4294967295 should have started at 4294963275
> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
> Dispatch function for update msgfree count was delayed 40780 ms (>
> 15000 ms) before being called (GSource: 0x811e470)
> Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
> started at 4294967295 should have started at 4294963217
> Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
> Dispatch function for client audit was delayed 38380 ms (> 5000 ms)
> before being called (GSource: 0x811e2d0)
> Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
> started at 4294967295 should have started at 4294963457
> Nov 15 16:22:48 node1 heartbeat: [5734]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:48 node1 heartbeat: [5738]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:48 node1 heartbeat: [5731]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:48 node1 logd: [5692]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:48 node1 logd: [5730]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:49 node1 heartbeat: [5735]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:49 node1 heartbeat: [5737]: info: time_longclock: clock_t
> wrapped around (uptime).
> Nov 15 16:22:53 node1 heartbeat: [5731]: info: Received shutdown
> notice from 'node2'.
> Nov 15 16:22:53 node1 heartbeat: [5731]: info: Resources being
> acquired from node2.
> Nov 15 16:22:53 node1 harc[7346]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:22:53 node1 mach_down[7367]: info:
> /usr/lib/heartbeat/mach_down: nice_failback: foreign resources
> acquired
> Nov 15 16:22:53 node1 mach_down[7367]: info: mach_down takeover
> complete for node node2.
> Nov 15 16:22:53 node1 heartbeat: [5731]: info: mach_down takeover complete.
> Nov 15 16:22:53 node1 heartbeat: [7347]: info: Local Resource
> acquisition completed.
> Nov 15 16:22:53 node1 harc[7408]: info: Running
> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> Nov 15 16:22:53 node1 ip-request-resp[7408]: received ip-request-resp
> custom_script1.sh OK yes
> Nov 15 16:22:53 node1 ResourceManager[7423]: info: Acquiring resource
> group: node1 custom_script1.sh custom_script2.sh 10.0.0.10
> MailTo::my_email_at_domain_tld
> Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running
> /etc/ha.d/resource.d/custom_script1.sh  start
> Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running
> /etc/init.d/custom_script2.sh  start
> Nov 15 16:22:53 node1 IPaddr[7573]: INFO: IPaddr Running OK
> Nov 15 16:22:54 node1 MailTo[7726]: WARNING: Don't stat/monitor me!
> MailTo is a pseudo resource agent, so the status reported may be
> incorrect
> Nov 15 16:22:54 node1 MailTo[7680]: INFO: MailTo Running OK
> Nov 15 16:23:12 node1 heartbeat: [5731]: info: Heartbeat restart on node node2
> Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node
> node2: status init
> Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node
> node2: status up
> Nov 15 16:23:12 node1 harc[7751]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:23:12 node1 harc[7761]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:23:13 node1 heartbeat: [5731]: info: Status update for node
> node2: status active
> Nov 15 16:23:13 node1 harc[7771]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:23:13 node1 heartbeat: [5731]: info: remote resource
> transition completed.
> 
> 
> #### Messages from Secondary Node
> 
> Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: node node1: is dead
> Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: No STONITH device configured.
> Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: Shared disks are not protected.
> Nov 15 16:22:23 node2 heartbeat: [3494]: info: Resources being
> acquired from node1.
> Nov 15 16:22:23 node2 harc[14952]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:22:23 node2 heartbeat: [14953]: info: No local resources
> [/usr/lib/heartbeat/ResourceManager listkeys node2] to acquire.
> Nov 15 16:22:23 node2 mach_down[14972]: info: Taking over resource
> group custom_script1.sh
> Nov 15 16:22:23 node2 ResourceManager[14992]: info: Acquiring resource
> group: node1 custom_script1.sh custom_script2.sh 10.0.0.10
> MailTo::my_email_at_domain_tld
> Nov 15 16:22:23 node2 ResourceManager[14992]: info: Running
> /etc/ha.d/resource.d/custom_script1.sh  start
> Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running
> /etc/init.d/custom_script2.sh  start
> Nov 15 16:22:32 node2 IPaddr[15200]: INFO: IPaddr Resource is stopped
> Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running
> /etc/ha.d/resource.d/IPaddr 10.0.0.10 start
> Nov 15 16:22:33 node2 IPaddr[15382]: INFO: eval /sbin/ifconfig bond0:0
> 10.0.0.10 netmask 255.255.255.0 broadcast 192.168.24.255
> Nov 15 16:22:33 node2 IPaddr[15382]: INFO: Sending Gratuitous Arp for
> 10.0.0.10 on bond0:0 [bond0]
> Nov 15 16:22:33 node2 IPaddr[15382]: INFO: /usr/lib/heartbeat/send_arp
> -i 500 -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.0.0.10
> bond0 10.0.0.10 auto 10.0.0.10
> ffffffffffff
> Nov 15 16:22:33 node2 IPaddr[15312]: INFO: IPaddr Success
> Nov 15 16:22:33 node2 MailTo[15518]: WARNING: Don't stat/monitor me!
> MailTo is a pseudo resource agent, so the status reported may be
> incorrect
> Nov 15 16:22:33 node2 MailTo[15472]: INFO: MailTo Resource is stopped
> Nov 15 16:22:33 node2 ResourceManager[14992]: info: Running
> /etc/ha.d/resource.d/MailTo my_email_at_domain_tld start
> Nov 15 16:22:33 node2 MailTo[15554]: INFO: MailTo Success
> Nov 15 16:22:33 node2 mach_down[14972]: info:
> /usr/lib/heartbeat/mach_down: nice_failback: foreign resources
> acquired
> Nov 15 16:22:33 node2 mach_down[14972]: info: mach_down takeover
> complete for node node1.
> Nov 15 16:22:33 node2 heartbeat: [3494]: info: mach_down takeover complete.
> Nov 15 16:22:49 node2 heartbeat: [3494]: info: For information on
> cluster partitions, See URL: http://linux-ha.org/SplitBrain
> Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Deadtime value may be too 
> small.
> Nov 15 16:22:49 node2 heartbeat: [3494]: info: See FAQ for information
> on tuning deadtime.
> Nov 15 16:22:49 node2 heartbeat: [3494]: info: URL:
> http://linux-ha.org/FAQ#heavy_load
> Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Late heartbeat: Node
> node1: interval 41590 ms
> Nov 15 16:22:49 node2 heartbeat: [3494]: info: Status update for node
> node1: status active
> Nov 15 16:22:49 node2 harc[15678]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:22:51 node2 heartbeat: [3494]: info: Heartbeat shutdown in
> progress. (3494)
> Nov 15 16:22:51 node2 heartbeat: [15689]: info: Giving up all HA resources.
> Nov 15 16:22:51 node2 ResourceManager[15699]: info: Releasing resource
> group: node1 custom_script1.sh custom_script2.sh 10.0.0.10
> MailTo::my_email_at_domain_tld
> Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
> /etc/ha.d/resource.d/MailTo my_email_at_domain_tld stop
> Nov 15 16:22:51 node2 MailTo[15734]: INFO: MailTo Success
> Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
> /etc/ha.d/resource.d/IPaddr 10.0.0.10 stop
> Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/route -n del -host 10.0.0.10
> Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/ifconfig bond0:0 10.0.0.10 
> down
> Nov 15 16:22:51 node2 IPaddr[15894]: INFO: IP Address 10.0.0.10 released
> Nov 15 16:22:51 node2 IPaddr[15824]: INFO: IPaddr Success
> Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
> /etc/init.d/custom_script2.sh  stop
> Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
> /etc/ha.d/resource.d/custom_script1.sh  stop
> Nov 15 16:22:53 node2 heartbeat: [15689]: info: All HA resources relinquished.
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBFIFO process
> 3509 with signal 15
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process
> 3510 with signal 15
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process
> 3511 with signal 15
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process
> 3512 with signal 15
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process
> 3513 with signal 15
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3511
> exited. 5 remaining
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3509
> exited. 4 remaining
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3510
> exited. 3 remaining
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3513
> exited. 2 remaining
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3512
> exited. 1 remaining
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: node2 Heartbeat
> shutdown complete.
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Heartbeat restart triggered.
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Restarting heartbeat.
> Nov 15 16:22:55 node2 heartbeat: [3494]: info: Performing heartbeat
> restart exec.
> Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Core dumps could be
> lost if multiple dumps occur
> Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
> supportability
> Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Logging daemon is
> disabled --enabling logging daemon is recommended
> Nov 15 16:23:11 node2 heartbeat: [3494]: info: **************************
> Nov 15 16:23:11 node2 heartbeat: [3494]: info: Configuration
> validated. Starting heartbeat 2.0.7
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: heartbeat: version 2.0.7
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: Heartbeat generation: 13
> Nov 15 16:23:11 node2 heartbeat: [16005]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Nov 15 16:23:11 node2 heartbeat: [16005]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: Removing
> /var/run/heartbeat/rsctmp failed, recreating.
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write
> socket priority set to IPTOS_LOWDELAY on bond0
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
> send socket to device: bond0
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
> receive socket to device: bond0
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started
> on port 694 interface bond0 to 10.0.0.1
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write
> socket priority set to IPTOS_LOWDELAY on bond0
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
> send socket to device: bond0
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
> receive socket to device: bond0
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started
> on port 694 interface bond0 to 10.0.0.2
> Nov 15 16:23:11 node2 heartbeat: [16005]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Nov 15 16:23:11 node2 heartbeat: [16005]: info: Local status now set to: 'up'
> Nov 15 16:23:12 node2 heartbeat: [16005]: info: Link node1:bond0 up.
> Nov 15 16:23:12 node2 heartbeat: [16005]: info: Status update for node
> node1: status active
> Nov 15 16:23:12 node2 harc[16013]: info: Running /etc/ha.d/rc.d/status status
> Nov 15 16:23:13 node2 heartbeat: [16005]: info: Comm_now_up():
> updating status to active
> Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local status now set
> to: 'active'
> Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource
> transition completed.
> Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource
> transition completed.
> Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local Resource
> acquisition completed. (none)
> Nov 15 16:23:13 node2 heartbeat: [16005]: info: Initial resource
> acquisition complete (T_RESOURCES(them))
> 
> 
> #### ha.cf
> 
> # log to syslog
> logfacility local0
> 
> # UDP heartbeats over eth0
> ucast eth0 10.0.0.1
> ucast eth0 10.0.0.2
> 
> # don't try to revert to the "primary" node automatically after a failure
> auto_failback off
> 
> # list of cluster members
> node node1 node2
> 
> # heartbeat spacing in seconds
> keepalive 1
> 
> # how long until we pronounce peer dead in seconds
> deadtime 15
> 
> #### haresources
> 
> node1 custom_script1.sh custom_script2.sh 10.0.0.10
> MailTo::my_email_at_domain_tld
> 
> Thanks in advance for taking the time to read this.
> Cheers,
> Ryan
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to