[Linux-HA] Unexpected Failover (resubmission)

Ryan Kish Tue, 18 Nov 2008 07:29:35 -0800

Hello List.

I came in this morning and found that one of my clusters had a failure
over the weekend. This cluster is not yet production, but is slated to
become a production cluster soon, so I am hoping I can get some
help/advice to keep this type of failure from occurring again in the
future.


This cluster contains two Dell 2950's who's primary job is to provide
a high-availability Postgres server. The systems do not use shared
disk, rather the leverage Slony to keep the DB's in sync with each
other. Heartbeat is used to provide a VIP address as well as to run a
couple of custom scripts (one for a java process, another to inform
slony about the changes made to the status of the DB).

We are using the old Heartbeat 1 style setup on this cluster. The
operating system is Debian Etch, heartbeat is version 2.0.7-2. Both
machines are using Ethernet Bonding in failover mode, and are then
connected to discrete blades on a Cisco 6509 switch.

Now on to the issue. Around 16:00 on Saturday afternoon, node2
declared node1 to be dead and initiated a failover. Shortly after
node1 recovered, recalimed the resources and caused node2 to restart
heartbeat. I have both machines instrumented with Cacti and neither
saw a heaving load or network traffic during this time. If this were
to occur with high load or during a time of heavy network utilization,
it would make more sense to me. Looking over the configuration files,
I am further perplexed, auto_failback is set to 'off'. I assumed this
would prevent the fail back to my primary node?

Here are my log files and the relevant configuration files. After
reviewing the configuration, I do think the multiple ucast directives
could be problematic, but could they have caused this problem?  Names,
IP's & scripts renamed to protect the guilty:

#### Messages from Primary Node

Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 40280 ms (> 510
ms) before being called (GSource: 0x811e098)
Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963267
Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node
node1: interval 41280 ms
Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Late heartbeat: Node
node2: interval 41780 ms
Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 40200 ms (> 510
ms) before being called (GSource: 0x811e3a0)
Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963275
Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
Dispatch function for update msgfree count was delayed 40780 ms (>
15000 ms) before being called (GSource: 0x811e470)
Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963217
Nov 15 16:22:07 node1 heartbeat: [5731]: WARN: Gmain_timeout_dispatch:
Dispatch function for client audit was delayed 38380 ms (> 5000 ms)
before being called (GSource: 0x811e2d0)
Nov 15 16:22:07 node1 heartbeat: [5731]: info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963457
Nov 15 16:22:48 node1 heartbeat: [5734]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:48 node1 heartbeat: [5738]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:48 node1 heartbeat: [5731]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:48 node1 logd: [5692]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:48 node1 logd: [5730]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:49 node1 heartbeat: [5735]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:49 node1 heartbeat: [5737]: info: time_longclock: clock_t
wrapped around (uptime).
Nov 15 16:22:53 node1 heartbeat: [5731]: info: Received shutdown
notice from 'node2'.
Nov 15 16:22:53 node1 heartbeat: [5731]: info: Resources being
acquired from node2.
Nov 15 16:22:53 node1 harc[7346]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:22:53 node1 mach_down[7367]: info:
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources
acquired
Nov 15 16:22:53 node1 mach_down[7367]: info: mach_down takeover
complete for node node2.
Nov 15 16:22:53 node1 heartbeat: [5731]: info: mach_down takeover complete.
Nov 15 16:22:53 node1 heartbeat: [7347]: info: Local Resource
acquisition completed.
Nov 15 16:22:53 node1 harc[7408]: info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
Nov 15 16:22:53 node1 ip-request-resp[7408]: received ip-request-resp
custom_script1.sh OK yes
Nov 15 16:22:53 node1 ResourceManager[7423]: info: Acquiring resource
group: node1 custom_script1.sh custom_script2.sh 10.0.0.10
MailTo::my_email_at_domain_tld
Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running
/etc/ha.d/resource.d/custom_script1.sh  start
Nov 15 16:22:53 node1 ResourceManager[7423]: info: Running
/etc/init.d/custom_script2.sh  start
Nov 15 16:22:53 node1 IPaddr[7573]: INFO: IPaddr Running OK
Nov 15 16:22:54 node1 MailTo[7726]: WARNING: Don't stat/monitor me!
MailTo is a pseudo resource agent, so the status reported may be
incorrect
Nov 15 16:22:54 node1 MailTo[7680]: INFO: MailTo Running OK
Nov 15 16:23:12 node1 heartbeat: [5731]: info: Heartbeat restart on node node2
Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node
node2: status init
Nov 15 16:23:12 node1 heartbeat: [5731]: info: Status update for node
node2: status up
Nov 15 16:23:12 node1 harc[7751]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:23:12 node1 harc[7761]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:23:13 node1 heartbeat: [5731]: info: Status update for node
node2: status active
Nov 15 16:23:13 node1 harc[7771]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:23:13 node1 heartbeat: [5731]: info: remote resource
transition completed.


#### Messages from Secondary Node

Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: node node1: is dead
Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: No STONITH device configured.
Nov 15 16:22:23 node2 heartbeat: [3494]: WARN: Shared disks are not protected.
Nov 15 16:22:23 node2 heartbeat: [3494]: info: Resources being
acquired from node1.
Nov 15 16:22:23 node2 harc[14952]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:22:23 node2 heartbeat: [14953]: info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys node2] to acquire.
Nov 15 16:22:23 node2 mach_down[14972]: info: Taking over resource
group custom_script1.sh
Nov 15 16:22:23 node2 ResourceManager[14992]: info: Acquiring resource
group: node1 custom_script1.sh custom_script2.sh 10.0.0.10
MailTo::my_email_at_domain_tld
Nov 15 16:22:23 node2 ResourceManager[14992]: info: Running
/etc/ha.d/resource.d/custom_script1.sh  start
Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running
/etc/init.d/custom_script2.sh  start
Nov 15 16:22:32 node2 IPaddr[15200]: INFO: IPaddr Resource is stopped
Nov 15 16:22:32 node2 ResourceManager[14992]: info: Running
/etc/ha.d/resource.d/IPaddr 10.0.0.10 start
Nov 15 16:22:33 node2 IPaddr[15382]: INFO: eval /sbin/ifconfig bond0:0
10.0.0.10 netmask 255.255.255.0 broadcast 192.168.24.255
Nov 15 16:22:33 node2 IPaddr[15382]: INFO: Sending Gratuitous Arp for
10.0.0.10 on bond0:0 [bond0]
Nov 15 16:22:33 node2 IPaddr[15382]: INFO: /usr/lib/heartbeat/send_arp
-i 500 -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.0.0.10
bond0 10.0.0.10 auto 10.0.0.10
ffffffffffff
Nov 15 16:22:33 node2 IPaddr[15312]: INFO: IPaddr Success
Nov 15 16:22:33 node2 MailTo[15518]: WARNING: Don't stat/monitor me!
MailTo is a pseudo resource agent, so the status reported may be
incorrect
Nov 15 16:22:33 node2 MailTo[15472]: INFO: MailTo Resource is stopped
Nov 15 16:22:33 node2 ResourceManager[14992]: info: Running
/etc/ha.d/resource.d/MailTo my_email_at_domain_tld start
Nov 15 16:22:33 node2 MailTo[15554]: INFO: MailTo Success
Nov 15 16:22:33 node2 mach_down[14972]: info:
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources
acquired
Nov 15 16:22:33 node2 mach_down[14972]: info: mach_down takeover
complete for node node1.
Nov 15 16:22:33 node2 heartbeat: [3494]: info: mach_down takeover complete.
Nov 15 16:22:49 node2 heartbeat: [3494]: info: For information on
cluster partitions, See URL: http://linux-ha.org/SplitBrain
Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Deadtime value may be too small.
Nov 15 16:22:49 node2 heartbeat: [3494]: info: See FAQ for information
on tuning deadtime.
Nov 15 16:22:49 node2 heartbeat: [3494]: info: URL:
http://linux-ha.org/FAQ#heavy_load
Nov 15 16:22:49 node2 heartbeat: [3494]: WARN: Late heartbeat: Node
node1: interval 41590 ms
Nov 15 16:22:49 node2 heartbeat: [3494]: info: Status update for node
node1: status active
Nov 15 16:22:49 node2 harc[15678]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:22:51 node2 heartbeat: [3494]: info: Heartbeat shutdown in
progress. (3494)
Nov 15 16:22:51 node2 heartbeat: [15689]: info: Giving up all HA resources.
Nov 15 16:22:51 node2 ResourceManager[15699]: info: Releasing resource
group: node1 custom_script1.sh custom_script2.sh 10.0.0.10
MailTo::my_email_at_domain_tld
Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
/etc/ha.d/resource.d/MailTo my_email_at_domain_tld stop
Nov 15 16:22:51 node2 MailTo[15734]: INFO: MailTo Success
Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
/etc/ha.d/resource.d/IPaddr 10.0.0.10 stop
Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/route -n del -host 10.0.0.10
Nov 15 16:22:51 node2 IPaddr[15894]: INFO: /sbin/ifconfig bond0:0 10.0.0.10 down
Nov 15 16:22:51 node2 IPaddr[15894]: INFO: IP Address 10.0.0.10 released
Nov 15 16:22:51 node2 IPaddr[15824]: INFO: IPaddr Success
Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
/etc/init.d/custom_script2.sh  stop
Nov 15 16:22:51 node2 ResourceManager[15699]: info: Running
/etc/ha.d/resource.d/custom_script1.sh  stop
Nov 15 16:22:53 node2 heartbeat: [15689]: info: All HA resources relinquished.
Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBFIFO process
3509 with signal 15
Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process
3510 with signal 15
Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process
3511 with signal 15
Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBWRITE process
3512 with signal 15
Nov 15 16:22:55 node2 heartbeat: [3494]: info: killing HBREAD process
3513 with signal 15
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3511
exited. 5 remaining
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3509
exited. 4 remaining
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3510
exited. 3 remaining
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3513
exited. 2 remaining
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Core process 3512
exited. 1 remaining
Nov 15 16:22:55 node2 heartbeat: [3494]: info: node2 Heartbeat
shutdown complete.
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Heartbeat restart triggered.
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Restarting heartbeat.
Nov 15 16:22:55 node2 heartbeat: [3494]: info: Performing heartbeat
restart exec.
Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Core dumps could be
lost if multiple dumps occur
Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
Nov 15 16:23:11 node2 heartbeat: [3494]: WARN: Logging daemon is
disabled --enabling logging daemon is recommended
Nov 15 16:23:11 node2 heartbeat: [3494]: info: **************************
Nov 15 16:23:11 node2 heartbeat: [3494]: info: Configuration
validated. Starting heartbeat 2.0.7
Nov 15 16:23:11 node2 heartbeat: [16005]: info: heartbeat: version 2.0.7
Nov 15 16:23:11 node2 heartbeat: [16005]: info: Heartbeat generation: 13
Nov 15 16:23:11 node2 heartbeat: [16005]: info:
G_main_add_TriggerHandler: Added signal manual handler
Nov 15 16:23:11 node2 heartbeat: [16005]: info:
G_main_add_TriggerHandler: Added signal manual handler
Nov 15 16:23:11 node2 heartbeat: [16005]: info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write
socket priority set to IPTOS_LOWDELAY on bond0
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
send socket to device: bond0
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
receive socket to device: bond0
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started
on port 694 interface bond0 to 10.0.0.1
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: write
socket priority set to IPTOS_LOWDELAY on bond0
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
send socket to device: bond0
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: bound
receive socket to device: bond0
Nov 15 16:23:11 node2 heartbeat: [16005]: info: glib: ucast: started
on port 694 interface bond0 to 10.0.0.2
Nov 15 16:23:11 node2 heartbeat: [16005]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Nov 15 16:23:11 node2 heartbeat: [16005]: info: Local status now set to: 'up'
Nov 15 16:23:12 node2 heartbeat: [16005]: info: Link node1:bond0 up.
Nov 15 16:23:12 node2 heartbeat: [16005]: info: Status update for node
node1: status active
Nov 15 16:23:12 node2 harc[16013]: info: Running /etc/ha.d/rc.d/status status
Nov 15 16:23:13 node2 heartbeat: [16005]: info: Comm_now_up():
updating status to active
Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local status now set
to: 'active'
Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource
transition completed.
Nov 15 16:23:13 node2 heartbeat: [16005]: info: remote resource
transition completed.
Nov 15 16:23:13 node2 heartbeat: [16005]: info: Local Resource
acquisition completed. (none)
Nov 15 16:23:13 node2 heartbeat: [16005]: info: Initial resource
acquisition complete (T_RESOURCES(them))


#### ha.cf

# log to syslog
logfacility local0

# UDP heartbeats over eth0
ucast eth0 10.0.0.1
ucast eth0 10.0.0.2

# don't try to revert to the "primary" node automatically after a failure
auto_failback off

# list of cluster members
node node1 node2

# heartbeat spacing in seconds
keepalive 1

# how long until we pronounce peer dead in seconds
deadtime 15

#### haresources

node1 custom_script1.sh custom_script2.sh 10.0.0.10
MailTo::my_email_at_domain_tld

Thanks in advance for taking the time to read this.
Cheers,
Ryan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Unexpected Failover (resubmission)

Reply via email to