Hello,
i've set up a two node cluster on FC4 with latest updates. When
restarting the active node i get some strange effects when it's coming
up again. In the ha-log of the rebooted node there are some messages
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Late heartbeat: Node mynode1:
heartbeat[2023]: 2006/04/06_15:12:12 info: time_longclock: clock_t
wrapped around (uptime).
(see logs)
Later on heartbeat restarts two times with a split brain situation and
then is everything fine again.
So what could be wrong here? I can't see that the machine is under heavy
load when heartbeat is coming up. It's a dual xeon 3.0 ghz dell server.
Any suggestions?
Regards,
Markus
========== File ha.cf
logfacility local0
logfile /var/log/ha-log
debugfile /var/log/ha-debug
auto_failback off
crm false
deadping 20
deadtime 20
initdead 60
keepalive 2
warntime 10
udpport 694
ucast eth1 192.168.0.71
ucast eth1 192.168.0.72
node mynode1 mynode2
traditional_compression false
==========
192.168.0.71 is internal IP mynode1
192.168.0.72 is internal IP mynode2 (crossover cable)
========== ha-log on mynode1
-----> Reboot...
heartbeat[8812]: 2006/04/06_15:01:57 info: Heartbeat shutdown in
progress. (8812)
heartbeat[30362]: 2006/04/06_15:01:57 info: Giving up all HA resources.
ResourceManager[30372]: 2006/04/06_15:01:57 info: Releasing resource
group: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddisk
Filesystem::/dev/drbd0::/home::xfs::defaults,noatime
...
heartbeat[8812]: 2006/04/06_15:04:35 info: Heartbeat shutdown complete.
-----> So far, so good. Resources are now running on node2,
-----> heartbeat is coming up on mynode1 again...
heartbeat[2022]: 2006/04/06_15:07:35 info: **************************
heartbeat[2022]: 2006/04/06_15:07:35 info: Configuration validated.
Starting heartbeat 2.0.3
heartbeat[2023]: 2006/04/06_15:07:35 info: heartbeat: version 2.0.3
heartbeat[2023]: 2006/04/06_15:07:35 info: Heartbeat generation: 97
heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[2023]: 2006/04/06_15:07:35 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound sendsocket
to device: eth1
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound receive
socket todevice: eth1
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: started on port
694 interface eth1 to 192.168.0.71
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound send
socket to device: eth1
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: started on port
694 interface eth1 to 192.168.0.72
heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[2023]: 2006/04/06_15:07:35 info: Local status now set to: 'up'
heartbeat[2023]: 2006/04/06_15:07:35 info: Exiting write_hostcachedata
process 2038 returned rc 0.
heartbeat[2023]: 2006/04/06_15:07:37 info: Link mynode2:eth1 up.
heartbeat[2023]: 2006/04/06_15:07:37 info: Status update for node
mynode2: status active
heartbeat[2023]: 2006/04/06_15:07:37 info: Exiting write_hostcachedata
process 2107 returned rc 0.
harc[2106]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/status
status
heartbeat[2023]: 2006/04/06_15:07:38 info: Comm_now_up(): updating
status to active
heartbeat[2023]: 2006/04/06_15:07:38 info: Local status now set to: 'active'
heartbeat[2023]: 2006/04/06_15:07:38 info: remote resource transition
completed.
heartbeat[2023]: 2006/04/06_15:07:38 info: remote resource transition
completed.
heartbeat[2023]: 2006/04/06_15:07:38 info: Local Resource acquisition
completed. (none)
heartbeat[2023]: 2006/04/06_15:07:38 info: Initial resource acquisition
complete (T_RESOURCES(them))
-----> Now it's going strange...
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 39430 ms before
being called (GSource: 0x81486b8)
heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963352
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Late heartbeat: Node mynode1:
interval 41430 ms
heartbeat[2023]: 2006/04/06_15:11:31 WARN: node mynode2: is dead
heartbeat[2023]: 2006/04/06_15:11:31 WARN: No STONITH device configured.
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Shared disks are not protected.
heartbeat[2023]: 2006/04/06_15:11:31 info: Resources being acquired from
mynode2.
heartbeat[2023]: 2006/04/06_15:11:31 info: Link mynode2:eth1 dead.
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 39430 ms before
being called (GSource: 0x81489c0)
heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963352
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:
Dispatch function for update msgfree count was delayed 40930 ms before
being called (GSource: 0x8148a90)
heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963202
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:
Dispatch function for client audit was delayed 33320 ms before being
called (GSource: 0x81488f0)
heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:
started at 4294967295 should have started at 4294963963
harc[2477]: 2006/04/06_15:11:31 info: Running /etc/ha.d/rc.d/status
status
mach_down[2498]: 2006/04/06_15:11:31 info:
/usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
mach_down[2498]: 2006/04/06_15:11:31 info: mach_down takeover
complete for node mynode2.
heartbeat[2023]: 2006/04/06_15:11:31 info: mach_down takeover complete.
...
-----> After this heartbeat is running into a split brain situation with
-----> mynode2 holding the resources and mynode1 trying to get the
-----> resources:
...
heartbeat[2023]: 2006/04/06_15:11:33 CRIT: Cluster node mynode2
returning after partition.
heartbeat[2023]: 2006/04/06_15:11:33 info: For information on cluster
partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[2023]: 2006/04/06_15:11:33 WARN: Deadtime value may be too small.
...
< Now everything is stopped on mynode1 >
...
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own our resources!
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own foreign
resources!
ResourceManager[2655]: 2006/04/06_15:11:54 ERROR: Return code 1 from
/etc/ha.d/resource.d/sshd_cluster
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own our resources!
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own foreign
resources!
...
-----> Strange messages again...
heartbeat[2023]: 2006/04/06_15:12:12 info: time_longclock: clock_t
wrapped around (uptime).
heartbeat[2033]: 2006/04/06_15:12:13 info: time_longclock: clock_t
wrapped around (uptime).
heartbeat[2037]: 2006/04/06_15:12:13 info: time_longclock: clock_t
wrapped around (uptime).
heartbeat[2034]: 2006/04/06_15:12:14 info: time_longclock: clock_t
wrapped around (uptime).
heartbeat[2023]: 2006/04/06_15:12:14 info: Heartbeat shutdown in
progress. (2023)
heartbeat[2036]: 2006/04/06_15:12:14 info: time_longclock: clock_t
wrapped around (uptime).
heartbeat[4962]: 2006/04/06_15:12:14 info: Giving up all HA resources.
...
========== ha-log on mynode2
-----> Takeover when rebooting mynode1
mach_down[29198]: 2006/04/06_15:04:49 info: mach_down takeover
complete for node mynode11.
heartbeat[17386]: 2006/04/06_15:04:49 info: mach_down takeover complete.
heartbeat[17386]: 2006/04/06_15:04:56 WARN: node mynode1: is dead
heartbeat[17386]: 2006/04/06_15:04:56 info: Dead node mynode1 gave up
resources.
heartbeat[17386]: 2006/04/06_15:04:56 info: Link mynode1:eth1 dead.
heartbeat[17386]: 2006/04/06_15:07:37 info: Heartbeat restart on node
mynode1
heartbeat[17386]: 2006/04/06_15:07:37 info: Link mynode1:eth1 up.
heartbeat[17386]: 2006/04/06_15:07:37 info: Status update for node
mynode1: status init
heartbeat[17386]: 2006/04/06_15:07:37 info: Status update for node
mynode1: status up
harc[29744]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/status
status
harc[29754]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/status
status
heartbeat[17386]: 2006/04/06_15:07:38 info: Status update for node
mynode1: status active
harc[29764]: 2006/04/06_15:07:38 info: Running /etc/ha.d/rc.d/status
status
heartbeat[17386]: 2006/04/06_15:07:39 info: remote resource transition
completed.
heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own our
resources!
heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own our
resources!
heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own foreign
resources!
heartbeat[17386]: 2006/04/06_15:11:53 WARN: node mynode1: is dead
heartbeat[17386]: 2006/04/06_15:11:53 WARN: No STONITH device configured.
heartbeat[17386]: 2006/04/06_15:11:53 WARN: Shared disks are not protected.
heartbeat[17386]: 2006/04/06_15:11:53 info: Resources being acquired
from mynode1.
harc[29787]: 2006/04/06_15:11:53 info: Running /etc/ha.d/rc.d/status
status
heartbeat[29788]: 2006/04/06_15:11:53 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys mynode2] to acquire.
mach_down[29807]: 2006/04/06_15:11:53 info: Taking over resource
group 10.0.0.1
ResourceManager[29827]: 2006/04/06_15:11:53 info: Acquiring resource
group: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddisk
Filesystem::/dev/drbd0::/home::xfs::defaults,noatime
IPaddr[29851]: 2006/04/06_15:11:54 INFO: IPaddr Running OK
IPsrcaddr[29958]: 2006/04/06_15:11:54 INFO: IPsrcaddr Running OK
ResourceManager[29827]: 2006/04/06_15:11:54 info: Running
/etc/ha.d/resource.d/drbddisk start
Filesystem[30125]: 2006/04/06_15:11:54 INFO: /home is mounted (running)
Filesystem[30061]: 2006/04/06_15:11:54 INFO: Filesystem Running OK
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[29807]: 2006/04/06_15:11:54 info: mach_down takeover
complete for node mynode1.
heartbeat[17386]: 2006/04/06_15:11:54 info: mach_down takeover complete.
heartbeat[17386]: 2006/04/06_15:12:14 CRIT: Cluster node mynode1
returning after partition.
heartbeat[17386]: 2006/04/06_15:12:14 info: For information on cluster
partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[17386]: 2006/04/06_15:12:14 WARN: Deadtime value may be too small.
heartbeat[17386]: 2006/04/06_15:12:14 info: See FAQ for information on
tuning deadtime.
heartbeat[17386]: 2006/04/06_15:12:14 info: URL:
http://linux-ha.org/FAQ#heavy_load
heartbeat[17386]: 2006/04/06_15:12:14 WARN: Late heartbeat: Node
mynode1: interval 41310 ms
heartbeat[17386]: 2006/04/06_15:12:14 info: Status update for node
mynode1: status active
harc[30300]: 2006/04/06_15:12:14 info: Running /etc/ha.d/rc.d/status
status
heartbeat[17386]: 2006/04/06_15:12:16 WARN: Shutdown delayed until
current resource activity finishes.
heartbeat[17386]: 2006/04/06_15:12:30 WARN: standby message [me] from
mynode1 ignored. Other side is in flux.
heartbeat[17386]: 2006/04/06_15:12:36 info: Heartbeat shutdown in
progress. (17386)
heartbeat[17386]: 2006/04/06_15:12:36 info: Received shutdown notice
from 'mynode1'.
heartbeat[17386]: 2006/04/06_15:12:36 info: Resource takeover cancelled
- shutdown in progress.
heartbeat[30312]: 2006/04/06_15:12:36 info: Giving up all HA resources.
ResourceManager[30322]: 2006/04/06_15:12:36 info: Releasing resource
group: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddisk
Filesystem::/dev/drbd0::/home::xfs::defaults,noatime
heartbeat[17386]: 2006/04/06_15:12:59 info: Link mynode1:eth1 dead.
heartbeat[17386]: 2006/04/06_15:13:01 info: Heartbeat restart on node
mynode1
heartbeat[17386]: 2006/04/06_15:13:01 info: Link mynode1:eth1 up.
heartbeat[17386]: 2006/04/06_15:13:01 info: Status update for node
mynode1: status init
heartbeat[17386]: 2006/04/06_15:13:01 info: Status update for node
mynode1: status up
heartbeat[17386]: 2006/04/06_15:13:02 info: Status update for node
mynode1: status active
heartbeat[17386]: 2006/04/06_15:13:13 info: remote resource transition
completed.
ResourceManager[30322]: 2006/04/06_15:14:41 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs defaults,noatime stop
Filesystem[30507]: 2006/04/06_15:14:41 INFO: Filesystem Success
ResourceManager[30322]: 2006/04/06_15:14:41 info: Running
/etc/ha.d/resource.d/drbddisk stop
ResourceManager[30322]: 2006/04/06_15:14:41 info: Running
/etc/ha.d/resource.d/IPsrcaddr 10.0.0.1 stop
IPsrcaddr[30637]: 2006/04/06_15:14:41 INFO: IPsrcaddr Success
ResourceManager[30322]: 2006/04/06_15:14:41 info: Running
/etc/ha.d/resource.d/IPaddr 10.0.0.1 stop
IPaddr[30791]: 2006/04/06_15:14:41 INFO: /sbin/route -n del -host 10.0.0.1
IPaddr[30791]: 2006/04/06_15:14:41 INFO: /sbin/ifconfig eth0:0 10.0.0.1
down
IPaddr[30791]: 2006/04/06_15:14:41 INFO: IP Address 10.0.0.1 released
IPaddr[30721]: 2006/04/06_15:14:41 INFO: IPaddr Success
heartbeat[30312]: 2006/04/06_15:14:41 info: All HA resources relinquished.
harc[30838]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/status
status
harc[30848]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/status
status
harc[30858]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/status
status
heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBREAD process 17391
with signal 15
heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBWRITE process
17392 with signal 15
heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBREAD process 17393
with signal 15
heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBFIFO process 17389
with signal 15
heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBWRITE process
17390 with signal 15
heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17389 exited. 5
remaining
heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17392 exited. 4
remaining
heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17393 exited. 3
remaining
heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17390 exited. 2
remaining
heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17391 exited. 1
remaining
heartbeat[17386]: 2006/04/06_15:14:43 info: Heartbeat shutdown complete.
heartbeat[17386]: 2006/04/06_15:14:43 info: Heartbeat restart triggered.
heartbeat[17386]: 2006/04/06_15:14:43 info: Restarting heartbeat.
heartbeat[17386]: 2006/04/06_15:14:43 info: Performing heartbeat restart
exec.
heartbeat[17386]: 2006/04/06_15:15:04 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[17386]: 2006/04/06_15:15:04 info: **************************
heartbeat[17386]: 2006/04/06_15:15:04 info: Configuration validated.
Starting heartbeat 2.0.3
heartbeat[30868]: 2006/04/06_15:15:04 info: heartbeat: version 2.0.3
heartbeat[30868]: 2006/04/06_15:15:05 info: Heartbeat generation: 72
heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[30868]: 2006/04/06_15:15:05 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound send
socket to device: eth1
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: started on port
694 interface eth1 to 192.168.0.71
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound send
socket to device: eth1
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: started on port
694 interface eth1 to 192.168.0.72
heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[30868]: 2006/04/06_15:15:05 info: Local status now set to: 'up'
heartbeat[30868]: 2006/04/06_15:15:05 info: Exiting write_hostcachedata
process 30875 returned rc 0.
heartbeat[30868]: 2006/04/06_15:15:06 info: Link mynode1:eth1 up.
heartbeat[30868]: 2006/04/06_15:15:06 info: Status update for node
mynode1: status active
heartbeat[30868]: 2006/04/06_15:15:06 info: Exiting write_hostcachedata
process 30877 returned rc 0.
harc[30876]: 2006/04/06_15:15:06 info: Running /etc/ha.d/rc.d/status
status
heartbeat[30868]: 2006/04/06_15:15:07 info: Comm_now_up(): updating
status to active
heartbeat[30868]: 2006/04/06_15:15:07 info: Local status now set to:
'active'
heartbeat[30868]: 2006/04/06_15:15:07 info: remote resource transition
completed.
heartbeat[30868]: 2006/04/06_15:15:07 info: remote resource transition
completed.
heartbeat[30868]: 2006/04/06_15:15:07 info: Local Resource acquisition
completed. (none)
heartbeat[30868]: 2006/04/06_15:15:07 info: Initial resource acquisition
complete (T_RESOURCES(them))
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/