Hello,

i've set up a two node cluster on FC4 with latest updates. When restarting the active node i get some strange effects when it's coming up again. In the ha-log of the rebooted node there are some messages

heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Late heartbeat: Node mynode1:
heartbeat[2023]: 2006/04/06_15:12:12 info: time_longclock: clock_t wrapped around (uptime).
(see logs)

Later on heartbeat restarts two times with a split brain situation and then is everything fine again.

So what could be wrong here? I can't see that the machine is under heavy load when heartbeat is coming up. It's a dual xeon 3.0 ghz dell server.

Any suggestions?

Regards,
Markus

========== File ha.cf

logfacility local0
logfile /var/log/ha-log
debugfile /var/log/ha-debug

auto_failback off
crm false
deadping 20
deadtime 20
initdead 60
keepalive 2
warntime 10

udpport 694
ucast eth1 192.168.0.71
ucast eth1 192.168.0.72

node mynode1 mynode2

traditional_compression false

==========

192.168.0.71 is internal IP mynode1
192.168.0.72 is internal IP mynode2 (crossover cable)


========== ha-log on mynode1

-----> Reboot...

heartbeat[8812]: 2006/04/06_15:01:57 info: Heartbeat shutdown in progress. (8812) heartbeat[30362]: 2006/04/06_15:01:57 info: Giving up all HA resources. ResourceManager[30372]: 2006/04/06_15:01:57 info: Releasing resource group: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddisk Filesystem::/dev/drbd0::/home::xfs::defaults,noatime
...
heartbeat[8812]: 2006/04/06_15:04:35 info: Heartbeat shutdown complete.

-----> So far, so good. Resources are now running on node2,
-----> heartbeat is coming up on mynode1 again...

heartbeat[2022]: 2006/04/06_15:07:35 info: **************************
heartbeat[2022]: 2006/04/06_15:07:35 info: Configuration validated. Starting heartbeat 2.0.3
heartbeat[2023]: 2006/04/06_15:07:35 info: heartbeat: version 2.0.3
heartbeat[2023]: 2006/04/06_15:07:35 info: Heartbeat generation: 97
heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[2023]: 2006/04/06_15:07:35 info: Removing /var/run/heartbeat/rsctmp failed, recreating. heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound sendsocket to device: eth1 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound receive socket todevice: eth1 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: started on port 694 interface eth1 to 192.168.0.71 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound send socket to device: eth1 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound receive socket to device: eth1 heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: started on port 694 interface eth1 to 192.168.0.72 heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_SignalHandler: Added signal handler for signal 17
heartbeat[2023]: 2006/04/06_15:07:35 info: Local status now set to: 'up'
heartbeat[2023]: 2006/04/06_15:07:35 info: Exiting write_hostcachedata process 2038 returned rc 0.
heartbeat[2023]: 2006/04/06_15:07:37 info: Link mynode2:eth1 up.
heartbeat[2023]: 2006/04/06_15:07:37 info: Status update for node mynode2: status active heartbeat[2023]: 2006/04/06_15:07:37 info: Exiting write_hostcachedata process 2107 returned rc 0. harc[2106]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/status status heartbeat[2023]: 2006/04/06_15:07:38 info: Comm_now_up(): updating status to active
heartbeat[2023]: 2006/04/06_15:07:38 info: Local status now set to: 'active'
heartbeat[2023]: 2006/04/06_15:07:38 info: remote resource transition completed. heartbeat[2023]: 2006/04/06_15:07:38 info: remote resource transition completed. heartbeat[2023]: 2006/04/06_15:07:38 info: Local Resource acquisition completed. (none) heartbeat[2023]: 2006/04/06_15:07:38 info: Initial resource acquisition complete (T_RESOURCES(them))

-----> Now it's going strange...

heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 39430 ms before being called (GSource: 0x81486b8) heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963352 heartbeat[2023]: 2006/04/06_15:11:31 WARN: Late heartbeat: Node mynode1: interval 41430 ms
heartbeat[2023]: 2006/04/06_15:11:31 WARN: node mynode2: is dead
heartbeat[2023]: 2006/04/06_15:11:31 WARN: No STONITH device configured.
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Shared disks are not protected.
heartbeat[2023]: 2006/04/06_15:11:31 info: Resources being acquired from mynode2.
heartbeat[2023]: 2006/04/06_15:11:31 info: Link mynode2:eth1 dead.
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 39430 ms before being called (GSource: 0x81489c0) heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963352 heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 40930 ms before being called (GSource: 0x8148a90) heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963202 heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 33320 ms before being called (GSource: 0x81488f0) heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch: started at 4294967295 should have started at 4294963963 harc[2477]: 2006/04/06_15:11:31 info: Running /etc/ha.d/rc.d/status status mach_down[2498]: 2006/04/06_15:11:31 info: /usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
mach_down[2498]: 2006/04/06_15:11:31 info: mach_down takeover complete for node mynode2.
heartbeat[2023]: 2006/04/06_15:11:31 info: mach_down takeover complete.
...
-----> After this heartbeat is running into a split brain situation with
-----> mynode2 holding the resources and mynode1 trying to get the -----> resources:
...
heartbeat[2023]: 2006/04/06_15:11:33 CRIT: Cluster node mynode2 returning after partition. heartbeat[2023]: 2006/04/06_15:11:33 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[2023]: 2006/04/06_15:11:33 WARN: Deadtime value may be too small.
...
< Now everything is stopped on mynode1 >
...
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own our resources!
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own foreign resources! ResourceManager[2655]: 2006/04/06_15:11:54 ERROR: Return code 1 from /etc/ha.d/resource.d/sshd_cluster
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own our resources!
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own foreign resources!
...

-----> Strange messages again...

heartbeat[2023]: 2006/04/06_15:12:12 info: time_longclock: clock_t wrapped around (uptime). heartbeat[2033]: 2006/04/06_15:12:13 info: time_longclock: clock_t wrapped around (uptime). heartbeat[2037]: 2006/04/06_15:12:13 info: time_longclock: clock_t wrapped around (uptime). heartbeat[2034]: 2006/04/06_15:12:14 info: time_longclock: clock_t wrapped around (uptime). heartbeat[2023]: 2006/04/06_15:12:14 info: Heartbeat shutdown in progress. (2023) heartbeat[2036]: 2006/04/06_15:12:14 info: time_longclock: clock_t wrapped around (uptime).
heartbeat[4962]: 2006/04/06_15:12:14 info: Giving up all HA resources.
...


========== ha-log on mynode2

-----> Takeover when rebooting mynode1

mach_down[29198]: 2006/04/06_15:04:49 info: mach_down takeover complete for node mynode11.
heartbeat[17386]: 2006/04/06_15:04:49 info: mach_down takeover complete.
heartbeat[17386]: 2006/04/06_15:04:56 WARN: node mynode1: is dead
heartbeat[17386]: 2006/04/06_15:04:56 info: Dead node mynode1 gave up resources.
heartbeat[17386]: 2006/04/06_15:04:56 info: Link mynode1:eth1 dead.
heartbeat[17386]: 2006/04/06_15:07:37 info: Heartbeat restart on node mynode1
heartbeat[17386]: 2006/04/06_15:07:37 info: Link mynode1:eth1 up.
heartbeat[17386]: 2006/04/06_15:07:37 info: Status update for node mynode1: status init heartbeat[17386]: 2006/04/06_15:07:37 info: Status update for node mynode1: status up harc[29744]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/status status harc[29754]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/status status heartbeat[17386]: 2006/04/06_15:07:38 info: Status update for node mynode1: status active harc[29764]: 2006/04/06_15:07:38 info: Running /etc/ha.d/rc.d/status status heartbeat[17386]: 2006/04/06_15:07:39 info: remote resource transition completed. heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own our resources! heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own our resources! heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own foreign resources!
heartbeat[17386]: 2006/04/06_15:11:53 WARN: node mynode1: is dead
heartbeat[17386]: 2006/04/06_15:11:53 WARN: No STONITH device configured.
heartbeat[17386]: 2006/04/06_15:11:53 WARN: Shared disks are not protected.
heartbeat[17386]: 2006/04/06_15:11:53 info: Resources being acquired from mynode1. harc[29787]: 2006/04/06_15:11:53 info: Running /etc/ha.d/rc.d/status status heartbeat[29788]: 2006/04/06_15:11:53 info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys mynode2] to acquire. mach_down[29807]: 2006/04/06_15:11:53 info: Taking over resource group 10.0.0.1 ResourceManager[29827]: 2006/04/06_15:11:53 info: Acquiring resource group: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddisk Filesystem::/dev/drbd0::/home::xfs::defaults,noatime
IPaddr[29851]:  2006/04/06_15:11:54 INFO: IPaddr Running OK
IPsrcaddr[29958]:       2006/04/06_15:11:54 INFO: IPsrcaddr Running OK
ResourceManager[29827]: 2006/04/06_15:11:54 info: Running /etc/ha.d/resource.d/drbddisk start
Filesystem[30125]:      2006/04/06_15:11:54 INFO: /home is mounted (running)
Filesystem[30061]:      2006/04/06_15:11:54 INFO: Filesystem Running OK
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[29807]: 2006/04/06_15:11:54 info: mach_down takeover complete for node mynode1.
heartbeat[17386]: 2006/04/06_15:11:54 info: mach_down takeover complete.
heartbeat[17386]: 2006/04/06_15:12:14 CRIT: Cluster node mynode1 returning after partition. heartbeat[17386]: 2006/04/06_15:12:14 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[17386]: 2006/04/06_15:12:14 WARN: Deadtime value may be too small.
heartbeat[17386]: 2006/04/06_15:12:14 info: See FAQ for information on tuning deadtime. heartbeat[17386]: 2006/04/06_15:12:14 info: URL: http://linux-ha.org/FAQ#heavy_load heartbeat[17386]: 2006/04/06_15:12:14 WARN: Late heartbeat: Node mynode1: interval 41310 ms heartbeat[17386]: 2006/04/06_15:12:14 info: Status update for node mynode1: status active harc[30300]: 2006/04/06_15:12:14 info: Running /etc/ha.d/rc.d/status status heartbeat[17386]: 2006/04/06_15:12:16 WARN: Shutdown delayed until current resource activity finishes. heartbeat[17386]: 2006/04/06_15:12:30 WARN: standby message [me] from mynode1 ignored. Other side is in flux. heartbeat[17386]: 2006/04/06_15:12:36 info: Heartbeat shutdown in progress. (17386) heartbeat[17386]: 2006/04/06_15:12:36 info: Received shutdown notice from 'mynode1'. heartbeat[17386]: 2006/04/06_15:12:36 info: Resource takeover cancelled - shutdown in progress.
heartbeat[30312]: 2006/04/06_15:12:36 info: Giving up all HA resources.
ResourceManager[30322]: 2006/04/06_15:12:36 info: Releasing resource group: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddisk Filesystem::/dev/drbd0::/home::xfs::defaults,noatime
heartbeat[17386]: 2006/04/06_15:12:59 info: Link mynode1:eth1 dead.
heartbeat[17386]: 2006/04/06_15:13:01 info: Heartbeat restart on node mynode1
heartbeat[17386]: 2006/04/06_15:13:01 info: Link mynode1:eth1 up.
heartbeat[17386]: 2006/04/06_15:13:01 info: Status update for node mynode1: status init heartbeat[17386]: 2006/04/06_15:13:01 info: Status update for node mynode1: status up heartbeat[17386]: 2006/04/06_15:13:02 info: Status update for node mynode1: status active heartbeat[17386]: 2006/04/06_15:13:13 info: remote resource transition completed. ResourceManager[30322]: 2006/04/06_15:14:41 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs defaults,noatime stop
Filesystem[30507]:      2006/04/06_15:14:41 INFO: Filesystem Success
ResourceManager[30322]: 2006/04/06_15:14:41 info: Running /etc/ha.d/resource.d/drbddisk stop ResourceManager[30322]: 2006/04/06_15:14:41 info: Running /etc/ha.d/resource.d/IPsrcaddr 10.0.0.1 stop
IPsrcaddr[30637]:       2006/04/06_15:14:41 INFO: IPsrcaddr Success
ResourceManager[30322]: 2006/04/06_15:14:41 info: Running /etc/ha.d/resource.d/IPaddr 10.0.0.1 stop
IPaddr[30791]:  2006/04/06_15:14:41 INFO: /sbin/route -n del -host 10.0.0.1
IPaddr[30791]: 2006/04/06_15:14:41 INFO: /sbin/ifconfig eth0:0 10.0.0.1 down
IPaddr[30791]:  2006/04/06_15:14:41 INFO: IP Address 10.0.0.1 released
IPaddr[30721]:  2006/04/06_15:14:41 INFO: IPaddr Success
heartbeat[30312]: 2006/04/06_15:14:41 info: All HA resources relinquished.
harc[30838]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/status status harc[30848]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/status status harc[30858]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/status status heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBREAD process 17391 with signal 15 heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBWRITE process 17392 with signal 15 heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBREAD process 17393 with signal 15 heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBFIFO process 17389 with signal 15 heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBWRITE process 17390 with signal 15 heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17389 exited. 5 remaining heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17392 exited. 4 remaining heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17393 exited. 3 remaining heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17390 exited. 2 remaining heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17391 exited. 1 remaining
heartbeat[17386]: 2006/04/06_15:14:43 info: Heartbeat shutdown complete.
heartbeat[17386]: 2006/04/06_15:14:43 info: Heartbeat restart triggered.
heartbeat[17386]: 2006/04/06_15:14:43 info: Restarting heartbeat.
heartbeat[17386]: 2006/04/06_15:14:43 info: Performing heartbeat restart exec. heartbeat[17386]: 2006/04/06_15:15:04 WARN: Logging daemon is disabled --enabling logging daemon is recommended
heartbeat[17386]: 2006/04/06_15:15:04 info: **************************
heartbeat[17386]: 2006/04/06_15:15:04 info: Configuration validated. Starting heartbeat 2.0.3
heartbeat[30868]: 2006/04/06_15:15:04 info: heartbeat: version 2.0.3
heartbeat[30868]: 2006/04/06_15:15:05 info: Heartbeat generation: 72
heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[30868]: 2006/04/06_15:15:05 info: Removing /var/run/heartbeat/rsctmp failed, recreating. heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound send socket to device: eth1 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound receive socket to device: eth1 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: started on port 694 interface eth1 to 192.168.0.71 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound send socket to device: eth1 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound receive socket to device: eth1 heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: started on port 694 interface eth1 to 192.168.0.72 heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_SignalHandler: Added signal handler for signal 17
heartbeat[30868]: 2006/04/06_15:15:05 info: Local status now set to: 'up'
heartbeat[30868]: 2006/04/06_15:15:05 info: Exiting write_hostcachedata process 30875 returned rc 0.
heartbeat[30868]: 2006/04/06_15:15:06 info: Link mynode1:eth1 up.
heartbeat[30868]: 2006/04/06_15:15:06 info: Status update for node mynode1: status active heartbeat[30868]: 2006/04/06_15:15:06 info: Exiting write_hostcachedata process 30877 returned rc 0. harc[30876]: 2006/04/06_15:15:06 info: Running /etc/ha.d/rc.d/status status heartbeat[30868]: 2006/04/06_15:15:07 info: Comm_now_up(): updating status to active heartbeat[30868]: 2006/04/06_15:15:07 info: Local status now set to: 'active' heartbeat[30868]: 2006/04/06_15:15:07 info: remote resource transition completed. heartbeat[30868]: 2006/04/06_15:15:07 info: remote resource transition completed. heartbeat[30868]: 2006/04/06_15:15:07 info: Local Resource acquisition completed. (none) heartbeat[30868]: 2006/04/06_15:15:07 info: Initial resource acquisition complete (T_RESOURCES(them))

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to