[Linux-ha-dev] Late heartbeat / split brain after restart node

Markus Marquardt Thu, 06 Apr 2006 09:08:26 -0700

Hello,

i've set up a two node cluster on FC4 with latest updates. Whenrestarting the active node i get some strange effects when it's comingup again. In the ha-log of the rebooted node there are some messages


heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Late heartbeat: Node mynode1:

heartbeat[2023]: 2006/04/06_15:12:12 info: time_longclock: clock_twrapped around (uptime).

(see logs)

Later on heartbeat restarts two times with a split brain situation andthen is everything fine again.

So what could be wrong here? I can't see that the machine is under heavyload when heartbeat is coming up. It's a dual xeon 3.0 ghz dell server.


Any suggestions?

Regards,
Markus

========== File ha.cf

logfacility local0
logfile /var/log/ha-log
debugfile /var/log/ha-debug

auto_failback off
crm false
deadping 20
deadtime 20
initdead 60
keepalive 2
warntime 10

udpport 694
ucast eth1 192.168.0.71
ucast eth1 192.168.0.72

node mynode1 mynode2

traditional_compression false

==========

192.168.0.71 is internal IP mynode1
192.168.0.72 is internal IP mynode2 (crossover cable)


========== ha-log on mynode1

-----> Reboot...

heartbeat[8812]: 2006/04/06_15:01:57 info: Heartbeat shutdown inprogress. (8812)heartbeat[30362]: 2006/04/06_15:01:57 info: Giving up all HA resources.ResourceManager[30372]: 2006/04/06_15:01:57 info: Releasing resourcegroup: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddiskFilesystem::/dev/drbd0::/home::xfs::defaults,noatime

...
heartbeat[8812]: 2006/04/06_15:04:35 info: Heartbeat shutdown complete.

-----> So far, so good. Resources are now running on node2,
-----> heartbeat is coming up on mynode1 again...

heartbeat[2022]: 2006/04/06_15:07:35 info: **************************

heartbeat[2022]: 2006/04/06_15:07:35 info: Configuration validated.Starting heartbeat 2.0.3

heartbeat[2023]: 2006/04/06_15:07:35 info: heartbeat: version 2.0.3
heartbeat[2023]: 2006/04/06_15:07:35 info: Heartbeat generation: 97

heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_TriggerHandler:Added signal manual handlerheartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_TriggerHandler:Added signal manual handlerheartbeat[2023]: 2006/04/06_15:07:35 info: Removing/var/run/heartbeat/rsctmp failed, recreating.heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: write socketpriority set to IPTOS_LOWDELAY on eth1heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound sendsocketto device: eth1heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound receivesocket todevice: eth1heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: started on port694 interface eth1 to 192.168.0.71heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: write socketpriority set to IPTOS_LOWDELAY on eth1heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound sendsocket to device: eth1heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: bound receivesocket to device: eth1heartbeat[2023]: 2006/04/06_15:07:35 info: glib: ucast: started on port694 interface eth1 to 192.168.0.72heartbeat[2023]: 2006/04/06_15:07:35 info: G_main_add_SignalHandler:Added signal handler for signal 17

heartbeat[2023]: 2006/04/06_15:07:35 info: Local status now set to: 'up'

heartbeat[2023]: 2006/04/06_15:07:35 info: Exiting write_hostcachedataprocess 2038 returned rc 0.

heartbeat[2023]: 2006/04/06_15:07:37 info: Link mynode2:eth1 up.

heartbeat[2023]: 2006/04/06_15:07:37 info: Status update for nodemynode2: status activeheartbeat[2023]: 2006/04/06_15:07:37 info: Exiting write_hostcachedataprocess 2107 returned rc 0.harc[2106]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[2023]: 2006/04/06_15:07:38 info: Comm_now_up(): updatingstatus to active

heartbeat[2023]: 2006/04/06_15:07:38 info: Local status now set to: 'active'

heartbeat[2023]: 2006/04/06_15:07:38 info: remote resource transitioncompleted.heartbeat[2023]: 2006/04/06_15:07:38 info: remote resource transitioncompleted.heartbeat[2023]: 2006/04/06_15:07:38 info: Local Resource acquisitioncompleted. (none)heartbeat[2023]: 2006/04/06_15:07:38 info: Initial resource acquisitioncomplete (T_RESOURCES(them))


-----> Now it's going strange...

heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:Dispatch function for send local status was delayed 39430 ms beforebeing called (GSource: 0x81486b8)heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:started at 4294967295 should have started at 4294963352heartbeat[2023]: 2006/04/06_15:11:31 WARN: Late heartbeat: Node mynode1:interval 41430 ms

heartbeat[2023]: 2006/04/06_15:11:31 WARN: node mynode2: is dead
heartbeat[2023]: 2006/04/06_15:11:31 WARN: No STONITH device configured.
heartbeat[2023]: 2006/04/06_15:11:31 WARN: Shared disks are not protected.

heartbeat[2023]: 2006/04/06_15:11:31 info: Resources being acquired frommynode2.

heartbeat[2023]: 2006/04/06_15:11:31 info: Link mynode2:eth1 dead.

heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:Dispatch function for check for signals was delayed 39430 ms beforebeing called (GSource: 0x81489c0)heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:started at 4294967295 should have started at 4294963352heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:Dispatch function for update msgfree count was delayed 40930 ms beforebeing called (GSource: 0x8148a90)heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:started at 4294967295 should have started at 4294963202heartbeat[2023]: 2006/04/06_15:11:31 WARN: Gmain_timeout_dispatch:Dispatch function for client audit was delayed 33320 ms before beingcalled (GSource: 0x81488f0)heartbeat[2023]: 2006/04/06_15:11:31 info: Gmain_timeout_dispatch:started at 4294967295 should have started at 4294963963harc[2477]: 2006/04/06_15:11:31 info: Running /etc/ha.d/rc.d/statusstatusmach_down[2498]: 2006/04/06_15:11:31 info:/usr/lib/heartbeat/mach_down:

nice_failback: foreign resources acquired

mach_down[2498]: 2006/04/06_15:11:31 info: mach_down takeovercomplete for node mynode2.

heartbeat[2023]: 2006/04/06_15:11:31 info: mach_down takeover complete.
...
-----> After this heartbeat is running into a split brain situation with

-----> mynode2 holding the resources and mynode1 trying to get the-----> resources:

...

heartbeat[2023]: 2006/04/06_15:11:33 CRIT: Cluster node mynode2returning after partition.heartbeat[2023]: 2006/04/06_15:11:33 info: For information on clusterpartitions, See URL: http://linux-ha.org/SplitBrain

heartbeat[2023]: 2006/04/06_15:11:33 WARN: Deadtime value may be too small.
...
< Now everything is stopped on mynode1 >
...
heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own our resources!

heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own foreignresources!ResourceManager[2655]: 2006/04/06_15:11:54 ERROR: Return code 1 from/etc/ha.d/resource.d/sshd_cluster

heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own our resources!

heartbeat[2023]: 2006/04/06_15:11:54 ERROR: Both machines own foreignresources!

...

-----> Strange messages again...

heartbeat[2023]: 2006/04/06_15:12:12 info: time_longclock: clock_twrapped around (uptime).heartbeat[2033]: 2006/04/06_15:12:13 info: time_longclock: clock_twrapped around (uptime).heartbeat[2037]: 2006/04/06_15:12:13 info: time_longclock: clock_twrapped around (uptime).heartbeat[2034]: 2006/04/06_15:12:14 info: time_longclock: clock_twrapped around (uptime).heartbeat[2023]: 2006/04/06_15:12:14 info: Heartbeat shutdown inprogress. (2023)heartbeat[2036]: 2006/04/06_15:12:14 info: time_longclock: clock_twrapped around (uptime).

heartbeat[4962]: 2006/04/06_15:12:14 info: Giving up all HA resources.
...


========== ha-log on mynode2

-----> Takeover when rebooting mynode1

mach_down[29198]: 2006/04/06_15:04:49 info: mach_down takeovercomplete for node mynode11.

heartbeat[17386]: 2006/04/06_15:04:49 info: mach_down takeover complete.
heartbeat[17386]: 2006/04/06_15:04:56 WARN: node mynode1: is dead

heartbeat[17386]: 2006/04/06_15:04:56 info: Dead node mynode1 gave upresources.

heartbeat[17386]: 2006/04/06_15:04:56 info: Link mynode1:eth1 dead.

heartbeat[17386]: 2006/04/06_15:07:37 info: Heartbeat restart on nodemynode1

heartbeat[17386]: 2006/04/06_15:07:37 info: Link mynode1:eth1 up.

heartbeat[17386]: 2006/04/06_15:07:37 info: Status update for nodemynode1: status initheartbeat[17386]: 2006/04/06_15:07:37 info: Status update for nodemynode1: status upharc[29744]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/statusstatusharc[29754]: 2006/04/06_15:07:37 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[17386]: 2006/04/06_15:07:38 info: Status update for nodemynode1: status activeharc[29764]: 2006/04/06_15:07:38 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[17386]: 2006/04/06_15:07:39 info: remote resource transitioncompleted.heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own ourresources!heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own ourresources!heartbeat[17386]: 2006/04/06_15:11:31 ERROR: Both machines own foreignresources!

heartbeat[17386]: 2006/04/06_15:11:53 WARN: node mynode1: is dead
heartbeat[17386]: 2006/04/06_15:11:53 WARN: No STONITH device configured.
heartbeat[17386]: 2006/04/06_15:11:53 WARN: Shared disks are not protected.

heartbeat[17386]: 2006/04/06_15:11:53 info: Resources being acquiredfrom mynode1.harc[29787]: 2006/04/06_15:11:53 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[29788]: 2006/04/06_15:11:53 info: No local resources[/usr/lib/heartbeat/ResourceManager listkeys mynode2] to acquire.mach_down[29807]: 2006/04/06_15:11:53 info: Taking over resourcegroup 10.0.0.1ResourceManager[29827]: 2006/04/06_15:11:53 info: Acquiring resourcegroup: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddiskFilesystem::/dev/drbd0::/home::xfs::defaults,noatime

IPaddr[29851]:  2006/04/06_15:11:54 INFO: IPaddr Running OK
IPsrcaddr[29958]:       2006/04/06_15:11:54 INFO: IPsrcaddr Running OK

ResourceManager[29827]: 2006/04/06_15:11:54 info: Running/etc/ha.d/resource.d/drbddisk start

Filesystem[30125]:      2006/04/06_15:11:54 INFO: /home is mounted (running)
Filesystem[30061]:      2006/04/06_15:11:54 INFO: Filesystem Running OK
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired

mach_down[29807]: 2006/04/06_15:11:54 info: mach_down takeovercomplete for node mynode1.

heartbeat[17386]: 2006/04/06_15:11:54 info: mach_down takeover complete.

heartbeat[17386]: 2006/04/06_15:12:14 CRIT: Cluster node mynode1returning after partition.heartbeat[17386]: 2006/04/06_15:12:14 info: For information on clusterpartitions, See URL: http://linux-ha.org/SplitBrain

heartbeat[17386]: 2006/04/06_15:12:14 WARN: Deadtime value may be too small.

heartbeat[17386]: 2006/04/06_15:12:14 info: See FAQ for information ontuning deadtime.heartbeat[17386]: 2006/04/06_15:12:14 info: URL:http://linux-ha.org/FAQ#heavy_loadheartbeat[17386]: 2006/04/06_15:12:14 WARN: Late heartbeat: Nodemynode1: interval 41310 msheartbeat[17386]: 2006/04/06_15:12:14 info: Status update for nodemynode1: status activeharc[30300]: 2006/04/06_15:12:14 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[17386]: 2006/04/06_15:12:16 WARN: Shutdown delayed untilcurrent resource activity finishes.heartbeat[17386]: 2006/04/06_15:12:30 WARN: standby message [me] frommynode1 ignored. Other side is in flux.heartbeat[17386]: 2006/04/06_15:12:36 info: Heartbeat shutdown inprogress. (17386)heartbeat[17386]: 2006/04/06_15:12:36 info: Received shutdown noticefrom 'mynode1'.heartbeat[17386]: 2006/04/06_15:12:36 info: Resource takeover cancelled- shutdown in progress.

heartbeat[30312]: 2006/04/06_15:12:36 info: Giving up all HA resources.

ResourceManager[30322]: 2006/04/06_15:12:36 info: Releasing resourcegroup: mynode1 10.0.0.1 IPsrcaddr::10.0.0.1 drbddiskFilesystem::/dev/drbd0::/home::xfs::defaults,noatime

heartbeat[17386]: 2006/04/06_15:12:59 info: Link mynode1:eth1 dead.

heartbeat[17386]: 2006/04/06_15:13:01 info: Heartbeat restart on nodemynode1

heartbeat[17386]: 2006/04/06_15:13:01 info: Link mynode1:eth1 up.

heartbeat[17386]: 2006/04/06_15:13:01 info: Status update for nodemynode1: status initheartbeat[17386]: 2006/04/06_15:13:01 info: Status update for nodemynode1: status upheartbeat[17386]: 2006/04/06_15:13:02 info: Status update for nodemynode1: status activeheartbeat[17386]: 2006/04/06_15:13:13 info: remote resource transitioncompleted.ResourceManager[30322]: 2006/04/06_15:14:41 info: Running/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs defaults,noatime stop

Filesystem[30507]:      2006/04/06_15:14:41 INFO: Filesystem Success

ResourceManager[30322]: 2006/04/06_15:14:41 info: Running/etc/ha.d/resource.d/drbddisk stopResourceManager[30322]: 2006/04/06_15:14:41 info: Running/etc/ha.d/resource.d/IPsrcaddr 10.0.0.1 stop

IPsrcaddr[30637]:       2006/04/06_15:14:41 INFO: IPsrcaddr Success

ResourceManager[30322]: 2006/04/06_15:14:41 info: Running/etc/ha.d/resource.d/IPaddr 10.0.0.1 stop

IPaddr[30791]:  2006/04/06_15:14:41 INFO: /sbin/route -n del -host 10.0.0.1

IPaddr[30791]: 2006/04/06_15:14:41 INFO: /sbin/ifconfig eth0:0 10.0.0.1down

IPaddr[30791]:  2006/04/06_15:14:41 INFO: IP Address 10.0.0.1 released
IPaddr[30721]:  2006/04/06_15:14:41 INFO: IPaddr Success
heartbeat[30312]: 2006/04/06_15:14:41 info: All HA resources relinquished.

harc[30838]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/statusstatusharc[30848]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/statusstatusharc[30858]: 2006/04/06_15:14:41 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[17386]: 2006/04/06_15:14:43 info: killing HBREAD process 17391with signal 15heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBWRITE process17392 with signal 15heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBREAD process 17393with signal 15heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBFIFO process 17389with signal 15heartbeat[17386]: 2006/04/06_15:14:43 info: killing HBWRITE process17390 with signal 15heartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17389 exited. 5remainingheartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17392 exited. 4remainingheartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17393 exited. 3remainingheartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17390 exited. 2remainingheartbeat[17386]: 2006/04/06_15:14:43 info: Core process 17391 exited. 1remaining

heartbeat[17386]: 2006/04/06_15:14:43 info: Heartbeat shutdown complete.
heartbeat[17386]: 2006/04/06_15:14:43 info: Heartbeat restart triggered.
heartbeat[17386]: 2006/04/06_15:14:43 info: Restarting heartbeat.

heartbeat[17386]: 2006/04/06_15:14:43 info: Performing heartbeat restartexec.heartbeat[17386]: 2006/04/06_15:15:04 WARN: Logging daemon is disabled--enabling logging daemon is recommended

heartbeat[17386]: 2006/04/06_15:15:04 info: **************************

heartbeat[17386]: 2006/04/06_15:15:04 info: Configuration validated.Starting heartbeat 2.0.3

heartbeat[30868]: 2006/04/06_15:15:04 info: heartbeat: version 2.0.3
heartbeat[30868]: 2006/04/06_15:15:05 info: Heartbeat generation: 72

heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_TriggerHandler:Added signal manual handlerheartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_TriggerHandler:Added signal manual handlerheartbeat[30868]: 2006/04/06_15:15:05 info: Removing/var/run/heartbeat/rsctmp failed, recreating.heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: write socketpriority set to IPTOS_LOWDELAY on eth1heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound sendsocket to device: eth1heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound receivesocket to device: eth1heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: started on port694 interface eth1 to 192.168.0.71heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: write socketpriority set to IPTOS_LOWDELAY on eth1heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound sendsocket to device: eth1heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: bound receivesocket to device: eth1heartbeat[30868]: 2006/04/06_15:15:05 info: glib: ucast: started on port694 interface eth1 to 192.168.0.72heartbeat[30868]: 2006/04/06_15:15:05 info: G_main_add_SignalHandler:Added signal handler for signal 17

heartbeat[30868]: 2006/04/06_15:15:05 info: Local status now set to: 'up'

heartbeat[30868]: 2006/04/06_15:15:05 info: Exiting write_hostcachedataprocess 30875 returned rc 0.

heartbeat[30868]: 2006/04/06_15:15:06 info: Link mynode1:eth1 up.

heartbeat[30868]: 2006/04/06_15:15:06 info: Status update for nodemynode1: status activeheartbeat[30868]: 2006/04/06_15:15:06 info: Exiting write_hostcachedataprocess 30877 returned rc 0.harc[30876]: 2006/04/06_15:15:06 info: Running /etc/ha.d/rc.d/statusstatusheartbeat[30868]: 2006/04/06_15:15:07 info: Comm_now_up(): updatingstatus to activeheartbeat[30868]: 2006/04/06_15:15:07 info: Local status now set to:'active'heartbeat[30868]: 2006/04/06_15:15:07 info: remote resource transitioncompleted.heartbeat[30868]: 2006/04/06_15:15:07 info: remote resource transitioncompleted.heartbeat[30868]: 2006/04/06_15:15:07 info: Local Resource acquisitioncompleted. (none)heartbeat[30868]: 2006/04/06_15:15:07 info: Initial resource acquisitioncomplete (T_RESOURCES(them))


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Late heartbeat / split brain after restart node

Reply via email to