Message: 3
Date: Wed, 8 Apr 2009 17:55:34 +0200
From: Dejan Muhamedagic <[email protected]>
Subject: Re: [Linux-HA] Stranges "dead link" and "late heartbeat" on
sunny Sunday.
To: General Linux-HA mailing list <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=us-ascii
Hi,
On Tue, Apr 07, 2009 at 04:25:03PM +0200, julien WICQUART wrote:
Hi,
Sunday the 5th of April, i've got a strange surprise.
7 of ours 2 master/slave nodes heartbeat clusters on 2
differents datacenters have done the same "dance" at differents
times of the day.
We used :
- Different DELL servers 1850 SC1425 2950
- GNU/Linux Debian Etch kernel 2.6.18-6-686
- heartbeat 1.2.5-3
/etc/ha.d/ha.cf :
logfacility daemon
ucast eth0 10.10.0.92
ucast eth1 10.2.0.2
ucast eth2 192.168.0.92
auto_failback off
watchdog /dev/watchdog
node serveur1
node serveur2
ping 192.168.0.100 192.168.0.101 192.168.0.102 192.168.0.200
respawn hacluster /usr/lib/heartbeat/ipfail
apiauth cl_status uid=nobody,root
/etc/ha.d/haresources :
serveur2 ressource1 ressource2 ...
The systems were not heavily loaded and i have looked in old
logs, i have no "late heartbeat" for a long time, so no problem
like "How to tune Heartbeat on heavily loaded system to avoid
split-brain?".
The more strange is that heartbeat says it own system is dead.
The link on eth1 is a crossover cable so no physical device between the 2
servers.
Here is a typical log of this problem :
in this case the master was serveur1.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node serveur1: is dead
Apr 5 12:48:56 serveur1 heartbeat[9884]: ERROR: No local heartbeat. Forcing
restart.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Heartbeat shutdown in progress.
(9884)
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node serveur2: is dead
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.100: is dead
Apr 5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node serveur2 now
has status dead
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled -
shutdown in progress.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.101: is dead
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled -
shutdown in progress.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.102: is dead
Apr 5 12:48:56 serveur1 heartbeat[23266]: info: Giving up all HA resources.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled -
shutdown in progress.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.200: is dead
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled -
shutdown in progress.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth0 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth1 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth2 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link
192.168.0.100:192.168.0.100 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link
192.168.0.101:192.168.0.101 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link
192.168.0.102:192.168.0.102 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link
192.168.0.200:192.168.0.200 dead.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node serveur1:
interval 41770 ms
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Cluster node serveur2 returning
after partition.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Deadtime value may be too small.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: See documentation for
information on tuning deadtime.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth0 up.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node serveur2:
interval 41890 ms
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Status update for node
serveur2: status active
Apr 5 12:48:56 serveur1 heartbeat[9884]: debug: StartNextRemoteRscReq(): child
count 1
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth1 up.
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth2 up.
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node
192.168.0.200: interval 41770 ms
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Status update for node
192.168.0.200: status ping
Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link
192.168.0.200:192.168.0.200 up.
Apr 5 12:48:56 serveur1 ipfail[9911]: info: NS: We are dead. :<
Apr 5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node 192.168.0.100
now has status dead
Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200!
Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node
192.168.0.102: interval 41760 ms
The heartbeat was late for about 42 seconds.
According to the timestamps, looks like the whole machine was
"hanging". At the same time, heartbeat realized that there was no
heartbeat (local and from other nodes) and that there's again
heartbeat, albeit late. Looks like a hardware problem. SCSI
cables? Do you see any other interesting messages in system logs?
There used to be a kernel (I think on RedHat) which would
sometimes forget to schedule processes.
Thanks,
Dejan
Hi,
the servers don't seem to hang. I've got log 1 second before and after
in other log files.
The more strange is that in one day, the 5th of April, seven of ours
clusters do the same thing. We didn't have this kind of problem before
and never more since this day.
There is 5 clusters in 1 datacenter and 2 others in an other datacenter.
The only thing similar to these servers are ntp servers and DNS servers.
So i have looked at ntp log, i don't see anything strange.
Thanks,
--
Julien WICQUART NEWTECH INTERACTIVE
Responsable Exploitation 3 ch du Pigeonnier de la Cépière
Tel: +33 5 61 43 14 80 31081 TOULOUSE
Gsm: +33 6 88 17 16 30 http://www.newtech.fr/
Fax: +33 5 61 43 20 11
[email protected]
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems