Hi,

On Tue, Apr 07, 2009 at 04:25:03PM +0200, julien WICQUART wrote:
> Hi,
> 
> Sunday the 5th of April, i've got a strange surprise.
> 7 of ours 2 master/slave nodes heartbeat clusters on 2
> differents datacenters have done the same "dance" at differents
> times of the day.
> 
> We used : 
> - Different DELL servers 1850 SC1425 2950
> - GNU/Linux Debian Etch kernel 2.6.18-6-686
> - heartbeat 1.2.5-3
> 
> /etc/ha.d/ha.cf :
> logfacility daemon
> ucast eth0 10.10.0.92
> ucast eth1 10.2.0.2
> ucast eth2 192.168.0.92
> auto_failback off
> watchdog /dev/watchdog
> node serveur1
> node serveur2
> ping 192.168.0.100 192.168.0.101 192.168.0.102 192.168.0.200
> respawn hacluster /usr/lib/heartbeat/ipfail
> apiauth cl_status uid=nobody,root
> 
> /etc/ha.d/haresources :
> serveur2 ressource1 ressource2 ...
> 
> The systems were not heavily loaded and i have looked in old
> logs, i have no "late heartbeat" for a long time, so no problem
> like "How to tune Heartbeat on heavily loaded system to avoid
> split-brain?".
> 
> 
> The more strange is that heartbeat says it own system is dead.
> The link on eth1 is a crossover cable so no physical device between the 2 
> servers.
> 
> 
> Here is a typical log of this problem :
> 
> in this case the master was serveur1.
> 
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: node serveur1: is dead
> Apr  5 12:48:56 serveur1 heartbeat[9884]: ERROR: No local heartbeat. Forcing 
> restart.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Heartbeat shutdown in 
> progress. (9884)
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: node serveur2: is dead
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.100: is dead
> Apr  5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node serveur2 now 
> has status dead
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - 
> shutdown in progress.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.101: is dead
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - 
> shutdown in progress.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.102: is dead
> Apr  5 12:48:56 serveur1 heartbeat[23266]: info: Giving up all HA resources.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - 
> shutdown in progress.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.200: is dead
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - 
> shutdown in progress.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth0 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth1 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth2 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link 
> 192.168.0.100:192.168.0.100 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link 
> 192.168.0.101:192.168.0.101 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link 
> 192.168.0.102:192.168.0.102 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link 
> 192.168.0.200:192.168.0.200 dead.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node 
> serveur1: interval 41770 ms
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: Cluster node serveur2 
> returning after partition.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: Deadtime value may be too 
> small.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: See documentation for 
> information on tuning deadtime.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth0 up.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node 
> serveur2: interval 41890 ms
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Status update for node 
> serveur2: status active
> Apr  5 12:48:56 serveur1 heartbeat[9884]: debug: StartNextRemoteRscReq(): 
> child count 1
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth1 up.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth2 up.
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node 
> 192.168.0.200: interval 41770 ms
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Status update for node 
> 192.168.0.200: status ping
> Apr  5 12:48:56 serveur1 heartbeat[9884]: info: Link 
> 192.168.0.200:192.168.0.200 up.
> Apr  5 12:48:56 serveur1 ipfail[9911]: info: NS: We are dead. :<
> Apr  5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node 
> 192.168.0.100 now has status dead
> Apr  5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200!
> Apr  5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node 
> 192.168.0.102: interval 41760 ms

The heartbeat was late for about 42 seconds.

According to the timestamps, looks like the whole machine was
"hanging". At the same time, heartbeat realized that there was no
heartbeat (local and from other nodes) and that there's again
heartbeat, albeit late. Looks like a hardware problem. SCSI
cables? Do you see any other interesting messages in system logs?
There used to be a kernel (I think on RedHat) which would
sometimes forget to schedule processes.

Thanks,

Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to