Hi, Sunday the 5th of April, i've got a strange surprise. 7 of ours 2 master/slave nodes heartbeat clusters on 2 differents datacenters have done the same "dance" at differents times of the day.
We used : - Different DELL servers 1850 SC1425 2950 - GNU/Linux Debian Etch kernel 2.6.18-6-686 - heartbeat 1.2.5-3 /etc/ha.d/ha.cf : logfacility daemon ucast eth0 10.10.0.92 ucast eth1 10.2.0.2 ucast eth2 192.168.0.92 auto_failback off watchdog /dev/watchdog node serveur1 node serveur2 ping 192.168.0.100 192.168.0.101 192.168.0.102 192.168.0.200 respawn hacluster /usr/lib/heartbeat/ipfail apiauth cl_status uid=nobody,root /etc/ha.d/haresources : serveur2 ressource1 ressource2 ... The systems were not heavily loaded and i have looked in old logs, i have no "late heartbeat" for a long time, so no problem like "How to tune Heartbeat on heavily loaded system to avoid split-brain?". The more strange is that heartbeat says it own system is dead. The link on eth1 is a crossover cable so no physical device between the 2 servers. Here is a typical log of this problem : in this case the master was serveur1. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node serveur1: is dead Apr 5 12:48:56 serveur1 heartbeat[9884]: ERROR: No local heartbeat. Forcing restart. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Heartbeat shutdown in progress. (9884) Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node serveur2: is dead Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.100: is dead Apr 5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node serveur2 now has status dead Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - shutdown in progress. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.101: is dead Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - shutdown in progress. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.102: is dead Apr 5 12:48:56 serveur1 heartbeat[23266]: info: Giving up all HA resources. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - shutdown in progress. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: node 192.168.0.200: is dead Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Resource takeover cancelled - shutdown in progress. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth0 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth1 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth2 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link 192.168.0.100:192.168.0.100 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link 192.168.0.101:192.168.0.101 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link 192.168.0.102:192.168.0.102 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link 192.168.0.200:192.168.0.200 dead. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node serveur1: interval 41770 ms Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Cluster node serveur2 returning after partition. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Deadtime value may be too small. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: See documentation for information on tuning deadtime. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth0 up. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node serveur2: interval 41890 ms Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Status update for node serveur2: status active Apr 5 12:48:56 serveur1 heartbeat[9884]: debug: StartNextRemoteRscReq(): child count 1 Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth1 up. Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link serveur2:eth2 up. Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node 192.168.0.200: interval 41770 ms Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Status update for node 192.168.0.200: status ping Apr 5 12:48:56 serveur1 heartbeat[9884]: info: Link 192.168.0.200:192.168.0.200 up. Apr 5 12:48:56 serveur1 ipfail[9911]: info: NS: We are dead. :< Apr 5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node 192.168.0.100 now has status dead Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200! Apr 5 12:48:56 serveur1 heartbeat[9884]: WARN: Late heartbeat: Node 192.168.0.102: interval 41760 ms Apr 5 12:48:56 serveur1 heartbeat: info: Releasing resource group: serveur1 ... Apr 5 12:48:56 serveur1 ipfail[9911]: info: NS: We are still alive! Apr 5 12:48:56 serveur1 ipfail[9911]: info: Status update: Node 192.168.0.200 now has status dead Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.102! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.101! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.100! Apr 5 12:48:56 serveur1 ipfail[9911]: info: NS: We are still alive! Apr 5 12:48:56 serveur1 ipfail[9911]: info: Link Status update: Link serveur2/eth0 now has status dead Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.102! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.101! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.100! Apr 5 12:48:56 serveur1 ipfail[9911]: info: Asking other side for ping node count. Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Message [num_ping] sent. Apr 5 12:48:56 serveur1 ipfail[9911]: info: Checking remote count of ping nodes. Apr 5 12:48:56 serveur1 ipfail[9911]: info: Link Status update: Link serveur2/eth1 now has status dead Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.102! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.101! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.100! Apr 5 12:48:56 serveur1 ipfail[9911]: info: Asking other side for ping node count. Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Message [num_ping] sent. Apr 5 12:48:56 serveur1 ipfail[9911]: info: Checking remote count of ping nodes. Apr 5 12:48:56 serveur1 ipfail[9911]: info: Link Status update: Link serveur2/eth2 now has status dead Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.102! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.101! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.100! Apr 5 12:48:56 serveur1 ipfail[9911]: info: Asking other side for ping node count. Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Message [num_ping] sent. Apr 5 12:48:56 serveur1 ipfail[9911]: info: Checking remote count of ping nodes. Apr 5 12:48:56 serveur1 ipfail[9911]: info: Link Status update: Link 192.168.0.100/192.168.0.100 now has status dead Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.200! Apr 5 12:48:56 serveur1 ipfail[9911]: debug: Found ping node 192.168.0.102! Thanks for your help or informations. Regards, -- -- Julien WICQUART NEWTECH INTERACTIVE Responsable Exploitation 3 ch du Pigeonnier de la Cépière Tel: +33 5 61 43 14 80 31081 TOULOUSE Gsm: +33 6 88 17 16 30 http://www.newtech.fr/ Fax: +33 5 61 43 20 11 [email protected] _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
