> Hello list, > > I've got a problem with a 2-node active/passive setup, running with heartbeat > 2.0.7 on Novell SuSE 10.1. > The cluster is running in R1-style configuration, with a drbd-disk, > IPaddress, IPsrcaddr, nmb, smb, nfs and one proprietary > application as highly available services. > As heartbeat-media the nodes use both eth0 (via switch) and eth1 > (direct-link) network-interfaces (unicast), and a direct serial > connection. The netspeed is 1Gbit/s full duplex for both interfaces, the > serial line works at 57600 baud. > Furthermore, I've declared three IP-Adresses as ping nodes. > > Starting the cluster and running the resources is all fine, manual switching > between nodes with hb_takeover and hb_standby > works as expected, with all resources being started respectively stopped as > they should. > > But the cluster shows some rather weird runtime behaviour. On a regular > basis, both nodes report their eth0-network interfaces > being down, therefore reporting their ping-group as dead. As this doesn't > happen exactly synchronized, it sometimes provokes a > resource failover, depending on which node declared itself dead in the first > place. > This state lasts for about three seconds, after that time both nodes fire > their eth0-interfaces back up and resume working as if > nothing happened. The network-interfaces use hardware from Marvell and are > driven by the sk98lin-driver. > > I hope anyone has got an idea about this. > > Regards, > Ronald
As there are no replies to this problem yet, I want to add some information about the current configuration. logfiles: node 1 (the passive one): <last entry more then a minute ago...> heartbeat[3836]: 2007/06/18_06:51:05 info: node-1 wants to go standby [all] heartbeat[30193]: 2007/06/18_06:51:05 info: Checking status of STONITH device [Suicide STONITH device] heartbeat[3836]: 2007/06/18_06:51:05 info: Exiting STONITH-stat process 30193 returned rc 0. heartbeat[3836]: 2007/06/18_06:51:05 info: standby: node-2 can take our all resources heartbeat[30194]: 2007/06/18_06:51:05 info: give up all HA resources (standby). ResourceManager[30204]: 2007/06/18_06:51:05 info: Releasing resource group: node-1 172.16.17.161/24/eth0:0 IPsrcaddr::172.16.17.161 drbddisk::BR Filesystem::/dev/drbd0::/mnt/clusterdisk::reiserfs nfsserver nmb smb myservice_v1 ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/ha.d/resource.d/myservice_v1 stop ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/init.d/smb stop ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/init.d/nmb stop ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/init.d/nfsserver stop ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/clusterdisk reiserfs stop Filesystem[30377]: 2007/06/18_06:51:05 INFO: Running stop for /dev/drbd0 on /mnt/clusterdisk Filesystem[30313]: 2007/06/18_06:51:05 INFO: Filesystem Success ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/ha.d/resource.d/drbddisk BR stop ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/ha.d/resource.d/IPsrcaddr 172.16.17.161 stop IPsrcaddr[30476]: 2007/06/18_06:51:05 INFO: No preferred source address defined, nothing to stop IPsrcaddr[30439]: 2007/06/18_06:51:05 INFO: IPsrcaddr Success ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/ha.d/resource.d/IPaddr 172.16.17.161/24/eth0:0 stop IPaddr[30523]: 2007/06/18_06:51:05 INFO: IPaddr Success heartbeat[30194]: 2007/06/18_06:51:05 info: all HA resource release completed (standby). heartbeat[3836]: 2007/06/18_06:51:05 info: Local standby process completed [all]. heartbeat[3836]: 2007/06/18_06:51:07 WARN: 1 lost packet(s) for [node-2] [1046581:1046583] heartbeat[3836]: 2007/06/18_06:51:07 info: remote resource transition completed. heartbeat[3836]: 2007/06/18_06:51:07 info: No pkts missing from node-2! heartbeat[3836]: 2007/06/18_06:51:07 info: Other node completed standby takeover of all resources. heartbeat[3836]: 2007/06/18_06:51:37 info: Link node-2:eth0 up. heartbeat[3836]: 2007/06/18_06:51:38 info: Link group1:group1 up. heartbeat[3836]: 2007/06/18_06:51:38 WARN: Late heartbeat: Node group1: interval 43000 ms heartbeat[3836]: 2007/06/18_06:51:38 info: Status update for node group1: status ping heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-2:eth0 dead. heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-2:eth0 up. <next entry more then a minute away...> node 2 (the active one): <last entry more then a minute ago...> heartbeat[3836]: 2007/06/18_06:51:00 info: Link node-1:eth0 dead. heartbeat[3836]: 2007/06/18_06:51:05 info: node-1 wants to go standby [all] heartbeat[3836]: 2007/06/18_06:51:06 info: standby: acquire [all] resources from node-1 heartbeat[30758]: 2007/06/18_06:51:06 info: acquire all HA resources (standby). ResourceManager[30768]: 2007/06/18_06:51:06 info: Acquiring resource group: node-1 172.16.17.161/24/eth0:0 IPsrcaddr::172.16.17.161 drbddisk::BR Filesystem::/dev/drbd0::/mnt/clusterdisk::reiserfs nfsserver nmb smb myservice_v1 IPaddr[30791]: 2007/06/18_06:51:06 INFO: IPaddr Running OK IPsrcaddr[30899]: 2007/06/18_06:51:06 INFO: IPsrcaddr Running OK Filesystem[31046]: 2007/06/18_06:51:07 INFO: Running status for /dev/drbd0 on /mnt/clusterdisk Filesystem[31046]: 2007/06/18_06:51:07 INFO: /mnt/clusterdisk is mounted (running) Filesystem[30982]: 2007/06/18_06:51:07 INFO: Filesystem Running OK heartbeat[30758]: 2007/06/18_06:51:07 info: all HA resource acquisition completed (standby). heartbeat[3836]: 2007/06/18_06:51:07 info: Standby resource acquisition done [all]. heartbeat[3836]: 2007/06/18_06:51:07 info: remote resource transition completed. heartbeat[3836]: 2007/06/18_06:51:38 info: Link node-1:eth0 up. heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-1:eth0 dead. heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-1:eth0 up. <next entry more then a minute away...> I've omitted the log entries next to those related to my problem, as I don't see any connection between them. In addition to the network problem, I was wondering if there is a way to prevent failover of inactive resources, as this only clutters up the logfiles. Finally my haresources node-1 IPaddr::172.16.17.161/24/eth0:0 IPsrcaddr::172.16.17.161 \ drbddisk::BR Filesystem::/dev/drbd0::/mnt/clusterdisk::reiserfs \ nfsserver nmb smb myservice_v1 and my ha.cf debugfile /var/log/ha-debug logfile /var/log/ha-log keepalive 500ms deadtime 30 warntime 15 initdead 90 udpport 694 baud 57600 serial /dev/ttyS0 ucast eth1 10.0.0.2 on node 1 and ucast eth1 10.0.0.1 on node 2 auto_failback off stonith_host * suicide node-1 node-2 watchdog /dev/watchdog node node-1 node-2 ping_group group1 172.16.17.240 172.16.17.155 172.16.17.180 apiauth ipfail gid=haclient uid=hacluster respawn hacluster /usr/lib/heartbeat/ipfail deadping 5 Any help would be greatly appreciated. Regards, Ronald _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
