> Hello list,
>
> I've got a problem with a 2-node active/passive setup, running with heartbeat 
> 2.0.7 on Novell SuSE 10.1.
> The cluster is running in R1-style configuration, with a drbd-disk, 
> IPaddress, IPsrcaddr, nmb, smb, nfs and one proprietary
> application as highly available services.
> As heartbeat-media the nodes use both eth0 (via switch) and eth1 
> (direct-link) network-interfaces (unicast), and a direct serial
> connection. The netspeed is 1Gbit/s full duplex for both interfaces, the 
> serial line works at 57600 baud.
> Furthermore, I've declared three IP-Adresses as ping nodes.
>
> Starting the cluster and running the resources is all fine, manual switching 
> between nodes with hb_takeover and hb_standby
> works as expected, with all resources being started respectively stopped as 
> they should.
>
> But the cluster shows some rather weird runtime behaviour. On a regular 
> basis, both nodes report their eth0-network interfaces
> being down, therefore reporting their ping-group as dead. As this doesn't 
> happen exactly synchronized, it sometimes provokes a
> resource failover, depending on which node declared itself dead in the first 
> place.
> This state lasts for about three seconds, after that time both nodes fire 
> their eth0-interfaces back up and resume working as if
> nothing happened. The network-interfaces use hardware from Marvell and are 
> driven by the sk98lin-driver.
>
> I hope anyone has got an idea about this.
>
> Regards,
> Ronald

As there are no replies to this problem yet, I want to add some information 
about the current configuration.

logfiles:
node 1 (the passive one):
<last entry more then a minute ago...>
heartbeat[3836]: 2007/06/18_06:51:05 info: node-1 wants to go standby [all]
heartbeat[30193]: 2007/06/18_06:51:05 info: Checking status of STONITH device 
[Suicide STONITH device]
heartbeat[3836]: 2007/06/18_06:51:05 info: Exiting STONITH-stat process 30193 
returned rc 0.
heartbeat[3836]: 2007/06/18_06:51:05 info: standby: node-2 can take our all 
resources
heartbeat[30194]: 2007/06/18_06:51:05 info: give up all HA resources (standby).
ResourceManager[30204]: 2007/06/18_06:51:05 info: Releasing resource group: 
node-1 172.16.17.161/24/eth0:0
IPsrcaddr::172.16.17.161
drbddisk::BR Filesystem::/dev/drbd0::/mnt/clusterdisk::reiserfs nfsserver nmb 
smb myservice_v1
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running 
/etc/ha.d/resource.d/myservice_v1 stop
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/init.d/smb  stop
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running /etc/init.d/nmb  stop
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running 
/etc/init.d/nfsserver  stop
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/clusterdisk
reiserfs stop
Filesystem[30377]:      2007/06/18_06:51:05 INFO: Running stop for /dev/drbd0 
on /mnt/clusterdisk
Filesystem[30313]:      2007/06/18_06:51:05 INFO: Filesystem Success
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running 
/etc/ha.d/resource.d/drbddisk BR stop
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running 
/etc/ha.d/resource.d/IPsrcaddr 172.16.17.161 stop
IPsrcaddr[30476]:       2007/06/18_06:51:05 INFO: No preferred source address 
defined, nothing to stop
IPsrcaddr[30439]:       2007/06/18_06:51:05 INFO: IPsrcaddr Success
ResourceManager[30204]: 2007/06/18_06:51:05 info: Running 
/etc/ha.d/resource.d/IPaddr 172.16.17.161/24/eth0:0 stop
IPaddr[30523]:  2007/06/18_06:51:05 INFO: IPaddr Success
heartbeat[30194]: 2007/06/18_06:51:05 info: all HA resource release completed 
(standby).
heartbeat[3836]: 2007/06/18_06:51:05 info: Local standby process completed 
[all].
heartbeat[3836]: 2007/06/18_06:51:07 WARN: 1 lost packet(s) for [node-2] 
[1046581:1046583]
heartbeat[3836]: 2007/06/18_06:51:07 info: remote resource transition completed.
heartbeat[3836]: 2007/06/18_06:51:07 info: No pkts missing from node-2!
heartbeat[3836]: 2007/06/18_06:51:07 info: Other node completed standby 
takeover of all resources.
heartbeat[3836]: 2007/06/18_06:51:37 info: Link node-2:eth0 up.
heartbeat[3836]: 2007/06/18_06:51:38 info: Link group1:group1 up.
heartbeat[3836]: 2007/06/18_06:51:38 WARN: Late heartbeat: Node group1: 
interval 43000 ms
heartbeat[3836]: 2007/06/18_06:51:38 info: Status update for node group1: 
status ping
heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-2:eth0 dead.
heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-2:eth0 up.
<next entry more then a minute away...>

node 2 (the active one):
<last entry more then a minute ago...>
heartbeat[3836]: 2007/06/18_06:51:00 info: Link node-1:eth0 dead.
heartbeat[3836]: 2007/06/18_06:51:05 info: node-1 wants to go standby [all]
heartbeat[3836]: 2007/06/18_06:51:06 info: standby: acquire [all] resources 
from node-1
heartbeat[30758]: 2007/06/18_06:51:06 info: acquire all HA resources (standby).
ResourceManager[30768]: 2007/06/18_06:51:06 info: Acquiring resource group: 
node-1 172.16.17.161/24/eth0:0
IPsrcaddr::172.16.17.161
drbddisk::BR Filesystem::/dev/drbd0::/mnt/clusterdisk::reiserfs nfsserver nmb 
smb myservice_v1
IPaddr[30791]:  2007/06/18_06:51:06 INFO: IPaddr Running OK
IPsrcaddr[30899]:       2007/06/18_06:51:06 INFO: IPsrcaddr Running OK
Filesystem[31046]:      2007/06/18_06:51:07 INFO: Running status for /dev/drbd0 
on /mnt/clusterdisk
Filesystem[31046]:      2007/06/18_06:51:07 INFO: /mnt/clusterdisk is mounted 
(running)
Filesystem[30982]:      2007/06/18_06:51:07 INFO: Filesystem Running OK
heartbeat[30758]: 2007/06/18_06:51:07 info: all HA resource acquisition 
completed (standby).
heartbeat[3836]: 2007/06/18_06:51:07 info: Standby resource acquisition done 
[all].
heartbeat[3836]: 2007/06/18_06:51:07 info: remote resource transition completed.
heartbeat[3836]: 2007/06/18_06:51:38 info: Link node-1:eth0 up.
heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-1:eth0 dead.
heartbeat[3836]: 2007/06/18_06:56:06 info: Link node-1:eth0 up.
<next entry more then a minute away...>

I've omitted the log entries next to those related to my problem, as I don't 
see any connection between them.
In addition to the network problem, I was wondering if there is a way to 
prevent failover of inactive resources, as this only clutters
up the logfiles.

Finally my haresources
node-1           IPaddr::172.16.17.161/24/eth0:0 IPsrcaddr::172.16.17.161 \
                         drbddisk::BR 
Filesystem::/dev/drbd0::/mnt/clusterdisk::reiserfs \
                         nfsserver nmb smb myservice_v1

and my ha.cf
debugfile /var/log/ha-debug
logfile          /var/log/ha-log
keepalive 500ms
deadtime 30
warntime 15
initdead 90
udpport          694
baud             57600
serial           /dev/ttyS0   
ucast eth1 10.0.0.2 on node 1 and ucast eth1 10.0.0.1 on node 2
auto_failback off
stonith_host * suicide node-1 node-2
watchdog /dev/watchdog
node             node-1 node-2
ping_group group1 172.16.17.240 172.16.17.155 172.16.17.180
apiauth ipfail gid=haclient uid=hacluster
respawn hacluster /usr/lib/heartbeat/ipfail
deadping 5

Any help would be greatly appreciated.

Regards,
Ronald

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to