Hello,

I'm at this moment investigating problems on a site with qdisk time-outs. The 
SAN is slow, additional hardware is ordered, which leads to the time-outs. At 
this moment almost all clusters are managing to stay on-line, mostly due to 
huge time-out values (150-300 seconds). We've seen disk time-outs of more then 
100 seconds, but they've now dropped to max. 15 seconds.

However, there is one cluster that just keeps on killing it's primary node and 
I'm unable to find the reason why. All that's being logged on this cluster are 
the lines below:

Dec 15 00:15:10 node2 last message repeated 2 times
Dec 15 00:16:59 node2 qdiskd[3073]: <notice> Writing eviction notice for node 1
Dec 15 00:17:02 node2 qdiskd[3073]: <notice> Node 1 evicted
Dec 15 00:17:54 node2 openais[3040]: [TOTEM] The token was lost in the 
OPERATIONAL state.

Dec 15 00:15:10 node1 last message repeated 2 times
Dec 15 00:16:59 node1 openais[3297]: [CMAN ] cman killed by node 2 because we 
were killed by cman_tool or other application
Dec 15 00:17:00 node1 openais[3297]: [SERV ] Unloading all openais components

The quorumd has an interval of 3, tko of 50 (to get 150 seconds, but keep 
receiving warnings for the > 3 seconds time delays.)
The quorum_dev_poll is set to 300000, just as the totem token value.

I can't find any differences with the other clusters, which are now working 
fine, except for one big differences. There are no resources configured. The 
set-up is configured only to get gfs working. Both nodes have their own ip, no 
resources are shared (other then the gfs file systems, which sre concurrently 
available) and there are no check, other then qdisk availability and totem 
tokens.

Is this behaviour that can be expected when running a cluster without resources 
configured?

Greetings,

Jan Huijsmans

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to