[Linux-cluster] Severe problems with 64-bit RHCS on RHEL5.1

Harri.Paivaniemi Wed, 16 Apr 2008 22:40:51 -0700

Hi all,

Short introduction: My name is Harry, I'm working in Helsinki, Finland and have 
used RHCS from the beginning, we have currently 7 clusters mainly running 
MySQL/Oracle databases.


I tought I have some kind of knowledge about this clustering software and 
everything seemed to be ok until version 5. I don't have problems or severe 
bugging in any of RH4- clusters.

But....

Tryed to move --> 5.1 with 64-bit HP Blades. Cluster just won't work or it 
works but I don't have any kind of trust to it anymore. I have made about 20 
different scenarios and there is totally too much problems, couple of those 
will prevent me to use this anymore. I have created 3 tickets to RH support and 
it seems to me that they don't know that little what I know. I have had to tell 
them 2 times to read the f...g manual, because they have spoken directly agains 
qdisk man-page. They just don't know how it should work... hard to believe but 
tru.

First, I asked how to change cman deadnode_timeout in 5, because /proc doesn't 
anymore have it and that parameter didn't work on my tests. Support said "you 
can't tune the timeout at all". I asked, how can I  use qdisk if man page says 
cman's timeout must be > than qdisk eviction timeout.... and told them to read 
the man-page... finally I found myself the correct parameter "totem token"

Second time, they said in my 2-node cluster I made a mistake when I gave 1 vote 
for the quorum disk... but man-page again tell's to do that and of course it is 
correct in 2-node cluster....

So, this is my sad history with ver 5. Do you use 64-bit ver 5 and what's your 
feeling?

My problems this time are:

1. 2-node cluster. Can't start only one node to get cluster services up - it 
hangs in fencing and waits until I start te second node and immediately after 
that, when both nodes are starting cman, the cluster comes up. So if I have 
lost one node, I can't get the cluster up, if I have to restart for seome 
reason the working node. It should work like before (both nodes are down, I 
start one, it fences another and comes up). Now it just waits... log says:

ccsd[25272]: Error while processing connect: Connection refused

This is so common error message, that it just tell's nothing to me....

2. qdisk doesn't work. 2- node cluster. Start it (both nodes at the same time) 
to get it up. Works ok, qdisk works, heuristic works. Everything works. If I 
stop cluster daemons on one node, that node can't join to cluster anymore 
without a complete reboot. It joins, another node says ok, the node itself says 
ok, quorum is registred and heuristic is up, but the node's quorum-disk stays 
offline and another node says this node is offline. If I reboot this machine, 
it joins to cluster ok.

3. Funny thing: heuristic ping didn't work at all in the beginning and support 
gave me a "ping-script" which make it to work... so this describes quite well 
how experimental this cluster is nowadays...

I have to tell you it is a FACT that basics are ok: fencing works ok in a 
normal situation, I don't have typos, configs are in sync,  everything is ok, 
but these problems still exists.

I have 2 times sent sosreports etc. so RH support. They hava spent 3 weeks and 
still can't say whats wrong...


Just if somebody has something in mind to help...

Thanks,

-hjp

<<winmail.dat>>

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

[Linux-cluster] Severe problems with 64-bit RHCS on RHEL5.1

Reply via email to