Hello,

Was wondering if anyone else has ever run into this.  We have a three-node RHCS 
cluster:

Three Proliant DL380-G6s, 48G memory
Dual network, power, QLogic HBAs for redundancy
EMC SAN
RHEL 5.5  kernel 2.6.18-194.el5

All three in an RHCS cluster, 12 Oracle database services.  The cluster itself 
runs fine under normal conditions, and all failovers function as expected.  
There is only one failover domain configured, and all three nodes are members 
of that domain.  Four of the Oracle database services contain GFS2 file 
systems; the rest are ext3.

The problem is when we attempt a controlled shutdown of the current master 
node.  We have tested in the following situations:

1.  Node 1 is the current master and not running any services.  Node 2 is also 
not running any services.  Node 3 is running all 12 services.  We hard-fail 
node 1 (by logging into the ILO and clicking on "Reset" in power management) 
and node 2 immediately takes over the master role and the services stay where 
they are and continue to function.  I believe this is the expected behavior.

2.  Node 1 is the current master and not running any services.  Three services 
are on node 2, and node 3 is running the rest.  Again, we hard-fail node 1 as 
described above and node 2 assumes the master role and the services stay where 
they are and continue to function.

3.  Repeating the same steps as above; node 1 is the master and not running any 
services, three services on node 2 and the rest on node three.  This time we 
perform a controlled shutdown of node 1 to "properly" remove it from the 
cluster (let's say we're doing a rolling patch of the OS on the nodes) with the 
following steps on the master node:
 - Unmount any GFS file systems.
 - service rgmanager stop; service gfs2 stop; service gfs stop  (clustat shows 
node1 Online but no rgmanager, as expected)
 - fence_tool leave    (this removes node 1 from the fence group in the hopes 
that the other nodes don't try to fence it as it is rebooting)
 - service clvmd stop
 - cman_tool leave remove
 - service qdiskd stop
 - shutdown
Everything appears normal until we execute the 'cman_tool leave remove'.  At 
that point the cluster log on node 2 and node 3 shows "Lost contact with quorum 
device" (we expect that) but also shows "Emergency stop of services" for all 12 
services.  While access to the quorum device is restored almost immediately 
(node 2 takes over the master role), rgmanager is temporarily unavailable on 
nodes 2 and 3 while the cluster basically reconfigures itself, restarting all 
12 services.  Eventually all 12 services properly restart (not necessarily on 
the original node they were on) and when node 1 finishes rebooting, it properly 
rejoins itself to the cluster.  Node 2 retains itself as Master.

If I do the same tests as above and reboot a node that is NOT the master, the 
services remain where they are and the cluster does not reconfigure itself or 
restart any services.

My questions are, Why does the cluster reconfigure itself and restart ALL 
services regardless of what node they are on when I do a controlled shutdown of 
the current Master node?  Do I have to hard-reset the Master node in an RHCS 
cluster so the remaining services don't get restarted?  Why does the cluster 
completely reconfigure itself when the Master node is 'properly' removed.

Thanks for your help, and any suggestions would be appreciated.

Greg Charles
Mid Range Systems

[email protected]<mailto:[email protected]>


--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to