[Linux-cluster] ccs/ricci cluster operation design

Etsuji Nakai Mon, 08 Aug 2011 17:51:23 -0700

Let me know your thoughts on the ccs/ricci cluster operation design. 

The bottom line is that it's a bad design to get the failed node to join the 
cluster automatically, and I think ccs/ricci should have options (in additon to 
--start/--stop) which just starts/stops services and doesn't change the 
chkconfig status.


Here is the details of the problem:

You can start/stop the cluster with ccs --start/--stop, but my customer cannot 
adopt it from the following reason.

In the customer's cluster:

- They start/stop the cluster with starting/stopping the services directly.(Not 
using ccs/ricci interface at the moment.)
- They set chkconfig off for the cluster services (cman, rgmanger etc.)
- They force-reboot the failed node with the fence device.

In this setting, when a node is force-rebooted with some problem such as kernel 
panic, for example, the node doesn't automatically join the cluster. Then the 
customer logs-in to the node and investigates the problem. When they are sure 
that the problem is resolved, they start the cluster services on this node 
again.

Now, the problem is that this customer cannot adopt the ccs tool for the 
cluster operation. Under the ccs operation, when the failed node is
force-rebooted, it automatically tries to join the cluster as chkconfig is on 
although the potential problem is not yet investigated and resolved by the 
customer. 

Here's the related discussion on bugzilla: 
https://bugzilla.redhat.com/show_bug.cgi?id=728041

-- Etsuji

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

[Linux-cluster] ccs/ricci cluster operation design

Reply via email to