CVSROOT:        /cvs/cluster
Module name:    cluster
Branch:         RHEL5
Changes by:     [EMAIL PROTECTED]       2007-11-08 09:39:10

Modified files:
        cman/man       : cman_tool.8 

Log message:
        add an explanation of the "cman_tool nodes" states and some detail 
about the
        "disallowed" state.
        bz#323931

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/cman_tool.8.diff?cvsroot=cluster&only_with_tag=RHEL5&r1=1.9.2.4&r2=1.9.2.5

--- cluster/cman/man/cman_tool.8        2007/10/12 18:53:57     1.9.2.4
+++ cluster/cman/man/cman_tool.8        2007/11/08 09:39:10     1.9.2.5
@@ -1,4 +1,4 @@
-.TH CMAN_TOOL 8 "Nov 23 2004" "Cluster utilities"
+.TH CMAN_TOOL 8 "Nov 8 2007" "Cluster utilities"
 
 .SH NAME
 cman_tool \- Cluster Management Tool
@@ -267,3 +267,69 @@
 .br
 16 Interaction with OpenAIS
 .br
+.SH NOTES
+.br
+the 
+.B nodes
+subcommand shows a list of nodes known to cman. the state is one of the 
following:
+.br
+M      The node is a member of the cluster
+.br
+X      The node is not a member of the cluster
+.br
+d      The node is known to the cluster but disallowed access to it.
+.br
+.SH DISALLOWED NODES
+Occasionally (but very infrequently I hope) you may see nodes marked as 
"Disallowed" in cman_tool status or "d" in cman_tool nodes.  This is a bit of a 
nasty hack to get around mismatch between what the upper layers expect of the 
cluster manager and OpenAIS.
+.TP
+If a node experiences a momentary lack of connectivity, but one that is long 
enough to trigger the token timeouts, then it will be removed from the cluster. 
When connectivity is restored OpenAIS will happily let it rejoin the cluster 
with no fuss. Sadly the upper layers don't like this very much. They may 
(indeed probably will have) have changed their internal state while the other 
node was away and there is no straightforward way to bring the rejoined node 
up-to-date with that state. When this happens the node is marked "Disallowed" 
and is not permitted to take part in cman operations.  
+.P
+If the remainder of the cluster is quorate the the node will be sent a kill 
message and it will be forced to leave the cluster that way. Note that fencing 
should kick in to remove the node permanently anyway, but it may take longer 
than the network outage for this to complete.
+
+If the remainder of the cluster is inquorate then we have a problem. The 
likelihood is that we will have two (or more) partitioned clusters and we 
cannot decide which is the "right" one. In this case we need to defer to the 
system administrator to kill an appropriate selection of nodes to restore the 
cluster to sensible operation.
+
+The latter scenario should be very rare and may indicate a bug somewhere in 
the code. If the local network is very flaky or busy it may be necessary to 
increase some of the protocol timeouts for OpenAIS. We are trying to think of 
better solutions to this problem.
+
+Recovering from this state can, unfortunately, be complicated. Fortunately, in 
the majority of cases, fencing will do the job for you, and the disallowed 
state will only be temporary. If it persists, the recommended approach it is to 
do a cman tool nodes on all systems in the cluster and determine the largest 
common subset of nodes that are valid members to each other. Then reboot the 
others and let them rejoin correctly. In the case of a single-node 
disconnection this should be straightforward, with a large cluster that has 
experienced a network partition it could get very complicated!
+
+Example:
+
+In this example we have a five node cluster that has experienced a network 
partition. Here is the output of cman_tool nodes from all systems:
+.nf
+Node  Sts   Inc   Joined               Name
+   1   M   2372   2007-11-05 02:58:55  node-01.example.com
+   2   d   2376   2007-11-05 02:58:56  node-02.example.com
+   3   d   2376   2007-11-05 02:58:56  node-03.example.com
+   4   M   2376   2007-11-05 02:58:56  node-04.example.com
+   5   M   2376   2007-11-05 02:58:56  node-05.example.com
+
+Node  Sts   Inc   Joined               Name
+   1   d   2372   2007-11-05 02:58:55  node-01.example.com
+   2   M   2376   2007-11-05 02:58:56  node-02.example.com
+   3   M   2376   2007-11-05 02:58:56  node-03.example.com
+   4   d   2376   2007-11-05 02:58:56  node-04.example.com
+   5   d   2376   2007-11-05 02:58:56  node-05.example.com
+
+Node  Sts   Inc   Joined               Name
+   1   d   2372   2007-11-05 02:58:55  node-01.example.com
+   2   M   2376   2007-11-05 02:58:56  node-02.example.com
+   3   M   2376   2007-11-05 02:58:56  node-03.example.com
+   4   d   2376   2007-11-05 02:58:56  node-04.example.com
+   5   d   2376   2007-11-05 02:58:56  node-05.example.com
+
+Node  Sts   Inc   Joined               Name
+   1   M   2372   2007-11-05 02:58:55  node-01.example.com
+   2   d   2376   2007-11-05 02:58:56  node-02.example.com
+   3   d   2376   2007-11-05 02:58:56  node-03.example.com
+   4   M   2376   2007-11-05 02:58:56  node-04.example.com
+   5   M   2376   2007-11-05 02:58:56  node-05.example.com
+
+Node  Sts   Inc   Joined               Name
+   1   M   2372   2007-11-05 02:58:55  node-01.example.com
+   2   d   2376   2007-11-05 02:58:56  node-02.example.com
+   3   d   2376   2007-11-05 02:58:56  node-03.example.com
+   4   M   2376   2007-11-05 02:58:56  node-04.example.com
+   5   M   2376   2007-11-05 02:58:56  node-05.example.com
+.fi
+In this scenario we should kill the node node-02 and node-03. Of course, the 3 
node cluster of node-01, node-04 & node-05 should remain quorate and be able to 
fenced the two rejoined nodes anyway, but it is possible that the cluster has a 
qdisk setup that precludes this.
+

Reply via email to