Hi everyone,

Please review the changes in CMM and CCR that support
the new "weaker" form of membership being introduced
with Project Colorado. This "weaker" form of membership
allows multiple partitions to survive if a split brain
situation develops.

You can access the webrev at :
http://cr.opensolaris.org/~samnayak/colorado-I-CMM-CCR/


*********
Here is a summary of the CMM and CCR changes.
Please refer to the Requirements Document for Project Colorado
to understand what these changes in CMM and CCR are trying to achieve.


                SUMMARY OF CMM CHANGES
                ----------------------
1. cmm_impl::read_membership_info_from_ccr() reads membership properties
from CCR and stores them into the CMM 'conf' structure
(structure declaration in cmm_config.h).

The properties read from infrastructure table in CCR are
        a. multiple_partitions : true/false which indicates whether
        cluster allows multiple partitions to survive
        b. ping_targets : comma separated list of IP addresses
        that the cluster nodes ping to check their own health
        during CMM reconfiguration. This ping is done only
        if cluster allows multiple partitions to survive
        (weaker form of membership)

This 'read' is done when the cluster node boots up first,
and also when infrastructure table (that stores these properties)
is modified.


2. CMM has methods register_cmm_for_infr_callback() and
unregister_cmm_for_infr_callback() to register/unregister with CCR
for infrastructure table update callbacks.
When infrastructure table changes, CCR delivers callbacks to CMM
and CMM reads in its required membership information.

The callback object registered by CMM is of type infr_cb_impl_for_cmm
(introduced in this set of changes).


3. cmm_config.h declares the 'conf' structure that holds
properties read in by CMM from CCR.
We add a boolean value called multiple_partitions and
a list of strings called ping_targets that stores the IP addresses.

4. During CMM reconfiguration, the CMM automaton does the ping check
if cluster is configured to allow multiple partitions to survive.
This functionality is implemented in automaton_impl::ping_health_check().
It essentially does a door upcall to userland qd_userd daemon
in order to execute the ping.
If the door upcall cannot be performed or the ping fails,
then the node is panicked by the automaton considering that
the ping check failed.

5. For the weaker form of membership that allows multiple partitions
to survive, we alter the definition of required quorum votes.
For the usual strong form of membership, the definition was :
Q = (V/2) +1 where Q is the required quorum votes and V is the total votes.
For the new weaker form of membership, the definition is :
Q = 1, if V = 1 or 2;
and we do not allow weaker form of membership if V > 2.
If the cluster has more than 2 nodes, strong form of membership is used.

6. When a node tries to join another node, a path is first formed
before CMM knows that a remote node is reachable and starts reconfiguration.
When a node receives a request to form path,
path_manager::update_node_incarnation() does checks based on
incarnation numbers before allowing the path formation.
An additional check is introduced here to see if the local node
has any CCR changes done during split-brain that are still unresolved.
If so, then the node will reject the request of path formation.
Thus, a node that has unresolved post-split-brain CCR changes 
will not allow other nodes to join it.
The CCR changes have to be resolved (nodes have to be marked winner/loser)
before the nodes can join.


                SUMMARY OF CCR CHANGES
                ----------------------

1. Flag To Indicate CCR Change During Split-Brain

      Existence of a file "/etc/cluster/.split_brain_ccr_change" 
        on a cluster node serves as an indicator that CCR changes 
        were done during split-brain on that node.

      The file is created (if it doesn't already exist) upon
      successful completion of a CCR transaction during split-brain.

      The file is defined as SPLIT_BRAIN_CHANGE_FILE:

      #define SPLIT_BRAIN_CHANGE_FILE
      "/etc/cluster/.split_brain_ccr_change"

      It is created with 600 permission bits, owned by root.

      The split-brain change file will be created at the end of
      updatable_copy_impl::commit_transaction(), if all the previous
      steps of the commit were successful.
      
2. New convenience functions in os class

        os::file_create() to create files.
        The creation mode wil be O_EXCL | O_CREAT.

      os::file_exists() will be used to query whether or not
        there is CCR change during split-brain on this node.

3. How To Indicate A Table Was Changed During Split-Brain

      No special indication present in the changed table itself.
        Let the generation number increase normally whenever the table
      is changed. The presence of the SPLIT_BRAIN_CHANGE_FILE will
        indicate that some change exists.

4. How To Select The Winning CCR Copy

        The administrator will run a CLI utility to mark the CCR
        on a cluster node as the truth copy.
        (The CLI is not part of the present set of changes,
        but the CCR interfaces that it will use are present.)

        The CCR interface will delete the SPLIT_BRAIN_CHANGE_FILE.
        The CCR tables themselves are not modified.

5. How To Select The Loser CCR Copy

        The administrator will run a CLI utility to mark the CCR
        on a cluster node as the losing copy.
        (The CLI is not part of the present set of changes,
        but the CCR interfaces that it will use are present.)

        The CCR interface will delete the SPLIT_BRAIN_CHANGE_FILE
        is deleted, and then every table on the losing node
        is marked as invalid (generation number -1). Marking every
        table with this generation number -1 will ensure that
        when this node joins the winner node to form cluster,
        it will update its CCR copy using the winner node's CCR.
        The loser node cannot form a cluster of its own, as well.

        Note that this command to mark the loser has to be run
        in non-cluster mode. The node must then be rebooted
        into cluster mode.

6. CCR Interfaces Provided

        a. Interface to query if this node has split-brain CCR change(s).
           Exists both in kernel and userland.
           Implementation : Check if SPLIT_BRAIN_CHANGE_FILE exists.

         bool ccrlib::split_brain_ccr_change_exists()
        
        b. Interface to mark this node's CCR as winning copy.
           Removes SPLIT_BRAIN_CHANGE_FILE.
           Exists in userland.

         int ccrlib::mark_ccr_as_winner()
        
        c. Interface to mark this node's CCR as valid until cluster join;
           in other words, mark this node's CCR as loser copy.
           Removes SPLIT_BRAIN_CHANGE_FILE file.
           Set the gennum of every CCR table to -1.
           Exists in userland.

         int ccrlib::mark_ccr_as_loser()

        
7. Internal utility - ccradm - is modified to provide the following
features until the CLI is available.
        a. ccradm -q : check whether split-brain CCR change exist
        b. ccradm -m <"winner"|"loser"> :
                mark a node as the "winner" or "loser"


***********


Apologies for the short notice;
we are targetting a putback to Colorado staging gate on Dec 17
if there are no major objections,
and hence would request quick reviews. :)



Thanks & Regards,
Zoram, Sambit




Reply via email to