Hi everyone, Please review the changes in CMM and CCR that support the new "weaker" form of membership being introduced with Project Colorado. This "weaker" form of membership allows multiple partitions to survive if a split brain situation develops.
You can access the webrev at : http://cr.opensolaris.org/~samnayak/colorado-I-CMM-CCR/ ********* Here is a summary of the CMM and CCR changes. Please refer to the Requirements Document for Project Colorado to understand what these changes in CMM and CCR are trying to achieve. SUMMARY OF CMM CHANGES ---------------------- 1. cmm_impl::read_membership_info_from_ccr() reads membership properties from CCR and stores them into the CMM 'conf' structure (structure declaration in cmm_config.h). The properties read from infrastructure table in CCR are a. multiple_partitions : true/false which indicates whether cluster allows multiple partitions to survive b. ping_targets : comma separated list of IP addresses that the cluster nodes ping to check their own health during CMM reconfiguration. This ping is done only if cluster allows multiple partitions to survive (weaker form of membership) This 'read' is done when the cluster node boots up first, and also when infrastructure table (that stores these properties) is modified. 2. CMM has methods register_cmm_for_infr_callback() and unregister_cmm_for_infr_callback() to register/unregister with CCR for infrastructure table update callbacks. When infrastructure table changes, CCR delivers callbacks to CMM and CMM reads in its required membership information. The callback object registered by CMM is of type infr_cb_impl_for_cmm (introduced in this set of changes). 3. cmm_config.h declares the 'conf' structure that holds properties read in by CMM from CCR. We add a boolean value called multiple_partitions and a list of strings called ping_targets that stores the IP addresses. 4. During CMM reconfiguration, the CMM automaton does the ping check if cluster is configured to allow multiple partitions to survive. This functionality is implemented in automaton_impl::ping_health_check(). It essentially does a door upcall to userland qd_userd daemon in order to execute the ping. If the door upcall cannot be performed or the ping fails, then the node is panicked by the automaton considering that the ping check failed. 5. For the weaker form of membership that allows multiple partitions to survive, we alter the definition of required quorum votes. For the usual strong form of membership, the definition was : Q = (V/2) +1 where Q is the required quorum votes and V is the total votes. For the new weaker form of membership, the definition is : Q = 1, if V = 1 or 2; and we do not allow weaker form of membership if V > 2. If the cluster has more than 2 nodes, strong form of membership is used. 6. When a node tries to join another node, a path is first formed before CMM knows that a remote node is reachable and starts reconfiguration. When a node receives a request to form path, path_manager::update_node_incarnation() does checks based on incarnation numbers before allowing the path formation. An additional check is introduced here to see if the local node has any CCR changes done during split-brain that are still unresolved. If so, then the node will reject the request of path formation. Thus, a node that has unresolved post-split-brain CCR changes will not allow other nodes to join it. The CCR changes have to be resolved (nodes have to be marked winner/loser) before the nodes can join. SUMMARY OF CCR CHANGES ---------------------- 1. Flag To Indicate CCR Change During Split-Brain Existence of a file "/etc/cluster/.split_brain_ccr_change" on a cluster node serves as an indicator that CCR changes were done during split-brain on that node. The file is created (if it doesn't already exist) upon successful completion of a CCR transaction during split-brain. The file is defined as SPLIT_BRAIN_CHANGE_FILE: #define SPLIT_BRAIN_CHANGE_FILE "/etc/cluster/.split_brain_ccr_change" It is created with 600 permission bits, owned by root. The split-brain change file will be created at the end of updatable_copy_impl::commit_transaction(), if all the previous steps of the commit were successful. 2. New convenience functions in os class os::file_create() to create files. The creation mode wil be O_EXCL | O_CREAT. os::file_exists() will be used to query whether or not there is CCR change during split-brain on this node. 3. How To Indicate A Table Was Changed During Split-Brain No special indication present in the changed table itself. Let the generation number increase normally whenever the table is changed. The presence of the SPLIT_BRAIN_CHANGE_FILE will indicate that some change exists. 4. How To Select The Winning CCR Copy The administrator will run a CLI utility to mark the CCR on a cluster node as the truth copy. (The CLI is not part of the present set of changes, but the CCR interfaces that it will use are present.) The CCR interface will delete the SPLIT_BRAIN_CHANGE_FILE. The CCR tables themselves are not modified. 5. How To Select The Loser CCR Copy The administrator will run a CLI utility to mark the CCR on a cluster node as the losing copy. (The CLI is not part of the present set of changes, but the CCR interfaces that it will use are present.) The CCR interface will delete the SPLIT_BRAIN_CHANGE_FILE is deleted, and then every table on the losing node is marked as invalid (generation number -1). Marking every table with this generation number -1 will ensure that when this node joins the winner node to form cluster, it will update its CCR copy using the winner node's CCR. The loser node cannot form a cluster of its own, as well. Note that this command to mark the loser has to be run in non-cluster mode. The node must then be rebooted into cluster mode. 6. CCR Interfaces Provided a. Interface to query if this node has split-brain CCR change(s). Exists both in kernel and userland. Implementation : Check if SPLIT_BRAIN_CHANGE_FILE exists. bool ccrlib::split_brain_ccr_change_exists() b. Interface to mark this node's CCR as winning copy. Removes SPLIT_BRAIN_CHANGE_FILE. Exists in userland. int ccrlib::mark_ccr_as_winner() c. Interface to mark this node's CCR as valid until cluster join; in other words, mark this node's CCR as loser copy. Removes SPLIT_BRAIN_CHANGE_FILE file. Set the gennum of every CCR table to -1. Exists in userland. int ccrlib::mark_ccr_as_loser() 7. Internal utility - ccradm - is modified to provide the following features until the CLI is available. a. ccradm -q : check whether split-brain CCR change exist b. ccradm -m <"winner"|"loser"> : mark a node as the "winner" or "loser" *********** Apologies for the short notice; we are targetting a putback to Colorado staging gate on Dec 17 if there are no major objections, and hence would request quick reviews. :) Thanks & Regards, Zoram, Sambit