- Description has changed:
Diff:
~~~~
--- old
+++ new
@@ -1,25 +1,17 @@
In the event of split network that separates nodes of cluster into multiple
paritions, each partition may have one or no SC. Network merges back, 2 SCs
will be self-fenced and rebooted as current OpenSAF behavior, leaves multiple
partitions as headless clusters. Also, before network merges back, if the SC in
each partition shutdown, which also leaves multiple partitions in headless.
Once a SC comes back, we have multiple headless clusters joining into a single
cluster. These headless clusters will be conflicted in term of IMM data and AMF
assignments.
-In order to address this problem, this ticket introduces the partition
selection in CLM, in which CLM is responsible to select only one partition
among others to join the cluster.
-Some changes:
-- Node Join request will add two fields:
- (1) previous_director_id: The node id of director that this node previously
belongs to
- (2) previous_active_duration: The duration in which this node has being in
contact with CLMD previously
-- Once a SC starts from multiple headless clusters, clmna of each node will
send node_join_request with two new fields
-- CLMD will collect all node_join_request within a partition selection timer.
The partition selection routine will be started either all node_join_request
are collected or the partition selection timer timeout.
- . How to start partition selection timer: If a node_join_request comes, this
timer is started (first time) or restarted (for second request onwards) untill
all requests are received. The timer should be relatively small, 5 seconds
perhaps, and configurable. In this way, the timer value is not needed to change
if the cluster scale up to many nodes, it actually waits for the next
node_join_request to come.
-- Before partition selection routine finishes, all CLM Initialize API will be
returned with TRY_AGAIN
+In order to address this problem, this ticket introduces the partition
selection in IMM, in which IMM is responsible to select payloads from only one
partition to be alive among others, the others will be rebooted when all nodes
join into a single cluster.
+
+To do that, IMMND will hold an aditional same cluster-wide information about
the node id at which its active IMMD locates and a unique id sent by the active
IMMD. Based on these two information to distinguish which IMMND used to be on
the same partition or from other partitions. When an SC comes up from headless,
one of these veteran IMMNDs will be elected to be the coord, and any IMMND
which has different these data with the coord will order a reboot its local
node at the time of receiving intro rsp from the active IMMD.
For example:
Normal cluster: SC1, SC2, PL3, PL4, PL5, PL6, PL7, PL8
Split network first time:
- P#1: PL3, PL4 (previously has SC1 as active SC), active duration: 100 secs
+ P#1: PL3, PL4 (previously has SC1 as active SC, and unique id: 1111)
P#2: SC1, SC2, PL5, PL6, PL7, PL8
Split network second time:
- P#1: PL3, PL4 (previously has SC1 as active SC), active duration: 100 secs
- P#2: SC1, PL5, PL6 (active duration should be greater than 100 secs, says
200 secs)
- P#3: SC2, PL7 PL8 (active duration should be greater than 100 secs, says
200 secs)
-Network merge (both SC reboots), or shutdown both SCs. Then start SC1, the
partition selection routine only allows PL5, PL6 to join cluster. PL7, PL8
rebooted because they belongs to less-sized cluster which has SC2 as active.
PL3, PL4 are rebooted because they have much less active duration even they
belong to SC1's cluster.
-
-The outcome of partition selection routine is that it only allows one
partition to join, it may choose one among others even all satisfies the same
criterias.
+ P#1: PL3, PL4 (previously has SC1 as active SC, and unique id: 1111)
+ P#2: SC1, PL5, PL6 (has SC1 as active and the unique id: 2222 )
+ P#3: SC2, PL7 PL8 (has SC2 as active, and the unique id: 3333)
+Network merge (both SC reboots), or shutdown both SCs. Then SC1 comes to
active role and elects IMMND on PL5 to be the coord. When IMMNDs on PL3, PL4,
PL7, PL8 request to sync data, it will be rejected by the active IMMD and these
nodes will be rebooted afterward.
~~~~
---
** [tickets:#2936] imm: Select one from multiple headless partitioned cluster
to join into one cluster**
**Status:** assigned
**Milestone:** 5.18.12
**Created:** Thu Oct 04, 2018 08:47 AM UTC by Minh Hon Chau
**Last Updated:** Mon Dec 10, 2018 03:07 AM UTC
**Owner:** Vu Minh Nguyen
In the event of split network that separates nodes of cluster into multiple
paritions, each partition may have one or no SC. Network merges back, 2 SCs
will be self-fenced and rebooted as current OpenSAF behavior, leaves multiple
partitions as headless clusters. Also, before network merges back, if the SC in
each partition shutdown, which also leaves multiple partitions in headless.
Once a SC comes back, we have multiple headless clusters joining into a single
cluster. These headless clusters will be conflicted in term of IMM data and AMF
assignments.
In order to address this problem, this ticket introduces the partition
selection in IMM, in which IMM is responsible to select payloads from only one
partition to be alive among others, the others will be rebooted when all nodes
join into a single cluster.
To do that, IMMND will hold an aditional same cluster-wide information about
the node id at which its active IMMD locates and a unique id sent by the active
IMMD. Based on these two information to distinguish which IMMND used to be on
the same partition or from other partitions. When an SC comes up from headless,
one of these veteran IMMNDs will be elected to be the coord, and any IMMND
which has different these data with the coord will order a reboot its local
node at the time of receiving intro rsp from the active IMMD.
For example:
Normal cluster: SC1, SC2, PL3, PL4, PL5, PL6, PL7, PL8
Split network first time:
P#1: PL3, PL4 (previously has SC1 as active SC, and unique id: 1111)
P#2: SC1, SC2, PL5, PL6, PL7, PL8
Split network second time:
P#1: PL3, PL4 (previously has SC1 as active SC, and unique id: 1111)
P#2: SC1, PL5, PL6 (has SC1 as active and the unique id: 2222 )
P#3: SC2, PL7 PL8 (has SC2 as active, and the unique id: 3333)
Network merge (both SC reboots), or shutdown both SCs. Then SC1 comes to active
role and elects IMMND on PL5 to be the coord. When IMMNDs on PL3, PL4, PL7, PL8
request to sync data, it will be rejected by the active IMMD and these nodes
will be rebooted afterward.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list._______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets