Hi Gary,
I have few questions:
1. Do we really want to reboot both the nodes in case of conflicts?
2. Even we want to send reboot to one node, which node we should send the 
reboot, the one, which was a part of smaller cluster?
3. If we could differentiate here that the conflicts happened because of 
re-merge, then will susi_delete message(here also, we need to decide which SU 
susi need to be deleted) will do rather than reboot? Rebooting will be little 
to harsh for other applications running on the nodes, it is just my 
understanding.
4. In general, what we assume if the partition is merged, applications for sure 
will be out of sync , so just deleting the susi will do or we need to reboot 
for sure. This is just for my understanding as I am not much aware of actual 
application level impact(in terms of Data base, its behavior, etc.).
 
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 
 
--------- Original Message --------- Subject: [PATCH 1/1] amfd: reboot nodes 
that report conflicting 2N active assignments [#2920]
From: "Gary Lee" <gary....@dektech.com.au>
Date: 8/31/18 12:17 pm
To: hans.nordeb...@ericsson.com, minh.c...@dektech.com.au, 
nagen...@hasolutions.in
Cc: opensaf-devel@lists.sourceforge.net, "Gary Lee" <gary....@dektech.com.au>

After a split network event, both SCs can reboot endlessly,
 due to this assertion:
 
 2018-08-29 18:05:34.689 SC-2 osafamfd[263]: src/amf/amfd/sg_2n_fsm.cc:596:
 avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
 2018-08-29 18:05:34.695 SC-2 osafamfnd[273]: ER AMFD has unexpectedly crashed. 
Rebooting node
 
 During the network split, a SC could assign another SU to be active,
 if the node hosting the old active 2N assignment is not reachable.
 
 The assert occurs after the network is merged. SC absence must be
 enabled.
 
 For now, we can aid recovery of the cluster by rebooting
 both of the PLs in place of the assertion.
 ---
 src/amf/amfd/sg_2n_fsm.cc | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)
 
 diff --git a/src/amf/amfd/sg_2n_fsm.cc b/src/amf/amfd/sg_2n_fsm.cc
 index c7d584473..3ba1dc6c8 100644
 --- a/src/amf/amfd/sg_2n_fsm.cc
 +++ b/src/amf/amfd/sg_2n_fsm.cc
 @@ -593,8 +593,39 @@ static AVD_SU_SI_REL *avd_sg_2n_act_susi(AVD_CL_CB *cb, 
AVD_SG *sg,
 osafassert(a_susi_1->su == s_susi_2->su);
 osafassert(a_susi_2->su == s_susi_1->su);
 } else {
 - osafassert(a_susi_1->su == a_susi_2->su);
 - osafassert(s_susi_1->su == s_susi_2->su);
 + if (a_susi_1->su != a_susi_2->su) {
 + // Duplicate 2N active assignments found, probably after split brain
 + // Reboot both nodes hosting the SUs to recover
 +
 + LOG_EM("Duplicate 2N active assignments in '%s' and '%s'",
 + a_susi_1->su->name.c_str(), a_susi_2->su->name.c_str());
 +
 + LOG_EM("Sending node reboot order to '%s'",
 + a_susi_1->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(a_susi_1->su->su_on_node);
 +
 + if (a_susi_1->su->su_on_node != a_susi_2->su->su_on_node) {
 + LOG_EM("Sending node reboot order to '%s'",
 + a_susi_2->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(a_susi_2->su->su_on_node);
 + }
 + } else if (s_susi_1->su != s_susi_2->su) {
 + // Duplicate 2N standby assignments found
 + // Reboot both nodes hosting the SUs to recover
 +
 + LOG_EM("Duplicate 2N standby assignments in '%s' and '%s'",
 + s_susi_1->su->name.c_str(), s_susi_2->su->name.c_str());
 +
 + LOG_EM("Sending node reboot order to '%s'",
 + s_susi_1->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(s_susi_1->su->su_on_node);
 +
 + if (s_susi_1->su->su_on_node != s_susi_2->su->su_on_node) {
 + LOG_EM("Sending node reboot order to '%s'",
 + s_susi_2->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(s_susi_2->su->su_on_node);
 + }
 + }
 }
 a_susi = a_susi_1;
 s_susi = s_susi_1;
 -- 
 2.17.1
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to