After a split network event, both SCs can reboot endlessly,
due to this assertion:

2018-08-29 18:05:34.689 SC-2 osafamfd[263]: src/amf/amfd/sg_2n_fsm.cc:596:
  avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
2018-08-29 18:05:34.695 SC-2 osafamfnd[273]: ER AMFD has unexpectedly crashed. 
Rebooting node

During the network split, a SC could assign another SU to be active,
if the node hosting the old active 2N assignment is not reachable.

The assert occurs after the network is merged. SC absence must be
enabled.

For now, we can aid recovery of the cluster by rebooting
both of the PLs in place of the assertion.
---
 src/amf/amfd/sg_2n_fsm.cc | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/src/amf/amfd/sg_2n_fsm.cc b/src/amf/amfd/sg_2n_fsm.cc
index c7d584473..3ba1dc6c8 100644
--- a/src/amf/amfd/sg_2n_fsm.cc
+++ b/src/amf/amfd/sg_2n_fsm.cc
@@ -593,8 +593,39 @@ static AVD_SU_SI_REL *avd_sg_2n_act_susi(AVD_CL_CB *cb, 
AVD_SG *sg,
       osafassert(a_susi_1->su == s_susi_2->su);
       osafassert(a_susi_2->su == s_susi_1->su);
     } else {
-      osafassert(a_susi_1->su == a_susi_2->su);
-      osafassert(s_susi_1->su == s_susi_2->su);
+      if (a_susi_1->su != a_susi_2->su) {
+        // Duplicate 2N active assignments found, probably after split brain
+        // Reboot both nodes hosting the SUs to recover
+
+        LOG_EM("Duplicate 2N active assignments in '%s' and '%s'",
+          a_susi_1->su->name.c_str(), a_susi_2->su->name.c_str());
+
+        LOG_EM("Sending node reboot order to '%s'",
+          a_susi_1->su->su_on_node->name.c_str());
+        avd_send_reboot_msg_directly(a_susi_1->su->su_on_node);
+
+        if (a_susi_1->su->su_on_node != a_susi_2->su->su_on_node) {
+          LOG_EM("Sending node reboot order to '%s'",
+            a_susi_2->su->su_on_node->name.c_str());
+          avd_send_reboot_msg_directly(a_susi_2->su->su_on_node);
+        }
+      } else if (s_susi_1->su != s_susi_2->su) {
+        // Duplicate 2N standby assignments found
+        // Reboot both nodes hosting the SUs to recover
+
+        LOG_EM("Duplicate 2N standby assignments in '%s' and '%s'",
+          s_susi_1->su->name.c_str(), s_susi_2->su->name.c_str());
+
+        LOG_EM("Sending node reboot order to '%s'",
+          s_susi_1->su->su_on_node->name.c_str());
+        avd_send_reboot_msg_directly(s_susi_1->su->su_on_node);
+
+        if (s_susi_1->su->su_on_node != s_susi_2->su->su_on_node) {
+          LOG_EM("Sending node reboot order to '%s'",
+            s_susi_2->su->su_on_node->name.c_str());
+          avd_send_reboot_msg_directly(s_susi_2->su->su_on_node);
+        }
+      }
     }
     a_susi = a_susi_1;
     s_susi = s_susi_1;
-- 
2.17.1


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to