- Description has changed:

Diff:

~~~~

--- old
+++ new
@@ -14,6 +14,8 @@
 
 After the network merges, SC-1 and SC-2 may both reboot after they detect 
spilt brain.
 
+`2018-08-30 13:30:15.010 SC-1 osafrded[178]: Rebooting OpenSAF NodeId = 0 EE 
Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131343, 
SupervisionTime = 60`
+
 `2018-08-30 13:30:15.008 SC-2 osafrded[178]: Rebooting OpenSAF NodeId = 0 EE 
Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131599, 
SupervisionTime = 60`
 
 Then PL-3 and PL-4 will sync these duplicated active assignments to AMFD, and 
cause the assertion in AMFD.

~~~~




---

** [tickets:#2920] amfd: cyclic SC reboot after split network**

**Status:** unassigned
**Milestone:** 5.18.08
**Created:** Thu Aug 30, 2018 03:33 AM UTC by Gary Lee
**Last Updated:** Thu Aug 30, 2018 03:36 AM UTC
**Owner:** nobody


After a split network event, both SCs can reboot endlessly, due to this 
assertion:

2018-08-29 18:05:34.689 SC-2 osafamfd[263]: src/amf/amfd/sg_2n_fsm.cc:596: 
avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
2018-08-29 18:05:34.695 SC-2 osafamfnd[273]: ER AMFD has unexpectedly crashed. 
Rebooting node

To reproduce, enable SC absence, and split a network into two partitions.

Partition 1 contains SC-1, PL-3
Partition 2 contains SC-2, PL-4,PL-5

Before the split, PL-3 is active for a 2N SG. PL-4 is standby.

During the split, SC-2 may assign PL-4 to be active.

After the network merges, SC-1 and SC-2 may both reboot after they detect spilt 
brain.

`2018-08-30 13:30:15.010 SC-1 osafrded[178]: Rebooting OpenSAF NodeId = 0 EE 
Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131343, 
SupervisionTime = 60`

`2018-08-30 13:30:15.008 SC-2 osafrded[178]: Rebooting OpenSAF NodeId = 0 EE 
Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131599, 
SupervisionTime = 60`

Then PL-3 and PL-4 will sync these duplicated active assignments to AMFD, and 
cause the assertion in AMFD.

The user must then manually recover the cluster by doing a cluster reboot, or 
rebooting one of PL-3 / PL-4.

[#2918] addresses issues such as this, but for now, we can aid recovery of the 
cluster by rebooting one of the PLs in place of the assertion. 




---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to