I will try to reproduce it and will let you know if it is reproducible on the 
latest.
Though I have tested such scenarios before and I didn't get into such situation.
Thanks
Mohan
High Availability Solutions(www.GetHighAvailability.com)


---

** [tickets:#2074] amfd asserted on rebooted controllers continuoulsy after 
split brain scenario (headless)**

**Status:** accepted
**Milestone:** future
**Created:** Tue Sep 27, 2016 12:14 PM UTC by Srikanth R
**Last Updated:** Mon Oct 10, 2022 12:04 PM UTC
**Owner:** Mohan  Kanakam


Setup : 
SLES 11 Physical machine
Changeset :7997 5.1 FC
2 controllers and 2 payloads with headless feature enabled.
2N application with 3 SUs. (AmfDemo).

Issue :

amfd asserted on controllers  continuoulsy for every reboot after  initial 
split brain scenario is observed


Steps performed :

-> Initially brought up four nodes and all the nodes joined the cluster.

-> Brought up the 2N application, with SUs hosted on SC-1 ,SC-2 and PL-3 
successfully.

-> Performed some operations on the AMF objects and the cluster is left in idle 
state later.

-> After a gap of 2 weeks, MDS down event is generated on both the controllers 
for which spilt brain scenario is generated. Because of momentary cable(s) 
unplugging, MDS down event is generated.


Sep 24 21:36:40 SLES-SLOT1 osafimmd[2729]: NO MDS event from svc_id 25 
(change:3, dest:565214187380752)
Sep 24 21:36:40 SLES-SLOT1 kernel: [1297950.833811] TIPC: Established link 
<1.1.1:em1-1.1.2:em1> on network plane A
Sep 24 21:36:40 SLES-SLOT1 osafrded[2710]: Rebooting OpenSAF NodeId = 0 EE Name 
= No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131343, 
SupervisionTime = 60


Sep 26 00:00:01 SLES-SLOT2 osafrded[2715]: NO Got peer info request from node 
0x2010f with role ACTIVE
Sep 26 00:00:01 SLES-SLOT2 osafrded[2715]: Rebooting OpenSAF NodeId = 0 EE Name 
= No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131599, 
SupervisionTime = 60


-> As headless feature is enabled, payloads did not go for reboot.

-> Once controllers joined the payloads, amfd asserted on the rebooted 
controller and controllers went for reboot.
Sep 24 21:39:27 SLES-SLOT1 osafamfd[2772]: NO Received node_up from 2010f: 
msg_id 1
Sep 24 21:39:27 SLES-SLOT1 osafamfd[2772]: siass.cc:953: avd_susi_recreate: 
Assertion 'su' failed.
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: WA AMF director unexpectedly crashed
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: WA AMF director unexpectedly crashed
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: Rebooting OpenSAF NodeId = 131343 
EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, 
OwnNodeId = 131343, SupervisionTime = 60


Below is the backtrace :

#0  0x00007f1d28510b55 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f1d28512131 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f1d2a397197 in __osafassert_fail (__file=0x517c15 "siass.cc", 
__line=953, __func=0x518250 
<avd_susi_recreate(avsv_n2d_nd_sisu_state_msg_info_tag*)::__FUNCTION__> 
"avd_susi_recreate", 
    __assertion=0x517d01 "su") at sysf_def.c:281
No locals.
#3  0x00000000004c56a5 in avd_susi_recreate (info=0x7f1d20008ec8) at 
siass.cc:953
        su = 0x0
        __FUNCTION__ = "avd_susi_recreate"
        susi = 0x0
        node = 0x7bfdf0
        susi_state = 0x0
        su_state = 0x7f1d200055a0
        __PRETTY_FUNCTION__ = "SaAisErrorT 
avd_susi_recreate(AVSV_N2D_ND_SISU_STATE_MSG_INFO*)"
#4  0x0000000000459943 in avd_process_state_info_queue (cb=0x75cba0 
<_control_block>) at ndfsm.cc:78
        n2d_msg = 0x7f1d20008ec0
        i = 0
        queue_size = 4
        queue_evt = 0x7a9b60
        act_amfnd_node_up_count = 1
        found_state_info = true
        __FUNCTION__ = "avd_process_state_info_queue"
#5  0x000000000045a50f in avd_node_up_evh (cb=0x75cba0 <_control_block>, 
evt=0x7f1d20008880) at ndfsm.cc:363
        avnd = 0x7bf380
        n2d_msg = 0x7f1d20004b30
        rc = 1
        sync_nd_size = 4
        act_nd = true
        __FUNCTION__ = "avd_node_up_evh"
#6  0x0000000000453d78 in process_event (cb_now=0x75cba0 <_control_block>, 
evt=0x7f1d20008880) at main.cc:768
        __FUNCTION__ = "process_event"
#7  0x0000000000453a9b in main_loop () at main.cc:689
        pollretval = 1
        cb = 0x75cba0 <_control_block>
        evt = 0x7f1d20008880
        mbx_fd = {raise_obj = 11, rmv_obj = 12}
        error = SA_AIS_OK
        polltmo = -1
        term_fd = 17
        __FUNCTION__ = "main_loop"
#8  0x0000000000454017 in main (argc=2, argv=0x7fff50cd9958) at main.cc:841


Suggested recovery :

 During a split brain scenario, payloads  should be ordered for reboot even in 
headless feature. 


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to