I will try to reproduce it and will let you know if it is reproducible on the
latest.
Though I have tested such scenarios before and I didn't get into such situation.
Thanks
Mohan
High Availability Solutions(www.GetHighAvailability.com)
---
** [tickets:#2074] amfd asserted on rebooted controllers continuoulsy after
split brain scenario (headless)**
**Status:** accepted
**Milestone:** future
**Created:** Tue Sep 27, 2016 12:14 PM UTC by Srikanth R
**Last Updated:** Mon Oct 10, 2022 12:04 PM UTC
**Owner:** Mohan Kanakam
Setup :
SLES 11 Physical machine
Changeset :7997 5.1 FC
2 controllers and 2 payloads with headless feature enabled.
2N application with 3 SUs. (AmfDemo).
Issue :
amfd asserted on controllers continuoulsy for every reboot after initial
split brain scenario is observed
Steps performed :
-> Initially brought up four nodes and all the nodes joined the cluster.
-> Brought up the 2N application, with SUs hosted on SC-1 ,SC-2 and PL-3
successfully.
-> Performed some operations on the AMF objects and the cluster is left in idle
state later.
-> After a gap of 2 weeks, MDS down event is generated on both the controllers
for which spilt brain scenario is generated. Because of momentary cable(s)
unplugging, MDS down event is generated.
Sep 24 21:36:40 SLES-SLOT1 osafimmd[2729]: NO MDS event from svc_id 25
(change:3, dest:565214187380752)
Sep 24 21:36:40 SLES-SLOT1 kernel: [1297950.833811] TIPC: Established link
<1.1.1:em1-1.1.2:em1> on network plane A
Sep 24 21:36:40 SLES-SLOT1 osafrded[2710]: Rebooting OpenSAF NodeId = 0 EE Name
= No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131343,
SupervisionTime = 60
Sep 26 00:00:01 SLES-SLOT2 osafrded[2715]: NO Got peer info request from node
0x2010f with role ACTIVE
Sep 26 00:00:01 SLES-SLOT2 osafrded[2715]: Rebooting OpenSAF NodeId = 0 EE Name
= No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131599,
SupervisionTime = 60
-> As headless feature is enabled, payloads did not go for reboot.
-> Once controllers joined the payloads, amfd asserted on the rebooted
controller and controllers went for reboot.
Sep 24 21:39:27 SLES-SLOT1 osafamfd[2772]: NO Received node_up from 2010f:
msg_id 1
Sep 24 21:39:27 SLES-SLOT1 osafamfd[2772]: siass.cc:953: avd_susi_recreate:
Assertion 'su' failed.
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: WA AMF director unexpectedly crashed
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: WA AMF director unexpectedly crashed
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: Rebooting OpenSAF NodeId = 131343
EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received,
OwnNodeId = 131343, SupervisionTime = 60
Below is the backtrace :
#0 0x00007f1d28510b55 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f1d28512131 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007f1d2a397197 in __osafassert_fail (__file=0x517c15 "siass.cc",
__line=953, __func=0x518250
<avd_susi_recreate(avsv_n2d_nd_sisu_state_msg_info_tag*)::__FUNCTION__>
"avd_susi_recreate",
__assertion=0x517d01 "su") at sysf_def.c:281
No locals.
#3 0x00000000004c56a5 in avd_susi_recreate (info=0x7f1d20008ec8) at
siass.cc:953
su = 0x0
__FUNCTION__ = "avd_susi_recreate"
susi = 0x0
node = 0x7bfdf0
susi_state = 0x0
su_state = 0x7f1d200055a0
__PRETTY_FUNCTION__ = "SaAisErrorT
avd_susi_recreate(AVSV_N2D_ND_SISU_STATE_MSG_INFO*)"
#4 0x0000000000459943 in avd_process_state_info_queue (cb=0x75cba0
<_control_block>) at ndfsm.cc:78
n2d_msg = 0x7f1d20008ec0
i = 0
queue_size = 4
queue_evt = 0x7a9b60
act_amfnd_node_up_count = 1
found_state_info = true
__FUNCTION__ = "avd_process_state_info_queue"
#5 0x000000000045a50f in avd_node_up_evh (cb=0x75cba0 <_control_block>,
evt=0x7f1d20008880) at ndfsm.cc:363
avnd = 0x7bf380
n2d_msg = 0x7f1d20004b30
rc = 1
sync_nd_size = 4
act_nd = true
__FUNCTION__ = "avd_node_up_evh"
#6 0x0000000000453d78 in process_event (cb_now=0x75cba0 <_control_block>,
evt=0x7f1d20008880) at main.cc:768
__FUNCTION__ = "process_event"
#7 0x0000000000453a9b in main_loop () at main.cc:689
pollretval = 1
cb = 0x75cba0 <_control_block>
evt = 0x7f1d20008880
mbx_fd = {raise_obj = 11, rmv_obj = 12}
error = SA_AIS_OK
polltmo = -1
term_fd = 17
__FUNCTION__ = "main_loop"
#8 0x0000000000454017 in main (argc=2, argv=0x7fff50cd9958) at main.cc:841
Suggested recovery :
During a split brain scenario, payloads should be ordered for reboot even in
headless feature.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list._______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets