Hi Mohan,

1.  Yes, it is an expected behavior. The active SCs rebooted because the active 
SC saw other active SC in a cluster. PLs rebooted because it lost connection 
with both active SC and standby SC.
To enhance, OpenSAF introduced the SC absence feature documented in 
"src/imm/README.SC_ABSENCE".  If the SC absence feature is enabled, the PL will 
not reboot immediately.  After SCs rebooted and one of them became active,
all PLs which is not in the same partition with the new active SC will be 
rebooted.
For example:
-- We split cluster into 2 partitions [SC-1, PL-1] [SC-2, PL-2, PL-3].
-- After merging the partitions, SC-1 and SC-2 reboot.
-- SC-1 becomes an active SC and SC-2 becomes a standby SC.
-- SC-1 requires PL-2 and PL-3 to reboot. The PL-1 will survive after merging 
network.

2. We are not support this, but it is a promising  approach. To implement it, 
we need a strategy to choose a SC among the active SCs, reassign sus in PLs, 
synchronize IMM data in PLs etc... The survived SC must verify information of 
all unknown PLs and resolve conflicts between PLs. However, I think it's easier 
if we restart PLs.

Best regards,
Hieu


---

** [tickets:#3317] amfnd: two NEW_ACTIVE amfd in split-brain scenario**

**Status:** accepted
**Milestone:** 5.22.11
**Created:** Fri Jun 10, 2022 04:00 AM UTC by Hieu Hong Hoang
**Last Updated:** Fri Jul 29, 2022 05:16 PM UTC
**Owner:** Hieu Hong Hoang


This issue happen when we test the system in a split-brain scenario. We split 
the cluster into partitions as follow: [[SC-1(ACT), SC-2(STB), SC-3], [ 
SC-4(ACT), SC-5(STB), SC-6],[ SC-7, SC-8, SC-9(STB), SC-10(ATC)]] then merge 
all nodes back. The quiesced SC-3 detected active nodes in other partitions up 
while the active SC-1 in the same partition was still alive, therefore no 
service events were raised for the active nodes in other partitions. When the 
SC-1 was down, one of the other active was notified as new active. After the 
new active SC went down, other active was notified. Finally, the SC-3 "amfnd" 
detected two NEW_ACTIVE amfd and rebooted.

Log analysis:

* SC-3 detected active amfd in other partitions up:
~~~
<143>1 2022-05-31T05:34:56.169467+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25315"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.169469+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25316"] MCM:API: LOCAL SVC INFO  : svc_id = AVND(13) | PWE id = 1 | 
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.16947+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25317"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 | 
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.169472+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25318"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169474+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25319"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169476+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25320"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169477+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25321"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169479+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25322"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.16948+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25323"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.169482+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25324"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169484+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25325"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.169486+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25326"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169487+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25327"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169489+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25328"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169491+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25329"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169493+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25330"] MDS:DB: adest_details: <rem_node[0x20a0f]:dest_pid[441]> 
<143>1 2022-05-31T05:34:56.169494+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25331"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.169496+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25332"] MDS:DB: sub_adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169498+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25333"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169499+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25334"] << mds_mcm_svc_up
...

<143>1 2022-05-31T05:34:56.175867+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25497"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.175869+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25498"] MCM:API: LOCAL SVC INFO  : svc_id = AVND(13) | PWE id = 1 | 
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.17587+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25499"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 | 
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.175872+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25500"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175874+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25501"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175875+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25502"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175877+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25503"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175879+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25504"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175881+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25505"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175882+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25506"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175885+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25507"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.175887+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25508"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175888+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25509"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.17589+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25510"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.175891+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25511"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.175893+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25512"] MDS:DB: adest_details: <rem_node[0x2040f]:dest_pid[441]> 
<143>1 2022-05-31T05:34:56.175895+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25513"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.175897+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25514"] MDS:DB: sub_adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175898+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25515"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.1759+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25516"] << mds_mcm_svc_up
~~~
* SC-1 went down and SC-4 was notified as new active:
~~~
2022-05-31T05:34:56.214424+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25653"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535 
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2010f]:dest_pid[466]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.21448+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25662"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got 
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-4 went down and SC-10 was notified as new active:
~~~
2022-05-31T05:34:56.214606+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25731"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535 
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.214626+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25740"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got 
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x20a0f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-3 rebooted because it detected two active amfd:
~~~
2022-05-23 14:41:16.878 SC-3 osafamfnd[454]: Rebooting OpenSAF NodeId = 2030f 
EE Name = , Reason: AVD already up, OwnNodeId = 2030f, SupervisionTime = 60
2022-05-23 14:41:16.890 SC-3 opensaf_reboot: Rebooting local node; timeout=60
~~~


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to