Hi Mohan,
1. Yes, it is an expected behavior. The active SCs rebooted because the active
SC saw other active SC in a cluster. PLs rebooted because it lost connection
with both active SC and standby SC.
To enhance, OpenSAF introduced the SC absence feature documented in
"src/imm/README.SC_ABSENCE". If the SC absence feature is enabled, the PL will
not reboot immediately. After SCs rebooted and one of them became active,
all PLs which is not in the same partition with the new active SC will be
rebooted.
For example:
-- We split cluster into 2 partitions [SC-1, PL-1] [SC-2, PL-2, PL-3].
-- After merging the partitions, SC-1 and SC-2 reboot.
-- SC-1 becomes an active SC and SC-2 becomes a standby SC.
-- SC-1 requires PL-2 and PL-3 to reboot. The PL-1 will survive after merging
network.
2. We are not support this, but it is a promising approach. To implement it,
we need a strategy to choose a SC among the active SCs, reassign sus in PLs,
synchronize IMM data in PLs etc... The survived SC must verify information of
all unknown PLs and resolve conflicts between PLs. However, I think it's easier
if we restart PLs.
Best regards,
Hieu
---
** [tickets:#3317] amfnd: two NEW_ACTIVE amfd in split-brain scenario**
**Status:** accepted
**Milestone:** 5.22.11
**Created:** Fri Jun 10, 2022 04:00 AM UTC by Hieu Hong Hoang
**Last Updated:** Fri Jul 29, 2022 05:16 PM UTC
**Owner:** Hieu Hong Hoang
This issue happen when we test the system in a split-brain scenario. We split
the cluster into partitions as follow: [[SC-1(ACT), SC-2(STB), SC-3], [
SC-4(ACT), SC-5(STB), SC-6],[ SC-7, SC-8, SC-9(STB), SC-10(ATC)]] then merge
all nodes back. The quiesced SC-3 detected active nodes in other partitions up
while the active SC-1 in the same partition was still alive, therefore no
service events were raised for the active nodes in other partitions. When the
SC-1 was down, one of the other active was notified as new active. After the
new active SC went down, other active was notified. Finally, the SC-3 "amfnd"
detected two NEW_ACTIVE amfd and rebooted.
Log analysis:
* SC-3 detected active amfd in other partitions up:
~~~
<143>1 2022-05-31T05:34:56.169467+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25315"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.169469+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25316"] MCM:API: LOCAL SVC INFO : svc_id = AVND(13) | PWE id = 1 |
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.16947+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25317"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 |
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.169472+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25318"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169474+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25319"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169476+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25320"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169477+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25321"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169479+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25322"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.16948+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25323"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.169482+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25324"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169484+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25325"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.169486+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25326"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169487+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25327"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169489+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25328"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169491+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25329"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169493+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25330"] MDS:DB: adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169494+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25331"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.169496+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25332"] MDS:DB: sub_adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169498+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25333"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169499+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25334"] << mds_mcm_svc_up
...
<143>1 2022-05-31T05:34:56.175867+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25497"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.175869+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25498"] MCM:API: LOCAL SVC INFO : svc_id = AVND(13) | PWE id = 1 |
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.17587+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25499"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 |
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.175872+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25500"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175874+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25501"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175875+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25502"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175877+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25503"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175879+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25504"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175881+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25505"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175882+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25506"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175885+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25507"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.175887+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25508"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175888+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25509"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.17589+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25510"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.175891+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25511"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.175893+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25512"] MDS:DB: adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175895+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25513"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.175897+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25514"] MDS:DB: sub_adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175898+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25515"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.1759+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25516"] << mds_mcm_svc_up
~~~
* SC-1 went down and SC-4 was notified as new active:
~~~
2022-05-31T05:34:56.214424+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25653"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x2010f]:dest_pid[466]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.21448+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25662"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-4 went down and SC-10 was notified as new active:
~~~
2022-05-31T05:34:56.214606+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25731"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.214626+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25740"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x20a0f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-3 rebooted because it detected two active amfd:
~~~
2022-05-23 14:41:16.878 SC-3 osafamfnd[454]: Rebooting OpenSAF NodeId = 2030f
EE Name = , Reason: AVD already up, OwnNodeId = 2030f, SupervisionTime = 60
2022-05-23 14:41:16.890 SC-3 opensaf_reboot: Rebooting local node; timeout=60
~~~
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets