commit 203ff838c39910902eb16965e964a580b0cffe3b (HEAD -> develop,
origin/develop, ticket-3281)
Author: hieu.h.hoang <[email protected]>
Date: Mon Oct 18 08:59:03 2021 +0700
mds: Resolve active MxN VDEST conflict in split brain [#3281]
If split brain happens and network merges back, one active SC is possible
to stay alive due to other active SCs reboot too fast. Although alive SC
doesn't detect the split brain, it still detects some services are up in active
rebooting SCs.
In the ticket description, the status of cluster before network merges is
[[SC-1(ACT), SC-2(STB)], [SC-3(ACT), SC-4, SC-5(STB), SC-6], [SC-7(STB),
SC-8(ATC), SC-9], [SC-10(ATC)]]. On SC-1, the ntfa received wrong
NCSMDS_NO_ACTIVE in the below scenario:
- Detect ntfs on active SC-3 is up.
- Because ntfs on SC-1 is active, there are not SVC up event.
- Update the active vdest to new ntfs(on SC-3).
- Detect ntfs on active SC-3 is down.
- The active vdest is removed. A NCSMDS_NO_ACTIVE is generated, although
ntfs on SC-1 is still active.
Solution: don't replace current active vdest by a new active vdest. When
current active vdest is down, replace it by other active vdest if any.
---
** [tickets:#3281] mds: Sending wrong NO_ACTIVE after split brain detection**
**Status:** review
**Milestone:** 5.21.12
**Created:** Mon Sep 06, 2021 01:47 AM UTC by Hieu Hong Hoang
**Last Updated:** Mon Oct 18, 2021 07:58 AM UTC
**Owner:** Hieu Hong Hoang
**Attachments:**
-
[dump_mds_5.21.06.patch](https://sourceforge.net/p/opensaf/tickets/3281/attachment/dump_mds_5.21.06.patch)
(3.5 kB; application/octet-stream)
-
[mds.log.1](https://sourceforge.net/p/opensaf/tickets/3281/attachment/mds.log.1)
(12.0 MB; application/octet-stream)
Configuration:
- Cluster with 10 SCs, allow sc absence.
- Split cluster to four partitions: [[SC-1, SC-2], [SC-3, SC-4, SC-5, SC-6],
[SC-7, SC-8, SC-9], [SC-10]]
- Role of SCs after network splits: [[SC-1(ACT), SC-2(STB)], [SC-3(ACT), SC-4,
SC-5(STB), SC-6], [SC-7(STB), SC-8(ATC), SC-9], [SC-10(ATC)]]
- Network merges
Observation:
- All active and standby SCs rebooted due to split brain detected except SCs in
partition 1. The SCs in partition 1 don't reboot because the active and standby
SCs in other partitions rebooted too fast.
- Ntf agent in SC1 fails to send notification to ntf server and it will not
recover.
~~~
<143>1 2021-09-04T07:11:56.225221+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136348"] 439:amf/amfd/imm.cc:419 >> execute
<143>1 2021-09-04T07:11:56.225223+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136349"] 439:amf/amfd/ntf.cc:804 >> exec: Ntf Type:3000, sent
status:0
<143>1 2021-09-04T07:11:56.225227+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136350"] 439:amf/amfd/ntf.cc:491 >> avd_try_send_notification: Ntf
Type:3000, sent status:0
<143>1 2021-09-04T07:11:56.225231+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136351"] 439:ntf/agent/ntfa_api.c:2016 >> saNtfNotificationSend
<143>1 2021-09-04T07:11:56.225235+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136352"] 439:ntf/agent/ntfa_api.c:62 TR NTFS server is unavailable
<143>1 2021-09-04T07:11:56.225238+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136353"] 439:ntf/agent/ntfa_api.c:2260 << saNtfNotificationSend
<143>1 2021-09-04T07:11:56.225241+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136354"] 439:amf/amfd/ntf.cc:513 TR Notification Send unsuccesful
TRY_AGAIN or TIMEOUT rc:6
<143>1 2021-09-04T07:11:56.225243+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136355"] 439:amf/amfd/ntf.cc:532 << avd_try_send_notification
<143>1 2021-09-04T07:11:56.225246+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136356"] 439:amf/amfd/ntf.cc:811 TR TRY-AGAIN
<143>1 2021-09-04T07:11:56.225249+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136357"] 439:amf/amfd/ntf.cc:822 << exec
<143>1 2021-09-04T07:11:56.225252+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="136358"] 439:amf/amfd/imm.cc:427 << execute: 2
...
<143>1 2021-09-04T07:26:00.185418+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191040"] 439:amf/amfd/imm.cc:419 >> execute
<143>1 2021-09-04T07:26:00.185465+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191041"] 439:amf/amfd/ntf.cc:804 >> exec: Ntf Type:3000, sent
status:0
<143>1 2021-09-04T07:26:00.185475+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191042"] 439:amf/amfd/ntf.cc:491 >> avd_try_send_notification: Ntf
Type:3000, sent status:0
<143>1 2021-09-04T07:26:00.185485+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191043"] 439:ntf/agent/ntfa_api.c:2016 >> saNtfNotificationSend
<143>1 2021-09-04T07:26:00.185497+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191044"] 439:ntf/agent/ntfa_api.c:59 TR NTFS server is down
<143>1 2021-09-04T07:26:00.185505+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191045"] 439:ntf/agent/ntfa_api.c:2260 << saNtfNotificationSend
<143>1 2021-09-04T07:26:00.185513+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191046"] 439:amf/amfd/ntf.cc:513 TR Notification Send unsuccesful
TRY_AGAIN or TIMEOUT rc:6
<143>1 2021-09-04T07:26:00.18552+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191047"] 439:amf/amfd/ntf.cc:532 << avd_try_send_notification
<143>1 2021-09-04T07:26:00.185528+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191048"] 439:amf/amfd/ntf.cc:811 TR TRY-AGAIN
<143>1 2021-09-04T07:26:00.185536+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191049"] 439:amf/amfd/ntf.cc:822 << exec
<143>1 2021-09-04T07:26:00.185543+02:00 SC-1 osafamfd 439 osafamfd [meta
sequenceId="191050"] 439:amf/amfd/imm.cc:427 << execute: 2
~~~
Reason:
- Before the active SC-3 rebooted, SC-1 still had enough time to connect to the
ntf server in SC-3. When SC-3 rebooted, the ntf agent in SC-1 received a
NO_ACTIVE message of NTFS service. Actually, the ntf server in SC-1 is still in
active state.
- The following mds log is generated by the opensaf code which applied the
patch dump_mds_5.21.06.patch.
+ The ntf agent in SC-1 detects the ntf server in SC-3 is up. Mds updates the
active destination of NTFS service to SC-3.
~~~
<142>1 2021-09-04T07:10:34.384094+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307776"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.3841+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307777"] MDTM: Received SVC event
<142>1 2021-09-04T07:10:34.384102+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307778"] MDTM: No entry in ref tbl so dropping the recd event
<142>1 2021-09-04T07:10:34.384105+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307779"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.384111+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307780"] MDTM: Received SVC event
<143>1 2021-09-04T07:10:34.384113+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307781"] >> mds_mcm_svc_up
<143>1 2021-09-04T07:10:34.384115+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307782"] MCM:API: LOCAL SVC INFO : svc_id = NTFA(29) | PWE id = 1
| VDEST id = 65535 |
<143>1 2021-09-04T07:10:34.384117+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307783"] MCM:API: REMOTE SVC INFO : svc_id = NTFS(28) | PWE id = 1
| VDEST id = 14 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2021-09-04T07:10:34.38412+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307784"] >> mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.384122+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307785"] << mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.384127+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307786"] >> mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.38413+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307787"] << mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.384132+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307788"] >> mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.384135+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307789"] << mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.384138+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307790"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.384141+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307791"] MDS:DB: Subscription Result not present
<143>1 2021-09-04T07:10:34.384144+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307792"] << mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.384146+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307793"] >> mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.384149+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307794"] << mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.384152+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307795"] >> mds_subtn_res_tbl_add
<143>1 2021-09-04T07:10:34.384155+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307796"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.384267+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307831"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest
2010f0000019d
<143>1 2021-09-04T07:10:34.384269+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307832"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.384272+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307833"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.384287+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307838"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.38429+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307839"] MDS:DB: adest_details: <rem_node[0x2030f]:dest_pid[398]>
<143>1 2021-09-04T07:10:34.384298+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307840"] << get_subtn_adest_details
<143>1 2021-09-04T07:10:34.384301+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307841"] MDS:DB: sub_adest_details:
<rem_node[0x2030f]:dest_pid[398]>
<143>1 2021-09-04T07:10:34.384304+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307842"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.384462+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307877"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest
2030f0000018e
<143>1 2021-09-04T07:10:34.384465+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307878"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.384468+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307879"] sub_svc_id 28 adest 2030f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.384471+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307880"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.38449+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307885"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.384493+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307886"] << mds_subtn_res_tbl_add
<143>1 2021-09-04T07:10:34.384496+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="307887"] << mds_mcm_svc_up
~~~
+ The ntf agent in SC-1 detected the ntf server in SC-3 was down.
~~~
<142>1 2021-09-04T07:10:34.396969+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308753"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.396975+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308754"] MDTM: Received SVC event
<143>1 2021-09-04T07:10:34.396979+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308755"] >> mds_mcm_svc_down
<143>1 2021-09-04T07:10:34.396983+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308756"] MCM:API: LOCAL SVC INFO : svc_id = NTFA(29) | PWE id = 1
| VDEST id = 65535 |
<143>1 2021-09-04T07:10:34.396986+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308757"] MCM:API: REMOTE SVC INFO : svc_id = NTFS(28) | PWE id = 1
| VDEST id = 14 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2021-09-04T07:10:34.396989+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308758"] >> mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.396992+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308759"] << mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.396995+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308760"] >> mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.396998+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308761"] << mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.397001+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308762"] >> mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.397003+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308763"] << mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.397006+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308764"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.39701+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308765"] << mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.397014+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308766"] >> mds_subtn_res_tbl_get_by_adest
<143>1 2021-09-04T07:10:34.397016+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308767"] << mds_subtn_res_tbl_get_by_adest
<143>1 2021-09-04T07:10:34.397019+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308768"] >> mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.397022+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308769"] << mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.397025+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308770"] >> mds_subtn_res_tbl_del
<143>1 2021-09-04T07:10:34.397029+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308771"] >> dump_subt_res_tree...
<143>1 2021-09-04T07:10:34.397157+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308805"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest
2030f0000018e
<143>1 2021-09-04T07:10:34.397159+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308806"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.397162+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308807"] sub_svc_id 28 adest 2030f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.397165+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308808"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.397186+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308813"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.397189+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308814"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.397314+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308848"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest
2030f0000018e
<143>1 2021-09-04T07:10:34.397321+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308849"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.397325+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308850"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.39735+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308855"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.397353+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308856"] << mds_subtn_res_tbl_del
<143>1 2021-09-04T07:10:34.397356+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308857"] >> mds_subtn_res_tbl_getnext_any
<143>1 2021-09-04T07:10:34.397365+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308858"] << mds_subtn_res_tbl_getnext_any
~~~
+ Mds removed the active destination of NTFS service although the NTFS service
in SC-1 was still active.
~~~
<143>1 2021-09-04T07:10:34.39737+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308859"] >> mds_subtn_res_tbl_remove_active
<143>1 2021-09-04T07:10:34.397373+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308860"]
MCM_DB:RemoveActive:TimerStart:AwaitActiveTmri:Hdl=0xfc300003:SvcHdl=562945658454045:sbscr-svcid=NTFS(28),vdest=14
<143>1 2021-09-04T07:10:34.397377+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308861"] << mds_subtn_res_tbl_remove_active
<143>1 2021-09-04T07:10:34.397379+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308862"] >> mds_mcm_user_event_callback
<143>1 2021-09-04T07:10:34.397382+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308863"] >> mds_svc_tbl_get
<143>1 2021-09-04T07:10:34.397385+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308864"] << mds_svc_tbl_get
<143>1 2021-09-04T07:10:34.397387+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308865"] >> mds_subtn_tbl_get
<143>1 2021-09-04T07:10:34.397395+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308866"] << mds_subtn_tbl_get
<143>1 2021-09-04T07:10:34.397423+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308867"] << mds_mcm_user_event_callback
<142>1 2021-09-04T07:10:34.397425+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308868"] MCM:API: svc_down : svc_id = NTFA(29) on DEST id = 65535
got NO_ACTIVE for svc_id = NTFS(28) on Vdest id = 14 Adest =
<rem_node[0x2030f]:dest_pid[398]>, rem_svc_pvt_ver=1
<143>1 2021-09-04T07:10:34.397429+02:00 SC-1 osafamfd 439 mds.log [meta
sequenceId="308869"] << mds_mcm_svc_down
~~~
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list._______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets