- **status**: review --> fixed
- **Comment**:
commit f7e9ed4cee2d95490a3d5c05676dc6c512d08b9a (HEAD -> develop,
origin/develop, ticket-3309)
Author: thang.d.nguyen <[email protected]>
Date: Fri Mar 4 14:57:19 2022 +0700
amf: reboot to recovery PL in split-brain [#3309]
The connection between the standby SC and that PL was dropped,
but that PL still connected with the active SC. It led the
standby SC considered that PL absented regardless the connection
was established after that. During failover, the standby SC will
notify all recorded absent nodes left cluster. It causes PL left
cluster from AMF view but still connect to active.
This scenario is a kind of split-brain use case and amfd should
order PL reboot to recovery the issue.
---
** [tickets:#3309] amf: the payload node unexpectedly left cluster right after
failover**
**Status:** fixed
**Milestone:** 5.22.04
**Created:** Thu Feb 24, 2022 03:57 AM UTC by Hieu Hong Hoang
**Last Updated:** Mon Mar 07, 2022 04:27 AM UTC
**Owner:** Thang Duc Nguyen
After the active SC rebooted, the standby SC executed failover to active. The
new active SC notified a PL left cluster but that PL was still in cluster. The
reason is the connection between the standby SC and that PL was dropped in the
past, but that PL still connected with the active SC. It led the standby SC
considered that PL absented regardless the connection was established after
that. The standby SC only change the PL state when it receives a check point
from the active SC. However, the active SC will not send that check point
because it still connect with the PL. During failover, the standby SC will
notify all recorded absent nodes left cluster.
<pre>
absent nodes:PL-3 absent
nodes:PL-3
SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb) SC-1(Act)----SC-2(Stb)
\ / \ \ /
\ / \ \ /
PL-3 PL-3 PL-3
absent nodes:PL-3,SC-1
SC-1(Down) SC-2(Stb) SC-1(Stb)----SC-2(Act)
/ \ /
/ \ /
PL-3 PL-3
</pre>
Log analysis:
* SC-2 (standby SC) lost contact with PL-3
2022-02-23 09:03:24.114 SC-2 osafdtmd[320]: NO Lost contact with 'PL-3'
* SC-2 (standby SC) re-established contact with PL-3
2022-02-23 09:03:24.513 SC-2 osafdtmd[320]: NO Established contact with 'PL-3'
* SC-2 finished the failover:
2022-02-23 09:03:25.582 SC-2 osafamfd[422]: NO FAILOVER StandBy --> Active DONE!
* SC-2 notified the PL-3 left the cluster:
2022-02-23 09:03:25.679 SC-2 osafamfd[422]: NO Node 'PL-3' left the cluster
* State of nodes:
safAmfNode=PL-3,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=DISABLED(2)
safAmfNode=PL-4,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
safAmfNode=PL-5,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
safAmfNode=SC-1,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
safAmfNode=SC-2,safAmfCluster=myAmfCluster
saAmfNodeAdminState=UNLOCKED(1)
saAmfNodeOperState=ENABLED(1)
Steps to reproduce:
1. Drop connection between the standby SC-2 and PL-3
2. Reconnect SC-2 with PL-3
3. Execute "immdump" inside a node. (immd in the standby SC-2 will remove the
PL-3 from the list of detached nodes)
4. Reboot the active SC-1
5. Execute "amf-state node" inside a node
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list._______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets