Hi Gary,
Thanks for reminding me. My guess was if there is node leave at SC-1, then
there should be node leave at PL-16 as well. And that need to be debug.
The patch looks ok to me as well except it will result in sending multiple
reboot command to PL-16 if there are more messages from PL-16.
Thanks,
Nagendra, 91-9866424860
www.hasolutions.in
https://www.linkedin.com/company/hasolutions/
High Availability Solutions Pvt. Ltd.
- OpenSAF Support and Services
--------- Original Message --------- Subject: Re: [PATCH 1/1] amf: Recover node
that disconnnect from active AMFD [#2880]
From: "Gary Lee" <[email protected]>
Date: 7/24/18 10:01 am
To: "thuan.tran" <[email protected]>, [email protected]
Cc: [email protected], [email protected]
Hi Nagu
Do you have any comments on this? It seems OK to me, but I know you've
worked on similar scenarios with TIPC flickering before, where reboot is
issued from the PL side.
Thanks
Gary
On 09/07/18 16:37, thuan.tran wrote:
> There is a abnormal state that AMFND on remote node keep sending
> message to active AMFD but active AMFD see that node already left.
> The msg_id expected is not matched and the remote node keep stuck
> as out of control of active AMFD.
> In this case, active AMFD can trigger remote fencing for that node
> if possible, otherwise send reboot order directly.
> ---
> src/amf/amfd/ndfsm.cc | 2 --
> src/amf/amfd/ndproc.cc | 16 ++++++++++++++++
> 2 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
> index 9d54df13d..2d407be12 100644
> --- a/src/amf/amfd/ndfsm.cc
> +++ b/src/amf/amfd/ndfsm.cc
> @@ -796,7 +796,6 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
> */
> node->node_state = AVD_AVND_STATE_ABSENT;
> node->saAmfNodeOperState = SA_AMF_OPERATIONAL_DISABLED;
> - node->adest = 0;
> node->rcv_msg_id = 0;
> node->snd_msg_id = 0;
> node->recvr_fail_sw = false;
> @@ -1115,7 +1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) {
>
> LOG_NO("Node '%s' left the cluster", node->node_name.c_str());
>
> - node->adest = 0;
> node->rcv_msg_id = 0;
> node->snd_msg_id = 0;
> node->recvr_fail_sw = false;
> diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc
> index 428c26085..31d2263d2 100644
> --- a/src/amf/amfd/ndproc.cc
> +++ b/src/amf/amfd/ndproc.cc
> @@ -73,6 +73,22 @@ AVD_AVND *avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT
> node_id,
> LOG_WA("%s: invalid msg id %u, msg type %u, from %x should be %u",
> __FUNCTION__, msg_id, evt->info.avnd_msg->msg_type, node_id,
> node->rcv_msg_id + 1);
> + if (node->rcv_msg_id == 0) {
> + /* Active AMFD see node left but node still see active AMFD
> + and keep sending messages with msg_id increment */
> + LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id);
> + Consensus consensus_service;
> + if (consensus_service.IsRemoteFencingEnabled() == true) {
> + std::string host_name =
> + osaf_extended_name_borrow(&node->node_info.nodeName);
> + int first = host_name.find_first_of("=") + 1;
> + int end = host_name.find_first_of(",");
> + host_name = host_name.substr(first, end-first);
> + opensaf_reboot(node_id, host_name.c_str(), "Fencing remote node");
> + } else {
> + avd_send_reboot_msg_directly(node);
> + }
> + }
> return nullptr;
> }
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel