Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Minh Hon Chau Sun, 08 Jul 2018 18:42:00 -0700

Hi Thuan,

I often see mds down coming before clm callback in my test, the problemin this ticket has those events coming reverse order.

<143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >> avd_mds_avnd_down_evh:2030f, 0x55895024e410<143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >> avd_node_failover:'safAmfNode=PL-3,safAmfCluster=myAmfCluster'<143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >> avd_node_mark_absent

...

<143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 << avd_mds_avnd_down_evh

<143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23917"] 372:clm/agent/clma_api.cc:786 >> saClmDispatch

...

<143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23926"] 372:amf/amfd/clm.cc:223 >> clm_track_cb: '0' '4' '1'<143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23927"] 372:amf/amfd/clm.cc:239 TR numberOfMembers:'4',numberOfItems:'1'<143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0,node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3<142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23929"] 372:amf/amfd/clm.cc:320 IN clm_track_cb: CLM node'safNode=PL-3,safCluster=myClmCluster' is not an AMF cluster member; MDSdown received<143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23930"] 372:amf/amfd/clm.cc:443 << clm_track_cb

...

<143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372 osafamfd [metasequenceId="23936"] 372:clm/agent/clma_api.cc:834 << saClmDispatch

Reset counter can avoid the msg mismatch, but I'm not sure it'scompletely fixed since the mds down event is not handled properly.If you have mds down event handled properly in both cases, will there bethis msg mismatch problem?


Thanks,
Minh

On 09/07/18 11:11, Tran Thuan wrote:

Hi Minh,

If I understand correctly, CLM callback always before AMFND down/up event.
The problem is just because AMFND not yet up and CLM callback receive MDS down 
of node and delete node.
After that AMFND down event come cannot find the node, cannot find the node is 
expected I guess.
So, we only need to reset counter to make thing right, what do you think?

<143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2448"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb: '0' '4' '1'
<142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2451"] 4815:amf/amfd/clm.cc:299 IN clm_track_cb: CLM node 
'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF cluster member; MDS down received
<143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2452"] 4815:amf/amfd/clm.cc:418 << clm_track_cb
<143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >> avd_mds_avnd_down_evh: 2030f, (nil)
<143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 << avd_mds_avnd_down_evh

<143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2596"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb: '0' '4' '1'
<143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2605"] 4815:amf/amfd/clm.cc:418 << clm_track_cb
<143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR amfnd on 2030f is up
<143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >> avd_mds_avnd_down_evh: 2030f, 
0x7ff9ac3dc940
<143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815 osafamfd [meta 
sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 << avd_mds_avnd_down_evh

Best Regards,
Thuan


-----Original Message-----
From: Minh Hon Chau <[email protected]>
Sent: Monday, July 9, 2018 7:15 AM
To: thuan.tran <[email protected]>; [email protected]; 
[email protected]; [email protected]
Cc: [email protected]
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message 
ID mismatch [#2891]

Hi,

In normal situation when I stop Opensaf or reboot the node, I often get MDS 
down first, then CLM track callback. [1]

The problem in this ticket is because AMFD has CLM track callback coming before 
MDS down. [2]

We need to see it as a race condition, not because of missing the counters 
reset, so that the amfd's function to deal with MDS down should be called. That 
would be equivalent to [1], which is currently working.

With this patch, we still don't have avd_mds_avnd_down_evh() to be called, that 
function processes many things in the event of mds down.
(ie. @nodes_exit_cnt, ...)

I think the node_info.member should be set TRUE or FALSE in the event of CLM 
callback since it means CLM membership. We can use both @member and one of 
variables indicating MDS down, to trap each other in the events of MDS down and 
CLM callback, so that in the end we have
avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in both [1] and 
[2].

Thanks,
Minh
On 06/07/18 20:36, thuan.tran wrote:

There is a case that after AMFD send reboot order due to “out of sync window”.
AMFD receive CLM track callback but node is not member yet and delete node.
Later AMFND MDS down will not reset msg_id counter since it cannot find node.
When node reboot up, AMFD continue use current msg_id counter send to
AMFND cause messasge ID mismatch in AMFND then AMFND order reboot itself node.
---
   src/amf/amfd/clm.cc | 7 ++++++-
   1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index
e113a65f9..4a15d5ad7 100644
--- a/src/amf/amfd/clm.cc
+++ b/src/amf/amfd/clm.cc
@@ -316,9 +316,14 @@ static void clm_track_cb(
                      __FUNCTION__, node_name.c_str());
               goto done;
             } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
-            LOG_IN("%s: CLM node '%s' is not an AMF cluster member; MDS down 
received",
+            LOG_IN("%s: CLM node '%s' is ABSENT; MDS down received",
                      __FUNCTION__, node_name.c_str());
               avd_node_delete_nodeid(node);
+            /* Reset msg_id because AVND MDS down may come later
+              and cannot find node to reset these, cause message ID mismatch. 
*/
+            node->rcv_msg_id = 0;
+            node->snd_msg_id = 0;
+            node->node_info.member = SA_FALSE;
               goto done;
             }
             TRACE(" Node Left: rootCauseEntity %s for node %u",



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Reply via email to