Re: [devel] [PATCH 1/1] amfnd: Ignore second NCSMDS_DOWN [#2436]

A V Mahesh Wed, 03 May 2017 01:02:28 -0700

Hi Surya,

This current work around  solution in AMF which is similar to how other 
services handling the
case , soon we will re-visit and tune the  MDS functionality match the 
RED down/timer for HEADLESS functionality.


Hi Minh,

This patch protect the case , so ok with me as well.

-AVM

On 5/3/2017 12:16 PM, Suryanarayana.Garlapati wrote:
> Hi Nagendra,
>
> I guess it would be good, if the root cause is fixed not the side 
> effects.
>
>
> Regards
>
> Surya
>
>
> On Wednesday 03 May 2017 12:12 PM, Nagendra Kumar wrote:
>> Hi Minh,
>>         The patch looks ok to me.
>>
>> Thanks
>> -Nagu
>>
>>> -----Original Message-----
>>> From: minh chau [mailto:[email protected]]
>>> Sent: 28 April 2017 15:24
>>> To: A V Mahesh; Suryanarayana.Garlapati; [email protected];
>>> Nagendra Kumar; [email protected]; Praveen Malviya
>>> Cc: [email protected]
>>> Subject: Re: [devel] [PATCH 1/1] amfnd: Ignore second NCSMDS_DOWN
>>> [#2436]
>>>
>>> Hi AMF maintainers,
>>>
>>> While waiting Mahesh checks whether another NCSMDS_DOWN(Vdest)
>>> should come 3 mins after headless, can we have a look at this patch?
>>> I think we need it to make AMFND safe.
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 27/04/17 12:26, A V Mahesh wrote:
>>>> Hi Minh chau,
>>>>
>>>> On 4/26/2017 5:43 PM, minh chau wrote:
>>>>> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is
>>>>> Vdest
>>>> I don't seen unnatural events from MDS, as amfnd might have subsided
>>>> for them.
>>>> Currently  transport (MDS) functionality doesn't provide event
>>>> differently for headless or non-headless and it is completely
>>>> invisible to  MDS.
>>>>
>>>> I will go through this AMF case and will get back to you.
>>>>
>>>> -AVM
>>>>
>>>> On 4/26/2017 5:43 PM, minh chau wrote:
>>>>> Hi Mahesh,
>>>>>
>>>>> The sequence is going like this:
>>>>>
>>>>> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is
>>>>> Vdest. I guess at this point MDS tells that both standby and active
>>>>> amfd are down?
>>>>>      2017-04-26 21:13:52 PL-4 osafamfnd[413]: WA AMF director
>>>>> unexpectedly crashed
>>>>>
>>>>> - Leave cluster in headless about 3 mins, amfnd receives another
>>>>> NCSMDS_DOWN with Vdest, so MDS is telling no active amfd again?
>>>>>      syslog:
>>>>>      2017-04-26 21:16:52 PL-4 osafamfnd[413]: WA AMF director
>>>>> unexpectedly crashed
>>>>>
>>>>>      mds log:
>>>>>      <143>1 2017-04-26T21:16:52.873168+10:00 PL-4 osafamfnd 413
>>>>> mds.log [meta sequenceId="9881"] >>
>>> mds_mcm_await_active_tmr_expiry
>>>>>      <142>1 2017-04-26T21:16:52.873183+10:00 PL-4 osafamfnd 413
>>>>> mds.log [meta sequenceId="9882"] MCM:API: await_active_tmr expired
>>>>> for svc_id = AVND(13) Subscribed to svc_id = AVD(12) on VDEST id = 1
>>>>>      <143>1 2017-04-26T21:16:52.9453+10:00 PL-4 osafclmna 405 mds.log
>>>>> [meta sequenceId="938"] >> mds_mcm_await_active_tmr_expiry
>>>>>      <142>1 2017-04-26T21:16:52.945309+10:00 PL-4 osafclmna 405
>>>>> mds.log [meta sequenceId="939"] MCM:API: await_active_tmr expired
>>> for
>>>>> svc_id = CLMNA(36) Subscribed to svc_id = CLMS(34) on VDEST id = 16
>>>>>      <142>1 2017-04-26T21:16:52.945452+10:00 PL-4 osafsmfnd 454
>>>>> mds.log [meta sequenceId="620"] MCM:API: svc_down :
>>>>> await_active_tmr_expiry : svc_id = SMFND(31) on DEST id = 65535 got
>>>>> DOWN for svc_id = SMFD(30) on VDEST id = 15
>>>>>      <143>1 2017-04-26T21:16:52.945462+10:00 PL-4 osafsmfnd 454
>>>>> mds.log [meta sequenceId="621"] << mds_mcm_await_active_tmr_expiry
>>>>>      <143>1 2017-04-26T21:16:52.945938+10:00 PL-4 osafckptnd 432
>>>>> mds.log [meta sequenceId="1547"] >>
>>> mds_mcm_await_active_tmr_expiry
>>>>>      <142>1 2017-04-26T21:16:52.945947+10:00 PL-4 osafckptnd 432
>>>>> mds.log [meta sequenceId="1548"] MCM:API: await_active_tmr expired
>>>>> for svc_id = CPND(17) Subscribed to svc_id = CPD(16) on VDEST id = 9
>>>>>      <142>1 2017-04-26T21:16:52.946064+10:00 PL-4 osafckptnd 432
>>>>> mds.log [meta sequenceId="1558"] MCM:API: svc_down :
>>>>> await_active_tmr_expiry : svc_id = CPND(17) on DEST id = 65535 got
>>>>> DOWN for svc_id = CPD(16) on VDEST id = 9
>>>>>      <143>1 2017-04-26T21:16:52.946074+10:00 PL-4 osafckptnd 432
>>>>> mds.log [meta sequenceId="1559"] <<
>>> mds_mcm_await_active_tmr_expiry
>>>>>      <143>1 2017-04-26T21:16:52.94611+10:00 PL-4 osafckptnd 432
>>>>> mds.log [meta sequenceId="1562"] >>
>>> mds_mcm_await_active_tmr_expiry
>>>>>      <142>1 2017-04-26T21:16:52.946118+10:00 PL-4 osafckptnd 432
>>>>> mds.log [meta sequenceId="1563"] MCM:API: await_active_tmr expired
>>>>> for svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id 
>>>>> = 16
>>>>>      <143>1 2017-04-26T21:16:52.955692+10:00 PL-4 osafimmnd 395
>>>>> mds.log [meta sequenceId="30048"] >>
>>> mds_mcm_await_active_tmr_expiry
>>>>>      <142>1 2017-04-26T21:16:52.955698+10:00 PL-4 osafimmnd 395
>>>>> mds.log [meta sequenceId="30049"] MCM:API: await_active_tmr expired
>>>>> for svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id 
>>>>> = 16
>>>>>      <142>1 2017-04-26T21:16:52.955765+10:00 PL-4 osafimmnd 395
>>>>> mds.log [meta sequenceId="30059"] MCM:API: svc_down :
>>>>> await_active_tmr_expiry : svc_id = CLMA(35) on DEST id = 65535 got
>>>>> DOWN for svc_id = CLMS(34) on VDEST id = 16
>>>>>      <143>1 2017-04-26T21:16:52.955775+10:00 PL-4 osafimmnd 395
>>>>> mds.log [meta sequenceId="30060"] <<
>>> mds_mcm_await_active_tmr_expiry
>>>>> I guess the other node-director services also receive the 2nd
>>>>> NCSMDS_DOWN(Vdest), but those services have no problem because of
>>>>> service's logic (or likely ckptnd checks cb->is_cpd_up == true), so I
>>>>> thought it would be AMF problem, until I see the points from
>>>>> Suryanarayana. So the await_active_tmr is working as expected?
>>>>>
>>>>> thanks,
>>>>> Minh
>>>>>
>>>>> On 26/04/17 17:11, A V Mahesh wrote:
>>>>>> Hi Minh Chau,
>>>>>>
>>>>>> On 4/26/2017 12:05 PM, minh chau wrote:
>>>>>>> amfnd will receive another NCSMDS_DOWN
>>>>>> you mean  amfnd is receiving  NCSMDS_DOWN for same amfd twice ?
>>>>>> or  amfnd is receiving  NCSMDS_DOWN for both  active amfd & standby
>>>>>> amfd  ?
>>>>>>
>>>>>> -AVM
>>>>>>
>>>>>> On 4/26/2017 12:05 PM, minh chau wrote:
>>>>>>> @Suryanarayana: I think this fix makes AMFND a bit defensive, but
>>>>>>> let's see Mahesh's comments
>>>>>>> @Mahesh: If getting NCSMDS_DOWN, then there's no active to wait,
>>> so
>>>>>>> MDS should stop this timer?
>>>>>>>
>>>>>>>
>>>>>>> On 26/04/17 15:45, Suryanarayana.Garlapati wrote:
>>>>>>>> Might be i guess this fix needs to be done at the MDS level, not
>>>>>>>> at the AMFND, taking into consideration that the cluster
>>>>>>>>
>>>>>>>> has only two Controllers.
>>>>>>>>
>>>>>>>> Timer which is getting started at MDS should not be started(if
>>>>>>>> started should be stopped) in case of getting the down for both of
>>>>>>>> the amfd's.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wednesday 26 April 2017 10:53 AM, Minh Chau wrote:
>>>>>>>>> If cluster goes into headless stage and wait up to 3 mins which
>>>>>>>>> is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL, amfnd
>>> will
>>>>>>>>> receive another NCSMDS_DOWN, and then delete all buffered
>>>>>>>>> messages. As a result, the headless recovery is impossible
>>>>>>>>> because these buffered messages are deleted.
>>>>>>>>>
>>>>>>>>> Patch ignores the second NCSMDS_DOWN.
>>>>>>>>> ---
>>>>>>>>>    src/amf/amfnd/di.cc | 7 +++++++
>>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc index
>>>>>>>>> 627b31853..e06b9260d 100644
>>>>>>>>> --- a/src/amf/amfnd/di.cc
>>>>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>>>>>> @@ -638,6 +638,13 @@ uint32_t
>>> avnd_evt_mds_avd_dn_evh(AVND_CB
>>>>>>>>> *cb, AVND_EVT *evt) {
>>>>>>>>>        }
>>>>>>>>>      }
>>>>>>>>>    +  // Ignore the second NCSMDS_DOWN which comes from timeout
>>> of
>>>>>>>>> +  // MDS_AWAIT_ACTIVE_TMR_VAL
>>>>>>>>> +  if (cb->is_avd_down == true) {
>>>>>>>>> +    TRACE_LEAVE();
>>>>>>>>> +    return rc;
>>>>>>>>> +  }
>>>>>>>>> +
>>>>>>>>>      m_AVND_CB_AVD_UP_RESET(cb);
>>>>>>>>>      cb->active_avd_adest = 0;
>>>>>>>>
>>>>>>
>>>>
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/1] amfnd: Ignore second NCSMDS_DOWN [#2436]

Reply via email to