Hi Mahesh,

The sequence is going like this:

- Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is 
Vdest. I guess at this point MDS tells that both standby and active amfd 
are down?
     2017-04-26 21:13:52 PL-4 osafamfnd[413]: WA AMF director 
unexpectedly crashed

- Leave cluster in headless about 3 mins, amfnd receives another 
NCSMDS_DOWN with Vdest, so MDS is telling no active amfd again?
     syslog:
     2017-04-26 21:16:52 PL-4 osafamfnd[413]: WA AMF director 
unexpectedly crashed

     mds log:
     <143>1 2017-04-26T21:16:52.873168+10:00 PL-4 osafamfnd 413 mds.log 
[meta sequenceId="9881"] >> mds_mcm_await_active_tmr_expiry
     <142>1 2017-04-26T21:16:52.873183+10:00 PL-4 osafamfnd 413 mds.log 
[meta sequenceId="9882"] MCM:API: await_active_tmr expired for svc_id = 
AVND(13) Subscribed to svc_id = AVD(12) on VDEST id = 1
     <143>1 2017-04-26T21:16:52.9453+10:00 PL-4 osafclmna 405 mds.log 
[meta sequenceId="938"] >> mds_mcm_await_active_tmr_expiry
     <142>1 2017-04-26T21:16:52.945309+10:00 PL-4 osafclmna 405 mds.log 
[meta sequenceId="939"] MCM:API: await_active_tmr expired for svc_id = 
CLMNA(36) Subscribed to svc_id = CLMS(34) on VDEST id = 16
     <142>1 2017-04-26T21:16:52.945452+10:00 PL-4 osafsmfnd 454 mds.log 
[meta sequenceId="620"] MCM:API: svc_down : await_active_tmr_expiry : 
svc_id = SMFND(31) on DEST id = 65535 got DOWN for svc_id = SMFD(30) on 
VDEST id = 15
     <143>1 2017-04-26T21:16:52.945462+10:00 PL-4 osafsmfnd 454 mds.log 
[meta sequenceId="621"] << mds_mcm_await_active_tmr_expiry
     <143>1 2017-04-26T21:16:52.945938+10:00 PL-4 osafckptnd 432 mds.log 
[meta sequenceId="1547"] >> mds_mcm_await_active_tmr_expiry
     <142>1 2017-04-26T21:16:52.945947+10:00 PL-4 osafckptnd 432 mds.log 
[meta sequenceId="1548"] MCM:API: await_active_tmr expired for svc_id = 
CPND(17) Subscribed to svc_id = CPD(16) on VDEST id = 9
     <142>1 2017-04-26T21:16:52.946064+10:00 PL-4 osafckptnd 432 mds.log 
[meta sequenceId="1558"] MCM:API: svc_down : await_active_tmr_expiry : 
svc_id = CPND(17) on DEST id = 65535 got DOWN for svc_id = CPD(16) on 
VDEST id = 9
     <143>1 2017-04-26T21:16:52.946074+10:00 PL-4 osafckptnd 432 mds.log 
[meta sequenceId="1559"] << mds_mcm_await_active_tmr_expiry
     <143>1 2017-04-26T21:16:52.94611+10:00 PL-4 osafckptnd 432 mds.log 
[meta sequenceId="1562"] >> mds_mcm_await_active_tmr_expiry
     <142>1 2017-04-26T21:16:52.946118+10:00 PL-4 osafckptnd 432 mds.log 
[meta sequenceId="1563"] MCM:API: await_active_tmr expired for svc_id = 
CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id = 16
     <143>1 2017-04-26T21:16:52.955692+10:00 PL-4 osafimmnd 395 mds.log 
[meta sequenceId="30048"] >> mds_mcm_await_active_tmr_expiry
     <142>1 2017-04-26T21:16:52.955698+10:00 PL-4 osafimmnd 395 mds.log 
[meta sequenceId="30049"] MCM:API: await_active_tmr expired for svc_id = 
CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id = 16
     <142>1 2017-04-26T21:16:52.955765+10:00 PL-4 osafimmnd 395 mds.log 
[meta sequenceId="30059"] MCM:API: svc_down : await_active_tmr_expiry : 
svc_id = CLMA(35) on DEST id = 65535 got DOWN for svc_id = CLMS(34) on 
VDEST id = 16
     <143>1 2017-04-26T21:16:52.955775+10:00 PL-4 osafimmnd 395 mds.log 
[meta sequenceId="30060"] << mds_mcm_await_active_tmr_expiry

I guess the other node-director services also receive the 2nd 
NCSMDS_DOWN(Vdest), but those services have no problem because of 
service's logic (or likely ckptnd checks cb->is_cpd_up == true), so I 
thought it would be AMF problem, until I see the points from 
Suryanarayana. So the await_active_tmr is working as expected?

thanks,
Minh

On 26/04/17 17:11, A V Mahesh wrote:
> Hi Minh Chau,
>
> On 4/26/2017 12:05 PM, minh chau wrote:
>> amfnd will receive another NCSMDS_DOWN
>
> you mean  amfnd is receiving  NCSMDS_DOWN for same amfd twice ?
> or  amfnd is receiving  NCSMDS_DOWN for both  active amfd  & standby 
> amfd  ?
>
> -AVM
>
> On 4/26/2017 12:05 PM, minh chau wrote:
>>
>> @Suryanarayana: I think this fix makes AMFND a bit defensive, but 
>> let's see Mahesh's comments
>> @Mahesh: If getting NCSMDS_DOWN, then there's no active to wait, so 
>> MDS should stop this timer?
>>
>>
>> On 26/04/17 15:45, Suryanarayana.Garlapati wrote:
>>> Might be i guess this fix needs to be done at the MDS level, not at 
>>> the AMFND, taking into consideration that the cluster
>>>
>>> has only two Controllers.
>>>
>>> Timer which is getting started at MDS should not be started(if 
>>> started should be stopped) in case of getting the down for both of 
>>> the amfd's.
>>>
>>>
>>>
>>> On Wednesday 26 April 2017 10:53 AM, Minh Chau wrote:
>>>> If cluster goes into headless stage and wait up to 3 mins
>>>> which is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL,
>>>> amfnd will receive another NCSMDS_DOWN, and then delete
>>>> all buffered messages. As a result, the headless recovery
>>>> is impossible because these buffered messages are deleted.
>>>>
>>>> Patch ignores the second NCSMDS_DOWN.
>>>> ---
>>>>   src/amf/amfnd/di.cc | 7 +++++++
>>>>   1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>> index 627b31853..e06b9260d 100644
>>>> --- a/src/amf/amfnd/di.cc
>>>> +++ b/src/amf/amfnd/di.cc
>>>> @@ -638,6 +638,13 @@ uint32_t avnd_evt_mds_avd_dn_evh(AVND_CB *cb, 
>>>> AVND_EVT *evt) {
>>>>       }
>>>>     }
>>>>   +  // Ignore the second NCSMDS_DOWN which comes from timeout of
>>>> +  // MDS_AWAIT_ACTIVE_TMR_VAL
>>>> +  if (cb->is_avd_down == true) {
>>>> +    TRACE_LEAVE();
>>>> +    return rc;
>>>> +  }
>>>> +
>>>>     m_AVND_CB_AVD_UP_RESET(cb);
>>>>     cb->active_avd_adest = 0;
>>>
>>>
>>
>
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to