Hi Minh chau,

On 4/26/2017 5:43 PM, minh chau wrote:
>
> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is Vdest

I don't seen unnatural events from MDS, as amfnd might have subsided for 
them.
Currently  transport (MDS) functionality doesn't provide event 
differently for
headless or non-headless and it is completely invisible to  MDS.

I will go through this AMF case and will get back to you.

-AVM

On 4/26/2017 5:43 PM, minh chau wrote:
> Hi Mahesh,
>
> The sequence is going like this:
>
> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is 
> Vdest. I guess at this point MDS tells that both standby and active 
> amfd are down?
>     2017-04-26 21:13:52 PL-4 osafamfnd[413]: WA AMF director 
> unexpectedly crashed
>
> - Leave cluster in headless about 3 mins, amfnd receives another 
> NCSMDS_DOWN with Vdest, so MDS is telling no active amfd again?
>     syslog:
>     2017-04-26 21:16:52 PL-4 osafamfnd[413]: WA AMF director 
> unexpectedly crashed
>
>     mds log:
>     <143>1 2017-04-26T21:16:52.873168+10:00 PL-4 osafamfnd 413 mds.log 
> [meta sequenceId="9881"] >> mds_mcm_await_active_tmr_expiry
>     <142>1 2017-04-26T21:16:52.873183+10:00 PL-4 osafamfnd 413 mds.log 
> [meta sequenceId="9882"] MCM:API: await_active_tmr expired for svc_id 
> = AVND(13) Subscribed to svc_id = AVD(12) on VDEST id = 1
>     <143>1 2017-04-26T21:16:52.9453+10:00 PL-4 osafclmna 405 mds.log 
> [meta sequenceId="938"] >> mds_mcm_await_active_tmr_expiry
>     <142>1 2017-04-26T21:16:52.945309+10:00 PL-4 osafclmna 405 mds.log 
> [meta sequenceId="939"] MCM:API: await_active_tmr expired for svc_id = 
> CLMNA(36) Subscribed to svc_id = CLMS(34) on VDEST id = 16
>     <142>1 2017-04-26T21:16:52.945452+10:00 PL-4 osafsmfnd 454 mds.log 
> [meta sequenceId="620"] MCM:API: svc_down : await_active_tmr_expiry : 
> svc_id = SMFND(31) on DEST id = 65535 got DOWN for svc_id = SMFD(30) 
> on VDEST id = 15
>     <143>1 2017-04-26T21:16:52.945462+10:00 PL-4 osafsmfnd 454 mds.log 
> [meta sequenceId="621"] << mds_mcm_await_active_tmr_expiry
>     <143>1 2017-04-26T21:16:52.945938+10:00 PL-4 osafckptnd 432 
> mds.log [meta sequenceId="1547"] >> mds_mcm_await_active_tmr_expiry
>     <142>1 2017-04-26T21:16:52.945947+10:00 PL-4 osafckptnd 432 
> mds.log [meta sequenceId="1548"] MCM:API: await_active_tmr expired for 
> svc_id = CPND(17) Subscribed to svc_id = CPD(16) on VDEST id = 9
>     <142>1 2017-04-26T21:16:52.946064+10:00 PL-4 osafckptnd 432 
> mds.log [meta sequenceId="1558"] MCM:API: svc_down : 
> await_active_tmr_expiry : svc_id = CPND(17) on DEST id = 65535 got 
> DOWN for svc_id = CPD(16) on VDEST id = 9
>     <143>1 2017-04-26T21:16:52.946074+10:00 PL-4 osafckptnd 432 
> mds.log [meta sequenceId="1559"] << mds_mcm_await_active_tmr_expiry
>     <143>1 2017-04-26T21:16:52.94611+10:00 PL-4 osafckptnd 432 mds.log 
> [meta sequenceId="1562"] >> mds_mcm_await_active_tmr_expiry
>     <142>1 2017-04-26T21:16:52.946118+10:00 PL-4 osafckptnd 432 
> mds.log [meta sequenceId="1563"] MCM:API: await_active_tmr expired for 
> svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id = 16
>     <143>1 2017-04-26T21:16:52.955692+10:00 PL-4 osafimmnd 395 mds.log 
> [meta sequenceId="30048"] >> mds_mcm_await_active_tmr_expiry
>     <142>1 2017-04-26T21:16:52.955698+10:00 PL-4 osafimmnd 395 mds.log 
> [meta sequenceId="30049"] MCM:API: await_active_tmr expired for svc_id 
> = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id = 16
>     <142>1 2017-04-26T21:16:52.955765+10:00 PL-4 osafimmnd 395 mds.log 
> [meta sequenceId="30059"] MCM:API: svc_down : await_active_tmr_expiry 
> : svc_id = CLMA(35) on DEST id = 65535 got DOWN for svc_id = CLMS(34) 
> on VDEST id = 16
>     <143>1 2017-04-26T21:16:52.955775+10:00 PL-4 osafimmnd 395 mds.log 
> [meta sequenceId="30060"] << mds_mcm_await_active_tmr_expiry
>
> I guess the other node-director services also receive the 2nd 
> NCSMDS_DOWN(Vdest), but those services have no problem because of 
> service's logic (or likely ckptnd checks cb->is_cpd_up == true), so I 
> thought it would be AMF problem, until I see the points from 
> Suryanarayana. So the await_active_tmr is working as expected?
>
> thanks,
> Minh
>
> On 26/04/17 17:11, A V Mahesh wrote:
>> Hi Minh Chau,
>>
>> On 4/26/2017 12:05 PM, minh chau wrote:
>>> amfnd will receive another NCSMDS_DOWN
>>
>> you mean  amfnd is receiving  NCSMDS_DOWN for same amfd twice ?
>> or  amfnd is receiving  NCSMDS_DOWN for both  active amfd  & standby 
>> amfd  ?
>>
>> -AVM
>>
>> On 4/26/2017 12:05 PM, minh chau wrote:
>>>
>>> @Suryanarayana: I think this fix makes AMFND a bit defensive, but 
>>> let's see Mahesh's comments
>>> @Mahesh: If getting NCSMDS_DOWN, then there's no active to wait, so 
>>> MDS should stop this timer?
>>>
>>>
>>> On 26/04/17 15:45, Suryanarayana.Garlapati wrote:
>>>> Might be i guess this fix needs to be done at the MDS level, not at 
>>>> the AMFND, taking into consideration that the cluster
>>>>
>>>> has only two Controllers.
>>>>
>>>> Timer which is getting started at MDS should not be started(if 
>>>> started should be stopped) in case of getting the down for both of 
>>>> the amfd's.
>>>>
>>>>
>>>>
>>>> On Wednesday 26 April 2017 10:53 AM, Minh Chau wrote:
>>>>> If cluster goes into headless stage and wait up to 3 mins
>>>>> which is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL,
>>>>> amfnd will receive another NCSMDS_DOWN, and then delete
>>>>> all buffered messages. As a result, the headless recovery
>>>>> is impossible because these buffered messages are deleted.
>>>>>
>>>>> Patch ignores the second NCSMDS_DOWN.
>>>>> ---
>>>>>   src/amf/amfnd/di.cc | 7 +++++++
>>>>>   1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>> index 627b31853..e06b9260d 100644
>>>>> --- a/src/amf/amfnd/di.cc
>>>>> +++ b/src/amf/amfnd/di.cc
>>>>> @@ -638,6 +638,13 @@ uint32_t avnd_evt_mds_avd_dn_evh(AVND_CB *cb, 
>>>>> AVND_EVT *evt) {
>>>>>       }
>>>>>     }
>>>>>   +  // Ignore the second NCSMDS_DOWN which comes from timeout of
>>>>> +  // MDS_AWAIT_ACTIVE_TMR_VAL
>>>>> +  if (cb->is_avd_down == true) {
>>>>> +    TRACE_LEAVE();
>>>>> +    return rc;
>>>>> +  }
>>>>> +
>>>>>     m_AVND_CB_AVD_UP_RESET(cb);
>>>>>     cb->active_avd_adest = 0;
>>>>
>>>>
>>>
>>
>>
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to