Hi Nagendra, I guess it would be good, if the root cause is fixed not the side effects.
Regards Surya On Wednesday 03 May 2017 12:12 PM, Nagendra Kumar wrote: > Hi Minh, > The patch looks ok to me. > > Thanks > -Nagu > >> -----Original Message----- >> From: minh chau [mailto:[email protected]] >> Sent: 28 April 2017 15:24 >> To: A V Mahesh; Suryanarayana.Garlapati; [email protected]; >> Nagendra Kumar; [email protected]; Praveen Malviya >> Cc: [email protected] >> Subject: Re: [devel] [PATCH 1/1] amfnd: Ignore second NCSMDS_DOWN >> [#2436] >> >> Hi AMF maintainers, >> >> While waiting Mahesh checks whether another NCSMDS_DOWN(Vdest) >> should come 3 mins after headless, can we have a look at this patch? >> I think we need it to make AMFND safe. >> >> Thanks, >> Minh >> >> On 27/04/17 12:26, A V Mahesh wrote: >>> Hi Minh chau, >>> >>> On 4/26/2017 5:43 PM, minh chau wrote: >>>> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is >>>> Vdest >>> I don't seen unnatural events from MDS, as amfnd might have subsided >>> for them. >>> Currently transport (MDS) functionality doesn't provide event >>> differently for headless or non-headless and it is completely >>> invisible to MDS. >>> >>> I will go through this AMF case and will get back to you. >>> >>> -AVM >>> >>> On 4/26/2017 5:43 PM, minh chau wrote: >>>> Hi Mahesh, >>>> >>>> The sequence is going like this: >>>> >>>> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is >>>> Vdest. I guess at this point MDS tells that both standby and active >>>> amfd are down? >>>> 2017-04-26 21:13:52 PL-4 osafamfnd[413]: WA AMF director >>>> unexpectedly crashed >>>> >>>> - Leave cluster in headless about 3 mins, amfnd receives another >>>> NCSMDS_DOWN with Vdest, so MDS is telling no active amfd again? >>>> syslog: >>>> 2017-04-26 21:16:52 PL-4 osafamfnd[413]: WA AMF director >>>> unexpectedly crashed >>>> >>>> mds log: >>>> <143>1 2017-04-26T21:16:52.873168+10:00 PL-4 osafamfnd 413 >>>> mds.log [meta sequenceId="9881"] >> >> mds_mcm_await_active_tmr_expiry >>>> <142>1 2017-04-26T21:16:52.873183+10:00 PL-4 osafamfnd 413 >>>> mds.log [meta sequenceId="9882"] MCM:API: await_active_tmr expired >>>> for svc_id = AVND(13) Subscribed to svc_id = AVD(12) on VDEST id = 1 >>>> <143>1 2017-04-26T21:16:52.9453+10:00 PL-4 osafclmna 405 mds.log >>>> [meta sequenceId="938"] >> mds_mcm_await_active_tmr_expiry >>>> <142>1 2017-04-26T21:16:52.945309+10:00 PL-4 osafclmna 405 >>>> mds.log [meta sequenceId="939"] MCM:API: await_active_tmr expired >> for >>>> svc_id = CLMNA(36) Subscribed to svc_id = CLMS(34) on VDEST id = 16 >>>> <142>1 2017-04-26T21:16:52.945452+10:00 PL-4 osafsmfnd 454 >>>> mds.log [meta sequenceId="620"] MCM:API: svc_down : >>>> await_active_tmr_expiry : svc_id = SMFND(31) on DEST id = 65535 got >>>> DOWN for svc_id = SMFD(30) on VDEST id = 15 >>>> <143>1 2017-04-26T21:16:52.945462+10:00 PL-4 osafsmfnd 454 >>>> mds.log [meta sequenceId="621"] << mds_mcm_await_active_tmr_expiry >>>> <143>1 2017-04-26T21:16:52.945938+10:00 PL-4 osafckptnd 432 >>>> mds.log [meta sequenceId="1547"] >> >> mds_mcm_await_active_tmr_expiry >>>> <142>1 2017-04-26T21:16:52.945947+10:00 PL-4 osafckptnd 432 >>>> mds.log [meta sequenceId="1548"] MCM:API: await_active_tmr expired >>>> for svc_id = CPND(17) Subscribed to svc_id = CPD(16) on VDEST id = 9 >>>> <142>1 2017-04-26T21:16:52.946064+10:00 PL-4 osafckptnd 432 >>>> mds.log [meta sequenceId="1558"] MCM:API: svc_down : >>>> await_active_tmr_expiry : svc_id = CPND(17) on DEST id = 65535 got >>>> DOWN for svc_id = CPD(16) on VDEST id = 9 >>>> <143>1 2017-04-26T21:16:52.946074+10:00 PL-4 osafckptnd 432 >>>> mds.log [meta sequenceId="1559"] << >> mds_mcm_await_active_tmr_expiry >>>> <143>1 2017-04-26T21:16:52.94611+10:00 PL-4 osafckptnd 432 >>>> mds.log [meta sequenceId="1562"] >> >> mds_mcm_await_active_tmr_expiry >>>> <142>1 2017-04-26T21:16:52.946118+10:00 PL-4 osafckptnd 432 >>>> mds.log [meta sequenceId="1563"] MCM:API: await_active_tmr expired >>>> for svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id = 16 >>>> <143>1 2017-04-26T21:16:52.955692+10:00 PL-4 osafimmnd 395 >>>> mds.log [meta sequenceId="30048"] >> >> mds_mcm_await_active_tmr_expiry >>>> <142>1 2017-04-26T21:16:52.955698+10:00 PL-4 osafimmnd 395 >>>> mds.log [meta sequenceId="30049"] MCM:API: await_active_tmr expired >>>> for svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id = 16 >>>> <142>1 2017-04-26T21:16:52.955765+10:00 PL-4 osafimmnd 395 >>>> mds.log [meta sequenceId="30059"] MCM:API: svc_down : >>>> await_active_tmr_expiry : svc_id = CLMA(35) on DEST id = 65535 got >>>> DOWN for svc_id = CLMS(34) on VDEST id = 16 >>>> <143>1 2017-04-26T21:16:52.955775+10:00 PL-4 osafimmnd 395 >>>> mds.log [meta sequenceId="30060"] << >> mds_mcm_await_active_tmr_expiry >>>> I guess the other node-director services also receive the 2nd >>>> NCSMDS_DOWN(Vdest), but those services have no problem because of >>>> service's logic (or likely ckptnd checks cb->is_cpd_up == true), so I >>>> thought it would be AMF problem, until I see the points from >>>> Suryanarayana. So the await_active_tmr is working as expected? >>>> >>>> thanks, >>>> Minh >>>> >>>> On 26/04/17 17:11, A V Mahesh wrote: >>>>> Hi Minh Chau, >>>>> >>>>> On 4/26/2017 12:05 PM, minh chau wrote: >>>>>> amfnd will receive another NCSMDS_DOWN >>>>> you mean amfnd is receiving NCSMDS_DOWN for same amfd twice ? >>>>> or amfnd is receiving NCSMDS_DOWN for both active amfd & standby >>>>> amfd ? >>>>> >>>>> -AVM >>>>> >>>>> On 4/26/2017 12:05 PM, minh chau wrote: >>>>>> @Suryanarayana: I think this fix makes AMFND a bit defensive, but >>>>>> let's see Mahesh's comments >>>>>> @Mahesh: If getting NCSMDS_DOWN, then there's no active to wait, >> so >>>>>> MDS should stop this timer? >>>>>> >>>>>> >>>>>> On 26/04/17 15:45, Suryanarayana.Garlapati wrote: >>>>>>> Might be i guess this fix needs to be done at the MDS level, not >>>>>>> at the AMFND, taking into consideration that the cluster >>>>>>> >>>>>>> has only two Controllers. >>>>>>> >>>>>>> Timer which is getting started at MDS should not be started(if >>>>>>> started should be stopped) in case of getting the down for both of >>>>>>> the amfd's. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wednesday 26 April 2017 10:53 AM, Minh Chau wrote: >>>>>>>> If cluster goes into headless stage and wait up to 3 mins which >>>>>>>> is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL, amfnd >> will >>>>>>>> receive another NCSMDS_DOWN, and then delete all buffered >>>>>>>> messages. As a result, the headless recovery is impossible >>>>>>>> because these buffered messages are deleted. >>>>>>>> >>>>>>>> Patch ignores the second NCSMDS_DOWN. >>>>>>>> --- >>>>>>>> src/amf/amfnd/di.cc | 7 +++++++ >>>>>>>> 1 file changed, 7 insertions(+) >>>>>>>> >>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc index >>>>>>>> 627b31853..e06b9260d 100644 >>>>>>>> --- a/src/amf/amfnd/di.cc >>>>>>>> +++ b/src/amf/amfnd/di.cc >>>>>>>> @@ -638,6 +638,13 @@ uint32_t >> avnd_evt_mds_avd_dn_evh(AVND_CB >>>>>>>> *cb, AVND_EVT *evt) { >>>>>>>> } >>>>>>>> } >>>>>>>> + // Ignore the second NCSMDS_DOWN which comes from timeout >> of >>>>>>>> + // MDS_AWAIT_ACTIVE_TMR_VAL >>>>>>>> + if (cb->is_avd_down == true) { >>>>>>>> + TRACE_LEAVE(); >>>>>>>> + return rc; >>>>>>>> + } >>>>>>>> + >>>>>>>> m_AVND_CB_AVD_UP_RESET(cb); >>>>>>>> cb->active_avd_adest = 0; >>>>>>> >>>>> >>> ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
