Hi Surya, This current work around solution in AMF which is similar to how other services handling the case , soon we will re-visit and tune the MDS functionality match the RED down/timer for HEADLESS functionality.
Hi Minh, This patch protect the case , so ok with me as well. -AVM On 5/3/2017 12:16 PM, Suryanarayana.Garlapati wrote: > Hi Nagendra, > > I guess it would be good, if the root cause is fixed not the side > effects. > > > Regards > > Surya > > > On Wednesday 03 May 2017 12:12 PM, Nagendra Kumar wrote: >> Hi Minh, >> The patch looks ok to me. >> >> Thanks >> -Nagu >> >>> -----Original Message----- >>> From: minh chau [mailto:[email protected]] >>> Sent: 28 April 2017 15:24 >>> To: A V Mahesh; Suryanarayana.Garlapati; [email protected]; >>> Nagendra Kumar; [email protected]; Praveen Malviya >>> Cc: [email protected] >>> Subject: Re: [devel] [PATCH 1/1] amfnd: Ignore second NCSMDS_DOWN >>> [#2436] >>> >>> Hi AMF maintainers, >>> >>> While waiting Mahesh checks whether another NCSMDS_DOWN(Vdest) >>> should come 3 mins after headless, can we have a look at this patch? >>> I think we need it to make AMFND safe. >>> >>> Thanks, >>> Minh >>> >>> On 27/04/17 12:26, A V Mahesh wrote: >>>> Hi Minh chau, >>>> >>>> On 4/26/2017 5:43 PM, minh chau wrote: >>>>> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is >>>>> Vdest >>>> I don't seen unnatural events from MDS, as amfnd might have subsided >>>> for them. >>>> Currently transport (MDS) functionality doesn't provide event >>>> differently for headless or non-headless and it is completely >>>> invisible to MDS. >>>> >>>> I will go through this AMF case and will get back to you. >>>> >>>> -AVM >>>> >>>> On 4/26/2017 5:43 PM, minh chau wrote: >>>>> Hi Mahesh, >>>>> >>>>> The sequence is going like this: >>>>> >>>>> - Stop both SCs, amfnd receives 2 NCSMDS_DOWN, one is Adest, one is >>>>> Vdest. I guess at this point MDS tells that both standby and active >>>>> amfd are down? >>>>> 2017-04-26 21:13:52 PL-4 osafamfnd[413]: WA AMF director >>>>> unexpectedly crashed >>>>> >>>>> - Leave cluster in headless about 3 mins, amfnd receives another >>>>> NCSMDS_DOWN with Vdest, so MDS is telling no active amfd again? >>>>> syslog: >>>>> 2017-04-26 21:16:52 PL-4 osafamfnd[413]: WA AMF director >>>>> unexpectedly crashed >>>>> >>>>> mds log: >>>>> <143>1 2017-04-26T21:16:52.873168+10:00 PL-4 osafamfnd 413 >>>>> mds.log [meta sequenceId="9881"] >> >>> mds_mcm_await_active_tmr_expiry >>>>> <142>1 2017-04-26T21:16:52.873183+10:00 PL-4 osafamfnd 413 >>>>> mds.log [meta sequenceId="9882"] MCM:API: await_active_tmr expired >>>>> for svc_id = AVND(13) Subscribed to svc_id = AVD(12) on VDEST id = 1 >>>>> <143>1 2017-04-26T21:16:52.9453+10:00 PL-4 osafclmna 405 mds.log >>>>> [meta sequenceId="938"] >> mds_mcm_await_active_tmr_expiry >>>>> <142>1 2017-04-26T21:16:52.945309+10:00 PL-4 osafclmna 405 >>>>> mds.log [meta sequenceId="939"] MCM:API: await_active_tmr expired >>> for >>>>> svc_id = CLMNA(36) Subscribed to svc_id = CLMS(34) on VDEST id = 16 >>>>> <142>1 2017-04-26T21:16:52.945452+10:00 PL-4 osafsmfnd 454 >>>>> mds.log [meta sequenceId="620"] MCM:API: svc_down : >>>>> await_active_tmr_expiry : svc_id = SMFND(31) on DEST id = 65535 got >>>>> DOWN for svc_id = SMFD(30) on VDEST id = 15 >>>>> <143>1 2017-04-26T21:16:52.945462+10:00 PL-4 osafsmfnd 454 >>>>> mds.log [meta sequenceId="621"] << mds_mcm_await_active_tmr_expiry >>>>> <143>1 2017-04-26T21:16:52.945938+10:00 PL-4 osafckptnd 432 >>>>> mds.log [meta sequenceId="1547"] >> >>> mds_mcm_await_active_tmr_expiry >>>>> <142>1 2017-04-26T21:16:52.945947+10:00 PL-4 osafckptnd 432 >>>>> mds.log [meta sequenceId="1548"] MCM:API: await_active_tmr expired >>>>> for svc_id = CPND(17) Subscribed to svc_id = CPD(16) on VDEST id = 9 >>>>> <142>1 2017-04-26T21:16:52.946064+10:00 PL-4 osafckptnd 432 >>>>> mds.log [meta sequenceId="1558"] MCM:API: svc_down : >>>>> await_active_tmr_expiry : svc_id = CPND(17) on DEST id = 65535 got >>>>> DOWN for svc_id = CPD(16) on VDEST id = 9 >>>>> <143>1 2017-04-26T21:16:52.946074+10:00 PL-4 osafckptnd 432 >>>>> mds.log [meta sequenceId="1559"] << >>> mds_mcm_await_active_tmr_expiry >>>>> <143>1 2017-04-26T21:16:52.94611+10:00 PL-4 osafckptnd 432 >>>>> mds.log [meta sequenceId="1562"] >> >>> mds_mcm_await_active_tmr_expiry >>>>> <142>1 2017-04-26T21:16:52.946118+10:00 PL-4 osafckptnd 432 >>>>> mds.log [meta sequenceId="1563"] MCM:API: await_active_tmr expired >>>>> for svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id >>>>> = 16 >>>>> <143>1 2017-04-26T21:16:52.955692+10:00 PL-4 osafimmnd 395 >>>>> mds.log [meta sequenceId="30048"] >> >>> mds_mcm_await_active_tmr_expiry >>>>> <142>1 2017-04-26T21:16:52.955698+10:00 PL-4 osafimmnd 395 >>>>> mds.log [meta sequenceId="30049"] MCM:API: await_active_tmr expired >>>>> for svc_id = CLMA(35) Subscribed to svc_id = CLMS(34) on VDEST id >>>>> = 16 >>>>> <142>1 2017-04-26T21:16:52.955765+10:00 PL-4 osafimmnd 395 >>>>> mds.log [meta sequenceId="30059"] MCM:API: svc_down : >>>>> await_active_tmr_expiry : svc_id = CLMA(35) on DEST id = 65535 got >>>>> DOWN for svc_id = CLMS(34) on VDEST id = 16 >>>>> <143>1 2017-04-26T21:16:52.955775+10:00 PL-4 osafimmnd 395 >>>>> mds.log [meta sequenceId="30060"] << >>> mds_mcm_await_active_tmr_expiry >>>>> I guess the other node-director services also receive the 2nd >>>>> NCSMDS_DOWN(Vdest), but those services have no problem because of >>>>> service's logic (or likely ckptnd checks cb->is_cpd_up == true), so I >>>>> thought it would be AMF problem, until I see the points from >>>>> Suryanarayana. So the await_active_tmr is working as expected? >>>>> >>>>> thanks, >>>>> Minh >>>>> >>>>> On 26/04/17 17:11, A V Mahesh wrote: >>>>>> Hi Minh Chau, >>>>>> >>>>>> On 4/26/2017 12:05 PM, minh chau wrote: >>>>>>> amfnd will receive another NCSMDS_DOWN >>>>>> you mean amfnd is receiving NCSMDS_DOWN for same amfd twice ? >>>>>> or amfnd is receiving NCSMDS_DOWN for both active amfd & standby >>>>>> amfd ? >>>>>> >>>>>> -AVM >>>>>> >>>>>> On 4/26/2017 12:05 PM, minh chau wrote: >>>>>>> @Suryanarayana: I think this fix makes AMFND a bit defensive, but >>>>>>> let's see Mahesh's comments >>>>>>> @Mahesh: If getting NCSMDS_DOWN, then there's no active to wait, >>> so >>>>>>> MDS should stop this timer? >>>>>>> >>>>>>> >>>>>>> On 26/04/17 15:45, Suryanarayana.Garlapati wrote: >>>>>>>> Might be i guess this fix needs to be done at the MDS level, not >>>>>>>> at the AMFND, taking into consideration that the cluster >>>>>>>> >>>>>>>> has only two Controllers. >>>>>>>> >>>>>>>> Timer which is getting started at MDS should not be started(if >>>>>>>> started should be stopped) in case of getting the down for both of >>>>>>>> the amfd's. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wednesday 26 April 2017 10:53 AM, Minh Chau wrote: >>>>>>>>> If cluster goes into headless stage and wait up to 3 mins which >>>>>>>>> is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL, amfnd >>> will >>>>>>>>> receive another NCSMDS_DOWN, and then delete all buffered >>>>>>>>> messages. As a result, the headless recovery is impossible >>>>>>>>> because these buffered messages are deleted. >>>>>>>>> >>>>>>>>> Patch ignores the second NCSMDS_DOWN. >>>>>>>>> --- >>>>>>>>> src/amf/amfnd/di.cc | 7 +++++++ >>>>>>>>> 1 file changed, 7 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc index >>>>>>>>> 627b31853..e06b9260d 100644 >>>>>>>>> --- a/src/amf/amfnd/di.cc >>>>>>>>> +++ b/src/amf/amfnd/di.cc >>>>>>>>> @@ -638,6 +638,13 @@ uint32_t >>> avnd_evt_mds_avd_dn_evh(AVND_CB >>>>>>>>> *cb, AVND_EVT *evt) { >>>>>>>>> } >>>>>>>>> } >>>>>>>>> + // Ignore the second NCSMDS_DOWN which comes from timeout >>> of >>>>>>>>> + // MDS_AWAIT_ACTIVE_TMR_VAL >>>>>>>>> + if (cb->is_avd_down == true) { >>>>>>>>> + TRACE_LEAVE(); >>>>>>>>> + return rc; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> m_AVND_CB_AVD_UP_RESET(cb); >>>>>>>>> cb->active_avd_adest = 0; >>>>>>>> >>>>>> >>>> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
