Hi Nagu, The patch has already had a variable to memorize the track start/stop status. Wouldn't it be nice to avoid a node reboot (as consequence of AMFD's exit) due to TRY_AGAIN/TIMEOUT which could be classified as temporary service denial? And retry to stop tracking until coming back to as active one.
Thanks, Minh On 30/11/16 17:25, Nagendra Kumar wrote: > Hi Minh, > I think the patch could be made simpler by again try to stop CLM > track(keep a variable to indicate it wasn't stopped before) when AMFD becomes > Standby. If it doesn't get stopped again, then exit Amfd as it is kind of > double fault. > > Thanks > -Nagu > >> -----Original Message----- >> From: minh chau [mailto:[email protected]] >> Sent: 30 November 2016 10:40 >> To: praveen malviya;[email protected]; Nagendra Kumar; >> [email protected];[email protected] >> Cc:[email protected] >> Subject: Re: [PATCH 1 of 1] AMFD: Do not let non-active AMFD receives CLM >> track callback [#2141] >> >> Hi Praveen, >> >> I have seen a limitation of using the same job queue now, that it will >> increase >> the possibility of loss of IMM update job which is causing incorrect headless >> recovery. The timer is one alternative. We had spawned a thread to deal with >> TRY_AGAIN/TIMEOUT in case of start cluster tracking, so stop cluster >> tracking should be handled in a similar way. >> I will create another ticket and assign it to myself to introduce a thread >> handler for using CLM service in AMF. Some background/similar >> tickets: #1609, #1531. For the scope of this ticket, the floated patch + a >> correction from your comment that standby AMFD responds CLM callback, >> they should be enough to avoid the reported problem in #2141. Do you >> agree? >> >> Thanks, >> Minh >> >> >> On 21/11/16 22:37, praveen malviya wrote: >>> Hi Minh, >>> >>> If the intention is to use same job queue then, I think, one problem >>> will come: >>> "Currently AMFD empties job queue when AMFD becomes quiesced in >>> avd_mds_qsd_role_evh(). Most probably before emptying, AMFD will try >>> to stop CLM tracking and now if the API again fails then this failure >>> cannot be handled as AMFD cannot be blocked in avd_mds_qsd_role_evh() >>> for long time. And then AMFD will empty the job queue." >>> >>> Can we think of some way to avoid above problem? One solution, without >>> using job queue, could be: When CLM track fails, run a timer of 1 >>> seconds. When this timer expires, check role of AMFD. If the role is >>> still standby and Track stop is pending then now AMFD will give a try. >>> If it again fails then rerun the timer based on the value of return >>> code. We can decide how many times this can be tried. >>> >>> One more thing. on standby controller if clm track has failed then it >>> should respond for any track callback that it receives without taking >>> any action. This needs to be done irrespective of the approach of the >>> solution. >>> >>> Thanks, >>> Praveen >>> >>> On 21-Nov-16 4:38 PM, minh chau wrote: >>>> Hi Praveen, >>>> >>>> I am thinking if creating a CLM job (likely IMM update, NTF send job) in >>>> order to stop tracking at non-active AMFD. >>>> Should it be a better version than this patch that AMFD will still have >>>> chance to stop tracking until being promoted back to active role? >>>> >>>> Thanks, >>>> Minh >>>> >>>> On 18/11/16 11:38, minh chau wrote: >>>>> Hi Praveen, >>>>> >>>>> It has been seen once in swapping 2N Opensaf SI. It was just an error >>>>> indicating that AMFD failed to stop track callback, after that the >>>>> si-swap also succeeded. CLM service could be busy during switch-over >>>>> but AMF (as well as other services) is supposed to handle that return >>>>> code properly as AMFD has done in track start. Further testing, it >>>>> exposed a problem due to clm track callback received on standby AMFD. >>>>> >>>>> AMFD can not be too busy just try to stop clm track. First option, >>>>> AMFD can start another thread which just tries to stop track callback, >>>>> but that's likely to generate more problems where this standby AMFD is >>>>> promoted back to active in the meantime that thread of stopping track >>>>> is on going. At that time, main thread will be starting track and >>>>> another thread is trying to stop track. >>>>> >>>>> Second option as in the patch, AMFD retries to stop track where >>>>> problem happens in callback. At the time standby AMFD switching back >>>>> to active, we stop and start track again, but it seems no point of >>>>> doing this. I hope it's less impact on existing behavior, do you see >>>>> any problems? >>>>> >>>>> Regarding standby controller with locked/disabled CLM, I think it is >>>>> mentioned (in #1828 or some where else?) that this matter requires >>>>> more thoughts as whether we want Ha role assigned to AMFD in non >>>>> member node. It seems to beyond this ticket. >>>>> >>>>> Thanks, >>>>> Minh >>>>> >>>>> On 17/11/16 22:03, praveen malviya wrote: >>>>>> Hi Minh, >>>>>> >>>>>> We have not seen this issue earlier. Why CLMA is returning timeout >>>>>> where it is busy? >>>>>> If we are allowing Standby controller to continue tracking the Node, >>>>>> then I think (though not fully sure), it will increase one more >>>>>> client for validation step and this client does not exist locally. >>>>>> Also we need to evaluate the situation where Standby controller is >>>>>> CLM locked or disabled. >>>>>> >>>>>> Thanks, >>>>>> Praveen >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 17-Nov-16 3:44 AM, minh chau wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> Has anyone had chance to review this patch? >>>>>>> >>>>>>> Thanks, >>>>>>> Minh >>>>>>> >>>>>>> On 14/11/16 15:27, Minh Hon Chau wrote: >>>>>>>> osaf/services/saf/amf/amfd/clm.cc | 37 >>>>>>>> +++++++++++++++++++++++--------- >>>>>>>> osaf/services/saf/amf/amfd/include/cb.h | 1 + >>>>>>>> osaf/services/saf/amf/amfd/role.cc | 16 +++++++------- >>>>>>>> 3 files changed, 35 insertions(+), 19 deletions(-) >>>>>>>> >>>>>>>> >>>>>>>> In controller failover/switchover, sometimes active AMFD fails to >>>>>>>> stop >>>>>>>> CLM track callback. Therefore, when this AMFD become standby, >>>>>>>> AMFD can >>>>>>>> continue receiving CLM track callback and trigger the operations >>>>>>>> which >>>>>>>> should only be executed in active AMFD. >>>>>>>> >>>>>>>> diff --git a/osaf/services/saf/amf/amfd/clm.cc >>>>>>>> b/osaf/services/saf/amf/amfd/clm.cc >>>>>>>> --- a/osaf/services/saf/amf/amfd/clm.cc >>>>>>>> +++ b/osaf/services/saf/amf/amfd/clm.cc >>>>>>>> @@ -219,7 +219,13 @@ static void clm_track_cb(const SaClmClus >>>>>>>> LOG_ER("ClmTrackCallback received in error"); >>>>>>>> goto done; >>>>>>>> } >>>>>>>> - >>>>>>>> + if (avd_cb->avail_state_avd != SA_AMF_HA_ACTIVE) { >>>>>>>> + if (avd_cb->is_clm_track_started == true) { >>>>>>>> + LOG_NO("Retry to stop clm track with AMFD state(%d)", >>>>>>>> avd_cb->avail_state_avd); >>>>>>>> + avd_clm_track_stop(); >>>>>>>> + } >>>>>>>> + goto done; >>>>>>>> + } >>>>>>>> /* >>>>>>>> ** The CLM cluster can be larger than the AMF cluster thus >>>>>>>> it is >>>>>>>> not an >>>>>>>> ** error if the corresponding AMF node cannot be found. >>>>>>>> @@ -394,6 +400,7 @@ SaAisErrorT avd_clm_init(AVD_CL_CB* cb) >>>>>>>> cb->clmHandle = 0; >>>>>>>> cb->clm_sel_obj = 0; >>>>>>>> + cb->is_clm_track_started = false; >>>>>>>> TRACE_ENTER(); >>>>>>>> /* >>>>>>>> * TODO: This CLM initialization thread can be re-factored >>>>>>>> @@ -453,6 +460,8 @@ SaAisErrorT avd_clm_track_start(void) >>>>>>>> } else { >>>>>>>> LOG_ER("Failed to start cluster tracking %u", error); >>>>>>>> } >>>>>>>> + } else { >>>>>>>> + avd_cb->is_clm_track_started = true; >>>>>>>> } >>>>>>>> TRACE_LEAVE(); >>>>>>>> return error; >>>>>>>> @@ -460,17 +469,23 @@ SaAisErrorT avd_clm_track_start(void) >>>>>>>> SaAisErrorT avd_clm_track_stop(void) >>>>>>>> { >>>>>>>> - SaAisErrorT error = SA_AIS_OK; >>>>>>>> + SaAisErrorT error = SA_AIS_OK; >>>>>>>> + TRACE_ENTER(); >>>>>>>> + error = saClmClusterTrackStop(avd_cb->clmHandle); >>>>>>>> + if (error != SA_AIS_OK) { >>>>>>>> + if (error == SA_AIS_ERR_TRY_AGAIN || error == >>>>>>>> SA_AIS_ERR_TIMEOUT || >>>>>>>> + error == SA_AIS_ERR_UNAVAILABLE) { >>>>>>>> + LOG_WA("Failed to stop cluster tracking %u", error); >>>>>>>> + } else { >>>>>>>> + LOG_ER("Failed to stop cluster tracking %u", error); >>>>>>>> + } >>>>>>>> + } else { >>>>>>>> + TRACE("Sucessfully stops cluster tracking"); >>>>>>>> + avd_cb->is_clm_track_started = false; >>>>>>>> + } >>>>>>>> - TRACE_ENTER(); >>>>>>>> - error = saClmClusterTrackStop(avd_cb->clmHandle); >>>>>>>> - if (SA_AIS_OK != error) >>>>>>>> - LOG_ER("Failed to stop cluster tracking %u", >>>>>>>> error); >>>>>>>> - else >>>>>>>> - TRACE("Sucessfully stops cluster tracking"); >>>>>>>> - >>>>>>>> - TRACE_LEAVE(); >>>>>>>> - return error; >>>>>>>> + TRACE_LEAVE(); >>>>>>>> + return error; >>>>>>>> } >>>>>>>> void clm_node_terminate(AVD_AVND *node) >>>>>>>> diff --git a/osaf/services/saf/amf/amfd/include/cb.h >>>>>>>> b/osaf/services/saf/amf/amfd/include/cb.h >>>>>>>> --- a/osaf/services/saf/amf/amfd/include/cb.h >>>>>>>> +++ b/osaf/services/saf/amf/amfd/include/cb.h >>>>>>>> @@ -210,6 +210,7 @@ typedef struct cl_cb_tag { >>>>>>>> /* Clm stuff */ >>>>>>>> std::atomic<SaClmHandleT> clmHandle; >>>>>>>> std::atomic<SaSelectionObjectT> clm_sel_obj; >>>>>>>> + bool is_clm_track_started; >>>>>>>> bool fully_initialized; >>>>>>>> bool swap_switch; /* true - In middle of role switch. */ >>>>>>>> diff --git a/osaf/services/saf/amf/amfd/role.cc >>>>>>>> b/osaf/services/saf/amf/amfd/role.cc >>>>>>>> --- a/osaf/services/saf/amf/amfd/role.cc >>>>>>>> +++ b/osaf/services/saf/amf/amfd/role.cc >>>>>>>> @@ -1050,9 +1050,7 @@ uint32_t >> amfd_switch_actv_qsd(AVD_CL_CB >>>>>>>> /* Mark AVD as Quiesced. */ >>>>>>>> cb->avail_state_avd = SA_AMF_HA_QUIESCED; >>>>>>>> >>>>>>>> - if (avd_clm_track_stop() != SA_AIS_OK) { >>>>>>>> - LOG_ER("ClmTrack stop failed"); >>>>>>>> - } >>>>>>>> + avd_clm_track_stop(); >>>>>>>> /* Go ahead and set mds role as already the NCS SU has been >>>>>>>> switched */ >>>>>>>> if (NCSCC_RC_SUCCESS != (rc = avd_mds_set_vdest_role(cb, >>>>>>>> SA_AMF_HA_QUIESCED))) { >>>>>>>> @@ -1260,11 +1258,13 @@ uint32_t >> amfd_switch_stdby_actv(AVD_CL_C >>>>>>>> if (NCSCC_RC_SUCCESS != >> avd_rde_set_role(SA_AMF_HA_ACTIVE)) { >>>>>>>> LOG_ER("rde role change failed from stdy -> Active"); >>>>>>>> } >>>>>>>> - >>>>>>>> - if(avd_clm_track_start() != SA_AIS_OK) { >>>>>>>> - LOG_ER("Switch Standby --> Active, clm track start >>>>>>>> failed"); >>>>>>>> - avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE, >>>>>>>> SA_AMF_HA_ACTIVE); >>>>>>>> - return NCSCC_RC_FAILURE; >>>>>>>> + // reuse clm track start >>>>>>>> + if (avd_cb->is_clm_track_started == false) { >>>>>>>> + if(avd_clm_track_start() != SA_AIS_OK) { >>>>>>>> + LOG_ER("Switch Standby --> Active, clm track start >>>>>>>> failed"); >>>>>>>> + avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE, >>>>>>>> SA_AMF_HA_ACTIVE); >>>>>>>> + return NCSCC_RC_FAILURE; >>>>>>>> + } >>>>>>>> } >>>>>>>> /* Send the message to other avd for role change rsp as >>>>>>>> success */ >>>>>>>> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
