Hi Minh,
I think the patch could be made simpler by again try to stop CLM
track(keep a variable to indicate it wasn't stopped before) when AMFD becomes
Standby. If it doesn't get stopped again, then exit Amfd as it is kind of
double fault.
Thanks
-Nagu
> -----Original Message-----
> From: minh chau [mailto:[email protected]]
> Sent: 30 November 2016 10:40
> To: praveen malviya; [email protected]; Nagendra Kumar;
> [email protected]; [email protected]
> Cc: [email protected]
> Subject: Re: [PATCH 1 of 1] AMFD: Do not let non-active AMFD receives CLM
> track callback [#2141]
>
> Hi Praveen,
>
> I have seen a limitation of using the same job queue now, that it will
> increase
> the possibility of loss of IMM update job which is causing incorrect headless
> recovery. The timer is one alternative. We had spawned a thread to deal with
> TRY_AGAIN/TIMEOUT in case of start cluster tracking, so stop cluster
> tracking should be handled in a similar way.
> I will create another ticket and assign it to myself to introduce a thread
> handler for using CLM service in AMF. Some background/similar
> tickets: #1609, #1531. For the scope of this ticket, the floated patch + a
> correction from your comment that standby AMFD responds CLM callback,
> they should be enough to avoid the reported problem in #2141. Do you
> agree?
>
> Thanks,
> Minh
>
>
> On 21/11/16 22:37, praveen malviya wrote:
> > Hi Minh,
> >
> > If the intention is to use same job queue then, I think, one problem
> > will come:
> > "Currently AMFD empties job queue when AMFD becomes quiesced in
> > avd_mds_qsd_role_evh(). Most probably before emptying, AMFD will try
> > to stop CLM tracking and now if the API again fails then this failure
> > cannot be handled as AMFD cannot be blocked in avd_mds_qsd_role_evh()
> > for long time. And then AMFD will empty the job queue."
> >
> > Can we think of some way to avoid above problem? One solution, without
> > using job queue, could be: When CLM track fails, run a timer of 1
> > seconds. When this timer expires, check role of AMFD. If the role is
> > still standby and Track stop is pending then now AMFD will give a try.
> > If it again fails then rerun the timer based on the value of return
> > code. We can decide how many times this can be tried.
> >
> > One more thing. on standby controller if clm track has failed then it
> > should respond for any track callback that it receives without taking
> > any action. This needs to be done irrespective of the approach of the
> > solution.
> >
> > Thanks,
> > Praveen
> >
> > On 21-Nov-16 4:38 PM, minh chau wrote:
> >> Hi Praveen,
> >>
> >> I am thinking if creating a CLM job (likely IMM update, NTF send job) in
> >> order to stop tracking at non-active AMFD.
> >> Should it be a better version than this patch that AMFD will still have
> >> chance to stop tracking until being promoted back to active role?
> >>
> >> Thanks,
> >> Minh
> >>
> >> On 18/11/16 11:38, minh chau wrote:
> >>> Hi Praveen,
> >>>
> >>> It has been seen once in swapping 2N Opensaf SI. It was just an error
> >>> indicating that AMFD failed to stop track callback, after that the
> >>> si-swap also succeeded. CLM service could be busy during switch-over
> >>> but AMF (as well as other services) is supposed to handle that return
> >>> code properly as AMFD has done in track start. Further testing, it
> >>> exposed a problem due to clm track callback received on standby AMFD.
> >>>
> >>> AMFD can not be too busy just try to stop clm track. First option,
> >>> AMFD can start another thread which just tries to stop track callback,
> >>> but that's likely to generate more problems where this standby AMFD is
> >>> promoted back to active in the meantime that thread of stopping track
> >>> is on going. At that time, main thread will be starting track and
> >>> another thread is trying to stop track.
> >>>
> >>> Second option as in the patch, AMFD retries to stop track where
> >>> problem happens in callback. At the time standby AMFD switching back
> >>> to active, we stop and start track again, but it seems no point of
> >>> doing this. I hope it's less impact on existing behavior, do you see
> >>> any problems?
> >>>
> >>> Regarding standby controller with locked/disabled CLM, I think it is
> >>> mentioned (in #1828 or some where else?) that this matter requires
> >>> more thoughts as whether we want Ha role assigned to AMFD in non
> >>> member node. It seems to beyond this ticket.
> >>>
> >>> Thanks,
> >>> Minh
> >>>
> >>> On 17/11/16 22:03, praveen malviya wrote:
> >>>> Hi Minh,
> >>>>
> >>>> We have not seen this issue earlier. Why CLMA is returning timeout
> >>>> where it is busy?
> >>>> If we are allowing Standby controller to continue tracking the Node,
> >>>> then I think (though not fully sure), it will increase one more
> >>>> client for validation step and this client does not exist locally.
> >>>> Also we need to evaluate the situation where Standby controller is
> >>>> CLM locked or disabled.
> >>>>
> >>>> Thanks,
> >>>> Praveen
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 17-Nov-16 3:44 AM, minh chau wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> Has anyone had chance to review this patch?
> >>>>>
> >>>>> Thanks,
> >>>>> Minh
> >>>>>
> >>>>> On 14/11/16 15:27, Minh Hon Chau wrote:
> >>>>>> osaf/services/saf/amf/amfd/clm.cc | 37
> >>>>>> +++++++++++++++++++++++---------
> >>>>>> osaf/services/saf/amf/amfd/include/cb.h | 1 +
> >>>>>> osaf/services/saf/amf/amfd/role.cc | 16 +++++++-------
> >>>>>> 3 files changed, 35 insertions(+), 19 deletions(-)
> >>>>>>
> >>>>>>
> >>>>>> In controller failover/switchover, sometimes active AMFD fails to
> >>>>>> stop
> >>>>>> CLM track callback. Therefore, when this AMFD become standby,
> >>>>>> AMFD can
> >>>>>> continue receiving CLM track callback and trigger the operations
> >>>>>> which
> >>>>>> should only be executed in active AMFD.
> >>>>>>
> >>>>>> diff --git a/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> b/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> --- a/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> +++ b/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> @@ -219,7 +219,13 @@ static void clm_track_cb(const SaClmClus
> >>>>>> LOG_ER("ClmTrackCallback received in error");
> >>>>>> goto done;
> >>>>>> }
> >>>>>> -
> >>>>>> + if (avd_cb->avail_state_avd != SA_AMF_HA_ACTIVE) {
> >>>>>> + if (avd_cb->is_clm_track_started == true) {
> >>>>>> + LOG_NO("Retry to stop clm track with AMFD state(%d)",
> >>>>>> avd_cb->avail_state_avd);
> >>>>>> + avd_clm_track_stop();
> >>>>>> + }
> >>>>>> + goto done;
> >>>>>> + }
> >>>>>> /*
> >>>>>> ** The CLM cluster can be larger than the AMF cluster thus
> >>>>>> it is
> >>>>>> not an
> >>>>>> ** error if the corresponding AMF node cannot be found.
> >>>>>> @@ -394,6 +400,7 @@ SaAisErrorT avd_clm_init(AVD_CL_CB* cb)
> >>>>>> cb->clmHandle = 0;
> >>>>>> cb->clm_sel_obj = 0;
> >>>>>> + cb->is_clm_track_started = false;
> >>>>>> TRACE_ENTER();
> >>>>>> /*
> >>>>>> * TODO: This CLM initialization thread can be re-factored
> >>>>>> @@ -453,6 +460,8 @@ SaAisErrorT avd_clm_track_start(void)
> >>>>>> } else {
> >>>>>> LOG_ER("Failed to start cluster tracking %u", error);
> >>>>>> }
> >>>>>> + } else {
> >>>>>> + avd_cb->is_clm_track_started = true;
> >>>>>> }
> >>>>>> TRACE_LEAVE();
> >>>>>> return error;
> >>>>>> @@ -460,17 +469,23 @@ SaAisErrorT avd_clm_track_start(void)
> >>>>>> SaAisErrorT avd_clm_track_stop(void)
> >>>>>> {
> >>>>>> - SaAisErrorT error = SA_AIS_OK;
> >>>>>> + SaAisErrorT error = SA_AIS_OK;
> >>>>>> + TRACE_ENTER();
> >>>>>> + error = saClmClusterTrackStop(avd_cb->clmHandle);
> >>>>>> + if (error != SA_AIS_OK) {
> >>>>>> + if (error == SA_AIS_ERR_TRY_AGAIN || error ==
> >>>>>> SA_AIS_ERR_TIMEOUT ||
> >>>>>> + error == SA_AIS_ERR_UNAVAILABLE) {
> >>>>>> + LOG_WA("Failed to stop cluster tracking %u", error);
> >>>>>> + } else {
> >>>>>> + LOG_ER("Failed to stop cluster tracking %u", error);
> >>>>>> + }
> >>>>>> + } else {
> >>>>>> + TRACE("Sucessfully stops cluster tracking");
> >>>>>> + avd_cb->is_clm_track_started = false;
> >>>>>> + }
> >>>>>> - TRACE_ENTER();
> >>>>>> - error = saClmClusterTrackStop(avd_cb->clmHandle);
> >>>>>> - if (SA_AIS_OK != error)
> >>>>>> - LOG_ER("Failed to stop cluster tracking %u",
> >>>>>> error);
> >>>>>> - else
> >>>>>> - TRACE("Sucessfully stops cluster tracking");
> >>>>>> -
> >>>>>> - TRACE_LEAVE();
> >>>>>> - return error;
> >>>>>> + TRACE_LEAVE();
> >>>>>> + return error;
> >>>>>> }
> >>>>>> void clm_node_terminate(AVD_AVND *node)
> >>>>>> diff --git a/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> b/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> --- a/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> +++ b/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> @@ -210,6 +210,7 @@ typedef struct cl_cb_tag {
> >>>>>> /* Clm stuff */
> >>>>>> std::atomic<SaClmHandleT> clmHandle;
> >>>>>> std::atomic<SaSelectionObjectT> clm_sel_obj;
> >>>>>> + bool is_clm_track_started;
> >>>>>> bool fully_initialized;
> >>>>>> bool swap_switch; /* true - In middle of role switch. */
> >>>>>> diff --git a/osaf/services/saf/amf/amfd/role.cc
> >>>>>> b/osaf/services/saf/amf/amfd/role.cc
> >>>>>> --- a/osaf/services/saf/amf/amfd/role.cc
> >>>>>> +++ b/osaf/services/saf/amf/amfd/role.cc
> >>>>>> @@ -1050,9 +1050,7 @@ uint32_t
> amfd_switch_actv_qsd(AVD_CL_CB
> >>>>>> /* Mark AVD as Quiesced. */
> >>>>>> cb->avail_state_avd = SA_AMF_HA_QUIESCED;
> >>>>>>
> >>>>>> - if (avd_clm_track_stop() != SA_AIS_OK) {
> >>>>>> - LOG_ER("ClmTrack stop failed");
> >>>>>> - }
> >>>>>> + avd_clm_track_stop();
> >>>>>> /* Go ahead and set mds role as already the NCS SU has been
> >>>>>> switched */
> >>>>>> if (NCSCC_RC_SUCCESS != (rc = avd_mds_set_vdest_role(cb,
> >>>>>> SA_AMF_HA_QUIESCED))) {
> >>>>>> @@ -1260,11 +1258,13 @@ uint32_t
> amfd_switch_stdby_actv(AVD_CL_C
> >>>>>> if (NCSCC_RC_SUCCESS !=
> avd_rde_set_role(SA_AMF_HA_ACTIVE)) {
> >>>>>> LOG_ER("rde role change failed from stdy -> Active");
> >>>>>> }
> >>>>>> -
> >>>>>> - if(avd_clm_track_start() != SA_AIS_OK) {
> >>>>>> - LOG_ER("Switch Standby --> Active, clm track start
> >>>>>> failed");
> >>>>>> - avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE,
> >>>>>> SA_AMF_HA_ACTIVE);
> >>>>>> - return NCSCC_RC_FAILURE;
> >>>>>> + // reuse clm track start
> >>>>>> + if (avd_cb->is_clm_track_started == false) {
> >>>>>> + if(avd_clm_track_start() != SA_AIS_OK) {
> >>>>>> + LOG_ER("Switch Standby --> Active, clm track start
> >>>>>> failed");
> >>>>>> + avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE,
> >>>>>> SA_AMF_HA_ACTIVE);
> >>>>>> + return NCSCC_RC_FAILURE;
> >>>>>> + }
> >>>>>> }
> >>>>>> /* Send the message to other avd for role change rsp as
> >>>>>> success */
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel