Re: [devel] [PATCH 1 of 1] AMFD: Do not let non-active AMFD receives CLM track callback [#2141]

Nagendra Kumar Tue, 29 Nov 2016 22:23:56 -0800

Hi Minh,
        I think the patch could be made simpler by again try to stop CLM 
track(keep a variable to indicate it wasn't stopped before) when AMFD becomes 
Standby. If it doesn't get stopped again, then exit Amfd as it is kind of 
double fault.


Thanks
-Nagu

> -----Original Message-----
> From: minh chau [mailto:[email protected]]
> Sent: 30 November 2016 10:40
> To: praveen malviya; [email protected]; Nagendra Kumar;
> [email protected]; [email protected]
> Cc: [email protected]
> Subject: Re: [PATCH 1 of 1] AMFD: Do not let non-active AMFD receives CLM
> track callback [#2141]
> 
> Hi Praveen,
> 
> I have seen a limitation of using the same job queue now, that it will 
> increase
> the possibility of loss of IMM update job which is causing incorrect headless
> recovery. The timer is one alternative. We had spawned a thread to deal with
> TRY_AGAIN/TIMEOUT in case of start cluster tracking, so stop cluster
> tracking should be handled in a similar way.
> I will create another ticket and assign it to myself to introduce a thread
> handler for using CLM service in AMF. Some background/similar
> tickets: #1609, #1531. For the scope of this ticket, the floated patch + a
> correction from your comment that standby AMFD responds CLM callback,
> they should be enough to avoid the reported problem in #2141. Do you
> agree?
> 
> Thanks,
> Minh
> 
> 
> On 21/11/16 22:37, praveen malviya wrote:
> > Hi Minh,
> >
> > If the intention is to use same job queue then, I think, one problem
> > will come:
> >  "Currently AMFD empties job queue when AMFD becomes quiesced in
> > avd_mds_qsd_role_evh(). Most probably before emptying, AMFD will try
> > to stop CLM tracking and now if the API again fails then this failure
> > cannot be handled as AMFD cannot be blocked in avd_mds_qsd_role_evh()
> > for long time. And then AMFD will empty the job queue."
> >
> > Can we think of some way to avoid above problem? One solution, without
> > using job queue, could be: When CLM track fails, run a timer of 1
> > seconds. When this timer expires, check role of AMFD. If the role is
> > still standby and Track stop is pending then now AMFD will give a try.
> > If it again fails then rerun the timer based on the value of return
> > code. We can decide how many times this can be tried.
> >
> > One more thing. on standby controller if clm track has failed then it
> > should respond for any track callback that it receives without taking
> > any action. This needs to be done irrespective of the approach of the
> > solution.
> >
> > Thanks,
> > Praveen
> >
> > On 21-Nov-16 4:38 PM, minh chau wrote:
> >> Hi Praveen,
> >>
> >> I am thinking if creating a CLM job (likely IMM update, NTF send job) in
> >> order to stop tracking at non-active AMFD.
> >> Should it be a better version than this patch that AMFD will still have
> >> chance to stop tracking until being promoted back to active role?
> >>
> >> Thanks,
> >> Minh
> >>
> >> On 18/11/16 11:38, minh chau wrote:
> >>> Hi Praveen,
> >>>
> >>> It has been seen once in swapping 2N Opensaf SI. It was just an error
> >>> indicating that AMFD failed to stop track callback, after that the
> >>> si-swap also succeeded. CLM service could be busy during switch-over
> >>> but AMF (as well as other services) is supposed to handle that return
> >>> code properly as AMFD has done in track start. Further testing, it
> >>> exposed a problem due to clm track callback received on standby AMFD.
> >>>
> >>> AMFD can not be too busy just try to stop clm track. First option,
> >>> AMFD can start another thread which just tries to stop track callback,
> >>> but that's likely to generate more problems where this standby AMFD is
> >>> promoted back to active in the meantime that thread of stopping track
> >>> is on going. At that time, main thread will be starting track and
> >>> another thread is trying to stop track.
> >>>
> >>> Second option as in the patch, AMFD retries to stop track where
> >>> problem happens in callback. At the time standby AMFD switching back
> >>> to active, we stop and start track again, but it seems no point of
> >>> doing this. I hope it's less impact on existing behavior, do you see
> >>> any problems?
> >>>
> >>> Regarding standby controller with locked/disabled CLM, I think it is
> >>> mentioned (in #1828 or some where else?) that this matter requires
> >>> more thoughts as whether we want Ha role assigned to AMFD in non
> >>> member node. It seems to beyond this ticket.
> >>>
> >>> Thanks,
> >>> Minh
> >>>
> >>> On 17/11/16 22:03, praveen malviya wrote:
> >>>> Hi Minh,
> >>>>
> >>>> We have not seen this issue earlier. Why CLMA is returning timeout
> >>>> where it is busy?
> >>>> If we are allowing Standby controller to continue tracking the Node,
> >>>> then I think (though not fully sure), it will increase one more
> >>>> client for validation step and this client does not exist locally.
> >>>> Also we need to evaluate the situation where Standby controller is
> >>>> CLM locked or disabled.
> >>>>
> >>>> Thanks,
> >>>> Praveen
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 17-Nov-16 3:44 AM, minh chau wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> Has anyone had chance to review this patch?
> >>>>>
> >>>>> Thanks,
> >>>>> Minh
> >>>>>
> >>>>> On 14/11/16 15:27, Minh Hon Chau wrote:
> >>>>>> osaf/services/saf/amf/amfd/clm.cc |  37
> >>>>>> +++++++++++++++++++++++---------
> >>>>>>   osaf/services/saf/amf/amfd/include/cb.h |   1 +
> >>>>>>   osaf/services/saf/amf/amfd/role.cc      |  16 +++++++-------
> >>>>>>   3 files changed, 35 insertions(+), 19 deletions(-)
> >>>>>>
> >>>>>>
> >>>>>> In controller failover/switchover, sometimes active AMFD fails to
> >>>>>> stop
> >>>>>> CLM track callback. Therefore, when this AMFD become standby,
> >>>>>> AMFD can
> >>>>>> continue receiving CLM track callback and trigger the operations
> >>>>>> which
> >>>>>> should only be executed in active AMFD.
> >>>>>>
> >>>>>> diff --git a/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> b/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> --- a/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> +++ b/osaf/services/saf/amf/amfd/clm.cc
> >>>>>> @@ -219,7 +219,13 @@ static void clm_track_cb(const SaClmClus
> >>>>>>           LOG_ER("ClmTrackCallback received in error");
> >>>>>>           goto done;
> >>>>>>       }
> >>>>>> -
> >>>>>> +    if (avd_cb->avail_state_avd != SA_AMF_HA_ACTIVE) {
> >>>>>> +        if (avd_cb->is_clm_track_started == true) {
> >>>>>> +            LOG_NO("Retry to stop clm track with AMFD state(%d)",
> >>>>>> avd_cb->avail_state_avd);
> >>>>>> +            avd_clm_track_stop();
> >>>>>> +        }
> >>>>>> +        goto done;
> >>>>>> +    }
> >>>>>>       /*
> >>>>>>       ** The CLM cluster can be larger than the AMF cluster thus
> >>>>>> it is
> >>>>>> not an
> >>>>>>       ** error if the corresponding AMF node cannot be found.
> >>>>>> @@ -394,6 +400,7 @@ SaAisErrorT avd_clm_init(AVD_CL_CB* cb)
> >>>>>>         cb->clmHandle = 0;
> >>>>>>       cb->clm_sel_obj = 0;
> >>>>>> +    cb->is_clm_track_started = false;
> >>>>>>       TRACE_ENTER();
> >>>>>>       /*
> >>>>>>        * TODO: This CLM initialization thread can be re-factored
> >>>>>> @@ -453,6 +460,8 @@ SaAisErrorT avd_clm_track_start(void)
> >>>>>>           } else {
> >>>>>>               LOG_ER("Failed to start cluster tracking %u", error);
> >>>>>>           }
> >>>>>> +    } else {
> >>>>>> +        avd_cb->is_clm_track_started = true;
> >>>>>>       }
> >>>>>>       TRACE_LEAVE();
> >>>>>>       return error;
> >>>>>> @@ -460,17 +469,23 @@ SaAisErrorT avd_clm_track_start(void)
> >>>>>>     SaAisErrorT avd_clm_track_stop(void)
> >>>>>>   {
> >>>>>> -        SaAisErrorT error = SA_AIS_OK;
> >>>>>> +    SaAisErrorT error = SA_AIS_OK;
> >>>>>> +    TRACE_ENTER();
> >>>>>> +    error = saClmClusterTrackStop(avd_cb->clmHandle);
> >>>>>> +    if (error != SA_AIS_OK) {
> >>>>>> +        if (error == SA_AIS_ERR_TRY_AGAIN || error ==
> >>>>>> SA_AIS_ERR_TIMEOUT ||
> >>>>>> +                error == SA_AIS_ERR_UNAVAILABLE) {
> >>>>>> +            LOG_WA("Failed to stop cluster tracking %u", error);
> >>>>>> +        } else {
> >>>>>> +            LOG_ER("Failed to stop cluster tracking %u", error);
> >>>>>> +        }
> >>>>>> +    } else {
> >>>>>> +        TRACE("Sucessfully stops cluster tracking");
> >>>>>> +        avd_cb->is_clm_track_started = false;
> >>>>>> +    }
> >>>>>>   -        TRACE_ENTER();
> >>>>>> -    error = saClmClusterTrackStop(avd_cb->clmHandle);
> >>>>>> -        if (SA_AIS_OK != error)
> >>>>>> -                LOG_ER("Failed to stop cluster tracking %u",
> >>>>>> error);
> >>>>>> -    else
> >>>>>> -        TRACE("Sucessfully stops cluster tracking");
> >>>>>> -
> >>>>>> -        TRACE_LEAVE();
> >>>>>> -        return error;
> >>>>>> +    TRACE_LEAVE();
> >>>>>> +    return error;
> >>>>>>   }
> >>>>>>     void clm_node_terminate(AVD_AVND *node)
> >>>>>> diff --git a/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> b/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> --- a/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> +++ b/osaf/services/saf/amf/amfd/include/cb.h
> >>>>>> @@ -210,6 +210,7 @@ typedef struct cl_cb_tag {
> >>>>>>       /* Clm stuff */
> >>>>>>       std::atomic<SaClmHandleT> clmHandle;
> >>>>>>       std::atomic<SaSelectionObjectT> clm_sel_obj;
> >>>>>> +    bool is_clm_track_started;
> >>>>>>         bool fully_initialized;
> >>>>>>       bool swap_switch; /* true - In middle of role switch. */
> >>>>>> diff --git a/osaf/services/saf/amf/amfd/role.cc
> >>>>>> b/osaf/services/saf/amf/amfd/role.cc
> >>>>>> --- a/osaf/services/saf/amf/amfd/role.cc
> >>>>>> +++ b/osaf/services/saf/amf/amfd/role.cc
> >>>>>> @@ -1050,9 +1050,7 @@ uint32_t
> amfd_switch_actv_qsd(AVD_CL_CB
> >>>>>>       /*  Mark AVD as Quiesced. */
> >>>>>>       cb->avail_state_avd = SA_AMF_HA_QUIESCED;
> >>>>>>
> >>>>>> -    if (avd_clm_track_stop() != SA_AIS_OK) {
> >>>>>> -        LOG_ER("ClmTrack stop failed");
> >>>>>> -    }
> >>>>>> +    avd_clm_track_stop();
> >>>>>>         /* Go ahead and set mds role as already the NCS SU has been
> >>>>>> switched */
> >>>>>>       if (NCSCC_RC_SUCCESS != (rc = avd_mds_set_vdest_role(cb,
> >>>>>> SA_AMF_HA_QUIESCED))) {
> >>>>>> @@ -1260,11 +1258,13 @@ uint32_t
> amfd_switch_stdby_actv(AVD_CL_C
> >>>>>>       if (NCSCC_RC_SUCCESS !=
> avd_rde_set_role(SA_AMF_HA_ACTIVE)) {
> >>>>>>           LOG_ER("rde role change failed from stdy -> Active");
> >>>>>>       }
> >>>>>> -
> >>>>>> -    if(avd_clm_track_start() != SA_AIS_OK) {
> >>>>>> -        LOG_ER("Switch Standby --> Active, clm track start
> >>>>>> failed");
> >>>>>> -        avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE,
> >>>>>> SA_AMF_HA_ACTIVE);
> >>>>>> -        return NCSCC_RC_FAILURE;
> >>>>>> +    // reuse clm track start
> >>>>>> +    if (avd_cb->is_clm_track_started == false) {
> >>>>>> +        if(avd_clm_track_start() != SA_AIS_OK) {
> >>>>>> +            LOG_ER("Switch Standby --> Active, clm track start
> >>>>>> failed");
> >>>>>> +            avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE,
> >>>>>> SA_AMF_HA_ACTIVE);
> >>>>>> +            return NCSCC_RC_FAILURE;
> >>>>>> +        }
> >>>>>>       }
> >>>>>>         /* Send the message to other avd for role change rsp as
> >>>>>> success */
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] AMFD: Do not let non-active AMFD receives CLM track callback [#2141]

Reply via email to