Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

Nagendra Kumar Wed, 22 Feb 2017 00:31:40 -0800

>> Since in spec there is no specific discussion for comp-failover recovery for 
>> an unassigned comp, I will encourage other maintainers also to provide 
>> inputs.


I do agree for not instantiating failed component before recovery, this keeps 
the approach similar to SU failover also.

@Minh: If you don't mind, we can take su oper state changes in an enhancement. 
What do you say ?

Thanks
-Nagu

> -----Original Message-----
> From: praveen malviya
> Sent: 21 February 2017 11:15
> To: minh chau
> Cc: [email protected]; Nagendra Kumar;
> [email protected]; [email protected]; opensaf-
> [email protected]
> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message
> synchronizes with component failover sequence [#2233]
> 
> Hi Minh,
> 
> Please find my response inline with [Praveen].
> 
> Thanks,
> Praveen
> 
> On 20-Feb-17 6:58 AM, minh chau wrote:
> > Hi Praveen,
> >
> > Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
> > and #1902, it also can fix the problem.
> > Here we have 2 solutions:
> > - The one I sent for review is letting the failed component to be
> > instantiated, I think it is current behavior. But one change is that
> > amfnd will not report su operational message to amfd until amfnd
> > finishes removing the assignment of (faulty) su which contains the
> > failed component
> > - The V2 patch postpones the instantiation of failed component. amfnd
> > will instantiate the failed component (via avnd_err_su_repair) after
> > amfnd finishes removing the assignment of faulty su.
> >
> > So basically the difference is the time that the failed component should
> > be instantiated.
> >
> > Still in item 3.11.1.3.2:
> > "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
> > other-wise, the number of active service units would be higher than what
> > is allowed by the redundancy model. However, in an Nway redundancy
> > model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
> > configuration attribute of the ser-vice unit is set to SA_FALSE), and a
> > repair of C2 should be attempted by reinstantiating it. If the attempt
> > to instantiate C2 fails, the service unit becomes disabled, and SI2 must
> > be switched-over; however, if the attempt to instantiate C2 is
> > successful, SI2 shall remain assigned to SU1, and based on other
> > configuration parameters and N-way redundancy model semantics, even
> SI1
> > might get reassigned to SU1."
> >
> > My comment on V2:
> >
> > The configuration in #2233 is different from the example in
> > specification, but it sounds to me the attempt to instantiate failed
> > component should be done as soon as possible.
> > The check in V2 patch means the failed component won't be instantiated
> > if its SU still has any assignment. It should be true to 2N and N+M, but
> > it's not for other SG. (As the example in specification, S2 does not
> > have any CSI assigned to failed component C2).
> [Praveen]As of now we have documented in the PR doc (conformance table
> section 3.11.1.3 Recovery) that if a component faults with comp-failover
> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active
> and
> N+M models also. This is just to highlight about other red models. But
> this documentation is not clear for an unassigned comp.
> But from the beginning, comp-failover is working this way only. At-least
> from clean up perspective we have fixed the problem of parallelism in
> the past in the ticket #474.
> 
> One more thing I have noted, proxy-proxied implementation is based on
> B.01.01. As per B.01.01, proxy will register himself and its proxied as
> soon as it gets instantiated. In a configuration containing both proxy
> and proxied comp, if the proxy does not get any CSI and it faults with
> comp-failover recovery then in instantiation phase it may again register
> its proxy. I think proxy in other SU should register its proxied. I
> guess, from deployment perspective such a configuration in which a user
> configures proxy without any CSI may not exists and only possibility is
> an application modeling a legacy code in NoRed model. However, in the
> later version of spec B.01.02, proxied was supposed to mention the name
> of proxy CSI and thus proxy should register only when its get proxy CSI.
> 
> One more point to be noted comp-failover can also be done as a part of
> escalation also. If a component is instantiated before the completion of
> comp-failover recovery and if faults again then it may escalate to
> node-failover before completion of comp-failover recovery.
> 
> Since in spec there is no specific discussion for comp-failover recovery
> for an unassigned comp, I will encourage other maintainers also to
> provide inputs.
> 
> 
> Thanks,
> Praveen
> 
> 
>   Moreover, in the clc.cc,
> > amfnd does not check any of si_list.n_nodes, this probably is the logic
> > that has being done so far.
> >
> > Thanks,
> > Minh
> >
> > On 17/02/17 23:16, praveen malviya wrote:
> >> Hi Minh,
> >>
> >> I think we should see this problem from fault management perspective
> >> also. Here repair of failed component is performed before the
> >> completion of recovery.In the problem, component faulted with
> >> comp-failover recovery and it was successfully repaired(instantiated)
> >> when SU switch-over was still pending.
> >>
> >> Now the question is: Why it was never observed earlier? The reason is
> >> generally all components are assigned at least one CSI. In the present
> >> configuration failed component was not assigned any CSI. When this
> >> component was cleaned up and marked UNINSTANTIATED, AMFND sent
> >> comp-failover recovery request to AMFD. But after sending recovery
> >> request, it instantiated failed comp when SU has still assignments to
> >> be switch-overed. The code related to this assumes that comp will have
> >> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
> >> TERMINATING to UNINSTANTIATED if block). For normal sequence of
> >> comp-failover, su is repaired after removal of assignment in
> >> avnd_su_si_oper_done() by calling avnd_err_su_repair().
> >>
> >> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page
> >> 195) about switch-overing all the SIs of failed SU in case of
> >> comp-failed recovery and not for other models. In current OpenSAF
> >> implementation we are following this for all models.
> >>
> >> I think as a fix we should stop failed comp to get instantiated before
> >> removal of assignments. For this the check in clc.cc can be hardened
> >> to consider non-assigned comp failures.
> >> Attached is the patch (2233_v2.patch) based on this idea/approach.
> >>
> >> Thanks,
> >> Praveen
> >>
> >>
> >> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
> >>> Hi Praveen,
> >>>
> >>> Yes, you are right, I will update the description.
> >>>
> >>> Thanks, Minh
> >>>
> >>> Quoting praveen malviya <[email protected]>:
> >>>
> >>>> Hi Minh,
> >>>>
> >>>> One quick question:
> >>>> Ticket description says:
> >>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1
> depends
> >>>> safSi=AmfDemoTwon"
> >>>> But logs are related to without SIdep. Also in the configuration
> >>>> app3_twon3su3si.xml, SI dep classes are commented.
> >>>> I think ticket description needs correction as problem is without SI
> >>>> dep.
> >>>> Please confirm.
> >>>>
> >>>> Thanks,
> >>>> Praveen
> >>>>
> >>>>
> >>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
> >>>>> Hi Minh,
> >>>>>
> >>>>> I have started reviewing this patch.
> >>>>>
> >>>>> Thanks,
> >>>>> Praveen
> >>>>>
> >>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Have you had time to review this patch?
> >>>>>> It changes the component failover sequence, so I think we need
> more
> >>>>>> time
> >>>>>> to look at it.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Minh
> >>>>>>
> >>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
> >>>>>>>  src/amf/amfnd/avnd_su.h |   1 +
> >>>>>>>  src/amf/amfnd/clc.cc    |   3 ---
> >>>>>>>  src/amf/amfnd/di.cc     |  12 +++++++++++-
> >>>>>>>  src/amf/amfnd/susm.cc   |  32
> +++++++++++++++++++++++++++++---
> >>>>>>>  4 files changed, 41 insertions(+), 7 deletions(-)
> >>>>>>>
> >>>>>>>
> >>>>>>> In case component failover, faulty component will be terminated.
> >>>>>>> When
> >>>>>>> the reinstantiation
> >>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which
> is
> >>>>>>> running along with
> >>>>>>> component failover. In the reported problem, if su_oper_message
> >>>>>>> (enabled) comes to amfd
> >>>>>>> before the quiesced assignment response (as part of component
> >>>>>>> failover
> >>>>>>> sequence) comes to
> >>>>>>> amfd, then this quiesced assignment response is ignored, thus
> >>>>>>> component failover will not
> >>>>>>> finish.
> >>>>>>>
> >>>>>>> The problem is in function susi_success_sg_realign with act=5,
> >>>>>>> state=3, amfd always assumes
> >>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is
> >>>>>>> true
> >>>>>>> in most of the time
> >>>>>>> when su_oper_message (enabled) comes a little later than quiesced
> >>>>>>> assignment response. In fact
> >>>>>>> the su_oper_message (enabled) is not designed as part of
> component
> >>>>>>> failover sequence, thus it
> >>>>>>> can come any time during the failover. If amfd is getting a bit
> >>>>>>> busier
> >>>>>>> with RTA update then
> >>>>>>> the faulty component has enough to reinstiantiate so that amfnd
> >>>>>>> sends
> >>>>>>> su_oper_message (enabled)
> >>>>>>> before quiesced assignment response, the reported problem will be
> >>>>>>> seen.
> >>>>>>>
> >>>>>>> This patch hardens the component failover sequence by ensuring
> the
> >>>>>>> su_oper_message (enabled) to
> >>>>>>> be sent after su completes to remove assignment. This approach
> comes
> >>>>>>> from the similarity in
> >>>>>>> su failover, where the su_oper_message (enabled) is sent in repair
> >>>>>>> phase.
> >>>>>>>
> >>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
> >>>>>>> --- a/src/amf/amfnd/avnd_su.h
> >>>>>>> +++ b/src/amf/amfnd/avnd_su.h
> >>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
> >>>>>>>  extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
> >>>>>>> avnd_su_si_rec *);
> >>>>>>>  extern struct avnd_su_si_rec *avnd_silist_getlast(void);
> >>>>>>>  extern bool sufailover_in_progress(const AVND_SU *su);
> >>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
> >>>>>>>  extern bool sufailover_during_nodeswitchover(const AVND_SU
> *su);
> >>>>>>>  extern bool all_csis_in_removed_state(const AVND_SU *su);
> >>>>>>>  extern void su_reset_restart_count_in_comps(const struct
> >>>>>>> avnd_cb_tag
> >>>>>>> *cb, const AVND_SU *su);
> >>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
> >>>>>>> --- a/src/amf/amfnd/clc.cc
> >>>>>>> +++ b/src/amf/amfnd/clc.cc
> >>>>>>> @@ -2381,9 +2381,6 @@ uint32_t
> avnd_comp_clc_terming_cleansucc
> >>>>>>>              (m_AVND_SU_IS_FAILOVER(su))) {
> >>>>>>>          /* yes, request director to orchestrate component
> >>>>>>> failover */
> >>>>>>>          rc = avnd_di_oper_send(cb, su,
> SA_AMF_COMPONENT_FAILOVER);
> >>>>>>> -
> >>>>>>> -        //Reset component-failover here. SU failover is reset as
> >>>>>>> part
> >>>>>>> of REPAIRED admin op.
> >>>>>>> -        m_AVND_SU_FAILOVER_RESET(su);
> >>>>>>>      }
> >>>>>>>        /*
> >>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
> >>>>>>> --- a/src/amf/amfnd/di.cc
> >>>>>>> +++ b/src/amf/amfnd/di.cc
> >>>>>>> @@ -894,7 +894,17 @@ uint32_t
> avnd_di_susi_resp_send(AVND_CB
> >>>>>>>              }
> >>>>>>>              m_AVND_SU_ALL_SI_RESET(su);
> >>>>>>>          }
> >>>>>>> -
> >>>>>>> +        if (componentfailover_in_progress(su)) {
> >>>>>>> +            if (all_csis_in_removed_state(su) == true) {
> >>>>>>> +                bool is_en;
> >>>>>>> +                m_AVND_SU_IS_ENABLED(su, is_en);
> >>>>>>> +                if (is_en) {
> >>>>>>> +                    if (avnd_di_oper_send(cb, su, 0) ==
> >>>>>>> NCSCC_RC_SUCCESS) {
> >>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
> >>>>>>> +                    }
> >>>>>>> +                }
> >>>>>>> +            }
> >>>>>>> +        }
> >>>>>>>      /* free the contents of avnd message */
> >>>>>>>      avnd_msg_content_free(cb, &msg);
> >>>>>>>  diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
> >>>>>>> --- a/src/amf/amfnd/susm.cc
> >>>>>>> +++ b/src/amf/amfnd/susm.cc
> >>>>>>> @@ -1633,10 +1633,22 @@ uint32_t
> avnd_su_pres_st_chng_prc(AVND_C
> >>>>>>>              m_AVND_SU_IS_ENABLED(su, is_en);
> >>>>>>>              if (true == is_en) {
> >>>>>>>                  TRACE("SU oper state is enabled");
> >>>>>>> +                // do not send su_oper state if component
> >>>>>>> failover is
> >>>>>>> in progress
> >>>>>>>                  m_AVND_SU_OPER_STATE_SET(su,
> >>>>>>> SA_AMF_OPERATIONAL_ENABLED);
> >>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
> >>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
> >>>>>>> -                    goto done;
> >>>>>>> +                if (componentfailover_in_progress(su) == true) {
> >>>>>>> +                    si = reinterpret_cast<AVND_SU_SI_REC*>
> >>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
> >>>>>>> +                    if (si == nullptr ||
> >>>>>>> all_csis_in_removed_state(su)) {
> >>>>>>> +                        rc = avnd_di_oper_send(cb, su, 0);
> >>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
> >>>>>>> +                            goto done;
> >>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
> >>>>>>> +                    }
> >>>>>>> +                } else {
> >>>>>>> +                    rc = avnd_di_oper_send(cb, su, 0);
> >>>>>>> +                    if (NCSCC_RC_SUCCESS != rc)
> >>>>>>> +                        goto done;
> >>>>>>> +                }
> >>>>>>>              }
> >>>>>>>              else
> >>>>>>>                  TRACE("SU oper state is disabled");
> >>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const
> AVND_S
> >>>>>>>  }
> >>>>>>>    /**
> >>>>>>> + * This function checks if the componentfailover is going on.
> >>>>>>> + * @param su: ptr to the SU .
> >>>>>>> + *
> >>>>>>> + * @return true/false.
> >>>>>>> + */
> >>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
> >>>>>>> +    if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su))
> &&
> >>>>>>> +            (avnd_cb->oper_state !=
> SA_AMF_OPERATIONAL_DISABLED) &&
> >>>>>>> (!su->is_ncs) &&
> >>>>>>> +            m_AVND_SU_IS_FAILOVER(su))
> >>>>>>> +        return true;
> >>>>>>> +    return false;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/**
> >>>>>>>   * This function checks if the sufailover and node switchover are
> >>>>>>> going on.
> >>>>>>>   * @param su: ptr to the SU .
> >>>>>>>   *
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------
> >>>>>
> >>>>>
> >>>>> Check out the vibrant tech community on one of the world's most
> >>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> >>>>> _______________________________________________
> >>>>> Opensaf-devel mailing list
> >>>>> [email protected]
> >>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> >>>>>
> >>>
> >>>
> >

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

Reply via email to