>> Since in spec there is no specific discussion for comp-failover recovery for >> an unassigned comp, I will encourage other maintainers also to provide >> inputs.
I do agree for not instantiating failed component before recovery, this keeps the approach similar to SU failover also. @Minh: If you don't mind, we can take su oper state changes in an enhancement. What do you say ? Thanks -Nagu > -----Original Message----- > From: praveen malviya > Sent: 21 February 2017 11:15 > To: minh chau > Cc: hans.nordeb...@ericsson.com; Nagendra Kumar; > gary....@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf- > de...@lists.sourceforge.net > Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message > synchronizes with component failover sequence [#2233] > > Hi Minh, > > Please find my response inline with [Praveen]. > > Thanks, > Praveen > > On 20-Feb-17 6:58 AM, minh chau wrote: > > Hi Praveen, > > > > Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233 > > and #1902, it also can fix the problem. > > Here we have 2 solutions: > > - The one I sent for review is letting the failed component to be > > instantiated, I think it is current behavior. But one change is that > > amfnd will not report su operational message to amfd until amfnd > > finishes removing the assignment of (faulty) su which contains the > > failed component > > - The V2 patch postpones the instantiation of failed component. amfnd > > will instantiate the failed component (via avnd_err_su_repair) after > > amfnd finishes removing the assignment of faulty su. > > > > So basically the difference is the time that the failed component should > > be instantiated. > > > > Still in item 3.11.1.3.2: > > "In a 2N or N+M redundancy model, SI2 also needs to be switched over; > > other-wise, the number of active service units would be higher than what > > is allowed by the redundancy model. However, in an Nway redundancy > > model, SI2 could be left assigned to SU1 (if the saAmfSUFailover > > configuration attribute of the ser-vice unit is set to SA_FALSE), and a > > repair of C2 should be attempted by reinstantiating it. If the attempt > > to instantiate C2 fails, the service unit becomes disabled, and SI2 must > > be switched-over; however, if the attempt to instantiate C2 is > > successful, SI2 shall remain assigned to SU1, and based on other > > configuration parameters and N-way redundancy model semantics, even > SI1 > > might get reassigned to SU1." > > > > My comment on V2: > > > > The configuration in #2233 is different from the example in > > specification, but it sounds to me the attempt to instantiate failed > > component should be done as soon as possible. > > The check in V2 patch means the failed component won't be instantiated > > if its SU still has any assignment. It should be true to 2N and N+M, but > > it's not for other SG. (As the example in specification, S2 does not > > have any CSI assigned to failed component C2). > [Praveen]As of now we have documented in the PR doc (conformance table > section 3.11.1.3 Recovery) that if a component faults with comp-failover > recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active > and > N+M models also. This is just to highlight about other red models. But > this documentation is not clear for an unassigned comp. > But from the beginning, comp-failover is working this way only. At-least > from clean up perspective we have fixed the problem of parallelism in > the past in the ticket #474. > > One more thing I have noted, proxy-proxied implementation is based on > B.01.01. As per B.01.01, proxy will register himself and its proxied as > soon as it gets instantiated. In a configuration containing both proxy > and proxied comp, if the proxy does not get any CSI and it faults with > comp-failover recovery then in instantiation phase it may again register > its proxy. I think proxy in other SU should register its proxied. I > guess, from deployment perspective such a configuration in which a user > configures proxy without any CSI may not exists and only possibility is > an application modeling a legacy code in NoRed model. However, in the > later version of spec B.01.02, proxied was supposed to mention the name > of proxy CSI and thus proxy should register only when its get proxy CSI. > > One more point to be noted comp-failover can also be done as a part of > escalation also. If a component is instantiated before the completion of > comp-failover recovery and if faults again then it may escalate to > node-failover before completion of comp-failover recovery. > > Since in spec there is no specific discussion for comp-failover recovery > for an unassigned comp, I will encourage other maintainers also to > provide inputs. > > > Thanks, > Praveen > > > Moreover, in the clc.cc, > > amfnd does not check any of si_list.n_nodes, this probably is the logic > > that has being done so far. > > > > Thanks, > > Minh > > > > On 17/02/17 23:16, praveen malviya wrote: > >> Hi Minh, > >> > >> I think we should see this problem from fault management perspective > >> also. Here repair of failed component is performed before the > >> completion of recovery.In the problem, component faulted with > >> comp-failover recovery and it was successfully repaired(instantiated) > >> when SU switch-over was still pending. > >> > >> Now the question is: Why it was never observed earlier? The reason is > >> generally all components are assigned at least one CSI. In the present > >> configuration failed component was not assigned any CSI. When this > >> component was cleaned up and marked UNINSTANTIATED, AMFND sent > >> comp-failover recovery request to AMFD. But after sending recovery > >> request, it instantiated failed comp when SU has still assignments to > >> be switch-overed. The code related to this assumes that comp will have > >> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), > >> TERMINATING to UNINSTANTIATED if block). For normal sequence of > >> comp-failover, su is repaired after removal of assignment in > >> avnd_su_si_oper_done() by calling avnd_err_su_repair(). > >> > >> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page > >> 195) about switch-overing all the SIs of failed SU in case of > >> comp-failed recovery and not for other models. In current OpenSAF > >> implementation we are following this for all models. > >> > >> I think as a fix we should stop failed comp to get instantiated before > >> removal of assignments. For this the check in clc.cc can be hardened > >> to consider non-assigned comp failures. > >> Attached is the patch (2233_v2.patch) based on this idea/approach. > >> > >> Thanks, > >> Praveen > >> > >> > >> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote: > >>> Hi Praveen, > >>> > >>> Yes, you are right, I will update the description. > >>> > >>> Thanks, Minh > >>> > >>> Quoting praveen malviya <praveen.malv...@oracle.com>: > >>> > >>>> Hi Minh, > >>>> > >>>> One quick question: > >>>> Ticket description says: > >>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 > depends > >>>> safSi=AmfDemoTwon" > >>>> But logs are related to without SIdep. Also in the configuration > >>>> app3_twon3su3si.xml, SI dep classes are commented. > >>>> I think ticket description needs correction as problem is without SI > >>>> dep. > >>>> Please confirm. > >>>> > >>>> Thanks, > >>>> Praveen > >>>> > >>>> > >>>> On 17-Feb-17 10:58 AM, praveen malviya wrote: > >>>>> Hi Minh, > >>>>> > >>>>> I have started reviewing this patch. > >>>>> > >>>>> Thanks, > >>>>> Praveen > >>>>> > >>>>> On 15-Feb-17 9:22 AM, minh chau wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> Have you had time to review this patch? > >>>>>> It changes the component failover sequence, so I think we need > more > >>>>>> time > >>>>>> to look at it. > >>>>>> > >>>>>> Thanks, > >>>>>> Minh > >>>>>> > >>>>>> On 23/01/17 12:28, Minh Hon Chau wrote: > >>>>>>> src/amf/amfnd/avnd_su.h | 1 + > >>>>>>> src/amf/amfnd/clc.cc | 3 --- > >>>>>>> src/amf/amfnd/di.cc | 12 +++++++++++- > >>>>>>> src/amf/amfnd/susm.cc | 32 > +++++++++++++++++++++++++++++--- > >>>>>>> 4 files changed, 41 insertions(+), 7 deletions(-) > >>>>>>> > >>>>>>> > >>>>>>> In case component failover, faulty component will be terminated. > >>>>>>> When > >>>>>>> the reinstantiation > >>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which > is > >>>>>>> running along with > >>>>>>> component failover. In the reported problem, if su_oper_message > >>>>>>> (enabled) comes to amfd > >>>>>>> before the quiesced assignment response (as part of component > >>>>>>> failover > >>>>>>> sequence) comes to > >>>>>>> amfd, then this quiesced assignment response is ignored, thus > >>>>>>> component failover will not > >>>>>>> finish. > >>>>>>> > >>>>>>> The problem is in function susi_success_sg_realign with act=5, > >>>>>>> state=3, amfd always assumes > >>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is > >>>>>>> true > >>>>>>> in most of the time > >>>>>>> when su_oper_message (enabled) comes a little later than quiesced > >>>>>>> assignment response. In fact > >>>>>>> the su_oper_message (enabled) is not designed as part of > component > >>>>>>> failover sequence, thus it > >>>>>>> can come any time during the failover. If amfd is getting a bit > >>>>>>> busier > >>>>>>> with RTA update then > >>>>>>> the faulty component has enough to reinstiantiate so that amfnd > >>>>>>> sends > >>>>>>> su_oper_message (enabled) > >>>>>>> before quiesced assignment response, the reported problem will be > >>>>>>> seen. > >>>>>>> > >>>>>>> This patch hardens the component failover sequence by ensuring > the > >>>>>>> su_oper_message (enabled) to > >>>>>>> be sent after su completes to remove assignment. This approach > comes > >>>>>>> from the similarity in > >>>>>>> su failover, where the su_oper_message (enabled) is sent in repair > >>>>>>> phase. > >>>>>>> > >>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h > >>>>>>> --- a/src/amf/amfnd/avnd_su.h > >>>>>>> +++ b/src/amf/amfnd/avnd_su.h > >>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis > >>>>>>> extern struct avnd_su_si_rec *avnd_silist_getprev(const struct > >>>>>>> avnd_su_si_rec *); > >>>>>>> extern struct avnd_su_si_rec *avnd_silist_getlast(void); > >>>>>>> extern bool sufailover_in_progress(const AVND_SU *su); > >>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su); > >>>>>>> extern bool sufailover_during_nodeswitchover(const AVND_SU > *su); > >>>>>>> extern bool all_csis_in_removed_state(const AVND_SU *su); > >>>>>>> extern void su_reset_restart_count_in_comps(const struct > >>>>>>> avnd_cb_tag > >>>>>>> *cb, const AVND_SU *su); > >>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc > >>>>>>> --- a/src/amf/amfnd/clc.cc > >>>>>>> +++ b/src/amf/amfnd/clc.cc > >>>>>>> @@ -2381,9 +2381,6 @@ uint32_t > avnd_comp_clc_terming_cleansucc > >>>>>>> (m_AVND_SU_IS_FAILOVER(su))) { > >>>>>>> /* yes, request director to orchestrate component > >>>>>>> failover */ > >>>>>>> rc = avnd_di_oper_send(cb, su, > SA_AMF_COMPONENT_FAILOVER); > >>>>>>> - > >>>>>>> - //Reset component-failover here. SU failover is reset as > >>>>>>> part > >>>>>>> of REPAIRED admin op. > >>>>>>> - m_AVND_SU_FAILOVER_RESET(su); > >>>>>>> } > >>>>>>> /* > >>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc > >>>>>>> --- a/src/amf/amfnd/di.cc > >>>>>>> +++ b/src/amf/amfnd/di.cc > >>>>>>> @@ -894,7 +894,17 @@ uint32_t > avnd_di_susi_resp_send(AVND_CB > >>>>>>> } > >>>>>>> m_AVND_SU_ALL_SI_RESET(su); > >>>>>>> } > >>>>>>> - > >>>>>>> + if (componentfailover_in_progress(su)) { > >>>>>>> + if (all_csis_in_removed_state(su) == true) { > >>>>>>> + bool is_en; > >>>>>>> + m_AVND_SU_IS_ENABLED(su, is_en); > >>>>>>> + if (is_en) { > >>>>>>> + if (avnd_di_oper_send(cb, su, 0) == > >>>>>>> NCSCC_RC_SUCCESS) { > >>>>>>> + m_AVND_SU_FAILOVER_RESET(su); > >>>>>>> + } > >>>>>>> + } > >>>>>>> + } > >>>>>>> + } > >>>>>>> /* free the contents of avnd message */ > >>>>>>> avnd_msg_content_free(cb, &msg); > >>>>>>> diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc > >>>>>>> --- a/src/amf/amfnd/susm.cc > >>>>>>> +++ b/src/amf/amfnd/susm.cc > >>>>>>> @@ -1633,10 +1633,22 @@ uint32_t > avnd_su_pres_st_chng_prc(AVND_C > >>>>>>> m_AVND_SU_IS_ENABLED(su, is_en); > >>>>>>> if (true == is_en) { > >>>>>>> TRACE("SU oper state is enabled"); > >>>>>>> + // do not send su_oper state if component > >>>>>>> failover is > >>>>>>> in progress > >>>>>>> m_AVND_SU_OPER_STATE_SET(su, > >>>>>>> SA_AMF_OPERATIONAL_ENABLED); > >>>>>>> - rc = avnd_di_oper_send(cb, su, 0); > >>>>>>> - if (NCSCC_RC_SUCCESS != rc) > >>>>>>> - goto done; > >>>>>>> + if (componentfailover_in_progress(su) == true) { > >>>>>>> + si = reinterpret_cast<AVND_SU_SI_REC*> > >>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list)); > >>>>>>> + if (si == nullptr || > >>>>>>> all_csis_in_removed_state(su)) { > >>>>>>> + rc = avnd_di_oper_send(cb, su, 0); > >>>>>>> + if (rc != NCSCC_RC_SUCCESS) > >>>>>>> + goto done; > >>>>>>> + m_AVND_SU_FAILOVER_RESET(su); > >>>>>>> + } > >>>>>>> + } else { > >>>>>>> + rc = avnd_di_oper_send(cb, su, 0); > >>>>>>> + if (NCSCC_RC_SUCCESS != rc) > >>>>>>> + goto done; > >>>>>>> + } > >>>>>>> } > >>>>>>> else > >>>>>>> TRACE("SU oper state is disabled"); > >>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const > AVND_S > >>>>>>> } > >>>>>>> /** > >>>>>>> + * This function checks if the componentfailover is going on. > >>>>>>> + * @param su: ptr to the SU . > >>>>>>> + * > >>>>>>> + * @return true/false. > >>>>>>> + */ > >>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) { > >>>>>>> + if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) > && > >>>>>>> + (avnd_cb->oper_state != > SA_AMF_OPERATIONAL_DISABLED) && > >>>>>>> (!su->is_ncs) && > >>>>>>> + m_AVND_SU_IS_FAILOVER(su)) > >>>>>>> + return true; > >>>>>>> + return false; > >>>>>>> +} > >>>>>>> + > >>>>>>> +/** > >>>>>>> * This function checks if the sufailover and node switchover are > >>>>>>> going on. > >>>>>>> * @param su: ptr to the SU . > >>>>>>> * > >>>>>>> > >>>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------ > >>>>> > >>>>> > >>>>> Check out the vibrant tech community on one of the world's most > >>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot > >>>>> _______________________________________________ > >>>>> Opensaf-devel mailing list > >>>>> Opensaf-devel@lists.sourceforge.net > >>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > >>>>> > >>> > >>> > > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel