Hi Praveen, yes, I think the opensafd script should be updated as mentioned below. It is related to the problem described in #1630 and is obviously not correct. /Thanks HansN
On 12/21/2015 10:05 AM, praveen malviya wrote: > > > On 21-Dec-15 2:06 PM, Hans Nordebäck wrote: >> Hi Praveen, >> >> Ack for the patch, code review only. One comment, the opensafd >> script should also be updated > I did not get the comment here. > Is that what is mentioned below: > "if amfnd do not exit in time, OPENSAF_TERMTIMEOUT?, then do a pkill-9 > osafamfnd, thereafter pkill -9 osaf* or > final_clean should be fine for the remaining services." ? > > Thanks, > Praveen > ./Thanks HansN >> >> -----Original Message----- >> From: Hans Nordebäck [mailto:[email protected]] >> Sent: den 21 december 2015 08:33 >> To: praveen malviya; [email protected]; Gary Lee >> Cc: [email protected] >> Subject: Re: [devel] [PATCH 1 of 1] amfnd: fix removal of assignments >> from faulty NPI SU during shutdown phase [#1630] >> >> Hi Praveen, >> >> If the opensafd script is changed to: >> if amfnd do not exit in time, OPENSAF_TERMTIMEOUT?, then do a >> pkill-9 osafamfnd, thereafter pkill -9 osaf* or final_clean should >> be fine for the remaining services. >> /Thanks HansN >> >> -----Original Message----- >> From: praveen malviya [mailto:[email protected]] >> Sent: den 21 december 2015 07:21 >> To: Hans Nordebäck; [email protected]; Gary Lee >> Cc: [email protected] >> Subject: Re: [PATCH 1 of 1] amfnd: fix removal of assignments from >> faulty NPI SU during shutdown phase [#1630] >> >> >> >> On 18-Dec-15 6:27 PM, Hans Nordebäck wrote: >>> Hi Praveen, >>> >>> A question, the opensafd script, in stop function, amfnd is sent a >>> SIGTERM signal and starts to terminate its components. >>> Despite the timeout retry logic, the final_clean seems to be called >>> even if amfnd has not finished. >>> Shouldn't there be a pkill -9 osaf* in the case where amfnd has not >>> finished in time? As it now looks the amfnd is running when the >>> final_clean is called to run all the clc_cli scripts, e.g. killing >>> amfd. >>> amfnd should not be running at this time and leads to the "ER AMF >>> director unexpectedly crashed". >> I agree on the explanation above. But calling "pkill -9 osaf*", in >> case amfnd does not exit in time, will also cause node reboot most of >> the time because amfnd will detect "avaDown" of director components >> and will do nodefailfast before itself getting killed. >> >> I an not aware of any historical reason of dependency on amfnd for >> opensaf stop other than these two advantages: >> -a user can configure OPENSAF_TERMTIMEOUT in nid.conf depending upon >> the callback and cleanup scripts timeouts of application >> components.In this case graceful exit of amfnd will help in faster >> stop of OpenSAF. >> -On graceful exit, cleanup script of all the middleware components >> will be called to further release any resources without any node reboot. >> >> Thanks, >> Praveen >> >> >>> >>> /Thanks HansN >>> >>> >>> On 12/16/2015 12:53 PM, [email protected] wrote: >>>> osaf/services/saf/amf/amfnd/clc.cc | 12 ++++++ >>>> osaf/services/saf/amf/amfnd/sidb.cc | 8 ++++ >>>> osaf/services/saf/amf/amfnd/susm.cc | 72 >>>> +++++++++++++++++++++++++++++++++++++ >>>> 3 files changed, 92 insertions(+), 0 deletions(-) >>>> >>>> >>>> In the reported problem, opensaf shutdown got stuck when one of the >>>> components of NPI su faulted. >>>> >>>> During opnesaf shutdown, amfnd started removing assignment from lower >>>> rank >>>> SI2 assigned to SU2. During this time, comp of a NPI SU1 having a csi >>>> from a higher rank SI1 faulted. AMFND successfully cleaned up the >>>> failed component. Now when all the assignments from SU2 got removed, >>>> amfnd started removing assignments from SU1. >>>> Since SU1 had only one >>>> component which was moved to UNINSTANTIATED state after clean up, no >>>> further clean up was required. Since no further comp was cleaned up >>>> in SU1 got stuck in TERMINATING state. >>>> >>>> In healthy condition when last CSI is removed from a NPI SU, SU will >>>> move to UNINSTANTIATED state. After this if no further lower rank SIs >>>> are available for application SUs, amfnd will launch clean up of all >>>> the components. >>>> But since SU1 got stuck in TERMINATING state AMFND could not launch >>>> clean up of all the comps and after 60 sceconds NID rebooted the >>>> node. >>>> >>>> Patch fixes the problem by resuming the SU FSM of failed component >>>> and removes further CSIs. If no CSI is available it will mark the SU >>>> UNINSTANTIATED and will go with the normal sequence of removal and >>>> then clean up of all comps. >>>> >>>> diff --git a/osaf/services/saf/amf/amfnd/clc.cc >>>> b/osaf/services/saf/amf/amfnd/clc.cc >>>> --- a/osaf/services/saf/amf/amfnd/clc.cc >>>> +++ b/osaf/services/saf/amf/amfnd/clc.cc >>>> @@ -2204,6 +2204,18 @@ uint32_t avnd_comp_clc_terming_cleansucc >>>> } >>>> } >>>> + if ((!comp->su->is_ncs) && (comp->csi_list.n_nodes > 0) && >>>> + (!m_AVND_SU_IS_PREINSTANTIABLE(comp->su))) { >>>> + AVND_COMP_CSI_REC *csi = nullptr; >>>> + /* >>>> + Explantion written above for PI SU case is valid here >>>> also. >>>> + However for a NPI comp in NPI SU, mark it REMOVED >>>> instead of >>>> + generating remove done indication. >>>> + */ >>>> + csi = >>>> m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&comp-> >>>> csi_list)); >>>> >>>> + if (csi != nullptr) >>>> + m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi, >>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVED); >>>> + } >>>> if (all_comps_terminated()) { >>>> LOG_NO("Terminated all AMF components"); >>>> LOG_NO("Shutdown completed, exiting"); diff --git >>>> a/osaf/services/saf/amf/amfnd/sidb.cc >>>> b/osaf/services/saf/amf/amfnd/sidb.cc >>>> --- a/osaf/services/saf/amf/amfnd/sidb.cc >>>> +++ b/osaf/services/saf/amf/amfnd/sidb.cc >>>> @@ -185,6 +185,7 @@ AVND_SU_SI_REC *avnd_su_si_rec_add(AVND_ >>>> /* >>>> * Update the rest of the parameters with default values. >>>> */ >>>> + TRACE("Marking curr assigned state of '%s' >>>> unassigned.",si_rec->name.value); >>>> m_AVND_SU_SI_CURR_ASSIGN_STATE_SET(si_rec, >>>> AVND_SU_SI_ASSIGN_STATE_UNASSIGNED); >>>> /* >>>> @@ -394,6 +395,7 @@ AVND_COMP_CSI_REC *avnd_su_si_csi_rec_ad >>>> /* >>>> * Update the rest of the parameters with default values. >>>> */ >>>> + TRACE("Marking curr assigned state of '%s' >>>> unassigned.",csi_rec->name.value); >>>> m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi_rec, >>>> AVND_COMP_CSI_ASSIGN_STATE_UNASSIGNED); >>>> m_AVND_COMP_CSI_PRV_ASSIGN_STATE_SET(csi_rec, >>>> AVND_COMP_CSI_ASSIGN_STATE_UNASSIGNED); >>>> @@ -477,6 +479,7 @@ AVND_SU_SI_REC *avnd_su_si_rec_modify(AV >>>> /* store the prv assign-state & update the new assign-state */ >>>> si_rec->prv_assign_state = si_rec->curr_assign_state; >>>> + TRACE_1("Marking curr assigned state of '%s' >>>> unassigned.",si_rec->name.value); >>>> m_AVND_SU_SI_CURR_ASSIGN_STATE_SET(si_rec, >>>> AVND_SU_SI_ASSIGN_STATE_UNASSIGNED); >>>> m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb, si_rec, >>>> AVND_CKPT_SU_SI_REC); @@ -518,6 +521,7 @@ uint32_t >>>> avnd_su_si_csi_rec_modify(AVND_ >>>> TRACE_ENTER2("%p", param); >>>> /* pick up all the csis belonging to the si & modify them */ >>>> if (!param) { >>>> + TRACE_1("Marking curr assigned state of all CSIs of '%s' >>>> unassigned.",si_rec->name.value); >>>> for (curr_csi = (AVND_COMP_CSI_REC >>>> *)m_NCS_DBLIST_FIND_FIRST(&si_rec->csi_list); >>>> curr_csi; curr_csi = (AVND_COMP_CSI_REC >>>> *)m_NCS_DBLIST_FIND_NEXT(&curr_csi->si_dll_node)) { >>>> /* store the prv assign-state & update the new >>>> assign-state */ @@ -544,6 +548,7 @@ uint32_t >>>> avnd_su_si_csi_rec_modify(AVND_ >>>> /* store the prv assign-state & update the new >>>> assign-state */ >>>> curr_csi->prv_assign_state = curr_csi->curr_assign_state; >>>> + TRACE("Marking curr assigned state of '%s' >>>> unassigned.",curr_csi->name.value); >>>> m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(curr_csi, >>>> AVND_COMP_CSI_ASSIGN_STATE_UNASSIGNED); >>>> m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb, curr_csi, >>>> AVND_CKPT_CSI_REC); >>>> } /* for */ >>>> @@ -574,6 +579,7 @@ uint32_t avnd_su_si_all_modify(AVND_CB * >>>> TRACE_ENTER2(); >>>> /* modify all the si records */ >>>> + TRACE("Marking curr assigned state all SIs in '%s' >>>> unassigned.",su->name.value); >>>> for (curr_si = (AVND_SU_SI_REC >>>> *)m_NCS_DBLIST_FIND_FIRST(&su->si_list); >>>> curr_si; curr_si = (AVND_SU_SI_REC >>>> *)m_NCS_DBLIST_FIND_NEXT(&curr_si->su_dll_node)) { >>>> /* store the prv state & update the new state */ @@ -624,6 >>>> +630,7 @@ uint32_t avnd_su_si_csi_all_modify(AVND_ >>>> TRACE_ENTER2("%p", param); >>>> /* pick up all the csis belonging to all the sis & modify >>>> them */ >>>> if (!param) { >>>> + TRACE("Marking curr assigned state all CSIs in SIs of '%s' >>>> unassigned.",su->name.value); >>>> for (curr_si = (AVND_SU_SI_REC >>>> *)m_NCS_DBLIST_FIND_FIRST(&su->si_list); >>>> curr_si; curr_si = (AVND_SU_SI_REC >>>> *)m_NCS_DBLIST_FIND_NEXT(&curr_si->su_dll_node)) { >>>> for (curr_csi = (AVND_COMP_CSI_REC >>>> *)m_NCS_DBLIST_FIND_FIRST(&curr_si->csi_list); >>>> @@ -658,6 +665,7 @@ uint32_t avnd_su_si_csi_all_modify(AVND_ >>>> } >>>> if (false == curr_comp->assigned_flag) { >>>> /* modify all the csi-records */ >>>> + TRACE("Marking curr assigned state all CSIs assigned to >>>> '%s' unassigned.",curr_comp->name.value); >>>> for (curr_csi = >>>> m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&curr_c >>>> omp->csi_list)); >>>> >>>> curr_csi; >>>> curr_csi = >>>> m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_NEXT(&curr_cs >>>> i->comp_dll_node))) >>>> >>>> diff --git a/osaf/services/saf/amf/amfnd/susm.cc >>>> b/osaf/services/saf/amf/amfnd/susm.cc >>>> --- a/osaf/services/saf/amf/amfnd/susm.cc >>>> +++ b/osaf/services/saf/amf/amfnd/susm.cc >>>> @@ -1917,6 +1917,19 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C >>>> */ >>>> avnd_di_uns32_upd_send(AVSV_SA_AMF_SU, >>>> saAmfSUOperState_ID, &su->name, su->oper); >>>> } >>>> + >>>> + if ((prv_st == SA_AMF_PRESENCE_INSTANTIATED) && >>>> + (final_st == >>>> SA_AMF_PRESENCE_UNINSTANTIATED) && >>>> + (cb->term_state == >>>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)) { >>>> + /* >>>> + During shutdown phase, all comps of SU may fault. In >>>> that case, >>>> + SU FSM marks SU in TERMINAIING state and finally >>>> + moves >>>> it to >>>> + UNINSTANTIATED state. So generated the assignment >>>> + done >>>> indication >>>> + so that removal of lower rank SI can proceed. >>>> + */ >>>> + rc = avnd_su_si_oper_done(cb, su, si); >>>> + m_AVND_SU_ALL_SI_RESET(su); >>>> + } >>>> } >>>> done: >>>> @@ -2264,6 +2277,21 @@ uint32_t avnd_su_pres_insting_compinstfa >>>> return rc; >>>> } >>>> +/** >>>> + * @brief Returns first assigned csi traversing from end. >>>> + * @return Ptr to csi_rec. >>>> + */ >>>> +static AVND_COMP_CSI_REC *get_next_assigned_csi_from_end(const >>>> AVND_SU_SI_REC *si) >>>> +{ >>>> + for (AVND_COMP_CSI_REC *csi = (AVND_COMP_CSI_REC >>>> *)m_NCS_DBLIST_FIND_LAST(&si->csi_list); >>>> + (csi != nullptr); >>>> + csi = (AVND_COMP_CSI_REC >>>> *)m_NCS_DBLIST_FIND_PREV(&csi->si_dll_node)) { >>>> + if (m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_ASSIGNED(csi) && >>>> ((csi->comp != nullptr) >>>> + && (csi->comp->pres == >>>> SA_AMF_PRESENCE_INSTANTIATED))) >>>> + return csi; >>>> + } >>>> + return nullptr; >>>> +} >>>> >>>> /******************************************************************** >>>> ******** >>>> >>>> Name : avnd_su_pres_inst_suterm_hdler >>>> @@ -2334,6 +2362,32 @@ uint32_t avnd_su_pres_inst_suterm_hdler( >>>> AVND_COMP_CLC_PRES_FSM_EV_CLEANUP : >>>> AVND_COMP_CLC_PRES_FSM_EV_TERM); >>>> if (NCSCC_RC_SUCCESS != rc) >>>> goto done; >>>> + >>>> + /* >>>> + During shutdown phase if a component faults, it will be >>>> cleaned up by AMFND >>>> + irrespective of recovery policy. This component will move >>>> to UNINSTANTIATED >>>> + after successful clean up. When amfnd starts removing SI >>>> from SU of this comp, >>>> + it will have to skip the CSI of cleaned up component. >>>> + */ >>>> + if ((csi->comp->pres == SA_AMF_PRESENCE_UNINSTANTIATED) && >>>> + (cb->term_state == >>>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)) { >>>> + m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi, >>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVED); >>>> + avnd_su_pres_state_set(su, SA_AMF_PRESENCE_TERMINATING); >>>> + AVND_COMP_CSI_REC *assigned_csi = >>>> get_next_assigned_csi_from_end(si); >>>> + if (assigned_csi == nullptr) { >>>> + //Components of all the CSIs in SI are cleaned up. >>>> + avnd_su_pres_state_set(su, >>>> SA_AMF_PRESENCE_UNINSTANTIATED); >>>> + goto done; >>>> + } else { >>>> + //One CSI is still assigned. >>>> + m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(assigned_csi, >>>> + AVND_COMP_CSI_ASSIGN_STATE_REMOVING); >>>> + rc = avnd_comp_clc_fsm_trigger(cb, >>>> + assigned_csi->comp, >>>> + >>>> (m_AVND_COMP_IS_FAILED(assigned_csi->comp)) ? >>>> + >>>> AVND_COMP_CLC_PRES_FSM_EV_CLEANUP : >>>> + AVND_COMP_CLC_PRES_FSM_EV_TERM); >>>> + } >>>> + } >>>> } >>>> /* transition to terminating state */ @@ -2937,6 +2991,24 @@ >>>> uint32_t avnd_su_pres_terming_compuninst >>>> if (all_csis_in_assigned_state(su) || >>>> all_csis_in_removed_state(su)) { >>>> TRACE("SI Assignment done"); >>>> avnd_su_pres_state_set(su, >>>> SA_AMF_PRESENCE_UNINSTANTIATED); >>>> + goto done; >>>> + } >>>> + >>>> + /* >>>> + During shutdown phase if a component faults, it will be >>>> cleaned up by AMFND >>>> + irrespective of recovery policy. This component will move >>>> to UNINSTANTIATED >>>> + after successful clean up. When amfnd starts removing SI >>>> from SU of this comp, >>>> + it will have to skip the CSI of cleaned up component. >>>> + */ >>>> + if ((curr_csi != NULL) && (curr_csi->comp->pres == >>>> SA_AMF_PRESENCE_UNINSTANTIATED) && >>>> + (cb->term_state == >>>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)) { >>>> + m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(curr_csi, >>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVED); >>>> + AVND_COMP_CSI_REC *assigned_csi = >>>> get_next_assigned_csi_from_end(curr_csi->si); >>>> + m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(assigned_csi, >>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVING); >>>> + rc = avnd_comp_clc_fsm_trigger(cb, assigned_csi->comp, >>>> + (m_AVND_COMP_IS_FAILED(assigned_csi->comp)) ? >>>> + AVND_COMP_CLC_PRES_FSM_EV_CLEANUP : >>>> + AVND_COMP_CLC_PRES_FSM_EV_TERM); >>>> } >>>> } >>> >> >> ------------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Opensaf-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
