On 21-Dec-15 2:06 PM, Hans Nordebäck wrote:
> Hi Praveen,
>
> Ack for the patch, code review only. One comment,  the opensafd script should 
> also be updated
I did not get the comment here.
Is that what is mentioned  below:
"if amfnd do not exit in time, OPENSAF_TERMTIMEOUT?, then do a pkill-9 
osafamfnd,  thereafter pkill -9 osaf* or
final_clean should be fine for the remaining services." ?

Thanks,
Praveen
./Thanks HansN
>
> -----Original Message-----
> From: Hans Nordebäck [mailto:[email protected]]
> Sent: den 21 december 2015 08:33
> To: praveen malviya; [email protected]; Gary Lee
> Cc: [email protected]
> Subject: Re: [devel] [PATCH 1 of 1] amfnd: fix removal of assignments from 
> faulty NPI SU during shutdown phase [#1630]
>
> Hi Praveen,
>
> If the opensafd script is changed to:
> if amfnd do not exit in time, OPENSAF_TERMTIMEOUT?, then do a pkill-9  
> osafamfnd,  thereafter pkill -9 osaf* or final_clean should be fine for the 
> remaining services.
>   /Thanks HansN
>
> -----Original Message-----
> From: praveen malviya [mailto:[email protected]]
> Sent: den 21 december 2015 07:21
> To: Hans Nordebäck; [email protected]; Gary Lee
> Cc: [email protected]
> Subject: Re: [PATCH 1 of 1] amfnd: fix removal of assignments from faulty NPI 
> SU during shutdown phase [#1630]
>
>
>
> On 18-Dec-15 6:27 PM, Hans Nordebäck wrote:
>> Hi Praveen,
>>
>> A question, the opensafd script, in stop function, amfnd is sent a
>> SIGTERM signal and starts to terminate its components.
>> Despite the timeout retry logic, the final_clean seems to be called
>> even if amfnd has not finished.
>> Shouldn't there be a pkill -9 osaf* in the case where amfnd has not
>> finished in time? As it now looks the amfnd is running when the
>> final_clean is called to run all the clc_cli scripts, e.g. killing
>> amfd.
>> amfnd should not be running at this time and leads to the "ER AMF
>> director unexpectedly crashed".
> I agree on the explanation above. But calling "pkill -9 osaf*", in case amfnd 
> does not exit in time, will also cause node reboot most of the time because 
> amfnd will detect "avaDown" of director components and will do nodefailfast 
> before itself getting killed.
>
> I an not aware of any historical reason of dependency on amfnd for opensaf 
> stop other than these two advantages:
> -a user can configure OPENSAF_TERMTIMEOUT in nid.conf depending upon the 
> callback and cleanup scripts timeouts of application components.In this case 
> graceful exit of amfnd will help in faster stop of OpenSAF.
> -On graceful exit, cleanup script of all the middleware components will be 
> called to further release any resources without any node reboot.
>
> Thanks,
> Praveen
>
>
>>
>> /Thanks  HansN
>>
>>
>> On 12/16/2015 12:53 PM, [email protected] wrote:
>>>    osaf/services/saf/amf/amfnd/clc.cc  |  12 ++++++
>>>    osaf/services/saf/amf/amfnd/sidb.cc |   8 ++++
>>>    osaf/services/saf/amf/amfnd/susm.cc |  72
>>> +++++++++++++++++++++++++++++++++++++
>>>    3 files changed, 92 insertions(+), 0 deletions(-)
>>>
>>>
>>> In the reported problem, opensaf shutdown got stuck when one of the
>>> components of NPI su faulted.
>>>
>>> During opnesaf shutdown, amfnd started removing assignment from lower
>>> rank
>>> SI2 assigned to SU2. During this time, comp of a NPI SU1 having a csi
>>> from a higher rank SI1 faulted. AMFND successfully cleaned up the
>>> failed component. Now when all the assignments from SU2 got removed,
>>> amfnd started removing assignments from SU1.
>>> Since SU1 had only one
>>> component which was moved to UNINSTANTIATED state after clean up, no
>>> further clean up was required. Since no further comp was cleaned up
>>> in SU1 got stuck in TERMINATING state.
>>>
>>> In healthy condition when last CSI is removed from a NPI SU, SU will
>>> move to UNINSTANTIATED state. After this if no further lower rank SIs
>>> are available for application SUs, amfnd will launch clean up of all
>>> the components.
>>> But since SU1 got stuck in TERMINATING state AMFND could not launch
>>> clean up of all the comps and after 60 sceconds NID rebooted the
>>> node.
>>>
>>> Patch fixes the problem by resuming the SU FSM of failed component
>>> and removes further CSIs. If no CSI is available it will mark the SU
>>> UNINSTANTIATED and will go with the normal sequence of removal and
>>> then clean up of all comps.
>>>
>>> diff --git a/osaf/services/saf/amf/amfnd/clc.cc
>>> b/osaf/services/saf/amf/amfnd/clc.cc
>>> --- a/osaf/services/saf/amf/amfnd/clc.cc
>>> +++ b/osaf/services/saf/amf/amfnd/clc.cc
>>> @@ -2204,6 +2204,18 @@ uint32_t avnd_comp_clc_terming_cleansucc
>>>                }
>>>            }
>>> +        if ((!comp->su->is_ncs) && (comp->csi_list.n_nodes > 0) &&
>>> +                (!m_AVND_SU_IS_PREINSTANTIABLE(comp->su))) {
>>> +            AVND_COMP_CSI_REC *csi = nullptr;
>>> +            /*
>>> +               Explantion written above for PI SU case is valid here
>>> also.
>>> +               However for a NPI comp in NPI SU, mark it REMOVED
>>> instead of
>>> +               generating remove done indication.
>>> +             */
>>> +            csi =
>>> m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&comp->
>>> csi_list));
>>>
>>> +            if (csi != nullptr)
>>> +                m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVED);
>>> +        }
>>>            if (all_comps_terminated()) {
>>>                LOG_NO("Terminated all AMF components");
>>>                LOG_NO("Shutdown completed, exiting"); diff --git
>>> a/osaf/services/saf/amf/amfnd/sidb.cc
>>> b/osaf/services/saf/amf/amfnd/sidb.cc
>>> --- a/osaf/services/saf/amf/amfnd/sidb.cc
>>> +++ b/osaf/services/saf/amf/amfnd/sidb.cc
>>> @@ -185,6 +185,7 @@ AVND_SU_SI_REC *avnd_su_si_rec_add(AVND_
>>>        /*
>>>         * Update the rest of the parameters with default values.
>>>         */
>>> +    TRACE("Marking curr assigned state of '%s'
>>> unassigned.",si_rec->name.value);
>>>        m_AVND_SU_SI_CURR_ASSIGN_STATE_SET(si_rec,
>>> AVND_SU_SI_ASSIGN_STATE_UNASSIGNED);
>>>        /*
>>> @@ -394,6 +395,7 @@ AVND_COMP_CSI_REC *avnd_su_si_csi_rec_ad
>>>        /*
>>>         * Update the rest of the parameters with default values.
>>>         */
>>> +    TRACE("Marking curr assigned state of '%s'
>>> unassigned.",csi_rec->name.value);
>>>        m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi_rec,
>>> AVND_COMP_CSI_ASSIGN_STATE_UNASSIGNED);
>>>        m_AVND_COMP_CSI_PRV_ASSIGN_STATE_SET(csi_rec,
>>> AVND_COMP_CSI_ASSIGN_STATE_UNASSIGNED);
>>> @@ -477,6 +479,7 @@ AVND_SU_SI_REC *avnd_su_si_rec_modify(AV
>>>        /* store the prv assign-state & update the new assign-state */
>>>        si_rec->prv_assign_state = si_rec->curr_assign_state;
>>> +    TRACE_1("Marking curr assigned state of '%s'
>>> unassigned.",si_rec->name.value);
>>>        m_AVND_SU_SI_CURR_ASSIGN_STATE_SET(si_rec,
>>> AVND_SU_SI_ASSIGN_STATE_UNASSIGNED);
>>>        m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb, si_rec,
>>> AVND_CKPT_SU_SI_REC); @@ -518,6 +521,7 @@ uint32_t 
>>> avnd_su_si_csi_rec_modify(AVND_
>>>        TRACE_ENTER2("%p", param);
>>>        /* pick up all the csis belonging to the si & modify them */
>>>        if (!param) {
>>> +        TRACE_1("Marking curr assigned state of all CSIs of '%s'
>>> unassigned.",si_rec->name.value);
>>>            for (curr_csi = (AVND_COMP_CSI_REC
>>> *)m_NCS_DBLIST_FIND_FIRST(&si_rec->csi_list);
>>>                 curr_csi; curr_csi = (AVND_COMP_CSI_REC
>>> *)m_NCS_DBLIST_FIND_NEXT(&curr_csi->si_dll_node)) {
>>>                /* store the prv assign-state & update the new
>>> assign-state */ @@ -544,6 +548,7 @@ uint32_t
>>> avnd_su_si_csi_rec_modify(AVND_
>>>            /* store the prv assign-state & update the new assign-state */
>>>            curr_csi->prv_assign_state = curr_csi->curr_assign_state;
>>> +        TRACE("Marking curr assigned state of '%s'
>>> unassigned.",curr_csi->name.value);
>>>            m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(curr_csi,
>>> AVND_COMP_CSI_ASSIGN_STATE_UNASSIGNED);
>>>            m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb, curr_csi,
>>> AVND_CKPT_CSI_REC);
>>>        }            /* for */
>>> @@ -574,6 +579,7 @@ uint32_t avnd_su_si_all_modify(AVND_CB *
>>>        TRACE_ENTER2();
>>>        /* modify all the si records */
>>> +    TRACE("Marking curr assigned state all SIs in '%s'
>>> unassigned.",su->name.value);
>>>        for (curr_si = (AVND_SU_SI_REC
>>> *)m_NCS_DBLIST_FIND_FIRST(&su->si_list);
>>>             curr_si; curr_si = (AVND_SU_SI_REC
>>> *)m_NCS_DBLIST_FIND_NEXT(&curr_si->su_dll_node)) {
>>>            /* store the prv state & update the new state */ @@ -624,6
>>> +630,7 @@ uint32_t avnd_su_si_csi_all_modify(AVND_
>>>        TRACE_ENTER2("%p", param);
>>>        /* pick up all the csis belonging to all the sis & modify them */
>>>        if (!param) {
>>> +        TRACE("Marking curr assigned state all CSIs in SIs of '%s'
>>> unassigned.",su->name.value);
>>>            for (curr_si = (AVND_SU_SI_REC
>>> *)m_NCS_DBLIST_FIND_FIRST(&su->si_list);
>>>                 curr_si; curr_si = (AVND_SU_SI_REC
>>> *)m_NCS_DBLIST_FIND_NEXT(&curr_si->su_dll_node)) {
>>>                for (curr_csi = (AVND_COMP_CSI_REC
>>> *)m_NCS_DBLIST_FIND_FIRST(&curr_si->csi_list);
>>> @@ -658,6 +665,7 @@ uint32_t avnd_su_si_csi_all_modify(AVND_
>>>            }
>>>            if (false == curr_comp->assigned_flag) {
>>>                /* modify all the csi-records */
>>> +            TRACE("Marking curr assigned state all CSIs assigned to
>>> '%s' unassigned.",curr_comp->name.value);
>>>                for (curr_csi =
>>> m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&curr_c
>>> omp->csi_list));
>>>
>>>                        curr_csi;
>>>                        curr_csi =
>>> m_AVND_CSI_REC_FROM_COMP_DLL_NODE_GET(m_NCS_DBLIST_FIND_NEXT(&curr_cs
>>> i->comp_dll_node)))
>>>
>>> diff --git a/osaf/services/saf/amf/amfnd/susm.cc
>>> b/osaf/services/saf/amf/amfnd/susm.cc
>>> --- a/osaf/services/saf/amf/amfnd/susm.cc
>>> +++ b/osaf/services/saf/amf/amfnd/susm.cc
>>> @@ -1917,6 +1917,19 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
>>>                 */
>>>                avnd_di_uns32_upd_send(AVSV_SA_AMF_SU,
>>> saAmfSUOperState_ID, &su->name, su->oper);
>>>            }
>>> +
>>> +        if ((prv_st == SA_AMF_PRESENCE_INSTANTIATED) &&
>>> +                                (final_st ==
>>> SA_AMF_PRESENCE_UNINSTANTIATED) &&
>>> +                (cb->term_state ==
>>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)) {
>>> +            /*
>>> +               During shutdown phase, all comps of SU may fault. In
>>> that case,
>>> +               SU FSM marks SU in TERMINAIING state and finally
>>> + moves
>>> it to
>>> +               UNINSTANTIATED state. So generated the assignment
>>> + done
>>> indication
>>> +               so that removal of lower rank SI can proceed.
>>> +             */
>>> +            rc = avnd_su_si_oper_done(cb, su, si);
>>> +            m_AVND_SU_ALL_SI_RESET(su);
>>> +        }
>>>        }
>>>     done:
>>> @@ -2264,6 +2277,21 @@ uint32_t avnd_su_pres_insting_compinstfa
>>>        return rc;
>>>    }
>>> +/**
>>> + * @brief  Returns first assigned csi traversing from end.
>>> + * @return Ptr to csi_rec.
>>> + */
>>> +static AVND_COMP_CSI_REC *get_next_assigned_csi_from_end(const
>>> AVND_SU_SI_REC *si)
>>> +{
>>> +    for (AVND_COMP_CSI_REC *csi = (AVND_COMP_CSI_REC
>>> *)m_NCS_DBLIST_FIND_LAST(&si->csi_list);
>>> +            (csi != nullptr);
>>> +            csi = (AVND_COMP_CSI_REC
>>> *)m_NCS_DBLIST_FIND_PREV(&csi->si_dll_node)) {
>>> +        if (m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_ASSIGNED(csi) &&
>>> ((csi->comp != nullptr)
>>> +                    && (csi->comp->pres ==
>>> SA_AMF_PRESENCE_INSTANTIATED)))
>>> +            return csi;
>>> +    }
>>> +    return nullptr;
>>> +}
>>>
>>> /********************************************************************
>>> ********
>>>
>>>      Name          : avnd_su_pres_inst_suterm_hdler
>>> @@ -2334,6 +2362,32 @@ uint32_t avnd_su_pres_inst_suterm_hdler(
>>>                           AVND_COMP_CLC_PRES_FSM_EV_CLEANUP :
>>> AVND_COMP_CLC_PRES_FSM_EV_TERM);
>>>            if (NCSCC_RC_SUCCESS != rc)
>>>                goto done;
>>> +
>>> +        /*
>>> +           During shutdown phase if a component faults, it will be
>>> cleaned up by AMFND
>>> +           irrespective of recovery policy. This component will move
>>> to UNINSTANTIATED
>>> +           after successful clean up. When amfnd starts removing SI
>>> from SU of this comp,
>>> +           it will have to skip the CSI of cleaned up component.
>>> +        */
>>> +        if ((csi->comp->pres == SA_AMF_PRESENCE_UNINSTANTIATED) &&
>>> +            (cb->term_state ==
>>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)) {
>>> +            m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVED);
>>> +            avnd_su_pres_state_set(su, SA_AMF_PRESENCE_TERMINATING);
>>> +            AVND_COMP_CSI_REC *assigned_csi =
>>> get_next_assigned_csi_from_end(si);
>>> +            if (assigned_csi == nullptr) {
>>> +                //Components of all the CSIs in SI are cleaned up.
>>> +                avnd_su_pres_state_set(su,
>>> SA_AMF_PRESENCE_UNINSTANTIATED);
>>> +                goto done;
>>> +            } else {
>>> +                //One CSI is still assigned.
>>> +                m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(assigned_csi,
>>> +                        AVND_COMP_CSI_ASSIGN_STATE_REMOVING);
>>> +                rc = avnd_comp_clc_fsm_trigger(cb,
>>> + assigned_csi->comp,
>>> +
>>> (m_AVND_COMP_IS_FAILED(assigned_csi->comp)) ?
>>> +
>>> AVND_COMP_CLC_PRES_FSM_EV_CLEANUP :
>>> +                                        AVND_COMP_CLC_PRES_FSM_EV_TERM);
>>> +            }
>>> +        }
>>>        }
>>>        /* transition to terminating state */ @@ -2937,6 +2991,24 @@
>>> uint32_t avnd_su_pres_terming_compuninst
>>>            if (all_csis_in_assigned_state(su) ||
>>> all_csis_in_removed_state(su)) {
>>>                TRACE("SI Assignment done");
>>>                avnd_su_pres_state_set(su,
>>> SA_AMF_PRESENCE_UNINSTANTIATED);
>>> +            goto done;
>>> +        }
>>> +
>>> +        /*
>>> +           During shutdown phase if a component faults, it will be
>>> cleaned up by AMFND
>>> +           irrespective of recovery policy. This component will move
>>> to UNINSTANTIATED
>>> +           after successful clean up. When amfnd starts removing SI
>>> from SU of this comp,
>>> +           it will have to skip the CSI of cleaned up component.
>>> +        */
>>> +        if ((curr_csi != NULL) && (curr_csi->comp->pres ==
>>> SA_AMF_PRESENCE_UNINSTANTIATED) &&
>>> +                (cb->term_state ==
>>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)) {
>>> +            m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(curr_csi,
>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVED);
>>> +            AVND_COMP_CSI_REC *assigned_csi =
>>> get_next_assigned_csi_from_end(curr_csi->si);
>>> +            m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(assigned_csi,
>>> AVND_COMP_CSI_ASSIGN_STATE_REMOVING);
>>> +            rc = avnd_comp_clc_fsm_trigger(cb, assigned_csi->comp,
>>> +                    (m_AVND_COMP_IS_FAILED(assigned_csi->comp)) ?
>>> +                    AVND_COMP_CLC_PRES_FSM_EV_CLEANUP :
>>> +                    AVND_COMP_CLC_PRES_FSM_EV_TERM);
>>>            }
>>>        }
>>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to