Hi Praveen,
   I have added the "check" to your patch, it works fine now
   diff            --git            a/osaf/services/saf/amf/amfnd/susm.cc
   b/osaf/services/saf/amf/amfnd/susm.cc
   --- a/osaf/services/saf/amf/amfnd/susm.cc
   +++ b/osaf/services/saf/amf/amfnd/susm.cc
   @@ -2637,12 +2637,14 @@ uint32_t avnd_su_pres_restart_compinst_h
                   /* get the next csi */
                            curr_csi         =         (AVND_COMP_CSI_REC
   *)m_NCS_DBLIST_FIND_NEXT(&curr_csi->si_dll_node);
   -               if (curr_csi) {
   +               if (curr_csi &&
   +
   m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_UNASSIGNED(curr_csi)  && // avoid
   restarting the already-restarted comp
   +                       curr_csi->comp->pres != SA_AMF_PRESENCE_RESTARTING)
   { // avoid restarting the component which is in progress of componentRestart
   recovery
                           /* we have another csi. trigger the comp fsm with
   RestartEv */
                           rc = avnd_comp_clc_fsm_trigger(cb, curr_csi->comp,
   AVND_COMP_CLC_PRES_FSM_EV_RESTART);
                           if (NCSCC_RC_SUCCESS != rc)
                                   goto done;
   -               } else {
   +               } else if (all_csis_in_assigned_state(su)) {
                           /* => si assignment done */
                                               avnd_su_pres_state_set(su,
   SA_AMF_PRESENCE_INSTANTIATED);
                   }
   It's now using UNASSIGNED csi (rather than FAILED flag of comp as in my
   patch) to avoid the duplication of component restart, and it also can reuse
   all_csis_in_assigned_state()
   What do you think?
   Thanks/
   Minh
   On 5/14/2014 2:37 PM, praveen malviya wrote:

     On 13-May-14 6:38 PM, minhchau wrote:

     Hi Praveen,
     My patch has problem about the way npi component is retrieved, it should
     be retrieved from csi as currently.
     I just load your patch and test it. I can see components are incurred many
     extra restarts than expected.

     This will happen because of previous recovery policy "component restart" a
     component is restarted. But due to surestart policy every component
     restarted again and
     some components may get restarted again and again. But a component needs
     to be restarted in the context of surestart atleast once even if it has
     been already restarted in component restart recovery. The reason is there
     may be dependency among components.
     So we are getting all the components instantiated at the expense of
     multiple restart of some components and this is due to parallel execution
     of recovery policies component restart and surestart.  ideally the fix
     should differentiate if the component is restarted in the context of
     component restart policy
     or surestart policy. In this way when cleanup of component because of comp
     restart recovery succeeds and AMFND sees that already surestart is going
     on then
     it should not instantiate this component again as it will be taken care in
     surestart. At the same time Surestart should use this successful cleanup
     to  just instantiate the component instead of doing further restart
     (cleanup + instantiated) because in both the cases cleanup will be done
     same way.
     I will also see if this can be achieved.

     Let's look at this case:
     kill comp #15 #10 #8 #2 -> componentRestart
     kill comp #5 -> suRestart
     If restart comp #2 finishes, it will continue restart #1
     If restart comp #5 finishes, it will restart the #4, and then #3, then #2
     (again) because presence state of comp #2 is INSTANTIATED now
     Similarly, the other restart iterations:
     #15 -> #14 -> #13 -> #12 -> #11 -> #10 -> #9 ... -> #1
     #10 -> #9 -> #8 -> ... -> #1
     #8 -> #7 ... -> #1
     So many components have to be restarted many times because it is at the
     end  of  restart iteration which originally is initiated from other
     components.
     I probably keep my approach which marks the comp FAILED (or any better
     flag?) (in order that it restarts only one time per component), revise the
     way of looping through npi components and send it out again.

     I could not get much benefit of marking all the components failed.
     Please add some comment/explanation  for making all components failed when
     patch will be floated again.
     Thanks,
     Praveen

     Thanks/
     Minh
     On 5/13/2014 7:34 PM, praveen malviya wrote:

     Please see inline some comments and  at the end a simple patch that fixes
     the problem with less code.
     Thanks
     Praveen
     On 13-May-14 1:27 PM, Minh Hon Chau wrote:

       osaf/services/saf/amf/amfnd/err.cc  | 13 ++++++++++---
       osaf/services/saf/amf/amfnd/susm.cc |  24 ++++++++++++++++++------
       2 files changed, 28 insertions(+), 9 deletions(-)
     Problem: In case of npi su restart recovery, the condition to change su
     presence
     state is not sufficient. If the component of the last csi in csi_list has
     been
     restarted first due to componentRestart (before suRestart), amfnd can not
     find
     any next csi and changes the su presence state to INSTANTIATED, this will
     miss
     the restart for the rest of components.
     Change: Mark all components are FAILED during suRestart, and su presence
     state
     is changed only if all components are unmarked FAILED.
     diff           --git           a/osaf/services/saf/amf/amfnd/err.cc
     b/osaf/services/saf/amf/amfnd/err.cc
     --- a/osaf/services/saf/amf/amfnd/err.cc
     +++ b/osaf/services/saf/amf/amfnd/err.cc
     @@ -616,13 +616,20 @@ static uint32_t avnd_err_rcvr_comp_resta
       uint32_t avnd_err_rcvr_su_restart(AVND_CB *cb, AVND_SU *su, AVND_COMP
     *failed_comp)
       {
           uint32_t rc = NCSCC_RC_SUCCESS;
     +    AVND_COMP *comp = NULL;
           TRACE_ENTER();
       -    /* mark the su & comp failed */
     +    /* mark the su & its all comps are failed if not marked */
           m_AVND_SU_FAILED_SET(su);
           m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb, su, AVND_CKPT_SU_FLAG_CHANGE);
     -    m_AVND_COMP_FAILED_SET(failed_comp);
     -           m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb,       failed_comp,
     AVND_CKPT_COMP_FLAG_CHANGE);
     +
     +    for (comp =
     m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&su->comp_list));
     +         comp; comp =
     m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_NEXT(&comp->su_dll_node
     ))) {
     +        if (!m_AVND_COMP_IS_FAILED(comp)) {
     +            m_AVND_COMP_FAILED_SET(comp);
     +                  m_AVND_SEND_CKPT_UPDT_ASYNC_UPDT(cb,       comp,
     AVND_CKPT_COMP_FLAG_CHANGE);
     +        }
     +    }

     Why there is no differentiation between a PI and NPI SU in the loop. The
     problem is reported in NPI SU case only and does not exist in PI SU case.

           /* delete su current info */
           rc = avnd_su_curr_info_del(cb, su);
     diff           --git          a/osaf/services/saf/amf/amfnd/susm.cc
     b/osaf/services/saf/amf/amfnd/susm.cc
     --- a/osaf/services/saf/amf/amfnd/susm.cc
     +++ b/osaf/services/saf/amf/amfnd/susm.cc
     @@ -2638,12 +2638,24 @@ uint32_t avnd_su_pres_restart_compinst_h
               /* get the next csi */
               curr_csi = (AVND_COMP_CSI_REC
     *)m_NCS_DBLIST_FIND_NEXT(&curr_csi->si_dll_node);
               if (curr_csi) {
     -            /* we have another csi. trigger the comp fsm with RestartEv
     */
     -             rc  =  avnd_comp_clc_fsm_trigger(cb,  curr_csi->comp,
     AVND_COMP_CLC_PRES_FSM_EV_RESTART);
     -            if (NCSCC_RC_SUCCESS != rc)
     -                goto done;
     -        } else {
     -            /* => si assignment done */
     +            /* we have another csi. trigger the comp fsm with RestartEv
     if this comp is marked failed*/
     +            if (curr_csi->comp->pres == SA_AMF_PRESENCE_INSTANTIATED &&
     m_AVND_COMP_IS_FAILED(curr_csi->comp)) {
     +                rc = avnd_comp_clc_fsm_trigger(cb, curr_csi->comp,
     AVND_COMP_CLC_PRES_FSM_EV_RESTART);
     +                if (NCSCC_RC_SUCCESS != rc)
     +                    goto done;
     +            }
     +        }
     +
     +        /* check whether all comp's are instantiated */
     +        for (curr_comp =
     m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&su->comp_list));
     +             curr_comp && all_inst == true;
     +             curr_comp =
     m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_NEXT(&curr_comp->su_dll
     _node))) {
     +            if (curr_comp->pres != SA_AMF_PRESENCE_INSTANTIATED ||
     m_AVND_COMP_IS_FAILED(curr_comp))
     +                all_inst = false;
     +        }
     +
     +        /* OK, all are instantiated */
     +        if (all_inst == true) {
                   avnd_su_pres_state_set(su, SA_AMF_PRESENCE_INSTANTIATED);
               }

     In case of NPI SU, loops cannot be iterated over components as a user can
     configure  spare components also. In that case SU will get stuck in
     Restarting state as this check will never pass.
     There is much simpler approach to solve this issue.It takes care of this
     spare components problem also. Here is the simple patch :
     diff           --git          a/osaf/services/saf/amf/amfnd/susm.cc
     b/osaf/services/saf/amf/amfnd/susm.cc
     --- a/osaf/services/saf/amf/amfnd/susm.cc
     +++ b/osaf/services/saf/amf/amfnd/susm.cc
     @@ -2642,8 +2642,7 @@ uint32_t avnd_su_pres_restart_compinst_h
                             rc = avnd_comp_clc_fsm_trigger(cb, curr_csi->comp,
     AVND_COMP_CLC_PRES_FSM_EV_RESTART);
                             if (NCSCC_RC_SUCCESS != rc)
                                     goto done;
     -               } else {
     -                       /* => si assignment done */
     +               } else if (all_csis_in_assigned_state(su)) {
                                              avnd_su_pres_state_set(su,
     SA_AMF_PRESENCE_INSTANTIATED);
                     }
             }
     What do you think about this approach?
      Isn't  simple and much less code.
     I have tested it for 2N model and it works fine with 15 csi configuration.

           }
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to