Hi Praveen,

I'm trying to reproduce #309 also. At the time fault happens during standby 
assignment that escalated to nodeFailover, the SU on the other node has already 
done active assignment. So I see the code inside "if 
(all_assignments_done(a_susi->su)) {" always runs

-----------------
                        /* the SI relationships to the SU is quiesced assigned 
and the
                         * other SU is being modified to Active. If this 
                         * SU admin is shutdown change to LOCK. If this SU 
switch state
                         * is true change to false. Remove the SU from 
operation list.
                         * Add that SU to the operation list . Change state to 
                         * SG_realign state. Free all the SI assignments to 
this SU.
                         */

                        if (all_assignments_done(a_susi->su)) {
                                /* Since Act assignment is completely done, so
                                   we don't expect any response from Act su. */
                                avd_sg_su_oper_list_del(cb, su, false);
                                su->delete_all_susis();
                                su->set_su_switch(AVSV_SI_TOGGLE_STABLE);
                                m_AVD_SET_SG_FSM(cb, (su->sg_of_su), 
AVD_SG_FSM_STABLE);
                                /*As sg is stable, screen for si dependencies 
and take action on whole sg*/
                                
avd_sidep_update_si_dep_state_for_all_sis(su->sg_of_su);
                                avd_sidep_sg_take_action(su->sg_of_su);
                        } else {
----------------------

If I understand correctly, the active su assignment on the other node must be 
done before amfd starts sending standby assignment to the quiesced su. Can you 
share the steps to reproduce #309?

Thanks,
Minh


---

** [tickets:#1312] AMF: NodeFailover during SiSwap leaves SG UnStable**

**Status:** assigned
**Milestone:** 4.4.2
**Created:** Fri Apr 10, 2015 10:57 AM UTC by Minh Hon Chau
**Last Updated:** Tue Apr 14, 2015 11:20 AM UTC
**Owner:** Minh Hon Chau

* Configuration:

2 2N SU1, SU2 hosted in SCs
1 sponsored SI (AGENT) and some dependent SIs (MTZ, ACA, CQH, AFD, HDF, NSF, 
SGS, CLH, DBO)
Only one componentRestart will escalate to nodeFailover

* Steps and analysis

All SIs are assigned ACTIVE to SU1, STANDBY to SU2

1) Swap SI safSi=AFD,safApp=TEST_APP
Apr 10 11:00:49 SC-1 osafamfd[491]: NO safSi=AFD,safApp=TEST_APP Swap initiated

2) Swap 2N SI will lead to SU switch over
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigning 'safSi=ACA,safApp=TEST_APP' 
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigned 'safSi=ACA,safApp=TEST_APP' 
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
...
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigning 'safSi=AGENT,safApp=TEST_APP' 
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigned 'safSi=AGENT,safApp=TEST_APP' 
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'

3) Assign sponsor SI ACTIVE to SU2
Apr 10 11:00:49 SC-2 osafamfnd[488]: NO Assigning 'safSi=AGENT,safApp=TEST_APP' 
ACTIVE to 'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP'
(But AGENT in SC-2 has not responded to AMFND)

4) Binary of CQH is corrupted after QUIESCED response to AMF , escalate to 
nodeFailover
Apr 10 11:00:50 SC-1 osafamfnd[500]: NO 
'safComp=CQH,safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' recovery action 
escalated from 'componentRestart' to 'nodeFailover'
Apr 10 11:00:50 SC-1 osafamfnd[500]: NO 
'safComp=CQH,safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' faulted due to 
'avaDown' : Recovery is 'nodeFailover'

5) SC-1 is going reboot, SC-2 becomes ACTIVE
Apr 10 11:00:50 SC-2 osafamfd[479]: NO FAILOVER StandBy --> Active

6) AMFD-SC2 starts node_failover procedure
Apr 10 11:00:50.731489 osafamfd [479:ndproc.cc:0923] >> avd_node_failover: 
'safAmfNode=SC-1,safAmfCluster=myAmfCluster'
...
Apr 10 11:00:50.737048 osafamfd [479:sg_nored_fsm.cc:0793] >> node_fail: 
safSu=SC-1,safSg=NoRed,safApp=OpenSAF, sg_fsm_state=0
Apr 10 11:00:50.745536 osafamfd [479:sg_2n_fsm.cc:3262] >> node_fail: 
'safSu=SC-1,safSg=2N,safApp=OpenSAF', 0
Apr 10 11:00:50.748579 osafamfd [479:sg_2n_fsm.cc:3262] >> node_fail: 
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP', 2

7) During running node_fail_su_oper for TEST_SG_2N (due to swap), SG state set 
to STABLE
Apr 10 11:00:50.748584 osafamfd [479:sg_2n_fsm.cc:2865] >> node_fail_su_oper 
...
Apr 10 11:00:50.749197 osafamfd [479:sg.cc:1635] TR 
safSg=TEST_SG_2N,safApp=TEST_APP sg_fsm_state 2 => 0
...
Apr 10 11:00:50.749217 osafamfd [479:sg_2n_fsm.cc:3099] << node_fail_su_oper 

8) Now in SC-2, AGENT responded to AMFND for ACTIVE csiSetCallback, AMFD 
receives this su_si event from AMFND.
But SG is STABLE, and no operation for su_si modify (act:5)
Apr 10 11:00:59.280465 osafamfnd [488:susm.cc:0954] NO Assigned 
'safSi=AGENT,safApp=TEST_APP' ACTIVE to 
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:59.280681 osafamfd [479:sgproc.cc:0889] >> avd_su_si_assign_evh: 
id:120, node:2020f, act:5, 'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP', 
'safSi=AGENT,safApp=TEST_APP', ha:1, err:1, single:0
...
Apr 10 11:00:59.280737 osafamfd [479:sg_2n_fsm.cc:2361] >> susi_success: 
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP' act=5, hastate=1, sg_fsm_state=0
Apr 10 11:00:59.280749 osafamfd [479:sg_2n_fsm.cc:2376] EM sg_2n_fsm.cc:2376: 
safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP (42)
Apr 10 11:00:59.280752 osafamfd [479:sg_2n_fsm.cc:2562] << susi_success: rc:1
Apr 10 11:00:59.280755 osafamfd [479:sgproc.cc:1405] << avd_su_si_assign_evh 

9) SC-1 comes up, all SIs are assigned STANDBY
Apr 10 11:01:21 SC-1 opensafd: Starting OpenSAF Services (Using TCP)
...
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigning 'safSi=DBO,safApp=TEST_APP' 
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigned 'safSi=DBO,safApp=TEST_APP' 
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
...
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigning 'safSi=AGENT,safApp=TEST_APP' 
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigned 'safSi=AGENT,safApp=TEST_APP' 
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'

10) AMFD-SC2 is informed the SU1's STANDBY assignment
After susi_success(), SG state is still REALIGN
Apr 10 11:01:24.345208 osafamfd [479:sgproc.cc:0889] >> avd_su_si_assign_evh: 
id:115, node:2010f, act:2, 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP', 
'safSi=AGENT,safApp=TEST_APP', ha:2, err:1, single:0
...
Apr 10 11:01:24.345666 osafamfd [479:sg_2n_fsm.cc:2361] >> susi_success: 
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' act=2, hastate=2, sg_fsm_state=1
Apr 10 11:01:24.345669 osafamfd [479:sg_2n_fsm.cc:1446] >> 
susi_success_sg_realign: 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' act=2, 
state=2
Apr 10 11:01:24.345672 osafamfd [479:sg_2n_fsm.cc:1865] << 
susi_success_sg_realign: rc:1
Apr 10 11:01:24.345674 osafamfd [479:sg_2n_fsm.cc:2562] << susi_success: rc:1
Apr 10 11:01:24.345678 osafamfd [479:sgproc.cc:1405] << avd_su_si_assign_evh 

11) Finally, failed to swap again
Apr 10 11:03:23.304988 osafamfd [479:si.cc:0821] >> si_admin_op_cb: 
safSi=AFD,safApp=TEST_APP op=7
Apr 10 11:03:23.304997 osafamfd [479:sg_2n_fsm.cc:0757] >> si_swap: 
'safSi=AFD,safApp=TEST_APP' sg_fsm_state=1
Apr 10 11:03:23.305011 osafamfd [479:sg_2n_fsm.cc:0775] ER 
safSi=AFD,safApp=TEST_APP SWAP failed - SG not stable (1)
Apr 10 11:03:23.305013 osafamfd [479:sg_2n_fsm.cc:0857] << si_swap: 
sg_fsm_state=1





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to