Hi Praveen,
I'm trying to reproduce #309 also. At the time fault happens during standby
assignment that escalated to nodeFailover, the SU on the other node has already
done active assignment. So I see the code inside "if
(all_assignments_done(a_susi->su)) {" always runs
-----------------
/* the SI relationships to the SU is quiesced assigned
and the
* other SU is being modified to Active. If this
* SU admin is shutdown change to LOCK. If this SU
switch state
* is true change to false. Remove the SU from
operation list.
* Add that SU to the operation list . Change state to
* SG_realign state. Free all the SI assignments to
this SU.
*/
if (all_assignments_done(a_susi->su)) {
/* Since Act assignment is completely done, so
we don't expect any response from Act su. */
avd_sg_su_oper_list_del(cb, su, false);
su->delete_all_susis();
su->set_su_switch(AVSV_SI_TOGGLE_STABLE);
m_AVD_SET_SG_FSM(cb, (su->sg_of_su),
AVD_SG_FSM_STABLE);
/*As sg is stable, screen for si dependencies
and take action on whole sg*/
avd_sidep_update_si_dep_state_for_all_sis(su->sg_of_su);
avd_sidep_sg_take_action(su->sg_of_su);
} else {
----------------------
If I understand correctly, the active su assignment on the other node must be
done before amfd starts sending standby assignment to the quiesced su. Can you
share the steps to reproduce #309?
Thanks,
Minh
---
** [tickets:#1312] AMF: NodeFailover during SiSwap leaves SG UnStable**
**Status:** assigned
**Milestone:** 4.4.2
**Created:** Fri Apr 10, 2015 10:57 AM UTC by Minh Hon Chau
**Last Updated:** Tue Apr 14, 2015 11:20 AM UTC
**Owner:** Minh Hon Chau
* Configuration:
2 2N SU1, SU2 hosted in SCs
1 sponsored SI (AGENT) and some dependent SIs (MTZ, ACA, CQH, AFD, HDF, NSF,
SGS, CLH, DBO)
Only one componentRestart will escalate to nodeFailover
* Steps and analysis
All SIs are assigned ACTIVE to SU1, STANDBY to SU2
1) Swap SI safSi=AFD,safApp=TEST_APP
Apr 10 11:00:49 SC-1 osafamfd[491]: NO safSi=AFD,safApp=TEST_APP Swap initiated
2) Swap 2N SI will lead to SU switch over
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigning 'safSi=ACA,safApp=TEST_APP'
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigned 'safSi=ACA,safApp=TEST_APP'
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
...
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigning 'safSi=AGENT,safApp=TEST_APP'
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:49 SC-1 osafamfnd[500]: NO Assigned 'safSi=AGENT,safApp=TEST_APP'
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
3) Assign sponsor SI ACTIVE to SU2
Apr 10 11:00:49 SC-2 osafamfnd[488]: NO Assigning 'safSi=AGENT,safApp=TEST_APP'
ACTIVE to 'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP'
(But AGENT in SC-2 has not responded to AMFND)
4) Binary of CQH is corrupted after QUIESCED response to AMF , escalate to
nodeFailover
Apr 10 11:00:50 SC-1 osafamfnd[500]: NO
'safComp=CQH,safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' recovery action
escalated from 'componentRestart' to 'nodeFailover'
Apr 10 11:00:50 SC-1 osafamfnd[500]: NO
'safComp=CQH,safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' faulted due to
'avaDown' : Recovery is 'nodeFailover'
5) SC-1 is going reboot, SC-2 becomes ACTIVE
Apr 10 11:00:50 SC-2 osafamfd[479]: NO FAILOVER StandBy --> Active
6) AMFD-SC2 starts node_failover procedure
Apr 10 11:00:50.731489 osafamfd [479:ndproc.cc:0923] >> avd_node_failover:
'safAmfNode=SC-1,safAmfCluster=myAmfCluster'
...
Apr 10 11:00:50.737048 osafamfd [479:sg_nored_fsm.cc:0793] >> node_fail:
safSu=SC-1,safSg=NoRed,safApp=OpenSAF, sg_fsm_state=0
Apr 10 11:00:50.745536 osafamfd [479:sg_2n_fsm.cc:3262] >> node_fail:
'safSu=SC-1,safSg=2N,safApp=OpenSAF', 0
Apr 10 11:00:50.748579 osafamfd [479:sg_2n_fsm.cc:3262] >> node_fail:
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP', 2
7) During running node_fail_su_oper for TEST_SG_2N (due to swap), SG state set
to STABLE
Apr 10 11:00:50.748584 osafamfd [479:sg_2n_fsm.cc:2865] >> node_fail_su_oper
...
Apr 10 11:00:50.749197 osafamfd [479:sg.cc:1635] TR
safSg=TEST_SG_2N,safApp=TEST_APP sg_fsm_state 2 => 0
...
Apr 10 11:00:50.749217 osafamfd [479:sg_2n_fsm.cc:3099] << node_fail_su_oper
8) Now in SC-2, AGENT responded to AMFND for ACTIVE csiSetCallback, AMFD
receives this su_si event from AMFND.
But SG is STABLE, and no operation for su_si modify (act:5)
Apr 10 11:00:59.280465 osafamfnd [488:susm.cc:0954] NO Assigned
'safSi=AGENT,safApp=TEST_APP' ACTIVE to
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:59.280681 osafamfd [479:sgproc.cc:0889] >> avd_su_si_assign_evh:
id:120, node:2020f, act:5, 'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP',
'safSi=AGENT,safApp=TEST_APP', ha:1, err:1, single:0
...
Apr 10 11:00:59.280737 osafamfd [479:sg_2n_fsm.cc:2361] >> susi_success:
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP' act=5, hastate=1, sg_fsm_state=0
Apr 10 11:00:59.280749 osafamfd [479:sg_2n_fsm.cc:2376] EM sg_2n_fsm.cc:2376:
safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP (42)
Apr 10 11:00:59.280752 osafamfd [479:sg_2n_fsm.cc:2562] << susi_success: rc:1
Apr 10 11:00:59.280755 osafamfd [479:sgproc.cc:1405] << avd_su_si_assign_evh
9) SC-1 comes up, all SIs are assigned STANDBY
Apr 10 11:01:21 SC-1 opensafd: Starting OpenSAF Services (Using TCP)
...
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigning 'safSi=DBO,safApp=TEST_APP'
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigned 'safSi=DBO,safApp=TEST_APP'
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
...
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigning 'safSi=AGENT,safApp=TEST_APP'
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:01:24 SC-1 osafamfnd[490]: NO Assigned 'safSi=AGENT,safApp=TEST_APP'
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
10) AMFD-SC2 is informed the SU1's STANDBY assignment
After susi_success(), SG state is still REALIGN
Apr 10 11:01:24.345208 osafamfd [479:sgproc.cc:0889] >> avd_su_si_assign_evh:
id:115, node:2010f, act:2, 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP',
'safSi=AGENT,safApp=TEST_APP', ha:2, err:1, single:0
...
Apr 10 11:01:24.345666 osafamfd [479:sg_2n_fsm.cc:2361] >> susi_success:
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' act=2, hastate=2, sg_fsm_state=1
Apr 10 11:01:24.345669 osafamfd [479:sg_2n_fsm.cc:1446] >>
susi_success_sg_realign: 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' act=2,
state=2
Apr 10 11:01:24.345672 osafamfd [479:sg_2n_fsm.cc:1865] <<
susi_success_sg_realign: rc:1
Apr 10 11:01:24.345674 osafamfd [479:sg_2n_fsm.cc:2562] << susi_success: rc:1
Apr 10 11:01:24.345678 osafamfd [479:sgproc.cc:1405] << avd_su_si_assign_evh
11) Finally, failed to swap again
Apr 10 11:03:23.304988 osafamfd [479:si.cc:0821] >> si_admin_op_cb:
safSi=AFD,safApp=TEST_APP op=7
Apr 10 11:03:23.304997 osafamfd [479:sg_2n_fsm.cc:0757] >> si_swap:
'safSi=AFD,safApp=TEST_APP' sg_fsm_state=1
Apr 10 11:03:23.305011 osafamfd [479:sg_2n_fsm.cc:0775] ER
safSi=AFD,safApp=TEST_APP SWAP failed - SG not stable (1)
Apr 10 11:03:23.305013 osafamfd [479:sg_2n_fsm.cc:0857] << si_swap:
sg_fsm_state=1
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets