----- Original Message -----
From: [email protected]
To: [email protected]
Sent: Thursday, April 16, 2015 9:30:59 AM GMT +05:30 Chennai, Kolkata, Mumbai,
New Delhi
Subject: [opensaf:tickets] #1312 AMF: NodeFailover during SiSwap leaves SG
UnStable
I guess your suggestion will look like below
/* the SI relationships to the SU is quiesced assigned and the * other SU is
being modified to Active. If this * SU admin is shutdown change to LOCK. If
this SU switch state * is true change to false. Remove the SU from operation
list. * Add that SU to the operation list . Change state to * SG_realign state.
Free all the SI assignments to this SU. */ if ( all_assignments_done ( a_susi
-> su )) { // Patch of #708 } else { avd_sg_su_oper_list_del ( cb , su , false
); su -> delete_all_susis (); su_node_ptr = su -> get_node_ptr (); /* the admin
state of the SU is shutdown change it to lock. */ if ( su -> saAmfSUAdminState
== SA_AMF_ADMIN_SHUTTING_DOWN ) { ... } else if ( su_node_ptr ->
saAmfNodeAdminState == SA_AMF_ADMIN_SHUTTING_DOWN ) { ... } else if ( su ->
su_switch == AVSV_SI_TOGGLE_SWITCH ) { /* During si-swap while standby
assignment is going on, if Nodefailover or SU failover got escalated then
toggle SU switch state and make SG stable. After SG becomes stable, spare SU
will be instantiated, if available, or same SU will get standby assignment
after repair. */ if ( all_assignments_done ( a_susi -> su )) { //#309 su ->
set_su_switch ( AVSV_SI_TOGGLE_STABLE ); m_AVD_SET_SG_FSM ( cb , ( su ->
sg_of_su ), AVD_SG_FSM_STABLE ); complete_siswap ( a_susi -> su , SA_AIS_OK );
} else if ( a_susi -> su -> any_susi_fsm_in_modify () == true ){ /* Do nothing
let the response come from AMFND for a_susi->su. When such a response will be
received SG FSM will take care of it as it is still in su oper state */ LOG_NO
( "SG remain in unstable state" ); } else { avd_sg_su_oper_list_add ( cb ,
a_susi -> su , false ); m_AVD_SET_SG_FSM ( cb , ( su -> sg_of_su ),
AVD_SG_FSM_SG_REALIGN ); } }
The patch for #309 will never be run because "all_assignment_done(a_susi->su)"
always returns false, since we are at "else" case of patch #708
changeset: 5560:0e6771c09786
user: Nagendra Kumar [email protected]
date: Tue Aug 12 20:32:48 2014 +0530
summary: amfd: start clm track and mark sg stable after node failover [#708]
changeset: 5557:b1702cf32d6e
user: Praveen [email protected]
date: Tue Aug 12 20:17:16 2014 +0530
summary: amfd : fix node failover while assigning standby state during si-swap
[#309]
I think both #309 and #708 separately can fix the issue of "NodeFailover
happens on node receiving standby assignment during si-swap", but when those
patches are merged into branch, a part of patch #309 for this particular issue
will never be run since the fix of #708 has been run first.
I think we can just back out the part of #309 related to this issue and add
patch for #1312. Does it sound ok to you?
[Praveen] Below patch looks ok to me. Please float it officially.
Only one minor comment below.
diff --git a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
--- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
+++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
@@ -2967,14 +2967,10 @@ void SG_2N::node_fail_su_oper(AVD_SU su
avd_sg_su_oper_list_add(cb, a_susi->su, false);
m_AVD_SET_SG_FSM(cb, (su->sg_of_su), AVD_SG_FSM_SG_REALIGN);
} else if (su->su_switch == AVSV_SI_TOGGLE_SWITCH) {
- / During si-swap while standby assignment is going on, if Nodefailover
- or SU failover got escalated then toggle SU switch state and make SG
- stable. After SG becomes stable, spare SU will be instantiated,
- if available, or same SU will get standby assignment after repair.
- */
[Praveen]Above comment can be moved in the if part (#708/#309 part), so that in
future anybody can know the context.
- su->set_su_switch(AVSV_SI_TOGGLE_STABLE);
- m_AVD_SET_SG_FSM(cb, (su->sg_of_su), AVD_SG_FSM_STABLE);
- complete_siswap(a_susi->su, SA_AIS_OK);
+ // Patch for #1312
+ // if (a_susi->su->any_susi_fsm_in_modify() == true){
+ // ....
+ // }
} else {
avd_sg_su_oper_list_add(cb, a_susi->su, false);
m_AVD_SET_SG_FSM(cb, (su->sg_of_su), AVD_SG_FSM_SG_REALIGN);
Thanks,
Minh
[tickets:#1312] AMF: NodeFailover during SiSwap leaves SG UnStable
Status: assigned
Milestone: 4.4.2
Created: Fri Apr 10, 2015 10:57 AM UTC by Minh Hon Chau
Last Updated: Tue Apr 14, 2015 11:20 AM UTC
Owner: Minh Hon Chau
• Configuration:
2 2N SU1, SU2 hosted in SCs
1 sponsored SI (AGENT) and some dependent SIs (MTZ, ACA, CQH, AFD, HDF, NSF,
SGS, CLH, DBO)
Only one componentRestart will escalate to nodeFailover
• Steps and analysis
All SIs are assigned ACTIVE to SU1, STANDBY to SU2
1) Swap SI safSi=AFD,safApp=TEST_APP
Apr 10 11:00:49 SC-1 osafamfd [491] : NO safSi=AFD,safApp=TEST_APP Swap
initiated
2) Swap 2N SI will lead to SU switch over
Apr 10 11:00:49 SC-1 osafamfnd [500] : NO Assigning 'safSi=ACA,safApp=TEST_APP'
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:49 SC-1 osafamfnd [500] : NO Assigned 'safSi=ACA,safApp=TEST_APP'
QUIESCED to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
...
Apr 10 11:00:49 SC-1 osafamfnd [500] : NO Assigning
'safSi=AGENT,safApp=TEST_APP' QUIESCED to
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:49 SC-1 osafamfnd [500] : NO Assigned
'safSi=AGENT,safApp=TEST_APP' QUIESCED to
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
3) Assign sponsor SI ACTIVE to SU2
Apr 10 11:00:49 SC-2 osafamfnd [488] : NO Assigning
'safSi=AGENT,safApp=TEST_APP' ACTIVE to
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP'
(But AGENT in SC-2 has not responded to AMFND)
4) Binary of CQH is corrupted after QUIESCED response to AMF , escalate to
nodeFailover
Apr 10 11:00:50 SC-1 osafamfnd [500] : NO
'safComp=CQH,safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' recovery action
escalated from 'componentRestart' to 'nodeFailover'
Apr 10 11:00:50 SC-1 osafamfnd [500] : NO
'safComp=CQH,safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' faulted due to
'avaDown' : Recovery is 'nodeFailover'
5) SC-1 is going reboot, SC-2 becomes ACTIVE
Apr 10 11:00:50 SC-2 osafamfd [479] : NO FAILOVER StandBy --> Active
6) AMFD-SC2 starts node_failover procedure
Apr 10 11:00:50.731489 osafamfd [479:ndproc.cc:0923] >> avd_node_failover:
'safAmfNode=SC-1,safAmfCluster=myAmfCluster'
...
Apr 10 11:00:50.737048 osafamfd [479:sg_nored_fsm.cc:0793] >> node_fail:
safSu=SC-1,safSg=NoRed,safApp=OpenSAF, sg_fsm_state=0
Apr 10 11:00:50.745536 osafamfd [479:sg_2n_fsm.cc:3262] >> node_fail:
'safSu=SC-1,safSg=2N,safApp=OpenSAF', 0
Apr 10 11:00:50.748579 osafamfd [479:sg_2n_fsm.cc:3262] >> node_fail:
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP', 2
7) During running node_fail_su_oper for TEST_SG_2N (due to swap), SG state set
to STABLE
Apr 10 11:00:50.748584 osafamfd [479:sg_2n_fsm.cc:2865] >> node_fail_su_oper
...
Apr 10 11:00:50.749197 osafamfd [479:sg.cc:1635] TR
safSg=TEST_SG_2N,safApp=TEST_APP sg_fsm_state 2 => 0
...
Apr 10 11:00:50.749217 osafamfd [479:sg_2n_fsm.cc:3099] << node_fail_su_oper
8) Now in SC-2, AGENT responded to AMFND for ACTIVE csiSetCallback, AMFD
receives this su_si event from AMFND.
But SG is STABLE, and no operation for su_si modify (act:5)
Apr 10 11:00:59.280465 osafamfnd [488:susm.cc:0954] NO Assigned
'safSi=AGENT,safApp=TEST_APP' ACTIVE to
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:00:59.280681 osafamfd [479:sgproc.cc:0889] >> avd_su_si_assign_evh:
id:120, node:2020f, act:5, 'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP',
'safSi=AGENT,safApp=TEST_APP', ha:1, err:1, single:0
...
Apr 10 11:00:59.280737 osafamfd [479:sg_2n_fsm.cc:2361] >> susi_success:
'safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP' act=5, hastate=1, sg_fsm_state=0
Apr 10 11:00:59.280749 osafamfd [479:sg_2n_fsm.cc:2376] EM sg_2n_fsm.cc:2376:
safSu=SU2,safSg=TEST_SG_2N,safApp=TEST_APP (42)
Apr 10 11:00:59.280752 osafamfd [479:sg_2n_fsm.cc:2562] << susi_success: rc:1
Apr 10 11:00:59.280755 osafamfd [479:sgproc.cc:1405] << avd_su_si_assign_evh
9) SC-1 comes up, all SIs are assigned STANDBY
Apr 10 11:01:21 SC-1 opensafd: Starting OpenSAF Services (Using TCP)
...
Apr 10 11:01:24 SC-1 osafamfnd [490] : NO Assigning 'safSi=DBO,safApp=TEST_APP'
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:01:24 SC-1 osafamfnd [490] : NO Assigned 'safSi=DBO,safApp=TEST_APP'
STANDBY to 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
...
Apr 10 11:01:24 SC-1 osafamfnd [490] : NO Assigning
'safSi=AGENT,safApp=TEST_APP' STANDBY to
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
Apr 10 11:01:24 SC-1 osafamfnd [490] : NO Assigned
'safSi=AGENT,safApp=TEST_APP' STANDBY to
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP'
10) AMFD-SC2 is informed the SU1's STANDBY assignment
After susi_success(), SG state is still REALIGN
Apr 10 11:01:24.345208 osafamfd [479:sgproc.cc:0889] >> avd_su_si_assign_evh:
id:115, node:2010f, act:2, 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP',
'safSi=AGENT,safApp=TEST_APP', ha:2, err:1, single:0
...
Apr 10 11:01:24.345666 osafamfd [479:sg_2n_fsm.cc:2361] >> susi_success:
'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' act=2, hastate=2, sg_fsm_state=1
Apr 10 11:01:24.345669 osafamfd [479:sg_2n_fsm.cc:1446] >>
susi_success_sg_realign: 'safSu=SU1,safSg=TEST_SG_2N,safApp=TEST_APP' act=2,
state=2
Apr 10 11:01:24.345672 osafamfd [479:sg_2n_fsm.cc:1865] <<
susi_success_sg_realign: rc:1
Apr 10 11:01:24.345674 osafamfd [479:sg_2n_fsm.cc:2562] << susi_success: rc:1
Apr 10 11:01:24.345678 osafamfd [479:sgproc.cc:1405] << avd_su_si_assign_evh
11) Finally, failed to swap again
Apr 10 11:03:23.304988 osafamfd [479:si.cc:0821] >> si_admin_op_cb:
safSi=AFD,safApp=TEST_APP op=7
Apr 10 11:03:23.304997 osafamfd [479:sg_2n_fsm.cc:0757] >> si_swap:
'safSi=AFD,safApp=TEST_APP' sg_fsm_state=1
Apr 10 11:03:23.305011 osafamfd [479:sg_2n_fsm.cc:0775] ER
safSi=AFD,safApp=TEST_APP SWAP failed - SG not stable (1)
Apr 10 11:03:23.305013 osafamfd [479:sg_2n_fsm.cc:0857] << si_swap:
sg_fsm_state=1
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/opensaf/tickets/1312/
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets