The issue of unstablility of component failover test case has been reproduced

Test case 01:
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, spare SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4, delay active csi callback in SU5
4- Stop both SCs
5- Release active csi callback
6- Trigger component failover on component of SU5
7- Restart SC1, SC2
8- SG unstable

Analysis:
- After step 7, restart SC1, there are 3 buffered messages at amfnd to be sent 
to amfd
> 2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered 
> su_si_assign msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> SI:'', ha_state:'1', msg_act:'5', single_csi:'0', error:'1', msg_id:'2'
> 2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered 
> oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> su_oper_state:'2', node_oper_state:'1', recovery:'3'
> 2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered 
> oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> su_oper_state:'1', node_oper_state:'1', recovery:'0'

When amfd receives 2nd buffered message (su_oper_state=DISABLED), which 
triggers component failover, the SU5's readiness state is OUT_OF_SERVICE.
While component failover recovery is running, the 3rd buffered message 
(su_oper_state=ENABLED) comes and set SU5's readiness state back to IN_SERVICE, 
that causes the component failover sequence goes off the normal track.

The below trace is sg fsm function that receives QUIESCED assignment response 
of SU5 as a part of component failover sequence. Since SU5 becomes IN_SERVICE 
(it's being expected OUT_OF_SERVICE) then nothing happens next.
> Dec 19 10:54:20.647940 osafamfd [478:sgproc.cc:1047] >> avd_su_si_assign_evh: 
> id:5, node:2050f, act:5, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> '', ha:3, err:1, single:0
> ...
> Dec 19 10:54:20.648641 osafamfd [478:sg_2n_fsm.cc:1445] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 19 10:54:20.648644 osafamfd [478:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 19 10:54:20.648646 osafamfd [478:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 19 10:54:20.648652 osafamfd [478:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 19 10:54:20.648655 osafamfd [478:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 19 10:54:20.648657 osafamfd [478:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 19 10:54:20.648659 osafamfd [478:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 19 10:54:20.648662 osafamfd [478:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 19 10:54:20.648664 osafamfd [478:sg_2n_fsm.cc:1859] << 
> susi_success_sg_realign: rc:1


Test case 02: Repeat the same test in non-headless mode without restart SCs 
(compf_ok.tgz), the message su_oper_state=ENABLED comes from clc event and it 
currently does not sync with component failover assignment, but it most of the 
time comes after SU5's assignment is removed, so no issue happens
> Dec 19 10:58:46.531900 osafamfd [488:sgproc.cc:1047] >> avd_su_si_assign_evh: 
> id:78, node:2050f, act:5, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> '', ha:3, err:1, single:0
> Dec 19 10:58:46.561116 osafamfd [488:sgproc.cc:1047] >> avd_su_si_assign_evh: 
> id:80, node:2050f, act:4, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> '', ha:3, err:1, single:0
> Dec 19 10:58:46.577882 osafamfd [488:sgproc.cc:0656] >> 
> avd_su_oper_state_evh: id:85, node:2050f, 
> 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1
> 


Test case 03: In another successful headless case (compf_ok.tgz), the (3rd) 
message su_oper_state=ENABLED that enables SU5 comes after SU5 completes 
assignment's removal. In this test case, instantiation of failed component 
takes a bit longer and it happens after restarting SC1
> Dec 19 10:59:31.294679 osafamfd [476:sgproc.cc:1047] >> avd_su_si_assign_evh: 
> id:4, node:2050f, act:5, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> '', ha:3, err:1, single:0
> Dec 19 10:59:31.299693 osafamfd [476:sgproc.cc:1047] >> avd_su_si_assign_evh: 
> id:5, node:2050f, act:4, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> '', ha:3, err:1, single:0
> Dec 19 10:59:31.316863 osafamfd [476:sgproc.cc:1047] >> avd_su_si_assign_evh: 
> id:9, node:2050f, act:2, 'safSu=SU5B,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> 'safSi=AmfDemoTwon,safApp=AmfDemoTwon', ha:1, err:1, single:0
> Dec 19 10:59:31.316944 osafamfd [476:sg_2n_fsm.cc:2354] >> susi_success: 
> 'safSu=SU5B,safSg=AmfDemoTwon,safApp=AmfDemoTwon' act=2, hastate=1, 
> sg_fsm_state=1
> Dec 19 10:59:31.329765 osafamfd [476:sgproc.cc:0656] >> 
> avd_su_oper_state_evh: id:14, node:2050f, 
> 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

Attach trace of test case 01: compf_not_ok.tgz, trace of test case 02&03: 
compf_ok.tgz



---

** [tickets:#1902] AMF: Extend escalation support during headless**

**Status:** assigned
**Milestone:** 5.2.FC
**Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 10, 2016 04:47 AM UTC
**Owner:** Minh Hon Chau


If a comp/su failover occurs during headless, amfnd will escalate to reboot. 
This will unexpectedly impact on other comp/su which are up and running if 
there's no node failover escalation configured on this faulty comp/su 

2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due 
to 'avaDown' : Recovery is 'suFailover'
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered)
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE 
Name = , Reason: Can't perform recovery while controllers are down. Recovery is 
node failfast., OwnNodeId = 132111, SupervisionTime = 60
2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60

This ticket will remove unexpected reboot due to failover during headless which 
is mentioned as limitation in AMF opensaf documentation.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to