Hi Praveen,
In ticket #1902, the problem of component failover during headless was found
here: https://sourceforge.net/p/opensaf/tickets/1902/#8990
Outlined logs:
2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered
su_si_assign msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
SI:'', ha_state:'1', msg_act:'5', single_csi:'0', error:'1', msg_id:'2'
2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered
oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
su_oper_state:'2', node_oper_state:'1', recovery:'3'
2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered
oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
su_oper_state:'1', node_oper_state:'1', recovery:'0'
After headless, there are two su_oper_msg sent to amfd. The first one
(recovery:'3') triggers component failover sequence, so amfd will send QUIECED
su_si assignment to amfnd, and amfnd should then send response of this QUIESED
su_si assignment to amfd. The problem was at the time amfd processes the
response of this QUIESCED su_si assignment, the SU's readiness state at amfd
was changed to IN_SERVICE, because of the second su_oper_msg (recovery:'0').
The SG_2N::susi_success_sg_realign does not expect IN_SERVICE SU in this
situation.
The same problem could be seen in non-headless scenario, which is in this
ticket #2233. After amfd receives su_oper_msg (recovery:'3') and getting a bit
busy due to RT update so that faulty component has enough time to instantiate
and amfnd sends su_oper_msg (recovery:'0') earlier, then we can see the same
problem of SG_2N::susi_success_sg_realign as in #1902.
I think basically the entire sequence of component failover between amfd and
amfnd does not design to include su_oper_msg (recovery:'0'). This message can
comes into any unexpected points of sequence.
Attach the patch for this, it's based on similarity of su failover scenario
which lets the su_oper_msg(recovery:'0') coming after su failover sequence (at
su repair phase). I'm still testing but any comments are welcome.
Thanks,
Minh
Attachments:
-
[2233_V2.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/cdb8/attachment/2233_V2.diff)
(3.4 kB; text/x-patch)
---
** [tickets:#2233] AMF: SG is unstable after component failover recovery**
**Status:** accepted
**Milestone:** 5.0.2
**Labels:** unstable sg
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Fri Dec 23, 2016 02:10 AM UTC
**Owner:** Minh Hon Chau
This issue occurs as component failover recovery in context of locking node.
**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal
from amfd. In mean time, component failover report is triggered by component of
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5
**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities
At the time amfd process quiesced assignment response in REALIGN state, no
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >>
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >>
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon',
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon',
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon',
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] <<
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] <<
> susi_success_sg_realign: rc:1
In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED]
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh:
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2
The failed component is instantiated again, and generates another message
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh:
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1
SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation
of failed component happens faster as in this test then the sg fsm results in
unexpected sequence.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets