- **summary**: AMF : amfd crashed on both controllers, after opensafd is
stopped on appl hosted payloads --> CLM : amfd crashed on both controllers,
after opensafd is stopped on appl hosted payloads
- **status**: assigned --> duplicate
- Attachments has changed:
Diff:
~~~~
--- old
+++ new
@@ -1 +1,3 @@
1794.tgz (3.8 MB; application/x-compressed-tar)
+osafamfd_repr_on_5.0FC (156.9 kB; application/octet-stream)
+osafamfd_verified_on_5.0_GA (1.6 MB; application/octet-stream)
~~~~
- **Component**: amf --> clm
- **Comment**:
After further analysing found root cause of the problem is related to #1762 and
#1738 (CLM issues). Only difference is this is the case of switchover. Standby
AMFD is getting old CLM events of PL3 and PL-4 leaving the cluster when all
the director components becomes active AMFD role switch is about to happen.
It can reproduced on 5.0 FC changeset without using any application.
Not observed on GA changeset.
Attached are the AMFD traces on reproduction on 5.0 FC and successful
verification on 5.0GA.
Steps performed to repdroduce:
1)Bring 4 nodes up SC-1 active.
2)perform switchover so that SC-2 becomes standby.
3)Bring up and down PL-3.
4)Perform switchover again. AMFD gets the old event of PL-3 leaving cluster
as:
mfSINumCurrActiveAssignments=1, si_updt:32
May 2 17:01:36.209169 osafamfd [485:ckpt_dec.cc:1728] >>
dec_si_su_curr_stby
May 2 17:01:36.209222 osafamfd [485:ckpt_dec.cc:1738] <<
dec_si_su_curr_stby: 'safSi=SC-2N,safApp=OpenSAF',
saAmfSINumCurrStandbyAssignments=0, si_updt:33
May 2 17:01:36.209337 osafamfd [485:ckpt_dec.cc:1000] >>
dec_node_snd_msg_id
May 2 17:01:36.209440 osafamfd [485:ckpt_dec.cc:1018] <<
dec_node_snd_msg_id
May 2 17:01:36.209528 osafamfd [485:mbcsv_act.c:0412] <<
ncs_mbscv_rcv_decode
May 2 17:01:36.324189 osafamfd [485:clma_util.c:0042] <<
clma_validate_version
May 2 17:01:36.324405 osafamfd [485:clma_util.c:0407] >>
clma_hdl_cbk_rec_prc: callback type: 0
May 2 17:01:36.324584 osafamfd [485:clma_util.c:0036] >>
clma_validate_version
May 2 17:01:36.324707 osafamfd [485:clma_util.c:0042] <<
clma_validate_version
May 2 17:01:36.324864 osafamfd [485:clm.cc:0216] >> clm_track_cb: '0' '4'
'1'
May 2 17:01:36.325074 osafamfd [485:clm.cc:0280] TR Node Left:
rootCauseEntity safNode=PL-3,safCluster=myClmCluster for node 131855
May 2 17:01:36.325225 osafamfd [485:clm.cc:0185] >>
clm_node_exit_complete: 2030f
May 2 17:01:36.325458 osafamfd [485:ndproc.cc:1145] >> avd_node_failover:
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
May 2 17:01:37.537104 osafamfd [485:mbcsv_api.c:0435] <<
mbcsv_process_dispatch_request: retval: 1
May 2 17:01:37.537896 osafamfd [485:dmsg.cc:0187] TR amfd role change req
stdby -> actv, posting to mail box
May 2 17:01:37.537988 osafamfd [485:main.cc:0757] >> process_event:
evt->rcv_evt 23
May 2 17:01:37.538017 osafamfd [485:role.cc:0078] >> avd_role_change_evh:
cause=0, role=1, current_role=2
May 2 17:01:37.538043 osafamfd [485:role.cc:0176] >>
initialize_for_assignment: ha_state = 1
May 2 17:01:37.538068 osafamfd [485:role.cc:0243] <<
initialize_for_assignment: rc = 1
May 2 17:01:37.538104 osafamfd [485:role.cc:1172] >> amfd_switch_stdby_actv
May 2 17:01:37.538296 osafamfd [485:role.cc:1174] NO Switching StandBy -->
Active State
Here is the detailed analysis from logs of reported issue:
1) Initially SC-1 was brought up as active. When application is brought up:
SU1: Active and SU2 : standby.
2) Now Controller switchover is performed and SC-2 becomes active.
3) Now payload PL-3 is stopped and rejoined which changed application
assignments as:
SU2-->active.
SU3-->standby.
4) Now controller swichover is initiated and when SC-1 director components were
processing csi_set callback PL-4 was brought down which chaged application
assignments as:
SU3-->active.
SU1-->standby.
These both SUs are on PL-3.
5)After this Switchover of director components other than AMFD compeltes. Since
CLMS becomes active on SC-1
it gives callback to SC-1 for PL-3 and PL-4 (old events because of #1762)
Because of this standby node delete all the SUSIs of aplication on PL-3.
Apr 28 14:20:15.350085 osafamfd [10395:clma_util.c:0042] <<
clma_validate_version
Apr 28 14:20:15.350090 osafamfd [10395:clm.cc:0216] >> clm_track_cb: '0' '4' '1'
Apr 28 14:20:15.350098 osafamfd [10395:clm.cc:0280] TR Node Left:
rootCauseEntity safNode=PL-3,safCluster=myClmCluster for
node 131855
Apr 28 14:20:15.350103 osafamfd [10395:clm.cc:0185] >> clm_node_exit_complete:
2030f
Apr 28 14:20:15.350108 osafamfd [10395:ndproc.cc:1145] >> avd_node_failover:
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
Apr 28 14:20:15.350112 osafamfd [10395:ndfsm.cc:0958] >> avd_node_mark_absent
Apr 28 14:20:15.497204 osafamfd [10395:clma_util.c:0042] <<
clma_validate_version
Apr 28 14:20:15.497211 osafamfd [10395:clm.cc:0216] >> clm_track_cb: '0' '4' '1'
Apr 28 14:20:15.497222 osafamfd [10395:clm.cc:0280] TR Node Left:
rootCauseEntity safNode=PL-4,safCluster=myClmCluster for
node 132111
Apr 28 14:20:15.497232 osafamfd [10395:clm.cc:0185] >> clm_node_exit_complete:
2040f
Apr 28 14:20:15.497240 osafamfd [10395:ndproc.cc:1145] >> avd_node_failover:
'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
Apr 28 14:20:15.497247 osafamfd [10395:ndfsm.cc:0958] >> avd_node_mark_absent
Apr 28 14:20:16.746773 osafamfd [10395:role.cc:0243] <<
initialize_for_assignment: rc = 1
This deletion will not happen on active SC-2 because SC-1 as standby cannot
checkpoint it.
6)Now AMFD role changes to active on Sc-1:
Apr 28 14:20:16.746777 osafamfd [10395:role.cc:1172] >> amfd_switch_stdby_actv
Apr 28 14:20:16.746801 osafamfd [10395:role.cc:1174] NO Switching StandBy -->
Active State.
7)PL-4 joins the cluster
SU2 becomes a fresh active on PL-4. Since SC-2 is stdby it will also create
SUSis for Su2.
But SC-2 will be having SU1 and Su3 respectively stdby and active.
so SC-2 has following states:
SU1 standby.
SU2 active.
SU3 active.
8)After this SC-1 went down and SC-2 becomes active.
So this is the same state from which steps to reproduce of issue were
performed and this issue is reported.
---
** [tickets:#1794] CLM : amfd crashed on both controllers, after opensafd is
stopped on appl hosted payloads **
**Status:** duplicate
**Milestone:** 5.0.RC2
**Created:** Fri Apr 29, 2016 06:48 AM UTC by Srikanth R
**Last Updated:** Mon May 02, 2016 06:48 AM UTC
**Owner:** Praveen
**Attachments:**
-
[1794.tgz](https://sourceforge.net/p/opensaf/tickets/1794/attachment/1794.tgz)
(3.8 MB; application/x-compressed-tar)
-
[osafamfd_repr_on_5.0FC](https://sourceforge.net/p/opensaf/tickets/1794/attachment/osafamfd_repr_on_5.0FC)
(156.9 kB; application/octet-stream)
-
[osafamfd_verified_on_5.0_GA](https://sourceforge.net/p/opensaf/tickets/1794/attachment/osafamfd_verified_on_5.0_GA)
(1.6 MB; application/octet-stream)
Changeset : 7436 5.0.FC
Setup : 5 nodes cluster with 3 payloads.
Application : 2n red model , 3 SUs with 4 SIs ( si-si dep configured )
PL-3 is hosting SU1 and SU3 and PL-4 is hosting SU2.
Issue : AMFD on both controllers crashed , after opensafd is stopped on
application hosted payloads.
Steps performed :
-> After deploying application, lot of AMF related operations have been
performed.
-> After that, following is the opensafd status , where SU1 deployed on PL-3
is standby and SU2 deployed on PL-4 is active.
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI3,safApp=TestApp_TwoN
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI4,safApp=TestApp_TwoN
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed6,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI2,safApp=TestApp_TwoN
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI2,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI1,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI3,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI1,safApp=TestApp_TwoN
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=PL-5\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed5,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI4,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI1,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI2,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI3,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI4,safApp=TestApp_TwoN
saAmfSISUHAState=ACTIVE(1)
-> Now stopped opensafd on the payloads PL-5 and PL-4, one after another.
-> Amfd on the active controller crashed after opensafd is stopped on PL-4.
Apr 28 16:47:54 CONTROLLER-2 osafamfd[12188]: NO Node 'PL-4' left the cluster
Apr 28 16:47:54 CONTROLLER-2 osafamfd[12188]: sg_2n_fsm.cc:534:
avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
Apr 28 16:47:54 CONTROLLER-2 osafamfnd[12198]: WA AMF director unexpectedly
crashed
Note, this issue is not reproducible just by bringing up the application and
performing the above steps.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets