- **summary**: AMF : amfd crashed on both controllers, after opensafd is 
stopped on appl hosted  payloads  --> CLM : amfd crashed on both controllers, 
after opensafd is stopped on appl hosted  payloads 
- **status**: assigned --> duplicate
- Attachments has changed:

Diff:

~~~~

--- old
+++ new
@@ -1 +1,3 @@
 1794.tgz (3.8 MB; application/x-compressed-tar)
+osafamfd_repr_on_5.0FC (156.9 kB; application/octet-stream)
+osafamfd_verified_on_5.0_GA (1.6 MB; application/octet-stream)

~~~~

- **Component**: amf --> clm
- **Comment**:

After further analysing found root cause of the problem is related to #1762 and 
#1738 (CLM issues). Only difference is this is the case of switchover. Standby 
AMFD is getting old CLM events of PL3 and PL-4  leaving the cluster when all 
the director components becomes active AMFD role switch is about to happen.
It can reproduced on 5.0 FC changeset without using any application.
Not observed on GA changeset.

Attached are the AMFD traces on reproduction on 5.0 FC and successful 
verification on 5.0GA. 

Steps performed to repdroduce:
   1)Bring 4 nodes up SC-1 active.
   2)perform switchover so that SC-2 becomes standby.
   3)Bring up and down PL-3.
   4)Perform switchover again. AMFD gets the old event of PL-3 leaving cluster 
as:
       
    mfSINumCurrActiveAssignments=1, si_updt:32
    May  2 17:01:36.209169 osafamfd [485:ckpt_dec.cc:1728] >> 
dec_si_su_curr_stby
    May  2 17:01:36.209222 osafamfd [485:ckpt_dec.cc:1738] << 
dec_si_su_curr_stby: 'safSi=SC-2N,safApp=OpenSAF',     
saAmfSINumCurrStandbyAssignments=0, si_updt:33
    May  2 17:01:36.209337 osafamfd [485:ckpt_dec.cc:1000] >> 
dec_node_snd_msg_id
    May  2 17:01:36.209440 osafamfd [485:ckpt_dec.cc:1018] << 
dec_node_snd_msg_id
    May  2 17:01:36.209528 osafamfd [485:mbcsv_act.c:0412] << 
ncs_mbscv_rcv_decode
    May  2 17:01:36.324189 osafamfd [485:clma_util.c:0042] << 
clma_validate_version
    May  2 17:01:36.324405 osafamfd [485:clma_util.c:0407] >> 
clma_hdl_cbk_rec_prc: callback type: 0
    May  2 17:01:36.324584 osafamfd [485:clma_util.c:0036] >> 
clma_validate_version
    May  2 17:01:36.324707 osafamfd [485:clma_util.c:0042] << 
clma_validate_version
    May  2 17:01:36.324864 osafamfd [485:clm.cc:0216] >> clm_track_cb: '0' '4' 
'1'
    May  2 17:01:36.325074 osafamfd [485:clm.cc:0280] TR  Node Left: 
rootCauseEntity safNode=PL-3,safCluster=myClmCluster for     node 131855
    May  2 17:01:36.325225 osafamfd [485:clm.cc:0185] >> 
clm_node_exit_complete: 2030f
    May  2 17:01:36.325458 osafamfd [485:ndproc.cc:1145] >> avd_node_failover: 
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'

    May  2 17:01:37.537104 osafamfd [485:mbcsv_api.c:0435] << 
mbcsv_process_dispatch_request: retval: 1
    May  2 17:01:37.537896 osafamfd [485:dmsg.cc:0187] TR amfd role change req 
stdby -> actv, posting to mail box
    May  2 17:01:37.537988 osafamfd [485:main.cc:0757] >> process_event: 
evt->rcv_evt 23
    May  2 17:01:37.538017 osafamfd [485:role.cc:0078] >> avd_role_change_evh: 
cause=0, role=1, current_role=2
    May  2 17:01:37.538043 osafamfd [485:role.cc:0176] >> 
initialize_for_assignment: ha_state = 1
    May  2 17:01:37.538068 osafamfd [485:role.cc:0243] << 
initialize_for_assignment: rc = 1
    May  2 17:01:37.538104 osafamfd [485:role.cc:1172] >> amfd_switch_stdby_actv
    May  2 17:01:37.538296 osafamfd [485:role.cc:1174] NO Switching StandBy --> 
Active State

Here is the detailed analysis from logs of reported issue:
1) Initially SC-1 was brought up as active. When application is brought up:
    SU1: Active and SU2 : standby.
2) Now  Controller switchover is performed and SC-2 becomes active.
3) Now payload PL-3 is stopped and rejoined which changed application 
assignments as:
   SU2-->active.
   SU3-->standby.
4) Now controller swichover is initiated and when SC-1 director components were 
processing csi_set callback PL-4 was brought down which chaged application 
assignments as:
   SU3-->active.
   SU1-->standby. 
    These both SUs are on PL-3.

5)After this Switchover of director components other than AMFD compeltes. Since 
CLMS becomes active on SC-1  
it gives callback to SC-1 for PL-3 and PL-4 (old events because of #1762)
Because of this standby node delete all the SUSIs of aplication on PL-3.

Apr 28 14:20:15.350085 osafamfd [10395:clma_util.c:0042] << 
clma_validate_version
Apr 28 14:20:15.350090 osafamfd [10395:clm.cc:0216] >> clm_track_cb: '0' '4' '1'
Apr 28 14:20:15.350098 osafamfd [10395:clm.cc:0280] TR  Node Left: 
rootCauseEntity safNode=PL-3,safCluster=myClmCluster for 
node 131855
Apr 28 14:20:15.350103 osafamfd [10395:clm.cc:0185] >> clm_node_exit_complete: 
2030f
Apr 28 14:20:15.350108 osafamfd [10395:ndproc.cc:1145] >> avd_node_failover: 
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
Apr 28 14:20:15.350112 osafamfd [10395:ndfsm.cc:0958] >> avd_node_mark_absent
Apr 28 14:20:15.497204 osafamfd [10395:clma_util.c:0042] << 
clma_validate_version
Apr 28 14:20:15.497211 osafamfd [10395:clm.cc:0216] >> clm_track_cb: '0' '4' '1'
Apr 28 14:20:15.497222 osafamfd [10395:clm.cc:0280] TR  Node Left: 
rootCauseEntity safNode=PL-4,safCluster=myClmCluster for 
node 132111
Apr 28 14:20:15.497232 osafamfd [10395:clm.cc:0185] >> clm_node_exit_complete: 
2040f
Apr 28 14:20:15.497240 osafamfd [10395:ndproc.cc:1145] >> avd_node_failover: 
'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
Apr 28 14:20:15.497247 osafamfd [10395:ndfsm.cc:0958] >> avd_node_mark_absent
Apr 28 14:20:16.746773 osafamfd [10395:role.cc:0243] << 
initialize_for_assignment: rc = 1

This deletion will not happen on active SC-2 because SC-1 as standby cannot 
checkpoint it.

6)Now AMFD role changes to active on Sc-1:
Apr 28 14:20:16.746777 osafamfd [10395:role.cc:1172] >> amfd_switch_stdby_actv
Apr 28 14:20:16.746801 osafamfd [10395:role.cc:1174] NO Switching StandBy --> 
Active State.

7)PL-4 joins the cluster
  SU2 becomes a fresh active on PL-4. Since SC-2 is stdby it will also create 
SUSis for Su2.
  But SC-2 will be having SU1 and Su3 respectively stdby and active.
  so SC-2 has following states:
   SU1 standby.
   SU2 active.
   SU3 active.
8)After this SC-1 went down and SC-2 becomes active.
  So this is the same state from which steps to reproduce of issue were 
performed and this issue is reported.





---

** [tickets:#1794] CLM : amfd crashed on both controllers, after opensafd is 
stopped on appl hosted  payloads **

**Status:** duplicate
**Milestone:** 5.0.RC2
**Created:** Fri Apr 29, 2016 06:48 AM UTC by Srikanth R
**Last Updated:** Mon May 02, 2016 06:48 AM UTC
**Owner:** Praveen
**Attachments:**

- 
[1794.tgz](https://sourceforge.net/p/opensaf/tickets/1794/attachment/1794.tgz) 
(3.8 MB; application/x-compressed-tar)
- 
[osafamfd_repr_on_5.0FC](https://sourceforge.net/p/opensaf/tickets/1794/attachment/osafamfd_repr_on_5.0FC)
 (156.9 kB; application/octet-stream)
- 
[osafamfd_verified_on_5.0_GA](https://sourceforge.net/p/opensaf/tickets/1794/attachment/osafamfd_verified_on_5.0_GA)
 (1.6 MB; application/octet-stream)


Changeset : 7436 5.0.FC
Setup : 5 nodes cluster with 3 payloads.
Application : 2n red model , 3 SUs with 4 SIs ( si-si dep configured )
PL-3 is hosting SU1 and SU3 and PL-4 is hosting SU2.

Issue : AMFD on both controllers crashed , after opensafd is stopped on  
application hosted payloads.

Steps performed :

-> After deploying application, lot of AMF related operations have been 
performed.

-> After that,  following is the opensafd status , where SU1 deployed on PL-3 
is standby and SU2 deployed on PL-4 is active.

safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI3,safApp=TestApp_TwoN
        saAmfSISUHAState=STANDBY(2)
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI4,safApp=TestApp_TwoN
        saAmfSISUHAState=STANDBY(2)
safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed6,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI2,safApp=TestApp_TwoN
        saAmfSISUHAState=STANDBY(2)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI2,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI1,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI3,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU1\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI1,safApp=TestApp_TwoN
        saAmfSISUHAState=STANDBY(2)
safSISU=safSu=PL-5\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed5,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU3\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI4,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
        saAmfSISUHAState=STANDBY(2)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI1,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI2,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI3,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=TestApp_SU2\,safSg=TestApp_SG1\,safApp=TestApp_TwoN,safSi=TestApp_SI4,safApp=TestApp_TwoN
        saAmfSISUHAState=ACTIVE(1)


-> Now stopped opensafd on the payloads PL-5 and PL-4, one after another.

-> Amfd on the active controller crashed after opensafd is stopped on PL-4.

Apr 28 16:47:54 CONTROLLER-2 osafamfd[12188]: NO Node 'PL-4' left the cluster
Apr 28 16:47:54 CONTROLLER-2 osafamfd[12188]: sg_2n_fsm.cc:534: 
avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
Apr 28 16:47:54 CONTROLLER-2 osafamfnd[12198]: WA AMF director unexpectedly 
crashed

Note, this issue is not reproducible just by bringing up the application and 
performing the above steps.


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to