---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** unassigned
**Milestone:** 5.2.FC
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Fri Dec 02, 2016 04:54 AM UTC
**Owner:** nobody
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x0000000000449cc9 in avsv_dblist_sastring_cmp (key1=<optimized out>, 
key2=<optimized out>) at util.c:361
        i = 0
        str1 = <optimized out>
        str2 = <optimized out>
#2  0x00007f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 <error: Cannot access memory
 at address 0x656d6e6769737361>) at ncsdlib.c:169
        start_ptr = 0x1ee3168
#3  0x0000000000416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
        curr = 0x1ee8060
        prv = 0x1ee3150
        __FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x000000000040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
        rc = <optimized out>
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
        csi = 0x0
        __FUNCTION__ = "avnd_comp_clc_st_chng_prc"
        ev = AVND_SU_PRES_FSM_EV_MAX
        is_en = <optimized out>
        rc = 1
#6  0x000000000040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
        prv_st = <optimized out>
        final_st = <optimized out>
        rc = 1
        __FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x000000000040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f92900008c0) at clc.cc:414
        __FUNCTION__ = "avnd_evt_clc_resp_evh"
        ev = <optimized out>
        clc_evt = 0x7f92900008e0
        comp = 0x1ee8200
        rc = 1
#8  0x000000000042676f in avnd_evt_process (evt=0x7f92900008c0) at main.cc:626
        cb = 0x666940 <_avnd_cb>
        rc = 1
#9  avnd_main_process () at main.cc:577
        ret = <optimized out>
        fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
        evt = 0x7f92900008c0
        __FUNCTION__ = "avnd_main_process"
        result = <optimized out>
        rc = <optimized out>
#10 0x00000000004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
        error = 0
1358    ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: NO Lost contact with 'PL-4'
2016-11-20 22:02:13 PL-5 opensafd: Stopping OpenSAF Services
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Shutdown initiated
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1' 
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Cleanup of 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' failed
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Reason:'Script did not exit within 
time'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: WA 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' Presence State RESTARTING => 
TERMINATION_FAILED
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Removed 'safSi=A2,safApp=npm_1' 
from 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1' 
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigning 'safSi=A1,safApp=npm_1' 
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigned 'safSi=A1,safApp=npm_1' 
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1' 
(state 4)
2016-11-20 22:02:21 PL-5 A[722]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 B[671]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[665]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[629]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[557]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[593]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 osafckptnd[443]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 A[521]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 osafamfwd[452]: Rebooting OpenSAF NodeId = 0 EE Name = 
No EE Mapped, Reason: AMF unexpectedly crashed, OwnNodeId = 132367, 
SupervisionTime = 60
2016-11-20 22:02:21 PL-5 osafimmnd[394]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafckptnd[443]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafimmnd[394]: exiting for shutdown
2016-11-20 22:02:21 PL-5 opensaf_reboot: Rebooting local node; timeout=60

Observation from syslog:
- Cluster shutdown order: PL3, PL4, PL5, SCs
- On shutting down PL5, component has timeout on csiRemove callback and failed 
to perform clean up script. As result, comp has moved to TERM_FAILED, but su 
had not been seen to move to TERM_FAILED in syslog
- A similiar thing was happening on shutting down PL3, PL4. At the time PL5 was 
struggling to shutdown, component/su was receiving a new active assignment 
before SU moved to TERM_FAILED

Attach syslog



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to