- **status**: review --> fixed
- **Comment**:

changeset:   8656:a56101161326
branch:      opensaf-5.0.x
parent:      8651:a90faf589254
user:        Nagendra Kumar<nagendr...@oracle.com>
date:        Tue Mar 07 13:18:45 2017 +0530
summary:     amfnd: avoid null pointer access [#2213]

changeset:   8657:a203318fb21e
branch:      opensaf-5.1.x
parent:      8652:a7c62f1de1a3
user:        Nagendra Kumar<nagendr...@oracle.com>
date:        Tue Mar 07 13:19:02 2017 +0530
summary:     amfnd: avoid null pointer access [#2213]

changeset:   8658:136a8f432da6
tag:         tip
parent:      8655:45be1e612ab6
user:        Nagendra Kumar<nagendr...@oracle.com>
date:        Tue Mar 07 13:19:16 2017 +0530
summary:     amfnd: avoid null pointer access [#2213]

[staging:a56101]
[staging:a20331]
[staging:136a8f]




---

** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**

**Status:** fixed
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Tue Mar 07, 2017 07:21 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz) 
(548.6 kB; application/x-compressed)


Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0  __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1  0x0000000000449cc9 in avsv_dblist_sastring_cmp (key1=<optimized out>, 
key2=<optimized out>) at util.c:361
        i = 0
        str1 = <optimized out>
        str2 = <optimized out>
#2  0x00007f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0, 
key=0x656d6e6769737361 <error: Cannot access memory
 at address 0x656d6e6769737361>) at ncsdlib.c:169
        start_ptr = 0x1ee3168
#3  0x0000000000416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940 
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
        curr = 0x1ee8060
        prv = 0x1ee3150
        __FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4  0x000000000040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200, 
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
        rc = <optimized out>
#5  avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>, 
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING, 
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
        csi = 0x0
        __FUNCTION__ = "avnd_comp_clc_st_chng_prc"
        ev = AVND_SU_PRES_FSM_EV_MAX
        is_en = <optimized out>
        rc = 1
#6  0x000000000040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940 
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
        prv_st = <optimized out>
        final_st = <optimized out>
        rc = 1
        __FUNCTION__ = "avnd_comp_clc_fsm_run"
#7  0x000000000040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>, 
evt=0x7f92900008c0) at clc.cc:414
        __FUNCTION__ = "avnd_evt_clc_resp_evh"
        ev = <optimized out>
        clc_evt = 0x7f92900008e0
        comp = 0x1ee8200
        rc = 1
#8  0x000000000042676f in avnd_evt_process (evt=0x7f92900008c0) at main.cc:626
        cb = 0x666940 <_avnd_cb>
        rc = 1
#9  avnd_main_process () at main.cc:577
        ret = <optimized out>
        fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1, 
revents = 0}, {fd = 14, events = 1, revents = 
0}, {fd = 0, events = 0, revents = 0}}
        evt = 0x7f92900008c0
        __FUNCTION__ = "avnd_main_process"
        result = <optimized out>
        rc = <optimized out>
#10 0x00000000004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
        error = 0
1358    ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:

2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of 
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to 
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request 
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request 
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1' 
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for 
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: NO Lost contact with 'PL-4'
2016-11-20 22:02:13 PL-5 opensafd: Stopping OpenSAF Services
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Shutdown initiated
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1' 
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1' 
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1' 
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2' 
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Cleanup of 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' failed
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Reason:'Script did not exit within 
time'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: WA 
'safComp=A,safSu=3,safSg=1,safApp=npm_1' Presence State RESTARTING => 
TERMINATION_FAILED
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Removed 'safSi=A2,safApp=npm_1' 
from 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1' 
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigning 'safSi=A1,safApp=npm_1' 
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigned 'safSi=A1,safApp=npm_1' 
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1' 
(state 4)
2016-11-20 22:02:21 PL-5 A[722]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 B[671]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[665]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[629]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[557]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 A[593]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 osafckptnd[443]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 A[521]: AL AMF Node Director is down, terminate this 
process
2016-11-20 22:02:21 PL-5 osafamfwd[452]: Rebooting OpenSAF NodeId = 0 EE Name = 
No EE Mapped, Reason: AMF unexpectedly crashed, OwnNodeId = 132367, 
SupervisionTime = 60
2016-11-20 22:02:21 PL-5 osafimmnd[394]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: AL AMF Node Director is down, 
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafckptnd[443]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafimmnd[394]: exiting for shutdown
2016-11-20 22:02:21 PL-5 opensaf_reboot: Rebooting local node; timeout=60

Observation from syslog:
- Cluster shutdown order: PL3, PL4, PL5, SCs
- On shutting down PL5, component has timeout on csiRemove callback and failed 
to perform clean up script. As result, comp has moved to TERM_FAILED, but su 
had not been seen to move to TERM_FAILED in syslog
- A similiar thing was happening on shutting down PL3, PL4. At the time PL5 was 
struggling to shutdown, component/su was receiving a new active assignment 
before SU moved to TERM_FAILED

Attach syslog



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to