- **status**: review --> fixed
- **Comment**:
changeset: 8656:a56101161326
branch: opensaf-5.0.x
parent: 8651:a90faf589254
user: Nagendra Kumar<nagendr...@oracle.com>
date: Tue Mar 07 13:18:45 2017 +0530
summary: amfnd: avoid null pointer access [#2213]
changeset: 8657:a203318fb21e
branch: opensaf-5.1.x
parent: 8652:a7c62f1de1a3
user: Nagendra Kumar<nagendr...@oracle.com>
date: Tue Mar 07 13:19:02 2017 +0530
summary: amfnd: avoid null pointer access [#2213]
changeset: 8658:136a8f432da6
tag: tip
parent: 8655:45be1e612ab6
user: Nagendra Kumar<nagendr...@oracle.com>
date: Tue Mar 07 13:19:16 2017 +0530
summary: amfnd: avoid null pointer access [#2213]
[staging:a56101]
[staging:a20331]
[staging:136a8f]
---
** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**
**Status:** fixed
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Tue Mar 07, 2017 07:21 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**
- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz)
(548.6 kB; application/x-compressed)
Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0 __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1 0x0000000000449cc9 in avsv_dblist_sastring_cmp (key1=<optimized out>,
key2=<optimized out>) at util.c:361
i = 0
str1 = <optimized out>
str2 = <optimized out>
#2 0x00007f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0,
key=0x656d6e6769737361 <error: Cannot access memory
at address 0x656d6e6769737361>) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3 0x0000000000416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4 0x000000000040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200,
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = <optimized out>
#5 avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>,
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING,
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = <optimized out>
rc = 1
#6 0x000000000040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = <optimized out>
final_st = <optimized out>
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7 0x000000000040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>,
evt=0x7f92900008c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = <optimized out>
clc_evt = 0x7f92900008e0
comp = 0x1ee8200
rc = 1
#8 0x000000000042676f in avnd_evt_process (evt=0x7f92900008c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9 avnd_main_process () at main.cc:577
ret = <optimized out>
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1,
revents = 0}, {fd = 14, events = 1, revents =
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f92900008c0
__FUNCTION__ = "avnd_main_process"
result = <optimized out>
rc = <optimized out>
#10 0x00000000004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358 ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1'
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1'
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1'
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1'
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2'
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2'
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1'
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: NO Lost contact with 'PL-4'
2016-11-20 22:02:13 PL-5 opensafd: Stopping OpenSAF Services
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Shutdown initiated
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1'
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1'
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1'
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2'
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Cleanup of
'safComp=A,safSu=3,safSg=1,safApp=npm_1' failed
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Reason:'Script did not exit within
time'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: WA
'safComp=A,safSu=3,safSg=1,safApp=npm_1' Presence State RESTARTING =>
TERMINATION_FAILED
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Removed 'safSi=A2,safApp=npm_1'
from 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1'
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigning 'safSi=A1,safApp=npm_1'
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigned 'safSi=A1,safApp=npm_1'
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1'
(state 4)
2016-11-20 22:02:21 PL-5 A[722]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 B[671]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[665]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[629]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[557]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[593]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 osafckptnd[443]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 A[521]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 osafamfwd[452]: Rebooting OpenSAF NodeId = 0 EE Name =
No EE Mapped, Reason: AMF unexpectedly crashed, OwnNodeId = 132367,
SupervisionTime = 60
2016-11-20 22:02:21 PL-5 osafimmnd[394]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafckptnd[443]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafimmnd[394]: exiting for shutdown
2016-11-20 22:02:21 PL-5 opensaf_reboot: Rebooting local node; timeout=60
Observation from syslog:
- Cluster shutdown order: PL3, PL4, PL5, SCs
- On shutting down PL5, component has timeout on csiRemove callback and failed
to perform clean up script. As result, comp has moved to TERM_FAILED, but su
had not been seen to move to TERM_FAILED in syslog
- A similiar thing was happening on shutting down PL3, PL4. At the time PL5 was
struggling to shutdown, component/su was receiving a new active assignment
before SU moved to TERM_FAILED
Attach syslog
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets