Hi All,
I am new to opensaf. Need your help.
Please find my Opensaf Setup as below:
I am using Opensaf 4.4.2 Version and below is my opensaf status output:
atcafs-n10s2:~# /etc/init.d/opensafd status
safSISU=safSu=n10s2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed10,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=n10s2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SU-n10s2\,safSg=HenbGw-SG\,safApp=HenbGwApp,safSi=HenbGw,safApp=HenbGwApp
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=n10s1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=n10s1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SU-n10s1\,safSg=HenbGw-SG\,safApp=HenbGwApp,safSi=HenbGw,safApp=HenbGwApp
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SU-n10s5\,safSg=HenbGw-SG\,safApp=HenbGwApp_PL_n10s5,safSi=HenbGw,safApp=HenbGwApp_PL_n10s5
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SU-n10s4\,safSg=HenbGw-SG\,safApp=HenbGwApp_PL_n10s4,safSi=HenbGw,safApp=HenbGwApp_PL_n10s4
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=n10s5\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=n10s4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
atcafs-n10s2:~#
whereas n10s1, n10s2 are my controllers and n10s4,n105 are Payloads.
Below applications are running on Payloads:
atcafs-n10s4:~# ps -aef | grep ins
root 3379 1 21 11:34 ? 00:21:36 /hegw/gsw/bin/hms instantiate
root 3396 1 11 11:34 ? 00:11:49 /hegw/gsw/bin/mms instantiate
root 3410 1 2 11:34 ? 00:02:05 /hegw/gsw/bin/dra instantiate
root 3424 1 2 11:34 ? 00:02:15 /hegw/gsw/bin/bcm instantiate
Problem Detail:
When I killed the application (hms) with signal 11 "kill -11 3379 " , it
generates a core ( about size 7GB). Opensaf trying to restart the process in 60s , but by
that time my process was busy with writing the core and till then PID is active.
So opensaf failed with below error:
Aug 29 13:26:12 localhost kernel: grsec: From 172.16.10.1: signal 11 sent to
/hegw/gsw/bin/hms[hms:11902] uid/euid:0/0 gid/egid:0/0, parent
/sbin/init[init:1] uid/euid:0/0 gid/egid:0/0 by /bin/bash[bash:10442]
uid/euid:0/0 gid/egid:0/0, parent /bin/login[login:10441] uid/euid:0/0
gid/egid:0/0
Aug 29 13:26:27 localhost osafamfnd[11779]:
'safComp=HMSComp_n10s4,safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
faulted due to 'healthCheckcallbackTimeout' : Recovery is 'componentRestart'
Aug 29 13:26:27 localhost AMF_DEMO: CMD=cleanup
Aug 29 13:26:27 localhost AMF_DEMO_VAR: AMF_DEMO_VAR4=COMP1_VALUE4
Aug 29 13:26:27 localhost AMF_DEMO_VAR: AMF_DEMO_VAR1=CT_VALUE1
Aug 29 13:26:27 localhost AMF_DEMO_VAR: AMF_DEMO_VAR2=COMP1_OVERLOAD_VALUE2
Aug 29 13:26:27 localhost AMF_DEMO_VAR: AMF_DEMO_VAR3=COMP1_VALUE3
Aug 29 13:26:37 localhost osafamfnd[11779]: Cleanup of
'safComp=HMSComp_n10s4,safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
failed
Aug 29 13:26:37 localhost osafamfnd[11779]: Reason:'Script did not exit within
time'
Aug 29 13:26:37 localhost osafamfnd[11779]: SU Failover trigerred for
'safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4': Failed component:
'safComp=HMSComp_n10s4,safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
Aug 29 13:26:37 localhost osafamfnd[11779]:
'safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4' Presence State
INSTANTIATED => TERMINATION_FAILED
Aug 29 13:26:37 localhost osafamfnd[11779]: Assigning
'safSi=HenbGw,safApp=HenbGwApp_PL_n10s4' QUIESCED to
'safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
Aug 29 13:26:37 localhost osafamfnd[11779]: Assigned
'safSi=HenbGw,safApp=HenbGwApp_PL_n10s4' QUIESCED to
'safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
Aug 29 13:26:37 localhost osafamfnd[11779]: Removing
'safSi=HenbGw,safApp=HenbGwApp_PL_n10s4' from
'safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
Aug 29 13:26:37 localhost osafamfnd[11779]: Removed
'safSi=HenbGw,safApp=HenbGwApp_PL_n10s4' from
'safSu=SU-n10s4,safSg=HenbGw-SG,safApp=HenbGwApp_PL_n10s4'
I have given a try by modifying "OPENSAF_TERMTIMEOUT=1000" in nid.conf file.