Hi Alex,

Agree, adding a comment in nid.conf and 00-README.conf is good. The backtrace 
below looks normal, can you share the syslogs?

/BR HansN

From: Alex Jones [mailto:[email protected]]
Sent: den 2 maj 2018 15:43
To: Hans Nordebäck <[email protected]>; Anders Widell 
<[email protected]>
Cc: [email protected]
Subject: Re: SV: [PATCH 1/1] nid: restart opensafd on failure when systemd 
enabled [#2839]


Hi Hans,

    I was finally able to get back to this.

    Having "Restart=on-failure" set works with REBOOT_ON_FAIL_TIMEOUT as long 
as RestartSec=xxx is also set in the service file to something greater than 
REBOOT_ON_FAIL_TIMEOUT. Maybe we could put a comment in nid.conf that says if 
you use systemd you need to also set RestartSec to a failure greater than 
REBOOT_ON_FAIL_TIMEOUT?

    Regarding "systemctl start opensafd; sleep 1; pkill -ABRT immnd". In my 
setup it does not restart after the nid phase. If I increase the time to 3, it 
starts to work. Here is the backtrace. Nothing looks suspicious.

(gdb) thread apply all bt

Thread 4 (Thread 0x7fbf852e9b00 (LWP 5123)):
#0  0x00007fbf839b906d in poll () from /lib64/libc.so.6
#1  0x00007fbf8462a370 in poll (__timeout=20000, __nfds=2, __fds=<optimized 
out>) at /usr/include/bits/poll2.h:46
#2  mdtm_process_recv_events_tcp () at src/mds/mds_dt_trans.c:986
#3  0x00007fbf83c910db in start_thread () from /lib64/libpthread.so.0
#4  0x00007fbf839c1e3d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fbf85309b00 (LWP 5122)):
#0  0x00007fbf839b906d in poll () from /lib64/libc.so.6
#1  0x00007fbf84601641 in poll (__timeout=4900, __nfds=1, __fds=0x7fbf85309260) 
at /usr/include/bits/poll2.h:46
#2  osaf_ppoll (io_fds=io_fds@entry=0x7fbf85309260, i_nfds=i_nfds@entry=1, 
i_timeout_ts=0x7fbf85309280, i_sigmask=i_sigmask@entry=0x0) at 
src/base/osaf_poll.c:108
#3  0x00007fbf84608c2f in ncs_tmr_wait () at src/base/sysf_tmr.c:463
#4  0x00007fbf83c910db in start_thread () from /lib64/libpthread.so.0
#5  0x00007fbf839c1e3d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fbf82787700 (LWP 5121)):
#0  0x00007fbf839b906d in poll () from /lib64/libc.so.6
#1  0x00007fbf84601560 in poll (__timeout=-1, __nfds=1, __fds=0x7fbf82786e30) 
at /usr/include/bits/poll2.h:46
#2  osaf_poll_no_timeout (io_fds=0x7fbf82786e30, i_nfds=1) at 
src/base/osaf_poll.c:31
#3  0x00007fbf846017e5 in osaf_poll (io_fds=io_fds@entry=0x7fbf82786e30, 
i_nfds=i_nfds@entry=1, i_timeout=i_timeout@entry=-1) at src/base/osaf_poll.c:44
#4  0x00007fbf8460197c in auth_server_main (_fd=<optimized out>) at 
src/base/osaf_secutil.c:176
#5  0x00007fbf83c910db in start_thread () from /lib64/libpthread.so.0
#6  0x00007fbf839c1e3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fbf85341740 (LWP 5120)):
#0  0x00007fbf839b906d in poll () from /lib64/libc.so.6
#1  0x00007fbf850cc3b8 in poll (__timeout=<optimized out>, __nfds=5, 
__fds=0x7ffdb1e02590) at /usr/include/bits/poll2.h:46
#2  main (argc=<optimized out>, argv=<optimized out>) at 
src/imm/immnd/immnd_main.c:358
(gdb)

Alex



On 04/26/2018 03:38 AM, Hans Nordeback wrote:
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________


Hi Alex,

I tested this, immnd gets restarted and systemd reports opensafd.service as 
active (running),

so it works as expected. In your case, immnd is never restarted after the nid 
phase, or does it work

if you increase the sleep time? One thing you can check is to send an ABRT 
instead of the KILL and check

the core dump at e.g. which address you receive the signal. Perhaps you have 
found a "window"

where immnd is not monitored?

/Regards HansN

On 04/25/2018 03:23 PM, Alex Jones wrote:

Hi Hans,

    I understand. But, what if it doesn't fail in the nid phase?

    If you run this command in your setup: "systemctl start opensafd; sleep 2; 
pkill -KILL immnd", does immnd get restarted? And does opensafd successfully 
come up according to systemd?

Alex

On 04/25/2018 09:19 AM, Hans Nordebäck wrote:
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________

Hi Alex,

the reboot should only happen if REBOOT_ON_FAIL_TIMEOUT is set, (i.e. not 0).
I checked the latest version, the reboot works fine if e.g. immnd fails in the 
nid phase and REBOOT_ON_FAIL_TIMEOUT is set.

/Thanks HansN

From: Alex Jones [mailto:[email protected]]
Sent: den 25 april 2018 15:05
To: Hans Nordebäck 
<[email protected]><mailto:[email protected]>; Anders 
Widell <[email protected]><mailto:[email protected]>
Cc: 
[email protected]<mailto:[email protected]>
Subject: Re: SV: [PATCH 1/1] nid: restart opensafd on failure when systemd 
enabled [#2839]


Hi Hans,



    There must be a hole here, then. Because in our setup, if dtmd or immnd 
crashes early in the startup process, the node doesn't reboot, and the 
executables are not restarted. If I set "Restart=on-failure" it works fine.



    Can you test this in your setup to see if you see the same thing?



Alex

On 04/24/2018 05:04 AM, Hans Nordeback wrote:
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________


Hi Alex,



please see comment below.



/Thanks HansN

On 04/23/2018 03:56 PM, Alex Jones wrote:

Hi Hans,



    I just did some tests. Maybe there is a bug in nid, but when I do not have 
"Restart=on-failure", the node does not reboot when I run the command 
"systemctl start opensafd; sleep 3; pkill -KILL immnd", and opensafd times out 
and fails, with REBOOT_ON_FAIL_TIMEOUT=30.
[HansN] isn't the nid phase finished before the sleep 3 command? It is only 
during the nid phase that the REBOOT_ON_FAIL_TIMEOUT is used,
After the nid phase opensaf enters "normal" operation,  no reboot will be 
performed as immnd is restartable. Instead of the sleep 3,
you can edit the nodeinit.conf.controller file and change the immnd line to 
e.g. "/usr/local/lib/opensaf/clc-cli/osaf-immndx:IMMND ... " then
nid should fail to start and REBOOT_ON_FAIL_TIMEOUT should work.






    But, opensafd restarts every time when I run that command with 
"Restart=on-failure" set.



Alex

On 04/19/2018 04:02 PM, Hans Nordebäck wrote:
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________


Hi Alex,



a question, if opensafd fails, (assert or exit code ne 0) a reboot of the node 
will be performed if REBOOT_ON_FAIL_TIMEOUT

is configured, I have not checked, but how do systemd handle the reboot request 
if Restart=on-failure is set?



/BR HansN

________________________________
Från: Alex Jones <[email protected]><mailto:[email protected]>
Skickat: den 19 april 2018 17:27:27
Till: Hans Nordebäck; Anders Widell
Kopia: 
[email protected]<mailto:[email protected]>;
 Alex Jones
Ämne: [PATCH 1/1] nid: restart opensafd on failure when systemd enabled [#2839]

Under certain circumstances opensafd fails to start (immnd or dtmd crashes,
etc).

Apr 19 15:07:31 ams-idsp-46-novnfm osafdtmd[3315]: 
src/dtm/dtmnd/dtm_intra_svc.cc:1778: dtm_process_internode_service_up_msg: 
Assertion '0' failed.

We can tell systemd to restart opensafd if it fails to start.
---
 src/nid/opensafd.service.in | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/nid/opensafd.service.in b/src/nid/opensafd.service.in
index 7f4d75ee3..6050f5e88 100644
--- a/src/nid/opensafd.service.in
+++ b/src/nid/opensafd.service.in
@@ -12,5 +12,7 @@ ControlGroup=cpu:/
 TimeoutStartSec=3hours
 KillMode=none
 @systemdtasksmax@
+Restart=on-failure
+
 [Install]
 WantedBy=multi-user.target
--
2.13.6






------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to