Re: [RFC] smsbox dead-lock on shutdown

Stipe Tolj Thu, 05 Nov 2020 02:58:25 -0800

Am 05.11.20, 10:41, schrieb Stipe Tolj:

Hi all,


I got across and issue on the shutdown sequence of Kannel smsbox, that
seems to me like a potential dead-lock situation while shutdown phase.

On a loaded system bearerbox was SIGHUP'ed and hence instructed it's
connected smsbox to go down too.

Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.
Through the smsbox still maintained running, and I looked into the gdb
backtrace of the process a bit more.

What I see is this: (BTW, the line numbers don't match with the svn trunk).

#1 0x000000000044596b in gwthread_join_every (func=0x41ba40
<obey_request_thread>) at gwlib/gwthread-pthread.c:744
#2 0x00000000004142c8 in main (argc=<value optimized out>,
argv=0x7fff05d24428) at gw/smsbox.c:3872

so main() was blocking in the gwthread_join_every for the
obey_request_thread()s.

They itself blocked in:

#0 0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x000000000041bdcb in obey_request_thread (arg=<value optimized out>)
at gw/smsbox.c:1346

in the semaphore_down(max_pending_requests); all before a
http_start_request().

Since we know that the semaphore_up() is performed in the
url_result_thread() when we got the response via
http_receive_result_real(), but that itself blocked in:

#0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.0
#1 0x000000000044e098 in gwlist_consume (list=0x1498e50) at
gwlib/list.c:478
#2 0x000000000044840c in http_receive_result_real (caller=0x1498e84,
status=0x44485054, final_url=0x44485018, headers=0x44484ff8,
body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
#3 0x000000000041a98e in url_result_thread (arg=<value optimized out>)
at gw/smsbox.c:1105

so in the gwlist_consume() on the HTTPCaller *caller.

Now, checking the the shutdown sequence in main() we see that we do:

...
gwthread_join_every(obey_request_thread);
http_caller_signal_shutdown(caller);
gwthread_join_every(url_result_thread);
...

so we remove the producer on HTTPCaller *caller AFTER we join the
obey_request_thread()s, which are performing the semaphore_down.

This ends up in a dead-lock situation IMO.

Resolution should be simply to move the http_caller_signal_shutdown()
before gwthread_join_every(obey_request_thread) in the shutdown sequence.

Any comments, reviews are highly welcome.

ok, this is NOT blocking fully. It does block for any HTTP requests thatare performed against "bogus IP ranges", i.e. unrouted C-class 10.x.x.xranges, and blocks while we have out client timeout running, which is240 seconds by default.


If we set

  group = smsbox
  ...
  http-timeout = 10

then we get it unblocked and shutdown cleanly.

So, forget about the dead-lock claim I made. The only thing that we MAYwant here is to have a more realistic TCP connection timeout?


Stipe

--
Best Regards,
Stipe Tolj

-------------------------------------------------------------------
Düsseldorf, NRW, Germany

Kannel Foundation                 tolj.org system architecture
http://www.kannel.org/            http://www.tolj.org/

st...@kannel.org                  s...@tolj.org
-------------------------------------------------------------------

Re: [RFC] smsbox dead-lock on shutdown

Reply via email to