Am 05.11.20, 10:41, schrieb Stipe Tolj:
Hi all,

I got across and issue on the shutdown sequence of Kannel smsbox, that
seems to me like a potential dead-lock situation while shutdown phase.

On a loaded system bearerbox was SIGHUP'ed and hence instructed it's
connected smsbox to go down too.

Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.
Through the smsbox still maintained running, and I looked into the gdb
backtrace of the process a bit more.

What I see is this: (BTW, the line numbers don't match with the svn trunk).

#1 0x000000000044596b in gwthread_join_every (func=0x41ba40
<obey_request_thread>) at gwlib/gwthread-pthread.c:744
#2 0x00000000004142c8 in main (argc=<value optimized out>,
argv=0x7fff05d24428) at gw/smsbox.c:3872

so main() was blocking in the gwthread_join_every for the
obey_request_thread()s.

They itself blocked in:

#0 0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x000000000041bdcb in obey_request_thread (arg=<value optimized out>)
at gw/smsbox.c:1346

in the semaphore_down(max_pending_requests); all before a
http_start_request().

Since we know that the semaphore_up() is performed in the
url_result_thread() when we got the response via
http_receive_result_real(), but that itself blocked in:

#0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.0
#1 0x000000000044e098 in gwlist_consume (list=0x1498e50) at
gwlib/list.c:478
#2 0x000000000044840c in http_receive_result_real (caller=0x1498e84,
status=0x44485054, final_url=0x44485018, headers=0x44484ff8,
body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
#3 0x000000000041a98e in url_result_thread (arg=<value optimized out>)
at gw/smsbox.c:1105

so in the gwlist_consume() on the HTTPCaller *caller.

Now, checking the the shutdown sequence in main() we see that we do:

...
gwthread_join_every(obey_request_thread);
http_caller_signal_shutdown(caller);
gwthread_join_every(url_result_thread);
...

so we remove the producer on HTTPCaller *caller AFTER we join the
obey_request_thread()s, which are performing the semaphore_down.

This ends up in a dead-lock situation IMO.

Resolution should be simply to move the http_caller_signal_shutdown()
before gwthread_join_every(obey_request_thread) in the shutdown sequence.

Any comments, reviews are highly welcome.

ok, this is NOT blocking fully. It does block for any HTTP requests that are performed against "bogus IP ranges", i.e. unrouted C-class 10.x.x.x ranges, and blocks while we have out client timeout running, which is 240 seconds by default.

If we set

  group = smsbox
  ...
  http-timeout = 10

then we get it unblocked and shutdown cleanly.

So, forget about the dead-lock claim I made. The only thing that we MAY want here is to have a more realistic TCP connection timeout?

Stipe

--
Best Regards,
Stipe Tolj

-------------------------------------------------------------------
Düsseldorf, NRW, Germany

Kannel Foundation                 tolj.org system architecture
http://www.kannel.org/            http://www.tolj.org/

st...@kannel.org                  s...@tolj.org
-------------------------------------------------------------------

Reply via email to