Re: [RFC] smsbox dead-lock on shutdown

2020-11-05 Thread Stipe Tolj

Am 05.11.20, 12:33, schrieb Alexander Malysh:

Hi,

agree that 240 sec timeout is too high. I think max 30 sec , should be
enough.


ok, I compared a couple of values here. First of all HTTP/1.1 spec DOES 
not define a timeout value as default.


Some that are used are:

- php: 60 sec.
- Microsoft .net: 120 sec.
- Firefox: 30 sec.
- node.js: 60 sec.

so, 30s seems fine to me. I was almost in favor of going lower. But that 
may be too low. Let's stick with 30s.


Changing and committing to svn trunk.

--
Best Regards,
Stipe Tolj

---
Düsseldorf, NRW, Germany

Kannel Foundation tolj.org system architecture
http://www.kannel.org/http://www.tolj.org/

st...@kannel.org  s...@tolj.org
---



Re: [RFC] smsbox dead-lock on shutdown

2020-11-05 Thread Alexander Malysh
Hi,

agree that 240 sec timeout is too high. I think max 30 sec , should be enough.

Alex
Am 5. Nov. 2020, 11:58 +0100 schrieb Stipe Tolj :
> Am 05.11.20, 10:41, schrieb Stipe Tolj:
> > Hi all,
> >
> > I got across and issue on the shutdown sequence of Kannel smsbox, that
> > seems to me like a potential dead-lock situation while shutdown phase.
> >
> > On a loaded system bearerbox was SIGHUP'ed and hence instructed it's
> > connected smsbox to go down too.
> >
> > Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.
> > Through the smsbox still maintained running, and I looked into the gdb
> > backtrace of the process a bit more.
> >
> > What I see is this: (BTW, the line numbers don't match with the svn trunk).
> >
> > #1 0x0044596b in gwthread_join_every (func=0x41ba40
> > ) at gwlib/gwthread-pthread.c:744
> > #2 0x004142c8 in main (argc=,
> > argv=0x7fff05d24428) at gw/smsbox.c:3872
> >
> > so main() was blocking in the gwthread_join_every for the
> > obey_request_thread()s.
> >
> > They itself blocked in:
> >
> > #0 0x7f809e117bd1 in sem_wait () from /lib/libpthread.so.0
> > #1 0x0041bdcb in obey_request_thread (arg=)
> > at gw/smsbox.c:1346
> >
> > in the semaphore_down(max_pending_requests); all before a
> > http_start_request().
> >
> > Since we know that the semaphore_up() is performed in the
> > url_result_thread() when we got the response via
> > http_receive_result_real(), but that itself blocked in:
> >
> > #0 0x7f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > /lib/libpthread.so.0
> > #1 0x0044e098 in gwlist_consume (list=0x1498e50) at
> > gwlib/list.c:478
> > #2 0x0044840c in http_receive_result_real (caller=0x1498e84,
> > status=0x44485054, final_url=0x44485018, headers=0x44484ff8,
> > body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
> > #3 0x0041a98e in url_result_thread (arg=)
> > at gw/smsbox.c:1105
> >
> > so in the gwlist_consume() on the HTTPCaller *caller.
> >
> > Now, checking the the shutdown sequence in main() we see that we do:
> >
> > ...
> > gwthread_join_every(obey_request_thread);
> > http_caller_signal_shutdown(caller);
> > gwthread_join_every(url_result_thread);
> > ...
> >
> > so we remove the producer on HTTPCaller *caller AFTER we join the
> > obey_request_thread()s, which are performing the semaphore_down.
> >
> > This ends up in a dead-lock situation IMO.
> >
> > Resolution should be simply to move the http_caller_signal_shutdown()
> > before gwthread_join_every(obey_request_thread) in the shutdown sequence.
> >
> > Any comments, reviews are highly welcome.
>
> ok, this is NOT blocking fully. It does block for any HTTP requests that
> are performed against "bogus IP ranges", i.e. unrouted C-class 10.x.x.x
> ranges, and blocks while we have out client timeout running, which is
> 240 seconds by default.
>
> If we set
>
> group = smsbox
> ...
> http-timeout = 10
>
> then we get it unblocked and shutdown cleanly.
>
> So, forget about the dead-lock claim I made. The only thing that we MAY
> want here is to have a more realistic TCP connection timeout?
>
> Stipe
>
> --
> Best Regards,
> Stipe Tolj
>
> ---
> Düsseldorf, NRW, Germany
>
> Kannel Foundation tolj.org system architecture
> http://www.kannel.org/ http://www.tolj.org/
>
> st...@kannel.org s...@tolj.org
> ---
>


Re: [RFC] smsbox dead-lock on shutdown

2020-11-05 Thread Stipe Tolj

Am 05.11.20, 10:41, schrieb Stipe Tolj:

Hi all,

I got across and issue on the shutdown sequence of Kannel smsbox, that
seems to me like a potential dead-lock situation while shutdown phase.

On a loaded system bearerbox was SIGHUP'ed and hence instructed it's
connected smsbox to go down too.

Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.
Through the smsbox still maintained running, and I looked into the gdb
backtrace of the process a bit more.

What I see is this: (BTW, the line numbers don't match with the svn trunk).

#1 0x0044596b in gwthread_join_every (func=0x41ba40
) at gwlib/gwthread-pthread.c:744
#2 0x004142c8 in main (argc=,
argv=0x7fff05d24428) at gw/smsbox.c:3872

so main() was blocking in the gwthread_join_every for the
obey_request_thread()s.

They itself blocked in:

#0 0x7f809e117bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x0041bdcb in obey_request_thread (arg=)
at gw/smsbox.c:1346

in the semaphore_down(max_pending_requests); all before a
http_start_request().

Since we know that the semaphore_up() is performed in the
url_result_thread() when we got the response via
http_receive_result_real(), but that itself blocked in:

#0 0x7f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.0
#1 0x0044e098 in gwlist_consume (list=0x1498e50) at
gwlib/list.c:478
#2 0x0044840c in http_receive_result_real (caller=0x1498e84,
status=0x44485054, final_url=0x44485018, headers=0x44484ff8,
body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
#3 0x0041a98e in url_result_thread (arg=)
at gw/smsbox.c:1105

so in the gwlist_consume() on the HTTPCaller *caller.

Now, checking the the shutdown sequence in main() we see that we do:

...
gwthread_join_every(obey_request_thread);
http_caller_signal_shutdown(caller);
gwthread_join_every(url_result_thread);
...

so we remove the producer on HTTPCaller *caller AFTER we join the
obey_request_thread()s, which are performing the semaphore_down.

This ends up in a dead-lock situation IMO.

Resolution should be simply to move the http_caller_signal_shutdown()
before gwthread_join_every(obey_request_thread) in the shutdown sequence.

Any comments, reviews are highly welcome.


ok, this is NOT blocking fully. It does block for any HTTP requests that 
are performed against "bogus IP ranges", i.e. unrouted C-class 10.x.x.x 
ranges, and blocks while we have out client timeout running, which is 
240 seconds by default.


If we set

  group = smsbox
  ...
  http-timeout = 10

then we get it unblocked and shutdown cleanly.

So, forget about the dead-lock claim I made. The only thing that we MAY 
want here is to have a more realistic TCP connection timeout?


Stipe

--
Best Regards,
Stipe Tolj

---
Düsseldorf, NRW, Germany

Kannel Foundation tolj.org system architecture
http://www.kannel.org/http://www.tolj.org/

st...@kannel.org  s...@tolj.org
---



[RFC] smsbox dead-lock on shutdown

2020-11-05 Thread Stipe Tolj

Hi all,

I got across and issue on the shutdown sequence of Kannel smsbox, that 
seems to me like a potential dead-lock situation while shutdown phase.


On a loaded system bearerbox was SIGHUP'ed and hence instructed it's 
connected smsbox to go down too.


Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down. 
Through the smsbox still maintained running, and I looked into the gdb 
backtrace of the process a bit more.


What I see is this: (BTW, the line numbers don't match with the svn trunk).

#1  0x0044596b in gwthread_join_every (func=0x41ba40 
) at gwlib/gwthread-pthread.c:744
#2  0x004142c8 in main (argc=, 
argv=0x7fff05d24428) at gw/smsbox.c:3872


so main() was blocking in the gwthread_join_every for the 
obey_request_thread()s.


They itself blocked in:

#0  0x7f809e117bd1 in sem_wait () from /lib/libpthread.so.0
#1  0x0041bdcb in obey_request_thread (arg=out>) at gw/smsbox.c:1346


in the semaphore_down(max_pending_requests); all before a 
http_start_request().


Since we know that the semaphore_up() is performed in the 
url_result_thread() when we got the response via 
http_receive_result_real(), but that itself blocked in:


#0  0x7f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x0044e098 in gwlist_consume (list=0x1498e50) at 
gwlib/list.c:478
#2  0x0044840c in http_receive_result_real (caller=0x1498e84, 
status=0x44485054, final_url=0x44485018, headers=0x44484ff8, 
body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
#3  0x0041a98e in url_result_thread (arg=) 
at gw/smsbox.c:1105


so in the gwlist_consume() on the HTTPCaller *caller.

Now, checking the the shutdown sequence in main() we see that we do:

...
gwthread_join_every(obey_request_thread);
http_caller_signal_shutdown(caller);
gwthread_join_every(url_result_thread);
...

so we remove the producer on HTTPCaller *caller AFTER we join the 
obey_request_thread()s, which are performing the semaphore_down.


This ends up in a dead-lock situation IMO.

Resolution should be simply to move the http_caller_signal_shutdown() 
before gwthread_join_every(obey_request_thread) in the shutdown sequence.


Any comments, reviews are highly welcome.

Stay safe all,
Stipe


--
Best Regards,
Stipe Tolj

---
Düsseldorf, NRW, Germany

Kannel Foundation tolj.org system architecture
http://www.kannel.org/http://www.tolj.org/

st...@kannel.org  s...@tolj.org
---