Thanks for your reply Stipe - much appreciated!
On 19/08/2006, at 2:08 AM, Stipe Tolj wrote:
Hi Giulio,
Giulio Harding wrote:
We're looking for some help/advice with a Kannel problem we've
just encountered.
- We're using Kannel CVS (up-to-date as of today, 28/07/2006)
- Kannel is built with --prefix=/opt/kannel --enable-start-stop-
daemon
- Kannel is running on an Intel Xeon 2.8 GHz w/ 2 GB RAM, running
Centos 4.3 with all the latest updates
- We are using two instances of OpenSMPP SMSC simulator running on
two separate machines (similar specs to above) generating a total
of 200000 messages as fast as they can.
- Kannel is delivering to a stock httpd running on a separate
machine, which returns a small (6 byte) response body.
Our problem is this: when injecting messages we get a throughput
in excess of 600 inbound SMPP MO messages per second. Shortly
after starting to inject the messages, we see a large number of
errors in smsbox.log as follows:
2006-07-28 16:55:23 [18715] [4] INFO: Starting to service
<testXXX> from <61432123123> to <1234>
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't create new socket.
2006-07-28 16:55:23 [18715] [9] ERROR: System error 24: Too many
open files
2006-07-28 16:55:23 [18715] [9] ERROR: error connecting to server
`10.100.123.20' at port `80'
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't send request to
<http://10.100.123.20/test.html?
msgData=testXXX&sourceAddr=61432123123&channel=test&destinationAddr=1
234>
ok, this seems to me that you generate a vast large stream of MO
mesasges (using that Logica tool) over 2 SMPP connections. But you
may be using only *one* smsbox connected to the bearerbox, right?
That's right.
In such a case you have a dis-weighted situation. Of course HTTP is
a more "overhead" protocol to transport your SMS payload then SMPP
is. So you have 2 connections inbound that stream very fast, but
only one smsbox that calls HTTP to the application layer.
For this purposes Kannel's architecture can be used to load-balance
the load on the HTTP side, by connecting more smsbox instances to
bearerbox. Actually you can connect 1-n instances, as far as you
use different sendsms HTTP ports in each config for every smsbox.
This will cause bearerbox to round-robin (actually obeying the
heartbeat load indication a connected smsbox gives) the MO messages
to the n connected smsbox instances.
Using 'lsof' and tweaking the 'nofiles' parameter in '/etc/
security/limits' confirms that Kannel is hitting the system's
limit of number of open files per process when trying to create
connections to the http server. The errors are generated in the
function 'static Connection *get_connection(HTTPServer *trans)' in
gwlib/http.c . As a result of these errors even though the Kannel
inbound counter indicates that all the incoming messages have been
received, many of them are not successfully delivered to the http
server. They are not retransmitted, and are effectively lost.
usually if we can't reach the HTTP server, smsbox would retry.
Retries can be configured via 2 specific config directives, see
user's guide. In case we get a hard system error, like in this
case, we can't aquire a socket for TCP transport, we fail the
transport and the message is lost.
The message is still considered "received", since bearerbox has
received the mesasge. It's not a garantee that the application
layer is getting it. In a worst case you could configure smsbox to
discard all message. Which still means physically they have been
received by the core transportation layer, baererbox.
That's why smsbox and bearerbox have their own access-log file.
We think there are some problems cleaning up connections to the
http server after they have been used - lsof shows that it can
take some time (up to a minute or two) for the number of open file
descriptors for the Kannel processes to drop, after traffic has
stopped passing through Kannel. Also, when this problem is
encountered, lsof shows a large number of strange sockets, e.g:
COMMAND PID USER FD TYPE DEVICE SIZE
NODE NAME
...
smsbox 9614 kannel 29u sock 0,4 6497455
can't identify protocol
smsbox 9614 kannel 30u sock 0,4 6497456
can't identify protocol
smsbox 9614 kannel 31u sock 0,4 6497457
can't
ok, 1-2 minutes backoff could be the simple result that we use HTTP/
1.1 with keep-alive. So if you aquire a socket to the HTTP server,
issue the request and get the response, and nothing more is
requested over the still active TCP connection, the server will
leave a 1-2 minute to cut-off the keep-alive connection.
On the other hand, if traffic re-starts again, we will pull the
still open connection from the connection pool and re-use it. So
the behaviour seems ok to me.
Hmmm, Kannel's behaviour still seems incorrect to me. Smsbox is
opening too many HTTP connections, and is thus running out of file
descriptors, and falling over. It may be reusing connections in its
pool, but the pool size seems to be unbounded. This is not a good
thing...
Under load, lsof on the server running Kannel shows a large number of
TCP connections open (in the example below, at the time lsof was run,
986 - this doesn't include the 'strange' unidentifiable sockets that
Kannel has open) - too many, since it reaches the OS limit for open
files:
...
smsbox 18683 kannel 168u IPv4
2067802 TCP 10.100.123.233:47198-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 169u IPv4
2069629 TCP 10.100.123.233:47576-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 170u IPv4
2069664 TCP 10.100.123.233:47581-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 171u IPv4
2067803 TCP 10.100.123.233:47199-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 172u IPv4
2067804 TCP 10.100.123.233:47200-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 173u IPv4
2069665 TCP 10.100.123.233:47582-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 174u IPv4
2069666 TCP 10.100.123.233:47583-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 175u IPv4
2067805 TCP 10.100.123.233:47201-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 176u IPv4
2069667 TCP 10.100.123.233:47584-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 177u IPv4
2067806 TCP 10.100.123.233:47202-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 178u IPv4
2067807 TCP 10.100.123.233:47203-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 179u IPv4
2067808 TCP 10.100.123.233:47204-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 180u IPv4
2069668 TCP 10.100.123.233:47585-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 181u IPv4
2069669 TCP 10.100.123.233:47586-
>10.100.123.20:http (ESTABLISHED)
smsbox 18683 kannel 182u IPv4
2067809 TCP 10.100.123.233:47205-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 183u IPv4
2067810 TCP 10.100.123.233:47206-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 184u IPv4
2067811 TCP 10.100.123.233:47207-
>10.100.123.20:http (SYN_SENT)
smsbox 18683 kannel 185u IPv4
2067812 TCP 10.100.123.233:47208-
>10.100.123.20:http (SYN_SENT)
...
On the application apache HTTP server (with max connections set to 5
for testing purposes), lsof only ever shows 5 connections established
at a time:
httpd 31403 root 3u IPv6
182516 TCP *:http (LISTEN)
httpd 31404 apache 3u IPv6
182516 TCP *:http (LISTEN)
httpd 31404 apache 8u IPv6
183249 TCP cookiemonster:http->10.100.123.233:42685
(ESTABLISHED)
httpd 31406 apache 3u IPv6
182516 TCP *:http (LISTEN)
httpd 31406 apache 8u IPv6
183251 TCP cookiemonster:http->10.100.123.233:42687
(ESTABLISHED)
httpd 31408 apache 3u IPv6
182516 TCP *:http (LISTEN)
httpd 31408 apache 8u IPv6
183252 TCP cookiemonster:http->10.100.123.233:42811
(ESTABLISHED)
httpd 31409 apache 3u IPv6
182516 TCP *:http (LISTEN)
httpd 31409 apache 8u IPv6
183250 TCP cookiemonster:http->10.100.123.233:42686
(ESTABLISHED)
httpd 31410 apache 3u IPv6
182516 TCP *:http (LISTEN)
httpd 31410 apache 8u IPv6
183253 TCP cookiemonster:http->10.100.123.233:42812
(ESTABLISHED)
The disparity between the two sides of the TCP connections is
probably due to backlog - only 5 TCP connections are ever established
with the HTTP server at a time, but the rest of the connections from
Kannel are 'pending' as per OS/apache TCP backlog limits. These
connections won't receive a 'connection refused', they'll simply wait
until resources are available, whereupon they'll be fully established.
Because these connections aren't explicitly failing, and because of
the rate of incoming MO messages, smsbox appears to be simply
creating HTTP connections faster than they are being used. The only
solution is to forcibly reduce the rate of message delivery, or to
limit the number of HTTP connections to the application HTTP server
(as mentioned by Alexander Malysh in his reply to my original email).
Also, those 'strange' sockets that Kannel has look like they've been
closed on the other side (by the application HTTP server) and are
thus useless, and simply haven't been cleaned up. When a HTTP
connection is closed by the server, it should be cleaned up on
Kannel's side as soon as possible. They should really be cleaned up
before attempting to create new HTTP connections.
Limiting the number of concurrent connections that the http server
allows doesn't help - Kannel always seems to have more open file
descriptors than connections to the http server.
???
connections should be re-used from the connection pool. Can you
give us more details on this please.
Sorry, I wasn't very clear - we tried limiting the number of
concurrent HTTP connections that the HTTP server would allow from
Kannel, to see if Kannel would reduce its connection/socket usage
accordingly... i.e. if Kannel got a 'connection refused' from the
HTTP server - as it turns out, the TCP backlog means that Kannel
can't really tell whether the application HTTP server can keep up
with its HTTP connections, so my comment above is redundant.
So: can someone confirm whether this really is a problem with
Kannel (I would be very surprised if it's been in use, in
production all this time with this kind of behaviour under load)?
Is our configuration wrong, perhaps? How should we be configuring
Kannel to deal with this kind of situation?
Kannel is in production use at numerious high-load sites. So this
keeps being strange for me.
Also, we noticed some comments in gwlib/http.c,
/* XXX re-implement socket pools, with idle connection killing to
save sockets */
now, this should have only impact if we constantly have request to
different hosts. If it's the same host, we should have re-usage of
connections.
Ok.
/* XXX set maximum number of concurrent connections to same host,
total? */
this one is a TODO still, yes.
I think this is the crux of the problem: this TODO, and our issue
would be addressed by the patch that Alexander has offered - given
that he's indicated that it is a fairly simple patch, and the issue
seems (to me) quite serious, I think this is worth focussing on ASAP.
These look like they may be directly related to the problem we're
experiencing - is anyone working on these tasks, and if so, is
there an ETA to implementation?
can you please retry to leaverage the MO load with let's say 2-4
smsbox connections....
It's still curious why sockets are keept open that way...
Ok, I'll try this out, with multiple smsboxes on the same machine,
and possibly multiple smsboxes on multiple machines (if I can source
some) - I'll also try fiddling with the priority of the bearerbox
process (using nice) to see if I can get the SMPP and HTTP processing
better balanced...
However, none of these workarounds will guarantee correct behaviour
under load, as there is nothing preventing Kannel from opening too
many sockets. As I mentioned before, I think this is a bug.
Stipe
-------------------------------------------------------------------
Kölner Landstrasse 419
40589 Düsseldorf, NRW, Germany
tolj.org system architecture Kannel Software Foundation (KSF)
http://www.tolj.org/ http://www.kannel.org/
mailto:st_{at}_tolj.org mailto:stolj_{at}_kannel.org
-------------------------------------------------------------------
--
Giulio Harding
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]
http://www.mnetcorporation.com