Thanks for your reply Stipe - much appreciated!

On 19/08/2006, at 2:08 AM, Stipe Tolj wrote:

Hi Giulio,

Giulio Harding wrote:

We're looking for some help/advice with a Kannel problem we've just encountered.
- We're using Kannel CVS (up-to-date as of today, 28/07/2006)
- Kannel is built with --prefix=/opt/kannel --enable-start-stop- daemon - Kannel is running on an Intel Xeon 2.8 GHz w/ 2 GB RAM, running Centos 4.3 with all the latest updates - We are using two instances of OpenSMPP SMSC simulator running on two separate machines (similar specs to above) generating a total of 200000 messages as fast as they can. - Kannel is delivering to a stock httpd running on a separate machine, which returns a small (6 byte) response body.

Our problem is this: when injecting messages we get a throughput in excess of 600 inbound SMPP MO messages per second. Shortly after starting to inject the messages, we see a large number of errors in smsbox.log as follows: 2006-07-28 16:55:23 [18715] [4] INFO: Starting to service <testXXX> from <61432123123> to <1234>
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't create new socket.
2006-07-28 16:55:23 [18715] [9] ERROR: System error 24: Too many open files 2006-07-28 16:55:23 [18715] [9] ERROR: error connecting to server `10.100.123.20' at port `80' 2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't send request to <http://10.100.123.20/test.html? msgData=testXXX&sourceAddr=61432123123&channel=test&destinationAddr=1 234>

ok, this seems to me that you generate a vast large stream of MO mesasges (using that Logica tool) over 2 SMPP connections. But you may be using only *one* smsbox connected to the bearerbox, right?

That's right.

In such a case you have a dis-weighted situation. Of course HTTP is a more "overhead" protocol to transport your SMS payload then SMPP is. So you have 2 connections inbound that stream very fast, but only one smsbox that calls HTTP to the application layer.

For this purposes Kannel's architecture can be used to load-balance the load on the HTTP side, by connecting more smsbox instances to bearerbox. Actually you can connect 1-n instances, as far as you use different sendsms HTTP ports in each config for every smsbox.

This will cause bearerbox to round-robin (actually obeying the heartbeat load indication a connected smsbox gives) the MO messages to the n connected smsbox instances.

Using 'lsof' and tweaking the 'nofiles' parameter in '/etc/ security/limits' confirms that Kannel is hitting the system's limit of number of open files per process when trying to create connections to the http server. The errors are generated in the function 'static Connection *get_connection(HTTPServer *trans)' in gwlib/http.c . As a result of these errors even though the Kannel inbound counter indicates that all the incoming messages have been received, many of them are not successfully delivered to the http server. They are not retransmitted, and are effectively lost.

usually if we can't reach the HTTP server, smsbox would retry. Retries can be configured via 2 specific config directives, see user's guide. In case we get a hard system error, like in this case, we can't aquire a socket for TCP transport, we fail the transport and the message is lost.

The message is still considered "received", since bearerbox has received the mesasge. It's not a garantee that the application layer is getting it. In a worst case you could configure smsbox to discard all message. Which still means physically they have been received by the core transportation layer, baererbox.

That's why smsbox and bearerbox have their own access-log file.

We think there are some problems cleaning up connections to the http server after they have been used - lsof shows that it can take some time (up to a minute or two) for the number of open file descriptors for the Kannel processes to drop, after traffic has stopped passing through Kannel. Also, when this problem is encountered, lsof shows a large number of strange sockets, e.g: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
...
smsbox 9614 kannel 29u sock 0,4 6497455 can't identify protocol smsbox 9614 kannel 30u sock 0,4 6497456 can't identify protocol smsbox 9614 kannel 31u sock 0,4 6497457 can't

ok, 1-2 minutes backoff could be the simple result that we use HTTP/ 1.1 with keep-alive. So if you aquire a socket to the HTTP server, issue the request and get the response, and nothing more is requested over the still active TCP connection, the server will leave a 1-2 minute to cut-off the keep-alive connection.

On the other hand, if traffic re-starts again, we will pull the still open connection from the connection pool and re-use it. So the behaviour seems ok to me.


Hmmm, Kannel's behaviour still seems incorrect to me. Smsbox is opening too many HTTP connections, and is thus running out of file descriptors, and falling over. It may be reusing connections in its pool, but the pool size seems to be unbounded. This is not a good thing...

Under load, lsof on the server running Kannel shows a large number of TCP connections open (in the example below, at the time lsof was run, 986 - this doesn't include the 'strange' unidentifiable sockets that Kannel has open) - too many, since it reaches the OS limit for open files:

...
smsbox 18683 kannel 168u IPv4 2067802 TCP 10.100.123.233:47198- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 169u IPv4 2069629 TCP 10.100.123.233:47576- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 170u IPv4 2069664 TCP 10.100.123.233:47581- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 171u IPv4 2067803 TCP 10.100.123.233:47199- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 172u IPv4 2067804 TCP 10.100.123.233:47200- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 173u IPv4 2069665 TCP 10.100.123.233:47582- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 174u IPv4 2069666 TCP 10.100.123.233:47583- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 175u IPv4 2067805 TCP 10.100.123.233:47201- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 176u IPv4 2069667 TCP 10.100.123.233:47584- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 177u IPv4 2067806 TCP 10.100.123.233:47202- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 178u IPv4 2067807 TCP 10.100.123.233:47203- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 179u IPv4 2067808 TCP 10.100.123.233:47204- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 180u IPv4 2069668 TCP 10.100.123.233:47585- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 181u IPv4 2069669 TCP 10.100.123.233:47586- >10.100.123.20:http (ESTABLISHED) smsbox 18683 kannel 182u IPv4 2067809 TCP 10.100.123.233:47205- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 183u IPv4 2067810 TCP 10.100.123.233:47206- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 184u IPv4 2067811 TCP 10.100.123.233:47207- >10.100.123.20:http (SYN_SENT) smsbox 18683 kannel 185u IPv4 2067812 TCP 10.100.123.233:47208- >10.100.123.20:http (SYN_SENT)
...

On the application apache HTTP server (with max connections set to 5 for testing purposes), lsof only ever shows 5 connections established at a time:

httpd 31403 root 3u IPv6 182516 TCP *:http (LISTEN) httpd 31404 apache 3u IPv6 182516 TCP *:http (LISTEN) httpd 31404 apache 8u IPv6 183249 TCP cookiemonster:http->10.100.123.233:42685 (ESTABLISHED) httpd 31406 apache 3u IPv6 182516 TCP *:http (LISTEN) httpd 31406 apache 8u IPv6 183251 TCP cookiemonster:http->10.100.123.233:42687 (ESTABLISHED) httpd 31408 apache 3u IPv6 182516 TCP *:http (LISTEN) httpd 31408 apache 8u IPv6 183252 TCP cookiemonster:http->10.100.123.233:42811 (ESTABLISHED) httpd 31409 apache 3u IPv6 182516 TCP *:http (LISTEN) httpd 31409 apache 8u IPv6 183250 TCP cookiemonster:http->10.100.123.233:42686 (ESTABLISHED) httpd 31410 apache 3u IPv6 182516 TCP *:http (LISTEN) httpd 31410 apache 8u IPv6 183253 TCP cookiemonster:http->10.100.123.233:42812 (ESTABLISHED)

The disparity between the two sides of the TCP connections is probably due to backlog - only 5 TCP connections are ever established with the HTTP server at a time, but the rest of the connections from Kannel are 'pending' as per OS/apache TCP backlog limits. These connections won't receive a 'connection refused', they'll simply wait until resources are available, whereupon they'll be fully established.

Because these connections aren't explicitly failing, and because of the rate of incoming MO messages, smsbox appears to be simply creating HTTP connections faster than they are being used. The only solution is to forcibly reduce the rate of message delivery, or to limit the number of HTTP connections to the application HTTP server (as mentioned by Alexander Malysh in his reply to my original email).

Also, those 'strange' sockets that Kannel has look like they've been closed on the other side (by the application HTTP server) and are thus useless, and simply haven't been cleaned up. When a HTTP connection is closed by the server, it should be cleaned up on Kannel's side as soon as possible. They should really be cleaned up before attempting to create new HTTP connections.


Limiting the number of concurrent connections that the http server allows doesn't help - Kannel always seems to have more open file descriptors than connections to the http server.

???

connections should be re-used from the connection pool. Can you give us more details on this please.

Sorry, I wasn't very clear - we tried limiting the number of concurrent HTTP connections that the HTTP server would allow from Kannel, to see if Kannel would reduce its connection/socket usage accordingly... i.e. if Kannel got a 'connection refused' from the HTTP server - as it turns out, the TCP backlog means that Kannel can't really tell whether the application HTTP server can keep up with its HTTP connections, so my comment above is redundant.

So: can someone confirm whether this really is a problem with Kannel (I would be very surprised if it's been in use, in production all this time with this kind of behaviour under load)? Is our configuration wrong, perhaps? How should we be configuring Kannel to deal with this kind of situation?

Kannel is in production use at numerious high-load sites. So this keeps being strange for me.

Also, we noticed some comments in gwlib/http.c,
/* XXX re-implement socket pools, with idle connection killing to save sockets */

now, this should have only impact if we constantly have request to different hosts. If it's the same host, we should have re-usage of connections.

Ok.

/* XXX set maximum number of concurrent connections to same host, total? */

this one is a TODO still, yes.

I think this is the crux of the problem: this TODO, and our issue would be addressed by the patch that Alexander has offered - given that he's indicated that it is a fairly simple patch, and the issue seems (to me) quite serious, I think this is worth focussing on ASAP.

These look like they may be directly related to the problem we're experiencing - is anyone working on these tasks, and if so, is there an ETA to implementation?

can you please retry to leaverage the MO load with let's say 2-4 smsbox connections....

It's still curious why sockets are keept open that way...

Ok, I'll try this out, with multiple smsboxes on the same machine, and possibly multiple smsboxes on multiple machines (if I can source some) - I'll also try fiddling with the priority of the bearerbox process (using nice) to see if I can get the SMPP and HTTP processing better balanced...

However, none of these workarounds will guarantee correct behaviour under load, as there is nothing preventing Kannel from opening too many sockets. As I mentioned before, I think this is a bug.

Stipe

-------------------------------------------------------------------
Kölner Landstrasse 419
40589 Düsseldorf, NRW, Germany

tolj.org system architecture      Kannel Software Foundation (KSF)
http://www.tolj.org/              http://www.kannel.org/

mailto:st_{at}_tolj.org           mailto:stolj_{at}_kannel.org
-------------------------------------------------------------------

--
Giulio Harding
Systems Administrator

m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com





Reply via email to