Hi Giulio,

Giulio Harding wrote:

We're looking for some help/advice with a Kannel problem we've just encountered.

- We're using Kannel CVS (up-to-date as of today, 28/07/2006)
- Kannel is built with --prefix=/opt/kannel --enable-start-stop-daemon
- Kannel is running on an Intel Xeon 2.8 GHz w/ 2 GB RAM, running Centos 4.3 with all the latest updates - We are using two instances of OpenSMPP SMSC simulator running on two separate machines (similar specs to above) generating a total of 200000 messages as fast as they can. - Kannel is delivering to a stock httpd running on a separate machine, which returns a small (6 byte) response body.

Our problem is this: when injecting messages we get a throughput in excess of 600 inbound SMPP MO messages per second. Shortly after starting to inject the messages, we see a large number of errors in smsbox.log as follows:

2006-07-28 16:55:23 [18715] [4] INFO: Starting to service <testXXX> from <61432123123> to <1234>
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't create new socket.
2006-07-28 16:55:23 [18715] [9] ERROR: System error 24: Too many open files
2006-07-28 16:55:23 [18715] [9] ERROR: error connecting to server `10.100.123.20' at port `80' 2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't send request to <http://10.100.123.20/test.html?msgData=testXXX&sourceAddr=61432123123&channel=test&destinationAddr=1234>

ok, this seems to me that you generate a vast large stream of MO mesasges (using that Logica tool) over 2 SMPP connections. But you may be using only *one* smsbox connected to the bearerbox, right?

In such a case you have a dis-weighted situation. Of course HTTP is a more "overhead" protocol to transport your SMS payload then SMPP is. So you have 2 connections inbound that stream very fast, but only one smsbox that calls HTTP to the application layer.

For this purposes Kannel's architecture can be used to load-balance the load on the HTTP side, by connecting more smsbox instances to bearerbox. Actually you can connect 1-n instances, as far as you use different sendsms HTTP ports in each config for every smsbox.

This will cause bearerbox to round-robin (actually obeying the heartbeat load indication a connected smsbox gives) the MO messages to the n connected smsbox instances.

Using 'lsof' and tweaking the 'nofiles' parameter in '/etc/security/limits' confirms that Kannel is hitting the system's limit of number of open files per process when trying to create connections to the http server. The errors are generated in the function 'static Connection *get_connection(HTTPServer *trans)' in gwlib/http.c . As a result of these errors even though the Kannel inbound counter indicates that all the incoming messages have been received, many of them are not successfully delivered to the http server. They are not retransmitted, and are effectively lost.

usually if we can't reach the HTTP server, smsbox would retry. Retries can be configured via 2 specific config directives, see user's guide. In case we get a hard system error, like in this case, we can't aquire a socket for TCP transport, we fail the transport and the message is lost.

The message is still considered "received", since bearerbox has received the mesasge. It's not a garantee that the application layer is getting it. In a worst case you could configure smsbox to discard all message. Which still means physically they have been received by the core transportation layer, baererbox.

That's why smsbox and bearerbox have their own access-log file.

We think there are some problems cleaning up connections to the http server after they have been used - lsof shows that it can take some time (up to a minute or two) for the number of open file descriptors for the Kannel processes to drop, after traffic has stopped passing through Kannel. Also, when this problem is encountered, lsof shows a large number of strange sockets, e.g:

COMMAND     PID           USER   FD      TYPE     DEVICE     SIZE NODE NAME
...
smsbox 9614 kannel 29u sock 0,4 6497455 can't identify protocol smsbox 9614 kannel 30u sock 0,4 6497456 can't identify protocol smsbox 9614 kannel 31u sock 0,4 6497457 can't

ok, 1-2 minutes backoff could be the simple result that we use HTTP/1.1 with keep-alive. So if you aquire a socket to the HTTP server, issue the request and get the response, and nothing more is requested over the still active TCP connection, the server will leave a 1-2 minute to cut-off the keep-alive connection.

On the other hand, if traffic re-starts again, we will pull the still open connection from the connection pool and re-use it. So the behaviour seems ok to me.

Limiting the number of concurrent connections that the http server allows doesn't help - Kannel always seems to have more open file descriptors than connections to the http server.

???

connections should be re-used from the connection pool. Can you give us more details on this please.

So: can someone confirm whether this really is a problem with Kannel (I would be very surprised if it's been in use, in production all this time with this kind of behaviour under load)? Is our configuration wrong, perhaps? How should we be configuring Kannel to deal with this kind of situation?

Kannel is in production use at numerious high-load sites. So this keeps being strange for me.

Also, we noticed some comments in gwlib/http.c,

/* XXX re-implement socket pools, with idle connection killing to save sockets */

now, this should have only impact if we constantly have request to different hosts. If it's the same host, we should have re-usage of connections.

/* XXX set maximum number of concurrent connections to same host, total? */

this one is a TODO still, yes.

These look like they may be directly related to the problem we're experiencing - is anyone working on these tasks, and if so, is there an ETA to implementation?

can you please retry to leaverage the MO load with let's say 2-4 smsbox connections....

It's still curious why sockets are keept open that way...

Stipe

-------------------------------------------------------------------
Kölner Landstrasse 419
40589 Düsseldorf, NRW, Germany

tolj.org system architecture      Kannel Software Foundation (KSF)
http://www.tolj.org/              http://www.kannel.org/

mailto:st_{at}_tolj.org           mailto:stolj_{at}_kannel.org
-------------------------------------------------------------------

Reply via email to