Hi Giulio,
Giulio Harding wrote:
We're looking for some help/advice with a Kannel problem we've just
encountered.
- We're using Kannel CVS (up-to-date as of today, 28/07/2006)
- Kannel is built with --prefix=/opt/kannel --enable-start-stop-daemon
- Kannel is running on an Intel Xeon 2.8 GHz w/ 2 GB RAM, running Centos
4.3 with all the latest updates
- We are using two instances of OpenSMPP SMSC simulator running on two
separate machines (similar specs to above) generating a total of 200000
messages as fast as they can.
- Kannel is delivering to a stock httpd running on a separate machine,
which returns a small (6 byte) response body.
Our problem is this: when injecting messages we get a throughput in
excess of 600 inbound SMPP MO messages per second. Shortly after
starting to inject the messages, we see a large number of errors in
smsbox.log as follows:
2006-07-28 16:55:23 [18715] [4] INFO: Starting to service <testXXX> from
<61432123123> to <1234>
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't create new socket.
2006-07-28 16:55:23 [18715] [9] ERROR: System error 24: Too many open files
2006-07-28 16:55:23 [18715] [9] ERROR: error connecting to server
`10.100.123.20' at port `80'
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't send request to
<http://10.100.123.20/test.html?msgData=testXXX&sourceAddr=61432123123&channel=test&destinationAddr=1234>
ok, this seems to me that you generate a vast large stream of MO mesasges (using
that Logica tool) over 2 SMPP connections. But you may be using only *one*
smsbox connected to the bearerbox, right?
In such a case you have a dis-weighted situation. Of course HTTP is a more
"overhead" protocol to transport your SMS payload then SMPP is. So you have 2
connections inbound that stream very fast, but only one smsbox that calls HTTP
to the application layer.
For this purposes Kannel's architecture can be used to load-balance the load on
the HTTP side, by connecting more smsbox instances to bearerbox. Actually you
can connect 1-n instances, as far as you use different sendsms HTTP ports in
each config for every smsbox.
This will cause bearerbox to round-robin (actually obeying the heartbeat load
indication a connected smsbox gives) the MO messages to the n connected smsbox
instances.
Using 'lsof' and tweaking the 'nofiles' parameter in
'/etc/security/limits' confirms that Kannel is hitting the system's
limit of number of open files per process when trying to create
connections to the http server. The errors are generated in the function
'static Connection *get_connection(HTTPServer *trans)' in gwlib/http.c .
As a result of these errors even though the Kannel inbound counter
indicates that all the incoming messages have been received, many of
them are not successfully delivered to the http server. They are not
retransmitted, and are effectively lost.
usually if we can't reach the HTTP server, smsbox would retry. Retries can be
configured via 2 specific config directives, see user's guide. In case we get a
hard system error, like in this case, we can't aquire a socket for TCP
transport, we fail the transport and the message is lost.
The message is still considered "received", since bearerbox has received the
mesasge. It's not a garantee that the application layer is getting it. In a
worst case you could configure smsbox to discard all message. Which still means
physically they have been received by the core transportation layer, baererbox.
That's why smsbox and bearerbox have their own access-log file.
We think there are some problems cleaning up connections to the http
server after they have been used - lsof shows that it can take some time
(up to a minute or two) for the number of open file descriptors for the
Kannel processes to drop, after traffic has stopped passing through
Kannel. Also, when this problem is encountered, lsof shows a large
number of strange sockets, e.g:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
...
smsbox 9614 kannel 29u sock 0,4 6497455 can't
identify protocol
smsbox 9614 kannel 30u sock 0,4 6497456 can't
identify protocol
smsbox 9614 kannel 31u sock 0,4 6497457 can't
ok, 1-2 minutes backoff could be the simple result that we use HTTP/1.1 with
keep-alive. So if you aquire a socket to the HTTP server, issue the request and
get the response, and nothing more is requested over the still active TCP
connection, the server will leave a 1-2 minute to cut-off the keep-alive connection.
On the other hand, if traffic re-starts again, we will pull the still open
connection from the connection pool and re-use it. So the behaviour seems ok to me.
Limiting the number of concurrent connections that the http server
allows doesn't help - Kannel always seems to have more open file
descriptors than connections to the http server.
???
connections should be re-used from the connection pool. Can you give us more
details on this please.
So: can someone confirm whether this really is a problem with Kannel (I
would be very surprised if it's been in use, in production all this time
with this kind of behaviour under load)? Is our configuration wrong,
perhaps? How should we be configuring Kannel to deal with this kind of
situation?
Kannel is in production use at numerious high-load sites. So this keeps being
strange for me.
Also, we noticed some comments in gwlib/http.c,
/* XXX re-implement socket pools, with idle connection killing to save
sockets */
now, this should have only impact if we constantly have request to different
hosts. If it's the same host, we should have re-usage of connections.
/* XXX set maximum number of concurrent connections to same host, total? */
this one is a TODO still, yes.
These look like they may be directly related to the problem we're
experiencing - is anyone working on these tasks, and if so, is there an
ETA to implementation?
can you please retry to leaverage the MO load with let's say 2-4 smsbox
connections....
It's still curious why sockets are keept open that way...
Stipe
-------------------------------------------------------------------
Kölner Landstrasse 419
40589 Düsseldorf, NRW, Germany
tolj.org system architecture Kannel Software Foundation (KSF)
http://www.tolj.org/ http://www.kannel.org/
mailto:st_{at}_tolj.org mailto:stolj_{at}_kannel.org
-------------------------------------------------------------------