Re: Kannel HTTP MO delivery failing under load

Giulio Harding Mon, 21 Aug 2006 00:46:13 -0700

Thanks for your reply Stipe - much appreciated!

On 19/08/2006, at 2:08 AM, Stipe Tolj wrote:

Hi Giulio,

Giulio Harding wrote:
We're looking for some help/advice with a Kannel problem we'vejust encountered.
- We're using Kannel CVS (up-to-date as of today, 28/07/2006)
- Kannel is built with --prefix=/opt/kannel --enable-start-stop-daemon- Kannel is running on an Intel Xeon 2.8 GHz w/ 2 GB RAM, runningCentos 4.3 with all the latest updates- We are using two instances of OpenSMPP SMSC simulator running ontwo separate machines (similar specs to above) generating a totalof 200000 messages as fast as they can.- Kannel is delivering to a stock httpd running on a separatemachine, which returns a small (6 byte) response body.
Our problem is this: when injecting messages we get a throughputin excess of 600 inbound SMPP MO messages per second. Shortlyafter starting to inject the messages, we see a large number oferrors in smsbox.log as follows:2006-07-28 16:55:23 [18715] [4] INFO: Starting to service<testXXX> from <61432123123> to <1234>
2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't create new socket.
2006-07-28 16:55:23 [18715] [9] ERROR: System error 24: Too manyopen files2006-07-28 16:55:23 [18715] [9] ERROR: error connecting to server`10.100.123.20' at port `80'2006-07-28 16:55:23 [18715] [9] ERROR: Couldn't send request to<http://10.100.123.20/test.html?msgData=testXXX&sourceAddr=61432123123&channel=test&destinationAddr=1234>
ok, this seems to me that you generate a vast large stream of MOmesasges (using that Logica tool) over 2 SMPP connections. But youmay be using only *one* smsbox connected to the bearerbox, right?


That's right.

In such a case you have a dis-weighted situation. Of course HTTP isa more "overhead" protocol to transport your SMS payload then SMPPis. So you have 2 connections inbound that stream very fast, butonly one smsbox that calls HTTP to the application layer.
For this purposes Kannel's architecture can be used to load-balancethe load on the HTTP side, by connecting more smsbox instances tobearerbox. Actually you can connect 1-n instances, as far as youuse different sendsms HTTP ports in each config for every smsbox.
This will cause bearerbox to round-robin (actually obeying theheartbeat load indication a connected smsbox gives) the MO messagesto the n connected smsbox instances.
Using 'lsof' and tweaking the 'nofiles' parameter in '/etc/security/limits' confirms that Kannel is hitting the system'slimit of number of open files per process when trying to createconnections to the http server. The errors are generated in thefunction 'static Connection *get_connection(HTTPServer *trans)' ingwlib/http.c . As a result of these errors even though the Kannelinbound counter indicates that all the incoming messages have beenreceived, many of them are not successfully delivered to the httpserver. They are not retransmitted, and are effectively lost.
usually if we can't reach the HTTP server, smsbox would retry.Retries can be configured via 2 specific config directives, seeuser's guide. In case we get a hard system error, like in thiscase, we can't aquire a socket for TCP transport, we fail thetransport and the message is lost.
The message is still considered "received", since bearerbox hasreceived the mesasge. It's not a garantee that the applicationlayer is getting it. In a worst case you could configure smsbox todiscard all message. Which still means physically they have beenreceived by the core transportation layer, baererbox.
That's why smsbox and bearerbox have their own access-log file.
We think there are some problems cleaning up connections to thehttp server after they have been used - lsof shows that it cantake some time (up to a minute or two) for the number of open filedescriptors for the Kannel processes to drop, after traffic hasstopped passing through Kannel. Also, when this problem isencountered, lsof shows a large number of strange sockets, e.g:COMMAND PID USER FD TYPE DEVICE SIZENODE NAME
...
smsbox 9614 kannel 29u sock 0,4 6497455can't identify protocolsmsbox 9614 kannel 30u sock 0,4 6497456can't identify protocolsmsbox 9614 kannel 31u sock 0,4 6497457can't
ok, 1-2 minutes backoff could be the simple result that we use HTTP/1.1 with keep-alive. So if you aquire a socket to the HTTP server,issue the request and get the response, and nothing more isrequested over the still active TCP connection, the server willleave a 1-2 minute to cut-off the keep-alive connection.
On the other hand, if traffic re-starts again, we will pull thestill open connection from the connection pool and re-use it. Sothe behaviour seems ok to me.

Hmmm, Kannel's behaviour still seems incorrect to me. Smsbox isopening too many HTTP connections, and is thus running out of filedescriptors, and falling over. It may be reusing connections in itspool, but the pool size seems to be unbounded. This is not a goodthing...

Under load, lsof on the server running Kannel shows a large number ofTCP connections open (in the example below, at the time lsof was run,986 - this doesn't include the 'strange' unidentifiable sockets thatKannel has open) - too many, since it reaches the OS limit for openfiles:

...

smsbox 18683 kannel 168u IPv42067802 TCP 10.100.123.233:47198->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 169u IPv42069629 TCP 10.100.123.233:47576->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 170u IPv42069664 TCP 10.100.123.233:47581->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 171u IPv42067803 TCP 10.100.123.233:47199->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 172u IPv42067804 TCP 10.100.123.233:47200->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 173u IPv42069665 TCP 10.100.123.233:47582->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 174u IPv42069666 TCP 10.100.123.233:47583->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 175u IPv42067805 TCP 10.100.123.233:47201->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 176u IPv42069667 TCP 10.100.123.233:47584->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 177u IPv42067806 TCP 10.100.123.233:47202->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 178u IPv42067807 TCP 10.100.123.233:47203->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 179u IPv42067808 TCP 10.100.123.233:47204->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 180u IPv42069668 TCP 10.100.123.233:47585->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 181u IPv42069669 TCP 10.100.123.233:47586->10.100.123.20:http (ESTABLISHED)smsbox 18683 kannel 182u IPv42067809 TCP 10.100.123.233:47205->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 183u IPv42067810 TCP 10.100.123.233:47206->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 184u IPv42067811 TCP 10.100.123.233:47207->10.100.123.20:http (SYN_SENT)smsbox 18683 kannel 185u IPv42067812 TCP 10.100.123.233:47208->10.100.123.20:http (SYN_SENT)

...

On the application apache HTTP server (with max connections set to 5for testing purposes), lsof only ever shows 5 connections establishedat a time:

httpd 31403 root 3u IPv6182516 TCP *:http (LISTEN)httpd 31404 apache 3u IPv6182516 TCP *:http (LISTEN)httpd 31404 apache 8u IPv6183249 TCP cookiemonster:http->10.100.123.233:42685(ESTABLISHED)httpd 31406 apache 3u IPv6182516 TCP *:http (LISTEN)httpd 31406 apache 8u IPv6183251 TCP cookiemonster:http->10.100.123.233:42687(ESTABLISHED)httpd 31408 apache 3u IPv6182516 TCP *:http (LISTEN)httpd 31408 apache 8u IPv6183252 TCP cookiemonster:http->10.100.123.233:42811(ESTABLISHED)httpd 31409 apache 3u IPv6182516 TCP *:http (LISTEN)httpd 31409 apache 8u IPv6183250 TCP cookiemonster:http->10.100.123.233:42686(ESTABLISHED)httpd 31410 apache 3u IPv6182516 TCP *:http (LISTEN)httpd 31410 apache 8u IPv6183253 TCP cookiemonster:http->10.100.123.233:42812(ESTABLISHED)

The disparity between the two sides of the TCP connections isprobably due to backlog - only 5 TCP connections are ever establishedwith the HTTP server at a time, but the rest of the connections fromKannel are 'pending' as per OS/apache TCP backlog limits. Theseconnections won't receive a 'connection refused', they'll simply waituntil resources are available, whereupon they'll be fully established.

Because these connections aren't explicitly failing, and because ofthe rate of incoming MO messages, smsbox appears to be simplycreating HTTP connections faster than they are being used. The onlysolution is to forcibly reduce the rate of message delivery, or tolimit the number of HTTP connections to the application HTTP server(as mentioned by Alexander Malysh in his reply to my original email).

Also, those 'strange' sockets that Kannel has look like they've beenclosed on the other side (by the application HTTP server) and arethus useless, and simply haven't been cleaned up. When a HTTPconnection is closed by the server, it should be cleaned up onKannel's side as soon as possible. They should really be cleaned upbefore attempting to create new HTTP connections.

Limiting the number of concurrent connections that the http serverallows doesn't help - Kannel always seems to have more open filedescriptors than connections to the http server.
???
connections should be re-used from the connection pool. Can yougive us more details on this please.

Sorry, I wasn't very clear - we tried limiting the number ofconcurrent HTTP connections that the HTTP server would allow fromKannel, to see if Kannel would reduce its connection/socket usageaccordingly... i.e. if Kannel got a 'connection refused' from theHTTP server - as it turns out, the TCP backlog means that Kannelcan't really tell whether the application HTTP server can keep upwith its HTTP connections, so my comment above is redundant.

So: can someone confirm whether this really is a problem withKannel (I would be very surprised if it's been in use, inproduction all this time with this kind of behaviour under load)?Is our configuration wrong, perhaps? How should we be configuringKannel to deal with this kind of situation?
Kannel is in production use at numerious high-load sites. So thiskeeps being strange for me.
Also, we noticed some comments in gwlib/http.c,
/* XXX re-implement socket pools, with idle connection killing tosave sockets */
now, this should have only impact if we constantly have request todifferent hosts. If it's the same host, we should have re-usage ofconnections.

Ok.

/* XXX set maximum number of concurrent connections to same host,total? */
this one is a TODO still, yes.

I think this is the crux of the problem: this TODO, and our issuewould be addressed by the patch that Alexander has offered - giventhat he's indicated that it is a fairly simple patch, and the issueseems (to me) quite serious, I think this is worth focussing on ASAP.

These look like they may be directly related to the problem we'reexperiencing - is anyone working on these tasks, and if so, isthere an ETA to implementation?
can you please retry to leaverage the MO load with let's say 2-4smsbox connections....
It's still curious why sockets are keept open that way...

Ok, I'll try this out, with multiple smsboxes on the same machine,and possibly multiple smsboxes on multiple machines (if I can sourcesome) - I'll also try fiddling with the priority of the bearerboxprocess (using nice) to see if I can get the SMPP and HTTP processingbetter balanced...

However, none of these workarounds will guarantee correct behaviourunder load, as there is nothing preventing Kannel from opening toomany sockets. As I mentioned before, I think this is a bug.

Stipe

-------------------------------------------------------------------
Kölner Landstrasse 419
40589 Düsseldorf, NRW, Germany

tolj.org system architecture      Kannel Software Foundation (KSF)
http://www.tolj.org/              http://www.kannel.org/

mailto:st_{at}_tolj.org           mailto:stolj_{at}_kannel.org
-------------------------------------------------------------------


--
Giulio Harding
Systems Administrator

m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com

Re: Kannel HTTP MO delivery failing under load

Reply via email to