They are all very valid points, and all observed here as well. The store file read at the start is a bit incedulous, but very efficient. The solution is to limit the queue size of MT, done by some handshaking at the smsbox. Instead of the normally "Sent" reply, also have a "Queue full"reply, when that happens, the client program waits some time before trying again. BTW,our app has its own queue, so not completely depending on queuing in kannel. For MO messages, make sure you receive/process them from smsbox as quickly as possible. I find with current implementation, (and our hardware of course) a store file of ~50000 messages, the overheads start becoming intolerable.
----- Original Message ----- From: "Giulio Harding" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Thursday, August 31, 2006 5:24 PM Subject: Problematic Kannel behaviour with large message stores > We've been doing some more stress-testing of Kannel, this time under > high-load over long periods of time, an we've encountered some > problems when the message store grows large, of the order of hundreds > of thousands of messages (in our case, because we're delivering MTs > slower than we are receiving MOs, and due to the use of synchronous > responses to MO HTTP requests, the MTs are queuing at the bearerbox). > We understand that this is an extreme situation, and unlikely to be > encountered in normal usage, but we want to cover all our bases > (worst-case scenarios) and make sure that Kannel behaves correctly/ > predictably in bad situations. > > - It seems that reading from the message store during Kannel startup > is single-threaded, and CPU bound - for large stores (several hundred > thousand messages), this can result in very long start-up times, > since Kannel seems to not process any messages in the store until it > has finished reading them *all* into memory from disk. > > - Why can't they simply be processed from disk? (i.e. read message > from disk, process message, read next message from disk) Given that > the messages are in a persistent store, why do they need to be in > memory as well? > > - The bearerbox memory usage grows by around 100MB per hundred > thousand messages in the store, as the messages are read from the > store, which leads to VM thrashing under extreme circumstances, as > the bearerbox footprint grows beyond the RAM size of the server... > For one test, after the initial loading of ~634,000 MTs from the > store, and bearerbox was using ~780M of RAM. 1+ KB per message seems > a lot, given that messages seems to be occupying more space in memory > than on disk (that number of MTs was consuming about 255MB on disc) > > - Once messages are read into memory, they seem to be copied into > another queue (this exacerbates the problem above) - why is this? > > > [from kannel admin page] > > ... > > Status: running, uptime 0d 0h 12m 39s > > WDP: received 0 (0 queued), sent 0 (0 queued) > > v-------------- *** this number has stopped growing *** > SMS: received 0 (329153 queued), sent 4 (0 queued), store size > 634247 > ^--------------- *** this number is growing *** > SMS: inbound 0.00 msg/sec, outbound 0.01 msg/sec > > ... > > > (the queue sizes for all the SMPP binds are also growing, so I > assume their queues are related to this main queue somehow?) > > At this point, bearerbox's memory footprint is *still* growing > (eventually reaching 1264M virtual, 932M resident) and the server is > spending 60+ percent of CPU cycles on disk IO, as the VM thrashes. > > - Finally, Kannel's queue sizes etc. seem to stop growing: > > > [kannel admin page] > > Kannel bearerbox version `cvs-20060830'. Build `Aug 31 2006 > 22:56:51', > compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux, > release > 2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006, > machine > x86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16. > Using native > malloc. > > Status: running, uptime 0d 0h 25m 5s > > WDP: received 0 (0 queued), sent 0 (0 queued) > > SMS: received 0 (531104 queued), sent 4 (0 queued), store size > 634247 > > SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec > > DLR: 0 queued, using internal storage > > No boxes connected > > SMSC connections: > test SMPP:10.100.123.20:7011/7011:test:smpp (online 178s, > rcvd 0, > sent 1, failed 0, queued 25776 msgs) > test SMPP:10.100.123.20:7012/7012:test:smpp (online 177s, > rcvd 0, > sent 1, failed 0, queued 25775 msgs) > test SMPP:10.100.123.20:7013/7013:test:smpp (online 1505s, > rcvd 0, > sent 1, failed 0, queued 25775 msgs) > test SMPP:10.100.123.20:7014/7014:test:smpp (online 1505s, > rcvd 0, > sent 1, failed 0, queued 25776 msgs) > > > BUT, the queue size doesn't match the store size, and Kannel has NOT > resumed sending those messages (apart from 1 per SMSC, not sure why) > > The bearerbox still seems to be busy doing something, even though it > doesn't seem to be loading messages from store into queues, or > sending them: > > > [top output] > > Mem: 1024580k total, 1009916k used, 14664k free, 248k buffers > top - 01:46:39 up 1 day, 2:15, 4 users, load average: 1.68, 2.03, > 1.81 > Tasks: 82 total, 2 running, 80 sleeping, 0 stopped, 0 zombie > Cpu(s): 16.5% us, 1.3% sy, 0.0% ni, 49.8% id, 32.1% wa, 0.0% hi, > 0.3% si > Mem: 1024580k total, 1009916k used, 14664k free, 332k buffers > Swap: 2104496k total, 937960k used, 1166536k free, 68740k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 18021 kannel 16 0 1265m 812m 924 D 35.3 81.2 18:58.47 bearerbox > 56 root 15 0 0 0 0 S 0.7 0.0 4:51.60 kswapd0 > > > - At this point, it seems that Kannel is somewhat broken... and of > course, restarting it simply goes through this process all over again > (reading messages from store, consuming memory, eventually stopping > and doing nothing) > > ADDENDUM: > > After waiting a while, the 'SMS' queue counter at the top of the page > dropped to zero (store size was still ~640k) and the smsbox (which > hadn't actually connected to the bearerbox until the bearerbox had > finished all its loading of messages from the store) was finally > connected, and its queue size had grown to a similar size as the > bearerbox's queue: > > > [kannel admin page] > > Kannel bearerbox version `cvs-20060830'. Build `Aug 31 2006 > 22:56:51', > compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux, > release > 2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006, > machine > x86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16. > Using native > malloc. > > Status: running, uptime 0d 0h 33m 18s > > WDP: received 0 (0 queued), sent 0 (0 queued) > > SMS: received 0 (0 queued), sent 8 (0 queued), store size 634243 > > SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec > > DLR: 0 queued, using internal storage > > Box connections: > smsbox:(none), IP 127.0.0.1 (531104 queued), (on-line 0d 0h > 6m 45s) <----- *** this is a recent change (originally, while > bearerbox was busy, no other boxes were connected) *** > > SMSC connections: > test SMPP:10.100.123.20:7011/7011:test:smpp (online 671s, > rcvd 0, > sent 2, failed 0, queued 25775 msgs) > test SMPP:10.100.123.20:7012/7012:test:smpp (online 670s, > rcvd 0, > sent 2, failed 0, queued 25774 msgs) > test SMPP:10.100.123.20:7013/7013:test:smpp (online 1998s, > rcvd 0, > sent 2, failed 0, queued 25776 msgs) > test SMPP:10.100.123.20:7014/7014:test:smpp (online 1998s, > rcvd 0, > sent 2, failed 0, queued 25775 msgs) > > > At this point, 4 more messages have been delivered, but there is no > other activity, and bearerbox is still chewing up CPU cycles and memory: > > > [top output] > > Mem: 1024580k total, 1009916k used, 14664k free, 248k buffers > top - 01:55:12 up 1 day, 2:23, 4 users, load average: 1.15, 1.84, > 1.88 > Tasks: 82 total, 1 running, 81 sleeping, 0 stopped, 0 zombie > Cpu(s): 15.5% us, 1.2% sy, 0.0% ni, 50.1% id, 32.9% wa, 0.0% hi, > 0.3% si > Mem: 1024580k total, 1009892k used, 14688k free, 688k buffers > Swap: 2104496k total, 1042352k used, 1062144k free, 58272k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 18021 kannel 16 0 1297m 770m 1020 S 33.0 77.0 19:48.88 bearerbox > 56 root 15 0 0 0 0 S 0.3 0.0 5:20.49 kswapd > > > ... after waiting a while further, the server is becoming > unresponsive, and bearerbox and the VM are still thrashing: > > [top output] > > Mem: 1024580k total, 1009916k used, 14664k free, 248k buffers > top - 02:02:01 up 1 day, 2:30, 4 users, load average: 6.15, 3.51, > 2.45 > Tasks: 82 total, 1 running, 81 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.1% us, 0.4% sy, 0.0% ni, 47.5% id, 51.4% wa, 0.1% hi, > 0.4% si > Mem: 1024580k total, 1009572k used, 15008k free, 236k buffers > Swap: 2104496k total, 1038460k used, 1066036k free, 11292k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 56 root 15 0 0 0 0 S 8.4 0.0 5:56.76 kswapd0 > 3011 root 15 0 18824 1436 512 D 6.0 0.1 0:57.46 hald > 18248 giulio.h 15 0 6148 848 668 R 3.5 0.1 0:07.41 top > 2443 root 15 0 2548 272 196 S 3.2 0.0 0:02.76 irqbalance > 2920 root 15 0 34996 744 632 S 3.2 0.1 0:24.00 sendmail > 16918 giulio.h 16 0 53736 376 356 S 3.2 0.0 0:03.72 tail > 18021 kannel 16 0 1276m 936m 1028 S 0.5 93.6 20:30.30 bearerbox > > > So: > > - Any comments on Kannel's behaviour under these circumstances? > - Apart from stopping Kannel, moving the message store file aside, > and restarting Kannel, I can't think of any other way of gracefully > recovering from such a situation - this makes disaster recovery > difficult... > - Could Kannel be modified so that it processes stored messages more > efficiently? (read messages direct from disk, reduce size of messages > in memory, paralellize processing of stored messages on startup?) > > Thanks, > > -- > Giulio Harding > Systems Administrator > > m.Net Corporation > Level 13, 99 Gawler Place > Adelaide SA 5000, Australia > > Tel: +61 8 8210 2041 > Fax: +61 8 8211 9620 > Mobile: 0432 876 733 > MSN: [EMAIL PROTECTED] > > http://www.mnetcorporation.com > > > > >
