We've been doing some more stress-testing of Kannel, this time under
high-load over long periods of time, an we've encountered some
problems when the message store grows large, of the order of hundreds
of thousands of messages (in our case, because we're delivering MTs
slower than we are receiving MOs, and due to the use of synchronous
responses to MO HTTP requests, the MTs are queuing at the bearerbox).
We understand that this is an extreme situation, and unlikely to be
encountered in normal usage, but we want to cover all our bases
(worst-case scenarios) and make sure that Kannel behaves correctly/
predictably in bad situations.
- It seems that reading from the message store during Kannel startup
is single-threaded, and CPU bound - for large stores (several hundred
thousand messages), this can result in very long start-up times,
since Kannel seems to not process any messages in the store until it
has finished reading them *all* into memory from disk.
- Why can't they simply be processed from disk? (i.e. read message
from disk, process message, read next message from disk) Given that
the messages are in a persistent store, why do they need to be in
memory as well?
- The bearerbox memory usage grows by around 100MB per hundred
thousand messages in the store, as the messages are read from the
store, which leads to VM thrashing under extreme circumstances, as
the bearerbox footprint grows beyond the RAM size of the server...
For one test, after the initial loading of ~634,000 MTs from the
store, and bearerbox was using ~780M of RAM. 1+ KB per message seems
a lot, given that messages seems to be occupying more space in memory
than on disk (that number of MTs was consuming about 255MB on disc)
- Once messages are read into memory, they seem to be copied into
another queue (this exacerbates the problem above) - why is this?
[from kannel admin page]
...
Status: running, uptime 0d 0h 12m 39s
WDP: received 0 (0 queued), sent 0 (0 queued)
v-------------- *** this number has stopped growing ***
SMS: received 0 (329153 queued), sent 4 (0 queued), store size
634247
^--------------- *** this number is growing ***
SMS: inbound 0.00 msg/sec, outbound 0.01 msg/sec
...
(the queue sizes for all the SMPP binds are also growing, so I
assume their queues are related to this main queue somehow?)
At this point, bearerbox's memory footprint is *still* growing
(eventually reaching 1264M virtual, 932M resident) and the server is
spending 60+ percent of CPU cycles on disk IO, as the VM thrashes.
- Finally, Kannel's queue sizes etc. seem to stop growing:
[kannel admin page]
Kannel bearerbox version `cvs-20060830'. Build `Aug 31 2006
22:56:51',
compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux,
release
2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006,
machine
x86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16.
Using native
malloc.
Status: running, uptime 0d 0h 25m 5s
WDP: received 0 (0 queued), sent 0 (0 queued)
SMS: received 0 (531104 queued), sent 4 (0 queued), store size
634247
SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec
DLR: 0 queued, using internal storage
No boxes connected
SMSC connections:
test SMPP:10.100.123.20:7011/7011:test:smpp (online 178s,
rcvd 0,
sent 1, failed 0, queued 25776 msgs)
test SMPP:10.100.123.20:7012/7012:test:smpp (online 177s,
rcvd 0,
sent 1, failed 0, queued 25775 msgs)
test SMPP:10.100.123.20:7013/7013:test:smpp (online 1505s,
rcvd 0,
sent 1, failed 0, queued 25775 msgs)
test SMPP:10.100.123.20:7014/7014:test:smpp (online 1505s,
rcvd 0,
sent 1, failed 0, queued 25776 msgs)
BUT, the queue size doesn't match the store size, and Kannel has NOT
resumed sending those messages (apart from 1 per SMSC, not sure why)
The bearerbox still seems to be busy doing something, even though it
doesn't seem to be loading messages from store into queues, or
sending them:
[top output]
Mem: 1024580k total, 1009916k used, 14664k free, 248k
buffers
top - 01:46:39 up 1 day, 2:15, 4 users, load average: 1.68, 2.03,
1.81
Tasks: 82 total, 2 running, 80 sleeping, 0 stopped, 0 zombie
Cpu(s): 16.5% us, 1.3% sy, 0.0% ni, 49.8% id, 32.1% wa, 0.0% hi,
0.3% si
Mem: 1024580k total, 1009916k used, 14664k free, 332k
buffers
Swap: 2104496k total, 937960k used, 1166536k free, 68740k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18021 kannel 16 0 1265m 812m 924 D 35.3 81.2 18:58.47 bearerbox
56 root 15 0 0 0 0 S 0.7 0.0 4:51.60 kswapd0
- At this point, it seems that Kannel is somewhat broken... and of
course, restarting it simply goes through this process all over again
(reading messages from store, consuming memory, eventually stopping
and doing nothing)
ADDENDUM:
After waiting a while, the 'SMS' queue counter at the top of the page
dropped to zero (store size was still ~640k) and the smsbox (which
hadn't actually connected to the bearerbox until the bearerbox had
finished all its loading of messages from the store) was finally
connected, and its queue size had grown to a similar size as the
bearerbox's queue:
[kannel admin page]
Kannel bearerbox version `cvs-20060830'. Build `Aug 31 2006
22:56:51',
compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux,
release
2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006,
machine
x86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16.
Using native
malloc.
Status: running, uptime 0d 0h 33m 18s
WDP: received 0 (0 queued), sent 0 (0 queued)
SMS: received 0 (0 queued), sent 8 (0 queued), store size 634243
SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec
DLR: 0 queued, using internal storage
Box connections:
smsbox:(none), IP 127.0.0.1 (531104 queued), (on-line 0d 0h
6m 45s) <----- *** this is a recent change (originally, while
bearerbox was busy, no other boxes were connected) ***
SMSC connections:
test SMPP:10.100.123.20:7011/7011:test:smpp (online 671s,
rcvd 0,
sent 2, failed 0, queued 25775 msgs)
test SMPP:10.100.123.20:7012/7012:test:smpp (online 670s,
rcvd 0,
sent 2, failed 0, queued 25774 msgs)
test SMPP:10.100.123.20:7013/7013:test:smpp (online 1998s,
rcvd 0,
sent 2, failed 0, queued 25776 msgs)
test SMPP:10.100.123.20:7014/7014:test:smpp (online 1998s,
rcvd 0,
sent 2, failed 0, queued 25775 msgs)
At this point, 4 more messages have been delivered, but there is no
other activity, and bearerbox is still chewing up CPU cycles and
memory:
[top output]
Mem: 1024580k total, 1009916k used, 14664k free, 248k
buffers
top - 01:55:12 up 1 day, 2:23, 4 users, load average: 1.15, 1.84,
1.88
Tasks: 82 total, 1 running, 81 sleeping, 0 stopped, 0 zombie
Cpu(s): 15.5% us, 1.2% sy, 0.0% ni, 50.1% id, 32.9% wa, 0.0% hi,
0.3% si
Mem: 1024580k total, 1009892k used, 14688k free, 688k
buffers
Swap: 2104496k total, 1042352k used, 1062144k free, 58272k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18021 kannel 16 0 1297m 770m 1020 S 33.0 77.0 19:48.88 bearerbox
56 root 15 0 0 0 0 S 0.3 0.0 5:20.49 kswapd
... after waiting a while further, the server is becoming
unresponsive, and bearerbox and the VM are still thrashing:
[top output]
Mem: 1024580k total, 1009916k used, 14664k free, 248k
buffers
top - 02:02:01 up 1 day, 2:30, 4 users, load average: 6.15, 3.51,
2.45
Tasks: 82 total, 1 running, 81 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1% us, 0.4% sy, 0.0% ni, 47.5% id, 51.4% wa, 0.1% hi,
0.4% si
Mem: 1024580k total, 1009572k used, 15008k free, 236k
buffers
Swap: 2104496k total, 1038460k used, 1066036k free, 11292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
56 root 15 0 0 0 0 S 8.4 0.0 5:56.76 kswapd0
3011 root 15 0 18824 1436 512 D 6.0 0.1 0:57.46 hald
18248 giulio.h 15 0 6148 848 668 R 3.5 0.1 0:07.41 top
2443 root 15 0 2548 272 196 S 3.2 0.0 0:02.76
irqbalance
2920 root 15 0 34996 744 632 S 3.2 0.1 0:24.00 sendmail
16918 giulio.h 16 0 53736 376 356 S 3.2 0.0 0:03.72 tail
18021 kannel 16 0 1276m 936m 1028 S 0.5 93.6 20:30.30 bearerbox
So:
- Any comments on Kannel's behaviour under these circumstances?
- Apart from stopping Kannel, moving the message store file aside,
and restarting Kannel, I can't think of any other way of gracefully
recovering from such a situation - this makes disaster recovery
difficult...
- Could Kannel be modified so that it processes stored messages more
efficiently? (read messages direct from disk, reduce size of messages
in memory, paralellize processing of stored messages on startup?)
Thanks,
--
Giulio Harding
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]
http://www.mnetcorporation.com