We've been doing some more stress-testing of Kannel, this time under high-load over long periods of time, an we've encountered some problems when the message store grows large, of the order of hundreds of thousands of messages (in our case, because we're delivering MTs slower than we are receiving MOs, and due to the use of synchronous responses to MO HTTP requests, the MTs are queuing at the bearerbox). We understand that this is an extreme situation, and unlikely to be encountered in normal usage, but we want to cover all our bases (worst-case scenarios) and make sure that Kannel behaves correctly/ predictably in bad situations.

- It seems that reading from the message store during Kannel startup is single-threaded, and CPU bound - for large stores (several hundred thousand messages), this can result in very long start-up times, since Kannel seems to not process any messages in the store until it has finished reading them *all* into memory from disk.

- Why can't they simply be processed from disk? (i.e. read message from disk, process message, read next message from disk) Given that the messages are in a persistent store, why do they need to be in memory as well?

- The bearerbox memory usage grows by around 100MB per hundred thousand messages in the store, as the messages are read from the store, which leads to VM thrashing under extreme circumstances, as the bearerbox footprint grows beyond the RAM size of the server... For one test, after the initial loading of ~634,000 MTs from the store, and bearerbox was using ~780M of RAM. 1+ KB per message seems a lot, given that messages seems to be occupying more space in memory than on disk (that number of MTs was consuming about 255MB on disc)

- Once messages are read into memory, they seem to be copied into another queue (this exacerbates the problem above) - why is this?


[from kannel admin page]

...

   Status: running, uptime 0d 0h 12m 39s

   WDP: received 0 (0 queued), sent 0 (0 queued)
v-------------- *** this number has stopped growing *** SMS: received 0 (329153 queued), sent 4 (0 queued), store size 634247
                       ^--------------- *** this number is growing ***
   SMS: inbound 0.00 msg/sec, outbound 0.01 msg/sec

...


(the queue sizes for all the SMPP binds are also growing, so I assume their queues are related to this main queue somehow?)

At this point, bearerbox's memory footprint is *still* growing (eventually reaching 1264M virtual, 932M resident) and the server is spending 60+ percent of CPU cycles on disk IO, as the VM thrashes.

- Finally, Kannel's queue sizes etc. seem to stop growing:


[kannel admin page]

Kannel bearerbox version `cvs-20060830'. Build `Aug 31 2006 22:56:51', compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux, release 2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006, machine x86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16. Using native
   malloc.

   Status: running, uptime 0d 0h 25m 5s

   WDP: received 0 (0 queued), sent 0 (0 queued)

SMS: received 0 (531104 queued), sent 4 (0 queued), store size 634247

   SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec

   DLR: 0 queued, using internal storage

   No boxes connected

   SMSC connections:
test SMPP:10.100.123.20:7011/7011:test:smpp (online 178s, rcvd 0,
   sent 1, failed 0, queued 25776 msgs)
test SMPP:10.100.123.20:7012/7012:test:smpp (online 177s, rcvd 0,
   sent 1, failed 0, queued 25775 msgs)
test SMPP:10.100.123.20:7013/7013:test:smpp (online 1505s, rcvd 0,
   sent 1, failed 0, queued 25775 msgs)
test SMPP:10.100.123.20:7014/7014:test:smpp (online 1505s, rcvd 0,
   sent 1, failed 0, queued 25776 msgs)


BUT, the queue size doesn't match the store size, and Kannel has NOT resumed sending those messages (apart from 1 per SMSC, not sure why)

The bearerbox still seems to be busy doing something, even though it doesn't seem to be loading messages from store into queues, or sending them:


[top output]

Mem:   1024580k total,  1009916k used,    14664k free,      248k buffers
top - 01:46:39 up 1 day, 2:15, 4 users, load average: 1.68, 2.03, 1.81
Tasks:  82 total,   2 running,  80 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.5% us, 1.3% sy, 0.0% ni, 49.8% id, 32.1% wa, 0.0% hi, 0.3% si
Mem:   1024580k total,  1009916k used,    14664k free,      332k buffers
Swap:  2104496k total,   937960k used,  1166536k free,    68740k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18021 kannel    16   0 1265m 812m  924 D 35.3 81.2  18:58.47 bearerbox
   56 root      15   0     0    0    0 S  0.7  0.0   4:51.60 kswapd0


- At this point, it seems that Kannel is somewhat broken... and of course, restarting it simply goes through this process all over again (reading messages from store, consuming memory, eventually stopping and doing nothing)

ADDENDUM:

After waiting a while, the 'SMS' queue counter at the top of the page dropped to zero (store size was still ~640k) and the smsbox (which hadn't actually connected to the bearerbox until the bearerbox had finished all its loading of messages from the store) was finally connected, and its queue size had grown to a similar size as the bearerbox's queue:


[kannel admin page]

Kannel bearerbox version `cvs-20060830'. Build `Aug 31 2006 22:56:51', compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux, release 2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006, machine x86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16. Using native
   malloc.

   Status: running, uptime 0d 0h 33m 18s

   WDP: received 0 (0 queued), sent 0 (0 queued)

   SMS: received 0 (0 queued), sent 8 (0 queued), store size 634243

   SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec

   DLR: 0 queued, using internal storage

   Box connections:
smsbox:(none), IP 127.0.0.1 (531104 queued), (on-line 0d 0h 6m 45s) <----- *** this is a recent change (originally, while bearerbox was busy, no other boxes were connected) ***

   SMSC connections:
test SMPP:10.100.123.20:7011/7011:test:smpp (online 671s, rcvd 0,
   sent 2, failed 0, queued 25775 msgs)
test SMPP:10.100.123.20:7012/7012:test:smpp (online 670s, rcvd 0,
   sent 2, failed 0, queued 25774 msgs)
test SMPP:10.100.123.20:7013/7013:test:smpp (online 1998s, rcvd 0,
   sent 2, failed 0, queued 25776 msgs)
test SMPP:10.100.123.20:7014/7014:test:smpp (online 1998s, rcvd 0,
   sent 2, failed 0, queued 25775 msgs)


At this point, 4 more messages have been delivered, but there is no other activity, and bearerbox is still chewing up CPU cycles and memory:


[top output]

Mem:   1024580k total,  1009916k used,    14664k free,      248k buffers
top - 01:55:12 up 1 day, 2:23, 4 users, load average: 1.15, 1.84, 1.88
Tasks:  82 total,   1 running,  81 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.5% us, 1.2% sy, 0.0% ni, 50.1% id, 32.9% wa, 0.0% hi, 0.3% si
Mem:   1024580k total,  1009892k used,    14688k free,      688k buffers
Swap:  2104496k total,  1042352k used,  1062144k free,    58272k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18021 kannel    16   0 1297m 770m 1020 S 33.0 77.0  19:48.88 bearerbox
   56 root      15   0     0    0    0 S  0.3  0.0   5:20.49 kswapd


... after waiting a while further, the server is becoming unresponsive, and bearerbox and the VM are still thrashing:

[top output]

Mem:   1024580k total,  1009916k used,    14664k free,      248k buffers
top - 02:02:01 up 1 day, 2:30, 4 users, load average: 6.15, 3.51, 2.45
Tasks:  82 total,   1 running,  81 sleeping,   0 stopped,   0 zombie
Cpu(s): 0.1% us, 0.4% sy, 0.0% ni, 47.5% id, 51.4% wa, 0.1% hi, 0.4% si
Mem:   1024580k total,  1009572k used,    15008k free,      236k buffers
Swap:  2104496k total,  1038460k used,  1066036k free,    11292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
   56 root      15   0     0    0    0 S  8.4  0.0   5:56.76 kswapd0
 3011 root      15   0 18824 1436  512 D  6.0  0.1   0:57.46 hald
18248 giulio.h  15   0  6148  848  668 R  3.5  0.1   0:07.41 top
 2443 root      15   0  2548  272  196 S  3.2  0.0   0:02.76 irqbalance
 2920 root      15   0 34996  744  632 S  3.2  0.1   0:24.00 sendmail
16918 giulio.h  16   0 53736  376  356 S  3.2  0.0   0:03.72 tail
18021 kannel    16   0 1276m 936m 1028 S  0.5 93.6  20:30.30 bearerbox


So:

- Any comments on Kannel's behaviour under these circumstances?
- Apart from stopping Kannel, moving the message store file aside, and restarting Kannel, I can't think of any other way of gracefully recovering from such a situation - this makes disaster recovery difficult... - Could Kannel be modified so that it processes stored messages more efficiently? (read messages direct from disk, reduce size of messages in memory, paralellize processing of stored messages on startup?)

Thanks,

--
Giulio Harding
Systems Administrator

m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com





Reply via email to