Problematic Kannel behaviour with large message stores

Giulio Harding Thu, 31 Aug 2006 00:26:24 -0700

We've been doing some more stress-testing of Kannel, this time underhigh-load over long periods of time, an we've encountered someproblems when the message store grows large, of the order of hundredsof thousands of messages (in our case, because we're delivering MTsslower than we are receiving MOs, and due to the use of synchronousresponses to MO HTTP requests, the MTs are queuing at the bearerbox).We understand that this is an extreme situation, and unlikely to beencountered in normal usage, but we want to cover all our bases(worst-case scenarios) and make sure that Kannel behaves correctly/predictably in bad situations.

- It seems that reading from the message store during Kannel startupis single-threaded, and CPU bound - for large stores (several hundredthousand messages), this can result in very long start-up times,since Kannel seems to not process any messages in the store until ithas finished reading them *all* into memory from disk.

- Why can't they simply be processed from disk? (i.e. read messagefrom disk, process message, read next message from disk) Given thatthe messages are in a persistent store, why do they need to be inmemory as well?

- The bearerbox memory usage grows by around 100MB per hundredthousand messages in the store, as the messages are read from thestore, which leads to VM thrashing under extreme circumstances, asthe bearerbox footprint grows beyond the RAM size of the server...For one test, after the initial loading of ~634,000 MTs from thestore, and bearerbox was using ~780M of RAM. 1+ KB per message seemsa lot, given that messages seems to be occupying more space in memorythan on disk (that number of MTs was consuming about 255MB on disc)

- Once messages are read into memory, they seem to be copied intoanother queue (this exacerbates the problem above) - why is this?



[from kannel admin page]

...

   Status: running, uptime 0d 0h 12m 39s

   WDP: received 0 (0 queued), sent 0 (0 queued)

v-------------- *** this number has stopped growing ***SMS: received 0 (329153 queued), sent 4 (0 queued), store size634247

                       ^--------------- *** this number is growing ***
   SMS: inbound 0.00 msg/sec, outbound 0.01 msg/sec

...

(the queue sizes for all the SMPP binds are also growing, so Iassume their queues are related to this main queue somehow?)

At this point, bearerbox's memory footprint is *still* growing(eventually reaching 1264M virtual, 932M resident) and the server isspending 60+ percent of CPU cycles on disk IO, as the VM thrashes.


- Finally, Kannel's queue sizes etc. seem to stop growing:


[kannel admin page]

Kannel bearerbox version `cvs-20060830'. Build `Aug 31 200622:56:51',compiler `3.4.5 20051201 (Red Hat 3.4.5-2)'. System Linux,release2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006,machinex86_64. Hostname grover, IP 127.0.0.1. Libxml version 2.6.16.Using native

   malloc.

   Status: running, uptime 0d 0h 25m 5s

   WDP: received 0 (0 queued), sent 0 (0 queued)

SMS: received 0 (531104 queued), sent 4 (0 queued), store size634247


   SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec

   DLR: 0 queued, using internal storage

   No boxes connected

   SMSC connections:

test SMPP:10.100.123.20:7011/7011:test:smpp (online 178s,rcvd 0,

   sent 1, failed 0, queued 25776 msgs)

test SMPP:10.100.123.20:7012/7012:test:smpp (online 177s,rcvd 0,

   sent 1, failed 0, queued 25775 msgs)

test SMPP:10.100.123.20:7013/7013:test:smpp (online 1505s,rcvd 0,

   sent 1, failed 0, queued 25775 msgs)

test SMPP:10.100.123.20:7014/7014:test:smpp (online 1505s,rcvd 0,

   sent 1, failed 0, queued 25776 msgs)

BUT, the queue size doesn't match the store size, and Kannel has NOTresumed sending those messages (apart from 1 per SMSC, not sure why)

The bearerbox still seems to be busy doing something, even though itdoesn't seem to be loading messages from store into queues, orsending them:



[top output]

Mem:   1024580k total,  1009916k used,    14664k free,      248k buffers

top - 01:46:39 up 1 day, 2:15, 4 users, load average: 1.68, 2.03,1.81

Tasks:  82 total,   2 running,  80 sleeping,   0 stopped,   0 zombie

Cpu(s): 16.5% us, 1.3% sy, 0.0% ni, 49.8% id, 32.1% wa, 0.0% hi,0.3% si

Mem:   1024580k total,  1009916k used,    14664k free,      332k buffers
Swap:  2104496k total,   937960k used,  1166536k free,    68740k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18021 kannel    16   0 1265m 812m  924 D 35.3 81.2  18:58.47 bearerbox
   56 root      15   0     0    0    0 S  0.7  0.0   4:51.60 kswapd0

- At this point, it seems that Kannel is somewhat broken... and ofcourse, restarting it simply goes through this process all over again(reading messages from store, consuming memory, eventually stoppingand doing nothing)


ADDENDUM:

After waiting a while, the 'SMS' queue counter at the top of the pagedropped to zero (store size was still ~640k) and the smsbox (whichhadn't actually connected to the bearerbox until the bearerbox hadfinished all its loading of messages from the store) was finallyconnected, and its queue size had grown to a similar size as thebearerbox's queue:



[kannel admin page]

   malloc.

   Status: running, uptime 0d 0h 33m 18s

   WDP: received 0 (0 queued), sent 0 (0 queued)

   SMS: received 0 (0 queued), sent 8 (0 queued), store size 634243

   SMS: inbound 0.00 msg/sec, outbound 0.00 msg/sec

   DLR: 0 queued, using internal storage

   Box connections:

smsbox:(none), IP 127.0.0.1 (531104 queued), (on-line 0d 0h6m 45s) <----- *** this is a recent change (originally, whilebearerbox was busy, no other boxes were connected) ***


   SMSC connections:

test SMPP:10.100.123.20:7011/7011:test:smpp (online 671s,rcvd 0,

   sent 2, failed 0, queued 25775 msgs)

test SMPP:10.100.123.20:7012/7012:test:smpp (online 670s,rcvd 0,

   sent 2, failed 0, queued 25774 msgs)

test SMPP:10.100.123.20:7013/7013:test:smpp (online 1998s,rcvd 0,

   sent 2, failed 0, queued 25776 msgs)

test SMPP:10.100.123.20:7014/7014:test:smpp (online 1998s,rcvd 0,

   sent 2, failed 0, queued 25775 msgs)

At this point, 4 more messages have been delivered, but there is noother activity, and bearerbox is still chewing up CPU cycles and memory:



[top output]

Mem:   1024580k total,  1009916k used,    14664k free,      248k buffers

top - 01:55:12 up 1 day, 2:23, 4 users, load average: 1.15, 1.84,1.88

Tasks:  82 total,   1 running,  81 sleeping,   0 stopped,   0 zombie

Cpu(s): 15.5% us, 1.2% sy, 0.0% ni, 50.1% id, 32.9% wa, 0.0% hi,0.3% si

Mem:   1024580k total,  1009892k used,    14688k free,      688k buffers
Swap:  2104496k total,  1042352k used,  1062144k free,    58272k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18021 kannel    16   0 1297m 770m 1020 S 33.0 77.0  19:48.88 bearerbox
   56 root      15   0     0    0    0 S  0.3  0.0   5:20.49 kswapd

... after waiting a while further, the server is becomingunresponsive, and bearerbox and the VM are still thrashing:


[top output]

Mem:   1024580k total,  1009916k used,    14664k free,      248k buffers

top - 02:02:01 up 1 day, 2:30, 4 users, load average: 6.15, 3.51,2.45

Tasks:  82 total,   1 running,  81 sleeping,   0 stopped,   0 zombie

Cpu(s): 0.1% us, 0.4% sy, 0.0% ni, 47.5% id, 51.4% wa, 0.1% hi,0.4% si

Mem:   1024580k total,  1009572k used,    15008k free,      236k buffers
Swap:  2104496k total,  1038460k used,  1066036k free,    11292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
   56 root      15   0     0    0    0 S  8.4  0.0   5:56.76 kswapd0
 3011 root      15   0 18824 1436  512 D  6.0  0.1   0:57.46 hald
18248 giulio.h  15   0  6148  848  668 R  3.5  0.1   0:07.41 top
 2443 root      15   0  2548  272  196 S  3.2  0.0   0:02.76 irqbalance
 2920 root      15   0 34996  744  632 S  3.2  0.1   0:24.00 sendmail
16918 giulio.h  16   0 53736  376  356 S  3.2  0.0   0:03.72 tail
18021 kannel    16   0 1276m 936m 1028 S  0.5 93.6  20:30.30 bearerbox


So:

- Any comments on Kannel's behaviour under these circumstances?

- Apart from stopping Kannel, moving the message store file aside,and restarting Kannel, I can't think of any other way of gracefullyrecovering from such a situation - this makes disaster recoverydifficult...- Could Kannel be modified so that it processes stored messages moreefficiently? (read messages direct from disk, reduce size of messagesin memory, paralellize processing of stored messages on startup?)


Thanks,

--
Giulio Harding
Systems Administrator

m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com

Problematic Kannel behaviour with large message stores

Reply via email to