For a while now I have been trying to hunt down the reason why the processor load on our Vicidial systems spikes frequently. It normally stays at around 2 on a 4 core server. It can however jump up as high as 45 which is not good for call quality. Upon investigation I figured out that it is the Asterisk process (version 1.2.21.1) that is doing this.
For those who do not know what Vicidial is or how it works, Vicidial is a call center dialing application that uses Asterisk. Basically a bunch of call center agents sit in their own respective meetme waiting for calls. Vicidial dials a number of lines per available agent. When a call picks up, Vicidial determines which agent to hand the call to, and drops the call into the meetme for that agent. When an agent wants to transfer a call to another agent, the new agent is ripped from his meetme and placed in the other agent's so that they can both talk to the person on the phone. Once the first agent is done talking to the second he is pulled from that meetme and placed into a free one. In total there are about 150 meetmes per box. Generally I have only about 30-40 agents on these boxes. To try and hunt down what was causing my load spikes I compiled Asterisk with full debugging and ran Vicidial as normal. I found that there was a correlation between lots of calls picking up and the load spikes. Also during this period of time app_meetme complains of deadlocks. Thinking that I had a deadlocking issue on my hands I poured through the code and was less than thrilled with what I found. The problem has nothing to do with deadlocks (We actually found one but it could not actually happen during normal Vicidial operation because it only occurs when dynamically creating new conferences and Vicidial does not dynamically create them). Now I have looked somewhat into the version of app_meetme in 1.4 as well as the svn trunk and some of these problems have been fix but some others have not. I have also seen these problems in a number of other places through out Asterisk. The problems are as follows: 1. There is only one lock in all of app_meetme (version 1.2) and that is on the linked list that holds the configurations for for each conference. When something needs to be changed in a single conference, such as adding another channel, all of the conferences get locked. This is not that big of a deal when you have one 1 or 2 conferences but when you have 10 to 15 it becomes a somewhat of a problem and when you have 150 with channels being added and removed all the time it is a big problem. Sure using a single lock does a great job of helping to prevent deadlocks, but it would be far better to use thread safe programming techniques which do not need a lock. For example uning a thread safe linked list that allows reading and traversing the list without locking and only requires locks for insertion and deletion, and a lock for each one of the conferences when something needs to be change on it. Not to mention that it would be even better to convert the linked list to a thread safe hash that would allow much faster access on a higher number of meetmes. 2. There are locks around file I/O, mainly writing to the log file. File I/O is very slow. Locking and unlocking should happen as quickly as possible. You do not want to keep a lock while things are happening that potentially have to wait on anything, especially not on something that could be based on another lock, like concurrent access to a file! Locking, performing file I/O with the locked data, then unlocking should be avoided whenever possible. It is far better to lock, make a copy of the data, unlock, then perform file I/O with the copy of the data, especially when it is something like writing to a log file. Making a copy in ram is far faster than file I/O. This gets even worse when doing things like throwing manager events or writing to the cli. Here is where we have to start talking about deadlocks. This appears even in the svn version. 3. Last we checked Asterisk locks were non-reentrant, meaning a single thread cannot lock on the same lock twice. This is a problem that seems to be currently being worked on but I am not aware if it has been fixed as of yet. This means we need to be careful when locking. A deadlock we fixed was caused by a function locking the one lock, then calling a function that locks on the same lock. Now I consider much of this to be pretty simple to understand, but based on the pervasiveness of these problems throughout Asterisk, I think that a good locking practices section should be added to the Asterisk coding guidelines. It should address not only these issues but other ones as well like: 1. Using finer granular locks rather then monster locks that block all functionality of a module for each tiny operation. 2. Correct ordering of locking and unlocking (lock a,b,c then unlock c,b,a) to avoid cross-locks. thread 1 locking a -> ? -> b thread 2 locking b -> ? -> a can cause both threads to wait on each other with no chance of releasing their locks. 3. Make sure locking order is kept to avoid deadlocks even when it seems pointless to do so. For instance it is wise to not only lock a and c but b as well even though b does not need to be locked. Not doing this can sometimes lead to deadlocks. In general chains of dependent locks have to always be used in the same order. If partial chains are used in some places you have to make sure that the calling order does not lead to crosses over multiple functions that use the same locks (circular dependency locks). thread 1 locking a -> ? -> b thread 2 locking b -> ? -> c thread 3 locking c -> ? -> a 4. Good ways of thinking through the locking that is going on. Sometimes this can be very difficult especially when dealing with many many locks and different types of locks. Drawing nesting diagrams can help greatly here. 5. Above all stressing that Asterisk is a very large, multi-threaded application, and that poorly locked code can have far reaching effects in other areas of Asterisk where you would not expect it. This can make debugging problems extremely difficult and aggravating. It can also have a massive impact on the systems performance such as the case with app_meetme. I am more than willing to help write such a section, but it will have to wait till I fix the issues with the 1.2, 1.4, and svn versions of app_meetme. Michael Cargile Explido Software USA Inc. www.explido.us _______________________________________________ --Bandwidth and Colocation Provided by http://www.api-digital.com-- asterisk-dev mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-dev
