Hi Luigi, Thank you for letting us know about this issue. It looks like you have hit a threading bug in Bono. The log message you picked out shows that one of Bono's threads is attempting to release a lock that it does not currently own.
We have run stress recently, and didn't see this problem, so I would like to find out what the difference is between your system and ours. That should help use to get a reproduction scenario and debug the problem. The first thing to check is how many CPUs and how many worker threads you are using. You can find out how many CPUs there are on your server by running `grep processor /proc/cpuinfo | wc -l`. You can find out many worker threads you are using by looking in /etc/clearwater/user_settings. If there is a num_worker_threads option, that will set the number of threads. Otherwise, there will be one per core. If you have more than 1 CPU core on your Bono node, you may be able to reduce the risk of hitting this issue by moving to a node with only 1 core. I would also like to take a look at some debug logs from your system. Can you please enable debug logging<http://clearwater.readthedocs.io/en/stable/Troubleshooting_and_Recovery.html#bono> (log level 5), and send over a log from when this happens? Yours, Chris From: Clearwater [mailto:[email protected]] On Behalf Of Luigi De Simone Sent: 13 July 2016 11:06 To: [email protected] Subject: [Project Clearwater] [Clearwater Docker] SIPp stress testing on clearwater docker deploy Hi all, after deploying clearwater with docker, I've just tried sipp stress test and there are some message timeouts. In particular from /var/log/bono/bono_current.txt there are some sporadic messages like this: UTC Error pjsip: Assert failed: ../src/pj/os_core_unix.c:1287 mutex->owner == pj_thread_this() Other nodes do not exhibit problems. The sipp error log is below: sipp: The following events occured: 2016-07-13 09:57:35:333 1468403855.333438: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:57:52:946 1468403872.946879: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:58:19:863 1468403899.863728: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:58:20:820 1468403900.820186: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:02:504 1468403942.504959: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:06:724 1468403946.724817: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:09:466 1468403949.466635: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:09:921 1468403949.921984: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:26:948 1468403966.948038: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:55:369 1468403995.369205: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. and sipp.out is: Call limit reached (-m 50), 0.000 s period 0 ms scheduler resolution^M 0 calls (limit 50) Peak was 50 calls, after 0 s^M 0 Running, 15 Paused, 0 Woken up^M 0 dead call msg (discarded) 0 out-of-call msg (discarded) ^M 1 open sockets ^M ^M Messages Retrans Timeout Unexpected-Msg^M Pause [ 0ms/2:30] 50 0 ^M REGISTER ----------> 50 0 ^M 401 <---------- 50 0 0 0 ^M REGISTER ----------> 50 0 ^M 200 <---------- 50 0 0 0 ^M REGISTER ----------> 50 0 ^M 401 <---------- 50 0 0 0 ^M REGISTER ----------> 50 0 ^M 200 <---------- 50 0 0 0 ^M Pause [ 1:00] 50 0 ^M REGISTER ----------> B-RTD1 50 0 ^M 200 <---------- E-RTD1 50 0 0 0 ^M REGISTER ----------> B-RTD1 50 0 ^M 200 <---------- E-RTD1 50 0 0 0 ^M Pause [$pre_call_delay] 50 0 ^M INVITE ----------> B-RTD2 50 0 ^M 100 <---------- 50 0 0 0 ^M INVITE <---------- 0 0 0 0 ^M 100 <---------- 0 0 0 0 ^M INVITE <---------- 50 0 0 0 ^M 100 ----------> 50 0 ^M 180 ----------> 50 0 ^M 180 <---------- 50 0 0 0 ^M Pause [ 2000ms] 50 0 ^M 200 ----------> 50 0 ^M 200 <---------- 50 0 0 0 ^M ACK ----------> 50 0 ^M ACK <---------- 40 0 10 0 ^M UPDATE ----------> 40 0 ^M UPDATE <---------- 40 0 0 0 ^M 200 ----------> 40 0 ^M 200 <---------- E-RTD2 40 0 0 0 ^M Pause [ 6000ms] 40 0 ^M BYE ----------> B-RTD3 40 0 ^M BYE <---------- 40 0 0 0 ^M 200 ----------> 40 0 ^M 200 <---------- E-RTD3 40 0 0 0 ^M ------------------------------ Test Terminated --------------------------------^M The messages with trigger timeouts are always in that point of the scenario. What is the problem? Any help? Luigi
_______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org
