Hi Luigi,

Thank you for letting us know about this issue. It looks like you have hit a 
threading bug in Bono. The log message you picked out shows that one of Bono's 
threads is attempting to release a lock that it does not currently own.

We have run stress recently, and didn't see this problem, so I would like to 
find out what the difference is between your system and ours. That should help 
use to get a reproduction scenario and debug the problem.

The first thing to check is how many CPUs and how many worker threads you are 
using. You can find out how many CPUs there are on your server by running `grep 
processor /proc/cpuinfo | wc -l`. You can find out many worker threads you are 
using by looking in /etc/clearwater/user_settings. If there is a 
num_worker_threads option, that will set the number of threads. Otherwise, 
there will be one per core. If you have more than 1 CPU core on your Bono node, 
you may be able to reduce the risk of hitting this issue by moving to a node 
with only 1 core.

I would also like to take a look at some debug logs from your system. Can you 
please enable debug 
logging<http://clearwater.readthedocs.io/en/stable/Troubleshooting_and_Recovery.html#bono>
 (log level 5), and send over a log from when this happens?

Yours,

Chris

From: Clearwater [mailto:[email protected]] On 
Behalf Of Luigi De Simone
Sent: 13 July 2016 11:06
To: [email protected]
Subject: [Project Clearwater] [Clearwater Docker] SIPp stress testing on 
clearwater docker deploy

Hi all, after deploying clearwater with docker, I've just tried sipp stress 
test and there are some message timeouts. In particular from
/var/log/bono/bono_current.txt there are some sporadic messages like this:

UTC Error pjsip: Assert failed: ../src/pj/os_core_unix.c:1287 mutex->owner == 
pj_thread_this()

Other nodes do not exhibit problems.
The sipp error log is below:

sipp: The following events occured:
2016-07-13      09:57:35:333    1468403855.333438: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:57:52:946    1468403872.946879: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:58:19:863    1468403899.863728: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:58:20:820    1468403900.820186: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:02:504    1468403942.504959: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:06:724    1468403946.724817: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-13      09:59:09:466    1468403949.466635: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:09:921    1468403949.921984: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:26:948    1468403966.948038: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-13      09:59:55:369    1468403995.369205: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.

and sipp.out is:

  Call limit reached (-m 50), 0.000 s period  0 ms scheduler resolution^M
  0 calls (limit 50)                     Peak was 50 calls, after 0 s^M
  0 Running, 15 Paused, 0 Woken up^M
  0 dead call msg (discarded)            0 out-of-call msg (discarded)        ^M
  1 open sockets                        ^M
^M
                                 Messages  Retrans   Timeout   Unexpected-Msg^M
       Pause [ 0ms/2:30]         50                            0        ^M
    REGISTER ---------->         50        0                            ^M
         401 <----------         50        0         0         0        ^M
    REGISTER ---------->         50        0                            ^M
         200 <----------         50        0         0         0        ^M
    REGISTER ---------->         50        0                            ^M
         401 <----------         50        0         0         0        ^M
    REGISTER ---------->         50        0                            ^M
         200 <----------         50        0         0         0        ^M
       Pause [     1:00]         50                            0        ^M
    REGISTER ---------->  B-RTD1 50        0                            ^M
         200 <----------  E-RTD1 50        0         0         0        ^M
    REGISTER ---------->  B-RTD1 50        0                            ^M
         200 <----------  E-RTD1 50        0         0         0        ^M
       Pause [$pre_call_delay]   50                            0        ^M
      INVITE ---------->  B-RTD2 50        0                            ^M
         100 <----------         50        0         0         0        ^M
      INVITE <----------         0         0         0         0        ^M
         100 <----------         0         0         0         0        ^M
      INVITE <----------         50        0         0         0        ^M
         100 ---------->         50        0                            ^M
         180 ---------->         50        0                            ^M
         180 <----------         50        0         0         0        ^M
       Pause [   2000ms]         50                            0        ^M
         200 ---------->         50        0                            ^M
         200 <----------         50        0         0         0        ^M
         ACK ---------->         50        0                            ^M
         ACK <----------         40        0         10        0        ^M
      UPDATE ---------->         40        0                            ^M
      UPDATE <----------         40        0         0         0        ^M
         200 ---------->         40        0                            ^M
         200 <----------  E-RTD2 40        0         0         0        ^M
       Pause [   6000ms]         40                            0        ^M
         BYE ---------->  B-RTD3 40        0                            ^M
         BYE <----------         40        0         0         0        ^M
         200 ---------->         40        0                            ^M
         200 <----------  E-RTD3 40        0         0         0        ^M
------------------------------ Test Terminated 
--------------------------------^M

The messages with trigger timeouts are always in that point of the scenario.
What is the problem? Any help?

Luigi
_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Reply via email to