Luigi,

Chris is a bit busy at the moment, so I've picked this up.  (BTW, it's worth 
copying in the 
[email protected]<mailto:[email protected]>
 email address so that others can see your emails.)

Please can you give a bit more context in the logs - I'm not sure if there are 
important logs before or after the section you've extracted?  (In particular, 
we trace messages both when we send and when we receive them - I think the 
first message is from receiving, but I can't tell for sure.)  It would also be 
really useful to gather the corresponding logs from Sprout.

Thanks,

Matt


From: Luigi De Simone [mailto:[email protected]]
Sent: 18 July 2016 17:57
To: Chris Elford <[email protected]>
Subject: Re: [Project Clearwater] [Clearwater Docker] SIPp stress testing on 
clearwater docker deploy

Hi Chris, unfortunately the problem still persists, also if I set the host node 
with 1vCPU, and force to use one worker thread
(by setting /etc/clearwater/user_settings with num_worker_threads=1) by bono 
service.
Below is the bono log in debug mode, specifically for a message that experience 
timeout:

....
--start msg--

ACK sip:[email protected]:18906;transport=TCP;ob SIP/2.0^M
Via: SIP/2.0/TCP 
192.168.1.107:18906;rport;branch=z9hG4bK-2010000088-6-1.000000-1^M
Route:  <sip:[email protected]:5060;transport=TCP;lr>, 
<sip:192.168.2.121:5058;transport=TCP;lr>, 
<sip:scscf.sprout:5054;transport=tcp;lr;service=scscf;billing-r
ole=charge-orig>, 
<sip:scscf.sprout:5054;transport=tcp;lr;service=scscf;billing-role=charge-term>,
 <sip:192.168.2.121:5058;transport=TCP;lr>,<sip:[email protected]
21:5060;transport=TCP;lr>^M
Max-Forwards: 70^M
From: 
sip:[email protected];tag=1931SIPpTag0061234^M<sip:[email protected];tag=1931SIPpTag0061234%5eM>
To: 
sip:[email protected];tag=1931SIPpTag0064321^M<sip:[email protected];tag=1931SIPpTag0064321%5eM>
Call-ID: 
2010000088-1.000000///[email protected]<mailto:2010000088-1.000000///[email protected]>^M
CSeq: 7 ACK^M
Content-Length: 0^M
^M

--end msg--
18-07-2016 16:49:28.548 UTC Debug pjutils.cpp:1660: Logging SAS Call-ID marker, 
Call-ID 
2010000088-1.000000///[email protected]<mailto:2010000088-1.000000///[email protected]>
18-07-2016 16:49:28.548 UTC Debug thread_dispatcher.cpp:253: Queuing cloned 
received message 0x7fa14c12e748 for worker threads
18-07-2016 16:49:28.548 UTC Debug pjsip: tsx0x7fa1500ad Timeout timer event
18-07-2016 16:49:28.548 UTC Debug pjsip: tsx0x7fa1500ad State changed from 
Terminated to Destroyed, event=TIMER
18-07-2016 16:49:28.548 UTC Debug bono.cpp:382: tsx0x7fa1500ad6e8 - 
tu_on_tsx_state UAS, TSX_STATE TIMER state=Destroyed
18-07-2016 16:49:28.548 UTC Debug bono.cpp:2260: tsx0x7fa1500ad6e8 - UAS tsx 
destroyed
18-07-2016 16:49:28.548 UTC Debug bono.cpp:2575: tsx0x7fa1500ad6e8 - Cancel 0 
pending UAC transactions
18-07-2016 16:49:28.548 UTC Debug bono.cpp:2584: tsx0x7fa1500ad6e8 - Check 
target 0, UAC data = (nil), UAC tsx = (nil)
18-07-2016 16:49:28.548 UTC Debug bono.cpp:1815: UASTransaction destructor 
(0x7fa150002fc0)
18-07-2016 16:49:28.548 UTC Debug bono.cpp:1835: Disconnect UAC transactions 
from UAS transaction
18-07-2016 16:49:28.548 UTC Debug bono.cpp:1881: Upstream ACR = 0x7fa1500091b0, 
Downstream ACR = 0x7fa1500091b0
18-07-2016 16:49:28.548 UTC Verbose acr.cpp:643: Sending P-CSCF Ralf ACR 
(0x7fa1500091b0)
18-07-2016 16:49:28.548 UTC Debug acr.cpp:673: Building message
18-07-2016 16:49:28.548 UTC Debug acr.cpp:690: Adding peers meta-data, 1 ccfs, 
0 ecfs
18-07-2016 16:49:28.548 UTC Debug acr.cpp:729: Building event
18-07-2016 16:49:28.548 UTC Debug acr.cpp:734: Adding Account-Record-Type AVP 2
18-07-2016 16:49:28.549 UTC Debug acr.cpp:754: Adding Service-Information AVP 
group
18-07-2016 16:49:28.549 UTC Debug acr.cpp:764: Adding 0 Subscription-Id AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:790: Adding IMS-Information AVP group
18-07-2016 16:49:28.549 UTC Debug acr.cpp:795: Adding Event-Type AVP group
18-07-2016 16:49:28.549 UTC Debug acr.cpp:824: Adding 0 Calling-Party-Address 
AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:844: Adding Called-Party-Address AVP
18-07-2016 16:49:28.549 UTC Debug acr.cpp:865: Adding 0 
Called-Asserted-Identity AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:886: Adding 0 Associated-URI AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:905: Adding Time-Stamps AVP group
18-07-2016 16:49:28.549 UTC Debug acr.cpp:995: Adding 0 Transit-IOI-List AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1071: Adding 0 
Early-Media-Description AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1094: Adding Media AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1297: Adding SDP-Session-Description 
AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1323: Adding media AVPs for offer
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1330: Adding media AVPs for answer
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1099: Adding 0 Message-Body AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1189: Adding 0 Reason-Header AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1207: Adding 0 
Access-Network-Information AVPs
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1225: Adding From-Address AVP
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1240: Adding Route-Header-Received AVP
18-07-2016 16:49:28.549 UTC Debug acr.cpp:1247: Adding Route-Header-Transmitted 
AVP
18-07-2016 16:49:28.549 UTC Debug thread_dispatcher.cpp:149: Worker thread 
dequeue message 0x7fa14c12e748
18-07-2016 16:49:28.549 UTC Debug pjsip: sip_endpoint.c Distributing rdata to 
modules: Request msg ACK/cseq=7 (rdata0x7fa14c12e748)
18-07-2016 16:49:28.549 UTC Debug uri_classifier.cpp:169: home domain: false, 
local_to_node: false, is_gruu: false, enforce_user_phone: false, prefer_sip: 
true, treat_n
umber_as_phone: false
18-07-2016 16:49:28.549 UTC Debug uri_classifier.cpp:199: Classified URI as 5
18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500ad Incoming Request msg 
ACK/cseq=7 (rdata0x7fa14c12e748) in state Destroyed
18-07-2016 16:49:28.549 UTC Debug httpresolver.cpp:71: HttpResolver::resolve 
for host ralf, port 10888, family 2
18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:514: Attempt to parse ralf 
as IP address
18-07-2016 16:49:28.549 UTC Verbose dnscachedresolver.cpp:245: Check cache for 
ralf type 1
18-07-2016 16:49:28.549 UTC Debug dnscachedresolver.cpp:347: Pulling 1 records 
from cache for ralf A
18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:362: Found 1 A/AAAA records, 
randomizing
18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:504: 192.168.2.114:10888 
transport 6 is not blacklisted
18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:383: Added a server, now 
have 1 of 5
18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:421: Adding 0 servers from 
blacklist
18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:514: Attempt to parse 
192.168.2.114 as IP address
18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:623: Sending HTTP request 
: http://ralf:10888/call-id/2010000088-1.000000%2F%2F%2F6-1931%40192.168.1.107 
(trying 19
2.168.2.114)
18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:915: Received header 
http/1.1200ok with value
18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:915: Received header 
content-length with value 0
18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:915: Received header  with 
value
18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:638: Received HTTP 
response: status=200, doc=
18-07-2016 16:49:28.549 UTC Debug acr.cpp:54: Destroyed ACR (0x7fa1500091b0)
18-07-2016 16:49:28.549 UTC Debug bono.cpp:1906: Free original request 
(0x7fa1500ac248)
18-07-2016 16:49:28.549 UTC Debug pjsip: tdta0x7fa1500a Destroying txdata 
Request msg INVITE/cseq=7 (tdta0x7fa1500ac1a0)
18-07-2016 16:49:28.549 UTC Debug bono.cpp:1922: UASTransaction destructor 
completed
18-07-2016 16:49:28.549 UTC Debug pjsip: tdta0x7fa1500c Destroying txdata 
Response msg 200/INVITE/cseq=7 (tdta0x7fa1500c8810)
18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500a0 Timeout timer event
18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500a0 State changed from 
Terminated to Destroyed, event=TIMER
18-07-2016 16:49:28.549 UTC Debug bono.cpp:382: tsx0x7fa1500a0da8 - 
tu_on_tsx_state UAC, TSX_STATE TIMER state=Destroyed
18-07-2016 16:49:28.549 UTC Debug bono.cpp:2946: tsx0x7fa1500a0da8 - uac_data = 
0x7fa150058860, uas_data = (nil)
18-07-2016 16:49:28.549 UTC Debug bono.cpp:3033: tsx0x7fa1500a0da8 - UAC tsx 
destroyed
18-07-2016 16:49:28.549 UTC Debug pjsip: tdta0x7fa15000 Destroying txdata 
Request msg INVITE/cseq=7 (tdta0x7fa150005d20)
18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500ad Transaction destroyed!
18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500a0 Transaction destroyed!
18-07-2016 16:49:28.549 UTC Error pjsip: Assert failed: 
../src/pj/os_core_unix.c:1287 mutex->owner == pj_thread_this()
18-07-2016 16:49:28.549 UTC Error pjsip: Assert failed: 
../src/pj/os_core_unix.c:1287 mutex->owner == pj_thread_this()
18-07-2016 16:49:28.549 UTC Debug thread_dispatcher.cpp:193: Worker thread 
completed processing message 0x7fa14c12e748
18-07-2016 16:49:28.549 UTC Debug thread_dispatcher.cpp:199: Request latency = 
730us
18-07-2016 16:49:30.003 UTC Debug pjsip: sip_endpoint.c Processing incoming 
message: Request msg INVITE/cseq=7 (rdata0x7fa14c0d77c0)
18-07-2016 16:49:30.003 UTC Verbose common_sip_processing.cpp:120: RX 1799 
bytes Request msg INVITE/cseq=7 (rdata0x7fa14c0d77c0) from TCP 
192.168.1.107:18897:

.....

sipp.out:

------------------------------ Scenario Screen -------- [1-9]: Change Screen --
     Users (length)   Port   Total-time  Total-calls  Remote-host
          50 (0 ms)   5060     264.83 s           50  192.168.2.121:5060(TCP)

  Call limit reached (-m 50), 0.373 s period  1 ms scheduler resolution
  0 calls (limit 50)                     Peak was 50 calls, after 0 s
  0 Running, 10 Paused, 2 Woken up
  0 dead call msg (discarded)            0 out-of-call msg (discarded)
  1 open sockets

                                 Messages  Retrans   Timeout   Unexpected-Msg
       Pause [ 0ms/2:30]         50                            0
    REGISTER ---------->         50        0
         401 <----------         50        0         0         0
    REGISTER ---------->         50        0
         200 <----------         50        0         0         0
    REGISTER ---------->         50        0
         401 <----------         50        0         0         0
    REGISTER ---------->         50        0
         200 <----------         50        0         0         0
       Pause [     1:00]         50                            0
    REGISTER ---------->  B-RTD1 50        0
         200 <----------  E-RTD1 50        0         0         0
    REGISTER ---------->  B-RTD1 50        0
         200 <----------  E-RTD1 50        0         0         0
       Pause [$pre_call_delay]   50                            0
      INVITE ---------->  B-RTD2 50        0
         100 <----------         50        0         0         0
      INVITE <----------         0         0         0         0
         100 <----------         0         0         0         0
      INVITE <----------         50        0         0         0
         100 ---------->         50        0
         180 ---------->         50        0
         180 <----------         50        0         0         0
       Pause [   2000ms]         50                            0
         200 ---------->         50        0
         200 <----------         50        0         0         0
         ACK ---------->         50        0
         ACK <----------         41        0         9         0
      UPDATE ---------->         41        0
      UPDATE <----------         41        0         0         0
         200 ---------->         41        0
         200 <----------  E-RTD2 41        0         0         0
       Pause [   6000ms]         41                            0
         BYE ---------->  B-RTD3 41        0
         BYE <----------         41        0         0         0
         200 ---------->         41        0
         200 <----------  E-RTD3 41        0         0         0
------------------------------ Test Terminated --------------------------------

call_load error messages:

sipp: The following events occured:
2016-07-18      18:48:03:968    1468860483.968101: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:48:22:723    1468860502.723670: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:48:38:690    1468860518.690775: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:48:54:571    1468860534.571992: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:49:12:397    1468860552.397434: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:49:20:446    1468860560.446247: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:49:26:442    1468860566.442759: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:49:52:635    1468860592.635282: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-18      18:49:55:495    1468860595.495591: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.

Il giorno 15 lug 2016, alle ore 12:41, Chris Elford 
<[email protected]<mailto:[email protected]>> ha scritto:

Thanks Luigi,

I’m glad to hear that fixed the problem.

I’ve raised a GitHub issue to cover the call failures that you were 
seeing:https://github.com/Metaswitch/sprout/issues/1488.

It would be great if you could try to hit this issue again with debug logs 
turned on, and send them over. That way, we may be able to use Bono reliably 
with multiple cores in the future.

Chris

From: Luigi De Simone [mailto:[email protected]]
Sent: 14 July 2016 18:19
To: Chris Elford (projectclearwater.org<http://projectclearwater.org>) 
<[email protected]<mailto:[email protected]>>
Subject: RE: [Project Clearwater] [Clearwater Docker] SIPp stress testing on 
clearwater docker deploy

Hi Chris, I would inform you that I've just found a solution in the morning by 
reducing cpus to 1 exactly :)
So, this is the bug!
Cheerio,
Luigi
Il 14 lug 2016 6:18 PM, "Chris Elford 
(projectclearwater.org<http://projectclearwater.org>)" 
<[email protected]<mailto:[email protected]>> ha scritto:
Hi Luigi,

Thank you for letting us know about this issue. It looks like you have hit a 
threading bug in Bono. The log message you picked out shows that one of Bono’s 
threads is attempting to release a lock that it does not currently own.

We have run stress recently, and didn’t see this problem, so I would like to 
find out what the difference is between your system and ours. That should help 
use to get a reproduction scenario and debug the problem.

The first thing to check is how many CPUs and how many worker threads you are 
using. You can find out how many CPUs there are on your server by running `grep 
processor /proc/cpuinfo | wc -l`. You can find out many worker threads you are 
using by looking in /etc/clearwater/user_settings. If there is a 
num_worker_threads option, that will set the number of threads. Otherwise, 
there will be one per core. If you have more than 1 CPU core on your Bono node, 
you may be able to reduce the risk of hitting this issue by moving to a node 
with only 1 core.

I would also like to take a look at some debug logs from your system. Can you 
please enable debug 
logging<http://clearwater.readthedocs.io/en/stable/Troubleshooting_and_Recovery.html#bono>(log
 level 5), and send over a log from when this happens?

Yours,

Chris

From: Clearwater [mailto:[email protected]] On 
Behalf Of Luigi De Simone
Sent: 13 July 2016 11:06
To: 
[email protected]<mailto:[email protected]>
Subject: [Project Clearwater] [Clearwater Docker] SIPp stress testing on 
clearwater docker deploy

Hi all, after deploying clearwater with docker, I've just tried sipp stress 
test and there are some message timeouts. In particular from
/var/log/bono/bono_current.txt there are some sporadic messages like this:

UTC Error pjsip: Assert failed: ../src/pj/os_core_unix.c:1287 mutex->owner == 
pj_thread_this()

Other nodes do not exhibit problems.
The sipp error log is below:

sipp: The following events occured:
2016-07-13      09:57:35:333    1468403855.333438: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:57:52:946    1468403872.946879: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:58:19:863    1468403899.863728: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:58:20:820    1468403900.820186: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:02:504    1468403942.504959: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:06:724    1468403946.724817: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-13      09:59:09:466    1468403949.466635: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:09:921    1468403949.921984: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.
2016-07-13      09:59:26:948    1468403966.948038: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on message 
Call Load Test:31 without label to jump to (ontimeout attribute): aborting call.
2016-07-13      09:59:55:369    1468403995.369205: Call-Id: 
[email protected]<mailto:[email protected]>, receive timeout on 
message Call Load Test:31 without label to jump to (ontimeout attribute): 
aborting call.

and sipp.out is:

  Call limit reached (-m 50), 0.000 s period  0 ms scheduler resolution^M
  0 calls (limit 50)                     Peak was 50 calls, after 0 s^M
  0 Running, 15 Paused, 0 Woken up^M
  0 dead call msg (discarded)            0 out-of-call msg (discarded)        ^M
  1 open sockets                        ^M
^M
                                 Messages  Retrans   Timeout   Unexpected-Msg^M
       Pause [ 0ms/2:30]         50                            0        ^M
    REGISTER ---------->         50        0                            ^M
         401 <----------         50        0         0         0        ^M
    REGISTER ---------->         50        0                            ^M
         200 <----------         50        0         0         0        ^M
    REGISTER ---------->         50        0                            ^M
         401 <----------         50        0         0         0        ^M
    REGISTER ---------->         50        0                            ^M
         200 <----------         50        0         0         0        ^M
       Pause [     1:00]         50                            0        ^M
    REGISTER ---------->  B-RTD1 50        0                            ^M
         200 <----------  E-RTD1 50        0         0         0        ^M
    REGISTER ---------->  B-RTD1 50        0                            ^M
         200 <----------  E-RTD1 50        0         0         0        ^M
       Pause [$pre_call_delay]   50                            0        ^M
      INVITE ---------->  B-RTD2 50        0                            ^M
         100 <----------         50        0         0         0        ^M
      INVITE <----------         0         0         0         0        ^M
         100 <----------         0         0         0         0        ^M
      INVITE <----------         50        0         0         0        ^M
         100 ---------->         50        0                            ^M
         180 ---------->         50        0                            ^M
         180 <----------         50        0         0         0        ^M
       Pause [   2000ms]         50                            0        ^M
         200 ---------->         50        0                            ^M
         200 <----------         50        0         0         0        ^M
         ACK ---------->         50        0                            ^M
         ACK <----------         40        0         10        0        ^M
      UPDATE ---------->         40        0                            ^M
      UPDATE <----------         40        0         0         0        ^M
         200 ---------->         40        0                            ^M
         200 <----------  E-RTD2 40        0         0         0        ^M
       Pause [   6000ms]         40                            0        ^M
         BYE ---------->  B-RTD3 40        0                            ^M
         BYE <----------         40        0         0         0        ^M
         200 ---------->         40        0                            ^M
         200 <----------  E-RTD3 40        0         0         0        ^M
------------------------------ Test Terminated 
--------------------------------^M

The messages with trigger timeouts are always in that point of the scenario.
What is the problem? Any help?

Luigi
_______________________________________________
Clearwater mailing list
[email protected]<mailto:[email protected]>
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Reply via email to