Luigi, Chris is a bit busy at the moment, so I've picked this up. (BTW, it's worth copying in the [email protected]<mailto:[email protected]> email address so that others can see your emails.)
Please can you give a bit more context in the logs - I'm not sure if there are important logs before or after the section you've extracted? (In particular, we trace messages both when we send and when we receive them - I think the first message is from receiving, but I can't tell for sure.) It would also be really useful to gather the corresponding logs from Sprout. Thanks, Matt From: Luigi De Simone [mailto:[email protected]] Sent: 18 July 2016 17:57 To: Chris Elford <[email protected]> Subject: Re: [Project Clearwater] [Clearwater Docker] SIPp stress testing on clearwater docker deploy Hi Chris, unfortunately the problem still persists, also if I set the host node with 1vCPU, and force to use one worker thread (by setting /etc/clearwater/user_settings with num_worker_threads=1) by bono service. Below is the bono log in debug mode, specifically for a message that experience timeout: .... --start msg-- ACK sip:[email protected]:18906;transport=TCP;ob SIP/2.0^M Via: SIP/2.0/TCP 192.168.1.107:18906;rport;branch=z9hG4bK-2010000088-6-1.000000-1^M Route: <sip:[email protected]:5060;transport=TCP;lr>, <sip:192.168.2.121:5058;transport=TCP;lr>, <sip:scscf.sprout:5054;transport=tcp;lr;service=scscf;billing-r ole=charge-orig>, <sip:scscf.sprout:5054;transport=tcp;lr;service=scscf;billing-role=charge-term>, <sip:192.168.2.121:5058;transport=TCP;lr>,<sip:[email protected] 21:5060;transport=TCP;lr>^M Max-Forwards: 70^M From: sip:[email protected];tag=1931SIPpTag0061234^M<sip:[email protected];tag=1931SIPpTag0061234%5eM> To: sip:[email protected];tag=1931SIPpTag0064321^M<sip:[email protected];tag=1931SIPpTag0064321%5eM> Call-ID: 2010000088-1.000000///[email protected]<mailto:2010000088-1.000000///[email protected]>^M CSeq: 7 ACK^M Content-Length: 0^M ^M --end msg-- 18-07-2016 16:49:28.548 UTC Debug pjutils.cpp:1660: Logging SAS Call-ID marker, Call-ID 2010000088-1.000000///[email protected]<mailto:2010000088-1.000000///[email protected]> 18-07-2016 16:49:28.548 UTC Debug thread_dispatcher.cpp:253: Queuing cloned received message 0x7fa14c12e748 for worker threads 18-07-2016 16:49:28.548 UTC Debug pjsip: tsx0x7fa1500ad Timeout timer event 18-07-2016 16:49:28.548 UTC Debug pjsip: tsx0x7fa1500ad State changed from Terminated to Destroyed, event=TIMER 18-07-2016 16:49:28.548 UTC Debug bono.cpp:382: tsx0x7fa1500ad6e8 - tu_on_tsx_state UAS, TSX_STATE TIMER state=Destroyed 18-07-2016 16:49:28.548 UTC Debug bono.cpp:2260: tsx0x7fa1500ad6e8 - UAS tsx destroyed 18-07-2016 16:49:28.548 UTC Debug bono.cpp:2575: tsx0x7fa1500ad6e8 - Cancel 0 pending UAC transactions 18-07-2016 16:49:28.548 UTC Debug bono.cpp:2584: tsx0x7fa1500ad6e8 - Check target 0, UAC data = (nil), UAC tsx = (nil) 18-07-2016 16:49:28.548 UTC Debug bono.cpp:1815: UASTransaction destructor (0x7fa150002fc0) 18-07-2016 16:49:28.548 UTC Debug bono.cpp:1835: Disconnect UAC transactions from UAS transaction 18-07-2016 16:49:28.548 UTC Debug bono.cpp:1881: Upstream ACR = 0x7fa1500091b0, Downstream ACR = 0x7fa1500091b0 18-07-2016 16:49:28.548 UTC Verbose acr.cpp:643: Sending P-CSCF Ralf ACR (0x7fa1500091b0) 18-07-2016 16:49:28.548 UTC Debug acr.cpp:673: Building message 18-07-2016 16:49:28.548 UTC Debug acr.cpp:690: Adding peers meta-data, 1 ccfs, 0 ecfs 18-07-2016 16:49:28.548 UTC Debug acr.cpp:729: Building event 18-07-2016 16:49:28.548 UTC Debug acr.cpp:734: Adding Account-Record-Type AVP 2 18-07-2016 16:49:28.549 UTC Debug acr.cpp:754: Adding Service-Information AVP group 18-07-2016 16:49:28.549 UTC Debug acr.cpp:764: Adding 0 Subscription-Id AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:790: Adding IMS-Information AVP group 18-07-2016 16:49:28.549 UTC Debug acr.cpp:795: Adding Event-Type AVP group 18-07-2016 16:49:28.549 UTC Debug acr.cpp:824: Adding 0 Calling-Party-Address AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:844: Adding Called-Party-Address AVP 18-07-2016 16:49:28.549 UTC Debug acr.cpp:865: Adding 0 Called-Asserted-Identity AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:886: Adding 0 Associated-URI AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:905: Adding Time-Stamps AVP group 18-07-2016 16:49:28.549 UTC Debug acr.cpp:995: Adding 0 Transit-IOI-List AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1071: Adding 0 Early-Media-Description AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1094: Adding Media AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1297: Adding SDP-Session-Description AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1323: Adding media AVPs for offer 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1330: Adding media AVPs for answer 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1099: Adding 0 Message-Body AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1189: Adding 0 Reason-Header AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1207: Adding 0 Access-Network-Information AVPs 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1225: Adding From-Address AVP 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1240: Adding Route-Header-Received AVP 18-07-2016 16:49:28.549 UTC Debug acr.cpp:1247: Adding Route-Header-Transmitted AVP 18-07-2016 16:49:28.549 UTC Debug thread_dispatcher.cpp:149: Worker thread dequeue message 0x7fa14c12e748 18-07-2016 16:49:28.549 UTC Debug pjsip: sip_endpoint.c Distributing rdata to modules: Request msg ACK/cseq=7 (rdata0x7fa14c12e748) 18-07-2016 16:49:28.549 UTC Debug uri_classifier.cpp:169: home domain: false, local_to_node: false, is_gruu: false, enforce_user_phone: false, prefer_sip: true, treat_n umber_as_phone: false 18-07-2016 16:49:28.549 UTC Debug uri_classifier.cpp:199: Classified URI as 5 18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500ad Incoming Request msg ACK/cseq=7 (rdata0x7fa14c12e748) in state Destroyed 18-07-2016 16:49:28.549 UTC Debug httpresolver.cpp:71: HttpResolver::resolve for host ralf, port 10888, family 2 18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:514: Attempt to parse ralf as IP address 18-07-2016 16:49:28.549 UTC Verbose dnscachedresolver.cpp:245: Check cache for ralf type 1 18-07-2016 16:49:28.549 UTC Debug dnscachedresolver.cpp:347: Pulling 1 records from cache for ralf A 18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:362: Found 1 A/AAAA records, randomizing 18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:504: 192.168.2.114:10888 transport 6 is not blacklisted 18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:383: Added a server, now have 1 of 5 18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:421: Adding 0 servers from blacklist 18-07-2016 16:49:28.549 UTC Debug baseresolver.cpp:514: Attempt to parse 192.168.2.114 as IP address 18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:623: Sending HTTP request : http://ralf:10888/call-id/2010000088-1.000000%2F%2F%2F6-1931%40192.168.1.107 (trying 19 2.168.2.114) 18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:915: Received header http/1.1200ok with value 18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:915: Received header content-length with value 0 18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:915: Received header with value 18-07-2016 16:49:28.549 UTC Debug httpconnection.cpp:638: Received HTTP response: status=200, doc= 18-07-2016 16:49:28.549 UTC Debug acr.cpp:54: Destroyed ACR (0x7fa1500091b0) 18-07-2016 16:49:28.549 UTC Debug bono.cpp:1906: Free original request (0x7fa1500ac248) 18-07-2016 16:49:28.549 UTC Debug pjsip: tdta0x7fa1500a Destroying txdata Request msg INVITE/cseq=7 (tdta0x7fa1500ac1a0) 18-07-2016 16:49:28.549 UTC Debug bono.cpp:1922: UASTransaction destructor completed 18-07-2016 16:49:28.549 UTC Debug pjsip: tdta0x7fa1500c Destroying txdata Response msg 200/INVITE/cseq=7 (tdta0x7fa1500c8810) 18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500a0 Timeout timer event 18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500a0 State changed from Terminated to Destroyed, event=TIMER 18-07-2016 16:49:28.549 UTC Debug bono.cpp:382: tsx0x7fa1500a0da8 - tu_on_tsx_state UAC, TSX_STATE TIMER state=Destroyed 18-07-2016 16:49:28.549 UTC Debug bono.cpp:2946: tsx0x7fa1500a0da8 - uac_data = 0x7fa150058860, uas_data = (nil) 18-07-2016 16:49:28.549 UTC Debug bono.cpp:3033: tsx0x7fa1500a0da8 - UAC tsx destroyed 18-07-2016 16:49:28.549 UTC Debug pjsip: tdta0x7fa15000 Destroying txdata Request msg INVITE/cseq=7 (tdta0x7fa150005d20) 18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500ad Transaction destroyed! 18-07-2016 16:49:28.549 UTC Debug pjsip: tsx0x7fa1500a0 Transaction destroyed! 18-07-2016 16:49:28.549 UTC Error pjsip: Assert failed: ../src/pj/os_core_unix.c:1287 mutex->owner == pj_thread_this() 18-07-2016 16:49:28.549 UTC Error pjsip: Assert failed: ../src/pj/os_core_unix.c:1287 mutex->owner == pj_thread_this() 18-07-2016 16:49:28.549 UTC Debug thread_dispatcher.cpp:193: Worker thread completed processing message 0x7fa14c12e748 18-07-2016 16:49:28.549 UTC Debug thread_dispatcher.cpp:199: Request latency = 730us 18-07-2016 16:49:30.003 UTC Debug pjsip: sip_endpoint.c Processing incoming message: Request msg INVITE/cseq=7 (rdata0x7fa14c0d77c0) 18-07-2016 16:49:30.003 UTC Verbose common_sip_processing.cpp:120: RX 1799 bytes Request msg INVITE/cseq=7 (rdata0x7fa14c0d77c0) from TCP 192.168.1.107:18897: ..... sipp.out: ------------------------------ Scenario Screen -------- [1-9]: Change Screen -- Users (length) Port Total-time Total-calls Remote-host 50 (0 ms) 5060 264.83 s 50 192.168.2.121:5060(TCP) Call limit reached (-m 50), 0.373 s period 1 ms scheduler resolution 0 calls (limit 50) Peak was 50 calls, after 0 s 0 Running, 10 Paused, 2 Woken up 0 dead call msg (discarded) 0 out-of-call msg (discarded) 1 open sockets Messages Retrans Timeout Unexpected-Msg Pause [ 0ms/2:30] 50 0 REGISTER ----------> 50 0 401 <---------- 50 0 0 0 REGISTER ----------> 50 0 200 <---------- 50 0 0 0 REGISTER ----------> 50 0 401 <---------- 50 0 0 0 REGISTER ----------> 50 0 200 <---------- 50 0 0 0 Pause [ 1:00] 50 0 REGISTER ----------> B-RTD1 50 0 200 <---------- E-RTD1 50 0 0 0 REGISTER ----------> B-RTD1 50 0 200 <---------- E-RTD1 50 0 0 0 Pause [$pre_call_delay] 50 0 INVITE ----------> B-RTD2 50 0 100 <---------- 50 0 0 0 INVITE <---------- 0 0 0 0 100 <---------- 0 0 0 0 INVITE <---------- 50 0 0 0 100 ----------> 50 0 180 ----------> 50 0 180 <---------- 50 0 0 0 Pause [ 2000ms] 50 0 200 ----------> 50 0 200 <---------- 50 0 0 0 ACK ----------> 50 0 ACK <---------- 41 0 9 0 UPDATE ----------> 41 0 UPDATE <---------- 41 0 0 0 200 ----------> 41 0 200 <---------- E-RTD2 41 0 0 0 Pause [ 6000ms] 41 0 BYE ----------> B-RTD3 41 0 BYE <---------- 41 0 0 0 200 ----------> 41 0 200 <---------- E-RTD3 41 0 0 0 ------------------------------ Test Terminated -------------------------------- call_load error messages: sipp: The following events occured: 2016-07-18 18:48:03:968 1468860483.968101: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:48:22:723 1468860502.723670: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:48:38:690 1468860518.690775: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:48:54:571 1468860534.571992: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:49:12:397 1468860552.397434: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:49:20:446 1468860560.446247: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:49:26:442 1468860566.442759: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:49:52:635 1468860592.635282: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-18 18:49:55:495 1468860595.495591: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. Il giorno 15 lug 2016, alle ore 12:41, Chris Elford <[email protected]<mailto:[email protected]>> ha scritto: Thanks Luigi, I’m glad to hear that fixed the problem. I’ve raised a GitHub issue to cover the call failures that you were seeing:https://github.com/Metaswitch/sprout/issues/1488. It would be great if you could try to hit this issue again with debug logs turned on, and send them over. That way, we may be able to use Bono reliably with multiple cores in the future. Chris From: Luigi De Simone [mailto:[email protected]] Sent: 14 July 2016 18:19 To: Chris Elford (projectclearwater.org<http://projectclearwater.org>) <[email protected]<mailto:[email protected]>> Subject: RE: [Project Clearwater] [Clearwater Docker] SIPp stress testing on clearwater docker deploy Hi Chris, I would inform you that I've just found a solution in the morning by reducing cpus to 1 exactly :) So, this is the bug! Cheerio, Luigi Il 14 lug 2016 6:18 PM, "Chris Elford (projectclearwater.org<http://projectclearwater.org>)" <[email protected]<mailto:[email protected]>> ha scritto: Hi Luigi, Thank you for letting us know about this issue. It looks like you have hit a threading bug in Bono. The log message you picked out shows that one of Bono’s threads is attempting to release a lock that it does not currently own. We have run stress recently, and didn’t see this problem, so I would like to find out what the difference is between your system and ours. That should help use to get a reproduction scenario and debug the problem. The first thing to check is how many CPUs and how many worker threads you are using. You can find out how many CPUs there are on your server by running `grep processor /proc/cpuinfo | wc -l`. You can find out many worker threads you are using by looking in /etc/clearwater/user_settings. If there is a num_worker_threads option, that will set the number of threads. Otherwise, there will be one per core. If you have more than 1 CPU core on your Bono node, you may be able to reduce the risk of hitting this issue by moving to a node with only 1 core. I would also like to take a look at some debug logs from your system. Can you please enable debug logging<http://clearwater.readthedocs.io/en/stable/Troubleshooting_and_Recovery.html#bono>(log level 5), and send over a log from when this happens? Yours, Chris From: Clearwater [mailto:[email protected]] On Behalf Of Luigi De Simone Sent: 13 July 2016 11:06 To: [email protected]<mailto:[email protected]> Subject: [Project Clearwater] [Clearwater Docker] SIPp stress testing on clearwater docker deploy Hi all, after deploying clearwater with docker, I've just tried sipp stress test and there are some message timeouts. In particular from /var/log/bono/bono_current.txt there are some sporadic messages like this: UTC Error pjsip: Assert failed: ../src/pj/os_core_unix.c:1287 mutex->owner == pj_thread_this() Other nodes do not exhibit problems. The sipp error log is below: sipp: The following events occured: 2016-07-13 09:57:35:333 1468403855.333438: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:57:52:946 1468403872.946879: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:58:19:863 1468403899.863728: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:58:20:820 1468403900.820186: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:02:504 1468403942.504959: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:06:724 1468403946.724817: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:09:466 1468403949.466635: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:09:921 1468403949.921984: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:26:948 1468403966.948038: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. 2016-07-13 09:59:55:369 1468403995.369205: Call-Id: [email protected]<mailto:[email protected]>, receive timeout on message Call Load Test:31 without label to jump to (ontimeout attribute): aborting call. and sipp.out is: Call limit reached (-m 50), 0.000 s period 0 ms scheduler resolution^M 0 calls (limit 50) Peak was 50 calls, after 0 s^M 0 Running, 15 Paused, 0 Woken up^M 0 dead call msg (discarded) 0 out-of-call msg (discarded) ^M 1 open sockets ^M ^M Messages Retrans Timeout Unexpected-Msg^M Pause [ 0ms/2:30] 50 0 ^M REGISTER ----------> 50 0 ^M 401 <---------- 50 0 0 0 ^M REGISTER ----------> 50 0 ^M 200 <---------- 50 0 0 0 ^M REGISTER ----------> 50 0 ^M 401 <---------- 50 0 0 0 ^M REGISTER ----------> 50 0 ^M 200 <---------- 50 0 0 0 ^M Pause [ 1:00] 50 0 ^M REGISTER ----------> B-RTD1 50 0 ^M 200 <---------- E-RTD1 50 0 0 0 ^M REGISTER ----------> B-RTD1 50 0 ^M 200 <---------- E-RTD1 50 0 0 0 ^M Pause [$pre_call_delay] 50 0 ^M INVITE ----------> B-RTD2 50 0 ^M 100 <---------- 50 0 0 0 ^M INVITE <---------- 0 0 0 0 ^M 100 <---------- 0 0 0 0 ^M INVITE <---------- 50 0 0 0 ^M 100 ----------> 50 0 ^M 180 ----------> 50 0 ^M 180 <---------- 50 0 0 0 ^M Pause [ 2000ms] 50 0 ^M 200 ----------> 50 0 ^M 200 <---------- 50 0 0 0 ^M ACK ----------> 50 0 ^M ACK <---------- 40 0 10 0 ^M UPDATE ----------> 40 0 ^M UPDATE <---------- 40 0 0 0 ^M 200 ----------> 40 0 ^M 200 <---------- E-RTD2 40 0 0 0 ^M Pause [ 6000ms] 40 0 ^M BYE ----------> B-RTD3 40 0 ^M BYE <---------- 40 0 0 0 ^M 200 ----------> 40 0 ^M 200 <---------- E-RTD3 40 0 0 0 ^M ------------------------------ Test Terminated --------------------------------^M The messages with trigger timeouts are always in that point of the scenario. What is the problem? Any help? Luigi _______________________________________________ Clearwater mailing list [email protected]<mailto:[email protected]> http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org
_______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org
