Re: [Project Clearwater] Performance limit measurement

Graeme Robertson (projectclearwater.org) Tue, 20 Sep 2016 04:49:01 -0700

Hi Michael,

It’s worth noting that Project Clearwater is designed to scale horizontally 
rather than vertically, so we would expect multiple less powerful Sprout nodes 
to out-perform a single powerful Sprout node. However, that doesn’t mean that 
your Sprout node isn’t capable of handling the load you’re hitting it with.


We do expose latency measurements over SNMP – see 
http://clearwater.readthedocs.io/en/stable/Clearwater_SNMP_Statistics.html for 
more details. In particular, under Sprout statistics we have various latency 
statistics including latency for SIP requests and latency for requests to 
Homestead. There are a couple of other statistics that might be useful for 
determining when exactly our requests are failing – if the number of initial 
registration failures and/or the number of authentication failures are non-zero 
this would indicate that the bottleneck is actually at Homestead.

It does sound as though Sprout is reporting itself as overloaded even though it 
could handle more requests. As I mentioned previously, Sprout will tweak it’s 
overload controls to rectify this, but it won’t be immediate, which might 
explain the failures. I know you’ve already tried tweaking the token controls, 
but it might be worth looking at them again. Over the course of the minute of 
your test, I think we expect to receive 60,000 REGISTERs (2 per subscriber), 
and they should be even distributed, so we’re expecting 1,000 requests per 
second. Have you tried setting init_token_rate to 1,000? You’ll want to make 
sure this change is picked up on both Sprout and Homestead – you can do this by 
editing /etc/clearwater/shared_config on a single node and running 
/usr/share/clearwater/clearwater-config-manager/scripts/upload_shared_config. 
After a few minutes the change will have propagated around the deployment.

Thanks,
Graeme

From: Clearwater [mailto:[email protected]] On 
Behalf Of ??????? ?ats?????
Sent: 19 September 2016 08:42
To: [email protected]
Subject: Re: [Project Clearwater] Performance limit measurement

Hi Graeme,

i created a simpler scenario comparing to what the Sip Stress testing uses. In 
each scenario two subscribers try just to register to IMS and do not make any 
call to each other. I run this scenario for 15000 pairs of subscribers (30000 
subscribers). The register requests are distributed in 1 minute time. It seems 
tha Sprout node is the bottleneck. The return code of most of the failed 
messages is  503 (Service Unavailable) and of some of them 408 (Request 
Timeout). I have added resources in Sprout (4 CPUs and 8Gb memory) so i don't 
believe that resources is the issue.

Does Sprout somehow exposes the latency measurements that lead to the 
throttling? We would like to take a look at them.



 Here is the the xml file .


<scenario name="Call Load Test">

  <User variables="my_dn,peer_dn,call_repeat" />
  <nop hide="true">
    <action>
      <!-- Get my and peer's DN -->
      <assignstr assign_to="my_dn" value="[field0]" />
      <!-- field1 is my_auth, but we can't store it in a variable -->
      <assignstr assign_to="peer_dn" value="[field2]" />
      <!-- field3 is peer_auth, but we can't store it in a variable -->
      <assign assign_to="reg_repeat" value="0"/>
      <assign assign_to="call_repeat" value="0"/>
    </action>
  </nop>

  <pause distribution="uniform" min="0" max="60000" />

  <send>
    <![CDATA[

      REGISTER sip:[$my_dn]@[service] SIP/2.0
      Via: SIP/2.0/[transport] 
[local_ip]:[local_port];rport;branch=[branch]-[$my_dn]-[$reg_repeat]
      Route: <sip:[service];transport=[transport];lr>
      Max-Forwards: 70
      From: <sip:[$my_dn]@[service]>;tag=[pid]SIPpTag00[call_number]
      To: <sip:[$my_dn]@[service]>
      Call-ID: [$my_dn]///[call_id]
      CSeq: [cseq] REGISTER
      User-Agent: Accession 4.0.0.0
      Supported: outbound, path
      Contact: 
<sip:[$my_dn]@[local_ip]:[local_port];transport=[transport];ob>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-0000-0000-000000000001>"
      Expires: 3600
      Allow: PRACK, INVITE, ACK, BYE, CANCEL, UPDATE, SUBSCRIBE, NOTIFY, REFER, 
MESSAGE, OPTIONS
      Content-Length: 0

    ]]>
  </send>

  <recv response="401" auth="true">
    <action>
      <add assign_to="reg_repeat" value="1" />
    </action>
  </recv>

  <send>
    <![CDATA[

      REGISTER sip:[$my_dn]@[service] SIP/2.0
      Via: SIP/2.0/[transport] 
[local_ip]:[local_port];rport;branch=[branch]-[$my_dn]-[$reg_repeat]
      Route: <sip:[service];transport=[transport];lr>
      Max-Forwards: 70
      From: <sip:[$my_dn]@[service]>;tag=[pid]SIPpTag00[call_number]
      To: <sip:[$my_dn]@[service]>
      Call-ID: [$my_dn]///[call_id]
      CSeq: [cseq] REGISTER
      User-Agent: Accession 4.0.0.0
      Supported: outbound, path
      Contact: 
<sip:[$my_dn]@[local_ip]:[local_port];transport=[transport];ob>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-0000-0000-000000000001>"
      Expires: 3600
      [field1]
      Allow: PRACK, INVITE, ACK, BYE, CANCEL, UPDATE, SUBSCRIBE, NOTIFY, REFER, 
MESSAGE, OPTIONS
      Content-Length: 0

    ]]>
  </send>

  <recv response="200">
    <action>
      <ereg regexp="rport=([^;]*);.*received=([^;]*);" search_in="hdr" 
header="Via:" assign_to="dummy" />
      <add assign_to="reg_repeat" value="1" />
    </action>
  </recv>
  <Reference variables="dummy" />

  <send>
    <![CDATA[

      REGISTER sip:[$peer_dn]@[service] SIP/2.0
      Via: SIP/2.0/[transport] 
[local_ip]:[local_port];rport;branch=[branch]-[$peer_dn]-[$reg_repeat]
      Route: <sip:[service];transport=[transport];lr>
      Max-Forwards: 70
      From: <sip:[$peer_dn]@[service]>;tag=[pid]SIPpTag00[call_number]
      To: <sip:[$peer_dn]@[service]>
      Call-ID: [$peer_dn]///[call_id]
      CSeq: [cseq] REGISTER
      User-Agent: Accession 4.0.0.0
      Supported: outbound, path
      Contact: 
<sip:[$peer_dn]@[local_ip]:[local_port];transport=[transport];ob>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-0000-0000-000000000001>"
      Expires: 3600
      Allow: PRACK, INVITE, ACK, BYE, CANCEL, UPDATE, SUBSCRIBE, NOTIFY, REFER, 
MESSAGE, OPTIONS
      Content-Length: 0

    ]]>
  </send>

  <recv response="401" auth="true">
    <action>
      <add assign_to="reg_repeat" value="1" />
    </action>
  </recv>

  <send>
    <![CDATA[

      REGISTER sip:[$peer_dn]@[service] SIP/2.0
      Via: SIP/2.0/[transport] 
[local_ip]:[local_port];rport;branch=[branch]-[$peer_dn]-[$reg_repeat]
      Route: <sip:[service];transport=[transport];lr>
      Max-Forwards: 70
      From: <sip:[$peer_dn]@[service]>;tag=[pid]SIPpTag00[call_number]
      To: <sip:[$peer_dn]@[service]>
      Call-ID: [$peer_dn]///[call_id]
      CSeq: [cseq] REGISTER
      User-Agent: Accession 4.0.0.0
      Supported: outbound, path
      Contact: 
<sip:[$peer_dn]@[local_ip]:[local_port];transport=[transport];ob>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-0000-0000-000000000001>"
      Expires: 3600
      [field3]
      Allow: PRACK, INVITE, ACK, BYE, CANCEL, UPDATE, SUBSCRIBE, NOTIFY, REFER, 
MESSAGE, OPTIONS
      Content-Length: 0

    ]]>
  </send>

  <recv response="200">
    <action>
      <add assign_to="reg_repeat" value="1" />
    </action>
  </recv>

</scenario>


Best Regards,
Michael Katsoulis



2016-09-16 21:25 GMT+03:00 Graeme Robertson 
(projectclearwater.org<http://projectclearwater.org>) 
<[email protected]<mailto:[email protected]>>:
Hi Michael,

Can you tell me more about your scenario? It sounds like you’re not using the 
clearwater-sip-stress package, or at least not in exactly the form we package 
up. If you’re not using the clearwater-sip-stress package then please can you 
send details of your stress scenario?

Depending on how powerful your Sprout node is, I would expect 15000 calls per 
second to be towards the upper limit of its performance powers. However, if the 
CPU is not particularly high then that would suggest that Sprout’s throttling 
controls might require further tuning. Do you know what return code the 
“unexpected messages” have? 503s indicate that there is overload somewhere. 
Sprout does adjust its throttling controls to match the load its able to 
process, but that process is not immediate, and we recommend building stress up 
gradually rather than immediately firing 15000 calls per second into the system 
– for more information on that, see 
http://www.projectclearwater.org/clearwater-performance-and-our-load-monitor/.

One final thought I had was that the node you’re running stress on might be 
overloaded. If the stress node is not responding to messages in a timely 
fashion then that will generate time outs and unexpected messages.

Thanks,
Graeme

From: Clearwater 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of ??????? ?ats?????
Sent: 16 September 2016 15:16
To: 
[email protected]<mailto:[email protected]>
Subject: Re: [Project Clearwater] Performance limit measurement

Hi Graeme,

thanks a lot for your response.

In our scenario we are using the Stress node to generate 15000 calls in 60 
seconds. The number of
unsuccessful calls varies from ~500 to ~5000 even in subsequent repetitions of 
the same scenario.
According to wireshark the failures happen because of Sprout that does not send 
the correct responses in time
and so we get "time-outs" and "unexpected messages" in the Stress node.
The Sprout node has sufficient CPU and memory resources.
What could be the reason of this instability in our deployment?

Thank you in advance,
Michael Katsoulis














2016-09-16 16:14 GMT+03:00 Graeme Robertson 
(projectclearwater.org<http://projectclearwater.org>) 
<[email protected]<mailto:[email protected]>>:
Hi Michael,

How many successes and failures are you seeing? We primarily use the 
clearwater-sip-stress package to check we haven’t introduced crashes under 
load, and to check we haven’t significantly regressed the performance of 
Project Clearwater. Unfortunately clearwater-sip-stress is not reliable enough 
to generate completely accurate performance numbers for Project Clearwater (and 
we don’t accurately measure Project Clearwater performance or provide numbers). 
We tend to see around 1% failures when running clearwater-sip-stress. If your 
failure numbers are fluctuating at around 1% then this is probably down to the 
test scripts not being completely reliable, and you won’t have actually hit the 
deployment’s limit until you start seeing more failures than this.

Thanks,
Graeme


From: Clearwater 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of ??????? ?ats?????
Sent: 16 September 2016 10:17
To: 
[email protected]<mailto:[email protected]>
Subject: [Project Clearwater] Performance limit measurement

Hi all,

we are running Stress Tests against our Clearwater Deployment using Sip Stress 
node.
We have noticed that the results are not consistent as the number of 
successfull calls changes during repetitions of the same test scenario.

We have tried to increase the values of max_tokens , init_token_rate, 
min_token_rate and
target_latency_us but we did not observe any difference.

What is the proposed way to discover the deployment's limit on how many 
requests per second can
be served?

Thanks in advance,
Michael Katsoulis

_______________________________________________
Clearwater mailing list
[email protected]<mailto:[email protected]>
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org


_______________________________________________
Clearwater mailing list
[email protected]<mailto:[email protected]>
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] Performance limit measurement

Reply via email to