Hi Willy, 

I am happy to follow-up on the thread. Long story short - based on your 
suggestions we did further experiments with the setup and good news is things 
got improved. Thank you. Short summary would be: 
- cpu idle increased from 50%->80% 
- system avg load decreased from 8 -> 3 
- software irq down from 10% -> 0-1% 
We think one node can now handle at least 60.000 https req/s - we heard about 
higher numbers, but I think this is quite an achievement already! 

A little bit more in details. 

For the experiments we have been using three servers which have same hw specs. 
All tests were conducted on those with same QPS up to 45000 https req/s. 

At first we tried removing the chaining of ssl to single socket as you 
suggested in 3). 
This did not show visible improvements on the software irq but reduced the 
average load on the system. From 8 to 5 or less. 

Next stage we pinned ssl_termination, cleartext_http, and lb_backend to 
processes and split IRQ affinity for eth1 to cpu0 & 12 and eth0 to cpu8 & 20. 
This resulted in a noticeable improvement in all three areas mentioned above. 
That is our current production configuration. 
What is fascinating about such configuration is that we applied more traffic 
but the system 
load did not increase significantly: at 45.000 req/s the cpu idle was still 75% 
and avg system load of 5. 

Third stage we have upgraded one of the nodes to Debian 8 to evaluate if 
SO_REUSEPORT from kernel 3.9+ will make any difference (also adjusted haproxy 
config as suggested in 2). 
We did not observe any visible improvements and/or we did not get the system 
loaded enough maybe. However we noticed that system load fluctuates less with 
spiky/burst inbound traffic. Probably that's a sign of a system getting more 
stable. For now, we decided not to go with SO_REUSEPORT. 

We did additional experiments like setting NIC irq to a single thread/Physical 
CPU with backend, split NIC queue affinities, etc without any noticeable 
improvement. 

Not sure if we have room for TCP tweaks, that we haven't try changing during 
the experiments. And just in case, our final haproxy.cfg looks as follows: 

global 
daemon 
log 127.0.0.1 local0 
maxconn 100000 
nbproc 24 
cpu-map 1 0 
cpu-map 2 1 
cpu-map 3 2 
cpu-map 4 3 
cpu-map 5 4 
cpu-map 6 5 
cpu-map 7 6 
cpu-map 8 7 
cpu-map 9 8 
cpu-map 10 9 
cpu-map 11 10 
cpu-map 12 11 
cpu-map 13 12 
cpu-map 14 13 
cpu-map 15 14 
cpu-map 16 15 
cpu-map 17 16 
cpu-map 18 17 
cpu-map 19 18 
cpu-map 20 19 
cpu-map 21 20 
cpu-map 22 21 
cpu-map 23 22 
cpu-map 24 23 

tune.bufsize 16384 
spread-checks 4 
tune.maxrewrite 1024 
tune.maxpollevents 100 
tune.ssl.default-dh-param 2048 
pidfile /var/run/haproxy.pid 
stats socket 0.0.0.0:2001 process 1 
stats socket 0.0.0.0:2002 process 2 
stats socket 0.0.0.0:2003 process 3 
stats socket 0.0.0.0:2004 process 4 
stats socket 0.0.0.0:2005 process 5 
stats socket 0.0.0.0:2006 process 6 
stats socket 0.0.0.0:2007 process 7 
stats socket 0.0.0.0:2008 process 8 
stats socket 0.0.0.0:2009 process 9 
stats socket 0.0.0.0:2010 process 10 
stats socket 0.0.0.0:2011 process 11 
stats socket 0.0.0.0:2012 process 12 
stats socket 0.0.0.0:2013 process 13 
stats socket 0.0.0.0:2014 process 14 
stats socket 0.0.0.0:2015 process 15 
stats socket 0.0.0.0:2016 process 16 
stats socket 0.0.0.0:2017 process 17 
stats socket 0.0.0.0:2018 process 18 
stats socket 0.0.0.0:2019 process 19 
stats socket 0.0.0.0:2020 process 20 
stats socket 0.0.0.0:2021 process 21 
stats socket 0.0.0.0:2022 process 22 
stats socket 0.0.0.0:2023 process 23 
stats socket 0.0.0.0:2024 process 24 



defaults 
mode http 
timeout connect 30s 
timeout client 60s 
timeout server 30s 
timeout queue 60s 
timeout http-request 30s 
timeout http-keep-alive 30s 
option redispatch 
option tcplog 
option dontlog-normal 
option http-keep-alive 
option splice-auto 
option http-no-delay 
log global 

listen stats 
bind :4001 process 1 
bind :4002 process 2 
bind :4003 process 3 
bind :4004 process 4 
bind :4005 process 5 
bind :4006 process 6 
bind :4007 process 7 
bind :4008 process 8 
bind :4009 process 9 
bind :4010 process 10 
bind :4011 process 11 
bind :4012 process 12 
bind :4013 process 13 
bind :4014 process 14 
bind :4015 process 15 
bind :4016 process 16 
bind :4017 process 17 
bind :4018 process 18 
bind :4019 process 19 
bind :4020 process 20 
bind :4021 process 21 
bind :4022 process 22 
bind :4023 process 23 
bind :4024 process 24 
mode http 
stats enable 
stats hide-version 
stats realm Haproxy\ Statistics 
stats uri / 
stats auth someuser:somepass 


listen ssl_termination 
bind :443 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-sslv3 
bind-process 4 5 6 7 8 10 11 12 16 17 18 20 22 23 24 
default_backend lb_backend 

frontend cleartext_http 
bind :80 
default_backend lb_backend 
bind-process 2 3 14 15 

backend lb_backend 
mode http 
fullconn 100000 
option httpchk HEAD /lbcheck.jsp HTTP/1.0 
option accept-invalid-http-response 
option forwardfor 
balance roundrobin 
server node111 172.22.18.75:80 check 
server node112 172.22.18.76:80 check 
server node113 172.22.18.77:80 check 
server node114 172.22.18.78:80 check 
server node115 172.22.18.79:80 check 
server node116 172.22.18.80:80 check 
server node117 172.22.18.81:80 check 
server node118 172.22.18.82:80 check 
server node119 172.22.18.83:80 check 
server node120 172.22.18.84:80 check 
server node121 172.22.18.93:80 check 
server node122 172.22.18.94:80 check 
bind-process 2 3 14 15 




On Thursday, June 11, 2015 1:55 AM, Willy Tarreau <[email protected]> wrote:
Hi Eduard,

On Wed, Jun 10, 2015 at 04:56:31AM +0000, Eduard Rushanyan wrote:
> With few folks here we had some learning and already are experiencing quite
> good results with HAProxy. Wanted to first of all share that during the tests
> we achieved up to 45,000 requests per second on SSL on a single 1G box (with
> same setup/hw below). isn't that amazing? :) 

It's independant on the network connectivity. It also depends whether
you're doing it in keep-alive, with close and TLS resume, or with a
new renegociation on each request. Given the numbers, I'm assuming
that you're in TLS resume mode, because the numbers would seem high
for renegociation (typically 500-1000 per core) and low for requests
(typically 100000 per core).

> Also wanted to ask for your opinion or advise on how we can possibly improve
> the setup further. It really feels like there is something more out there and
> we could tune up the setup further. 

I'm seeing room for improvement, as it's clear that you're not getting
the most out of your machine. We usually observe around 10000 conn/s
per core in TLS resume, so you're still far from this.

> Our use case is: 
> - high request per second traffic (very high PPS/packet per second) 
> - HTTPS 
> - hundreds of thousands of requests per second 
> - gigabytes of traffic /per second 
> - currently handled by hardware LoadBalancers --> aim to replace hardware
>   LoadBalancers with HAProxy 
> 
> What do we have currently in HAProxy: 
> Rate: 26,000 HTTPS requests per second, per single HAProxy server 
> CPU idle: 50% 
> System avg load: 8 
> Software IRQs %: ~10% 
> 
> What would be great to have: 
> - reduced system load 
> - more idle CPU 
> - ability to push more bandwidth or more requests per second 
> - no Software IRQs (or less), possibly less context switches/interrupts 

You'll have to pick 1 from the last 3 :-)

> Do you think it's possible to further improve current setup
> software/configuration wise? 

First I'm seeing a number of things you can change in your config.

1) all the stats instances can be simplified to a single one with
   all the individual ports, making it much simpler to declare and
   the config easier to read :

   listen stats
       bind :4001 process 1
       bind :4002 process 2
       ...
       bind :4024 process 24
       mode http
       stats enable
       stats hide-version
       stats realm Haproxy\ Statistics
       stats uri /
       stats auth someuser:somepass

2) you didn't specify any process binding in ssl_termination, so the
   kernel wakes all processes with incoming connections, and a few of
   them take some and the other ones go back to sleep. With a kernel
   3.9 or later, you can multiply the "bind" lines and bind each of them
   to a different process. The load will be much better distributed :

   listen ssl_termination
       bind 0.0.0.0:443 process 1 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3
       bind 0.0.0.0:443 process 2 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3
       ...

3) you're chaining the SSL instances to the clear-text instance,
   thus doubling the internal connection rate. In general this ensures
   that you have a single process which handles all the traffic, but in
   your case that's not true since all 24 processes can randomly receive
   the connection :

   listen ssl_termination
       bind 0.0.0.0:443 ssl crt /webapps/ssl/haproxy.new.crt ciphers 
AES-128-CBC:HIGH:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!AESGCM no-ssl3
       server cleartext_http abns@haproxy-clear-listener send-proxy-v2

   frontend cleartext_http
      bind 0.0.0.0:80
      bind abns@haproxy-clear-listener accept-proxy
      default_backend lb_backend

   I'd suggest that you either avoid this bouncing or limit the number
   of processes listening to clear text. If you're fine with running
   the backend and cleartext frontend on all processes, then you can
   simply pu the "default_backend" rule in "ssl_termination" instead
   of "server".

4) as you can see in /proc/interrupts, the ethernet interrupts are
   spread over all threads of the first CPU socket, so the haproxy
   processes running on the same cores are competing with the softirq
   on the same threads, forcing the load to be unequal.

The last point is the trickiest to adjust and you'll have to experiment
a lot. In general, what is important to know :
  - avoid inter-CPU communications as much as possible, as they come
    with latency and cache flushes ;

  - SSL processing is still faster on multiple CPU sockets than what
    you save by avoiding such communications above.

Given that you're limited to 1 Gbps, the softirq load will remain very
low so in my opinion you should limit your IRQs to just a few cores.
Note that using threads for IRQs is interesting because it increases
cache locality and still provides a nice boost (I've observed about
20% perf increase by using 2 threads from the same core over just 1).

Also, avoid mixing haproxy and softirq on the same core or threads of
the same core. It reduces cache hit ratio.

Since you're communicating in clear text to the servers, you must
absolutely have the backend running on the same CPU socket as the
network IRQs.

My suggestion would be to try something like this (and then you can
adjust to experiment) :

- bind eth0 and eth1 IRQs to each thread of the two first cores of
  the first CPU socket. That should be 0, 1, 12, 13 I guess. For
  your load, 4 threads to deal with both NICs irqs should far more
  than enough.

- bind the clear-text frontend+backend to all threads of one or
  two cores of the first CPU socket. Let's try with 2, 3, 14, 15.

- bind the SSL frontend to all other threads.

I suspect that you can easily remove two threads for network interrupts,
and that you can possibly do the same for the clear-text frontend,
except if you manage to reach more than 50-100k conn/s.

If you want to stay on 4 threads for interrupts, then you should be
able to slightly improve the results by binding eth0 to the second
CPU socket only and eth1 to the first one only. Indeed, eth1 will
exclusively be used for clear text while eth0 will exclusively be
used by SSL. So that can improve performance for half of the SSL
threads.

Thus I think you could end up with something like this :

  - threads 0, 12 : eth1 irqs
  - threads 1, 2, 13, 14 : haproxy clear
  - threads 8, 20 : eth0 irqs
  - all other threads : haproxy SSL (16 threads total)

I'd expect that you should reach roughly 80k conn/s with such a setup.

> OpenSSL: 
> ./config --prefix=$LIBSSLBUILD no-shared no-ssl2 no-ssl3 -DOPENSSL_USE_IPV6=0 
> no-err enable-ec_nistp_64_gcc_128 zlib 

You should double-check if the line above enables arch-specific ASM
optimizations or not (look for .o files in some asm/ subdirs). I seem
to remember that it was required to explicitly specify x86_64 somewhere,
but I could be wrong.

> HAProxy: 
> make TARGET=linux2628 CPU=native USE_PCRE=1 USE_OPENSSL=1 USE_ZLIB=1 
> USE_TFO=1 ADDINC=-I$LIBSSLBUILD/include ADDLIB="-L$LIBSSLBUILD/lib -ldl" 

fine.


Willy

Reply via email to