Re: HAProxy SSL performance issue

Willy Tarreau Thu, 21 May 2015 05:29:41 -0700

Hi,

On Thu, May 21, 2015 at 11:31:52AM +0530, Krishna Kumar (Engineering) wrote:
> Hi all,
> 
> I am getting a big performance hit with SSL termination for small I/O, and
> errors
> when testing with bigger I/O sizes (ab version is 2.3):
> 
> 1. Non-SSL vs SSL for small I/O (128 bytes):
>    ab -k -n 1000000 -c 500 http://<HAPROXY>/128
> 
>    RPS: 181763.65 vs 133611.69        - 27% drop
>    BW:  63546.28   vs 46711.90           - 27% drop


That's expected and it matches my measurements. I've found up to about
30% drop on keep-alive request rate between clear and SSL, which is
much better than I'd have expected.

> 2. Non-SSL vs SSL for medium I/O (16 KB):
>    ab -k -n 1000000 -c 500 http://<HAPROXY>/16K
> 
>    RPS:  62646.13    vs 21876.33  (fails mostly with 70007 error as below)
> - 65% drop
>    BW:   1016531.41 vs 354977.59 (fails mostly with 70007 error)
>      - 65% drop

I suspect the BW unit is bytes per second above though I could be
wrong. It matches what you can expect from bidirectionnal TCP streams
over a single port. Thus you're at about 8 Gbps of payload bitrate in
each direction but in practice the link is close to 10 due to packet
headers and acks flying back. But that means about 2.8 Gbps of SSL
traffic. Here it will depend on the algorithm and CPU frequengy of
course. I remember seeing about 4.6 Gbps on a single core at 3.5 GHz
using AES-NI.  

> 3. Non-SSL vs SSL for large I/O (128 KB):
>    ab -k -n 100000 -c 500 http://<HAPROXY>/128K
> 
>    RPS:  8476.99      vs "apr_poll: The timeout specified has expired
> (70007)"
>    BW:   1086983.11 vs same error, this happens after 90000 requests
> (always reproducible).

Hmmm, would you be running from multiple load generators connected via
a switch with small buffers ? I've already experienced a similar situation
caused by the inability of the switch to buffer all the data sent by
haproxy and distribute it fairly to all clients.

You can have a similar issue if injecting over a same port on a saturated
link and/or switch, and return ACKs are getting lost because of the full
link.

I'm thinking about something else, could you retry with less or more total
objects in the /128 case (or the 16k case) ? The thing is that "ab" starts
all the connections at the same time, leading to parallel key generation,
and will then keep them all during the rest of the test. On a reasonable
machine, this will not cost more than one second, but it would be interesting
to know the part caused by the negociation and the rest.

> ----------------------------------- HAProxy Build info
> -------------------------------------
> HA-Proxy version 1.5.12 2015/05/02
> Copyright 2000-2015 Willy Tarreau <[email protected]>
> 
> Build options :
>   TARGET  = linux2628
>   CPU     = native
>   CC      = gcc
>   CFLAGS  = -O3 -march=native -g -fno-strict-aliasing
>   OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1
> 
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
> 
> Encrypted password support via crypt(3): yes
> Built with zlib version : 1.2.8
> Compression algorithms supported : identity, deflate, gzip
> Built with OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
> Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015

You may want to try openssl-1.0.2a which significantly improved performance
in several use cases.

> OpenSSL library supports TLS extensions : yes
> OpenSSL library supports SNI : yes
> OpenSSL library supports prefer-server-ciphers : yes
> Built with PCRE version : 8.35 2014-04-04
> PCRE library supports JIT : no (USE_PCRE_JIT not set)
> Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
> IP_FREEBIND
> 
> Available polling systems :
>       epoll : pref=300,  test result OK
>        poll : pref=200,  test result OK
>      select : pref=150,  test result OK
> Total: 3 (3 usable), will use epoll.
> ------- Config file - even cpu cores are on 1st socket on the mb, odd cpus
> are on 2nd --------
> global
>     daemon
>     maxconn 50000
>     quiet
>     nbproc 6
>     cpu-map 1 0
>     cpu-map 2 2
>     cpu-map 3 4
>     cpu-map 4 6
>     cpu-map 5 8
>     cpu-map 6 10
>     user haproxy
>     group haproxy
>     stats socket /var/run/haproxy.sock mode 600 level admin
>     stats timeout 2m
>     tune.bufsize 32768
> 
> userlist stats-auth
>     group admin    users admin
>     user  admin    insecure-password admin
> 
> defaults
>     mode http
>     maxconn 50000
>     retries 3
>     option forwardfor
>     option redispatch
>     option prefer-last-server
>     option splice-auto
> 
> frontend www-http
>     bind-process 1 2 3
>     bind *:80

Here you have an imbalance at the kernel level : the same socket is
bound to several processes, so any process who picks a connection will
process it, and the initial rush of key generations can make this
imbalance quite visible.

You should do this instead to have 3 distinct sockets each in its own
process (but warning, this requires a kernel >= 3.9) :

     bind *:80 process 1
     bind *:80 process 2
     bind *:80 process 3

>     stats uri /stats
>     stats enable
>     acl AUTH http_auth(stats-auth)
>     acl AUTH_ADMIN http_auth(stats-auth) admin
>     stats http-request auth unless AUTH
>     default_backend www-backend
> 
> frontend www-https
>     bind-process 4 5 6
>     bind *:443 ssl crt /etc/ssl/private/haproxy.pem

Same comment here.

>     reqadd X-Forwarded-Proto:\ https
>     default_backend www-backend-ssl
> 
> backend www-backend
>     bind-process 1 2 3
>     mode http
>     balance roundrobin
>     cookie FKSID prefix indirect nocache
>     server nginx-1 172.20.232.122:80 maxconn 25000 check
>     server nginx-2 172.20.232.125:80 maxconn 25000 check
> 
> backend www-backend-ssl
>     bind-process 4 5 6
>     mode http
>     balance roundrobin
>     cookie FKSID prefix indirect nocache
>     server nginx-1 172.20.232.122:80 maxconn 25000 check
>     server nginx-2 172.20.232.125:80 maxconn 25000 check
> ---------------------------------------------------------------------------------------------------------------

Your config looks correct. TCP splicing will not be used for SSL so it
only serves to improve clear-text performance. I don't see anything which
needs to be changed for now.

Another thing that can be done is to compare the setup above with 6-process
per frontend. You can even have everything in the same frontend by the way :

  frontend www-front
     bind *:80  process 1
     bind *:80  process 2
     bind *:80  process 3
     bind *:80  process 4
     bind *:80  process 5
     bind *:80  process 6
     bind *:443 process 1 ssl crt /etc/ssl/private/haproxy.pem
     bind *:443 process 2 ssl crt /etc/ssl/private/haproxy.pem
     bind *:443 process 3 ssl crt /etc/ssl/private/haproxy.pem
     bind *:443 process 4 ssl crt /etc/ssl/private/haproxy.pem
     bind *:443 process 5 ssl crt /etc/ssl/private/haproxy.pem
     bind *:443 process 6 ssl crt /etc/ssl/private/haproxy.pem

> CPU is E5-2670, 48 core system,

I fail to see how this is possible, the Xeon E5-2670 is 8-core and
supports 2 CPU configurations max. So that's 16 cores max in total.

> nic interrupts are pinned to correct cpu's, etc.

OK. What do you mean by "correct", you mean "the same CPU package as
the one running haproxy so as not to pass the data over QPI", right ?

> Can someone suggest what change is  required to get better results as well
> as
> fix the 70007 error, or share their config settings? The stats are also
> captured.
> For 128 byte, all 3 haproxy's are running, but for 16K, and for 128K, only
> the last
> haproxy is being used (and seen consistently):

This certainly is a side effect of the imbalance above combined with ab which
keeps the same connection from the beginning to the end of the test.

(...)

Regards,
Willy

Re: HAProxy SSL performance issue

Reply via email to