Hi, On Thu, May 21, 2015 at 11:31:52AM +0530, Krishna Kumar (Engineering) wrote: > Hi all, > > I am getting a big performance hit with SSL termination for small I/O, and > errors > when testing with bigger I/O sizes (ab version is 2.3): > > 1. Non-SSL vs SSL for small I/O (128 bytes): > ab -k -n 1000000 -c 500 http://<HAPROXY>/128 > > RPS: 181763.65 vs 133611.69 - 27% drop > BW: 63546.28 vs 46711.90 - 27% drop
That's expected and it matches my measurements. I've found up to about 30% drop on keep-alive request rate between clear and SSL, which is much better than I'd have expected. > 2. Non-SSL vs SSL for medium I/O (16 KB): > ab -k -n 1000000 -c 500 http://<HAPROXY>/16K > > RPS: 62646.13 vs 21876.33 (fails mostly with 70007 error as below) > - 65% drop > BW: 1016531.41 vs 354977.59 (fails mostly with 70007 error) > - 65% drop I suspect the BW unit is bytes per second above though I could be wrong. It matches what you can expect from bidirectionnal TCP streams over a single port. Thus you're at about 8 Gbps of payload bitrate in each direction but in practice the link is close to 10 due to packet headers and acks flying back. But that means about 2.8 Gbps of SSL traffic. Here it will depend on the algorithm and CPU frequengy of course. I remember seeing about 4.6 Gbps on a single core at 3.5 GHz using AES-NI. > 3. Non-SSL vs SSL for large I/O (128 KB): > ab -k -n 100000 -c 500 http://<HAPROXY>/128K > > RPS: 8476.99 vs "apr_poll: The timeout specified has expired > (70007)" > BW: 1086983.11 vs same error, this happens after 90000 requests > (always reproducible). Hmmm, would you be running from multiple load generators connected via a switch with small buffers ? I've already experienced a similar situation caused by the inability of the switch to buffer all the data sent by haproxy and distribute it fairly to all clients. You can have a similar issue if injecting over a same port on a saturated link and/or switch, and return ACKs are getting lost because of the full link. I'm thinking about something else, could you retry with less or more total objects in the /128 case (or the 16k case) ? The thing is that "ab" starts all the connections at the same time, leading to parallel key generation, and will then keep them all during the rest of the test. On a reasonable machine, this will not cost more than one second, but it would be interesting to know the part caused by the negociation and the rest. > ----------------------------------- HAProxy Build info > ------------------------------------- > HA-Proxy version 1.5.12 2015/05/02 > Copyright 2000-2015 Willy Tarreau <[email protected]> > > Build options : > TARGET = linux2628 > CPU = native > CC = gcc > CFLAGS = -O3 -march=native -g -fno-strict-aliasing > OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1 > > Default settings : > maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 > > Encrypted password support via crypt(3): yes > Built with zlib version : 1.2.8 > Compression algorithms supported : identity, deflate, gzip > Built with OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015 > Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015 You may want to try openssl-1.0.2a which significantly improved performance in several use cases. > OpenSSL library supports TLS extensions : yes > OpenSSL library supports SNI : yes > OpenSSL library supports prefer-server-ciphers : yes > Built with PCRE version : 8.35 2014-04-04 > PCRE library supports JIT : no (USE_PCRE_JIT not set) > Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT > IP_FREEBIND > > Available polling systems : > epoll : pref=300, test result OK > poll : pref=200, test result OK > select : pref=150, test result OK > Total: 3 (3 usable), will use epoll. > ------- Config file - even cpu cores are on 1st socket on the mb, odd cpus > are on 2nd -------- > global > daemon > maxconn 50000 > quiet > nbproc 6 > cpu-map 1 0 > cpu-map 2 2 > cpu-map 3 4 > cpu-map 4 6 > cpu-map 5 8 > cpu-map 6 10 > user haproxy > group haproxy > stats socket /var/run/haproxy.sock mode 600 level admin > stats timeout 2m > tune.bufsize 32768 > > userlist stats-auth > group admin users admin > user admin insecure-password admin > > defaults > mode http > maxconn 50000 > retries 3 > option forwardfor > option redispatch > option prefer-last-server > option splice-auto > > frontend www-http > bind-process 1 2 3 > bind *:80 Here you have an imbalance at the kernel level : the same socket is bound to several processes, so any process who picks a connection will process it, and the initial rush of key generations can make this imbalance quite visible. You should do this instead to have 3 distinct sockets each in its own process (but warning, this requires a kernel >= 3.9) : bind *:80 process 1 bind *:80 process 2 bind *:80 process 3 > stats uri /stats > stats enable > acl AUTH http_auth(stats-auth) > acl AUTH_ADMIN http_auth(stats-auth) admin > stats http-request auth unless AUTH > default_backend www-backend > > frontend www-https > bind-process 4 5 6 > bind *:443 ssl crt /etc/ssl/private/haproxy.pem Same comment here. > reqadd X-Forwarded-Proto:\ https > default_backend www-backend-ssl > > backend www-backend > bind-process 1 2 3 > mode http > balance roundrobin > cookie FKSID prefix indirect nocache > server nginx-1 172.20.232.122:80 maxconn 25000 check > server nginx-2 172.20.232.125:80 maxconn 25000 check > > backend www-backend-ssl > bind-process 4 5 6 > mode http > balance roundrobin > cookie FKSID prefix indirect nocache > server nginx-1 172.20.232.122:80 maxconn 25000 check > server nginx-2 172.20.232.125:80 maxconn 25000 check > --------------------------------------------------------------------------------------------------------------- Your config looks correct. TCP splicing will not be used for SSL so it only serves to improve clear-text performance. I don't see anything which needs to be changed for now. Another thing that can be done is to compare the setup above with 6-process per frontend. You can even have everything in the same frontend by the way : frontend www-front bind *:80 process 1 bind *:80 process 2 bind *:80 process 3 bind *:80 process 4 bind *:80 process 5 bind *:80 process 6 bind *:443 process 1 ssl crt /etc/ssl/private/haproxy.pem bind *:443 process 2 ssl crt /etc/ssl/private/haproxy.pem bind *:443 process 3 ssl crt /etc/ssl/private/haproxy.pem bind *:443 process 4 ssl crt /etc/ssl/private/haproxy.pem bind *:443 process 5 ssl crt /etc/ssl/private/haproxy.pem bind *:443 process 6 ssl crt /etc/ssl/private/haproxy.pem > CPU is E5-2670, 48 core system, I fail to see how this is possible, the Xeon E5-2670 is 8-core and supports 2 CPU configurations max. So that's 16 cores max in total. > nic interrupts are pinned to correct cpu's, etc. OK. What do you mean by "correct", you mean "the same CPU package as the one running haproxy so as not to pass the data over QPI", right ? > Can someone suggest what change is required to get better results as well > as > fix the 70007 error, or share their config settings? The stats are also > captured. > For 128 byte, all 3 haproxy's are running, but for 16K, and for 128K, only > the last > haproxy is being used (and seen consistently): This certainly is a side effect of the imbalance above combined with ab which keeps the same connection from the beginning to the end of the test. (...) Regards, Willy

