Re: SSL/ECC and nbproc >1
On 2016-11-25 15:26, Willy Tarreau wrote: On Fri, Nov 25, 2016 at 02:44:35PM +0100, Christian Ruppert wrote: I have a default bind for process 1 which is basically the http frontend and the actual backend, RSA is bound to another, single process and ECC is bound to all the rest. So in this case SSL (in particular ECC) is the problem. The connections/handshakes should be *actually* using CPU+2 till NCPU. That's exactly what I'm talking about, look, you have this : frontend ECC bind-process 3-36 bind :65420 ssl crt /etc/haproxy/test.pem-ECC mode http default_backend bk_ram It creates a single socket (hence a single queue) and shares it between all processes. Thus each incoming connection will wake up all processes not doing anything, and the first one capable of grabbing it will take it as well as a few following ones if any. You end up with a very unbalanced load making it hard to scale. Instead you can do this : frontend ECC bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 3 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 4 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 5 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 6 ... bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 36 mode http default_backend bk_ram You'll really have 34 listening sockets all fairly balanced with their own queue. You can generally achieve higher loads this way and with a lower average latency. Also, I tend to bind network IRQs to the same cores as those doing SSL because you hardly have the two at once. SSL is not able to deal with traffic capable of saturating a NIC driver, so when SSL saturates the CPU you have little traffic and when the NIC requires all the CPU for high traffic, you know there's little SSL. Cheers, Willy Ah! Thanks! I had to remove the default "bind-process 1" or also setting the "bind-process 3-36" in the ECC frontend though. I guess it's the same at the end. Anyway the IRQ/NIC problem was still the same. I'll setup it that way anyway if that's better, together with the Intel affinity script or as you said, bound to the related core that does SSL. Let's see how well that performs. -- Regards, Christian Ruppert
Re: SSL/ECC and nbproc >1
On Fri, Nov 25, 2016 at 02:44:35PM +0100, Christian Ruppert wrote: > I have a default bind for process 1 which is basically the http frontend and > the actual backend, RSA is bound to another, single process and ECC is bound > to all the rest. So in this case SSL (in particular ECC) is the problem. The > connections/handshakes should be *actually* using CPU+2 till NCPU. That's exactly what I'm talking about, look, you have this : frontend ECC bind-process 3-36 bind :65420 ssl crt /etc/haproxy/test.pem-ECC mode http default_backend bk_ram It creates a single socket (hence a single queue) and shares it between all processes. Thus each incoming connection will wake up all processes not doing anything, and the first one capable of grabbing it will take it as well as a few following ones if any. You end up with a very unbalanced load making it hard to scale. Instead you can do this : frontend ECC bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 3 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 4 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 5 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 6 ... bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 36 mode http default_backend bk_ram You'll really have 34 listening sockets all fairly balanced with their own queue. You can generally achieve higher loads this way and with a lower average latency. Also, I tend to bind network IRQs to the same cores as those doing SSL because you hardly have the two at once. SSL is not able to deal with traffic capable of saturating a NIC driver, so when SSL saturates the CPU you have little traffic and when the NIC requires all the CPU for high traffic, you know there's little SSL. Cheers, Willy
Re: SSL/ECC and nbproc >1
On 2016-11-25 14:44, Christian Ruppert wrote: Hi Willy, On 2016-11-25 14:30, Willy Tarreau wrote: Hi Christian, On Fri, Nov 25, 2016 at 12:12:06PM +0100, Christian Ruppert wrote: I'll compare HT/no-HT afterwards. In my first tests it didn't same to make much of a difference o far. I also tried (in this case) to disable HT entirely and set it to max. 36 procs. Basically the same as before. Also you definitely need to split your bind lines, one per process, to take advantage of the kernel's ability to load balance between multiple queues. Otherwise the load is always unequal and many processes are woken up for nothing. I have a default bind for process 1 which is basically the http frontend and the actual backend, RSA is bound to another, single process and ECC is bound to all the rest. So in this case SSL (in particular ECC) is the problem. The connections/handshakes should be *actually* using CPU+2 till NCPU. The only shared part should be the backend but that should be actually no problem for e.g. 5 parallel benchmarks as a single HTTP benchmark can make >20k requests/s. global nbproc 36 defaults: bind-process 1 frontend http bind :65410 mode http default_backend bk_ram frontend ECC bind-process 3-36 bind :65420 ssl crt /etc/haproxy/test.pem-ECC mode http default_backend bk_ram backend bk_ram mode http fullconn 75000 errorfile 503 /etc/haproxy/test.error Regards, Willy It seems to be the NIC or rather driver/kernel. Using Intel's set_irq_affinity (set_irq_affinity -x local eth2 eth3) seems to do the trick, at least at the first glance. -- Regards, Christian Ruppert
Re: SSL/ECC and nbproc >1
Hi Willy, On 2016-11-25 14:30, Willy Tarreau wrote: Hi Christian, On Fri, Nov 25, 2016 at 12:12:06PM +0100, Christian Ruppert wrote: I'll compare HT/no-HT afterwards. In my first tests it didn't same to make much of a difference o far. I also tried (in this case) to disable HT entirely and set it to max. 36 procs. Basically the same as before. Also you definitely need to split your bind lines, one per process, to take advantage of the kernel's ability to load balance between multiple queues. Otherwise the load is always unequal and many processes are woken up for nothing. I have a default bind for process 1 which is basically the http frontend and the actual backend, RSA is bound to another, single process and ECC is bound to all the rest. So in this case SSL (in particular ECC) is the problem. The connections/handshakes should be *actually* using CPU+2 till NCPU. The only shared part should be the backend but that should be actually no problem for e.g. 5 parallel benchmarks as a single HTTP benchmark can make >20k requests/s. global nbproc 36 defaults: bind-process 1 frontend http bind :65410 mode http default_backend bk_ram frontend ECC bind-process 3-36 bind :65420 ssl crt /etc/haproxy/test.pem-ECC mode http default_backend bk_ram backend bk_ram mode http fullconn 75000 errorfile 503 /etc/haproxy/test.error Regards, Willy -- Regards, Christian Ruppert
Re: SSL/ECC and nbproc >1
Hi Christian, On Fri, Nov 25, 2016 at 12:12:06PM +0100, Christian Ruppert wrote: > I'll compare HT/no-HT afterwards. In my first tests it didn't same to make > much of a difference o far. > I also tried (in this case) to disable HT entirely and set it to max. 36 > procs. Basically the same as before. Also you definitely need to split your bind lines, one per process, to take advantage of the kernel's ability to load balance between multiple queues. Otherwise the load is always unequal and many processes are woken up for nothing. Regards, Willy
Re: SSL/ECC and nbproc >1
Hi Conrad, On 2016-10-21 17:39, Conrad Hoffmann wrote: Hi, it's a lot of information, and I don't have time to go into all details right now, but from a quick read, here are the things I noticed: - Why nbproc 64? Your CPU has 18 cores (36 w/ HT), so more procs than that will likely make performance rather worse. HT cores share the cache, so using 18 might make most sense (see also below). It's best to experiment a little with that and measure the results, though. I'll compare HT/no-HT afterwards. In my first tests it didn't same to make much of a difference o far. I also tried (in this case) to disable HT entirely and set it to max. 36 procs. Basically the same as before. - If you see ksoftirq eating up a lot of of one CPU, then your box is most likely configured to process all IRQs on the first core. Most NICs these days can be configured to use several IRQs, which you can then distribute across all cores, smoothening the workload across cores significantly. I'll try to get a more recent Distro (It's a Debian Wheezy still) with a newer driver etc. They seem to have added some IRQ options in more recent versions of ixgbe. Kernel could also be related. So disabling HT did not help. nginx seems to have similar problem btw. so it's neither HAProxy nor nginx I guess. - Consider using "bind-process" to lock the processes to a single core (but make sure to leave out the HT cores, or disable HT altogether). Less context switching, might improve performance) Hope that helps, Conrad On 10/21/2016 04:47 PM, Christian Ruppert wrote: Hi, again a performance topic. I did some further testing/benchmarks with ECC and nbproc >1. I was testing on a "E5-2697 v4" and the first thing I noticed was that HAProxy has a fixed limit of 64 for nbproc. So the setup: HAProxy server with the mentioned E5: global user haproxy group haproxy maxconn 75000 log 127.0.0.2 local0 ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH ssl-default-bind-options no-sslv3 no-tls-tickets tune.ssl.default-dh-param 1024 nbproc 64 defaults timeout client 300s timeout server 300s timeout queue 60s timeout connect 7s timeout http-request 10s maxconn 75000 bind-process 1 # HTTP frontend haproxy_test_http bind :65410 mode http option httplog option httpclose log global default_backend bk_ram # ECC frontend haproxy_test-ECC bind-process 3-64 bind :65420 ssl crt /etc/haproxy/test.pem-ECC mode http option httplog option httpclose log global default_backend bk_ram backend bk_ram mode http fullconn 75000 # Just in case the lower default limit will be reached... errorfile 503 /etc/haproxy/test.error /etc/haproxy/test.error: HTTP/1.0 200 Cache-Control: no-cache Connection: close Content-Type: text/plain Test123456 The ECC key: openssl ecparam -genkey -name prime256v1 -out /etc/haproxy/test.pem-ECC.key openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365 -nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out /etc/haproxy/test.pem-ECC.crt cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt > /etc/haproxy/test.pem-ECC So then I tried a local "ab": ab -n 5000 -c 250 https://127.0.0.1:65420/ Server Hostname:127.0.0.1 Server Port:65420 SSL/TLS Protocol: TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128 Document Path: / Document Length:107 bytes Concurrency Level: 250 Time taken for tests: 3.940 seconds Complete requests: 5000 Failed requests:0 Write errors: 0 Non-2xx responses: 5000 Total transferred: 106 bytes HTML transferred: 535000 bytes Requests per second:1268.95 [#/sec] (mean) Time per request: 197.013 [ms] (mean) Time per request: 0.788 [ms] (mean, across all concurrent requests) Transfer rate: 262.71 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 54 138 34.7162 193 Processing: 8 51 34.8 24 157 Waiting:3 40 31.6 18 113 Total:177 189 7.5188 333 Percentage of the requests served within a certain time (ms) 50%188 66%189 75%190 80%190 90%191 95%192 98%196 99%205 100%333 (longest request) The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k * nbproc would have been what I expected, at least somewhere near that value. Then I setup 61 EC2 instances, standard setup t2-micro. They're somewhat slower with ~1k ECC requests per second but that's ok for the test. HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) ~4500. So then I started "ab" parallel from each and it
Re: SSL/ECC and nbproc >1
Hi, it's a lot of information, and I don't have time to go into all details right now, but from a quick read, here are the things I noticed: - Why nbproc 64? Your CPU has 18 cores (36 w/ HT), so more procs than that will likely make performance rather worse. HT cores share the cache, so using 18 might make most sense (see also below). It's best to experiment a little with that and measure the results, though. - If you see ksoftirq eating up a lot of of one CPU, then your box is most likely configured to process all IRQs on the first core. Most NICs these days can be configured to use several IRQs, which you can then distribute across all cores, smoothening the workload across cores significantly. - Consider using "bind-process" to lock the processes to a single core (but make sure to leave out the HT cores, or disable HT altogether). Less context switching, might improve performance) Hope that helps, Conrad On 10/21/2016 04:47 PM, Christian Ruppert wrote: > Hi, > > again a performance topic. > I did some further testing/benchmarks with ECC and nbproc >1. I was testing > on a "E5-2697 v4" and the first thing I noticed was that HAProxy has a > fixed limit of 64 for nbproc. So the setup: > > HAProxy server with the mentioned E5: > global > user haproxy > group haproxy > maxconn 75000 > log 127.0.0.2 local0 > ssl-default-bind-ciphers > ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH > > ssl-default-bind-options no-sslv3 no-tls-tickets > tune.ssl.default-dh-param 1024 > > nbproc 64 > > defaults > timeout client 300s > timeout server 300s > timeout queue 60s > timeout connect 7s > timeout http-request 10s > maxconn 75000 > > bind-process 1 > > # HTTP > frontend haproxy_test_http > bind :65410 > mode http > option httplog > option httpclose > log global > default_backend bk_ram > > # ECC > frontend haproxy_test-ECC > bind-process 3-64 > bind :65420 ssl crt /etc/haproxy/test.pem-ECC > mode http > option httplog > option httpclose > log global > default_backend bk_ram > > backend bk_ram > mode http > fullconn 75000 # Just in case the lower default limit will be reached... > errorfile 503 /etc/haproxy/test.error > > > > /etc/haproxy/test.error: > HTTP/1.0 200 > Cache-Control: no-cache > Connection: close > Content-Type: text/plain > > Test123456 > > > The ECC key: > openssl ecparam -genkey -name prime256v1 -out /etc/haproxy/test.pem-ECC.key > openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365 > -nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out > /etc/haproxy/test.pem-ECC.crt > cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt > > /etc/haproxy/test.pem-ECC > > > So then I tried a local "ab": > ab -n 5000 -c 250 https://127.0.0.1:65420/ > Server Hostname:127.0.0.1 > Server Port:65420 > SSL/TLS Protocol: TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128 > > Document Path: / > Document Length:107 bytes > > Concurrency Level: 250 > Time taken for tests: 3.940 seconds > Complete requests: 5000 > Failed requests:0 > Write errors: 0 > Non-2xx responses: 5000 > Total transferred: 106 bytes > HTML transferred: 535000 bytes > Requests per second:1268.95 [#/sec] (mean) > Time per request: 197.013 [ms] (mean) > Time per request: 0.788 [ms] (mean, across all concurrent requests) > Transfer rate: 262.71 [Kbytes/sec] received > > Connection Times (ms) > min mean[+/-sd] median max > Connect: 54 138 34.7162 193 > Processing: 8 51 34.8 24 157 > Waiting:3 40 31.6 18 113 > Total:177 189 7.5188 333 > > Percentage of the requests served within a certain time (ms) > 50%188 > 66%189 > 75%190 > 80%190 > 90%191 > 95%192 > 98%196 > 99%205 > 100%333 (longest request) > > The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k * > nbproc would have been what I expected, at least somewhere near that value. > > Then I setup 61 EC2 instances, standard setup t2-micro. They're somewhat > slower with ~1k ECC requests per second but that's ok for the test. > HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) ~4500. > > So then I started "ab" parallel from each and it was going down to about > ~4xx requests/s for ECC on each node which is far below the ~1500 (single > proc) or ~1300 (multi proc) which is much more than I expected tbh. I > thought it would scale much better for up to nbproc and getting worse when >>nbproc. I did some basic checks to figure out the reason/bottleneck and to > me it looks like a lot of switch/epoll_wait. In
SSL/ECC and nbproc >1
Hi, again a performance topic. I did some further testing/benchmarks with ECC and nbproc >1. I was testing on a "E5-2697 v4" and the first thing I noticed was that HAProxy has a fixed limit of 64 for nbproc. So the setup: HAProxy server with the mentioned E5: global user haproxy group haproxy maxconn 75000 log 127.0.0.2 local0 ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH ssl-default-bind-options no-sslv3 no-tls-tickets tune.ssl.default-dh-param 1024 nbproc 64 defaults timeout client 300s timeout server 300s timeout queue 60s timeout connect 7s timeout http-request 10s maxconn 75000 bind-process 1 # HTTP frontend haproxy_test_http bind :65410 mode http option httplog option httpclose log global default_backend bk_ram # ECC frontend haproxy_test-ECC bind-process 3-64 bind :65420 ssl crt /etc/haproxy/test.pem-ECC mode http option httplog option httpclose log global default_backend bk_ram backend bk_ram mode http fullconn 75000 # Just in case the lower default limit will be reached... errorfile 503 /etc/haproxy/test.error /etc/haproxy/test.error: HTTP/1.0 200 Cache-Control: no-cache Connection: close Content-Type: text/plain Test123456 The ECC key: openssl ecparam -genkey -name prime256v1 -out /etc/haproxy/test.pem-ECC.key openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365 -nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out /etc/haproxy/test.pem-ECC.crt cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt > /etc/haproxy/test.pem-ECC So then I tried a local "ab": ab -n 5000 -c 250 https://127.0.0.1:65420/ Server Hostname:127.0.0.1 Server Port:65420 SSL/TLS Protocol: TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128 Document Path: / Document Length:107 bytes Concurrency Level: 250 Time taken for tests: 3.940 seconds Complete requests: 5000 Failed requests:0 Write errors: 0 Non-2xx responses: 5000 Total transferred: 106 bytes HTML transferred: 535000 bytes Requests per second:1268.95 [#/sec] (mean) Time per request: 197.013 [ms] (mean) Time per request: 0.788 [ms] (mean, across all concurrent requests) Transfer rate: 262.71 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 54 138 34.7162 193 Processing: 8 51 34.8 24 157 Waiting:3 40 31.6 18 113 Total:177 189 7.5188 333 Percentage of the requests served within a certain time (ms) 50%188 66%189 75%190 80%190 90%191 95%192 98%196 99%205 100%333 (longest request) The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k * nbproc would have been what I expected, at least somewhere near that value. Then I setup 61 EC2 instances, standard setup t2-micro. They're somewhat slower with ~1k ECC requests per second but that's ok for the test. HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) ~4500. So then I started "ab" parallel from each and it was going down to about ~4xx requests/s for ECC on each node which is far below the ~1500 (single proc) or ~1300 (multi proc) which is much more than I expected tbh. I thought it would scale much better for up to nbproc and getting worse when >nbproc. I did some basic checks to figure out the reason/bottleneck and to me it looks like a lot of switch/epoll_wait. In (h)top it shows that ksoftirqd + one haproxy proc is burning 100% cpu of a single core, it's not distributed above multiple cores. I'm not sure yet whether it's related to the SSL part, HAProxy or some Kernel foo. HTTP performs better. ~27k total on localhost, ~5400 single ab via EC2 and still ~2100 per EC2 with a total of 15 instances - and the HTTP proc is just a single proc! So I wonder what's the reason for the single blocking core. Is that the reason for the rather poor performance because it has an impact on any of those processes? Can we distribute that onto multiple cores/process as well? Any ideas? Oh, and I was using 1.6.5: HA-Proxy version 1.6.5 2016/05/10 Copyright 2000-2016 Willy TarreauBuild options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement OPTIONS = USE_LIBCRYPT=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.7 Compression algorithms supported : identity("identity"), deflate("deflate"),