Hi Conrad,

On 2016-10-21 17:39, Conrad Hoffmann wrote:
Hi,

it's a lot of information, and I don't have time to go into all details
right now, but from a quick read, here are the things I noticed:

- Why nbproc 64? Your CPU has 18 cores (36 w/ HT), so more procs than that
will likely make performance rather worse. HT cores share the cache, so
using 18 might make most sense (see also below). It's best to experiment a
little with that and measure the results, though.

I'll compare HT/no-HT afterwards. In my first tests it didn't same to make much of a difference o far. I also tried (in this case) to disable HT entirely and set it to max. 36 procs. Basically the same as before.


- If you see ksoftirq eating up a lot of of one CPU, then your box is most likely configured to process all IRQs on the first core. Most NICs these days can be configured to use several IRQs, which you can then distribute
across all cores, smoothening the workload across cores significantly.

I'll try to get a more recent Distro (It's a Debian Wheezy still) with a newer driver etc. They seem to have added some IRQ options in more recent versions of ixgbe. Kernel could also be related.
So disabling HT did not help.
nginx seems to have similar problem btw. so it's neither HAProxy nor nginx I guess.


- Consider using "bind-process" to lock the processes to a single core (but
make sure to leave out the HT cores, or disable HT altogether). Less
context switching, might improve performance)

Hope that helps,
Conrad



On 10/21/2016 04:47 PM, Christian Ruppert wrote:
Hi,

again a performance topic.
I did some further testing/benchmarks with ECC and nbproc >1. I was testing
on a "E5-2697 v4" and the first thing I noticed was that HAProxy has a
fixed limit of 64 for nbproc. So the setup:

HAProxy server with the mentioned E5:
global
    user haproxy
    group haproxy
    maxconn 75000
    log 127.0.0.2 local0
    ssl-default-bind-ciphers
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH

    ssl-default-bind-options no-sslv3 no-tls-tickets
    tune.ssl.default-dh-param 1024

    nbproc 64

defaults
    timeout client 300s
    timeout server 300s
    timeout queue 60s
    timeout connect 7s
    timeout http-request 10s
    maxconn 75000

    bind-process 1

# HTTP
frontend haproxy_test_http
    bind :65410
    mode http
    option httplog
    option httpclose
    log global
    default_backend bk_ram

# ECC
frontend haproxy_test-ECC
    bind-process 3-64
    bind :65420 ssl crt /etc/haproxy/test.pem-ECC
    mode http
    option httplog
    option httpclose
    log global
    default_backend bk_ram

backend bk_ram
    mode http
fullconn 75000 # Just in case the lower default limit will be reached...
    errorfile 503 /etc/haproxy/test.error



/etc/haproxy/test.error:
HTTP/1.0 200
Cache-Control: no-cache
Connection: close
Content-Type: text/plain

Test123456


The ECC key:
openssl ecparam -genkey -name prime256v1 -out /etc/haproxy/test.pem-ECC.key
openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365
-nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out
/etc/haproxy/test.pem-ECC.crt
cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt >
/etc/haproxy/test.pem-ECC


So then I tried a local "ab":
ab -n 5000 -c 250 https://127.0.0.1:65420/
Server Hostname:        127.0.0.1
Server Port:            65420
SSL/TLS Protocol: TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128

Document Path:          /
Document Length:        107 bytes

Concurrency Level:      250
Time taken for tests:   3.940 seconds
Complete requests:      5000
Failed requests:        0
Write errors:           0
Non-2xx responses:      5000
Total transferred:      1060000 bytes
HTML transferred:       535000 bytes
Requests per second:    1268.95 [#/sec] (mean)
Time per request:       197.013 [ms] (mean)
Time per request: 0.788 [ms] (mean, across all concurrent requests)
Transfer rate:          262.71 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       54  138  34.7    162     193
Processing:     8   51  34.8     24     157
Waiting:        3   40  31.6     18     113
Total:        177  189   7.5    188     333

Percentage of the requests served within a certain time (ms)
  50%    188
  66%    189
  75%    190
  80%    190
  90%    191
  95%    192
  98%    196
  99%    205
 100%    333 (longest request)

The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k *
nbproc would have been what I expected, at least somewhere near that value.

Then I setup 61 EC2 instances, standard setup t2-micro. They're somewhat
slower with ~1k ECC requests per second but that's ok for the test.
HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) ~4500.

So then I started "ab" parallel from each and it was going down to about ~4xx requests/s for ECC on each node which is far below the ~1500 (single
proc) or ~1300 (multi proc) which is much more than I expected tbh. I
thought it would scale much better for up to nbproc and getting worse when
nbproc. I did some basic checks to figure out the reason/bottleneck and to
me it looks like a lot of switch/epoll_wait. In (h)top it shows that
ksoftirqd + one haproxy proc is burning 100% cpu of a single core, it's not distributed above multiple cores. I'm not sure yet whether it's related to the SSL part, HAProxy or some Kernel foo. HTTP performs better. ~27k total on localhost, ~5400 single ab via EC2 and still ~2100 per EC2 with a total
of 15 instances - and the HTTP proc is just a single proc!

So I wonder what's the reason for the single blocking core. Is that the reason for the rather poor performance because it has an impact on any of
those processes? Can we distribute that onto multiple cores/process as
well? Any ideas?

Oh, and I was using 1.6.5:
HA-Proxy version 1.6.5 2016/05/10
Copyright 2000-2016 Willy Tarreau <[email protected]>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
  OPTIONS = USE_LIBCRYPT=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1t 3 May 2016 (VERSIONS DIFFER!)
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.30 2012-02-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

I actually thought I was using 1.6.9 on that host already so I just
upgraded and tried again some benchmarks but it looks like it's almost
equal at the first glance.


--
Regards,
Christian Ruppert

Reply via email to