Re: SSL/ECC and nbproc >1

2016-11-25 Thread Christian Ruppert

On 2016-11-25 15:26, Willy Tarreau wrote:

On Fri, Nov 25, 2016 at 02:44:35PM +0100, Christian Ruppert wrote:
I have a default bind for process 1 which is basically the http 
frontend and
the actual backend, RSA is bound to another, single process and ECC is 
bound
to all the rest. So in this case SSL (in particular ECC) is the 
problem. The

connections/handshakes should be *actually* using CPU+2 till NCPU.


That's exactly what I'm talking about, look, you have this :

  frontend ECC
 bind-process 3-36
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC
 mode http
 default_backend bk_ram

It creates a single socket (hence a single queue) and shares it between
all processes. Thus each incoming connection will wake up all processes
not doing anything, and the first one capable of grabbing it will take
it as well as a few following ones if any. You end up with a very
unbalanced load making it hard to scale.

Instead you can do this :

  frontend ECC
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 3
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 4
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 5
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 6
 ...
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 36
 mode http
 default_backend bk_ram

You'll really have 34 listening sockets all fairly balanced with their
own queue. You can generally achieve higher loads this way and with a
lower average latency.

Also, I tend to bind network IRQs to the same cores as those doing SSL
because you hardly have the two at once. SSL is not able to deal with
traffic capable of saturating a NIC driver, so when SSL saturates the
CPU you have little traffic and when the NIC requires all the CPU for
high traffic, you know there's little SSL.

Cheers,
Willy


Ah! Thanks! I had to remove the default "bind-process 1" or also setting 
the "bind-process 3-36" in the ECC frontend though. I guess it's the 
same at the end. Anyway the IRQ/NIC problem was still the same. I'll 
setup it that way anyway if that's better, together with the Intel 
affinity script or as you said, bound to the related core that does SSL. 
Let's see how well that performs.


--
Regards,
Christian Ruppert



Re: SSL/ECC and nbproc >1

2016-11-25 Thread Willy Tarreau
On Fri, Nov 25, 2016 at 02:44:35PM +0100, Christian Ruppert wrote:
> I have a default bind for process 1 which is basically the http frontend and
> the actual backend, RSA is bound to another, single process and ECC is bound
> to all the rest. So in this case SSL (in particular ECC) is the problem. The
> connections/handshakes should be *actually* using CPU+2 till NCPU.

That's exactly what I'm talking about, look, you have this :

  frontend ECC
 bind-process 3-36
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC
 mode http
 default_backend bk_ram

It creates a single socket (hence a single queue) and shares it between
all processes. Thus each incoming connection will wake up all processes
not doing anything, and the first one capable of grabbing it will take
it as well as a few following ones if any. You end up with a very
unbalanced load making it hard to scale.

Instead you can do this :

  frontend ECC
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 3
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 4
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 5
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 6
 ...
 bind :65420 ssl crt /etc/haproxy/test.pem-ECC process 36
 mode http
 default_backend bk_ram

You'll really have 34 listening sockets all fairly balanced with their
own queue. You can generally achieve higher loads this way and with a
lower average latency.

Also, I tend to bind network IRQs to the same cores as those doing SSL
because you hardly have the two at once. SSL is not able to deal with
traffic capable of saturating a NIC driver, so when SSL saturates the
CPU you have little traffic and when the NIC requires all the CPU for
high traffic, you know there's little SSL.

Cheers,
Willy



Re: SSL/ECC and nbproc >1

2016-11-25 Thread Christian Ruppert

On 2016-11-25 14:44, Christian Ruppert wrote:

Hi Willy,

On 2016-11-25 14:30, Willy Tarreau wrote:

Hi Christian,

On Fri, Nov 25, 2016 at 12:12:06PM +0100, Christian Ruppert wrote:
I'll compare HT/no-HT afterwards. In my first tests it didn't same to 
make

much of a difference o far.
I also tried (in this case) to disable HT entirely and set it to max. 
36

procs. Basically the same as before.


Also you definitely need to split your bind lines, one per process, to
take advantage of the kernel's ability to load balance between 
multiple
queues. Otherwise the load is always unequal and many processes are 
woken

up for nothing.


I have a default bind for process 1 which is basically the http
frontend and the actual backend, RSA is bound to another, single
process and ECC is bound to all the rest. So in this case SSL (in
particular ECC) is the problem. The connections/handshakes should be
*actually* using CPU+2 till NCPU. The only shared part should be the
backend but that should be actually no problem for e.g. 5 parallel
benchmarks as a single HTTP benchmark can make >20k requests/s.

global
nbproc 36

defaults:
bind-process 1

frontend http
bind :65410
mode http
default_backend bk_ram

frontend ECC
bind-process 3-36
bind :65420 ssl crt /etc/haproxy/test.pem-ECC
mode http
default_backend bk_ram

backend bk_ram
mode http
fullconn 75000
errorfile 503 /etc/haproxy/test.error




Regards,
Willy


It seems to be the NIC or rather driver/kernel. Using Intel's 
set_irq_affinity (set_irq_affinity -x local eth2 eth3) seems to do the 
trick, at least at the first glance.


--
Regards,
Christian Ruppert



Re: SSL/ECC and nbproc >1

2016-11-25 Thread Christian Ruppert

Hi Willy,

On 2016-11-25 14:30, Willy Tarreau wrote:

Hi Christian,

On Fri, Nov 25, 2016 at 12:12:06PM +0100, Christian Ruppert wrote:
I'll compare HT/no-HT afterwards. In my first tests it didn't same to 
make

much of a difference o far.
I also tried (in this case) to disable HT entirely and set it to max. 
36

procs. Basically the same as before.


Also you definitely need to split your bind lines, one per process, to
take advantage of the kernel's ability to load balance between multiple
queues. Otherwise the load is always unequal and many processes are 
woken

up for nothing.


I have a default bind for process 1 which is basically the http frontend 
and the actual backend, RSA is bound to another, single process and ECC 
is bound to all the rest. So in this case SSL (in particular ECC) is the 
problem. The connections/handshakes should be *actually* using CPU+2 
till NCPU. The only shared part should be the backend but that should be 
actually no problem for e.g. 5 parallel benchmarks as a single HTTP 
benchmark can make >20k requests/s.


global
nbproc 36

defaults:
bind-process 1

frontend http
bind :65410
mode http
default_backend bk_ram

frontend ECC
bind-process 3-36
bind :65420 ssl crt /etc/haproxy/test.pem-ECC
mode http
default_backend bk_ram

backend bk_ram
mode http
fullconn 75000
errorfile 503 /etc/haproxy/test.error




Regards,
Willy


--
Regards,
Christian Ruppert



Re: SSL/ECC and nbproc >1

2016-11-25 Thread Willy Tarreau
Hi Christian,

On Fri, Nov 25, 2016 at 12:12:06PM +0100, Christian Ruppert wrote:
> I'll compare HT/no-HT afterwards. In my first tests it didn't same to make
> much of a difference o far.
> I also tried (in this case) to disable HT entirely and set it to max. 36
> procs. Basically the same as before.

Also you definitely need to split your bind lines, one per process, to
take advantage of the kernel's ability to load balance between multiple
queues. Otherwise the load is always unequal and many processes are woken
up for nothing.

Regards,
Willy



Re: SSL/ECC and nbproc >1

2016-11-25 Thread Christian Ruppert

Hi Conrad,

On 2016-10-21 17:39, Conrad Hoffmann wrote:

Hi,

it's a lot of information, and I don't have time to go into all details
right now, but from a quick read, here are the things I noticed:

- Why nbproc 64? Your CPU has 18 cores (36 w/ HT), so more procs than 
that

will likely make performance rather worse. HT cores share the cache, so
using 18 might make most sense (see also below). It's best to 
experiment a

little with that and measure the results, though.


I'll compare HT/no-HT afterwards. In my first tests it didn't same to 
make much of a difference o far.
I also tried (in this case) to disable HT entirely and set it to max. 36 
procs. Basically the same as before.




- If you see ksoftirq eating up a lot of of one CPU, then your box is 
most
likely configured to process all IRQs on the first core. Most NICs 
these
days can be configured to use several IRQs, which you can then 
distribute

across all cores, smoothening the workload across cores significantly.


I'll try to get a more recent Distro (It's a Debian Wheezy still) with a 
newer driver etc. They seem to have added some IRQ options in more 
recent versions of ixgbe. Kernel could also be related.

So disabling HT did not help.
nginx seems to have similar problem btw. so it's neither HAProxy nor 
nginx I guess.




- Consider using "bind-process" to lock the processes to a single core 
(but

make sure to leave out the HT cores, or disable HT altogether). Less
context switching, might improve performance)

Hope that helps,
Conrad



On 10/21/2016 04:47 PM, Christian Ruppert wrote:

Hi,

again a performance topic.
I did some further testing/benchmarks with ECC and nbproc >1. I was 
testing

on a "E5-2697 v4" and the first thing I noticed was that HAProxy has a
fixed limit of 64 for nbproc. So the setup:

HAProxy server with the mentioned E5:
global
user haproxy
group haproxy
maxconn 75000
log 127.0.0.2 local0
ssl-default-bind-ciphers
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH

ssl-default-bind-options no-sslv3 no-tls-tickets
tune.ssl.default-dh-param 1024

nbproc 64

defaults
timeout client 300s
timeout server 300s
timeout queue 60s
timeout connect 7s
timeout http-request 10s
maxconn 75000

bind-process 1

# HTTP
frontend haproxy_test_http
bind :65410
mode http
option httplog
option httpclose
log global
default_backend bk_ram

# ECC
frontend haproxy_test-ECC
bind-process 3-64
bind :65420 ssl crt /etc/haproxy/test.pem-ECC
mode http
option httplog
option httpclose
log global
default_backend bk_ram

backend bk_ram
mode http
fullconn 75000 # Just in case the lower default limit will be 
reached...

errorfile 503 /etc/haproxy/test.error



/etc/haproxy/test.error:
HTTP/1.0 200
Cache-Control: no-cache
Connection: close
Content-Type: text/plain

Test123456


The ECC key:
openssl ecparam -genkey -name prime256v1 -out 
/etc/haproxy/test.pem-ECC.key

openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365
-nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out
/etc/haproxy/test.pem-ECC.crt
cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt >
/etc/haproxy/test.pem-ECC


So then I tried a local "ab":
ab -n 5000 -c 250 https://127.0.0.1:65420/
Server Hostname:127.0.0.1
Server Port:65420
SSL/TLS Protocol:   
TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128


Document Path:  /
Document Length:107 bytes

Concurrency Level:  250
Time taken for tests:   3.940 seconds
Complete requests:  5000
Failed requests:0
Write errors:   0
Non-2xx responses:  5000
Total transferred:  106 bytes
HTML transferred:   535000 bytes
Requests per second:1268.95 [#/sec] (mean)
Time per request:   197.013 [ms] (mean)
Time per request:   0.788 [ms] (mean, across all concurrent 
requests)

Transfer rate:  262.71 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:   54  138  34.7162 193
Processing: 8   51  34.8 24 157
Waiting:3   40  31.6 18 113
Total:177  189   7.5188 333

Percentage of the requests served within a certain time (ms)
  50%188
  66%189
  75%190
  80%190
  90%191
  95%192
  98%196
  99%205
 100%333 (longest request)

The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k *
nbproc would have been what I expected, at least somewhere near that 
value.


Then I setup 61 EC2 instances, standard setup t2-micro. They're 
somewhat

slower with ~1k ECC requests per second but that's ok for the test.
HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) 
~4500.


So then I started "ab" parallel from each and it 

Re: SSL/ECC and nbproc >1

2016-10-21 Thread Conrad Hoffmann
Hi,

it's a lot of information, and I don't have time to go into all details
right now, but from a quick read, here are the things I noticed:

- Why nbproc 64? Your CPU has 18 cores (36 w/ HT), so more procs than that
will likely make performance rather worse. HT cores share the cache, so
using 18 might make most sense (see also below). It's best to experiment a
little with that and measure the results, though.

- If you see ksoftirq eating up a lot of of one CPU, then your box is most
likely configured to process all IRQs on the first core. Most NICs these
days can be configured to use several IRQs, which you can then distribute
across all cores, smoothening the workload across cores significantly.

- Consider using "bind-process" to lock the processes to a single core (but
make sure to leave out the HT cores, or disable HT altogether). Less
context switching, might improve performance)

Hope that helps,
Conrad



On 10/21/2016 04:47 PM, Christian Ruppert wrote:
> Hi,
> 
> again a performance topic.
> I did some further testing/benchmarks with ECC and nbproc >1. I was testing
> on a "E5-2697 v4" and the first thing I noticed was that HAProxy has a
> fixed limit of 64 for nbproc. So the setup:
> 
> HAProxy server with the mentioned E5:
> global
> user haproxy
> group haproxy
> maxconn 75000
> log 127.0.0.2 local0
> ssl-default-bind-ciphers
> ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH
> 
> ssl-default-bind-options no-sslv3 no-tls-tickets
> tune.ssl.default-dh-param 1024
> 
> nbproc 64
> 
> defaults
> timeout client 300s
> timeout server 300s
> timeout queue 60s
> timeout connect 7s
> timeout http-request 10s
> maxconn 75000
> 
> bind-process 1
> 
> # HTTP
> frontend haproxy_test_http
> bind :65410
> mode http
> option httplog
> option httpclose
> log global
> default_backend bk_ram
> 
> # ECC
> frontend haproxy_test-ECC
> bind-process 3-64
> bind :65420 ssl crt /etc/haproxy/test.pem-ECC
> mode http
> option httplog
> option httpclose
> log global
> default_backend bk_ram
> 
> backend bk_ram
> mode http
> fullconn 75000 # Just in case the lower default limit will be reached...
> errorfile 503 /etc/haproxy/test.error
> 
> 
> 
> /etc/haproxy/test.error:
> HTTP/1.0 200
> Cache-Control: no-cache
> Connection: close
> Content-Type: text/plain
> 
> Test123456
> 
> 
> The ECC key:
> openssl ecparam -genkey -name prime256v1 -out /etc/haproxy/test.pem-ECC.key
> openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365
> -nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out
> /etc/haproxy/test.pem-ECC.crt
> cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt >
> /etc/haproxy/test.pem-ECC
> 
> 
> So then I tried a local "ab":
> ab -n 5000 -c 250 https://127.0.0.1:65420/
> Server Hostname:127.0.0.1
> Server Port:65420
> SSL/TLS Protocol:   TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128
> 
> Document Path:  /
> Document Length:107 bytes
> 
> Concurrency Level:  250
> Time taken for tests:   3.940 seconds
> Complete requests:  5000
> Failed requests:0
> Write errors:   0
> Non-2xx responses:  5000
> Total transferred:  106 bytes
> HTML transferred:   535000 bytes
> Requests per second:1268.95 [#/sec] (mean)
> Time per request:   197.013 [ms] (mean)
> Time per request:   0.788 [ms] (mean, across all concurrent requests)
> Transfer rate:  262.71 [Kbytes/sec] received
> 
> Connection Times (ms)
>   min  mean[+/-sd] median   max
> Connect:   54  138  34.7162 193
> Processing: 8   51  34.8 24 157
> Waiting:3   40  31.6 18 113
> Total:177  189   7.5188 333
> 
> Percentage of the requests served within a certain time (ms)
>   50%188
>   66%189
>   75%190
>   80%190
>   90%191
>   95%192
>   98%196
>   99%205
>  100%333 (longest request)
> 
> The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k *
> nbproc would have been what I expected, at least somewhere near that value.
> 
> Then I setup 61 EC2 instances, standard setup t2-micro. They're somewhat
> slower with ~1k ECC requests per second but that's ok for the test.
> HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) ~4500.
> 
> So then I started "ab" parallel from each and it was going down to about
> ~4xx requests/s for ECC on each node which is far below the ~1500 (single
> proc) or ~1300 (multi proc) which is much more than I expected tbh. I
> thought it would scale much better for up to nbproc and getting worse when
>>nbproc. I did some basic checks to figure out the reason/bottleneck and to
> me it looks like a lot of switch/epoll_wait. In 

SSL/ECC and nbproc >1

2016-10-21 Thread Christian Ruppert

Hi,

again a performance topic.
I did some further testing/benchmarks with ECC and nbproc >1. I was 
testing on a "E5-2697 v4" and the first thing I noticed was that HAProxy 
has a fixed limit of 64 for nbproc. So the setup:


HAProxy server with the mentioned E5:
global
user haproxy
group haproxy
maxconn 75000
log 127.0.0.2 local0
ssl-default-bind-ciphers 
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDH

ssl-default-bind-options no-sslv3 no-tls-tickets
tune.ssl.default-dh-param 1024

nbproc 64

defaults
timeout client 300s
timeout server 300s
timeout queue 60s
timeout connect 7s
timeout http-request 10s
maxconn 75000

bind-process 1

# HTTP
frontend haproxy_test_http
bind :65410
mode http
option httplog
option httpclose
log global
default_backend bk_ram

# ECC
frontend haproxy_test-ECC
bind-process 3-64
bind :65420 ssl crt /etc/haproxy/test.pem-ECC
mode http
option httplog
option httpclose
log global
default_backend bk_ram

backend bk_ram
mode http
fullconn 75000 # Just in case the lower default limit will be 
reached...

errorfile 503 /etc/haproxy/test.error



/etc/haproxy/test.error:
HTTP/1.0 200
Cache-Control: no-cache
Connection: close
Content-Type: text/plain

Test123456


The ECC key:
openssl ecparam -genkey -name prime256v1 -out 
/etc/haproxy/test.pem-ECC.key
openssl req -new -sha256 -key /etc/haproxy/test.pem-ECC.key -days 365 
-nodes -x509 -sha256 -subj "/O=ECC Test/CN=test.example.com" -out 
/etc/haproxy/test.pem-ECC.crt
cat /etc/haproxy/test.pem-ECC.key /etc/haproxy/test.pem-ECC.crt > 
/etc/haproxy/test.pem-ECC



So then I tried a local "ab":
ab -n 5000 -c 250 https://127.0.0.1:65420/
Server Hostname:127.0.0.1
Server Port:65420
SSL/TLS Protocol:   
TLSv1/SSLv3,ECDHE-ECDSA-AES128-GCM-SHA256,256,128


Document Path:  /
Document Length:107 bytes

Concurrency Level:  250
Time taken for tests:   3.940 seconds
Complete requests:  5000
Failed requests:0
Write errors:   0
Non-2xx responses:  5000
Total transferred:  106 bytes
HTML transferred:   535000 bytes
Requests per second:1268.95 [#/sec] (mean)
Time per request:   197.013 [ms] (mean)
Time per request:   0.788 [ms] (mean, across all concurrent 
requests)

Transfer rate:  262.71 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:   54  138  34.7162 193
Processing: 8   51  34.8 24 157
Waiting:3   40  31.6 18 113
Total:177  189   7.5188 333

Percentage of the requests served within a certain time (ms)
  50%188
  66%189
  75%190
  80%190
  90%191
  95%192
  98%196
  99%205
 100%333 (longest request)

The same test with just nbproc 1 was about ~1500 requests/s. So 1,5k * 
nbproc would have been what I expected, at least somewhere near that 
value.


Then I setup 61 EC2 instances, standard setup t2-micro. They're somewhat 
slower with ~1k ECC requests per second but that's ok for the test.

HTTP (one proc) via localhost was around 27-28k r/s, remote (EC2) ~4500.

So then I started "ab" parallel from each and it was going down to about 
~4xx requests/s for ECC on each node which is far below the ~1500 
(single proc) or ~1300 (multi proc) which is much more than I expected 
tbh. I thought it would scale much better for up to nbproc and getting 
worse when >nbproc. I did some basic checks to figure out the 
reason/bottleneck and to me it looks like a lot of switch/epoll_wait. In 
(h)top it shows that ksoftirqd + one haproxy proc is burning 100% cpu of 
a single core, it's not distributed above multiple cores. I'm not sure 
yet whether it's related to the SSL part, HAProxy or some Kernel foo. 
HTTP performs better. ~27k total on localhost, ~5400 single ab via EC2 
and still ~2100 per EC2 with a total of 15 instances - and the HTTP proc 
is just a single proc!


So I wonder what's the reason for the single blocking core. Is that the 
reason for the rather poor performance because it has an impact on any 
of those processes? Can we distribute that onto multiple cores/process 
as well? Any ideas?


Oh, and I was using 1.6.5:
HA-Proxy version 1.6.5 2016/05/10
Copyright 2000-2016 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
  OPTIONS = USE_LIBCRYPT=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 
200


Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Compression algorithms supported : identity("identity"), 
deflate("deflate"),