Re: Optimizing HAProxy CPU usage for SSL

2024-01-31 Thread Willy Tarreau
Hi Miles,

On Thu, Feb 01, 2024 at 05:09:20PM +1100, Miles Hampson wrote:
> Hi,
> 
> We recently hit an issue where we observed the
> haproxy_frontend_current_sessions reported by the prometheus endpoint
> plateau at 4095 and some requests start dropping. Increasing the global and
> listen maxconn from 4096 to something larger (as well as making the kernel
> TCP queues on our Ubuntu 22.04 OS slighty larger) fixed the issue.
> 
> The cause seems to have been a switch from http to https traffic due to a
> client side config change, rather than an increase in the number of
> requests, so I started looking at CPU usage to see if the SSL load was too
> much for our server CPUs. However on one of the modern 24 core machines
> running HAProxy I noticed top was only reporting around 100% CPU usage,
> with both the user and system CPU distributed pretty evenly across all the
> cores (4-8% user per core, 0.5-2% system). The idle percentage was in the
> high nineties, both as reported by top and by the haproxy socket Idle_pct.
> This was just a quick gathering of info and may not be representative,
> since our prometheus node exporter only shows overall CPU (which was a low
> 5% of the total on all cores throughout). This is for a bare metal server
> which is just running a HAProxy processing around 200 SSL req/sec, and not
> doing much else.

So that would give roughly 4000 SSL req/sec max over all cores, or only
166 per core, that sounds quite low!

> I started wondering if our global settings:
> 
>   master-worker
>   nbthread 24
>   cpu-map auto:1/1-24 0-23
>   tune.ssl.cachesize 10
> 
> were appropriate or if they had caused some inefficiency in using our
> machine's cores, which then caused this backlog. Or whether what I am
> observing is completely normal, given that we are now spending more time on
> SSL decoding so can expect more queuing (our backend servers are very fast
> and so we run them with a small maxconn, but they don't care if the request
> is SSL or not so the overall request time should be the same other than SSL
> processing time). We are running either the latest OpenSSL 1.1.1 or
> WolfSSL, all compiled sensibly (AES-NI etc).

Normally on a modern x86 CPU core, you should expect roughly 500 RSA2048/s
per core and per GHz (or keep in mind 1000/s for an average 2GHz core).
RSA4096 however, it much slower, usually 7 times or so. Example here on
a core i9-9900K at 5GHz:

  $ openssl speed rsa2048
signverifysign/s verify/s
  rsa 2048 bits 0.000404s 0.12s   2476.7  83266.3
  rsa 4096 bits 0.002726s 0.42s366.8  23632.8

On ARM however it will vary with the cores but up to the Neoverse-N1
(e.g. Graviton2), it was not fantastic, to say the least, around 100/s
per GHz for RSA2048 and 14/s/GHz for RSA4096. Neoverse-V1 as in Graviton3
is way better though, about 2/3 of x86.

> I turned to https://docs.haproxy.org/2.9/management.html#7 which had some
> very interesting advice about pinning haproxy to one CPU core and the
> interrupts to another one, but it also mentioned nbproc and the bind
> process option for better SSL traffic processing. Given that seems to be a
> bit out of date, I thought I might ask my question here instead.

Oops, good catch, I hoped we got rid of all references to nbproc, we'll
definitely have to clean that one!

> Is there a way to use the CPU cores available on our HAProxy machines to
> handle SSL requests better than I have with the global config above?

At least I'm having a questoin in return, what is this CPU exactly? I'm
asking because you mentioned 24 cores and you started 24 threads, but x86
CPUs usually are SMT-capable via HyperThreading or the equivalent from
AMD, so you have twice the number of threads. The gain is not much, since
both threads of a core share the same compute units, for lots of stuff
it ends in about 10-15% performance increase, and for SSL it brings almost
zero since the calculation code is already optimized to use the ALU fully.

But what's interesting with the second thread is sometimes to let the
network run on it. But usually when you're tuning for SSL, the network
is not the problem, and conversely. Let me explain. It takes just a few
Mbps of network traffic to saturate a machine with SSL handshakes. This
means that if your SSL stack is working like crazy doing computations
to the point of maxing out the CPU, chances are that you're under attack
and that your network load is very low. Conversely, if you're dealing
with a lot of network traffic, it usually means your site is designed
so that you deliver a lot of bytes without forcing clients to perform
handshakes all the time.

So in the end it often makes sense to let both haproxy and the network
stack coexist on all threads of all cores, so that the unused CPU in a
certain situation is available to the other (if only to better deal with
attacks or unexpected traffic surges).

OpenSSL doesn't scale well, but 1.1.1 isn't that bad. It reaches a plateau

Optimizing HAProxy CPU usage for SSL

2024-01-31 Thread Miles Hampson
Hi,

We recently hit an issue where we observed the
haproxy_frontend_current_sessions reported by the prometheus endpoint
plateau at 4095 and some requests start dropping. Increasing the global and
listen maxconn from 4096 to something larger (as well as making the kernel
TCP queues on our Ubuntu 22.04 OS slighty larger) fixed the issue.

The cause seems to have been a switch from http to https traffic due to a
client side config change, rather than an increase in the number of
requests, so I started looking at CPU usage to see if the SSL load was too
much for our server CPUs. However on one of the modern 24 core machines
running HAProxy I noticed top was only reporting around 100% CPU usage,
with both the user and system CPU distributed pretty evenly across all the
cores (4-8% user per core, 0.5-2% system). The idle percentage was in the
high nineties, both as reported by top and by the haproxy socket Idle_pct.
This was just a quick gathering of info and may not be representative,
since our prometheus node exporter only shows overall CPU (which was a low
5% of the total on all cores throughout). This is for a bare metal server
which is just running a HAProxy processing around 200 SSL req/sec, and not
doing much else.

I started wondering if our global settings:

  master-worker
  nbthread 24
  cpu-map auto:1/1-24 0-23
  tune.ssl.cachesize 10

were appropriate or if they had caused some inefficiency in using our
machine's cores, which then caused this backlog. Or whether what I am
observing is completely normal, given that we are now spending more time on
SSL decoding so can expect more queuing (our backend servers are very fast
and so we run them with a small maxconn, but they don't care if the request
is SSL or not so the overall request time should be the same other than SSL
processing time). We are running either the latest OpenSSL 1.1.1 or
WolfSSL, all compiled sensibly (AES-NI etc).

I turned to https://docs.haproxy.org/2.9/management.html#7 which had some
very interesting advice about pinning haproxy to one CPU core and the
interrupts to another one, but it also mentioned nbproc and the bind
process option for better SSL traffic processing. Given that seems to be a
bit out of date, I thought I might ask my question here instead.

Is there a way to use the CPU cores available on our HAProxy machines to
handle SSL requests better than I have with the global config above? I
realise this is a bit of an open ended question, but for example I was
wondering if we could reduce the number of active sessions (so we don't hit
maxconn) by increasing threads beyond the number of CPU cores, it naively
seems that might increase per session latency but increase overall
throughput since we don't appear to be taxing any of the cores (and have
lots of memory available on these machines). As I said I am not even sure
there is a problem, but I would like to understand a bit better if there is
anything we can do to help HAProxy use the CPU cores more effectively,
since all the advice I can find is obsolete (nbproc etc) and it is quite
hard to experiment when I don't know what is good to measure.

Thanks for your time,

Miles


Formilux - Bio-IT Conference & Expo 2024 engage leads

2024-01-31 Thread Amelia Jones

Hello,

I am writing this to see your interest, if you would like to acquire the 
attendee contacts list of "Bio-IT Conference & Expo 2024"


Kindly let me know if you are interested and I can send you pricing, counts and 
more details for your review. Looking forward to hearing from you.

Best Wishes,
Amelia Jones - Trade show coordinator

If you don't want to receive further emails revert with Take Out in the subject



[ANNOUNCE] haproxy-2.9.4

2024-01-31 Thread Willy Tarreau
Hi,

HAProxy 2.9.4 was released on 2024/01/31. It added 24 new commits
after version 2.9.3.

This version addresses various long-term stability issues that popped all
at once, so I preferred to issue it shortly after they were all addressed
rather than risking to let more accumulate and see users forced to roll
back later due to any possible regression that may later happen. We'll
also issue new versions of older branches progressively as time permits.

The issues fixes in this version are:
  - an API issue with OpenSSL. The SSL_do_handshake() function returns
SSL_ERROR_WANT_READ when it needs more data, but in certain obscure
circumstances related to internal error handling, it was found that
it may stop trying to read available data and continue to return that
status! This results in wakeup loops that prevent the process from
sleeping, hence it consumes 100% of the CPU (but it's still working
fine). The code does what the doc suggests (but the doc is basically
a one-liner), and neither aws-lc nor wolfSSL exhibit this problem.

Regardless, we decided to do like openssl does in their socket BIO
which doesn't show this problem and which always clears both direction
flags before any attempt in any direction, and this addressed the
issue without degrading anything even for the other libs. This problem
has been there since 2.0 and is very hard to reproduce without a prior
trace (it's the first time it's reported). I'd like to thank Valentin
Gutierrez for his invaluable report with a capture and a working
reproducer, and Olivier Houchard for the quick fix that clearly looks
more robust than my early workaround. That's issue #2403.

  - a regression in the cache's handling of secondary keys in 2.9 that
may sometimes cause a crash (issue #2417).

  - a possible crash in the QPACK encoder when encoding HTTP/3 responses
carrying status codes above 599.

  - another QUIC issue whereby the some streams reset with pending outgoing
data may clog the output buffer until the connection closes, possibly
causing the connection to slow down or even stall.

  - in H2, certain errors would only trigger a stream error (i.e. RESET)
instead of a connection error consecutive to an insufficient fix that
was merged in 2.9.3.

  - the status of agent checks is returned as-is in the stats CSV output,
resulting in mangling the CLI's output if it contains line feeds. It
has been there since 2.0.

  - the HTTP/1 chunk and header parsers were strengthened a bit. Indeed,
Ben Kallus kindly reminded us that we would still accept the NUL byte
in header values and plain LF in chunks, while we were (wrongly) quite
certain that these had long been rejected. Ben is currently not aware
of situations where this could help convey an attack to any existing
component, but given the surprises he certainly faces in his reviews,
it's probably only a matter of time before one implementation shows to
be too weak and we fail to properly protect it. So it was better to
address both at once. In the extremely unlikely case that anyone would
discover such an invalid byte on their network with an application that
heavily relies on it, option accept-invalid-http* will work as usual to
bypass the check. We'll backport that to older versions as well, and I
think it would be prudent for distros to take that as well.

  - an interesting arch-specific bug in the JWT parser: by initializing
a 64-bit variable a bit too early, everything was fine on 64-bit
platforms, but on 32-bit ones, a pointer located closer to the
beginning of the structure got reset by this initialization before it
was used, causing a crash! The fact this was only noticed now by running
VTest on a 32-bit platform just shows that 32-bit users are less common
these days and that their configs are probably simple enough not to use
JWT ;-)

  - the "newreno" congestion control algorithm for QUIC was misspelled
"newrno" in the code, making the config parser not recognize it.

  - and a few other low-importance stuff and doc updates.

I'd suggest all users of 2.9 to adopt this one now so that we can later
switch to less important fixes and backports if needed. There's currently
nothing else in the pipe concerning bugs, but we're still investigating a
case we triggered in the lab were the QUIC congestion window sometimes
doesn't open enough, and which could be responsible for lower than expected
performance on large objects when using the default Cubic algorithm (as
Tristan observes). But that's quite difficult because the original RFC was
barely exploitable due to numerous ambiguities, and fortunately there's a
new very recent one that allows to recheck the code against it (and we'll
take this opportunity to rename some parts according to the updated spec).
We're hopeful that we'll get some good news from this front soon!

Please 

Re: [RFC PATCH] DOC: httpclient: add dedicated httpclient section

2024-01-31 Thread William Lallemand

Hello Lukas,

On  2024-01-30  22:17, Lukas Tribus wrote:

Move httpclient keywords into its own section and explain adding
an introductory paragraph.

Also see Github issue #2409

Should be backported to 2.6 ; but note that:
2.7 does not have httpclient.resolvers.disabled
2.6 does not have httpclient.retries and httpclient.timeout.connect
---
  doc/configuration.txt | 131 ++
  1 file changed, 69 insertions(+), 62 deletions(-)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 208b474471..402fa3d317 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -58,6 +58,7 @@ Summary
  3.8.  HTTP-errors
  3.9.  Rings
  3.10. Log forwarding
+3.11. httpclient



I replaced the title by "HTTPClient tuning"


[...]
  
+3.11. httpclient

+
+
+httpclient is an internal HTTP library, it can be used by various subsystems,
+for example in LUA scripts. httpclient is not used in the data path, in other
+words it has nothing with HTTP traffic passing through HAProxy.
+


I replaced "httpclient" by "HTTPClient" to match the lua documentation, 
and separated the httpclient.* keywords in the global index in a "* 
HTTPClient".


That would be clear enough, thanks! I pushed it to master.

I also feel like the "3. Global parameters" section derived a long time 
ago, some subsections are describing other configuration sections like 
"userlists", "peers", "mailers", "programs" etc. instead of keywords 
from the global section, which is confusing. Maybe we should try to 
clean this up.



Regards,

--
William Lallemand