Re: Support arbitrary PROXY protocol v2 TLVs as samples

2023-01-24 Thread Willy Tarreau
Hi Johannes,

On Wed, Jan 18, 2023 at 10:49:18AM +, Bitsch, Johannes (external - Project) 
wrote:
> Hi again,
> 
> I checked my patch file from a few weeks ago using the recommended
> checkpatch.pl [1] and realized that the indentation was off as well as some
> other small things.
> To make this easier to review, I fixed all the issues mentioned by checkpatch
> (except for editing MAINTAINERS, I don't think we're quite there yet). The
> implementation was not changed. Apologies for not checking this earlier.

No problem, thanks, I'll have a look hopefully this week. Do not hesitate
to ping again if you don't see a comment, as constantly switching between
plenty of issues and reviews makes it easy to just forget some of them.

Thanks,
Willy



Re: Theoretical limits for a HAProxy instance

2023-01-24 Thread Willy Tarreau
Hi Iago,

On Tue, Jan 24, 2023 at 04:45:54PM +0100, Iago Alonso wrote:
> We are happy to report that after downgrading to OpenSSL 1.1.1s (from
> 3.0.7), our performance problems are solved, and now looks like
> HAProxy scales linearly with the available resources.

Excellent, thanks for this nice feedback! And welcome to the constantly
growing list of users condemned to use 1.1.1 forever :-/

> For reference, in a synthetic load test with a request payload of 2k,
> and a 32-core server (128GB RAM) with 10Gb bandwidth available, we are
> able to sustain 1.83M conn, 82K rps, and a sslrate of 22k, with a load
> average of about 31 (idle time percent was about 6)

This looks like excellent numbers, which should hopefully give you
plenty of headroom.

Thanks for your feedback!
Willy



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Willy Tarreau
On Wed, Jan 25, 2023 at 12:04:14AM +0100, Olivier Houchard wrote:
> >   0x0af892c770b0 :   mov%r12,%rdi
> >   0x0af892c770b3 :   callq  0xaf892c24e40 
> > 
> >   0x0af892c770b8 :   mov%rax,%r12
> >   0x0af892c770bb :   test   %rax,%rax
> >   0x0af892c770be :   je 0xaf892c770e5 
> > 
> >   0x0af892c770c0 :   mov0x18(%r12),%rax
> > =>0x0af892c770c5 :   mov0xa0(%rax),%r11
> >   0x0af892c770cc :   test   %r11,%r11
> >   0x0af892c770cf :   je 0xaf892c770b0 
> > 
> >   0x0af892c770d1 :   mov%r12,%rdi
> >   0x0af892c770d4 :   mov%r13d,%esi
> > 
> >   1229  conn = 
> > srv_lookup_conn(>per_thr[i].safe_conns, hash);
> >   1230  while (conn) {
> >   1231==>>> if (conn->mux->takeover && 
> > conn->mux->takeover(conn, i) == 0) {
> > ^^^
> > 
> > It's here. %rax==0, which is conn->mux when we're trying to dereference
> > it to retrieve ->takeover. I can't see how it's possible to have a NULL
> > mux here in an idle connection since they're placed there by the muxes
> > themselves. But maybe there's a tiny race somewhere that's impossible to
> > reach except under such an extreme contention between the two sockets.
> > I really have no idea regarding this one for now, maybe Olivier could
> > spot a race here ? Maybe we're missing a barrier somewhere.
>  
> One way I can see that happening, as unlikely as it may be, is if
> h1_takeover() is called, but fails to allocate a task or a tasklet, and
> will call h1_release(), which will set the mux to NULL and call
> conn_free().

:-)  thanks for the analysis!

I would find it strange that the task allocation fails, it would indicate
the machine is lacking RAM. Or maybe in order to avoid extreme lock
contention in malloc() on this platform when running on non-uniform
machines, they decided to use trylocks that can simply fail to report
memory instead of waiting forever. In this case we should be able to
reproduce the same issue by setting the fail-alloc fault injection rate
to a non-null value.

> It was previously ok to happen, because conn_free() would
> unconditionally remove the connection from any list, but since the idle
> connections now stay in a tree, and no longer in a list, that no longer
> happens.

I'm not sure why this has to be different, we could as well decide to
unconditionally remove it from the tree. I'll have a look and try to
figure why.

> Another possibility of h1_release() been called while still in the
> idle tree is if h1_wake() gets called, but the only way I see that
> happening is from mmux_stopping_process(), whatever that is, so that's
> unlikely to be the problem in this case.

Indeed. It's still good to keep this in mind though.

> Anyway, I suggest doing something along the line of the patch attached,
> and add a BUG_ON() in conn_free() to catch any freeing of connection
> still in the idle tree.

I agree, better crash where the problem is caused than where it hurts.
I'll take your patch, thanks!

Willy



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Willy Tarreau
On Tue, Jan 24, 2023 at 11:59:16PM -0600, Marc West wrote:
> On 2023-01-24 23:04:14, Olivier Houchard wrote:
> > On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote:
> > > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > > > > Stupid question but I prefer to ask in order to be certain, are all of
> > > > > these 32 threads located on the same physical CPU ? I just want to be
> > > > > sure that locks (kernel or user) are not traveling between multiple 
> > > > > CPU
> > > > > sockets, as that ruins scalability.
> > > > 
> > > > Very good observation. This is a 2 socket system, 20 cores per socket,
> > > > and since there is no affinity in OpenBSD unfortunately the 32 threads
> > > > are not on the same physical CPU.
> > > 
> > > Ah, then that's particularly bad. When you're saying "there is no
> > > affinity", do you mean the OS doesn't implement anything or only that
> > > haproxy doesn't support it ? Could you please check if you have man
> > > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
> > > is the FreeBSD version and the second the Linux version.
> > 
> > Unfortunately, having a quick look at OpenBSD's sources, I don't think
> > they provide anything that could look like that. In fact, I'm not sure
> > there is any way to restrict a process to a specific CPU.
> 
> Unfortunately that is what I found also, no references to affinity in
> manpages nor finding any tools like taskset or NUMA related knobs. 
> It seems to be a design decision.
> 
> https://marc.info/?l=openbsd-misc=152507006602422=2
> https://marc.info/?l=openbsd-tech=133909957708933=2
> https://marc.info/?l=openbsd-misc=135884346431916=2

Then it's really pathetic, because implementing multi-CPU support
without being able to pin tasks to CPUs it counter-productive. I know
that their focus is primarily on code simplicity and auditability but
they already lost half of it by supporting SMP, and not doing it right
only makes the situation worse, and even opens the way to trivial DoS
attacks on running processes. Sadly the response to that message from
Ilya in the last comment you pointed above says it all:

  "The scheduler will know what CPUs are busy and which ones are
   not and will make apprpriate decisions."

That's of course an indication of total ignorance of how hardware
works, since the right CPU to use is not directly related to the
ones that are busy or not, but first and foremost to the cost of
the communications with other threads/processes/devices that is
related to the distance between them, and the impacts of cache
line migration. That's precisely why operating systems supporting
multi-processing all offer the application the ability to set its
affinity based on what it knows about its workload. And seeing that
it was already required 10 years ago and still not addressed just
shows a total lack of interest for multicore machines.

> I thought there might be a way to enable just the CPUs that are on the
> first package via the kernel config mechanism but attempting that
> resulted in an immediate reboot. It looks like some kernel hacking would
> be needed to do this. There are no options in the BIOS to disable one of
> the sockets unfortunately and I don't have any single-socket machines to
> test with (aside from a laptop).

Yeah that's annoying. And you're not necessarily interested in running
that OS inside a virtual machine managed by a more capable operating
system.

> Gven that I think we are going to have to plan to switch OS to FreeBSD
> for now.

Yeah in your case I think that's the best plan. In addition, there's a
large community using haproxy on FreeBSD, and it's generally agreed that
it's the second operating system working the best with haproxy. Like
Linux it remains a modern and featureful OS with a powerful network
stack, and I think you shouldn't have any issues seeking help about it
if needed.

> OpenBSD is a great OS and a joy to work with but for this
> particular use case we do need to handle much more HTTPS traffic in
> production.

I agree. I've used it for a long time and really loved its light weight,
simplicity and reliability. It has been working flawlessly for something
like 10 years on my VAX with 24 MB RAM. I had to stop due to disks failing
repeatedly. But while I think that running it on a Geode-LX based firewall
or VPN can still make sense nowadays for an ADSL line, I also think it's
about time to abandon designs stuck in the 90s that try too hard to run
correctly on hardware designed in 2020+ with totally different concepts.
As a rule of thumb it should probably not be installed on a machine which
possesses a fan.

> Thanks again for all the replies and I will have this hardware available
> for the foreseeable future in case there is any more testing that would
> be helpful.

You're welcome, and thanks for sharing your experience, it's always
useful to many others!

Willy



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Marc West
On 2023-01-24 23:04:14, Olivier Houchard wrote:
> On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote:
> > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > > > Stupid question but I prefer to ask in order to be certain, are all of
> > > > these 32 threads located on the same physical CPU ? I just want to be
> > > > sure that locks (kernel or user) are not traveling between multiple CPU
> > > > sockets, as that ruins scalability.
> > > 
> > > Very good observation. This is a 2 socket system, 20 cores per socket,
> > > and since there is no affinity in OpenBSD unfortunately the 32 threads
> > > are not on the same physical CPU.
> > 
> > Ah, then that's particularly bad. When you're saying "there is no
> > affinity", do you mean the OS doesn't implement anything or only that
> > haproxy doesn't support it ? Could you please check if you have man
> > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
> > is the FreeBSD version and the second the Linux version.
> 
> Unfortunately, having a quick look at OpenBSD's sources, I don't think
> they provide anything that could look like that. In fact, I'm not sure
> there is any way to restrict a process to a specific CPU.

Unfortunately that is what I found also, no references to affinity in
manpages nor finding any tools like taskset or NUMA related knobs. 
It seems to be a design decision.

https://marc.info/?l=openbsd-misc=152507006602422=2
https://marc.info/?l=openbsd-tech=133909957708933=2
https://marc.info/?l=openbsd-misc=135884346431916=2

I thought there might be a way to enable just the CPUs that are on the
first package via the kernel config mechanism but attempting that
resulted in an immediate reboot. It looks like some kernel hacking would
be needed to do this. There are no options in the BIOS to disable one of
the sockets unfortunately and I don't have any single-socket machines to
test with (aside from a laptop).

Gven that I think we are going to have to plan to switch OS to FreeBSD
for now. OpenBSD is a great OS and a joy to work with but for this
particular use case we do need to handle much more HTTPS traffic in
production.

Thanks again for all the replies and I will have this hardware available
for the foreseeable future in case there is any more testing that would
be helpful.



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Olivier Houchard
On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote:
> On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > > Stupid question but I prefer to ask in order to be certain, are all of
> > > these 32 threads located on the same physical CPU ? I just want to be
> > > sure that locks (kernel or user) are not traveling between multiple CPU
> > > sockets, as that ruins scalability.
> > 
> > Very good observation. This is a 2 socket system, 20 cores per socket,
> > and since there is no affinity in OpenBSD unfortunately the 32 threads
> > are not on the same physical CPU.
> 
> Ah, then that's particularly bad. When you're saying "there is no
> affinity", do you mean the OS doesn't implement anything or only that
> haproxy doesn't support it ? Could you please check if you have man
> pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
> is the FreeBSD version and the second the Linux version.

Unfortunately, having a quick look at OpenBSD's sources, I don't think
they provide anything that could look like that. In fact, I'm not sure
there is any way to restrict a process to a specific CPU.

[...]

> Oh thank you very much. From what I'm seeing we're here:
> 
>   0x0af892c770b0 :   mov%r12,%rdi
>   0x0af892c770b3 :   callq  0xaf892c24e40 
> 
>   0x0af892c770b8 :   mov%rax,%r12
>   0x0af892c770bb :   test   %rax,%rax
>   0x0af892c770be :   je 0xaf892c770e5 
> 
>   0x0af892c770c0 :   mov0x18(%r12),%rax
> =>0x0af892c770c5 :   mov0xa0(%rax),%r11
>   0x0af892c770cc :   test   %r11,%r11
>   0x0af892c770cf :   je 0xaf892c770b0 
> 
>   0x0af892c770d1 :   mov%r12,%rdi
>   0x0af892c770d4 :   mov%r13d,%esi
> 
>   1229  conn = 
> srv_lookup_conn(>per_thr[i].safe_conns, hash);
>   1230  while (conn) {
>   1231==>>> if (conn->mux->takeover && 
> conn->mux->takeover(conn, i) == 0) {
> ^^^
> 
> It's here. %rax==0, which is conn->mux when we're trying to dereference
> it to retrieve ->takeover. I can't see how it's possible to have a NULL
> mux here in an idle connection since they're placed there by the muxes
> themselves. But maybe there's a tiny race somewhere that's impossible to
> reach except under such an extreme contention between the two sockets.
> I really have no idea regarding this one for now, maybe Olivier could
> spot a race here ? Maybe we're missing a barrier somewhere.
 
One way I can see that happening, as unlikely as it may be, is if
h1_takeover() is called, but fails to allocate a task or a tasklet, and
will call h1_release(), which will set the mux to NULL and call
conn_free(). It was previously ok to happen, because conn_free() would
unconditionally remove the connection from any list, but since the idle
connections now stay in a tree, and no longer in a list, that no longer
happens.
Another possibility of h1_release() been called while still in the
idle tree is if h1_wake() gets called, but the only way I see that
happening is from mmux_stopping_process(), whatever that is, so that's
unlikely to be the problem in this case.
Anyway, I suggest doing something along the line of the patch attached,
and add a BUG_ON() in conn_free() to catch any freeing of connection
still in the idle tree.

Olivier
>From 2a0ac4b84b97aa05cd7befc5fcf45b03795f2e76 Mon Sep 17 00:00:00 2001
From: Olivier Houchard 
Date: Tue, 24 Jan 2023 23:59:32 +0100
Subject: [PATCH] MINOR: Add a BUG_ON() to detect destroying connection in idle
 list

Add a BUG_ON() in conn_free(), to check that zhen we're freeing a
connection, it is not still in the idle connections tree, otherwise the
next thread that will try to use it will probably crash.
---
 src/connection.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/connection.c b/src/connection.c
index 4a73dbcc8..97619ec26 100644
--- a/src/connection.c
+++ b/src/connection.c
@@ -498,6 +498,10 @@ void conn_free(struct connection *conn)
 	pool_free(pool_head_uniqueid, istptr(conn->proxy_unique_id));
 	conn->proxy_unique_id = IST_NULL;
 
+	/* Make sure the connection is not left in the idle connection tree */
+	if (conn->hash_node != NULL)
+		BUG_ON(conn->hash_node->node.node.leaf_p != NULL);
+
 	pool_free(pool_head_conn_hash_node, conn->hash_node);
 	conn->hash_node = NULL;
 
-- 
2.36.1



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Willy Tarreau
On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > Stupid question but I prefer to ask in order to be certain, are all of
> > these 32 threads located on the same physical CPU ? I just want to be
> > sure that locks (kernel or user) are not traveling between multiple CPU
> > sockets, as that ruins scalability.
> 
> Very good observation. This is a 2 socket system, 20 cores per socket,
> and since there is no affinity in OpenBSD unfortunately the 32 threads
> are not on the same physical CPU.

Ah, then that's particularly bad. When you're saying "there is no
affinity", do you mean the OS doesn't implement anything or only that
haproxy doesn't support it ? Could you please check if you have man
pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
is the FreeBSD version and the second the Linux version.

> cpu0 reports as core 0 package 0, cpu1 core 0 package 1, and so on that
> the odd cpus are socket 1 and evens are socket 2. Watching threads in
> top with a 1 sec interval I see them bouncing around a lot between
> sockets.

Yes that must be horrible. In fact, the worst possible scenario!

> > It's great already to be able to rule out that one. Another useful
> > test you could run is to place a reject rule at the connection level
> > in your frontend (it will happen before SSL tries to process traffic):
> > 
> >tcp-request connection reject
> > 
> > (don't do that in production of course). Once this is installed you
> > can compare again the CPU usage with no-thread, and 32-thread without
> > shards and 32 threads with 8 shards.
> 
> This is HTTPS with the reject rule:
> 
> - no nbthread, no shards:
> -- 40  CPUs:  0.3% user,  0.0% nice,  0.7% sys,  0.1% spin,  0.1% intr, 98.9% 
> idle
> -- haproxy peaked at 20% CPU
> 
> - nbthread 32, no shards:
> -- 40  CPUs:  0.5% user,  0.0% nice, 19.9% sys,  4.6% spin,  0.1% intr, 75.0% 
> idle
> -- haproxy peaked at 863% CPU
> 
> - nbthread 32, 8 shards:
> -- 40  CPUs:  0.4% user,  0.0% nice,  1.9% sys,  0.1% spin,  0.1% intr, 97.4% 
> idle
> -- haproxy peaked at 62% CPU
> -- this only used 5 threads according to top FYI 

OK, let's basically say it doesn't work very well... At least it makes
sense now.

> Since the SSL stack has been mentioned a couple times I retested with
> HTTP instead of HTTPS and the results are interesting:
> 
> - HTTP, no nbthread, no shards 
> -- 40  CPUs:  0.6% user,  0.0% nice,  2.0% sys,  0.2% spin,  0.2% intr, 97.0% 
> idle
> -- haproxy peaked at 54% CPU
> -- current conns = 475; current pipes = 0/0; conn rate = 0/sec; bit rate 
> 403.512 Mbps, Running tasks: 0/922; idle = 28 %
> -- First 3 seconds 7001, 14474, 22000 responses. 7-8k/sec throughout
> test. 100% success 0 fail
> 
> - HTTP, nbthread 32, no shards
> -- 40  CPUs:  0.7% user,  0.0% nice, 29.6% sys,  0.2% spin,  0.5% intr, 69.1% 
> idle
> -- haproxy peaked at 800% CPU
> -- current conns = 501; current pipes = 0/0; conn rate = 1/sec; bit rate 
> 410.711 Mbps, Running tasks: 28/967; idle = 64 %
> -- First 3 seconds 7000, 14500, 22002 responses. 7-8k/sec during test. 100% 
> success 0 fail
> 
> - HTTP, nbthread 32, 8 shards
> -- 40  CPUs:  0.7% user,  0.0% nice,  5.0% sys,  0.2% spin,  0.2% intr, 94.0% 
> idle
> -- haproxy peaked at 147% CPUs
> -- current conns = 500; current pipes = 0/0; conn rate = 1/sec; bit rate 
> 401.946 Mbps, Running tasks: 0/746; idle = 93 %
> -- First 3 seconds 7000, 14500, 22000 responses. 7-8k/sec during test. 100% 
> success 0 fail 
>
> Here we still see sys% growing with more threads but even with the waste,
> with HTTP we get a reliable 7-8k responses/sec and 100% success rate
> instead of 400-500/sec for a few second burst and then a total stall /
> nearly 0% success rate with HTTPS. I expect SSL to have some cost but 
> not quite this huge, and the "stall" of traffic/health checks under 
> heavy HTTPS load is a bit puzzling.

It's important to split the problems. Here clearly the test is limited
by the client to ~7.3k/s, but it's still sufficient to exacerbate the
threading issue with 32 threads and no shards. The SSL part is a second
problem, but since openssl uses a lot of locks, it could be a bigger
victim of the apparently poor thread implementation, explaining why it
doesn't scale at all.

> Do you think it would be worth trying to install OpenSSL 3.0.7 from
> ports and manually build haproxy against that to compare with the
> current LibreSSL 3.6.0? Or is the bottleneck likely somewhere else?

No, OpenSSL 3 is much much worse. It's blatant that its performance was
never tested outside of "openssl speed" before being released, because
its extreme abuse of locks makes it totally unusable for any usage in
a network-connected daemon :-(  LibreSSL 3.6 could work, but it could
as well be a victim of the threading problem.

Don't you have at least an equivalent of the linux taskset utility on
OpenBSD, to force all your threads on the same physical package ? Or
maybe you'll find some NUMA tools ? Or in the worst case, 

Re: HAProxy performance on OpenBSD

2023-01-24 Thread Marc West
On 2023-01-24 06:58:57, Willy Tarreau wrote:
> Hi Marc,

Hi Willy,

> See the difference ? There seems to be an insane FD locking cost on this
> system that simply wastes 40% of the CPU there. So I suspect that in your
> first tests you were stressing the locking while in the last ones you
> were stressing the SSL stack.
> 
> Stupid question but I prefer to ask in order to be certain, are all of
> these 32 threads located on the same physical CPU ? I just want to be
> sure that locks (kernel or user) are not traveling between multiple CPU
> sockets, as that ruins scalability.

Very good observation. This is a 2 socket system, 20 cores per socket,
and since there is no affinity in OpenBSD unfortunately the 32 threads
are not on the same physical CPU.

cpu0 reports as core 0 package 0, cpu1 core 0 package 1, and so on that
the odd cpus are socket 1 and evens are socket 2. Watching threads in
top with a 1 sec interval I see them bouncing around a lot between
sockets.

> It's great already to be able to rule out that one. Another useful
> test you could run is to place a reject rule at the connection level
> in your frontend (it will happen before SSL tries to process traffic):
> 
>tcp-request connection reject
> 
> (don't do that in production of course). Once this is installed you
> can compare again the CPU usage with no-thread, and 32-thread without
> shards and 32 threads with 8 shards.

This is HTTPS with the reject rule:

- no nbthread, no shards:
-- 40  CPUs:  0.3% user,  0.0% nice,  0.7% sys,  0.1% spin,  0.1% intr, 98.9% 
idle
-- haproxy peaked at 20% CPU

- nbthread 32, no shards:
-- 40  CPUs:  0.5% user,  0.0% nice, 19.9% sys,  4.6% spin,  0.1% intr, 75.0% 
idle
-- haproxy peaked at 863% CPU

- nbthread 32, 8 shards:
-- 40  CPUs:  0.4% user,  0.0% nice,  1.9% sys,  0.1% spin,  0.1% intr, 97.4% 
idle
-- haproxy peaked at 62% CPU
-- this only used 5 threads according to top FYI 

Since the SSL stack has been mentioned a couple times I retested with
HTTP instead of HTTPS and the results are interesting:

- HTTP, no nbthread, no shards 
-- 40  CPUs:  0.6% user,  0.0% nice,  2.0% sys,  0.2% spin,  0.2% intr, 97.0% 
idle
-- haproxy peaked at 54% CPU
-- current conns = 475; current pipes = 0/0; conn rate = 0/sec; bit rate 
403.512 Mbps, Running tasks: 0/922; idle = 28 %
-- First 3 seconds 7001, 14474, 22000 responses. 7-8k/sec throughout
test. 100% success 0 fail

- HTTP, nbthread 32, no shards
-- 40  CPUs:  0.7% user,  0.0% nice, 29.6% sys,  0.2% spin,  0.5% intr, 69.1% 
idle
-- haproxy peaked at 800% CPU
-- current conns = 501; current pipes = 0/0; conn rate = 1/sec; bit rate 
410.711 Mbps, Running tasks: 28/967; idle = 64 %
-- First 3 seconds 7000, 14500, 22002 responses. 7-8k/sec during test. 100% 
success 0 fail

- HTTP, nbthread 32, 8 shards
-- 40  CPUs:  0.7% user,  0.0% nice,  5.0% sys,  0.2% spin,  0.2% intr, 94.0% 
idle
-- haproxy peaked at 147% CPUs
-- current conns = 500; current pipes = 0/0; conn rate = 1/sec; bit rate 
401.946 Mbps, Running tasks: 0/746; idle = 93 %
-- First 3 seconds 7000, 14500, 22000 responses. 7-8k/sec during test. 100% 
success 0 fail 

Here we still see sys% growing with more threads but even with the waste,
with HTTP we get a reliable 7-8k responses/sec and 100% success rate
instead of 400-500/sec for a few second burst and then a total stall /
nearly 0% success rate with HTTPS. I expect SSL to have some cost but 
not quite this huge, and the "stall" of traffic/health checks under 
heavy HTTPS load is a bit puzzling.

Do you think it would be worth trying to install OpenSSL 3.0.7 from
ports and manually build haproxy against that to compare with the
current LibreSSL 3.6.0? Or is the bottleneck likely somewhere else?

> Ah, great! Do you have any info on the signal that was received there ?
> If you still have the core, issuing "info registers", then "disassemble
> conn_backend_get" and pressing enter till the end of the function could
> be useful to try to locate what's happening there.

It was signal 11 and yes I do still have the core, attached!
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-unknown-openbsd7.2"...
Core was generated by `haproxy'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libpthread.so.26.2...done.
Loaded symbols for /usr/lib/libpthread.so.26.2
Loaded symbols for /usr/local/sbin/haproxy
Reading symbols from /usr/lib/libz.so.7.0...done.
Loaded symbols for /usr/lib/libz.so.7.0
Symbols already loaded for /usr/lib/libpthread.so.26.2
Reading symbols from /usr/lib/libssl.so.53.0...done.
Loaded symbols for /usr/lib/libssl.so.53.0
Reading symbols from /usr/lib/libcrypto.so.50.0...done.
Loaded 

Re: Theoretical limits for a HAProxy instance

2023-01-24 Thread Iago Alonso
We are happy to report that after downgrading to OpenSSL 1.1.1s (from
3.0.7), our performance problems are solved, and now looks like
HAProxy scales linearly with the available resources.

For reference, in a synthetic load test with a request payload of 2k,
and a 32-core server (128GB RAM) with 10Gb bandwidth available, we are
able to sustain 1.83M conn, 82K rps, and a sslrate of 22k, with a load
average of about 31 (idle time percent was about 6)

On Wed, Dec 21, 2022 at 5:39 PM Iago Alonso  wrote:
>
> > Interesting, so you have conntrack enabled. With 5M conns, there's no
> > reason to fill your table. However, have you checked your servers' kernel
> > logs to see if you find any "conntrack table full" message, that might
> > be caused by too fast connection recycling ?
> We can't see any message mentioning conntrack.
>
> > With that said, excessive use of rwlock indicates that openssl is
> > very likely there, because that's about the only caller in the process.
> Indeed, we tried compiling HAProxy 2.6.6 against OpenSSL 1.1.1s and the
>
> `perf top` results look much better. We can't recally why we decided
>
> to use OpenSSL v3, but seems like we have to stick to v1.
>
> 2.61%  haproxy[.] sha512_block_data_order_avx2
> 1.89%  haproxy[.] fe_mul
> 1.64%  haproxy[.] x25519_fe64_sqr
> 1.44%  libc.so.6[.] malloc
> 1.39%  [kernel] [k] __nf_conntrack_find_get
> 1.21%  libc.so.6[.] cfree
> 0.95%  [kernel] [k]
> native_queued_spin_lock_slowpath.part.0
> 0.94%  haproxy[.] OPENSSL_cleanse
> 0.78%  libc.so.6[.] 0x000a3b46
> 0.78%  [kernel] [k] do_epoll_ctl
> 0.75%  [kernel] [k] ___slab_alloc
> 0.73%  [kernel] [k] psi_group_change
>
> For reference, with a 12 core machine, we ran a test with 420k connections,
>
> 32k rps, and a sustained ssl rate of 5k, and the load1 was about 9, and
>
> the idle was about 30.
>
> And as a side note, we didn't have http/2 enabled, and after doing
>
> so, we ran the same test, and the idle was about 55, and the load1 was
> about 6.
>
> > OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
> > fixes, though these should not impact your tests and 2.7.0 will be OK).
> > And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
> > If so it will definitely confirm that you're endlessly waiting for some
> > threads stuck in SSL waiting on a mutex.
> We've just tried 2.7.1 with USE_PTHREAD_EMULATION=1 (OpenSSL 3.0.7), and
>
> we the results are pretty much the same, no noticeable improvement.
>
> > Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
> > rule in your iptables whose counter increases during the tests. It might
> > be possible that certain outgoing connections are rejected due to fast
> > port recycling if they were closed by an RST that was lost before hitting
> > conntrack. Another possibility here is that the TCP connection was
> > established but that the SSL handshake didn't complete in time. This
> > could be caused precisely by the openssl 3 locking issues I was speaking
> > about, but I guess not at such numbers, and you'd very likely see
> > the native_queue_spinlock_slowpath thing in perf top.
> We do have a `-A ufw-before-input -m conntrack --ctstate INVALID -j DROP`
>
> but the counter does not increase during the tests. Looks like OpenSSL
>
> v3 is the issue.
>
>
> On 16/12/22 18:11, Willy Tarreau wrote:
> > On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote:
> >> Hi,
> >>
> >>> Ah that's pretty useful :-) It's very likely dealing with the handshake.
> >>> Could you please run "perf top" on this machine and list the 10 top-most
> >>> lines ? I'm interested in seeing if you're saturating on crypto functions
> >>> or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would
> >>> indicate an abuse of mutexes.
> >> We are load testing with k6 (https://k6.io/), in a server with an
> >> AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test
> >> with 300k connections, 15k rps, and a sslrate and connrate of about 4k.
> >> The server had a `load1` of 12, the `node_nf_conntrack_entries` were about
> >> 450k. The `haproxy_server_connect_time_average_seconds` was between
> >> 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was
> >> always about 0.2s lower.
> >>
> >> We have this custom kernel parameters set:
> >> net.nf_conntrack_max=500
> >> fs.nr_open=500
> >> net.ipv4.ip_local_port_range=1276860999
> > Interesting, so you have conntrack enabled. With 5M conns, there's no
> > reason to fill your table. However, have you checked your servers' kernel
> > logs to see if you find any "conntrack table full" message, that might
> > be caused by too fast connection 

[ANNOUNCE] haproxy-2.5.11

2023-01-24 Thread Christopher Faulet

Hi,

HAProxy 2.5.11 was released on 2023/01/24. It added 65 new commits
after version 2.5.10.

As for the 2.6.8, this release includes the fix about the "set-uri" HTTP
action. This fix was delayed for the 2.5.10. It is now shipped with the
2.5.11.  The behavior of this action is no longer the same. This action is
been bogus for a while and was not working as documented, and used to make
HTTP/1 and HTTP/2 produce different outputs. The URI is now sent to H1
server exactly as set by the action.

Otherwise, for other fixes, this release is pretty similar to the 2.6.8,
excluding QUIC/H2 fixes:

About the H2:
  * Interim responses that carry the end-of-stream flag are now rejected as
invalid while it was handled as a full response. The consequences of
this issue are uncertain in 2.4 and newer, but on 2.2 and older it could
trigger a BUG_ON() condition and cause a panic.

About the H1:
  * Authority validation was improved to conform RFC3986 for non-CONNECT
methods. The validation was too strict and expected an exact match
between the authority and the host header value. Default ports are now
properly handled.

  * In addition, having an empty port in the authority for CONNECT requests
is no considered as invalid and a 400-Bad-request is now returned. For
other methods, empty ports in authority are considered as valid and are
handled as default ports.

About the FCGI
  * The path-info subexpression was not properly handled due to an inverted
condition.

  * A major fix regarding uninitialized bytes in the FCGI mux was backported.
It one could have leak sensitive data to the backends before the fix.

About listeners:
  * Multiple races were found and addressed related to closed FDs (mostly
happening on reload, sometimes on resuming after an aborted reload)

About HTTP rules:
  * Make sure that the logged status matches the reported status even upon
errors and also after http-after-response

  * There was a parsing error reported for responses carrying a websocket
header when the status was not 101.

About the Master-Worker:
  * When trying to upgrade from a previous version with a reload instead of
a restart, a bug in the master-worker was preventing to reload and was
stopping the whole process.

About other fixes:
  * A fix for a buffer realignment bug introduced in 1.9 was fixed. It's
uncertain whether it was possible to trigger it or not, but it could
possibly have been responsible for some rare unexplained corruptions.

  * JWT ECDSA signatures were not properly handled, this was now fixed.
However, another issue was just discovered after the release that may
still randomly trigger errors.

  * The maxconn automatic computation was fixed, its output value was not
correct anymore since the introduction of the httpclient SSL backend.

  * The haproxy_backend_agg_check_status metric for the Prometheus exporter
was backported.

  * A scheduling issue in the resolvers was preventing the resolution during
runtime.

  * A possible crash with the Lua HTTP-client during the cleanup stage was
fixed.

Thanks everyone for you help and your contributions !

Please find the usual URLs below :
   Site index   : https://www.haproxy.org/
   Documentation: https://docs.haproxy.org/
   Wiki : https://github.com/haproxy/wiki/wiki
   Discourse: https://discourse.haproxy.org/
   Slack channel: https://slack.haproxy.org/
   Issue tracker: https://github.com/haproxy/haproxy/issues
   Sources  : https://www.haproxy.org/download/2.5/src/
   Git repository   : https://git.haproxy.org/git/haproxy-2.5.git/
   Git Web browsing : https://git.haproxy.org/?p=haproxy-2.5.git
   Changelog: https://www.haproxy.org/download/2.5/src/CHANGELOG
   Dataplane API: 
https://github.com/haproxytech/dataplaneapi/releases/latest
   Pending bugs : https://www.haproxy.org/l/pending-bugs
   Reviewed bugs: https://www.haproxy.org/l/reviewed-bugs
   Code reports : https://www.haproxy.org/l/code-reports
   Latest builds: https://www.haproxy.org/l/dev-packages


---
Complete changelog :
Aurelien DARRAGON (3):
  REGTEST: fix the race conditions in json_query.vtc
  REGTEST: fix the race conditions in digest.vtc
  REGTEST: fix the race conditions in hmac.vtc

Cedric Paillet (2):
  BUG/MINOR: promex: create haproxy_backend_agg_server_status
  MINOR: promex: introduce haproxy_backend_agg_check_status

Christopher Faulet (25):
  BUG/MINOR: http-htx: Don't consider an URI as normalized after a set-uri 
action
  BUG/MEDIIM: stconn: Flush output data before forwarding close to write 
side
  Revert "CI: switch to the "latest" LibreSSL"
  Revert "CI: enable QUIC for LibreSSL builds"
  Revert "CI: determine actual OpenSSL version dynamically"
  DOC: promex: Add missing backend metrics
  REGTESTS: fix the race conditions in iff.vtc
  BUG/MEDIUM: resolvers: 

[ANNOUNCE] haproxy-2.6.8

2023-01-24 Thread Christopher Faulet

Hi,

HAProxy 2.6.8 was released on 2023/01/23. It added 94 new commits
after version 2.6.7.

The delayed fix about the "set-uri" HTTP action that was not included in the
2.6.7 was finally backported and shipped with this release. The behavior of
this action is no longer the same. This action is been bogus for a while and
was not working as documented, and used to make HTTP/1 and HTTP/2 produce
different outputs. The URI is now sent to H1 server exactly as set by the
action.

Beside that, this release includes its usual batch of fixes.

About the H3/QUIC:
  * a double-delete could happen in a list, causing memory corruption and
crashes.

  * Empty HTTP response are now properly transferred. It is pretty rare. It
only happens with HTTP/0.9 responses with no payload.

  * Remote unidirectional stream closures are now ignored and no longer
trigger aborts.

  * Invalid requests header or pseudo-header name are now rejected. For now,
this triggers a connection close. But, in future, it should be handled
with a stream reset. The same is performed for messages with an
announced content-length that does not match the total size of DATA
frames.

  * The cookie header parsing was fixed.

About the H2:
  * Interim responses that carry the end-of-stream flag are now rejected as
invalid while it was handled as a full response. The consequences of
this issue are uncertain in 2.4 and newer, but on 2.2 and older it
could trigger a BUG_ON() condition and cause a panic.

  * logs were not emitted for invalid requests that are blocked due to
forbidden headers or syntax. That made it complicated to debug errors
reported by clients. They will now be emitted, and traces will also
reflect this.

About the H1:
  * Authority validation was improved to conform RFC3986 for non-CONNECT
methods. The validation was too strict and expected an exact match
between the authority and the host header value. Default ports are now
properly handled.

  * In addition, having an empty port in the authority for CONNECT requests
is no considered as invalid and a 400-Bad-request is now returned. For
other methods, empty ports in authority are considered as valid and are
handled as default ports.

About the FCGI:
  * The path-info subexpression was not properly handled due to an inverted
condition.

  * A major fix regarding uninitialized bytes in the FCGI mux was backported.
It one could have leak sensitive data to the backends before the fix.

About listeners:
  * Multiple races were found and addressed related to closed FDs (mostly
happening on reload, sometimes on resuming after an aborted reload)

About HTTP rules:
  * Make sure that the logged status matches the reported status even upon
errors and also after http-after-response

  * There was a small leak per request when using the "originalto" option,
and another leak (per config entry) for "http-request redirect" lines.

  * There was a parsing error reported for responses carrying a websocket
header when the status was not 101.

About the Master-Worker:
  * When trying to upgrade from a previous version with a reload instead of
a restart, a bug in the master-worker was preventing to reload and was
stopping the whole process.

About other fixes:
  * A fix for a buffer realignment bug introduced in 1.9 was fixed. It's
uncertain whether it was possible to trigger it or not, but it could
possibly have been responsible for some rare unexplained corruptions.

  * JWT ECDSA signatures were not properly handled, this was now fixed.
However, another issue was just discovered after the release that may
still randomly trigger errors.

  * Some fixes on the stats output were backported. One of them is about the
json output. This output type is a lot more verbose and was starting to
reach the default buffer size limit, leading to truncated responses.
This is no longer an issue.

  * The maxconn automatic computation was fixed, its output value was not
correct anymore since the introduction of the httpclient SSL backend.

  * The haproxy_backend_agg_check_status metric for the Prometheus exporter
was backported.

  * A scheduling issue in the resolvers was preventing the resolution during
runtime.

  * A possible crash with the Lua HTTP-client during the cleanup stage was
fixed.

Thanks everyone for you help and your contributions !

Please find the usual URLs below :
   Site index   : https://www.haproxy.org/
   Documentation: https://docs.haproxy.org/
   Wiki : https://github.com/haproxy/wiki/wiki
   Discourse: https://discourse.haproxy.org/
   Slack channel: https://slack.haproxy.org/
   Issue tracker: https://github.com/haproxy/haproxy/issues
   Sources  : https://www.haproxy.org/download/2.6/src/
   Git repository   : https://git.haproxy.org/git/haproxy-2.6.git/
   Git Web browsing :