Re: Support arbitrary PROXY protocol v2 TLVs as samples
Hi Johannes, On Wed, Jan 18, 2023 at 10:49:18AM +, Bitsch, Johannes (external - Project) wrote: > Hi again, > > I checked my patch file from a few weeks ago using the recommended > checkpatch.pl [1] and realized that the indentation was off as well as some > other small things. > To make this easier to review, I fixed all the issues mentioned by checkpatch > (except for editing MAINTAINERS, I don't think we're quite there yet). The > implementation was not changed. Apologies for not checking this earlier. No problem, thanks, I'll have a look hopefully this week. Do not hesitate to ping again if you don't see a comment, as constantly switching between plenty of issues and reviews makes it easy to just forget some of them. Thanks, Willy
Re: Theoretical limits for a HAProxy instance
Hi Iago, On Tue, Jan 24, 2023 at 04:45:54PM +0100, Iago Alonso wrote: > We are happy to report that after downgrading to OpenSSL 1.1.1s (from > 3.0.7), our performance problems are solved, and now looks like > HAProxy scales linearly with the available resources. Excellent, thanks for this nice feedback! And welcome to the constantly growing list of users condemned to use 1.1.1 forever :-/ > For reference, in a synthetic load test with a request payload of 2k, > and a 32-core server (128GB RAM) with 10Gb bandwidth available, we are > able to sustain 1.83M conn, 82K rps, and a sslrate of 22k, with a load > average of about 31 (idle time percent was about 6) This looks like excellent numbers, which should hopefully give you plenty of headroom. Thanks for your feedback! Willy
Re: HAProxy performance on OpenBSD
On Wed, Jan 25, 2023 at 12:04:14AM +0100, Olivier Houchard wrote: > > 0x0af892c770b0 : mov%r12,%rdi > > 0x0af892c770b3 : callq 0xaf892c24e40 > > > > 0x0af892c770b8 : mov%rax,%r12 > > 0x0af892c770bb : test %rax,%rax > > 0x0af892c770be : je 0xaf892c770e5 > > > > 0x0af892c770c0 : mov0x18(%r12),%rax > > =>0x0af892c770c5 : mov0xa0(%rax),%r11 > > 0x0af892c770cc : test %r11,%r11 > > 0x0af892c770cf : je 0xaf892c770b0 > > > > 0x0af892c770d1 : mov%r12,%rdi > > 0x0af892c770d4 : mov%r13d,%esi > > > > 1229 conn = > > srv_lookup_conn(>per_thr[i].safe_conns, hash); > > 1230 while (conn) { > > 1231==>>> if (conn->mux->takeover && > > conn->mux->takeover(conn, i) == 0) { > > ^^^ > > > > It's here. %rax==0, which is conn->mux when we're trying to dereference > > it to retrieve ->takeover. I can't see how it's possible to have a NULL > > mux here in an idle connection since they're placed there by the muxes > > themselves. But maybe there's a tiny race somewhere that's impossible to > > reach except under such an extreme contention between the two sockets. > > I really have no idea regarding this one for now, maybe Olivier could > > spot a race here ? Maybe we're missing a barrier somewhere. > > One way I can see that happening, as unlikely as it may be, is if > h1_takeover() is called, but fails to allocate a task or a tasklet, and > will call h1_release(), which will set the mux to NULL and call > conn_free(). :-) thanks for the analysis! I would find it strange that the task allocation fails, it would indicate the machine is lacking RAM. Or maybe in order to avoid extreme lock contention in malloc() on this platform when running on non-uniform machines, they decided to use trylocks that can simply fail to report memory instead of waiting forever. In this case we should be able to reproduce the same issue by setting the fail-alloc fault injection rate to a non-null value. > It was previously ok to happen, because conn_free() would > unconditionally remove the connection from any list, but since the idle > connections now stay in a tree, and no longer in a list, that no longer > happens. I'm not sure why this has to be different, we could as well decide to unconditionally remove it from the tree. I'll have a look and try to figure why. > Another possibility of h1_release() been called while still in the > idle tree is if h1_wake() gets called, but the only way I see that > happening is from mmux_stopping_process(), whatever that is, so that's > unlikely to be the problem in this case. Indeed. It's still good to keep this in mind though. > Anyway, I suggest doing something along the line of the patch attached, > and add a BUG_ON() in conn_free() to catch any freeing of connection > still in the idle tree. I agree, better crash where the problem is caused than where it hurts. I'll take your patch, thanks! Willy
Re: HAProxy performance on OpenBSD
On Tue, Jan 24, 2023 at 11:59:16PM -0600, Marc West wrote: > On 2023-01-24 23:04:14, Olivier Houchard wrote: > > On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote: > > > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote: > > > > > Stupid question but I prefer to ask in order to be certain, are all of > > > > > these 32 threads located on the same physical CPU ? I just want to be > > > > > sure that locks (kernel or user) are not traveling between multiple > > > > > CPU > > > > > sockets, as that ruins scalability. > > > > > > > > Very good observation. This is a 2 socket system, 20 cores per socket, > > > > and since there is no affinity in OpenBSD unfortunately the 32 threads > > > > are not on the same physical CPU. > > > > > > Ah, then that's particularly bad. When you're saying "there is no > > > affinity", do you mean the OS doesn't implement anything or only that > > > haproxy doesn't support it ? Could you please check if you have man > > > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former > > > is the FreeBSD version and the second the Linux version. > > > > Unfortunately, having a quick look at OpenBSD's sources, I don't think > > they provide anything that could look like that. In fact, I'm not sure > > there is any way to restrict a process to a specific CPU. > > Unfortunately that is what I found also, no references to affinity in > manpages nor finding any tools like taskset or NUMA related knobs. > It seems to be a design decision. > > https://marc.info/?l=openbsd-misc=152507006602422=2 > https://marc.info/?l=openbsd-tech=133909957708933=2 > https://marc.info/?l=openbsd-misc=135884346431916=2 Then it's really pathetic, because implementing multi-CPU support without being able to pin tasks to CPUs it counter-productive. I know that their focus is primarily on code simplicity and auditability but they already lost half of it by supporting SMP, and not doing it right only makes the situation worse, and even opens the way to trivial DoS attacks on running processes. Sadly the response to that message from Ilya in the last comment you pointed above says it all: "The scheduler will know what CPUs are busy and which ones are not and will make apprpriate decisions." That's of course an indication of total ignorance of how hardware works, since the right CPU to use is not directly related to the ones that are busy or not, but first and foremost to the cost of the communications with other threads/processes/devices that is related to the distance between them, and the impacts of cache line migration. That's precisely why operating systems supporting multi-processing all offer the application the ability to set its affinity based on what it knows about its workload. And seeing that it was already required 10 years ago and still not addressed just shows a total lack of interest for multicore machines. > I thought there might be a way to enable just the CPUs that are on the > first package via the kernel config mechanism but attempting that > resulted in an immediate reboot. It looks like some kernel hacking would > be needed to do this. There are no options in the BIOS to disable one of > the sockets unfortunately and I don't have any single-socket machines to > test with (aside from a laptop). Yeah that's annoying. And you're not necessarily interested in running that OS inside a virtual machine managed by a more capable operating system. > Gven that I think we are going to have to plan to switch OS to FreeBSD > for now. Yeah in your case I think that's the best plan. In addition, there's a large community using haproxy on FreeBSD, and it's generally agreed that it's the second operating system working the best with haproxy. Like Linux it remains a modern and featureful OS with a powerful network stack, and I think you shouldn't have any issues seeking help about it if needed. > OpenBSD is a great OS and a joy to work with but for this > particular use case we do need to handle much more HTTPS traffic in > production. I agree. I've used it for a long time and really loved its light weight, simplicity and reliability. It has been working flawlessly for something like 10 years on my VAX with 24 MB RAM. I had to stop due to disks failing repeatedly. But while I think that running it on a Geode-LX based firewall or VPN can still make sense nowadays for an ADSL line, I also think it's about time to abandon designs stuck in the 90s that try too hard to run correctly on hardware designed in 2020+ with totally different concepts. As a rule of thumb it should probably not be installed on a machine which possesses a fan. > Thanks again for all the replies and I will have this hardware available > for the foreseeable future in case there is any more testing that would > be helpful. You're welcome, and thanks for sharing your experience, it's always useful to many others! Willy
Re: HAProxy performance on OpenBSD
On 2023-01-24 23:04:14, Olivier Houchard wrote: > On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote: > > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote: > > > > Stupid question but I prefer to ask in order to be certain, are all of > > > > these 32 threads located on the same physical CPU ? I just want to be > > > > sure that locks (kernel or user) are not traveling between multiple CPU > > > > sockets, as that ruins scalability. > > > > > > Very good observation. This is a 2 socket system, 20 cores per socket, > > > and since there is no affinity in OpenBSD unfortunately the 32 threads > > > are not on the same physical CPU. > > > > Ah, then that's particularly bad. When you're saying "there is no > > affinity", do you mean the OS doesn't implement anything or only that > > haproxy doesn't support it ? Could you please check if you have man > > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former > > is the FreeBSD version and the second the Linux version. > > Unfortunately, having a quick look at OpenBSD's sources, I don't think > they provide anything that could look like that. In fact, I'm not sure > there is any way to restrict a process to a specific CPU. Unfortunately that is what I found also, no references to affinity in manpages nor finding any tools like taskset or NUMA related knobs. It seems to be a design decision. https://marc.info/?l=openbsd-misc=152507006602422=2 https://marc.info/?l=openbsd-tech=133909957708933=2 https://marc.info/?l=openbsd-misc=135884346431916=2 I thought there might be a way to enable just the CPUs that are on the first package via the kernel config mechanism but attempting that resulted in an immediate reboot. It looks like some kernel hacking would be needed to do this. There are no options in the BIOS to disable one of the sockets unfortunately and I don't have any single-socket machines to test with (aside from a laptop). Gven that I think we are going to have to plan to switch OS to FreeBSD for now. OpenBSD is a great OS and a joy to work with but for this particular use case we do need to handle much more HTTPS traffic in production. Thanks again for all the replies and I will have this hardware available for the foreseeable future in case there is any more testing that would be helpful.
Re: HAProxy performance on OpenBSD
On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote: > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote: > > > Stupid question but I prefer to ask in order to be certain, are all of > > > these 32 threads located on the same physical CPU ? I just want to be > > > sure that locks (kernel or user) are not traveling between multiple CPU > > > sockets, as that ruins scalability. > > > > Very good observation. This is a 2 socket system, 20 cores per socket, > > and since there is no affinity in OpenBSD unfortunately the 32 threads > > are not on the same physical CPU. > > Ah, then that's particularly bad. When you're saying "there is no > affinity", do you mean the OS doesn't implement anything or only that > haproxy doesn't support it ? Could you please check if you have man > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former > is the FreeBSD version and the second the Linux version. Unfortunately, having a quick look at OpenBSD's sources, I don't think they provide anything that could look like that. In fact, I'm not sure there is any way to restrict a process to a specific CPU. [...] > Oh thank you very much. From what I'm seeing we're here: > > 0x0af892c770b0 : mov%r12,%rdi > 0x0af892c770b3 : callq 0xaf892c24e40 > > 0x0af892c770b8 : mov%rax,%r12 > 0x0af892c770bb : test %rax,%rax > 0x0af892c770be : je 0xaf892c770e5 > > 0x0af892c770c0 : mov0x18(%r12),%rax > =>0x0af892c770c5 : mov0xa0(%rax),%r11 > 0x0af892c770cc : test %r11,%r11 > 0x0af892c770cf : je 0xaf892c770b0 > > 0x0af892c770d1 : mov%r12,%rdi > 0x0af892c770d4 : mov%r13d,%esi > > 1229 conn = > srv_lookup_conn(>per_thr[i].safe_conns, hash); > 1230 while (conn) { > 1231==>>> if (conn->mux->takeover && > conn->mux->takeover(conn, i) == 0) { > ^^^ > > It's here. %rax==0, which is conn->mux when we're trying to dereference > it to retrieve ->takeover. I can't see how it's possible to have a NULL > mux here in an idle connection since they're placed there by the muxes > themselves. But maybe there's a tiny race somewhere that's impossible to > reach except under such an extreme contention between the two sockets. > I really have no idea regarding this one for now, maybe Olivier could > spot a race here ? Maybe we're missing a barrier somewhere. One way I can see that happening, as unlikely as it may be, is if h1_takeover() is called, but fails to allocate a task or a tasklet, and will call h1_release(), which will set the mux to NULL and call conn_free(). It was previously ok to happen, because conn_free() would unconditionally remove the connection from any list, but since the idle connections now stay in a tree, and no longer in a list, that no longer happens. Another possibility of h1_release() been called while still in the idle tree is if h1_wake() gets called, but the only way I see that happening is from mmux_stopping_process(), whatever that is, so that's unlikely to be the problem in this case. Anyway, I suggest doing something along the line of the patch attached, and add a BUG_ON() in conn_free() to catch any freeing of connection still in the idle tree. Olivier >From 2a0ac4b84b97aa05cd7befc5fcf45b03795f2e76 Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Tue, 24 Jan 2023 23:59:32 +0100 Subject: [PATCH] MINOR: Add a BUG_ON() to detect destroying connection in idle list Add a BUG_ON() in conn_free(), to check that zhen we're freeing a connection, it is not still in the idle connections tree, otherwise the next thread that will try to use it will probably crash. --- src/connection.c | 4 1 file changed, 4 insertions(+) diff --git a/src/connection.c b/src/connection.c index 4a73dbcc8..97619ec26 100644 --- a/src/connection.c +++ b/src/connection.c @@ -498,6 +498,10 @@ void conn_free(struct connection *conn) pool_free(pool_head_uniqueid, istptr(conn->proxy_unique_id)); conn->proxy_unique_id = IST_NULL; + /* Make sure the connection is not left in the idle connection tree */ + if (conn->hash_node != NULL) + BUG_ON(conn->hash_node->node.node.leaf_p != NULL); + pool_free(pool_head_conn_hash_node, conn->hash_node); conn->hash_node = NULL; -- 2.36.1
Re: HAProxy performance on OpenBSD
On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote: > > Stupid question but I prefer to ask in order to be certain, are all of > > these 32 threads located on the same physical CPU ? I just want to be > > sure that locks (kernel or user) are not traveling between multiple CPU > > sockets, as that ruins scalability. > > Very good observation. This is a 2 socket system, 20 cores per socket, > and since there is no affinity in OpenBSD unfortunately the 32 threads > are not on the same physical CPU. Ah, then that's particularly bad. When you're saying "there is no affinity", do you mean the OS doesn't implement anything or only that haproxy doesn't support it ? Could you please check if you have man pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former is the FreeBSD version and the second the Linux version. > cpu0 reports as core 0 package 0, cpu1 core 0 package 1, and so on that > the odd cpus are socket 1 and evens are socket 2. Watching threads in > top with a 1 sec interval I see them bouncing around a lot between > sockets. Yes that must be horrible. In fact, the worst possible scenario! > > It's great already to be able to rule out that one. Another useful > > test you could run is to place a reject rule at the connection level > > in your frontend (it will happen before SSL tries to process traffic): > > > >tcp-request connection reject > > > > (don't do that in production of course). Once this is installed you > > can compare again the CPU usage with no-thread, and 32-thread without > > shards and 32 threads with 8 shards. > > This is HTTPS with the reject rule: > > - no nbthread, no shards: > -- 40 CPUs: 0.3% user, 0.0% nice, 0.7% sys, 0.1% spin, 0.1% intr, 98.9% > idle > -- haproxy peaked at 20% CPU > > - nbthread 32, no shards: > -- 40 CPUs: 0.5% user, 0.0% nice, 19.9% sys, 4.6% spin, 0.1% intr, 75.0% > idle > -- haproxy peaked at 863% CPU > > - nbthread 32, 8 shards: > -- 40 CPUs: 0.4% user, 0.0% nice, 1.9% sys, 0.1% spin, 0.1% intr, 97.4% > idle > -- haproxy peaked at 62% CPU > -- this only used 5 threads according to top FYI OK, let's basically say it doesn't work very well... At least it makes sense now. > Since the SSL stack has been mentioned a couple times I retested with > HTTP instead of HTTPS and the results are interesting: > > - HTTP, no nbthread, no shards > -- 40 CPUs: 0.6% user, 0.0% nice, 2.0% sys, 0.2% spin, 0.2% intr, 97.0% > idle > -- haproxy peaked at 54% CPU > -- current conns = 475; current pipes = 0/0; conn rate = 0/sec; bit rate > 403.512 Mbps, Running tasks: 0/922; idle = 28 % > -- First 3 seconds 7001, 14474, 22000 responses. 7-8k/sec throughout > test. 100% success 0 fail > > - HTTP, nbthread 32, no shards > -- 40 CPUs: 0.7% user, 0.0% nice, 29.6% sys, 0.2% spin, 0.5% intr, 69.1% > idle > -- haproxy peaked at 800% CPU > -- current conns = 501; current pipes = 0/0; conn rate = 1/sec; bit rate > 410.711 Mbps, Running tasks: 28/967; idle = 64 % > -- First 3 seconds 7000, 14500, 22002 responses. 7-8k/sec during test. 100% > success 0 fail > > - HTTP, nbthread 32, 8 shards > -- 40 CPUs: 0.7% user, 0.0% nice, 5.0% sys, 0.2% spin, 0.2% intr, 94.0% > idle > -- haproxy peaked at 147% CPUs > -- current conns = 500; current pipes = 0/0; conn rate = 1/sec; bit rate > 401.946 Mbps, Running tasks: 0/746; idle = 93 % > -- First 3 seconds 7000, 14500, 22000 responses. 7-8k/sec during test. 100% > success 0 fail > > Here we still see sys% growing with more threads but even with the waste, > with HTTP we get a reliable 7-8k responses/sec and 100% success rate > instead of 400-500/sec for a few second burst and then a total stall / > nearly 0% success rate with HTTPS. I expect SSL to have some cost but > not quite this huge, and the "stall" of traffic/health checks under > heavy HTTPS load is a bit puzzling. It's important to split the problems. Here clearly the test is limited by the client to ~7.3k/s, but it's still sufficient to exacerbate the threading issue with 32 threads and no shards. The SSL part is a second problem, but since openssl uses a lot of locks, it could be a bigger victim of the apparently poor thread implementation, explaining why it doesn't scale at all. > Do you think it would be worth trying to install OpenSSL 3.0.7 from > ports and manually build haproxy against that to compare with the > current LibreSSL 3.6.0? Or is the bottleneck likely somewhere else? No, OpenSSL 3 is much much worse. It's blatant that its performance was never tested outside of "openssl speed" before being released, because its extreme abuse of locks makes it totally unusable for any usage in a network-connected daemon :-( LibreSSL 3.6 could work, but it could as well be a victim of the threading problem. Don't you have at least an equivalent of the linux taskset utility on OpenBSD, to force all your threads on the same physical package ? Or maybe you'll find some NUMA tools ? Or in the worst case,
Re: HAProxy performance on OpenBSD
On 2023-01-24 06:58:57, Willy Tarreau wrote: > Hi Marc, Hi Willy, > See the difference ? There seems to be an insane FD locking cost on this > system that simply wastes 40% of the CPU there. So I suspect that in your > first tests you were stressing the locking while in the last ones you > were stressing the SSL stack. > > Stupid question but I prefer to ask in order to be certain, are all of > these 32 threads located on the same physical CPU ? I just want to be > sure that locks (kernel or user) are not traveling between multiple CPU > sockets, as that ruins scalability. Very good observation. This is a 2 socket system, 20 cores per socket, and since there is no affinity in OpenBSD unfortunately the 32 threads are not on the same physical CPU. cpu0 reports as core 0 package 0, cpu1 core 0 package 1, and so on that the odd cpus are socket 1 and evens are socket 2. Watching threads in top with a 1 sec interval I see them bouncing around a lot between sockets. > It's great already to be able to rule out that one. Another useful > test you could run is to place a reject rule at the connection level > in your frontend (it will happen before SSL tries to process traffic): > >tcp-request connection reject > > (don't do that in production of course). Once this is installed you > can compare again the CPU usage with no-thread, and 32-thread without > shards and 32 threads with 8 shards. This is HTTPS with the reject rule: - no nbthread, no shards: -- 40 CPUs: 0.3% user, 0.0% nice, 0.7% sys, 0.1% spin, 0.1% intr, 98.9% idle -- haproxy peaked at 20% CPU - nbthread 32, no shards: -- 40 CPUs: 0.5% user, 0.0% nice, 19.9% sys, 4.6% spin, 0.1% intr, 75.0% idle -- haproxy peaked at 863% CPU - nbthread 32, 8 shards: -- 40 CPUs: 0.4% user, 0.0% nice, 1.9% sys, 0.1% spin, 0.1% intr, 97.4% idle -- haproxy peaked at 62% CPU -- this only used 5 threads according to top FYI Since the SSL stack has been mentioned a couple times I retested with HTTP instead of HTTPS and the results are interesting: - HTTP, no nbthread, no shards -- 40 CPUs: 0.6% user, 0.0% nice, 2.0% sys, 0.2% spin, 0.2% intr, 97.0% idle -- haproxy peaked at 54% CPU -- current conns = 475; current pipes = 0/0; conn rate = 0/sec; bit rate 403.512 Mbps, Running tasks: 0/922; idle = 28 % -- First 3 seconds 7001, 14474, 22000 responses. 7-8k/sec throughout test. 100% success 0 fail - HTTP, nbthread 32, no shards -- 40 CPUs: 0.7% user, 0.0% nice, 29.6% sys, 0.2% spin, 0.5% intr, 69.1% idle -- haproxy peaked at 800% CPU -- current conns = 501; current pipes = 0/0; conn rate = 1/sec; bit rate 410.711 Mbps, Running tasks: 28/967; idle = 64 % -- First 3 seconds 7000, 14500, 22002 responses. 7-8k/sec during test. 100% success 0 fail - HTTP, nbthread 32, 8 shards -- 40 CPUs: 0.7% user, 0.0% nice, 5.0% sys, 0.2% spin, 0.2% intr, 94.0% idle -- haproxy peaked at 147% CPUs -- current conns = 500; current pipes = 0/0; conn rate = 1/sec; bit rate 401.946 Mbps, Running tasks: 0/746; idle = 93 % -- First 3 seconds 7000, 14500, 22000 responses. 7-8k/sec during test. 100% success 0 fail Here we still see sys% growing with more threads but even with the waste, with HTTP we get a reliable 7-8k responses/sec and 100% success rate instead of 400-500/sec for a few second burst and then a total stall / nearly 0% success rate with HTTPS. I expect SSL to have some cost but not quite this huge, and the "stall" of traffic/health checks under heavy HTTPS load is a bit puzzling. Do you think it would be worth trying to install OpenSSL 3.0.7 from ports and manually build haproxy against that to compare with the current LibreSSL 3.6.0? Or is the bottleneck likely somewhere else? > Ah, great! Do you have any info on the signal that was received there ? > If you still have the core, issuing "info registers", then "disassemble > conn_backend_get" and pressing enter till the end of the function could > be useful to try to locate what's happening there. It was signal 11 and yes I do still have the core, attached! GNU gdb 6.3 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-unknown-openbsd7.2"... Core was generated by `haproxy'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/lib/libpthread.so.26.2...done. Loaded symbols for /usr/lib/libpthread.so.26.2 Loaded symbols for /usr/local/sbin/haproxy Reading symbols from /usr/lib/libz.so.7.0...done. Loaded symbols for /usr/lib/libz.so.7.0 Symbols already loaded for /usr/lib/libpthread.so.26.2 Reading symbols from /usr/lib/libssl.so.53.0...done. Loaded symbols for /usr/lib/libssl.so.53.0 Reading symbols from /usr/lib/libcrypto.so.50.0...done. Loaded
Re: Theoretical limits for a HAProxy instance
We are happy to report that after downgrading to OpenSSL 1.1.1s (from 3.0.7), our performance problems are solved, and now looks like HAProxy scales linearly with the available resources. For reference, in a synthetic load test with a request payload of 2k, and a 32-core server (128GB RAM) with 10Gb bandwidth available, we are able to sustain 1.83M conn, 82K rps, and a sslrate of 22k, with a load average of about 31 (idle time percent was about 6) On Wed, Dec 21, 2022 at 5:39 PM Iago Alonso wrote: > > > Interesting, so you have conntrack enabled. With 5M conns, there's no > > reason to fill your table. However, have you checked your servers' kernel > > logs to see if you find any "conntrack table full" message, that might > > be caused by too fast connection recycling ? > We can't see any message mentioning conntrack. > > > With that said, excessive use of rwlock indicates that openssl is > > very likely there, because that's about the only caller in the process. > Indeed, we tried compiling HAProxy 2.6.6 against OpenSSL 1.1.1s and the > > `perf top` results look much better. We can't recally why we decided > > to use OpenSSL v3, but seems like we have to stick to v1. > > 2.61% haproxy[.] sha512_block_data_order_avx2 > 1.89% haproxy[.] fe_mul > 1.64% haproxy[.] x25519_fe64_sqr > 1.44% libc.so.6[.] malloc > 1.39% [kernel] [k] __nf_conntrack_find_get > 1.21% libc.so.6[.] cfree > 0.95% [kernel] [k] > native_queued_spin_lock_slowpath.part.0 > 0.94% haproxy[.] OPENSSL_cleanse > 0.78% libc.so.6[.] 0x000a3b46 > 0.78% [kernel] [k] do_epoll_ctl > 0.75% [kernel] [k] ___slab_alloc > 0.73% [kernel] [k] psi_group_change > > For reference, with a 12 core machine, we ran a test with 420k connections, > > 32k rps, and a sustained ssl rate of 5k, and the load1 was about 9, and > > the idle was about 30. > > And as a side note, we didn't have http/2 enabled, and after doing > > so, we ran the same test, and the idle was about 55, and the load1 was > about 6. > > > OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few > > fixes, though these should not impact your tests and 2.7.0 will be OK). > > And please check whether USE_PTHREAD_EMULATION=1 improves your situation. > > If so it will definitely confirm that you're endlessly waiting for some > > threads stuck in SSL waiting on a mutex. > We've just tried 2.7.1 with USE_PTHREAD_EMULATION=1 (OpenSSL 3.0.7), and > > we the results are pretty much the same, no noticeable improvement. > > > Hmmm not good. Check if you have any "-m state --state INVALID -j DROP" > > rule in your iptables whose counter increases during the tests. It might > > be possible that certain outgoing connections are rejected due to fast > > port recycling if they were closed by an RST that was lost before hitting > > conntrack. Another possibility here is that the TCP connection was > > established but that the SSL handshake didn't complete in time. This > > could be caused precisely by the openssl 3 locking issues I was speaking > > about, but I guess not at such numbers, and you'd very likely see > > the native_queue_spinlock_slowpath thing in perf top. > We do have a `-A ufw-before-input -m conntrack --ctstate INVALID -j DROP` > > but the counter does not increase during the tests. Looks like OpenSSL > > v3 is the issue. > > > On 16/12/22 18:11, Willy Tarreau wrote: > > On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote: > >> Hi, > >> > >>> Ah that's pretty useful :-) It's very likely dealing with the handshake. > >>> Could you please run "perf top" on this machine and list the 10 top-most > >>> lines ? I'm interested in seeing if you're saturating on crypto functions > >>> or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would > >>> indicate an abuse of mutexes. > >> We are load testing with k6 (https://k6.io/), in a server with an > >> AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test > >> with 300k connections, 15k rps, and a sslrate and connrate of about 4k. > >> The server had a `load1` of 12, the `node_nf_conntrack_entries` were about > >> 450k. The `haproxy_server_connect_time_average_seconds` was between > >> 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was > >> always about 0.2s lower. > >> > >> We have this custom kernel parameters set: > >> net.nf_conntrack_max=500 > >> fs.nr_open=500 > >> net.ipv4.ip_local_port_range=1276860999 > > Interesting, so you have conntrack enabled. With 5M conns, there's no > > reason to fill your table. However, have you checked your servers' kernel > > logs to see if you find any "conntrack table full" message, that might > > be caused by too fast connection
[ANNOUNCE] haproxy-2.5.11
Hi, HAProxy 2.5.11 was released on 2023/01/24. It added 65 new commits after version 2.5.10. As for the 2.6.8, this release includes the fix about the "set-uri" HTTP action. This fix was delayed for the 2.5.10. It is now shipped with the 2.5.11. The behavior of this action is no longer the same. This action is been bogus for a while and was not working as documented, and used to make HTTP/1 and HTTP/2 produce different outputs. The URI is now sent to H1 server exactly as set by the action. Otherwise, for other fixes, this release is pretty similar to the 2.6.8, excluding QUIC/H2 fixes: About the H2: * Interim responses that carry the end-of-stream flag are now rejected as invalid while it was handled as a full response. The consequences of this issue are uncertain in 2.4 and newer, but on 2.2 and older it could trigger a BUG_ON() condition and cause a panic. About the H1: * Authority validation was improved to conform RFC3986 for non-CONNECT methods. The validation was too strict and expected an exact match between the authority and the host header value. Default ports are now properly handled. * In addition, having an empty port in the authority for CONNECT requests is no considered as invalid and a 400-Bad-request is now returned. For other methods, empty ports in authority are considered as valid and are handled as default ports. About the FCGI * The path-info subexpression was not properly handled due to an inverted condition. * A major fix regarding uninitialized bytes in the FCGI mux was backported. It one could have leak sensitive data to the backends before the fix. About listeners: * Multiple races were found and addressed related to closed FDs (mostly happening on reload, sometimes on resuming after an aborted reload) About HTTP rules: * Make sure that the logged status matches the reported status even upon errors and also after http-after-response * There was a parsing error reported for responses carrying a websocket header when the status was not 101. About the Master-Worker: * When trying to upgrade from a previous version with a reload instead of a restart, a bug in the master-worker was preventing to reload and was stopping the whole process. About other fixes: * A fix for a buffer realignment bug introduced in 1.9 was fixed. It's uncertain whether it was possible to trigger it or not, but it could possibly have been responsible for some rare unexplained corruptions. * JWT ECDSA signatures were not properly handled, this was now fixed. However, another issue was just discovered after the release that may still randomly trigger errors. * The maxconn automatic computation was fixed, its output value was not correct anymore since the introduction of the httpclient SSL backend. * The haproxy_backend_agg_check_status metric for the Prometheus exporter was backported. * A scheduling issue in the resolvers was preventing the resolution during runtime. * A possible crash with the Lua HTTP-client during the cleanup stage was fixed. Thanks everyone for you help and your contributions ! Please find the usual URLs below : Site index : https://www.haproxy.org/ Documentation: https://docs.haproxy.org/ Wiki : https://github.com/haproxy/wiki/wiki Discourse: https://discourse.haproxy.org/ Slack channel: https://slack.haproxy.org/ Issue tracker: https://github.com/haproxy/haproxy/issues Sources : https://www.haproxy.org/download/2.5/src/ Git repository : https://git.haproxy.org/git/haproxy-2.5.git/ Git Web browsing : https://git.haproxy.org/?p=haproxy-2.5.git Changelog: https://www.haproxy.org/download/2.5/src/CHANGELOG Dataplane API: https://github.com/haproxytech/dataplaneapi/releases/latest Pending bugs : https://www.haproxy.org/l/pending-bugs Reviewed bugs: https://www.haproxy.org/l/reviewed-bugs Code reports : https://www.haproxy.org/l/code-reports Latest builds: https://www.haproxy.org/l/dev-packages --- Complete changelog : Aurelien DARRAGON (3): REGTEST: fix the race conditions in json_query.vtc REGTEST: fix the race conditions in digest.vtc REGTEST: fix the race conditions in hmac.vtc Cedric Paillet (2): BUG/MINOR: promex: create haproxy_backend_agg_server_status MINOR: promex: introduce haproxy_backend_agg_check_status Christopher Faulet (25): BUG/MINOR: http-htx: Don't consider an URI as normalized after a set-uri action BUG/MEDIIM: stconn: Flush output data before forwarding close to write side Revert "CI: switch to the "latest" LibreSSL" Revert "CI: enable QUIC for LibreSSL builds" Revert "CI: determine actual OpenSSL version dynamically" DOC: promex: Add missing backend metrics REGTESTS: fix the race conditions in iff.vtc BUG/MEDIUM: resolvers:
[ANNOUNCE] haproxy-2.6.8
Hi, HAProxy 2.6.8 was released on 2023/01/23. It added 94 new commits after version 2.6.7. The delayed fix about the "set-uri" HTTP action that was not included in the 2.6.7 was finally backported and shipped with this release. The behavior of this action is no longer the same. This action is been bogus for a while and was not working as documented, and used to make HTTP/1 and HTTP/2 produce different outputs. The URI is now sent to H1 server exactly as set by the action. Beside that, this release includes its usual batch of fixes. About the H3/QUIC: * a double-delete could happen in a list, causing memory corruption and crashes. * Empty HTTP response are now properly transferred. It is pretty rare. It only happens with HTTP/0.9 responses with no payload. * Remote unidirectional stream closures are now ignored and no longer trigger aborts. * Invalid requests header or pseudo-header name are now rejected. For now, this triggers a connection close. But, in future, it should be handled with a stream reset. The same is performed for messages with an announced content-length that does not match the total size of DATA frames. * The cookie header parsing was fixed. About the H2: * Interim responses that carry the end-of-stream flag are now rejected as invalid while it was handled as a full response. The consequences of this issue are uncertain in 2.4 and newer, but on 2.2 and older it could trigger a BUG_ON() condition and cause a panic. * logs were not emitted for invalid requests that are blocked due to forbidden headers or syntax. That made it complicated to debug errors reported by clients. They will now be emitted, and traces will also reflect this. About the H1: * Authority validation was improved to conform RFC3986 for non-CONNECT methods. The validation was too strict and expected an exact match between the authority and the host header value. Default ports are now properly handled. * In addition, having an empty port in the authority for CONNECT requests is no considered as invalid and a 400-Bad-request is now returned. For other methods, empty ports in authority are considered as valid and are handled as default ports. About the FCGI: * The path-info subexpression was not properly handled due to an inverted condition. * A major fix regarding uninitialized bytes in the FCGI mux was backported. It one could have leak sensitive data to the backends before the fix. About listeners: * Multiple races were found and addressed related to closed FDs (mostly happening on reload, sometimes on resuming after an aborted reload) About HTTP rules: * Make sure that the logged status matches the reported status even upon errors and also after http-after-response * There was a small leak per request when using the "originalto" option, and another leak (per config entry) for "http-request redirect" lines. * There was a parsing error reported for responses carrying a websocket header when the status was not 101. About the Master-Worker: * When trying to upgrade from a previous version with a reload instead of a restart, a bug in the master-worker was preventing to reload and was stopping the whole process. About other fixes: * A fix for a buffer realignment bug introduced in 1.9 was fixed. It's uncertain whether it was possible to trigger it or not, but it could possibly have been responsible for some rare unexplained corruptions. * JWT ECDSA signatures were not properly handled, this was now fixed. However, another issue was just discovered after the release that may still randomly trigger errors. * Some fixes on the stats output were backported. One of them is about the json output. This output type is a lot more verbose and was starting to reach the default buffer size limit, leading to truncated responses. This is no longer an issue. * The maxconn automatic computation was fixed, its output value was not correct anymore since the introduction of the httpclient SSL backend. * The haproxy_backend_agg_check_status metric for the Prometheus exporter was backported. * A scheduling issue in the resolvers was preventing the resolution during runtime. * A possible crash with the Lua HTTP-client during the cleanup stage was fixed. Thanks everyone for you help and your contributions ! Please find the usual URLs below : Site index : https://www.haproxy.org/ Documentation: https://docs.haproxy.org/ Wiki : https://github.com/haproxy/wiki/wiki Discourse: https://discourse.haproxy.org/ Slack channel: https://slack.haproxy.org/ Issue tracker: https://github.com/haproxy/haproxy/issues Sources : https://www.haproxy.org/download/2.6/src/ Git repository : https://git.haproxy.org/git/haproxy-2.6.git/ Git Web browsing :