Re: Theoretical limits for a HAProxy instance
On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote: > Hi, > > > Ah that's pretty useful :-) It's very likely dealing with the handshake. > > Could you please run "perf top" on this machine and list the 10 top-most > > lines ? I'm interested in seeing if you're saturating on crypto functions > > or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would > > indicate an abuse of mutexes. > > We are load testing with k6 (https://k6.io/), in a server with an > AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test > with 300k connections, 15k rps, and a sslrate and connrate of about 4k. > The server had a `load1` of 12, the `node_nf_conntrack_entries` were about > 450k. The `haproxy_server_connect_time_average_seconds` was between > 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was > always about 0.2s lower. > > We have this custom kernel parameters set: > net.nf_conntrack_max=500 > fs.nr_open=500 > net.ipv4.ip_local_port_range=1276860999 Interesting, so you have conntrack enabled. With 5M conns, there's no reason to fill your table. However, have you checked your servers' kernel logs to see if you find any "conntrack table full" message, that might be caused by too fast connection recycling ? > At that load, we were starting to see just a small number of 5xx, and > in the logs > we could see that they were caused by connect timeout. > XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker > 261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET > / HTTP/1.1" 0//0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384 Hmmm not good. Check if you have any "-m state --state INVALID -j DROP" rule in your iptables whose counter increases during the tests. It might be possible that certain outgoing connections are rejected due to fast port recycling if they were closed by an RST that was lost before hitting conntrack. Another possibility here is that the TCP connection was established but that the SSL handshake didn't complete in time. This could be caused precisely by the openssl 3 locking issues I was speaking about, but I guess not at such numbers, and you'd very likely see the native_queue_spinlock_slowpath thing in perf top. > This is the output from `perf top`, after 40M samples, during the test > > 17.38% libc.so.6 [.] pthread_rwlock_rdlock > 10.25% libc.so.6 [.] pthread_rwlock_unlock With that said, excessive use of rwlock indicates that openssl is very likely there, because that's about the only caller in the process. To counter this abuse, we've developed a dirty counter-measure that consists in providing our own rwlock implementation as a replacement for the pthread one. That's really not beautiful but we have to deal with what we have. You can enable it in 2.7 by adding USE_PTHREAD_EMULATION=1 to your "make" command line. These locks are way less expensive but they will never sleep (no syscall) so instead of idle when the machine is congested, you'll see pl_wait_unlock_long() eating CPU in "perf top" output. In our tests that roughly doubles the performance under high congestion but it's hard to be that specific. In fact it mostly gets us our lock callback back, after they were removed in 1.1.0. > Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04, > and we tested all the possible combinations between HAProxy 2.5.3 and 2.6.6, > compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable > performance > hit running similar tests as the previously mentioned one. We are > planning to try again > (next week) just stressing the sslrate so we can assess a bit more the > performance > hit. OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few fixes, though these should not impact your tests and 2.7.0 will be OK). And please check whether USE_PTHREAD_EMULATION=1 improves your situation. If so it will definitely confirm that you're endlessly waiting for some threads stuck in SSL waiting on a mutex. I would really like that we distinguish L4 from L6 in logs output to indicate whether L4 was established and SSL not or if it's blocking at L4. The tricky part is that it requires adding new values for the termination flags, and everyone got used a lot to the current ones... But I think we'll have to do it anyway, at least to sort out such issues. Regards, Willy
Re: Followup on openssl 3.0 note seen in another thread
On Fri, Dec 16, 2022 at 06:58:33AM -0700, Shawn Heisey wrote: > On 12/16/22 01:59, Shawn Heisey wrote: > > On 12/16/22 00:26, Willy Tarreau wrote: > > > Both work for me using firefox (green flash after reload). > > > > It wasn't working when I tested it. I rebooted for a kernel upgrade and > > it still wasn't working. > > > > And then a while later I was poking around in my zabbix UI and saw the > > green lightning bolt. No idea what changed. Glad it's working, but > > problems that fix themselves annoy me because I usually never learn what > > happened. > > I think I know what happened. > > I was having problems with my pacemaker cluster where it got very confused > about the haproxy resource. I had the haproxy service enabled at boot for > both systems. I have now disabled that in systemd so it's fully under the > control of pacemaker. I'm pretty sure that pacemaker was confused because > it saw the service running on a system where it should have been disabled > and pacemaker didn't start it ... and it decided that was unacceptable and > basically broke the cluster. > > So for a while I had the virtual IP resource on the "lesser" server and the > haproxy resource on the main server. But because I had haproxy enabled at > boot time, it was actually running on both. The haproxy config is the same > between both systems, but the other server was still running a broken > haproxy version. Most of the backends are actually on the better server > accessed by br0 IP address rather than localhost, so the broken haproxy was > still sending them to the right place. This also explains why I was not > seeing traffic with tcpdump filtering on "udp port 443". I have a ways to > go before I've got true HA for my websites. Setting up a database cluster > is going to be challenging, I think. > > I got pacemaker back in working order after I was done with my testing, so > both resources were colocated on the better server and haproxy was not > running on the other one. I think you tried the URLs after I had fixed > pacemaker, and when I saw it working on zabbix, that was also definitely > after I fixed pacemaker. Thanks for sharing your analysis. Indeed, everything makes sense now. > On that UDP bind thing ... I now have three binds defined. The virtual IP, > the IP of the first server, and the IP of the second server. As long as you don't have too many nodes, that's often the simplest thing to do. It requires ip_non_local_bind=1 but that's extremely frequent where haproxy runs. Willy
Re: Theoretical limits for a HAProxy instance
Hi, > Ah that's pretty useful :-) It's very likely dealing with the handshake. > Could you please run "perf top" on this machine and list the 10 top-most > lines ? I'm interested in seeing if you're saturating on crypto functions > or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would > indicate an abuse of mutexes. We are load testing with k6 (https://k6.io/), in a server with an AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test with 300k connections, 15k rps, and a sslrate and connrate of about 4k. The server had a `load1` of 12, the `node_nf_conntrack_entries` were about 450k. The `haproxy_server_connect_time_average_seconds` was between 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was always about 0.2s lower. We have this custom kernel parameters set: net.nf_conntrack_max=500 fs.nr_open=500 net.ipv4.ip_local_port_range=1276860999 At that load, we were starting to see just a small number of 5xx, and in the logs we could see that they were caused by connect timeout. XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker 261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET / HTTP/1.1" 0//0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384 This is the output from `perf top`, after 40M samples, during the test 17.38% libc.so.6 [.] pthread_rwlock_rdlock 10.25% libc.so.6 [.] pthread_rwlock_unlock 2.32% haproxy[.] sha512_block_data_order_avx2 1.64% libc.so.6[.] malloc 1.62% libc.so.6[.] cfree 1.51% haproxy[.] fe_mul 1.43% libc.so.6[.] 0x000a4620 1.29% haproxy[.] OPENSSL_LH_retrieve 1.16% haproxy[.] x25519_fe64_sqr 0.80% haproxy[.] bn_sqrx8x_internal 0.75% haproxy[.] ossl_sa_doall_arg 0.70% haproxy[.] ossl_lh_strcasehash 0.67% [kernel] [k] __nf_conntrack_find_get 0.67% haproxy [.] OPENSSL_cleanse 0.66% haproxy [.] do_name 0.56% haproxy [.] OPENSSL_LH_doall_arg 0.53% haproxy [.] x25519_fe64_mul 0.51% haproxy [.] fe_sq > By the way, are you running with OpenSSL > 3.0 ? That one is absolutely terrible and makes extreme abuse of mutexes > and locks, to the point that certain workloads were divided by 2-digit > numbers between 1.1.1 and 3.0. It took me one day to figure that my load > generator which was caping at 400 conn/s was in fact suffering from an > accidental build using 3.0 while in 1.1.1 the perf went back to 75000/s! Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04, and we tested all the possible combinations between HAProxy 2.5.3 and 2.6.6, compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable performance hit running similar tests as the previously mentioned one. We are planning to try again (next week) just stressing the sslrate so we can assess a bit more the performance hit. On Wed, Dec 14, 2022 at 2:07 PM Willy Tarreau wrote: > > On Wed, Dec 14, 2022 at 12:50:10PM +0100, Iago Alonso wrote: > > Hi, > > > > We are not sure what element produces the errors, in the haproxy logs we > > don't see them. > > Then they don't pass through haproxy nor do they come from haproxy. > > > What does it happen with the new connections when we hit the > > limit? > > If you're speaking about the maxconnrate/maxsessrate, the connection > acceptance is simply delayed until the rate becomes lower again. > > > What server resource should be affected by it, if any? > > None at all, actually the connection remains in the kernel's accept > queue. However this can have as a side effect that once too many are > queued, new ones are just rejected once the kernel queue is full, and > maybe the client translates this into a 5xx (this wouldn't shock me > as long as there is a way for the user to figure it). > > > We have our logs in `warning` level so we do not see the response time on > > the haproxy logs. > > Ah, that's an important point. So in fact you almost don't have logs, > and maybe you''re really having 5xx that you don't log. You could > perhaps just leave them at "info" level and add "option dontlog-normal" > that will eliminate logs for successful requests and only log the failed > ones. > > > In the `haproxy_backend_response_time_average_seconds` we > > see "normal" values (20ms average and some spikes at less than 80ms). > > OK, that's good. > > > We see a spike in the `haproxy_backend_connect_time_average_seconds`, as > > well as in the `haproxy_server_connect_time_average_seconds` metrics when > > the errors start. > > Interesting, then this could be related to either a conntrack filling on > one element in the chain (including both extremities),
Re: Followup on openssl 3.0 note seen in another thread
On 12/16/22 01:59, Shawn Heisey wrote: On 12/16/22 00:26, Willy Tarreau wrote: > Both work for me using firefox (green flash after reload). It wasn't working when I tested it. I rebooted for a kernel upgrade and it still wasn't working. And then a while later I was poking around in my zabbix UI and saw the green lightning bolt. No idea what changed. Glad it's working, but problems that fix themselves annoy me because I usually never learn what happened. I think I know what happened. I was having problems with my pacemaker cluster where it got very confused about the haproxy resource. I had the haproxy service enabled at boot for both systems. I have now disabled that in systemd so it's fully under the control of pacemaker. I'm pretty sure that pacemaker was confused because it saw the service running on a system where it should have been disabled and pacemaker didn't start it ... and it decided that was unacceptable and basically broke the cluster. So for a while I had the virtual IP resource on the "lesser" server and the haproxy resource on the main server. But because I had haproxy enabled at boot time, it was actually running on both. The haproxy config is the same between both systems, but the other server was still running a broken haproxy version. Most of the backends are actually on the better server accessed by br0 IP address rather than localhost, so the broken haproxy was still sending them to the right place. This also explains why I was not seeing traffic with tcpdump filtering on "udp port 443". I have a ways to go before I've got true HA for my websites. Setting up a database cluster is going to be challenging, I think. I got pacemaker back in working order after I was done with my testing, so both resources were colocated on the better server and haproxy was not running on the other one. I think you tried the URLs after I had fixed pacemaker, and when I saw it working on zabbix, that was also definitely after I fixed pacemaker. On that UDP bind thing ... I now have three binds defined. The virtual IP, the IP of the first server, and the IP of the second server. Thanks, Shawn
Re: Followup on openssl 3.0 note seen in another thread
On 12/16/22 00:26, Willy Tarreau wrote: > Both work for me using firefox (green flash after reload). It wasn't working when I tested it. I rebooted for a kernel upgrade and it still wasn't working. And then a while later I was poking around in my zabbix UI and saw the green lightning bolt. No idea what changed. Glad it's working, but problems that fix themselves annoy me because I usually never learn what happened. > You indeed need to > bind to both the native and the virtual IP addresses (you can have the > two on the same "bind" line, delimited by comma). That's the little bit of info that I needed. Now it works the way I was expecting with both IP addresses. I have a lot less experience with UDP than TCP, I wasn't aware of that gotcha. It does make perfect sense now that it's been pointed out. Thanks, Shawn
Re: Followup on openssl 3.0 note seen in another thread
On 12/16/22 00:01, Willy Tarreau wrote: - if you want to use QUIC, use quictls-1.1.1. Once you have to build something yourself, you definitely don't want to waste your time on the performance-crippled 3.0, and 1.1.1 will change less often than 3.0 so that also means less package updates. - if you want to experiment with QUIC and help developers, running compatibility tests with the latest haproxy master and the latest WolfSSL master could be useful. I just don't know if the maintainers are ready to receive lots of uncoordinated reports yet, I'm aware that they're still in the process of fixing a few basic integration issues that will make things run much smoother soon. Similarly, LibreSSL's QUIC support is very recent (3.6) and few people seem to use LibreSSL, I don't know how well it's supported in distros these days. More tests on this one would probably be nice and may possibly encourage its support. I'd say that I am somewhere in between these two. Helping the devs is not an EXPLICIT goal, but I am already tinkering with this stuff for myself, so it's not a lot of extra effort to be involved here. I think my setup can provide a little bit of useful data and another test environment. Pursuing http3 has been fun. Straying offtopic: I find that being a useful member of open source communities is an awesome experience. For this one I'm not as much use at the code level as I am for other communities. My experience with C was a long time ago ... it was one of my first languages. I spend more time with Bash and Java than anything else these days. Occasionally delve into Perl, which I really like. On the subject of building things myself ... way back in the 90s I used to build all my own Linux kernels, enabling only what I needed, building it into the kernel directly, and optimizing for the specific CPU in the machine. And I tended to build most of the software I used from source as well. These days, some distros have figured out how to do all these things better than I ever could, so I mostly install from apt repos. For really mainstream software, they keep up with recent versions pretty well. For some software, haproxy being one of the most prominent, the distro packages are so far behind what's current that I pretty much have to build it myself if I want useful features. I got started using haproxy with version 1.4, and quickly went to 1.5-dev because I was pursuing the best TLS setup I could get. In those days I wasn't using source repositories, I would download tarballs from 1wt.eu. Thanks, Shawn