Re: Theoretical limits for a HAProxy instance

2022-12-16 Thread Willy Tarreau
On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote:
> Hi,
> 
> > Ah that's pretty useful :-) It's very likely dealing with the handshake.
> > Could you please run "perf top" on this machine and list the 10 top-most
> > lines ? I'm interested in seeing if you're saturating on crypto functions
> > or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would
> > indicate an abuse of mutexes.
> 
> We are load testing with k6 (https://k6.io/), in a server with an
> AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test
> with 300k connections, 15k rps, and a sslrate and connrate of about 4k.
> The server had a `load1` of 12, the `node_nf_conntrack_entries` were about
> 450k. The `haproxy_server_connect_time_average_seconds` was between
> 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was
> always about 0.2s lower.
> 
> We have this custom kernel parameters set:
> net.nf_conntrack_max=500
> fs.nr_open=500
> net.ipv4.ip_local_port_range=1276860999

Interesting, so you have conntrack enabled. With 5M conns, there's no
reason to fill your table. However, have you checked your servers' kernel
logs to see if you find any "conntrack table full" message, that might
be caused by too fast connection recycling ? 

> At that load, we were starting to see just a small number of 5xx, and
> in the logs
> we could see that they were caused by connect timeout.
> XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker
> 261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET
> / HTTP/1.1" 0//0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384

Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
rule in your iptables whose counter increases during the tests. It might
be possible that certain outgoing connections are rejected due to fast
port recycling if they were closed by an RST that was lost before hitting
conntrack. Another possibility here is that the TCP connection was
established but that the SSL handshake didn't complete in time. This
could be caused precisely by the openssl 3 locking issues I was speaking
about, but I guess not at such numbers, and you'd very likely see
the native_queue_spinlock_slowpath thing in perf top.

> This is the output from `perf top`, after 40M samples, during the test
> 
>   17.38%  libc.so.6   [.] pthread_rwlock_rdlock
>   10.25%  libc.so.6   [.] pthread_rwlock_unlock

With that said, excessive use of rwlock indicates that openssl is
very likely there, because that's about the only caller in the process.

To counter this abuse, we've developed a dirty counter-measure that
consists in providing our own rwlock implementation as a replacement for
the pthread one. That's really not beautiful but we have to deal with
what we have. You can enable it in 2.7 by adding USE_PTHREAD_EMULATION=1
to your "make" command line. These locks are way less expensive but they
will never sleep (no syscall) so instead of idle when the machine is
congested, you'll see pl_wait_unlock_long() eating CPU in "perf top"
output. In our tests that roughly doubles the performance under high
congestion but it's hard to be that specific. In fact it mostly gets
us our lock callback back, after they were removed in 1.1.0.

> Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04,
> and we tested all the possible combinations between HAProxy 2.5.3 and 2.6.6,
> compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable
> performance
> hit running similar tests as the previously mentioned one. We are
> planning to try again
> (next week) just stressing the sslrate so we can assess a bit more the
> performance
> hit.

OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
fixes, though these should not impact your tests and 2.7.0 will be OK).
And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
If so it will definitely confirm that you're endlessly waiting for some
threads stuck in SSL waiting on a mutex.

I would really like that we distinguish L4 from L6 in logs output to
indicate whether L4 was established and SSL not or if it's blocking at
L4. The tricky part is that it requires adding new values for the
termination flags, and everyone got used a lot to the current ones...
But I think we'll have to do it anyway, at least to sort out such
issues.

Regards,
Willy



Re: Followup on openssl 3.0 note seen in another thread

2022-12-16 Thread Willy Tarreau
On Fri, Dec 16, 2022 at 06:58:33AM -0700, Shawn Heisey wrote:
> On 12/16/22 01:59, Shawn Heisey wrote:
> > On 12/16/22 00:26, Willy Tarreau wrote:
> >  > Both work for me using firefox (green flash after reload).
> > 
> > It wasn't working when I tested it.  I rebooted for a kernel upgrade and
> > it still wasn't working.
> > 
> > And then a while later I was poking around in my zabbix UI and saw the
> > green lightning bolt.  No idea what changed.  Glad it's working, but
> > problems that fix themselves annoy me because I usually never learn what
> > happened.
> 
> I think I know what happened.
> 
> I was having problems with my pacemaker cluster where it got very confused
> about the haproxy resource.  I had the haproxy service enabled at boot for
> both systems.  I have now disabled that in systemd so it's fully under the
> control of pacemaker.  I'm pretty sure that pacemaker was confused because
> it saw the service running on a system where it should have been disabled
> and pacemaker didn't start it ... and it decided that was unacceptable and
> basically broke the cluster.
> 
> So for a while I had the virtual IP resource on the "lesser" server and the
> haproxy resource on the main server.  But because I had haproxy enabled at
> boot time, it was actually running on both.  The haproxy config is the same
> between both systems, but the other server was still running a broken
> haproxy version.  Most of the backends are actually on the better server
> accessed by br0 IP address rather than localhost, so the broken haproxy was
> still sending them to the right place.  This also explains why I was not
> seeing traffic with tcpdump filtering on "udp port 443".  I have a ways to
> go before I've got true HA for my websites.  Setting up a database cluster
> is going to be challenging, I think.
> 
> I got pacemaker back in working order after I was done with my testing, so
> both resources were colocated on the better server and haproxy was not
> running on the other one.  I think you tried the URLs after I had fixed
> pacemaker, and when I saw it working on zabbix, that was also definitely
> after I fixed pacemaker.

Thanks for sharing your analysis. Indeed, everything makes sense now.

> On that UDP bind thing ... I now have three binds defined.  The virtual IP,
> the IP of the first server, and the IP of the second server.

As long as you don't have too many nodes, that's often the simplest thing
to do. It requires ip_non_local_bind=1 but that's extremely frequent where
haproxy runs.

Willy



Re: Theoretical limits for a HAProxy instance

2022-12-16 Thread Iago Alonso
Hi,

> Ah that's pretty useful :-) It's very likely dealing with the handshake.
> Could you please run "perf top" on this machine and list the 10 top-most
> lines ? I'm interested in seeing if you're saturating on crypto functions
> or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would
> indicate an abuse of mutexes.

We are load testing with k6 (https://k6.io/), in a server with an
AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test
with 300k connections, 15k rps, and a sslrate and connrate of about 4k.
The server had a `load1` of 12, the `node_nf_conntrack_entries` were about
450k. The `haproxy_server_connect_time_average_seconds` was between
0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was
always about 0.2s lower.

We have this custom kernel parameters set:
net.nf_conntrack_max=500
fs.nr_open=500
net.ipv4.ip_local_port_range=1276860999

At that load, we were starting to see just a small number of 5xx, and
in the logs
we could see that they were caused by connect timeout.
XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker
261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET
/ HTTP/1.1" 0//0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384

This is the output from `perf top`, after 40M samples, during the test

  17.38%  libc.so.6   [.] pthread_rwlock_rdlock
  10.25%  libc.so.6   [.] pthread_rwlock_unlock
   2.32%  haproxy[.] sha512_block_data_order_avx2
   1.64%  libc.so.6[.] malloc
   1.62%  libc.so.6[.] cfree
   1.51%  haproxy[.] fe_mul
   1.43%  libc.so.6[.] 0x000a4620
   1.29%  haproxy[.] OPENSSL_LH_retrieve
   1.16%  haproxy[.] x25519_fe64_sqr
   0.80%  haproxy[.] bn_sqrx8x_internal
   0.75%  haproxy[.] ossl_sa_doall_arg
   0.70%  haproxy[.] ossl_lh_strcasehash
   0.67%  [kernel] [k] __nf_conntrack_find_get
   0.67%  haproxy [.] OPENSSL_cleanse
   0.66%  haproxy [.] do_name
   0.56%  haproxy [.] OPENSSL_LH_doall_arg
   0.53%  haproxy [.] x25519_fe64_mul
   0.51%  haproxy [.] fe_sq

> By the way, are you running with OpenSSL
> 3.0 ?  That one is absolutely terrible and makes extreme abuse of mutexes
> and locks, to the point that certain workloads were divided by 2-digit
> numbers between 1.1.1 and 3.0. It took me one day to figure that my load
> generator which was caping at 400 conn/s was in fact suffering from an
> accidental build using 3.0 while in 1.1.1 the perf went back to 75000/s!

Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04,
and we tested all the possible combinations between HAProxy 2.5.3 and 2.6.6,
compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable
performance
hit running similar tests as the previously mentioned one. We are
planning to try again
(next week) just stressing the sslrate so we can assess a bit more the
performance
hit.

On Wed, Dec 14, 2022 at 2:07 PM Willy Tarreau  wrote:
>
> On Wed, Dec 14, 2022 at 12:50:10PM +0100, Iago Alonso wrote:
> > Hi,
> >
> > We are not sure what element produces the errors, in the haproxy logs we
> > don't see them.
>
> Then they don't pass through haproxy nor do they come from haproxy.
>
> > What does it happen with the new connections when we hit the
> > limit?
>
> If you're speaking about the maxconnrate/maxsessrate, the connection
> acceptance is simply delayed until the rate becomes lower again.
>
> > What server resource should be affected by it, if any?
>
> None at all, actually the connection remains in the kernel's accept
> queue. However this can have as a side effect that once too many are
> queued, new ones are just rejected once the kernel queue is full, and
> maybe the client translates this into a 5xx (this wouldn't shock me
> as long as there is a way for the user to figure it).
>
> > We have our logs in `warning` level so we do not see the response time on
> > the haproxy logs.
>
> Ah, that's an important point. So in fact you almost don't have logs,
> and maybe you''re really having 5xx that you don't log. You could
> perhaps just leave them at "info" level and add "option dontlog-normal"
> that will eliminate logs for successful requests and only log the failed
> ones.
>
> > In the `haproxy_backend_response_time_average_seconds` we
> > see "normal" values (20ms average and some spikes at less than 80ms).
>
> OK, that's good.
>
> > We see a spike in the `haproxy_backend_connect_time_average_seconds`, as
> > well as in the `haproxy_server_connect_time_average_seconds` metrics when
> > the errors start.
>
> Interesting, then this could be related to either a conntrack filling on
> one element in the chain (including both extremities), 

Re: Followup on openssl 3.0 note seen in another thread

2022-12-16 Thread Shawn Heisey

On 12/16/22 01:59, Shawn Heisey wrote:

On 12/16/22 00:26, Willy Tarreau wrote:
 > Both work for me using firefox (green flash after reload).

It wasn't working when I tested it.  I rebooted for a kernel upgrade and 
it still wasn't working.


And then a while later I was poking around in my zabbix UI and saw the 
green lightning bolt.  No idea what changed.  Glad it's working, but 
problems that fix themselves annoy me because I usually never learn what 
happened.


I think I know what happened.

I was having problems with my pacemaker cluster where it got very 
confused about the haproxy resource.  I had the haproxy service enabled 
at boot for both systems.  I have now disabled that in systemd so it's 
fully under the control of pacemaker.  I'm pretty sure that pacemaker 
was confused because it saw the service running on a system where it 
should have been disabled and pacemaker didn't start it ... and it 
decided that was unacceptable and basically broke the cluster.


So for a while I had the virtual IP resource on the "lesser" server and 
the haproxy resource on the main server.  But because I had haproxy 
enabled at boot time, it was actually running on both.  The haproxy 
config is the same between both systems, but the other server was still 
running a broken haproxy version.  Most of the backends are actually on 
the better server accessed by br0 IP address rather than localhost, so 
the broken haproxy was still sending them to the right place.  This also 
explains why I was not seeing traffic with tcpdump filtering on "udp 
port 443".  I have a ways to go before I've got true HA for my websites. 
 Setting up a database cluster is going to be challenging, I think.


I got pacemaker back in working order after I was done with my testing, 
so both resources were colocated on the better server and haproxy was 
not running on the other one.  I think you tried the URLs after I had 
fixed pacemaker, and when I saw it working on zabbix, that was also 
definitely after I fixed pacemaker.


On that UDP bind thing ... I now have three binds defined.  The virtual 
IP, the IP of the first server, and the IP of the second server.


Thanks,
Shawn



Re: Followup on openssl 3.0 note seen in another thread

2022-12-16 Thread Shawn Heisey

On 12/16/22 00:26, Willy Tarreau wrote:
> Both work for me using firefox (green flash after reload).

It wasn't working when I tested it.  I rebooted for a kernel upgrade and 
it still wasn't working.


And then a while later I was poking around in my zabbix UI and saw the 
green lightning bolt.  No idea what changed.  Glad it's working, but 
problems that fix themselves annoy me because I usually never learn what 
happened.


> You indeed need to
> bind to both the native and the virtual IP addresses (you can have the
> two on the same "bind" line, delimited by comma).

That's the little bit of info that I needed.  Now it works the way I was 
expecting with both IP addresses.  I have a lot less experience with UDP 
than TCP, I wasn't aware of that gotcha.  It does make perfect sense now 
that it's been pointed out.


Thanks,
Shawn



Re: Followup on openssl 3.0 note seen in another thread

2022-12-16 Thread Shawn Heisey

On 12/16/22 00:01, Willy Tarreau wrote:

   - if you want to use QUIC, use quictls-1.1.1. Once you have to build
 something yourself, you definitely don't want to waste your time on
 the performance-crippled 3.0, and 1.1.1 will change less often than
 3.0 so that also means less package updates.

   - if you want to experiment with QUIC and help developers, running
 compatibility tests with the latest haproxy master and the latest
 WolfSSL master could be useful. I just don't know if the maintainers
 are ready to receive lots of uncoordinated reports yet, I'm aware
 that they're still in the process of fixing a few basic integration
 issues that will make things run much smoother soon. Similarly,
 LibreSSL's QUIC support is very recent (3.6) and few people seem to
 use LibreSSL, I don't know how well it's supported in distros these
 days. More tests on this one would probably be nice and may possibly
 encourage its support.


I'd say that I am somewhere in between these two.  Helping the devs is 
not an EXPLICIT goal, but I am already tinkering with this stuff for 
myself, so it's not a lot of extra effort to be involved here.  I think 
my setup can provide a little bit of useful data and another test 
environment.  Pursuing http3 has been fun.


Straying offtopic:

I find that being a useful member of open source communities is an 
awesome experience.  For this one I'm not as much use at the code level 
as I am for other communities.  My experience with C was a long time ago 
... it was one of my first languages.  I spend more time with Bash and 
Java than anything else these days.  Occasionally delve into Perl, which 
I really like.


On the subject of building things myself ... way back in the 90s I used 
to build all my own Linux kernels, enabling only what I needed, building 
it into the kernel directly, and optimizing for the specific CPU in the 
machine.  And I tended to build most of the software I used from source 
as well.


These days, some distros have figured out how to do all these things 
better than I ever could, so I mostly install from apt repos.  For 
really mainstream software, they keep up with recent versions pretty well.


For some software, haproxy being one of the most prominent, the distro 
packages are so far behind what's current that I pretty much have to 
build it myself if I want useful features.  I got started using haproxy 
with version 1.4, and quickly went to 1.5-dev because I was pursuing the 
best TLS setup I could get.  In those days I wasn't using source 
repositories, I would download tarballs from 1wt.eu.


Thanks,
Shawn