We are happy to report that after downgrading to OpenSSL 1.1.1s (from
3.0.7), our performance problems are solved, and now looks like
HAProxy scales linearly with the available resources.

For reference, in a synthetic load test with a request payload of 2k,
and a 32-core server (128GB RAM) with 10Gb bandwidth available, we are
able to sustain 1.83M conn, 82K rps, and a sslrate of 22k, with a load
average of about 31 (idle time percent was about 6)

On Wed, Dec 21, 2022 at 5:39 PM Iago Alonso <iago.alo...@marfeel.com> wrote:
>
> > Interesting, so you have conntrack enabled. With 5M conns, there's no
> > reason to fill your table. However, have you checked your servers' kernel
> > logs to see if you find any "conntrack table full" message, that might
> > be caused by too fast connection recycling ?
> We can't see any message mentioning conntrack.
>
> > With that said, excessive use of rwlock indicates that openssl is
> > very likely there, because that's about the only caller in the process.
> Indeed, we tried compiling HAProxy 2.6.6 against OpenSSL 1.1.1s and the
>
> `perf top` results look much better. We can't recally why we decided
>
> to use OpenSSL v3, but seems like we have to stick to v1.
>
>     2.61%  haproxy                    [.] sha512_block_data_order_avx2
>     1.89%  haproxy                    [.] fe_mul
>     1.64%  haproxy                    [.] x25519_fe64_sqr
>     1.44%  libc.so.6                    [.] malloc
>     1.39%  [kernel]                     [k] __nf_conntrack_find_get
>     1.21%  libc.so.6                    [.] cfree
>     0.95%  [kernel]                     [k]
> native_queued_spin_lock_slowpath.part.0
>     0.94%  haproxy                    [.] OPENSSL_cleanse
>     0.78%  libc.so.6                    [.] 0x00000000000a3b46
>     0.78%  [kernel]                     [k] do_epoll_ctl
>     0.75%  [kernel]                     [k] ___slab_alloc
>     0.73%  [kernel]                     [k] psi_group_change
>
> For reference, with a 12 core machine, we ran a test with 420k connections,
>
> 32k rps, and a sustained ssl rate of 5k, and the load1 was about 9, and
>
> the idle was about 30.
>
> And as a side note, we didn't have http/2 enabled, and after doing
>
> so, we ran the same test, and the idle was about 55, and the load1 was
> about 6.
>
> > OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
> > fixes, though these should not impact your tests and 2.7.0 will be OK).
> > And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
> > If so it will definitely confirm that you're endlessly waiting for some
> > threads stuck in SSL waiting on a mutex.
> We've just tried 2.7.1 with USE_PTHREAD_EMULATION=1 (OpenSSL 3.0.7), and
>
> we the results are pretty much the same, no noticeable improvement.
>
> > Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
> > rule in your iptables whose counter increases during the tests. It might
> > be possible that certain outgoing connections are rejected due to fast
> > port recycling if they were closed by an RST that was lost before hitting
> > conntrack. Another possibility here is that the TCP connection was
> > established but that the SSL handshake didn't complete in time. This
> > could be caused precisely by the openssl 3 locking issues I was speaking
> > about, but I guess not at such numbers, and you'd very likely see
> > the native_queue_spinlock_slowpath thing in perf top.
> We do have a `-A ufw-before-input -m conntrack --ctstate INVALID -j DROP`
>
> but the counter does not increase during the tests. Looks like OpenSSL
>
> v3 is the issue.
>
>
> On 16/12/22 18:11, Willy Tarreau wrote:
> > On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote:
> >> Hi,
> >>
> >>> Ah that's pretty useful :-) It's very likely dealing with the handshake.
> >>> Could you please run "perf top" on this machine and list the 10 top-most
> >>> lines ? I'm interested in seeing if you're saturating on crypto functions
> >>> or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would
> >>> indicate an abuse of mutexes.
> >> We are load testing with k6 (https://k6.io/), in a server with an
> >> AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test
> >> with 300k connections, 15k rps, and a sslrate and connrate of about 4k.
> >> The server had a `load1` of 12, the `node_nf_conntrack_entries` were about
> >> 450k. The `haproxy_server_connect_time_average_seconds` was between
> >> 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was
> >> always about 0.2s lower.
> >>
> >> We have this custom kernel parameters set:
> >> net.nf_conntrack_max=5000000
> >> fs.nr_open=5000000
> >> net.ipv4.ip_local_port_range=12768    60999
> > Interesting, so you have conntrack enabled. With 5M conns, there's no
> > reason to fill your table. However, have you checked your servers' kernel
> > logs to see if you find any "conntrack table full" message, that might
> > be caused by too fast connection recycling ?
> >
> >> At that load, we were starting to see just a small number of 5xx, and
> >> in the logs
> >> we could see that they were caused by connect timeout.
> >> XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker
> >> 261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET
> >> / HTTP/1.1" 0/0000000000000000/0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384
> > Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
> > rule in your iptables whose counter increases during the tests. It might
> > be possible that certain outgoing connections are rejected due to fast
> > port recycling if they were closed by an RST that was lost before hitting
> > conntrack. Another possibility here is that the TCP connection was
> > established but that the SSL handshake didn't complete in time. This
> > could be caused precisely by the openssl 3 locking issues I was speaking
> > about, but I guess not at such numbers, and you'd very likely see
> > the native_queue_spinlock_slowpath thing in perf top.
> >
> >> This is the output from `perf top`, after 40M samples, during the test
> >>
> >>    17.38%  libc.so.6                   [.] pthread_rwlock_rdlock
> >>    10.25%  libc.so.6                   [.] pthread_rwlock_unlock
> > With that said, excessive use of rwlock indicates that openssl is
> > very likely there, because that's about the only caller in the process.
> >
> > To counter this abuse, we've developed a dirty counter-measure that
> > consists in providing our own rwlock implementation as a replacement for
> > the pthread one. That's really not beautiful but we have to deal with
> > what we have. You can enable it in 2.7 by adding USE_PTHREAD_EMULATION=1
> > to your "make" command line. These locks are way less expensive but they
> > will never sleep (no syscall) so instead of idle when the machine is
> > congested, you'll see pl_wait_unlock_long() eating CPU in "perf top"
> > output. In our tests that roughly doubles the performance under high
> > congestion but it's hard to be that specific. In fact it mostly gets
> > us our lock callback back, after they were removed in 1.1.0.
> >
> >> Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04,
> >> and we tested all the possible combinations between HAProxy 2.5.3 and 
> >> 2.6.6,
> >> compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable
> >> performance
> >> hit running similar tests as the previously mentioned one. We are
> >> planning to try again
> >> (next week) just stressing the sslrate so we can assess a bit more the
> >> performance
> >> hit.
> > OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
> > fixes, though these should not impact your tests and 2.7.0 will be OK).
> > And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
> > If so it will definitely confirm that you're endlessly waiting for some
> > threads stuck in SSL waiting on a mutex.
> >
> > I would really like that we distinguish L4 from L6 in logs output to
> > indicate whether L4 was established and SSL not or if it's blocking at
> > L4. The tricky part is that it requires adding new values for the
> > termination flags, and everyone got used a lot to the current ones...
> > But I think we'll have to do it anyway, at least to sort out such
> > issues.
> >
> > Regards,
> > Willy

Reply via email to