Re: Stalled H3/QUIC Connections with Packet Loss

Tristan Sun, 24 Sep 2023 16:25:14 -0700

That said Luke, even if this fixed the issue of this thread, it would be 
helpful if you (and other people using QUIC or considering it) reported on how 
it runs for you.

Because at the moment I have a couple of suspected bugs that I lack 
reproducibility and diagnostic info to report, and some performance issues that 
might be my fault rather than HAProxy’s, which is always very hard to 
distinguish for performance metrics.

I think we should make a how-to-help wiki page btw, since the information is 
mostly spread across github issues, but the main 3 points would be:
- to build with ASAN [1]
- to enable traces (quic, qmux, and h3) using a ring in /dev/shm [2]
- to enable core dumps for haproxy [3]

1: Use clang if you don’t build haproxy directly on the runtime host btw. GCC 
doesn’t actually support static linking if ASAN. Longer rant on it here 
https://github.com/haproxy/haproxy/issues/2120#issuecomment-1515405340 and set 
the following environment variable for the HAProxy process so coredumps don’t 
get prevented : 
ASAN_OPTIONS="disable_coredump=0:unmap_shadow_on_exit=1:abort_on_error=1"

2: For traces, You want to set up a ring (see haproxy doc) on a ram-backed 
directory (typically /dev/shm/haproxy-quic for example) with at least 1 or 2 
minutes worth of traces in size (size depends on # of rqps mostly, with 350rqps 
that works out to 120MB for 2.5 minutes or so here). This is where low-level 
detailed logs will be written, in a circular ram-backed file. Using disk-based 
files or stdout will just waste resources if it even sustains the throughput, 
so def not a good idea to handwave that part.

In the end, it’ll look something like this:
ring traces
format timed
maxlen 3072
size 134217728 #128MB
backing-file /dev/shm/haproxy-quic

Then in the global section:
traces h3 level developer verbosity minimal sink traces
traves qmux level developer verbosity minimal sink traces
traces quic level developer sink traces

3: Here it is supposed to be trivial but alas. First you need to check the 
core_pattern sysctl. This is either a command or an output file relative to the 
execution directory of HAProxy *processes* (not the root process; so if you use 
a chroot, it is a path within the chroot). You need to have it writable by 
those haproxy subprocesses, and then you need to check that ulimit -c (max core 
size) is big enough for haproxy to dump fully (it’s often set to 0 out of the 
box), and check that systemd doesn’t override it if you use it (it does, out of 
the box, and you need a systemd unit override like LimitCORE=infinity)

Then with all this, when HAProxy crashes, you’ll have:
1. ASAN logs if applicable via stdout/stderr (I forget which, and the 
distinction remains one of the great pains of IT) which you can demangle using 
the asan symbolicate python script in clang’s source tree
2. Low level traces in /dev/shm/haproxy-quic.bak (on process restart the old 
ramfile is mv’d with a .bak suffix)
3. Your coredump in whatever path is in core_pattern

Alas, that is only for crash debugging.
For performance there’s no really good solution atm.
I just monitor the closest prometheus metric to %Tu 
(average_total_time_seconds) alongside process uptimes.
And how often users complain about issues, but that’s not the most reliable 
metric… Something like the NEL spec is the more scientific approach to this by 
far https://web.dev/network-error-logging/

Hopefully in the coming days I find time to compile this long mail in a 
slightly more organized wiki article with snippets etc, but it will do in the 
meantime 

Tristan

> On 22 Sep 2023, at 16:07, Amaury Denoyelle <[email protected]> wrote:
> On Fri, Sep 22, 2023 at 03:30:58PM +0200, Luke Seelenbinder wrote:
>> If it's any help, here's `show quic full` for a stalled connection:
>> [...]
> 
> Tristan has been right, as we saw here fd=-1 meaning that there is
> probably a permission issue for bind() on connection sockets. You should
> have a look at the new setcap directive to fix it.
> 
> Thanks to Tristan previous report, we know there is a real performance
> issue when using QUIC listener sockets. We have to investigate this. You
> have probably encounter an occurence of it. In the meantime, it's
> important to ensure you run with connection dedicated socket instead.
> 
> The socket fallback has been implemented silently first as on some
> platform it may not be supported at all. We plan to change this soon to
> report a log on the first permission error to improve the current
> situation.
> 
> Hope this helps,
> 
> --
> Amaury Denoyelle

Re: Stalled H3/QUIC Connections with Packet Loss

Reply via email to