That said Luke, even if this fixed the issue of this thread, it would be helpful if you (and other people using QUIC or considering it) reported on how it runs for you.
Because at the moment I have a couple of suspected bugs that I lack reproducibility and diagnostic info to report, and some performance issues that might be my fault rather than HAProxy’s, which is always very hard to distinguish for performance metrics. I think we should make a how-to-help wiki page btw, since the information is mostly spread across github issues, but the main 3 points would be: - to build with ASAN [1] - to enable traces (quic, qmux, and h3) using a ring in /dev/shm [2] - to enable core dumps for haproxy [3] 1: Use clang if you don’t build haproxy directly on the runtime host btw. GCC doesn’t actually support static linking if ASAN. Longer rant on it here https://github.com/haproxy/haproxy/issues/2120#issuecomment-1515405340 and set the following environment variable for the HAProxy process so coredumps don’t get prevented : ASAN_OPTIONS="disable_coredump=0:unmap_shadow_on_exit=1:abort_on_error=1" 2: For traces, You want to set up a ring (see haproxy doc) on a ram-backed directory (typically /dev/shm/haproxy-quic for example) with at least 1 or 2 minutes worth of traces in size (size depends on # of rqps mostly, with 350rqps that works out to 120MB for 2.5 minutes or so here). This is where low-level detailed logs will be written, in a circular ram-backed file. Using disk-based files or stdout will just waste resources if it even sustains the throughput, so def not a good idea to handwave that part. In the end, it’ll look something like this: ring traces format timed maxlen 3072 size 134217728 #128MB backing-file /dev/shm/haproxy-quic Then in the global section: traces h3 level developer verbosity minimal sink traces traves qmux level developer verbosity minimal sink traces traces quic level developer sink traces 3: Here it is supposed to be trivial but alas. First you need to check the core_pattern sysctl. This is either a command or an output file relative to the execution directory of HAProxy *processes* (not the root process; so if you use a chroot, it is a path within the chroot). You need to have it writable by those haproxy subprocesses, and then you need to check that ulimit -c (max core size) is big enough for haproxy to dump fully (it’s often set to 0 out of the box), and check that systemd doesn’t override it if you use it (it does, out of the box, and you need a systemd unit override like LimitCORE=infinity) Then with all this, when HAProxy crashes, you’ll have: 1. ASAN logs if applicable via stdout/stderr (I forget which, and the distinction remains one of the great pains of IT) which you can demangle using the asan symbolicate python script in clang’s source tree 2. Low level traces in /dev/shm/haproxy-quic.bak (on process restart the old ramfile is mv’d with a .bak suffix) 3. Your coredump in whatever path is in core_pattern Alas, that is only for crash debugging. For performance there’s no really good solution atm. I just monitor the closest prometheus metric to %Tu (average_total_time_seconds) alongside process uptimes. And how often users complain about issues, but that’s not the most reliable metric… Something like the NEL spec is the more scientific approach to this by far https://web.dev/network-error-logging/ Hopefully in the coming days I find time to compile this long mail in a slightly more organized wiki article with snippets etc, but it will do in the meantime Tristan > On 22 Sep 2023, at 16:07, Amaury Denoyelle <[email protected]> wrote: > On Fri, Sep 22, 2023 at 03:30:58PM +0200, Luke Seelenbinder wrote: >> If it's any help, here's `show quic full` for a stalled connection: >> [...] > > Tristan has been right, as we saw here fd=-1 meaning that there is > probably a permission issue for bind() on connection sockets. You should > have a look at the new setcap directive to fix it. > > Thanks to Tristan previous report, we know there is a real performance > issue when using QUIC listener sockets. We have to investigate this. You > have probably encounter an occurence of it. In the meantime, it's > important to ensure you run with connection dedicated socket instead. > > The socket fallback has been implemented silently first as on some > platform it may not be supported at all. We plan to change this soon to > report a log on the first permission error to improve the current > situation. > > Hope this helps, > > -- > Amaury Denoyelle

