[ANNOUNCE] haproxy-3.0.0

Willy Tarreau Wed, 29 May 2024 08:08:46 -0700

Hi,

HAProxy 3.0.0 was released on 2024/05/29. It added 21 new commits
after version 3.0-dev13. I do appreciate that everything was only
cosmetic.


We're having a total of 1108 patches in this release among which 850 ones
not concerning a bug, which makes it the smallest LTS release of all times
(2.6 and 2.4 still remain the largest ones, respectively 65% and 58%
larger). This is a good news in terms of expected stability, which might
possibly break the old myth of "better avoid dot zero".

Let's try to summarize what's new in this release. It has been one of the
most difficult for me to summarize because I'm not seeing one big killer
feature, instead it's an LTS as we like them: mostly a nice polishing of
existing stuff and small improvements all over the place as permitted by
the previous version's architectural changes. I tried to classify this
into a few categories, depending on the intended benefits.

First, let's enumerate the new features, and improvements of existing ones:

  - stats can finally be preserved across reloads for frontends,
    listeners, backends and servers. When using this, the config objects
    of the new process are preloaded with the relevant values from a dump
    of the previous process. This essentially concerns counters, ages and
    rates. Please have a look at "stats-file" and "dump stats-file" for
    more information.
  
  - the log outgoing load-balancing now relies on a regular backend,
    meaning that the load balancing algorithms could finally be unified
    with the ones used by other protocols, and servers now support
    weights.
  
  - log-format now supports JSON and CBOR output encoding. In such a case,
    the field name is taken from a new naming scheme that is placed within
    the log-format itself, allowing to assign a name to each field.
  
  - the load balancing algorithm "sticky" that was initially reserved for
    logs was generalized to other protocols.
  
  - the HTTP/2 RST_STREAM reason code can finally be forwarded to the
    server for client aborts. This addresses the problem a few users were
    facing with gRPC where request cancellation appeared as communication
    errors the server side. For now this is purposely limited to only a
    few reason codes that are relevant to gRPC so that we don't ruin the
    possibility to later extend that to H3 and maybe H1.
  
  - QUIC now supports the HyStart++ (RFC9406) alternative to slowstart
    with the Cubic algorithm. It's supposed to show better recovery
    patterns. It's not yet enabled by default.
  
  - a new set of converters, map_*_key, will report the matching part of
    the key itself instead of the associated pattern. The main target use
    cases for this is to know what address mask an address did match, or
    what regex a pattern did match.
  
  - the "uuid()" sample fetch function, which takes an optional version in
    argument now also supports "7" for UUIDv7. These UUIDs regroup many
    properties found in ULID and other mechanisms, one of the most
    interesting one being time-based locality that, for example, eases the
    archiving of old data, or the grouping of events on systems where
    they'll be processed together.
  
  - the name associated with servers in connection pools can now be
    overridden by the expression in "pool-conn-name" when SNI is not
    desired (useful with rhttp without SSL for example, but may also make
    sense when reaching remote servers over SSL tunnels). It also allows
    to entirely drop SSL from the server.
  
  - the "namespace" argument now works for "bind" and "server" lines using
    UNIX sockets.
  
  - Linux capabilities: the use of namespaces on the server side used to
    require capability "cap_sys_admin" but it was neither checked nor
    reported on startup to it would silently fail. The capability is now
    supported and is being checked for. Similarly, the need for
    capabilities for transparent proxying or QUIC are checked and reported
    on startup. Finally, file-system capabilities set on the executable are
    also supported now.
  
  - the set-mark/set-tos actions were extended to support an expression in
    addition of the constant, and were extended to also support the backend
    side. This can for example be used to select an outgoing link from a
    single IP address. The new backend actions are called "set-bc-mark" and
    "set-bc-tos", and by analogy new frontend actions called "set-fc-mark"
    and "set-fc-tos" were created, and the old actions are aliases of these
    last ones.
  
  - QUIC built with latest AWS-LC TLS library now correctly supports 0-RTT.
  
  - a new global setting "ssl-security-level" allows to adjust OpenSSL's
    internal security level beween 0 and 5. Previously it could only be
    done in openssl.cnf.
  
  - the key used by consistent hash to map to a server used to always be
    the server's id (either explicit or implicit, position-based), but
    that was not always convenient when dealing with fast added-removed
    server within a large fleet of LBs. Now the "hash-key" directive will
    also allow to use the server's address or address+port for this so
    that the same key ends up on the same server for all LBs.
  
  - The HTTP client now has an option to use either origin or absolute
    URIs. This should make it easier to configure it to talk to old
    servers which are not spec-compliant and do not support absolute
    URIs. The ocsp_update agent already exploits this ability via a new
    setting "ocsp-update.httpproxy".
  
  - it is now possible to suppress Content-Length and Transfer-Encoding
    headers from HTTP/1 requests and responses. It must never be done of
    course but there are rare situations where users dealing with bogus
    clients or server need to perform such cleanups. Most of the time
    when done, this will mark a connection non-reusable and it will be
    closed at the end of the transfer.
  
  - the proxy protocol now also parses TLV for LOCAL mode and supports
    sending them without a stream so that elements can be passed during
    the preconnect phase of a reverse-HTTP instance to a next stage that
    will no longer ignore them.
  
  - the new sched_setaffinity() of FreeBSD 14 and newer is now supported.
  
  - the new certificate selection callback for WolfSSL was now enabled
    since it's finally available in the upstream project.

Second, there were a reasonable set of usability improvements, all the
small features that make config management and day-to-day operations
easier:

  - maps are often used to operate at run time on some parts of the
    configuration. When no initial value is desired, it was still needed
    to have an empty file (/dev/null is not usable since a map is indexed
    by its name). As such, some users have expressed their desire to have
    virtual and/or optional maps. Both are brought by this version. When
    a map is loaded from a file whose name begins with "opt@", the file
    will only be loaded if it exists otherwise an empty map will be created
    with this name. And maps whose name begins with "virt@" are exclusively
    virtual and never backed by a file. They're always created empty at
    boot, for use at run time.
  
  - the default certificate selection method was improved: till now, the
    default certificate was the first one mentioned on the bind line. This
    causes issues with sites that want to support both RSA and ECDSA. A
    new approach was brought, with an optional "default-crt" keyword that
    designates the default certs on the bind line, and its equivalent in
    the crt-list files designated by "*" in the name. This allows the right
    cert to be picked based on the desired algorithm. Of course the default
    behavior doesn't change.
  
  - the list of status codes that increment the http_err_cnt and
    http_fail_cnt counters can now be changed with the global directives
    "http-err-codes" and "http-fail-codes". This has long been requested,
    both by those whose applications randomly return 500 that are not
    server failures, and those where 404 happen a lot and does not
    necessarily indicate a URL scanner. All of the 1xx-5xx range is
    permitted for both classes.
        
  - cookies, both static and dynamic, are now permitted for dynamically
    added servers.
  
  - API clients will find the CLI more friendly when it comes to removing
    a server. First, idle connections are now automatically closed when
    trying to delete a server, so that it's no longer needed to wait for
    them to vanish. Second, a new "wait" command pauses operations for at
    most as long as specified, optionally waiting for a condition. A new
    such condition is "srv-removable", which checks when a server may
    safely be removed. This means that issuing this "wait" command before
    a "del server" command will save the client from having to
    periodically retry the operation.
  
  - a new "crt-store" configuration section is supported. It allows to
    declare certificates by specifying the path for each element. The aim
    is essentially to decorellate the storage from the instantiation, both
    of which are currently correlated in crt-lists, and to allow easier
    specification of individual components. This section supports
    "crt-base" and "key-base" to ease the splitting of certificates and
    keys into distinct directories, as well as "ocsp-update" to indicate
    which certificates need to have their OCSP partperiodically updated.
    The certificates also support aliases so that they can be referenced
    from a bind line with a more convenient names than a file name.
    crt-lists may now make use of these certificates to only decide which
    ones to instantiate for a given listener, without having to deal with
    deployment concerns such as paths and file names.
  
  - the "thread-hard-limit" global parameter was added. It allows to only
    set a hard limit on the number of threads without enforcing that value
    as the thread count (like nbthread does). This is convenient to
    prepare portable configs with no more than X threads when one knows
    it's only a waste of resources to use more.
  
  - certain warnings about the presence of HTTP rules in TCP frontends
    that are going to be upgraded to HTTP when switching to a backend will
    now no longer be reported when it is certain that they will work as
    expected.
  
  - a new "guid" keyword was added for servers, listeners and proxies.
    The purpose will be to make it possible for external APIs to assign a
    globally unique object identifier to each of them in stats dumps or
    CLI accesses, and to later reliably recognize a server upon reloads.
    One usage example right now is stats preservation across reloads where
    this GUID uniquely identifies a server between two configs.
  
  - it has become easier to pass extra CFLAGS / LDFLAGS to the Makefile,
    just pass them into these variables (and a few other ones). Many were
    removed as the result of the simplification. The removed ones will
    trigger a build warning indicating what to use instead. A warning will
    also be emitted when passing an unknown USE_* setting, and such
    settings now support to be set to zero to disable them.

In addition to this, some changes aim at improving the reliability:

  - the draining of HTTP/1 request body was finally implemented. It is
    needed when an early response is sent before the end of a POST
    request, typically due to a redirect or authentication issue. It used
    to cause difficulties due to the TCP stack emitting an RST that would
    sometimes destroy the response before it had a chance to be sent, but
    this is now something of the past.
  
  - the buffer allocator's behavior on out-of-memory condition was finally
    fixed. It had been flaky since version 1.7, with possibilities for all
    requesters to deadlock if none had enough room to complete their work.
    A new, more robust algorithm was finally implemented, making sure that
    at least one requester has enough resources to make forward progress
    and let the system recover by itself.

Other ones put a particular focus on robustness against various threats in
general:

  - H2, H3 and QUIC now maintain a counter of per-connection glitches,
    which are characterized by not strictly illegal but suspicious or
    bogus protocol handling and behavior from a peer. Such counters are
    reported at upper layers, are trackable in stick-tables, and can be
    used to kill a misbehaving connection past a threshold. The goal here
    is to significantly reduce the CPU impact and log pollution caused by
    bots that blindly try to exploit various well-known vulnerabilities or
    limitations of some implementations. Since this works on both sides it
    can also be used to detect faulty applications that would need to be
    fixed.
  
  - H2 now supports to forcefully close connections after a configurable
    number of streams. This can be used to accelerate the switchover during
    reloads, as well as maintain an optimal balance between multiple front
    nodes, and force the re-evaluation of sanity checks at the connection
    level regarding tracked metrics to more easily get rid of abusers.
  
  - two new global settings now make it possible to simply prevent HAProxy
    from accepting traffic from privileged ports; one setting is for TCP
    and the other one for QUIC. QUIC was configured by default to refuse
    such traffic, because by relying on UDP it's particularly exposed to
    DNS and NTP amplification attacks, and while it's more efficient to
    filter such ports upstream, it's still very simple and cheap to just
    drop such undesirable packets before processing them.
  
  - the code no longer depends on libsystemd, so that we will not pull in
    a myriad of questionable dependencies anymore. This also allows to
    enable USE_SYSTEMD by default (it's only done on linux-glibc though),
    thus reducing configuration combinations.

As with every version comes a comprehensive collection of performance
improvements:

  - quic: the fast-forwarding mechanism now considers the flow control
    state, resulting in a reduction of the number of wakeups and better
    filling of packets. The internal send API was reworked and simplified
    and one buffer copy could be removed. Some minor fixes and cleanups
    were done in the cubic congestion controller.
  
  - a new QUIC setting, "tune.quic.reorder-ratio" was added to let the
    user adjust the size of holes over the in-flight window before we
    declare a loss. Normally QUIC users should observe much better
    performance now, even with the default setting (50%), which was
    sufficient for us to observe x10-20 at 3% losses. The send path was
    improved and cleaned up, by using exclusively sendmsg() and avoiding
    some copies where possible. Some CPU savings are expected on intense
    workloads.
  
  - the H1 mux now also supports zero-copy forwarding for chunks of unknown
    size (i.e. those larger than a buffer).
  
  - the fast forward zero-copy mechanism is now supported by applets. This
    will ultimately result in lower memory usage and higher performance
    for some applets such as the cache by carefully avoiding to queue more
    data than the mux can take without buffering. This can still be
    disabled by unsetting tune.cache.zero-copy-forwarding.
  
  - a few ebtree backports improved the performance on non-x86 machines
    (typically ~2% faster string lookups were measured on ARM and ~3%
    task switching rate was measured).
  
  - some of the remaining server name lookups that were still linear moved
    to use the tree instead, speeding up certain operations or config
    parsing.
        
  - ring: the ring internal API used to represent a bottleneck for traces
    at TCP logs, especially on multi-threaded systems due to the initially
    unplanned locking that resulted from the underlying buffer API. All of
    this was entirely rewritten so that the code is almost lockfree and
    waiting threads can prepare their work as groups in parallel. The
    performance increased by a factor of 2.5 on NUMA systems and even by
    20 on uniform systems, reaching up to around 7 million messages per
    second. This is sufficient to enable traces at the "developer" level
    even on moderately loaded systems. The "haring" utility was updated to
    automatically detect the new, slightly different format and support
    both the old and the new ones (the old haring tool will still read the
    new format in repair mode).
  
  - stick-tables are now sharded over multiple tree heads each with their
    own locks. This significantly reduces locking contention on systems
    with many threads (gains of ~6x measured on a 80-thread systems). In
    addition, the locking could be reduced even with low thread counts,
    particulary when using peers, where the performance could be doubled.
    This is particularly noticeable when using the bandwidth limiting
    filter "bwlim".
  
  - The Lua latency with single-threaded scripts (loaded by "lua-load")
    running on multi-thread instances was improved a lot by reducing the
    amount of consecutive instructions a thread may run when there are
    many threads.

A few changes that improve observability:

  - a few more sample fetches corresponding to certain log-format aliases
    were added (txn.redispatched, bc_be_queue, bc_srv_queue, etc).
  
  - new sample fetch functions retrieve the number of concurrent streams
    over the same connection for a frontend or a backend, as well as the
    maximum number negotiated. This can be useful to sort out connection
    performance from stream performance when looking at timings in logs.
  
  - the Prometheus exporter now exposes a bunch of new metrics (resolvers,
    more server stuff) and supports applying filters to limit the metrics
    that have to be returned.

Some debugging aid to save experts time in field, speed up recovery and
reduce the number of round trips in issues:

  - stick-table operations over the CLI using commands like "show table",
    "set table" and "clear table" now supports a "ptr" argument to directly
    use the pointer retrieved from a previous "show" command. This is
    convenient to remove bogus entries manually for example.
  
  - haproxy -dD will now report suspicious ACL pattern values which look
    like known ACL/sample fetch keywords.
  
  - the "insecure-fork-wanted" option now has an equivalent on the command
    line, "-dI". It's convenient to obtain decoded ASAN outputs for
    example, without having to edit a config
  
  - QUIC and HTTP/3 added some traces, refined some error reporting, and
    improved the accuracy of the "show quic" output.
  
  - the backend equivalent of the frontend keylog mechanism was
    implemented, so that it is now possible to decipher TLS captures on
    the backend side.  The log-format to be used becomes a bit large,
    please refer to the example in the doc.
  
  - some internal large memory areas (file descriptor tables, HTTP and SSL
    session caches, ring buffers etc) now have a name that is visible on
    Linux >= 5.17 in /proc/$pid/maps or using pmap. This will help figure
    out where the memory is being used and why.
  
  - traces are way faster on multi-threaded systems thanks to the ring
    locking changes, making them usable without risks on moderately loaded
    systems.
 
Some possibly (but unlikely) breaking changes:

  - an update of the DeviceAtlas addon was made to support the new version
    of the library. It slightly changes the build system but so far no issue
    was reported.
  
  - a mistake I accidentally introduced two years ago with a bug fix had
    the undesired side effect of randomly accepting chained commands on
    the CLI in non-interactive mode, when delimited by line feeds. The
    likelihood that it would work is essentially time-based, so a short
    string of multiple commands had great chances of working while a large
    one almost none. This started to cause side effects to other issues and
    had to be fixed, so that we no longer accept multiple commands delimited
    by '\n' in non-interactive mode, as documented. If you happen to have
    such scripts sending multiple commands this way, you may have to fix
    them (either use the semi-colon ';' to delimit the commands, or switch
    to interactive mode via the "prompt" command). A warning is emitted when
    this unreliable behavior is detected, to ease detection of faulty
    scripts.
  
  - the "enabled" server keyword used to be silently ignored when adding a
    dynamic server. Now it's properly rejected to avoid confusing scripts.
  
  - the way the memory limitation specified by "-m" on the command line
    was handled on Linux using RLIMIT_AS got completely useless over time
    due to much more fragmented memory spaces on 64-bit platforms, ASLR,
    and the fact that it had been chosen exclusively to avoid
    underestimating the allocated buffers' cost, which originally were
    allocated all the time even when empty. Nowadays this is no longer
    relevant since buffers are only allocated when used, and the current
    state had the nasty effect of causing OOMs way below the configured
    limit, rendering it pretty useless. The use of RLIMIT_AS was now
    dropped in favor of the more reliable RLIMIT_DATA like on other
    operating systems.
  
  - the "namespace" keyword used to be silently ignored on "bind" and
    "server" lines using UNIX sockets. Now it is properly used and
    checked, thus it may fail if it references an invalid value. If the
    previous configuration used to work, it probably means the keyword was
    not needed. In addition, the presence of the keyword on a "server"
    line may also cause a boot failure that was previously only detected
    at run time, if permissions are insufficient. There's no loss of
    functionality here, only a check performed earlier to ensure the
    process boots in a properly working state.
  
  - the HTTP/1 URI parser no longer accepts invalid origin-form URIs that
    start neither with a '/' nor a '*' (e.g. "index.html" without leading
    slash). Even if some servers would still accept that, clients that
    would be compatible with this have disappeared way more than a decade
    ago, and continuing to support this for such broken applications would
    probably lead to an abuse sooner or later, so better put an end to
    this now.
  
  - a workaround for an issue affecting QUIC on LibreSSL when running on
    non-x86 machines was developed jointly with the LibreSSL team. There's
    an issue with the CHACHA20_POLY1305 cipher when used in-place (for
    QUIC) that has been well identified and will be fixed in version 4.0
    of LibreSSL. The workaround consists in making the QUIC connection
    fail fast so that the client can quickly retry using TCP. We'll
    disable it once a stable LibreSSL version is out with the fix. A
    config-based workaround consists in forcing the ciphers, and exclude
    this one.

And we even found some room to improve the code's maintainability and
clarity, which will hopefully further lower the barrier to contribution:

  - applet: most of the internal API rework was done, which simpifies the
    upper layers and the applet code as well (for those that were
    converted). New applet code will have its own buffers and even less
    stuff to care about. This is also true for the CLI keyword handlers
    which can now be written in a more natural way and may now yield even
    when not blocked.
  
  - a significant part of the internal "shutdown" API was cleaned up so
    that there is now only one function at each layer instead of one per
    direction. Not only this did eliminate very old legacy code ported
    over the years, it also made it possible to forward gRPC
    cancellations.
  
  - prometheus: a new registration mechanism was added to permit to
    register metrics per module (e.g. stick-tables, resolvers etc). The
    extra counters are also dumped if requested now (frontend, backend,
    listener, server).

I'm fairly certain that I forgot a few things. As usual, I'm told that my
coworkers at HAProxyTech also went through this tedious task of enumerating
the changes, and it will be posted soon here:

  https://www.haproxy.com/blog/announcing-haproxy-3-0

My understanding is that there will be some followups with a focus on
selected points. I'm not surprised by the difficulty of the exercise
this time ;-)

For this version, we've got an increased help from various testers who
accepted to run one (or a few) servers with the development version, and
who were able to report a few problems with accurate version ranges, as
well as traces and info that permitted to fix the issues quickly. It
worked amazingly well and allowed us to address some nasty bugs that are
fairly hard to reproduce and that were present for several versions
already. At the risk of repeating myself, thanks for that! I know that
operating a -dev version requires a bit more involvement than a stable one
but it's also a win-win: when something doesn't please you, it's not too
late to suggest a change, and you can benefit from the latest debugging
features and performance improvements. I sincerely hope that this success
will encourage other users into that direction. The nice benefit for the
user of facing a bug in -dev vs -stable is that we have no problem
developing new debugging extensions just for that issue, so a git pull is
enough to suddenly make the problem much more observable and require less
amount of work to filter data than with a stable version. And something
that's human is that developers tend to be much more attracted by issues
affecting areas that are still fresh in their heads and will tend to treat
them with higher priority.

I also noticed more exchanges from various participants on the issues
and here on the list, so big thanks as well to those who take time to
review other users' problem reports and requests for help. Especially
for first-time reporters, it gives them a great experience of the
project and its community.

As usual with a new major release comes the death of an old one. This time
it's 2.0 that passed away after 5 years serving as a transition between
the old legacy versions and the newer HTX-enabled ones. I'm fairly sure
there are still some here and there, so please consider this as a reminder
that it's about time to upgrade. And 2.4 turned to critical fixes only
status.

On a side note (not very funny but surprising), apparently there was a big
GitHub outage last night, and this morning we're getting a "Ooops 500" page
on the haproxy repository there: https://github.com/haproxy/haproxy
The issues seem to be working, the wiki and docs projects as well. So I
suspect that an error page got cached during the outage and continues to
be delivered for whetever reason. I opened a ticket to their support and
we'll see when we get a response. Fortunately we're not completely blocked,
but it feels strange to release on a day of outage. After all, that's a
form of resilience that also makes one use a load balancer, so there's
some logic there.

Speaking of resilience, I'm going to take a bit of vacation next week and
the week after (maybe I should have postponed given the heavy rain here),
but you're in good hands with the rest of the team, and Christopher is
back on Monday, fresh an in full force. Maybe you'll even manage to
convince him to emit -dev1 himself, who knows :-)

Please find the usual URLs below :
   Site index       : https://www.haproxy.org/
   Documentation    : https://docs.haproxy.org/
   Wiki             : https://github.com/haproxy/wiki/wiki
   Discourse        : https://discourse.haproxy.org/
   Slack channel    : https://slack.haproxy.org/
   Issue tracker    : https://github.com/haproxy/haproxy/issues
   Sources          : https://www.haproxy.org/download/3.0/src/
   Git repository   : https://git.haproxy.org/git/haproxy-3.0.git/
   Git Web browsing : https://git.haproxy.org/?p=haproxy-3.0.git
   Changelog        : https://www.haproxy.org/download/3.0/src/CHANGELOG
   Dataplane API    : 
https://github.com/haproxytech/dataplaneapi/releases/latest
   Pending bugs     : https://www.haproxy.org/l/pending-bugs
   Reviewed bugs    : https://www.haproxy.org/l/reviewed-bugs
   Code reports     : https://www.haproxy.org/l/code-reports
   Latest builds    : https://www.haproxy.org/l/dev-packages

I verified what I had in mind for 3.0 and 3.1-dev0 (that just opened),
and I think all is good (Tim already fixed an incorrect color on the
docs index). As usual, if (should I say when?) you detect a broken link,
just let me know so I can fix it.

Have fun!
Willy
---
Complete changelog from 3.0-dev13:
Amaury Denoyelle (2):
      DOC: streamline http-reuse and connection naming definition
      REGTESTS: complete http-reuse test with pool-conn-name

Aurelien DARRAGON (3):
      MINOR: log: rename 'log-format tag' to 'log-format alias'
      DOC: config: document logformat item naming and typecasting features
      DOC: config: add %ID logformat alias alternative

Valentine Krasnobaeva (3):
      CLEANUP: ssl/ocsp: readable ifdef in ssl_sock_load_ocsp
      BUG/MINOR: ssl/ocsp: init callback func ptr as NULL
      BUG/MINOR: activity: fix Delta_calls and Delta_bytes count

William Lallemand (2):
      MINOR: sample: implement the uptime sample fetch
      CI: github: upgrade the WolfSSL job to 5.7.0

Willy Tarreau (11):
      CI: scripts: fix build of vtest regarding option -C
      CI: scripts: build vtest using multiple CPUs
      BUILD: makefile: yearly reordering of objects by build time
      BUILD: fd: errno is also needed without poll()
      DOC: config: fix two typos "RST_STEAM" vs "RST_STREAM"
      DOC: config: refer to the non-deprecated keywords in ocsp-update on/off
      CLEANUP: ssl_sock: move dirty openssl-1.0.2 wrapper to openssl-compat
      DOC: install: update quick build reminders with some missing options
      DOC: install: update the range of tested openssl version to cover 3.3
      DEV: patchbot: prepare for new version 3.1-dev
      MINOR: version: mention that it's 3.0 LTS now.

---

[ANNOUNCE] haproxy-3.0.0

Reply via email to