Package: openvpn
Version: 2.6.14-1+deb13u2
Severity: important
Tags: patch upstream fixed-upstream

Hello,

Full disclosure: I used Claude for a lot of this work, including the text
below. That said, I was thorough in the testing, careful to verify the bug
and the fix, and have read through the details below -- but I don't want to
pretend there weren't robots involved.

Summary
-------
OpenVPN's DCO (data-channel offload) userspace code in 2.6.x can desync its
client table from the kernel under a peer-deletion storm: it leaks client
instances until the server reaches --max-clients and refuses new/reconnecting clients, recoverable only by restarting openvpn. This is fixed upstream in 2.7
by commit 7791f535 ("dco: process messages immediately after read", fixes
OpenVPN issue #919), but that fix is not in the 2.6 branch, and trixie ships
2.6.14.

This is the userspace counterpart to the kernel-module crash I reported in
#1140548 (openvpn-dco-dkms). A DCO server needs both fixes to be safe under
load: #1140548 stops the kernel crash; this one stops the userspace
client-table desync. (I confirmed the two are independent -- with the
openvpn-dco-dkms fix applied but stock openvpn, the desync still occurs; with this openvpn fix applied but the stock buggy module, the kernel still crashes.)

To be clear, this is a separate bug from #1140548 in a different source package (openvpn vs openvpn-dco-dkms) -- not a duplicate; both fixes are required and
neither substitutes for the other.

The bug
-------
On a busy DCO server, when many peers are deleted in a short window (e.g. a
network blip that makes a large number of peers hit --keepalive timeout at
nearly the same time), libnl delivers multiple netlink messages per
nl_recvmsgs(), but the 2.6 DCO read path stores each notification into
single-slot dco_context_t fields and processes only the last, silently
dropping the rest. openvpn then never reaps those client instances ->
n_clients is never decremented -> the instances leak. Once enough leak, new
and reconnecting clients are rejected with:

  MULTI: new incoming connection would exceed maximum number of clients

and the only recovery is restarting openvpn. Both the management interface and
the status file report the inflated client count (they read the same
multi_context list).

Reproduced, and the fix tested
------------------------------
Reproduced against the stock Debian package (2.6.14 + openvpn-dco-dkms): 1024
DCO clients, server under heavy CPU load (stress-ng), then drop the server's
inbound VPN packets so all peers time out near-simultaneously. The kernel
deletes all 1024 peers but openvpn is left with a nondeterministic number of
stale instances (one run left 767/1024), and reconnects are then refused.

I built a patched package (2.6.14 + the attached patch) and re-ran the same
load: the desync is gone -- repeated 1024-peer deletion storms under stress-ng
drain cleanly to 0 with no stale instances, reconnect to 1024 with zero
max-clients rejections, no regressions, and it survives reboot.

Patch
-----
Attached is a quilt patch (DEP-3 header). I generated it with quilt against the
current openvpn source package, so it slots in at the end of the existing
debian/patches series, and the resulting package builds and passed all the load testing above -- i.e. it is ready to drop into debian/patches/ and add to the
series as-is. Please note it is an *adaptation* of upstream
7791f535, not a clean cherry-pick: 7791f535 depends on two earlier 2.7 commits that are not in release/2.6 (a699681b, which introduces dco->c, and 7f5a6dea, which introduces c->multi), and the intervening 2.7 DCO/new-module rework means
neither those nor 7791f535 apply to 2.6.14 (they conflict in 5 and 7 files
respectively). So the patch reimplements 7791f535's logic against the old
ovpn-dco API and inlines the two prerequisites as small c/multi back-pointers in dco_context_t. It is therefore worth a real review (back-pointer lifetime,
the server/client dispatch, the lock placement) rather than a verbatim diff.

The option I'd like to offer
----------------------------
To be explicit: the attached patch can be applied as-is now -- you do not need to wait for upstream to ship anything. It's a forwarded delta (see below), it slots into the existing series, and the built package passed the load testing
above. This would fit the same trixie stable-update path as the
openvpn-dco-dkms fix in #1140548.

Also raised upstream
--------------------
I have also posted this to the openvpn-devel mailing list, asking whether they will backport 7791f535 to release/2.6 and in what form (verbatim prerequisites
vs. a minimal adaptation like mine):

https://sourceforge.net/p/openvpn/mailman/openvpn-devel/thread/5afdb852-eabf-4829-b95f-6a322ed5d56a%40midjourney.com/#msg59351167

That thread is for coordination, not a blocker. Upstream may well decline a
release/2.6 backport at all -- 2.6 + the out-of-tree ovpn-dco module is the
older path, and their answer may be "move to 2.7 + the in-tree module." If so, carrying this delta is the practical way to fix 2.6.x users regardless of what
upstream decides; and if they do land an official release/2.6 commit, it can
simply replace this delta later. (If you carry it, the DEP-3 Forwarded: field should point at that mailing-list post once it has an archive URL -- happy to
provide that.)

Full validation details (logs, etc.) available on request.

Thanks,
Thomas Nyberg
Description: dco: process messages immediately after read
 Backport of upstream 7791f5358a5574d4ef1bd27e2d52300c9d98bd72. libnl delivers
 multiple netlink messages per nl_recvmsgs() call, but the DCO read path stored
 each into single-slot dco_context_t fields and processed only the last,
 silently dropping the rest. Under a peer-deletion storm this desyncs openvpn
 client table from the kernel (ghost instances -> max-clients lockout). Process
 each message inside the parse callback, with a lock to avoid re-entrant
 GET_PEER stats requests during parsing. Adapted to 2.6.x (adds c/multi
 backpointers to dco_context_t, which 2.7 has natively).
Origin: backport, https://github.com/OpenVPN/openvpn/commit/7791f5358a5574d4ef1bd27e2d52300c9d98bd72
Bug: https://github.com/OpenVPN/openvpn/issues/919
Forwarded: https://sourceforge.net/p/openvpn/mailman/openvpn-devel/thread/5afdb852-eabf-4829-b95f-6a322ed5d56a%40midjourney.com/#msg59351167
---
Index: openvpn-2.6.14/src/openvpn/dco.h
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/dco.h
+++ openvpn-2.6.14/src/openvpn/dco.h
@@ -134,7 +134,7 @@ void close_tun_dco(struct tuntap *tt, op
  * @param dco       the DCO context
  * @return          0 on success or a negative error code otherwise
  */
-int dco_do_read(dco_context_t *dco);
+int dco_read_and_process(dco_context_t *dco);
 
 /**
  * Install a DCO in the main event loop
@@ -301,7 +301,7 @@ close_tun_dco(struct tuntap *tt, openvpn
 }
 
 static inline int
-dco_do_read(dco_context_t *dco)
+dco_read_and_process(dco_context_t *dco)
 {
     ASSERT(false);
     return 0;
Index: openvpn-2.6.14/src/openvpn/dco_linux.c
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/dco_linux.c
+++ openvpn-2.6.14/src/openvpn/dco_linux.c
@@ -40,6 +40,7 @@
 #include "ssl.h"
 #include "fdmisc.h"
 #include "multi.h"
+#include "forward.h"
 #include "ssl_verify.h"
 
 #include "ovpn_dco_linux.h"
@@ -51,6 +52,15 @@
 #include <netlink/genl/ctrl.h>
 
 
+/* When parsing multiple DEL_PEER notifications, openvpn tries to request stats
+ * for each DEL_PEER message. This triggers a GET_PEER request-reply while we
+ * are still parsing the rest of the initial notifications, which can lead to
+ * NLE_BUSY or even NLE_NOMEM. This basic lock ensures we don't bite our own
+ * tail by issuing a dco_get_peer_stats while still busy receiving and parsing
+ * other messages. (backport of upstream 7791f535)
+ */
+static bool __is_locked = false;
+
 /* libnl < 3.5.0 does not set the NLA_F_NESTED on its own, therefore we
  * have to explicitly do it to prevent the kernel from failing upon
  * parsing of the message
@@ -131,7 +141,9 @@ nla_put_failure:
 static int
 ovpn_nl_recvmsgs(dco_context_t *dco, const char *prefix)
 {
+    __is_locked = true;
     int ret = nl_recvmsgs(dco->nl_sock, dco->nl_cb);
+    __is_locked = false;
 
     switch (ret)
     {
@@ -813,11 +825,22 @@ ovpn_handle_msg(struct nl_msg *msg, void
             return NL_SKIP;
     }
 
+    /* process the message immediately, before the next one parsed in the same
+     * nl_recvmsgs() batch overwrites the single-slot dco fields */
+    if (dco->multi)
+    {
+        multi_process_incoming_dco(dco);
+    }
+    else
+    {
+        process_incoming_dco(dco);
+    }
+
     return NL_OK;
 }
 
 int
-dco_do_read(dco_context_t *dco)
+dco_read_and_process(dco_context_t *dco)
 {
     msg(D_DCO_DEBUG, __func__);
     nl_cb_set(dco->nl_cb, NL_CB_VALID, NL_CB_CUSTOM, ovpn_handle_msg, dco);
@@ -925,6 +948,12 @@ dco_get_peer_stats_multi(dco_context_t *
 {
     msg(D_DCO_DEBUG, "%s", __func__);
 
+    if (__is_locked)
+    {
+        msg(D_DCO_DEBUG, "%s: cannot request peer stats while parsing other messages", __func__);
+        return 0;
+    }
+
     struct nl_msg *nl_msg = ovpn_dco_nlmsg_create(dco, OVPN_CMD_GET_PEER);
 
     nlmsg_hdr(nl_msg)->nlmsg_flags |= NLM_F_DUMP;
@@ -979,6 +1008,12 @@ dco_parse_peer(struct nl_msg *msg, void
 int
 dco_get_peer_stats(struct context *c)
 {
+    if (__is_locked)
+    {
+        msg(D_DCO_DEBUG, "%s: cannot request peer stats while parsing other messages", __func__);
+        return 0;
+    }
+
     uint32_t peer_id = c->c2.tls_multi->dco_peer_id;
     msg(D_DCO_DEBUG, "%s: peer-id %d", __func__, peer_id);
 
Index: openvpn-2.6.14/src/openvpn/dco_linux.h
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/dco_linux.h
+++ openvpn-2.6.14/src/openvpn/dco_linux.h
@@ -36,6 +36,8 @@
 typedef enum ovpn_key_slot dco_key_slot_t;
 typedef enum ovpn_cipher_alg dco_cipher_t;
 
+struct context;
+struct multi_context;
 
 typedef struct
 {
@@ -50,6 +52,11 @@ typedef struct
 
     unsigned int ifindex;
 
+    /* backpointers set before each read so the parse callback can process
+     * a message immediately (backport of upstream 7791f535) */
+    struct context *c;
+    struct multi_context *multi;
+
     int dco_message_type;
     int dco_message_peer_id;
     int dco_del_peer_reason;
Index: openvpn-2.6.14/src/openvpn/forward.c
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/forward.c
+++ openvpn-2.6.14/src/openvpn/forward.c
@@ -1234,13 +1234,11 @@ process_incoming_link(struct context *c)
     perf_pop();
 }
 
-static void
-process_incoming_dco(struct context *c)
+void
+process_incoming_dco(dco_context_t *dco)
 {
 #if defined(ENABLE_DCO) && (defined(TARGET_LINUX) || defined(TARGET_FREEBSD))
-    dco_context_t *dco = &c->c1.tuntap->dco;
-
-    dco_do_read(dco);
+    struct context *c = dco->c;
 
     /* FreeBSD currently sends us removal notifcation with the old peer-id in
      * p2p mode with the ping timeout reason, so ignore that one to not shoot
@@ -2309,7 +2307,9 @@ process_io(struct context *c)
     {
         if (!IS_SIG(c))
         {
-            process_incoming_dco(c);
+            c->c1.tuntap->dco.c = c;
+            c->c1.tuntap->dco.multi = NULL;
+            dco_read_and_process(&c->c1.tuntap->dco);
         }
     }
 }
Index: openvpn-2.6.14/src/openvpn/forward.h
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/forward.h
+++ openvpn-2.6.14/src/openvpn/forward.h
@@ -74,6 +74,13 @@ void pre_select(struct context *c);
 
 void process_io(struct context *c);
 
+/**
+ * Process an incoming DCO message (from kernel space).
+ *
+ * @param dco - Pointer to the structure representing the DCO context.
+ */
+void process_incoming_dco(dco_context_t *dco);
+
 /**********************************************************************/
 /**
  * Process a data channel packet that will be sent through a VPN tunnel.
Index: openvpn-2.6.14/src/openvpn/mtcp.c
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/mtcp.c
+++ openvpn-2.6.14/src/openvpn/mtcp.c
@@ -747,7 +747,9 @@ multi_tcp_process_io(struct multi_contex
             /* incoming data on DCO? */
             else if (e->arg == MTCP_DCO)
             {
-                multi_process_incoming_dco(m);
+                m->top.c1.tuntap->dco.multi = m;
+                m->top.c1.tuntap->dco.c = &m->top;
+                dco_read_and_process(&m->top.c1.tuntap->dco);
             }
 #endif
             /* signal received? */
Index: openvpn-2.6.14/src/openvpn/mudp.c
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/mudp.c
+++ openvpn-2.6.14/src/openvpn/mudp.c
@@ -409,11 +409,9 @@ multi_process_io_udp(struct multi_contex
     {
         if (!IS_SIG(&m->top))
         {
-            bool ret = true;
-            while (ret)
-            {
-                ret = multi_process_incoming_dco(m);
-            }
+            m->top.c1.tuntap->dco.multi = m;
+            m->top.c1.tuntap->dco.c = &m->top;
+            dco_read_and_process(&m->top.c1.tuntap->dco);
         }
     }
 #endif
Index: openvpn-2.6.14/src/openvpn/multi.c
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/multi.c
+++ openvpn-2.6.14/src/openvpn/multi.c
@@ -3275,14 +3275,12 @@ process_incoming_del_peer(struct multi_c
     multi_signal_instance(m, mi, SIGTERM);
 }
 
-bool
-multi_process_incoming_dco(struct multi_context *m)
+void
+multi_process_incoming_dco(dco_context_t *dco)
 {
-    dco_context_t *dco = &m->top.c1.tuntap->dco;
+    ASSERT(dco->multi);
 
-    struct multi_instance *mi = NULL;
-
-    int ret = dco_do_read(&m->top.c1.tuntap->dco);
+    struct multi_context *m = dco->multi;
 
     int peer_id = dco->dco_message_peer_id;
 
@@ -3291,12 +3289,12 @@ multi_process_incoming_dco(struct multi_
      */
     if (peer_id < 0)
     {
-        return ret > 0;
+        return;
     }
 
     if ((peer_id < m->max_clients) && (m->instances[peer_id]))
     {
-        mi = m->instances[peer_id];
+        struct multi_instance *mi = m->instances[peer_id];
         if (dco->dco_message_type == OVPN_CMD_DEL_PEER)
         {
             process_incoming_del_peer(m, mi, dco);
@@ -3325,13 +3323,6 @@ multi_process_incoming_dco(struct multi_
             "type %d, del_peer_reason %d", peer_id, dco->dco_message_type,
             dco->dco_del_peer_reason);
     }
-
-    dco->dco_message_type = 0;
-    dco->dco_message_peer_id = -1;
-    dco->dco_del_peer_reason = -1;
-    dco->dco_read_bytes = 0;
-    dco->dco_write_bytes = 0;
-    return ret > 0;
 }
 #endif /* if defined(ENABLE_DCO) && defined(TARGET_LINUX) */
 
Index: openvpn-2.6.14/src/openvpn/multi.h
===================================================================
--- openvpn-2.6.14.orig/src/openvpn/multi.h
+++ openvpn-2.6.14/src/openvpn/multi.h
@@ -323,7 +323,7 @@ bool multi_process_post(struct multi_con
  *  - True, if the message was received correctly.
  *  - False, if there was an error while reading the message.
  */
-bool multi_process_incoming_dco(struct multi_context *m);
+void multi_process_incoming_dco(dco_context_t *dco);
 
 /**************************************************************************/
 /**

Reply via email to