r

On 29 Jan 2026, at 15:09, Ilya Maximets wrote:

> There are crossing stream cases where one of the connections can only
> have established IKE SA, but no Child SA.  In this case the connection
> is not really active and may never become active being stuck in a
> strange configuration.  For example:
>
>  node-2:
>  #80: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; IKE SA #72;
>  #80: "tun-20-in-1" [email protected] [email protected] Traffic: ...
>  #87: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA 
> #86;
>  #87: "tun-20-in-1" [email protected] [email protected] Traffic: ...
>  #72: "tun-20-out-1":500 ESTABLISHED_IKE_SA; idle;
>  #86: "tun-20-out-1":500 ESTABLISHED_IKE_SA; newest; idle;
>
>  node-20:
>  #73: "tun-2-out-1":500 ESTABLISHED_IKE_SA; newest; idle;
>  #74: "tun-2-out-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA 
> #73;
>  #74: "tun-2-out-1" [email protected] [email protected] Traffic: ...
>
> In the case above we ended up with two Child SAs on tun-20-in-1, and
> zero on the tun-20-out-1.  At the same time tun-20-out-1 has two
> separate IKE SAs somehow.
>
> The reason why we end up in situations like this is beyond the scope
> of ovs-monitor-ipsec daemon, but we can detect such cases and
> potentially recover the connection.
>
> Today when looking for active connections, our regex is looking for
> any type of SA, i.e., any of the lines above.  However, the connection
> is not actually established until there is an established Child SA.
> So, let's look for those instead.  The ESTABLISHED_CHILD_SA lines
> changed a few times in different versions of Libreswan, so they are
> not reliable, but the 'esp' lines that list the actual SPI values
> didn't really change through the history and can be used for matching.
>
> Changing the regex to look for 'esp' lines instead.  This also helps
> with unification of the tunnels/show output, as it currently shows
> a random matched line, and it will always show the 'esp' lines now
> instead, which was the most common case before anyway.
>
> Changing the meaning of the 'active' connection however requires
> some updates for the reconciliation logic.  Namely, we can no longer
> delete "half-loaded" connections, as that would mean tearing down
> already fully established connections, which creates a lot of churn
> and extra problems for libreswan re-negotiating them.  May also
> result in us actively destroying connections in a way that pluto will
> not be able to start them all back before we delete them again.
>
> So, re-working the reconciliation logic as well to no longer delete
> the fully established connections (ones with established Child SAs).
> This actually allows us to remove a few hacks form the code and make
> the reconciliation logic more straightforward.
>
> Testing of the change shows improved resilience to crossing stream
> problems at scale.
>
> The check for the Libreswan version in the system test has to go
> up a bit, unfortunately, as now we're waiting for all the Child SAs
> to be established and that may not happen in time on buggy versions
> of Libreswan v4.
>
> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections 
> periodically.")
> Reported-at: https://github.com/openvswitch/ovs-issues/issues/374
> Signed-off-by: Ilya Maximets <[email protected]>

Thanks Ilya, these changes also look good to me. One small nit below to be 
fixed on commit.

Acked-by: Eelco Chaudron <[email protected]>

> ---
>  ipsec/ovs-monitor-ipsec.in | 83 +++++++++++++++++---------------------
>  tests/system-ipsec.at      | 32 +++++++--------
>  2 files changed, 53 insertions(+), 62 deletions(-)
>
> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
> index fa4613b56..70d3e5330 100755
> --- a/ipsec/ovs-monitor-ipsec.in
> +++ b/ipsec/ovs-monitor-ipsec.in
> @@ -673,57 +673,47 @@ conn prevent_unencrypted_vxlan
>                  self.conns_not_active["added"].pop(conn, None)
>                  self.conns_not_active["started"].pop(conn, None)
>
> -            # Remove connections that didn't become active after --start
> -            # or an explicit --up.
> -            for conn in list(loaded):
> +            # Remove all the loaded or active but not desired connections.
> +            for conn in (loaded | active) - desired:
> +                self._delete_ipsec_connection(conn, "is outdated")
> +                loaded.discard(conn)
> +                active.discard(conn)
> +
> +            # Check all the loaded but not active connections.
> +            for conn in loaded - active:
> +                # Remove connections that didn't become active after --start
> +                # or an explicit --up.
>                  started = self.conns_not_active["started"].get(conn, None)
>
>                  if started and self.conn_should_reconcile(started):
>                      self._delete_ipsec_connection(conn, "is defunct")
> -                    loaded.remove(conn)
> -
> -            # Remove all the loaded or active but not desired connections.
> -            for conn in loaded | active:
> -                if conn not in desired:
> -                    self._delete_ipsec_connection(conn, "is outdated")
> -                    loaded.discard(conn)
> -                    active.discard(conn)
> -
> -            # If not all desired are loaded, remove all the loaded and
> -            # active for this tunnel and re-load only the desired ones.
> -            # Need to do that, because connections for the same tunnel
> -            # may share SAs.  If one is loaded and the other is not,
> -            # it means the second one failed, so the shared SA may be in
> -            # a broken state.
> -            if desired != loaded:
> -                for conn in loaded | active:
> -                    self._delete_ipsec_connection(conn, "is half-loaded")
>                      loaded.discard(conn)
> -                    active.discard(conn)
> -
> -                for conn in desired:
> -                    # Start (add + up) outgoing connections and only add
> -                    # incoming ones.  If the other side will not initiate
> -                    # the connection and it will not become active, we'll
> -                    # bring it up during the next refresh.
> -                    if re.match(r".*-in-\d+$", conn):
> -                        vlog.info("Adding ipsec connection %s" % conn)
> -                        self._start_ipsec_connection(conn, "add")
> -                    else:
> -                        vlog.info("Starting ipsec connection %s" % conn)
> -                        self._start_ipsec_connection(conn, "start")
> -            else:
> -                # Ask pluto to bring UP connections that are loaded,
> -                # but not active for some reason.
> -                #
> -                # desired == loaded and desired >= loaded + active,
> -                # so loaded >= active
> -                for conn in loaded - active:
> -                    added = self.conns_not_active["added"].get(conn, None)
>
> -                    if added and self.conn_should_reconcile(added):
> -                        vlog.info("Bringing up ipsec connection %s" % conn)
> -                        self._start_ipsec_connection(conn, "up")
> +                # Bring UP connections that were loaded but didn't become
> +                # active on their own.
> +                added = self.conns_not_active["added"].get(conn, None)
> +
> +                if added and self.conn_should_reconcile(added):
> +                    vlog.info("Bringing up ipsec connection %s" % conn)
> +                    self._start_ipsec_connection(conn, "up")
> +
> +                # Check for connections that were active bafore, but no 
> longer.
> +                if not added and not started:

bafore->before

[...]

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to