r On 29 Jan 2026, at 15:09, Ilya Maximets wrote:
> There are crossing stream cases where one of the connections can only > have established IKE SA, but no Child SA. In this case the connection > is not really active and may never become active being stuck in a > strange configuration. For example: > > node-2: > #80: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; IKE SA #72; > #80: "tun-20-in-1" [email protected] [email protected] Traffic: ... > #87: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA > #86; > #87: "tun-20-in-1" [email protected] [email protected] Traffic: ... > #72: "tun-20-out-1":500 ESTABLISHED_IKE_SA; idle; > #86: "tun-20-out-1":500 ESTABLISHED_IKE_SA; newest; idle; > > node-20: > #73: "tun-2-out-1":500 ESTABLISHED_IKE_SA; newest; idle; > #74: "tun-2-out-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA > #73; > #74: "tun-2-out-1" [email protected] [email protected] Traffic: ... > > In the case above we ended up with two Child SAs on tun-20-in-1, and > zero on the tun-20-out-1. At the same time tun-20-out-1 has two > separate IKE SAs somehow. > > The reason why we end up in situations like this is beyond the scope > of ovs-monitor-ipsec daemon, but we can detect such cases and > potentially recover the connection. > > Today when looking for active connections, our regex is looking for > any type of SA, i.e., any of the lines above. However, the connection > is not actually established until there is an established Child SA. > So, let's look for those instead. The ESTABLISHED_CHILD_SA lines > changed a few times in different versions of Libreswan, so they are > not reliable, but the 'esp' lines that list the actual SPI values > didn't really change through the history and can be used for matching. > > Changing the regex to look for 'esp' lines instead. This also helps > with unification of the tunnels/show output, as it currently shows > a random matched line, and it will always show the 'esp' lines now > instead, which was the most common case before anyway. > > Changing the meaning of the 'active' connection however requires > some updates for the reconciliation logic. Namely, we can no longer > delete "half-loaded" connections, as that would mean tearing down > already fully established connections, which creates a lot of churn > and extra problems for libreswan re-negotiating them. May also > result in us actively destroying connections in a way that pluto will > not be able to start them all back before we delete them again. > > So, re-working the reconciliation logic as well to no longer delete > the fully established connections (ones with established Child SAs). > This actually allows us to remove a few hacks form the code and make > the reconciliation logic more straightforward. > > Testing of the change shows improved resilience to crossing stream > problems at scale. > > The check for the Libreswan version in the system test has to go > up a bit, unfortunately, as now we're waiting for all the Child SAs > to be established and that may not happen in time on buggy versions > of Libreswan v4. > > Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections > periodically.") > Reported-at: https://github.com/openvswitch/ovs-issues/issues/374 > Signed-off-by: Ilya Maximets <[email protected]> Thanks Ilya, these changes also look good to me. One small nit below to be fixed on commit. Acked-by: Eelco Chaudron <[email protected]> > --- > ipsec/ovs-monitor-ipsec.in | 83 +++++++++++++++++--------------------- > tests/system-ipsec.at | 32 +++++++-------- > 2 files changed, 53 insertions(+), 62 deletions(-) > > diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in > index fa4613b56..70d3e5330 100755 > --- a/ipsec/ovs-monitor-ipsec.in > +++ b/ipsec/ovs-monitor-ipsec.in > @@ -673,57 +673,47 @@ conn prevent_unencrypted_vxlan > self.conns_not_active["added"].pop(conn, None) > self.conns_not_active["started"].pop(conn, None) > > - # Remove connections that didn't become active after --start > - # or an explicit --up. > - for conn in list(loaded): > + # Remove all the loaded or active but not desired connections. > + for conn in (loaded | active) - desired: > + self._delete_ipsec_connection(conn, "is outdated") > + loaded.discard(conn) > + active.discard(conn) > + > + # Check all the loaded but not active connections. > + for conn in loaded - active: > + # Remove connections that didn't become active after --start > + # or an explicit --up. > started = self.conns_not_active["started"].get(conn, None) > > if started and self.conn_should_reconcile(started): > self._delete_ipsec_connection(conn, "is defunct") > - loaded.remove(conn) > - > - # Remove all the loaded or active but not desired connections. > - for conn in loaded | active: > - if conn not in desired: > - self._delete_ipsec_connection(conn, "is outdated") > - loaded.discard(conn) > - active.discard(conn) > - > - # If not all desired are loaded, remove all the loaded and > - # active for this tunnel and re-load only the desired ones. > - # Need to do that, because connections for the same tunnel > - # may share SAs. If one is loaded and the other is not, > - # it means the second one failed, so the shared SA may be in > - # a broken state. > - if desired != loaded: > - for conn in loaded | active: > - self._delete_ipsec_connection(conn, "is half-loaded") > loaded.discard(conn) > - active.discard(conn) > - > - for conn in desired: > - # Start (add + up) outgoing connections and only add > - # incoming ones. If the other side will not initiate > - # the connection and it will not become active, we'll > - # bring it up during the next refresh. > - if re.match(r".*-in-\d+$", conn): > - vlog.info("Adding ipsec connection %s" % conn) > - self._start_ipsec_connection(conn, "add") > - else: > - vlog.info("Starting ipsec connection %s" % conn) > - self._start_ipsec_connection(conn, "start") > - else: > - # Ask pluto to bring UP connections that are loaded, > - # but not active for some reason. > - # > - # desired == loaded and desired >= loaded + active, > - # so loaded >= active > - for conn in loaded - active: > - added = self.conns_not_active["added"].get(conn, None) > > - if added and self.conn_should_reconcile(added): > - vlog.info("Bringing up ipsec connection %s" % conn) > - self._start_ipsec_connection(conn, "up") > + # Bring UP connections that were loaded but didn't become > + # active on their own. > + added = self.conns_not_active["added"].get(conn, None) > + > + if added and self.conn_should_reconcile(added): > + vlog.info("Bringing up ipsec connection %s" % conn) > + self._start_ipsec_connection(conn, "up") > + > + # Check for connections that were active bafore, but no > longer. > + if not added and not started: bafore->before [...] _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
