On 2/2/26 12:25 PM, Eelco Chaudron wrote: > r > > On 29 Jan 2026, at 15:09, Ilya Maximets wrote: > >> There are crossing stream cases where one of the connections can only >> have established IKE SA, but no Child SA. In this case the connection >> is not really active and may never become active being stuck in a >> strange configuration. For example: >> >> node-2: >> #80: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; IKE SA #72; >> #80: "tun-20-in-1" [email protected] [email protected] Traffic: ... >> #87: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA >> #86; >> #87: "tun-20-in-1" [email protected] [email protected] Traffic: ... >> #72: "tun-20-out-1":500 ESTABLISHED_IKE_SA; idle; >> #86: "tun-20-out-1":500 ESTABLISHED_IKE_SA; newest; idle; >> >> node-20: >> #73: "tun-2-out-1":500 ESTABLISHED_IKE_SA; newest; idle; >> #74: "tun-2-out-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA >> #73; >> #74: "tun-2-out-1" [email protected] [email protected] Traffic: ... >> >> In the case above we ended up with two Child SAs on tun-20-in-1, and >> zero on the tun-20-out-1. At the same time tun-20-out-1 has two >> separate IKE SAs somehow. >> >> The reason why we end up in situations like this is beyond the scope >> of ovs-monitor-ipsec daemon, but we can detect such cases and >> potentially recover the connection. >> >> Today when looking for active connections, our regex is looking for >> any type of SA, i.e., any of the lines above. However, the connection >> is not actually established until there is an established Child SA. >> So, let's look for those instead. The ESTABLISHED_CHILD_SA lines >> changed a few times in different versions of Libreswan, so they are >> not reliable, but the 'esp' lines that list the actual SPI values >> didn't really change through the history and can be used for matching. >> >> Changing the regex to look for 'esp' lines instead. This also helps >> with unification of the tunnels/show output, as it currently shows >> a random matched line, and it will always show the 'esp' lines now >> instead, which was the most common case before anyway. >> >> Changing the meaning of the 'active' connection however requires >> some updates for the reconciliation logic. Namely, we can no longer >> delete "half-loaded" connections, as that would mean tearing down >> already fully established connections, which creates a lot of churn >> and extra problems for libreswan re-negotiating them. May also >> result in us actively destroying connections in a way that pluto will >> not be able to start them all back before we delete them again. >> >> So, re-working the reconciliation logic as well to no longer delete >> the fully established connections (ones with established Child SAs). >> This actually allows us to remove a few hacks form the code and make >> the reconciliation logic more straightforward. >> >> Testing of the change shows improved resilience to crossing stream >> problems at scale. >> >> The check for the Libreswan version in the system test has to go >> up a bit, unfortunately, as now we're waiting for all the Child SAs >> to be established and that may not happen in time on buggy versions >> of Libreswan v4. >> >> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections >> periodically.") >> Reported-at: https://github.com/openvswitch/ovs-issues/issues/374 >> Signed-off-by: Ilya Maximets <[email protected]> > > Thanks Ilya, these changes also look good to me. One small nit below to be > fixed on commit. > > Acked-by: Eelco Chaudron <[email protected]> > >> --- >> ipsec/ovs-monitor-ipsec.in | 83 +++++++++++++++++--------------------- >> tests/system-ipsec.at | 32 +++++++-------- >> 2 files changed, 53 insertions(+), 62 deletions(-) >> >> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in >> index fa4613b56..70d3e5330 100755 >> --- a/ipsec/ovs-monitor-ipsec.in >> +++ b/ipsec/ovs-monitor-ipsec.in >> @@ -673,57 +673,47 @@ conn prevent_unencrypted_vxlan >> self.conns_not_active["added"].pop(conn, None) >> self.conns_not_active["started"].pop(conn, None) >> >> - # Remove connections that didn't become active after --start >> - # or an explicit --up. >> - for conn in list(loaded): >> + # Remove all the loaded or active but not desired connections. >> + for conn in (loaded | active) - desired: >> + self._delete_ipsec_connection(conn, "is outdated") >> + loaded.discard(conn) >> + active.discard(conn) >> + >> + # Check all the loaded but not active connections. >> + for conn in loaded - active: >> + # Remove connections that didn't become active after --start >> + # or an explicit --up. >> started = self.conns_not_active["started"].get(conn, None) >> >> if started and self.conn_should_reconcile(started): >> self._delete_ipsec_connection(conn, "is defunct") >> - loaded.remove(conn) >> - >> - # Remove all the loaded or active but not desired connections. >> - for conn in loaded | active: >> - if conn not in desired: >> - self._delete_ipsec_connection(conn, "is outdated") >> - loaded.discard(conn) >> - active.discard(conn) >> - >> - # If not all desired are loaded, remove all the loaded and >> - # active for this tunnel and re-load only the desired ones. >> - # Need to do that, because connections for the same tunnel >> - # may share SAs. If one is loaded and the other is not, >> - # it means the second one failed, so the shared SA may be in >> - # a broken state. >> - if desired != loaded: >> - for conn in loaded | active: >> - self._delete_ipsec_connection(conn, "is half-loaded") >> loaded.discard(conn) >> - active.discard(conn) >> - >> - for conn in desired: >> - # Start (add + up) outgoing connections and only add >> - # incoming ones. If the other side will not initiate >> - # the connection and it will not become active, we'll >> - # bring it up during the next refresh. >> - if re.match(r".*-in-\d+$", conn): >> - vlog.info("Adding ipsec connection %s" % conn) >> - self._start_ipsec_connection(conn, "add") >> - else: >> - vlog.info("Starting ipsec connection %s" % conn) >> - self._start_ipsec_connection(conn, "start") >> - else: >> - # Ask pluto to bring UP connections that are loaded, >> - # but not active for some reason. >> - # >> - # desired == loaded and desired >= loaded + active, >> - # so loaded >= active >> - for conn in loaded - active: >> - added = self.conns_not_active["added"].get(conn, None) >> >> - if added and self.conn_should_reconcile(added): >> - vlog.info("Bringing up ipsec connection %s" % conn) >> - self._start_ipsec_connection(conn, "up") >> + # Bring UP connections that were loaded but didn't become >> + # active on their own. >> + added = self.conns_not_active["added"].get(conn, None) >> + >> + if added and self.conn_should_reconcile(added): >> + vlog.info("Bringing up ipsec connection %s" % conn) >> + self._start_ipsec_connection(conn, "up") >> + >> + # Check for connections that were active bafore, but no >> longer. >> + if not added and not started: > > bafore->before
Hrm. Sorry, I thought I fixed this before applying, but apparently I didn't. :( I can send a follow up, or we can just keep it until we need to touch this code again. WDYT? Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
