On 4 Feb 2026, at 21:44, Ilya Maximets wrote:
> On 2/2/26 12:25 PM, Eelco Chaudron wrote:
>> r
>>
>> On 29 Jan 2026, at 15:09, Ilya Maximets wrote:
>>
>>> There are crossing stream cases where one of the connections can only
>>> have established IKE SA, but no Child SA. In this case the connection
>>> is not really active and may never become active being stuck in a
>>> strange configuration. For example:
>>>
>>> node-2:
>>> #80: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; IKE SA #72;
>>> #80: "tun-20-in-1" [email protected] [email protected] Traffic: ...
>>> #87: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA
>>> #86;
>>> #87: "tun-20-in-1" [email protected] [email protected] Traffic:
>>> ...
>>> #72: "tun-20-out-1":500 ESTABLISHED_IKE_SA; idle;
>>> #86: "tun-20-out-1":500 ESTABLISHED_IKE_SA; newest; idle;
>>>
>>> node-20:
>>> #73: "tun-2-out-1":500 ESTABLISHED_IKE_SA; newest; idle;
>>> #74: "tun-2-out-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA
>>> #73;
>>> #74: "tun-2-out-1" [email protected] [email protected] Traffic:
>>> ...
>>>
>>> In the case above we ended up with two Child SAs on tun-20-in-1, and
>>> zero on the tun-20-out-1. At the same time tun-20-out-1 has two
>>> separate IKE SAs somehow.
>>>
>>> The reason why we end up in situations like this is beyond the scope
>>> of ovs-monitor-ipsec daemon, but we can detect such cases and
>>> potentially recover the connection.
>>>
>>> Today when looking for active connections, our regex is looking for
>>> any type of SA, i.e., any of the lines above. However, the connection
>>> is not actually established until there is an established Child SA.
>>> So, let's look for those instead. The ESTABLISHED_CHILD_SA lines
>>> changed a few times in different versions of Libreswan, so they are
>>> not reliable, but the 'esp' lines that list the actual SPI values
>>> didn't really change through the history and can be used for matching.
>>>
>>> Changing the regex to look for 'esp' lines instead. This also helps
>>> with unification of the tunnels/show output, as it currently shows
>>> a random matched line, and it will always show the 'esp' lines now
>>> instead, which was the most common case before anyway.
>>>
>>> Changing the meaning of the 'active' connection however requires
>>> some updates for the reconciliation logic. Namely, we can no longer
>>> delete "half-loaded" connections, as that would mean tearing down
>>> already fully established connections, which creates a lot of churn
>>> and extra problems for libreswan re-negotiating them. May also
>>> result in us actively destroying connections in a way that pluto will
>>> not be able to start them all back before we delete them again.
>>>
>>> So, re-working the reconciliation logic as well to no longer delete
>>> the fully established connections (ones with established Child SAs).
>>> This actually allows us to remove a few hacks form the code and make
>>> the reconciliation logic more straightforward.
>>>
>>> Testing of the change shows improved resilience to crossing stream
>>> problems at scale.
>>>
>>> The check for the Libreswan version in the system test has to go
>>> up a bit, unfortunately, as now we're waiting for all the Child SAs
>>> to be established and that may not happen in time on buggy versions
>>> of Libreswan v4.
>>>
>>> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections
>>> periodically.")
>>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/374
>>> Signed-off-by: Ilya Maximets <[email protected]>
>>
>> Thanks Ilya, these changes also look good to me. One small nit below to be
>> fixed on commit.
>>
>> Acked-by: Eelco Chaudron <[email protected]>
>>
>>> ---
>>> ipsec/ovs-monitor-ipsec.in | 83 +++++++++++++++++---------------------
>>> tests/system-ipsec.at | 32 +++++++--------
>>> 2 files changed, 53 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
>>> index fa4613b56..70d3e5330 100755
>>> --- a/ipsec/ovs-monitor-ipsec.in
>>> +++ b/ipsec/ovs-monitor-ipsec.in
>>> @@ -673,57 +673,47 @@ conn prevent_unencrypted_vxlan
>>> self.conns_not_active["added"].pop(conn, None)
>>> self.conns_not_active["started"].pop(conn, None)
>>>
>>> - # Remove connections that didn't become active after --start
>>> - # or an explicit --up.
>>> - for conn in list(loaded):
>>> + # Remove all the loaded or active but not desired connections.
>>> + for conn in (loaded | active) - desired:
>>> + self._delete_ipsec_connection(conn, "is outdated")
>>> + loaded.discard(conn)
>>> + active.discard(conn)
>>> +
>>> + # Check all the loaded but not active connections.
>>> + for conn in loaded - active:
>>> + # Remove connections that didn't become active after
>>> --start
>>> + # or an explicit --up.
>>> started = self.conns_not_active["started"].get(conn, None)
>>>
>>> if started and self.conn_should_reconcile(started):
>>> self._delete_ipsec_connection(conn, "is defunct")
>>> - loaded.remove(conn)
>>> -
>>> - # Remove all the loaded or active but not desired connections.
>>> - for conn in loaded | active:
>>> - if conn not in desired:
>>> - self._delete_ipsec_connection(conn, "is outdated")
>>> - loaded.discard(conn)
>>> - active.discard(conn)
>>> -
>>> - # If not all desired are loaded, remove all the loaded and
>>> - # active for this tunnel and re-load only the desired ones.
>>> - # Need to do that, because connections for the same tunnel
>>> - # may share SAs. If one is loaded and the other is not,
>>> - # it means the second one failed, so the shared SA may be in
>>> - # a broken state.
>>> - if desired != loaded:
>>> - for conn in loaded | active:
>>> - self._delete_ipsec_connection(conn, "is half-loaded")
>>> loaded.discard(conn)
>>> - active.discard(conn)
>>> -
>>> - for conn in desired:
>>> - # Start (add + up) outgoing connections and only add
>>> - # incoming ones. If the other side will not initiate
>>> - # the connection and it will not become active, we'll
>>> - # bring it up during the next refresh.
>>> - if re.match(r".*-in-\d+$", conn):
>>> - vlog.info("Adding ipsec connection %s" % conn)
>>> - self._start_ipsec_connection(conn, "add")
>>> - else:
>>> - vlog.info("Starting ipsec connection %s" % conn)
>>> - self._start_ipsec_connection(conn, "start")
>>> - else:
>>> - # Ask pluto to bring UP connections that are loaded,
>>> - # but not active for some reason.
>>> - #
>>> - # desired == loaded and desired >= loaded + active,
>>> - # so loaded >= active
>>> - for conn in loaded - active:
>>> - added = self.conns_not_active["added"].get(conn, None)
>>>
>>> - if added and self.conn_should_reconcile(added):
>>> - vlog.info("Bringing up ipsec connection %s" % conn)
>>> - self._start_ipsec_connection(conn, "up")
>>> + # Bring UP connections that were loaded but didn't become
>>> + # active on their own.
>>> + added = self.conns_not_active["added"].get(conn, None)
>>> +
>>> + if added and self.conn_should_reconcile(added):
>>> + vlog.info("Bringing up ipsec connection %s" % conn)
>>> + self._start_ipsec_connection(conn, "up")
>>> +
>>> + # Check for connections that were active bafore, but no
>>> longer.
>>> + if not added and not started:
>>
>> bafore->before
>
> Hrm. Sorry, I thought I fixed this before applying, but apparently I didn't.
> :(
> I can send a follow up, or we can just keep it until we need to touch this
> code
> again. WDYT?
This should be fine for now; we can fix it when we need another change.
//Eelco
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev