On 4 Feb 2026, at 21:44, Ilya Maximets wrote:

> On 2/2/26 12:25 PM, Eelco Chaudron wrote:
>> r
>>
>> On 29 Jan 2026, at 15:09, Ilya Maximets wrote:
>>
>>> There are crossing stream cases where one of the connections can only
>>> have established IKE SA, but no Child SA.  In this case the connection
>>> is not really active and may never become active being stuck in a
>>> strange configuration.  For example:
>>>
>>>  node-2:
>>>  #80: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; IKE SA #72;
>>>  #80: "tun-20-in-1" [email protected] [email protected] Traffic: ...
>>>  #87: "tun-20-in-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA 
>>> #86;
>>>  #87: "tun-20-in-1" [email protected] [email protected] Traffic: 
>>> ...
>>>  #72: "tun-20-out-1":500 ESTABLISHED_IKE_SA; idle;
>>>  #86: "tun-20-out-1":500 ESTABLISHED_IKE_SA; newest; idle;
>>>
>>>  node-20:
>>>  #73: "tun-2-out-1":500 ESTABLISHED_IKE_SA; newest; idle;
>>>  #74: "tun-2-out-1":500 ESTABLISHED_CHILD_SA; newest; eroute owner; IKE SA 
>>> #73;
>>>  #74: "tun-2-out-1" [email protected] [email protected] Traffic: 
>>> ...
>>>
>>> In the case above we ended up with two Child SAs on tun-20-in-1, and
>>> zero on the tun-20-out-1.  At the same time tun-20-out-1 has two
>>> separate IKE SAs somehow.
>>>
>>> The reason why we end up in situations like this is beyond the scope
>>> of ovs-monitor-ipsec daemon, but we can detect such cases and
>>> potentially recover the connection.
>>>
>>> Today when looking for active connections, our regex is looking for
>>> any type of SA, i.e., any of the lines above.  However, the connection
>>> is not actually established until there is an established Child SA.
>>> So, let's look for those instead.  The ESTABLISHED_CHILD_SA lines
>>> changed a few times in different versions of Libreswan, so they are
>>> not reliable, but the 'esp' lines that list the actual SPI values
>>> didn't really change through the history and can be used for matching.
>>>
>>> Changing the regex to look for 'esp' lines instead.  This also helps
>>> with unification of the tunnels/show output, as it currently shows
>>> a random matched line, and it will always show the 'esp' lines now
>>> instead, which was the most common case before anyway.
>>>
>>> Changing the meaning of the 'active' connection however requires
>>> some updates for the reconciliation logic.  Namely, we can no longer
>>> delete "half-loaded" connections, as that would mean tearing down
>>> already fully established connections, which creates a lot of churn
>>> and extra problems for libreswan re-negotiating them.  May also
>>> result in us actively destroying connections in a way that pluto will
>>> not be able to start them all back before we delete them again.
>>>
>>> So, re-working the reconciliation logic as well to no longer delete
>>> the fully established connections (ones with established Child SAs).
>>> This actually allows us to remove a few hacks form the code and make
>>> the reconciliation logic more straightforward.
>>>
>>> Testing of the change shows improved resilience to crossing stream
>>> problems at scale.
>>>
>>> The check for the Libreswan version in the system test has to go
>>> up a bit, unfortunately, as now we're waiting for all the Child SAs
>>> to be established and that may not happen in time on buggy versions
>>> of Libreswan v4.
>>>
>>> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections 
>>> periodically.")
>>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/374
>>> Signed-off-by: Ilya Maximets <[email protected]>
>>
>> Thanks Ilya, these changes also look good to me. One small nit below to be 
>> fixed on commit.
>>
>> Acked-by: Eelco Chaudron <[email protected]>
>>
>>> ---
>>>  ipsec/ovs-monitor-ipsec.in | 83 +++++++++++++++++---------------------
>>>  tests/system-ipsec.at      | 32 +++++++--------
>>>  2 files changed, 53 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
>>> index fa4613b56..70d3e5330 100755
>>> --- a/ipsec/ovs-monitor-ipsec.in
>>> +++ b/ipsec/ovs-monitor-ipsec.in
>>> @@ -673,57 +673,47 @@ conn prevent_unencrypted_vxlan
>>>                  self.conns_not_active["added"].pop(conn, None)
>>>                  self.conns_not_active["started"].pop(conn, None)
>>>
>>> -            # Remove connections that didn't become active after --start
>>> -            # or an explicit --up.
>>> -            for conn in list(loaded):
>>> +            # Remove all the loaded or active but not desired connections.
>>> +            for conn in (loaded | active) - desired:
>>> +                self._delete_ipsec_connection(conn, "is outdated")
>>> +                loaded.discard(conn)
>>> +                active.discard(conn)
>>> +
>>> +            # Check all the loaded but not active connections.
>>> +            for conn in loaded - active:
>>> +                # Remove connections that didn't become active after 
>>> --start
>>> +                # or an explicit --up.
>>>                  started = self.conns_not_active["started"].get(conn, None)
>>>
>>>                  if started and self.conn_should_reconcile(started):
>>>                      self._delete_ipsec_connection(conn, "is defunct")
>>> -                    loaded.remove(conn)
>>> -
>>> -            # Remove all the loaded or active but not desired connections.
>>> -            for conn in loaded | active:
>>> -                if conn not in desired:
>>> -                    self._delete_ipsec_connection(conn, "is outdated")
>>> -                    loaded.discard(conn)
>>> -                    active.discard(conn)
>>> -
>>> -            # If not all desired are loaded, remove all the loaded and
>>> -            # active for this tunnel and re-load only the desired ones.
>>> -            # Need to do that, because connections for the same tunnel
>>> -            # may share SAs.  If one is loaded and the other is not,
>>> -            # it means the second one failed, so the shared SA may be in
>>> -            # a broken state.
>>> -            if desired != loaded:
>>> -                for conn in loaded | active:
>>> -                    self._delete_ipsec_connection(conn, "is half-loaded")
>>>                      loaded.discard(conn)
>>> -                    active.discard(conn)
>>> -
>>> -                for conn in desired:
>>> -                    # Start (add + up) outgoing connections and only add
>>> -                    # incoming ones.  If the other side will not initiate
>>> -                    # the connection and it will not become active, we'll
>>> -                    # bring it up during the next refresh.
>>> -                    if re.match(r".*-in-\d+$", conn):
>>> -                        vlog.info("Adding ipsec connection %s" % conn)
>>> -                        self._start_ipsec_connection(conn, "add")
>>> -                    else:
>>> -                        vlog.info("Starting ipsec connection %s" % conn)
>>> -                        self._start_ipsec_connection(conn, "start")
>>> -            else:
>>> -                # Ask pluto to bring UP connections that are loaded,
>>> -                # but not active for some reason.
>>> -                #
>>> -                # desired == loaded and desired >= loaded + active,
>>> -                # so loaded >= active
>>> -                for conn in loaded - active:
>>> -                    added = self.conns_not_active["added"].get(conn, None)
>>>
>>> -                    if added and self.conn_should_reconcile(added):
>>> -                        vlog.info("Bringing up ipsec connection %s" % conn)
>>> -                        self._start_ipsec_connection(conn, "up")
>>> +                # Bring UP connections that were loaded but didn't become
>>> +                # active on their own.
>>> +                added = self.conns_not_active["added"].get(conn, None)
>>> +
>>> +                if added and self.conn_should_reconcile(added):
>>> +                    vlog.info("Bringing up ipsec connection %s" % conn)
>>> +                    self._start_ipsec_connection(conn, "up")
>>> +
>>> +                # Check for connections that were active bafore, but no 
>>> longer.
>>> +                if not added and not started:
>>
>> bafore->before
>
> Hrm.  Sorry, I thought I fixed this before applying, but apparently I didn't. 
> :(
> I can send a follow up, or we can just keep it until we need to touch this 
> code
> again.  WDYT?

This should be fine for now; we can fix it when we need another change.

//Eelco

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to