On 29 Apr 2025, at 21:22, Ilya Maximets wrote:

> Currently we're only tracking the last refresh time and perform
> reconciliation of non-active connections on every refresh.  This is
> causing issues in large clusters when tunnels are added sequentially.
> Consider the following example:
>
>  1. Tun-1 added -> refresh()
>     -> Tun-1: adding 'in' and starting 'out'.
>
>  2. Tun-2 added -> refresh()
>     -> Tun-2: adding 'in' and starting 'out'.
>     -> Tun-1: The other side didn't have time to initiate the 'in'
>               connection yet, so it is not active.  But we see that
>               it's not active and trying to start it.
>
>  3. Tun-3 added -> refresh()
>     -> Tun-3: adding 'in' and starting 'out'.
>     -> Tun-2: The other side didn't have time to initiate the 'in'
>               connection yet, so it is not active.  But we see that
>               it's not active and trying to start it.
>     -> Tun-1: The connection still had no time to become active, but
>               we declare it 'defunct' and re-creating.
>
> Behavior above is specific to Libreswan 4.  Libreswan 5 will report
> UP connections as active in most cases, so they will not be marked
> as defunct, but they will still be started quickly after addition
> when it is not needed.
>
> This creates unnecessary churn in the cluster and puts Libreswan into
> an uncomfortable position where crossing stream issues (where both
> sides are trying to establish the same connection at the same time)
> are far more likely.
>
> Fix that by specifically tracking time between reconciliation attempts
> instead of just the last time we refreshed for any reason.  This should
> make ovs-monitor-ipsec to actually wait for the reconciliation interval
> before attempting to repair connections and give Libreswan a decent
> amount of time to process the changes and try to establish connections
> normally.

Thanks for the patch, Ilya. It looks technically fine; however, I have
one concern/question.

Since last_reconciliation isn't updated when a new connection is added,
reconciliation could still happen immediately afterward if the interval
has already elapsed, potentially leading to premature teardown. Or did
I miss something?

Cheers,

Eelco

> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections 
> periodically.")
> Reported-at: https://issues.redhat.com/browse/FDP-1364
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> ---
>  ipsec/ovs-monitor-ipsec.in | 27 +++++++----
>  tests/system-ipsec.at      | 94 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 112 insertions(+), 9 deletions(-)
>
> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
> index 7309bdf3c..53111a7c8 100755
> --- a/ipsec/ovs-monitor-ipsec.in
> +++ b/ipsec/ovs-monitor-ipsec.in
> @@ -515,7 +515,7 @@ conn prevent_unencrypted_vxlan
>          self.IPSEC_CTL = libreswan_root_prefix + ipsec_ctl
>          self.conf_file = None
>          self.conns_not_active = set()
> -        self.last_refresh = time.time()
> +        self.last_reconciliation = time.time()
>          self.secrets_file = None
>          self.use_default_conn = self.IPSEC_CONF == self.ROOT_IPSEC_CONF
>          self.use_default_crypto = args.use_default_crypto
> @@ -656,6 +656,10 @@ conn prevent_unencrypted_vxlan
>                      set(loaded_conns.keys()) | \
>                      set(active_conns.keys())
>
> +        reconcile = False
> +        if time.time() - self.last_reconciliation >= RECONCILIATION_INTERVAL:
> +            reconcile = True
> +
>          for name in all_names:
>              desired = set(self.get_conn_names(monitor, name))
>              loaded = set(loaded_conns.get(name, dict()).keys())
> @@ -663,11 +667,13 @@ conn prevent_unencrypted_vxlan
>
>              # Untrack connections that became active.
>              self.conns_not_active.difference_update(active)
> -            # Remove connections that didn't become active after --start
> -            # and another explicit --up.
> -            for conn in self.conns_not_active & loaded:
> -                self._delete_ipsec_connection(conn, "is defunct")
> -                loaded.remove(conn)
> +
> +            if reconcile:
> +                # Remove connections that didn't become active after --start
> +                # and another explicit --up.
> +                for conn in self.conns_not_active & loaded:
> +                    self._delete_ipsec_connection(conn, "is defunct")
> +                    loaded.remove(conn)
>
>              # Remove all the loaded or active but not desired connections.
>              for conn in loaded | active:
> @@ -699,7 +705,7 @@ conn prevent_unencrypted_vxlan
>                      else:
>                          vlog.info("Starting ipsec connection %s" % conn)
>                          self._start_ipsec_connection(conn, "start")
> -            else:
> +            elif reconcile:
>                  # Ask pluto to bring UP connections that are loaded,
>                  # but not active for some reason.
>                  #
> @@ -746,7 +752,10 @@ conn prevent_unencrypted_vxlan
>                              "--delete",
>                              "--asynchronous", "prevent_unencrypted_vxlan"])
>              monitor.conf_in_use["skb_mark"] = monitor.conf["skb_mark"]
> -        self.last_refresh = time.time()
> +
> +        if reconcile:
> +            self.last_reconciliation = time.time()
> +
>          vlog.info("Refreshing is done.")
>
>      def get_conns_from_status(self, pattern):
> @@ -809,7 +818,7 @@ conn prevent_unencrypted_vxlan
>          return conns
>
>      def need_to_reconcile(self, monitor):
> -        if time.time() - self.last_refresh < RECONCILIATION_INTERVAL:
> +        if time.time() - self.last_reconciliation < RECONCILIATION_INTERVAL:
>              return False
>
>          conns_dict = self.get_active_conns()
> diff --git a/tests/system-ipsec.at b/tests/system-ipsec.at
> index de3ecd361..5828ad437 100644
> --- a/tests/system-ipsec.at
> +++ b/tests/system-ipsec.at
> @@ -535,6 +535,100 @@ AT_CHECK([grep -q -E "(ike|ikev2|esp)=" 
> $ovs_base/right/custom.conf], [1])
>  OVS_TRAFFIC_VSWITCHD_STOP()
>  AT_CLEANUP
>
> +AT_SETUP([IPsec -- Libreswan - reconciliation interval is respected - ipv4])
> +AT_KEYWORDS([ipsec libreswan ipv4 geneve psk reconciliation])
> +dnl Note: Geneve test may not work on older kernels due to CVE-2020-25645
> +dnl https://bugzilla.redhat.com/show_bug.cgi?id=1883988
> +
> +CHECK_LIBRESWAN()
> +OVS_TRAFFIC_VSWITCHD_START()
> +IPSEC_SETUP_UNDERLAY()
> +
> +m4_define([PSK_OPTIONS], [options:remote_ip=$1 options:psk=swordfish])
> +
> +dnl Set up only the left host.
> +IPSEC_ADD_NODE_LEFT([10.1.1.1], [10.1.1.254])
> +IPSEC_ADD_TUNNEL_LEFT([geneve], PSK_OPTIONS([10.1.1.2]))
> +dnl Wait for the monitor to find the new connection.
> +OVS_WAIT_UNTIL([grep -q 'tun appeared' left/ovs-monitor-ipsec.log])
> +
> +dnl Add a few more tunels waiting for the monitor to find them too.
> +m4_for([id], [1], 5, [1], [
> +  OVS_VSCTL([left], add-port br-ipsec tun-id \
> +                    -- set Interface tun-id type=geneve 
> PSK_OPTIONS(10.1.2.id))
> +  OVS_WAIT_UNTIL(grep -q 'tun-id appeared' left/ovs-monitor-ipsec.log)
> +])
> +dnl And now remove all the extra tunnels.
> +m4_for([id], [1], 5, [1], [
> +  OVS_VSCTL([left], del-port tun-id)
> +  OVS_WAIT_UNTIL(grep -q 'tun-id disappeared' left/ovs-monitor-ipsec.log)
> +])
> +
> +dnl Check that none of the connections were marked for reconciliation yet.
> +dnl It should take at least 15 seconds for this to happen.
> +AT_CHECK([grep -E 'half-loaded|defunct' left/ovs-monitor-ipsec.log], [1])
> +AT_CHECK([grep 'Bringing up' left/ovs-monitor-ipsec.log], [1])
> +dnl But we should eventually attempt to reconcile the original 'tun' tunnel.
> +OVS_WAIT_UNTIL([grep -q 'do not match desired' left/ovs-monitor-ipsec.log])
> +OVS_WAIT_UNTIL([grep -qE 'Bringing up ipsec connection tun-[[io]]' \
> +                         left/ovs-monitor-ipsec.log])
> +
> +dnl Now add the right host and check that connection works properly.
> +IPSEC_ADD_NODE_RIGHT([10.1.1.2], [10.1.1.254])
> +IPSEC_ADD_TUNNEL_RIGHT([geneve], PSK_OPTIONS([10.1.1.1]))
> +
> +CHECK_ESP_TRAFFIC
> +
> +OVS_TRAFFIC_VSWITCHD_STOP()
> +AT_CLEANUP
> +
> +AT_SETUP([IPsec -- Libreswan - reconciliation interval is respected - ipv6])
> +AT_KEYWORDS([ipsec libreswan ipv6 geneve psk reconciliation])
> +dnl Note: Geneve test may not work on older kernels due to CVE-2020-25645
> +dnl https://bugzilla.redhat.com/show_bug.cgi?id=1883988
> +
> +CHECK_LIBRESWAN()
> +OVS_TRAFFIC_VSWITCHD_START()
> +IPSEC_SETUP_UNDERLAY()
> +
> +m4_define([PSK_OPTIONS], [options:remote_ip=$1 options:psk=swordfish])
> +
> +dnl Set up only the left host.
> +IPSEC_ADD_NODE_LEFT([fd01::101], [fd01::254])
> +IPSEC_ADD_TUNNEL_LEFT([geneve], PSK_OPTIONS([fd01::102]))
> +dnl Wait for the monitor to find the new connection.
> +OVS_WAIT_UNTIL([grep -q 'tun appeared' left/ovs-monitor-ipsec.log])
> +
> +dnl Add a few more tunels waiting for the monitor to find them too.
> +m4_for([id], [1], 5, [1], [
> +  OVS_VSCTL([left], add-port br-ipsec tun-id \
> +                    -- set Interface tun-id type=geneve 
> PSK_OPTIONS(fd02::id))
> +  OVS_WAIT_UNTIL(grep -q 'tun-id appeared' left/ovs-monitor-ipsec.log)
> +])
> +dnl And now remove all the extra tunnels.
> +m4_for([id], [1], 5, [1], [
> +  OVS_VSCTL([left], del-port tun-id)
> +  OVS_WAIT_UNTIL(grep -q 'tun-id disappeared' left/ovs-monitor-ipsec.log)
> +])
> +
> +dnl Check that none of the connections were marked for reconciliation yet.
> +dnl It should take at least 15 seconds for this to happen.
> +AT_CHECK([grep -E 'half-loaded|defunct' left/ovs-monitor-ipsec.log], [1])
> +AT_CHECK([grep 'Bringing up' left/ovs-monitor-ipsec.log], [1])
> +dnl But we should eventually attempt to reconcile the original 'tun' tunnel.
> +OVS_WAIT_UNTIL([grep -q 'do not match desired' left/ovs-monitor-ipsec.log])
> +OVS_WAIT_UNTIL([grep -qE 'Bringing up ipsec connection tun-[[io]]' \
> +                         left/ovs-monitor-ipsec.log])
> +
> +dnl Now add the right host and check that connection works properly.
> +IPSEC_ADD_NODE_RIGHT([fd01::102], [fd01::254])
> +IPSEC_ADD_TUNNEL_RIGHT([geneve], PSK_OPTIONS([fd01::101]))
> +
> +CHECK_ESP_TRAFFIC
> +
> +OVS_TRAFFIC_VSWITCHD_STOP()
> +AT_CLEANUP
> +
>  AT_SETUP([IPsec -- Libreswan - established conns survive new additions - 
> ipv4])
>  AT_KEYWORDS([ipsec libreswan ipv4 geneve psk persistence])
>  dnl Note: Geneve test may not work on older kernels due to CVE-2020-25645
> -- 
> 2.49.0

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to