On 7 May 2025, at 18:26, Ilya Maximets wrote:

> Currently we're only tracking the last refresh time and perform
> reconciliation of non-active connections on every refresh.  This is
> causing issues in large clusters when tunnels are added sequentially.
> Consider the following example:
>
>  1. Tun-1 added -> refresh()
>     -> Tun-1: adding 'in' and starting 'out'.
>
>  2. Tun-2 added -> refresh()
>     -> Tun-2: adding 'in' and starting 'out'.
>     -> Tun-1: The other side didn't have time to initiate the 'in'
>               connection yet, so it is not active.  But we see that
>               it's not active and trying to start it.
>
>  3. Tun-3 added -> refresh()
>     -> Tun-3: adding 'in' and starting 'out'.
>     -> Tun-2: The other side didn't have time to initiate the 'in'
>               connection yet, so it is not active.  But we see that
>               it's not active and trying to start it.
>     -> Tun-1: The connection still had no time to become active, but
>               we declare it 'defunct' and re-creating.
>
> Behavior above is specific to Libreswan 4.  Libreswan 5 will report
> UP connections as active in most cases, so they will not be marked
> as defunct, but they will still be started quickly after addition
> when it is not needed.
>
> This creates unnecessary churn in the cluster and puts Libreswan into
> an uncomfortable position where crossing stream issues (where both
> sides are trying to establish the same connection at the same time)
> are far more likely.
>
> Fix that by specifically tracking time when we add or start each
> connection instead of just the last time we refreshed for any reason.
> This should make ovs-monitor-ipsec to actually wait for the
> reconciliation interval before attempting to repair connections and
> give Libreswan a decent amount of time to process the changes and try
> to establish connections normally.
>
> Note: even though we could precisely track 15 seconds for each
> individual connection and wake up when exactly 15 seconds expire,
> we're not doing that in this patch.  The reason is that we still
> need to wake up every 15 seconds to check that all the previously
> active connections are still active, and doing that allows for
> refreshing many connections in the same run instead of waking up
> every second just for one connection.
>
> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections 
> periodically.")
> Reported-at: https://issues.redhat.com/browse/FDP-1364
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>

Thanks Ilya, for looking into my suggestion. The patch looks good to me.

Acked-by: Eelco Chaudron <echau...@redhat.com>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to