On 5/16/25 9:05 AM, Eelco Chaudron wrote:
> 
> 
> On 7 May 2025, at 18:26, Ilya Maximets wrote:
> 
>> Currently we're only tracking the last refresh time and perform
>> reconciliation of non-active connections on every refresh.  This is
>> causing issues in large clusters when tunnels are added sequentially.
>> Consider the following example:
>>
>>  1. Tun-1 added -> refresh()
>>     -> Tun-1: adding 'in' and starting 'out'.
>>
>>  2. Tun-2 added -> refresh()
>>     -> Tun-2: adding 'in' and starting 'out'.
>>     -> Tun-1: The other side didn't have time to initiate the 'in'
>>               connection yet, so it is not active.  But we see that
>>               it's not active and trying to start it.
>>
>>  3. Tun-3 added -> refresh()
>>     -> Tun-3: adding 'in' and starting 'out'.
>>     -> Tun-2: The other side didn't have time to initiate the 'in'
>>               connection yet, so it is not active.  But we see that
>>               it's not active and trying to start it.
>>     -> Tun-1: The connection still had no time to become active, but
>>               we declare it 'defunct' and re-creating.
>>
>> Behavior above is specific to Libreswan 4.  Libreswan 5 will report
>> UP connections as active in most cases, so they will not be marked
>> as defunct, but they will still be started quickly after addition
>> when it is not needed.
>>
>> This creates unnecessary churn in the cluster and puts Libreswan into
>> an uncomfortable position where crossing stream issues (where both
>> sides are trying to establish the same connection at the same time)
>> are far more likely.
>>
>> Fix that by specifically tracking time when we add or start each
>> connection instead of just the last time we refreshed for any reason.
>> This should make ovs-monitor-ipsec to actually wait for the
>> reconciliation interval before attempting to repair connections and
>> give Libreswan a decent amount of time to process the changes and try
>> to establish connections normally.
>>
>> Note: even though we could precisely track 15 seconds for each
>> individual connection and wake up when exactly 15 seconds expire,
>> we're not doing that in this patch.  The reason is that we still
>> need to wake up every 15 seconds to check that all the previously
>> active connections are still active, and doing that allows for
>> refreshing many connections in the same run instead of waking up
>> every second just for one connection.
>>
>> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections 
>> periodically.")
>> Reported-at: https://issues.redhat.com/browse/FDP-1364
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> 
> Thanks Ilya, for looking into my suggestion. The patch looks good to me.
> 
> Acked-by: Eelco Chaudron <echau...@redhat.com>
> 

Thanks!  Applied and backported down to 3.2.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to