On 5/16/25 9:05 AM, Eelco Chaudron wrote: > > > On 7 May 2025, at 18:26, Ilya Maximets wrote: > >> Currently we're only tracking the last refresh time and perform >> reconciliation of non-active connections on every refresh. This is >> causing issues in large clusters when tunnels are added sequentially. >> Consider the following example: >> >> 1. Tun-1 added -> refresh() >> -> Tun-1: adding 'in' and starting 'out'. >> >> 2. Tun-2 added -> refresh() >> -> Tun-2: adding 'in' and starting 'out'. >> -> Tun-1: The other side didn't have time to initiate the 'in' >> connection yet, so it is not active. But we see that >> it's not active and trying to start it. >> >> 3. Tun-3 added -> refresh() >> -> Tun-3: adding 'in' and starting 'out'. >> -> Tun-2: The other side didn't have time to initiate the 'in' >> connection yet, so it is not active. But we see that >> it's not active and trying to start it. >> -> Tun-1: The connection still had no time to become active, but >> we declare it 'defunct' and re-creating. >> >> Behavior above is specific to Libreswan 4. Libreswan 5 will report >> UP connections as active in most cases, so they will not be marked >> as defunct, but they will still be started quickly after addition >> when it is not needed. >> >> This creates unnecessary churn in the cluster and puts Libreswan into >> an uncomfortable position where crossing stream issues (where both >> sides are trying to establish the same connection at the same time) >> are far more likely. >> >> Fix that by specifically tracking time when we add or start each >> connection instead of just the last time we refreshed for any reason. >> This should make ovs-monitor-ipsec to actually wait for the >> reconciliation interval before attempting to repair connections and >> give Libreswan a decent amount of time to process the changes and try >> to establish connections normally. >> >> Note: even though we could precisely track 15 seconds for each >> individual connection and wake up when exactly 15 seconds expire, >> we're not doing that in this patch. The reason is that we still >> need to wake up every 15 seconds to check that all the previously >> active connections are still active, and doing that allows for >> refreshing many connections in the same run instead of waking up >> every second just for one connection. >> >> Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections >> periodically.") >> Reported-at: https://issues.redhat.com/browse/FDP-1364 >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > > Thanks Ilya, for looking into my suggestion. The patch looks good to me. > > Acked-by: Eelco Chaudron <echau...@redhat.com> >
Thanks! Applied and backported down to 3.2. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev