On 11/1/24 02:23, Ilya Maximets wrote:
> This patch set is a result of debugging different Librewan issues
> for the past few weeks in an attempt to solve the problem where
> ovs-monitor-ipsec gets stuck forever while calling ipsec commands
> and cannot progress any further.
> 
> Main parts here are the introduction of the reconciliation mechanism
> for the ipsec connections and termination of the stuck commands on
> timeout.
> 
> This set also contains a lot of small changes that ultimately fix
> compatibility with multiple versions of Libreswan as well as improve
> visibility into what the ovs-monitor-ipsec process is doing by adding
> more verbose logging.
> For example, without the fist patch in the set, ovs-monitor-ipsec
> deadlocks both libreswan and itself with Libreswan 5 pretty easily:
>   https://github.com/libreswan/libreswan/issues/1859
> More details on addressed issues are in the commit messages.
> 
> The last few patches in the set are adding a system test that stresses
> the reconciliation and various failure handling paths inside the
> monitor.  Mainly because we do get a lot of failures from Libreswan
> while running the test.  This test is currently actively used by
> Libreswan team to find and fix the root causes of multiple issues that
> triggered creation of this patch set.
> 
> The intention for this patch set is to be backported to at least
> branch 3.3.  But further down to 3.1 (or even 2.17 ?) may also be good.
> Luckily, the code is not that different on older branches.

Thanks, Roi and Eelco for review!

I applied the set to main.  And since we're still providing releases
for 3.1 according to the 'last 4 release branches' policy, backported
down to 3.1.  This should be enough for most users.

For now, doesn't seem critical enough to backport to 2.17 (old LTS).

Best regards, Ilya Maximets.

> 
> The set is tested with various versions of Libreswan including
> 3.32 (from Ubuntu 22.04), 4.5, 4.6, 4.9, 4.12, 4.14, 4.15 and 5.1.
> 
> Without the set, only 4.5 and below work well enough, 4.9 - 4.15 are
> getting completely stuck with a few dozens of connections, and 5.1
> deadlocks easily.
> 
> With the set: 4.5 and below still work well, 5.1 works well, 4.9 - 4.15
> can get into state with connectivity issues (libreswan issue that cannot
> be worked around externally), but it is much less likely to end up in
> this state and it affects only a couple individual connections instead
> of blocking the daemon as a whole.  Also, 4.14 and 4.15 seems noticeably
> harder to get into that state (but still very possible).
> 
> 
> Version 3:
>   - Updated description logs in the run_command().
>   - Fixed typos and added timeout value to the log.
>   - Changed the test to report as skipped after the configuration is checked
>     instead of silently skipping the ping test with Libreswan 4.x.
>   - Added Acks from Roi to patches 1-4.
>   - Added Acks from Eelco to patches 2-4 and 7-8.
>   - Added one new patch at the end of the set that greatly reduces
>     chances for hitting bugs in Libreswan (still not enough to enable
>     the ping test, but much better than without it).  Can drop this
>     one if not desired, but seems useful in real world setups.
> 
> Version 2:
>   - Moved the regexp patch earlier in the set to avoid CI failures.
>   - Added logic to avoid reconciliation triggered on every wake up
>     if there are no configuration changes.  Now it runs only once in
>     15 seconds, if there are no config changes.
>   - Improved regexp for loaded connections.  Now we match on the
>     string starting with a digit (IP address) after the name.
>     This solves matching on connections that do not have === in their
>     formatting.  No idea why libreswan prints differently sometimes.
>   - Addressed comments from Roi: removed unnecessary len() and moved
>     stdout/err decoding to the common function.
>   - Added grep on pluto's ERRORs to the test, so they are more visible.
> 
> 
> Ilya Maximets (10):
>   ipsec: Add a helper function to run commands from the monitor.
>   ipsec: libreswan: Fix regexp for connections waiting on child SA.
>   ipsec: libreswan: Reconcile missing connections periodically.
>   ipsec: libreswan: Try to bring non-active connections up.
>   ipsec: libreswan: Avoid monitor hanging on stuck ipsec commands.
>   ipsec: Make command timeout configurable.
>   system-tests: Verbose cleanup of ports and namespaces.
>   tests: ipsec: Add NxN + reconciliation test.
>   tests: ipsec: Check that nodes can ping each other in the NxN test.
>   ipsec: libreswan: Reduce chances for crossing streams.
> 
>  ipsec/ovs-monitor-ipsec.in    | 496 +++++++++++++++++++---------------
>  tests/system-common-macros.at |   7 +-
>  tests/system-ipsec.at         | 208 +++++++++++++-
>  3 files changed, 480 insertions(+), 231 deletions(-)
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to