On 11/1/24 02:23, Ilya Maximets wrote: > This patch set is a result of debugging different Librewan issues > for the past few weeks in an attempt to solve the problem where > ovs-monitor-ipsec gets stuck forever while calling ipsec commands > and cannot progress any further. > > Main parts here are the introduction of the reconciliation mechanism > for the ipsec connections and termination of the stuck commands on > timeout. > > This set also contains a lot of small changes that ultimately fix > compatibility with multiple versions of Libreswan as well as improve > visibility into what the ovs-monitor-ipsec process is doing by adding > more verbose logging. > For example, without the fist patch in the set, ovs-monitor-ipsec > deadlocks both libreswan and itself with Libreswan 5 pretty easily: > https://github.com/libreswan/libreswan/issues/1859 > More details on addressed issues are in the commit messages. > > The last few patches in the set are adding a system test that stresses > the reconciliation and various failure handling paths inside the > monitor. Mainly because we do get a lot of failures from Libreswan > while running the test. This test is currently actively used by > Libreswan team to find and fix the root causes of multiple issues that > triggered creation of this patch set. > > The intention for this patch set is to be backported to at least > branch 3.3. But further down to 3.1 (or even 2.17 ?) may also be good. > Luckily, the code is not that different on older branches.
Thanks, Roi and Eelco for review! I applied the set to main. And since we're still providing releases for 3.1 according to the 'last 4 release branches' policy, backported down to 3.1. This should be enough for most users. For now, doesn't seem critical enough to backport to 2.17 (old LTS). Best regards, Ilya Maximets. > > The set is tested with various versions of Libreswan including > 3.32 (from Ubuntu 22.04), 4.5, 4.6, 4.9, 4.12, 4.14, 4.15 and 5.1. > > Without the set, only 4.5 and below work well enough, 4.9 - 4.15 are > getting completely stuck with a few dozens of connections, and 5.1 > deadlocks easily. > > With the set: 4.5 and below still work well, 5.1 works well, 4.9 - 4.15 > can get into state with connectivity issues (libreswan issue that cannot > be worked around externally), but it is much less likely to end up in > this state and it affects only a couple individual connections instead > of blocking the daemon as a whole. Also, 4.14 and 4.15 seems noticeably > harder to get into that state (but still very possible). > > > Version 3: > - Updated description logs in the run_command(). > - Fixed typos and added timeout value to the log. > - Changed the test to report as skipped after the configuration is checked > instead of silently skipping the ping test with Libreswan 4.x. > - Added Acks from Roi to patches 1-4. > - Added Acks from Eelco to patches 2-4 and 7-8. > - Added one new patch at the end of the set that greatly reduces > chances for hitting bugs in Libreswan (still not enough to enable > the ping test, but much better than without it). Can drop this > one if not desired, but seems useful in real world setups. > > Version 2: > - Moved the regexp patch earlier in the set to avoid CI failures. > - Added logic to avoid reconciliation triggered on every wake up > if there are no configuration changes. Now it runs only once in > 15 seconds, if there are no config changes. > - Improved regexp for loaded connections. Now we match on the > string starting with a digit (IP address) after the name. > This solves matching on connections that do not have === in their > formatting. No idea why libreswan prints differently sometimes. > - Addressed comments from Roi: removed unnecessary len() and moved > stdout/err decoding to the common function. > - Added grep on pluto's ERRORs to the test, so they are more visible. > > > Ilya Maximets (10): > ipsec: Add a helper function to run commands from the monitor. > ipsec: libreswan: Fix regexp for connections waiting on child SA. > ipsec: libreswan: Reconcile missing connections periodically. > ipsec: libreswan: Try to bring non-active connections up. > ipsec: libreswan: Avoid monitor hanging on stuck ipsec commands. > ipsec: Make command timeout configurable. > system-tests: Verbose cleanup of ports and namespaces. > tests: ipsec: Add NxN + reconciliation test. > tests: ipsec: Check that nodes can ping each other in the NxN test. > ipsec: libreswan: Reduce chances for crossing streams. > > ipsec/ovs-monitor-ipsec.in | 496 +++++++++++++++++++--------------- > tests/system-common-macros.at | 7 +- > tests/system-ipsec.at | 208 +++++++++++++- > 3 files changed, 480 insertions(+), 231 deletions(-) > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
