Multiple versions of Libreswan have an issue where ipsec --start
command may get stuck forever.  This issue affects many popular
versions of Libreswan from 4.5 to 4.15, which are shipped in most
modern distributions.

When ipsec --start gets stuck, ovs-monitor-ipsec hangs and can't do
anything else, so not only this one but all other tunnels are also
not being started.

Add a timeout to the subprocess call, so we do not wait forever.  Just
introduced reconciliation process will clean things up and will try to
re-add this connection later.

Pluto may take a lot of time to process the --start request.  Notably,
the time depends on the retransmission timeout, which is 60 seconds by
default.  However, even at high scale, it doesn't take much more than
that in tests.  So, 120 second timeout should be a reasonable default
value.

Note: it is observed in practice that the process doesn't actually
terminate for a long time, so we can't afford waiting for it.
That's the main reason why we're not using the subprocess.run() with
a timeout option here (it would wait).  But also, because we'd had to
catch the exception anyway.

Reported-at: https://issues.redhat.com/browse/FDP-846
Signed-off-by: Ilya Maximets <[email protected]>
---
 ipsec/ovs-monitor-ipsec.in | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
index 20f6ccb20..05c1965df 100755
--- a/ipsec/ovs-monitor-ipsec.in
+++ b/ipsec/ovs-monitor-ipsec.in
@@ -84,6 +84,7 @@ exiting = False
 monitor = None
 xfrm = None
 RECONCILIATION_INTERVAL = 15  # seconds
+TIMEOUT_EXPIRED = 137  # Exit code for a SIGKILL (128 + 9).
 
 
 def run_command(args, description=None):
@@ -96,7 +97,16 @@ def run_command(args, description=None):
     vlog.dbg("Running %s" % args)
     proc = subprocess.Popen(args, stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE)
-    pout, perr = proc.communicate()
+    try:
+        pout, perr = proc.communicate(timeout=120)
+        ret = proc.returncode
+    except subprocess.TimeoutExpired:
+        vlog.warn("Timed out after 120 seconds trying to %s." % description)
+        pout, perr = b'', b''
+        # Just kill the process here.  We can't afford waiting for it,
+        # as it may be stuck and may not actually be terminated.
+        proc.kill()
+        ret = TIMEOUT_EXPIRED
 
     if proc.returncode or perr:
         vlog.warn("Failed to %s; exit code: %d"
@@ -105,7 +115,7 @@ def run_command(args, description=None):
         vlog.warn("stderr: %s" % perr)
         vlog.warn("stdout: %s" % pout)
 
-    return proc.returncode, pout.decode(), perr.decode()
+    return ret, pout.decode(), perr.decode()
 
 
 class XFRM(object):
-- 
2.46.0

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to