Multiple versions of Libreswan have an issue where ipsec --start
command may get stuck forever.  This issue affects many popular
versions of Libreswan from 4.5 to 4.15, which are shipped in most
modern distributions.

When ipsec --start gets stuck, ovs-monitor-ipsec hangs and can't do
anything else, so not olny this one but all other tunnels are also
not being started.

Add a timeout to the subprocess call, so we do not wait forever.  Just
introduced reconciliation process will clean things up and will try to
re-add this connection later.

Pluto may take a lot of time to process the --start request.  Notably,
the time depends on the retransmission timeout, which is 60 seconds by
default.  However, even at high scale, it doesn't take much more than
that in tests.  So, 120 second timeout should be a reasonable default
value.

Note: it is observed in practice that the process doesn't actually
terminate for a long time, so we can't afford waiting for it.
That's the main reason why we're not using the subprocess.run() with
a timeout option here (it would wait).  But also, because we'd had to
catch the exception anyway.

Reported-at: https://issues.redhat.com/browse/FDP-846
Signed-off-by: Ilya Maximets <[email protected]>
---
 ipsec/ovs-monitor-ipsec.in | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
index 5d4b77bd2..dba855af5 100755
--- a/ipsec/ovs-monitor-ipsec.in
+++ b/ipsec/ovs-monitor-ipsec.in
@@ -84,6 +84,7 @@ exiting = False
 monitor = None
 xfrm = None
 RECONCILIATION_INTERVAL = 15  # seconds
+TIEMOUT_EXPIRED = 37
 
 
 def run_command(args, description=None):
@@ -96,7 +97,16 @@ def run_command(args, description=None):
     vlog.dbg("Running %s" % args)
     proc = subprocess.Popen(args, stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE)
-    pout, perr = proc.communicate()
+    try:
+        pout, perr = proc.communicate(timeout=120)
+        ret = proc.returncode
+    except subprocess.TimeoutExpired:
+        vlog.warn("Command timed out trying to %s." % description)
+        pout, perr = b'', b''
+        # Just kill the process here.  We can't afford waiting for it,
+        # as it may be stuck and may not actually be terminated.
+        proc.kill()
+        ret = TIEMOUT_EXPIRED
 
     if proc.returncode or perr:
         vlog.warn("Failed to %s; exit code: %d"
@@ -105,7 +115,7 @@ def run_command(args, description=None):
         vlog.warn("stderr: %s" % perr)
         vlog.warn("stdout: %s" % pout)
 
-    return proc.returncode, pout.decode(), perr.decode()
+    return ret, pout.decode(), perr.decode()
 
 
 class XFRM(object):
-- 
2.46.0

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to