Multiple versions of Libreswan have an issue where ipsec --start command may get stuck forever. This issue affects many popular versions of Libreswan from 4.5 to 4.15, which are shipped in most modern distributions.
When ipsec --start gets stuck, ovs-monitor-ipsec hangs and can't do anything else, so not only this one but all other tunnels are also not being started. Add a timeout to the subprocess call, so we do not wait forever. Just introduced reconciliation process will clean things up and will try to re-add this connection later. Pluto may take a lot of time to process the --start request. Notably, the time depends on the retransmission timeout, which is 60 seconds by default. However, even at high scale, it doesn't take much more than that in tests. So, 120 second timeout should be a reasonable default value. Note: it is observed in practice that the process doesn't actually terminate for a long time, so we can't afford waiting for it. That's the main reason why we're not using the subprocess.run() with a timeout option here (it would wait). But also, because we'd had to catch the exception anyway. Reported-at: https://issues.redhat.com/browse/FDP-846 Signed-off-by: Ilya Maximets <[email protected]> --- ipsec/ovs-monitor-ipsec.in | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in index 20f6ccb20..05c1965df 100755 --- a/ipsec/ovs-monitor-ipsec.in +++ b/ipsec/ovs-monitor-ipsec.in @@ -84,6 +84,7 @@ exiting = False monitor = None xfrm = None RECONCILIATION_INTERVAL = 15 # seconds +TIMEOUT_EXPIRED = 137 # Exit code for a SIGKILL (128 + 9). def run_command(args, description=None): @@ -96,7 +97,16 @@ def run_command(args, description=None): vlog.dbg("Running %s" % args) proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) - pout, perr = proc.communicate() + try: + pout, perr = proc.communicate(timeout=120) + ret = proc.returncode + except subprocess.TimeoutExpired: + vlog.warn("Timed out after 120 seconds trying to %s." % description) + pout, perr = b'', b'' + # Just kill the process here. We can't afford waiting for it, + # as it may be stuck and may not actually be terminated. + proc.kill() + ret = TIMEOUT_EXPIRED if proc.returncode or perr: vlog.warn("Failed to %s; exit code: %d" @@ -105,7 +115,7 @@ def run_command(args, description=None): vlog.warn("stderr: %s" % perr) vlog.warn("stdout: %s" % pout) - return proc.returncode, pout.decode(), perr.decode() + return ret, pout.decode(), perr.decode() class XFRM(object): -- 2.46.0 _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
