On 10/31/24 12:02, Eelco Chaudron wrote: > > > On 30 Oct 2024, at 14:50, Ilya Maximets wrote: > >> Multiple versions of Libreswan have an issue where ipsec --start >> command may get stuck forever. This issue affects many popular >> versions of Libreswan from 4.5 to 4.15, which are shipped in most >> modern distributions. >> >> When ipsec --start gets stuck, ovs-monitor-ipsec hangs and can't do >> anything else, so not olny this one but all other tunnels are also > > nit: olny -> only
OK. > >> not being started. >> >> Add a timeout to the subprocess call, so we do not wait forever. Just >> introduced reconciliation process will clean things up and will try to >> re-add this connection later. >> >> Pluto may take a lot of time to process the --start request. Notably, >> the time depends on the retransmission timeout, which is 60 seconds by >> default. However, even at high scale, it doesn't take much more than >> that in tests. So, 120 second timeout should be a reasonable default >> value. >> >> Note: it is observed in practice that the process doesn't actually >> terminate for a long time, so we can't afford waiting for it. >> That's the main reason why we're not using the subprocess.run() with >> a timeout option here (it would wait). But also, because we'd had to >> catch the exception anyway. >> >> Reported-at: https://issues.redhat.com/browse/FDP-846 >> Signed-off-by: Ilya Maximets <[email protected]> >> --- >> ipsec/ovs-monitor-ipsec.in | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in >> index 5d4b77bd2..dba855af5 100755 >> --- a/ipsec/ovs-monitor-ipsec.in >> +++ b/ipsec/ovs-monitor-ipsec.in >> @@ -84,6 +84,7 @@ exiting = False >> monitor = None >> xfrm = None >> RECONCILIATION_INTERVAL = 15 # seconds >> +TIEMOUT_EXPIRED = 37 > > TIEMOUT -> TIMEOUT Oops. :) > > Also, 37 seems rather random, any reason why this was chosen? It is random. I just needed a non-zero number to return as I do not want to re-raise the exception and catch it in the caller. Maybe 137 would be a good number? We do kill the process and 128 + 9 = 137 would be an exit code for a process terminated with SIGKILL in Linux. WDYT? > >> >> >> def run_command(args, description=None): >> @@ -96,7 +97,16 @@ def run_command(args, description=None): > > Maybe add the timeout as an argument, i.e. timeout=120? It's never used as an argument, but I can see an argument for not using a global value in the function from the next patch, so I'll change that. > >> vlog.dbg("Running %s" % args) >> proc = subprocess.Popen(args, stdout=subprocess.PIPE, >> stderr=subprocess.PIPE) >> - pout, perr = proc.communicate() >> + try: >> + pout, perr = proc.communicate(timeout=120) >> + ret = proc.returncode >> + except subprocess.TimeoutExpired: >> + vlog.warn("Command timed out trying to %s." % description) > > Maye add the timeout value to the log also? Maybe "Command timed out after %s seconds trying to %s." ? > >> + pout, perr = b'', b'' >> + # Just kill the process here. We can't afford waiting for it, >> + # as it may be stuck and may not actually be terminated. >> + proc.kill() >> + ret = TIEMOUT_EXPIRED >> >> if proc.returncode or perr: >> vlog.warn("Failed to %s; exit code: %d" >> @@ -105,7 +115,7 @@ def run_command(args, description=None): >> vlog.warn("stderr: %s" % perr) >> vlog.warn("stdout: %s" % pout) >> >> - return proc.returncode, pout.decode(), perr.decode() >> + return ret, pout.decode(), perr.decode() >> >> >> class XFRM(object): >> -- >> 2.46.0 > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
