On 30 Oct 2024, at 14:50, Ilya Maximets wrote:

> Multiple versions of Libreswan have an issue where ipsec --start
> command may get stuck forever.  This issue affects many popular
> versions of Libreswan from 4.5 to 4.15, which are shipped in most
> modern distributions.
>
> When ipsec --start gets stuck, ovs-monitor-ipsec hangs and can't do
> anything else, so not olny this one but all other tunnels are also

nit: olny -> only

> not being started.
>
> Add a timeout to the subprocess call, so we do not wait forever.  Just
> introduced reconciliation process will clean things up and will try to
> re-add this connection later.
>
> Pluto may take a lot of time to process the --start request.  Notably,
> the time depends on the retransmission timeout, which is 60 seconds by
> default.  However, even at high scale, it doesn't take much more than
> that in tests.  So, 120 second timeout should be a reasonable default
> value.
>
> Note: it is observed in practice that the process doesn't actually
> terminate for a long time, so we can't afford waiting for it.
> That's the main reason why we're not using the subprocess.run() with
> a timeout option here (it would wait).  But also, because we'd had to
> catch the exception anyway.
>
> Reported-at: https://issues.redhat.com/browse/FDP-846
> Signed-off-by: Ilya Maximets <[email protected]>
> ---
>  ipsec/ovs-monitor-ipsec.in | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
> index 5d4b77bd2..dba855af5 100755
> --- a/ipsec/ovs-monitor-ipsec.in
> +++ b/ipsec/ovs-monitor-ipsec.in
> @@ -84,6 +84,7 @@ exiting = False
>  monitor = None
>  xfrm = None
>  RECONCILIATION_INTERVAL = 15  # seconds
> +TIEMOUT_EXPIRED = 37

TIEMOUT -> TIMEOUT

Also, 37 seems rather random, any reason why this was chosen?

>
>
>  def run_command(args, description=None):
> @@ -96,7 +97,16 @@ def run_command(args, description=None):

Maybe add the timeout as an argument, i.e. timeout=120?

>      vlog.dbg("Running %s" % args)
>      proc = subprocess.Popen(args, stdout=subprocess.PIPE,
>                              stderr=subprocess.PIPE)
> -    pout, perr = proc.communicate()
> +    try:
> +        pout, perr = proc.communicate(timeout=120)
> +        ret = proc.returncode
> +    except subprocess.TimeoutExpired:
> +        vlog.warn("Command timed out trying to %s." % description)

Maye add the timeout value to the log also?

> +        pout, perr = b'', b''
> +        # Just kill the process here.  We can't afford waiting for it,
> +        # as it may be stuck and may not actually be terminated.
> +        proc.kill()
> +        ret = TIEMOUT_EXPIRED
>
>      if proc.returncode or perr:
>          vlog.warn("Failed to %s; exit code: %d"
> @@ -105,7 +115,7 @@ def run_command(args, description=None):
>          vlog.warn("stderr: %s" % perr)
>          vlog.warn("stdout: %s" % pout)
>
> -    return proc.returncode, pout.decode(), perr.decode()
> +    return ret, pout.decode(), perr.decode()
>
>
>  class XFRM(object):
> -- 
> 2.46.0

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to