On December 16, 2022 2:36 pm, Daniel Tschlatscher wrote:
> In some cases the VM API start method would return before the detached
> KVM process would have exited. This is especially problematic with HA,
> because the HA manager would think the VM started successfully, later
> see that it exited and start it again in an endless loop.
> 
> Moreover, another case exists when resuming a hibernated VM. In this
> case, the qemu thread will attempt to load the whole vmstate into
> memory before exiting.
> Depending on vmstate size, disk read speed, and similar factors this
> can take quite a while though and it is not possible to start the VM
> normally during this time.
> 
> To get around this, this patch intercepts the error, looks whether a
> corresponding KVM thread is still running, and waits for/kills it,
> before continuing.
> 
> Signed-off-by: Daniel Tschlatscher <d.tschlatsc...@proxmox.com>
> ---
> 
> Changes from v2:
> * Rebased to current master
> * Changed warn to use 'log_warn' instead
> * Reworded log message when waiting for lingering qemu process
> 
>  PVE/QemuServer.pm | 40 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 2adbe3a..f63dc3f 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -5884,15 +5884,41 @@ sub vm_start_nolock {
>               $tpmpid = start_swtpm($storecfg, $vmid, $tpm, $migratedfrom);
>           }
>  
> -         my $exitcode = run_command($cmd, %run_params);
> -         if ($exitcode) {
> -             if ($tpmpid) {
> -                 warn "stopping swtpm instance (pid $tpmpid) due to QEMU 
> startup error\n";
> -                 kill 'TERM', $tpmpid;
> +         eval {
> +             my $exitcode = run_command($cmd, %run_params);
> +
> +             if ($exitcode) {
> +                 if ($tpmpid) {
> +                     log_warn "stopping swtpm instance (pid $tpmpid) due to 
> QEMU startup
error\n";

this warn -> log_warn change kind of slipped in, it's not really part of this
patch?

> +                     kill 'TERM', $tpmpid;
> +                 }
> +                 die "QEMU exited with code $exitcode\n";
>               }
> -             die "QEMU exited with code $exitcode\n";
> +         };
> +
> +         if (my $err = $@) {
> +             my $pid = PVE::QemuServer::Helpers::vm_running_locally($vmid);
> +
> +             if ($pid ne "") {

can be combined:
if (my $pid = ...) {

}

(empty string evaluates to false in perl ;))

> +                 my $count = 0;
> +                 my $timeout = 300;
> +
> +                 print "Waiting $timeout seconds for detached qemu process 
> $pid to exit\n";
> +                 while (($count < $timeout) &&
> +                     PVE::QemuServer::Helpers::vm_running_locally($vmid)) {
> +                     $count++;
> +                     sleep(1);
> +                 }
> +

either here

> +                 if ($count >= $timeout) {
> +                     log_warn "Reached timeout. Terminating now with 
> SIGKILL\n";

or here, recheck that VM is still running and still has the same PID, and log
accordingly instead of KILLing if not..

the same is also true in _do_vm_stop

> +                     kill(9, $pid);
> +                 }
> +             }
> +
> +             die $err;
>           }
> -     };
> +     }
>      };
>  
>      if ($conf->{hugepages}) {
> -- 
> 2.30.2
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to