Re: [PVE-User] VMs hung after live migration - Intel CPU

Eneko Lacunza via pve-user Mon, 17 Apr 2023 08:19:27 -0700

--- Begin Message ---
Hi all,

We just tested today the following migrations with latest PVE 7.4:
Ryzen 5900X 5.13.19-6-pve -> Ryzen 1700 6.2.9-1-pve : OK (Linux andWindows, kvm64 cpu)Ryzen 5900X 5.13.19-6-pve -> Ryzen 2600X 6.2.9-1-pve : OK (Linux andWindows, kvm64 cpu)Ryzen 2600X 6.2.9-1-pve <-> Ryzen 1700 6.2.9-1-pve : OK (Linux andWindows, kvm64 cpu)Ryzen 5900X 6.2.9-1-pve <-> Ryzen 2600X 6.2.9-1-pve : OK (Linux andWindows, kvm64 cpu)
We were awaiting on 5.13.19 kernel because those issues, now it seemsthere's a way to upgrade kernel without stopping VMs in a mixedCPU-model cluster.
Thanks


El 16/11/22 a las 12:31, Eneko Lacunza escribió:
Hi,
A new kernel 5.15.74-1 is out, and I saw that a TSC bug fix preparedby Fiona (thanks a lot!) was there, so I just tried it out:
Ryzen 1700 5.13.19-6-pve -> Ryzen 5900X 5.15.74-1-pve : migrations OK
Ryzen 5900X 5.15.74-1-pve -> Ryzen 1700 5.15.74-1-pve: linuxmigrations failed, Windows OK
I noticed that in the VMs were something was logged to console, therewas no mention of TSC.
This time the error was (Debian 10 kernel):
PANIC: double fault, error_code: 0x0

Then a kernel panic, I have it in a screenshot if that can help.
I recall some floating point issue was reported, no idea if that hasbeen tracked.
I think there has been progress with the issues we are seeing in thisRyzen cluster, although 5.15 kernel is unworkable yet with 5.15.74...
Cheers


El 9/11/22 a las 9:21, Eneko Lacunza via pve-user escribió:
Hi Jan,

El 8/11/22 a las 21:57, Jan Vlach escribió:
thank you a million for taking your time to re-test this! It reallyhelps me to understand what to expect that works and what doesn’t. Ihad a glimpse of an idea to create cluster with mixed CPUs of EPYCgen1 and EPYC gen3, but this really seems like a road to hell(tm).So I’ll keep the clusters homogenous with the same gen of CPU. Ihave two sites, but fortunately, I can keep the clusters homogenous(with one having “more power”).
Honestly, up until now, I thought I could abstract from the versionof linux kernel I’m running. Because, hey, it’s all KVM. I’msetting my VMs with cpu type host to have the benefit of acceleratedAES and other instructions, but I have yet to see if EPYCv1 iscompatible with EPYCv3. (v being gen) Thanks for teaching me a newtrick or a thing to be aware of at least! (I remember this to be anissue with VMware heterogenous clusters (with cpus of differentgenerations), but I really though KVM64 would help you to abstractfrom all this, KVM64 being Pentium4-era CPU)
We haven't found any issue with this until kernel 5.15 (we are usingProxmox since 0.9 or something like that!). Only issue has beentrouble live migrating VMs with 2+ cores between Intel and AMDprocessors, but no issues at all with different generations of thesame brand. Until 5.15.* that is.
This is the only reason to have kvm64 vCPU type, and also the reasonfor it to be the default! So I don't understand why this is taking solong to be fixed.
Do you use virtio drivers for storage and network card at all? Canyou see a pattern there where the 3 Debian/Windows machines were notaffected? Did they use virtio or not?
Yes, virtio drivers for storage and network for all Debian andWindows 2008r2.
I really don’t see a reason why the migration back from 5.13 -> 5.19should bring that 50/100% CPU load and hanging. I’ve had somephantom load before with having “Use tablet for pointer: Yes”before, but that was in the 5% ballpark per VM.
Issue is not CPU load "per se", but that the VM is hung (not able todo anything in console)
I’m just a fellow proxmox admin/user. Hope this would ring a bell orspark interest in the core proxmox team. I’ve had struggles with5.15 before with GPU passthrough (wasn’t able to do this) andOpenBSD vm’s taking minutes compared to tens of seconds to boot on5.15 before.
All and all, thanks for all the hints I could test beforeproduction, do it won’t hurt “down the road” …
For now, we're pinning 5.13 kernel that is working perfectly (exceptAMD<->Intel migration, but that is a years long issue).
JV
P.S. i’m trying to push my boss towards a commercial subscriptionfor our clusters, but at this point I really am no sure it wouldhelp ...
I'm sure this must have been reported, no idea why it wasn'tfixed/official kernel downgraded to 5.13 . In the forum someone fromProxmox even commented that we shouldn't run clusters with differentgeneration CPUs, which was shocking to read, frankly. We havecustomers that have commercial support that we pinned to 5.13 kernelpreventively because we found the issue in our "eat our own food"cluster beforehand!! :-)
Cheers
On 8. 11. 2022, at 18:18, Eneko Lacunza viapve-user<[email protected]> wrote:
From: Eneko Lacunza<[email protected]>
Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
Date: 8 November 2022 18:18:44 CET
To:[email protected]


Hi Jan,

I had some time to re-test this.

I tried live migration with KVM64 CPU between 2 nodes:

node-ryzen1700 - kernel 5.19.7-1-pve
node-ryzen5900x - kernel 5.19.7-1-pve

I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2).
This works OK in both directions.

Then I downgraded a node to 5.13:
node-ryzen1700 - kernel 5.19.7-1-pve
node-ryzen5900x - kernel 5.13.19-6-pve
Migration of those 9 VMs worked well from node-ryzen1700 ->node->ryzen5900x
But migration of those 9 VMs back node->ryzen5900x ->node-ryzen1700 was a disaster: all 8 debian VMs hung with 50/100%CPU use. Window 2008r2 seems not affected by the issue at all.
3 other Debian/Windows VMs on node-ryzen1700 were not affected.

After migrating both nodes to kernel 5.13:

node-ryzen1700 - kernel 5.13.19-6-pve
node-ryzen5900x - kernel 5.13.19-6-pve
Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 worksas intended :)
Cheers



El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribió:
Hi Jan,

Yes, there's no issue if CPUs are the same.
VMs hang when CPUs are of different enough generation, even beingof the same brand and using KVM64 vCPU.
El 7/11/22 a las 22:59, Jan Vlach escribió:
Hi,
For what’s it worth, live VM migration with Linux VMs withvarious debian versions work here just fine. I’m using virtio fornetworking and virtio scsi for disks. (The only version where Ihad problems was debian6 where the kernel does not support virtioscsi and megaraid sas 8708EM2 needs to be used. I get kernelpanic in mpt_sas on thaw after migration.)
We're running 5.15.60-1-pve on three node cluster with AMD EPYC7551P 32-Core Processor. These are supermicros with latest bios(latest microcode?) and BMC
Storage is local ZFS pool, backed by SSDS in striped mirrors (4devices on each node). Migration has dedicated 2x 10GigE LACP anddedicated VLAN on switch stack.
I have more nodes with EPYC3/Milan on the way, so I’ll test thoselater as well.
What does your cluster look hardware-wise? What are the problemsyou experienced with VM migratio on 5.13->5.19?
Thanks,
JV
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
--- End Message ---

_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] VMs hung after live migration - Intel CPU

Reply via email to