--- Begin Message ---
Hi Jan,

El 8/11/22 a las 21:57, Jan Vlach escribió:
thank you a million for taking your time to re-test this! It really helps me to 
understand what to expect that works and what doesn’t. I had a glimpse of an 
idea to create cluster with mixed CPUs of EPYC gen1 and EPYC gen3, but this 
really seems like a road to hell(tm). So I’ll keep the clusters homogenous with 
the same gen of CPU. I have two sites, but fortunately, I can keep the clusters 
homogenous (with one having “more power”).

Honestly, up until now, I thought I could abstract from the version of linux 
kernel I’m running. Because, hey, it’s all KVM.  I’m setting my VMs with cpu 
type host to have the benefit of accelerated AES and other instructions, but I 
have yet to see if EPYCv1 is compatible with EPYCv3. (v being gen) Thanks for 
teaching me a new trick or a thing to be aware of at least! (I remember this to 
be an issue with VMware heterogenous clusters (with cpus of different 
generations), but I really though KVM64 would help you to abstract from all 
this, KVM64 being Pentium4-era CPU)

We haven't found any issue with this until kernel 5.15 (we are using Proxmox since 0.9 or something like that!). Only issue has been trouble live migrating VMs with 2+ cores between Intel and AMD processors, but no issues at all with different generations of the same brand. Until 5.15.* that is.

This is the only reason to have kvm64 vCPU type, and also the reason for it to be the default! So I don't understand why this is taking so long to be fixed.


Do you use virtio drivers for storage and network card at all? Can you see a 
pattern there where the 3 Debian/Windows machines were not affected? Did they 
use virtio or not?

Yes, virtio drivers for storage and network for all Debian and Windows 2008r2.


I really don’t see a reason why the migration back from 5.13 -> 5.19 should 
bring that 50/100% CPU load and hanging. I’ve had some phantom load before with 
having “Use tablet for pointer: Yes” before, but that was in the 5% ballpark per 
VM.

Issue is not CPU load "per se", but that the VM is hung (not able to do anything in console)


I’m just a fellow proxmox admin/user. Hope this would ring a bell or spark 
interest in the core proxmox team. I’ve had struggles with 5.15 before with GPU 
passthrough (wasn’t able to do this) and OpenBSD vm’s taking minutes compared 
to tens of seconds to boot on 5.15 before.

All and all, thanks for all the hints I could test before production, do it 
won’t hurt “down the road” …

For now, we're pinning 5.13 kernel that is working perfectly (except AMD<->Intel migration, but that is a years long issue).


JV
P.S. i’m trying to push my boss towards a commercial subscription for our 
clusters, but at this point I really am no sure it would help ...

I'm sure this must have been reported, no idea why it wasn't fixed/official kernel downgraded to 5.13 . In the forum someone from Proxmox even commented that we shouldn't run clusters with different generation CPUs, which was shocking to read, frankly. We have customers that have commercial support that we pinned to 5.13 kernel preventively because we found the issue in our "eat our own food" cluster beforehand!! :-)

Cheers



On 8. 11. 2022, at 18:18, Eneko Lacunza via 
pve-user<[email protected]>  wrote:


From: Eneko Lacunza<[email protected]>
Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
Date: 8 November 2022 18:18:44 CET
To:[email protected]


Hi Jan,

I had some time to re-test this.

I tried live migration with KVM64 CPU between 2 nodes:

node-ryzen1700 - kernel 5.19.7-1-pve
node-ryzen5900x - kernel 5.19.7-1-pve

I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2).
This works OK in both directions.

Then I downgraded a node to 5.13:
node-ryzen1700 - kernel 5.19.7-1-pve
node-ryzen5900x - kernel 5.13.19-6-pve

Migration of those 9 VMs worked well from node-ryzen1700 -> node->ryzen5900x

But migration of those 9 VMs back node->ryzen5900x -> node-ryzen1700 was a 
disaster: all 8 debian VMs hung with 50/100% CPU use. Window 2008r2 seems not 
affected by the issue at all.

3 other Debian/Windows VMs on node-ryzen1700 were not affected.

After migrating both nodes to kernel 5.13:

node-ryzen1700 - kernel 5.13.19-6-pve
node-ryzen5900x - kernel 5.13.19-6-pve

Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 works as intended :)

Cheers



El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribió:
Hi Jan,

Yes, there's no issue if CPUs are the same.

VMs hang when CPUs are of different enough generation, even being of the same 
brand and using KVM64 vCPU.

El 7/11/22 a las 22:59, Jan Vlach escribió:
Hi,

For what’s it worth, live VM migration with Linux VMs with various debian 
versions work here just fine. I’m using virtio for networking and virtio scsi 
for disks. (The only version where I had problems was debian6 where the kernel 
does not support virtio scsi and megaraid sas 8708EM2 needs to be used. I get 
kernel panic in mpt_sas on thaw after migration.)

We're running 5.15.60-1-pve on three node cluster with AMD EPYC 7551P 32-Core 
Processor. These are supermicros with latest bios (latest microcode?) and BMC

Storage is local ZFS pool, backed by SSDS in striped mirrors (4 devices on each 
node). Migration has dedicated 2x 10GigE LACP and dedicated VLAN on switch 
stack.

I have more nodes with EPYC3/Milan on the way, so I’ll test those later as well.

What does your cluster look hardware-wise? What are the problems you experienced 
with VM migratio on 5.13->5.19?

Thanks,
JV
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/

--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to