Hi all, We've been experimenting with cpr-transfer migration mode recently and have discovered the following issue with the guest QXL driver:
Run migration source: > EMULATOR=/path/to/emulator > ROOTFS=/path/to/image > QMPSOCK=/var/run/alma8qmp-src.sock > > $EMULATOR -enable-kvm \ > -machine q35 \ > -cpu host -smp 2 -m 2G \ > -object > memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\ > -machine memory-backend=ram0 \ > -machine aux-ram-share=on \ > -drive file=$ROOTFS,media=disk,if=virtio \ > -qmp unix:$QMPSOCK,server=on,wait=off \ > -nographic \ > -device qxl-vga Run migration target: > EMULATOR=/path/to/emulator > ROOTFS=/path/to/image > QMPSOCK=/var/run/alma8qmp-dst.sock > > > > $EMULATOR -enable-kvm \ > -machine q35 \ > -cpu host -smp 2 -m 2G \ > -object > memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\ > -machine memory-backend=ram0 \ > -machine aux-ram-share=on \ > -drive file=$ROOTFS,media=disk,if=virtio \ > -qmp unix:$QMPSOCK,server=on,wait=off \ > -nographic \ > -device qxl-vga \ > -incoming tcp:0:44444 \ > -incoming '{"channel-type": "cpr", "addr": { "transport": "socket", > "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}' Launch the migration: > QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell > QMPSOCK=/var/run/alma8qmp-src.sock > > $QMPSHELL -p $QMPSOCK <<EOF > migrate-set-parameters mode=cpr-transfer > migrate > channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}] > EOF Then, after a while, QXL guest driver on target crashes spewing the following messages: > [ 73.962002] [TTM] Buffer eviction failed > [ 73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001) > [ 73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate > VRAM BO That seems to be a known kernel QXL driver bug: https://lore.kernel.org/all/20220907094423.93581-1-min_h...@163.com/T/ https://lore.kernel.org/lkml/ztgydqrlk6wx_...@eldamar.lan/ (the latter discussion contains that reproduce script which speeds up the crash in the guest): > #!/bin/bash > > chvt 3 > > for j in $(seq 80); do > echo "$(date) starting round $j" > if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" > ]; then > echo "bug was reproduced after $j tries" > exit 1 > fi > for i in $(seq 100); do > dmesg > /dev/tty3 > done > done > > echo "bug could not be reproduced" > exit 0 The bug itself seems to remain unfixed, as I was able to reproduce that with Fedora 41 guest, as well as AlmaLinux 8 guest. However our cpr-transfer code also seems to be buggy as it triggers the crash - without the cpr-transfer migration the above reproduce doesn't lead to crash on the source VM. I suspect that, as cpr-transfer doesn't migrate the guest memory, but rather passes it through the memory backend object, our code might somehow corrupt the VRAM. However, I wasn't able to trace the corruption so far. Could somebody help the investigation and take a look into this? Any suggestions would be appreciated. Thanks! Andrey