Hi all,
We've been experimenting with cpr-transfer migration mode recently and
have discovered the following issue with the guest QXL driver:
Run migration source:
EMULATOR=/path/to/emulator
ROOTFS=/path/to/image
QMPSOCK=/var/run/alma8qmp-src.sock
$EMULATOR -enable-kvm \
-machine q35 \
-cpu host -smp 2 -m 2G \
-object
memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
-machine memory-backend=ram0 \
-machine aux-ram-share=on \
-drive file=$ROOTFS,media=disk,if=virtio \
-qmp unix:$QMPSOCK,server=on,wait=off \
-nographic \
-device qxl-vga
Run migration target:
EMULATOR=/path/to/emulator
ROOTFS=/path/to/image
QMPSOCK=/var/run/alma8qmp-dst.sock
$EMULATOR -enable-kvm \
-machine q35 \
-cpu host -smp 2 -m 2G \
-object
memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
-machine memory-backend=ram0 \
-machine aux-ram-share=on \
-drive file=$ROOTFS,media=disk,if=virtio \
-qmp unix:$QMPSOCK,server=on,wait=off \
-nographic \
-device qxl-vga \
-incoming tcp:0:44444 \
-incoming '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix",
"path": "/var/run/alma8cpr-dst.sock"}}'
Launch the migration:
QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
QMPSOCK=/var/run/alma8qmp-src.sock
$QMPSHELL -p $QMPSOCK <<EOF
migrate-set-parameters mode=cpr-transfer
migrate
channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
EOF
Then, after a while, QXL guest driver on target crashes spewing the
following messages:
[ 73.962002] [TTM] Buffer eviction failed
[ 73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
[ 73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate
VRAM BO
That seems to be a known kernel QXL driver bug:
https://lore.kernel.org/all/20220907094423.93581-1-min_h...@163.com/T/
https://lore.kernel.org/lkml/ztgydqrlk6wx_...@eldamar.lan/
(the latter discussion contains that reproduce script which speeds up
the crash in the guest):
#!/bin/bash
chvt 3
for j in $(seq 80); do
echo "$(date) starting round $j"
if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != ""
]; then
echo "bug was reproduced after $j tries"
exit 1
fi
for i in $(seq 100); do
dmesg > /dev/tty3
done
done
echo "bug could not be reproduced"
exit 0
The bug itself seems to remain unfixed, as I was able to reproduce that
with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
cpr-transfer code also seems to be buggy as it triggers the crash -
without the cpr-transfer migration the above reproduce doesn't lead to
crash on the source VM.
I suspect that, as cpr-transfer doesn't migrate the guest memory, but
rather passes it through the memory backend object, our code might
somehow corrupt the VRAM. However, I wasn't able to trace the
corruption so far.
Could somebody help the investigation and take a look into this? Any
suggestions would be appreciated. Thanks!