On 2/28/2025 1:13 PM, Steven Sistare wrote:
On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
Hi all,

We've been experimenting with cpr-transfer migration mode recently and
have discovered the following issue with the guest QXL driver:

Run migration source:
EMULATOR=/path/to/emulator
ROOTFS=/path/to/image
QMPSOCK=/var/run/alma8qmp-src.sock

$EMULATOR -enable-kvm \
     -machine q35 \
     -cpu host -smp 2 -m 2G \
     -object 
memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
     -machine memory-backend=ram0 \
     -machine aux-ram-share=on \
     -drive file=$ROOTFS,media=disk,if=virtio \
     -qmp unix:$QMPSOCK,server=on,wait=off \
     -nographic \
     -device qxl-vga

Run migration target:
EMULATOR=/path/to/emulator
ROOTFS=/path/to/image
QMPSOCK=/var/run/alma8qmp-dst.sock
$EMULATOR -enable-kvm \
     -machine q35 \
     -cpu host -smp 2 -m 2G \
     -object 
memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
     -machine memory-backend=ram0 \
     -machine aux-ram-share=on \
     -drive file=$ROOTFS,media=disk,if=virtio \
     -qmp unix:$QMPSOCK,server=on,wait=off \
     -nographic \
     -device qxl-vga \
     -incoming tcp:0:44444 \
     -incoming '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix", 
"path": "/var/run/alma8cpr-dst.sock"}}'


Launch the migration:
QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
QMPSOCK=/var/run/alma8qmp-src.sock

$QMPSHELL -p $QMPSOCK <<EOF
     migrate-set-parameters mode=cpr-transfer
     migrate 
channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
EOF

Then, after a while, QXL guest driver on target crashes spewing the
following messages:
[   73.962002] [TTM] Buffer eviction failed
[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate 
VRAM BO

That seems to be a known kernel QXL driver bug:

https://lore.kernel.org/all/20220907094423.93581-1-min_h...@163.com/T/
https://lore.kernel.org/lkml/ztgydqrlk6wx_...@eldamar.lan/

(the latter discussion contains that reproduce script which speeds up
the crash in the guest):
#!/bin/bash

chvt 3

for j in $(seq 80); do
         echo "$(date) starting round $j"
         if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" 
]; then
                 echo "bug was reproduced after $j tries"
                 exit 1
         fi
         for i in $(seq 100); do
                 dmesg > /dev/tty3
         done
done

echo "bug could not be reproduced"
exit 0

The bug itself seems to remain unfixed, as I was able to reproduce that
with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
cpr-transfer code also seems to be buggy as it triggers the crash -
without the cpr-transfer migration the above reproduce doesn't lead to
crash on the source VM.

I suspect that, as cpr-transfer doesn't migrate the guest memory, but
rather passes it through the memory backend object, our code might
somehow corrupt the VRAM.  However, I wasn't able to trace the
corruption so far.

Could somebody help the investigation and take a look into this?  Any
suggestions would be appreciated.  Thanks!

Possibly some memory region created by qxl is not being preserved.
Try adding these traces to see what is preserved:

-trace enable='*cpr*'
-trace enable='*ram_alloc*'

Also try adding this patch to see if it flags any ram blocks as not
compatible with cpr.  A message is printed at migration start time.
  
https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-steven.sist...@oracle.com/

- Steve



Reply via email to