On 2/28/25 8:35 PM, Andrey Drobyshev wrote: > On 2/28/25 8:20 PM, Steven Sistare wrote: >> On 2/28/2025 1:13 PM, Steven Sistare wrote: >>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote: >>>> Hi all, >>>> >>>> We've been experimenting with cpr-transfer migration mode recently and >>>> have discovered the following issue with the guest QXL driver: >>>> >>>> Run migration source: >>>>> EMULATOR=/path/to/emulator >>>>> ROOTFS=/path/to/image >>>>> QMPSOCK=/var/run/alma8qmp-src.sock >>>>> >>>>> $EMULATOR -enable-kvm \ >>>>> -machine q35 \ >>>>> -cpu host -smp 2 -m 2G \ >>>>> -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ >>>>> ram0,share=on\ >>>>> -machine memory-backend=ram0 \ >>>>> -machine aux-ram-share=on \ >>>>> -drive file=$ROOTFS,media=disk,if=virtio \ >>>>> -qmp unix:$QMPSOCK,server=on,wait=off \ >>>>> -nographic \ >>>>> -device qxl-vga >>>> >>>> Run migration target: >>>>> EMULATOR=/path/to/emulator >>>>> ROOTFS=/path/to/image >>>>> QMPSOCK=/var/run/alma8qmp-dst.sock >>>>> $EMULATOR -enable-kvm \ >>>>> -machine q35 \ >>>>> -cpu host -smp 2 -m 2G \ >>>>> -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ >>>>> ram0,share=on\ >>>>> -machine memory-backend=ram0 \ >>>>> -machine aux-ram-share=on \ >>>>> -drive file=$ROOTFS,media=disk,if=virtio \ >>>>> -qmp unix:$QMPSOCK,server=on,wait=off \ >>>>> -nographic \ >>>>> -device qxl-vga \ >>>>> -incoming tcp:0:44444 \ >>>>> -incoming '{"channel-type": "cpr", "addr": { "transport": >>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}' >>>> >>>> >>>> Launch the migration: >>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell >>>>> QMPSOCK=/var/run/alma8qmp-src.sock >>>>> >>>>> $QMPSHELL -p $QMPSOCK <<EOF >>>>> migrate-set-parameters mode=cpr-transfer >>>>> migrate channels=[{"channel-type":"main","addr": >>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}}, >>>>> {"channel-type":"cpr","addr": >>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr- >>>>> dst.sock"}}] >>>>> EOF >>>> >>>> Then, after a while, QXL guest driver on target crashes spewing the >>>> following messages: >>>>> [ 73.962002] [TTM] Buffer eviction failed >>>>> [ 73.962072] qxl 0000:00:02.0: object_init failed for (3149824, >>>>> 0x00000001) >>>>> [ 73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to >>>>> allocate VRAM BO >>>> >>>> That seems to be a known kernel QXL driver bug: >>>> >>>> https://lore.kernel.org/all/20220907094423.93581-1-min_h...@163.com/T/ >>>> https://lore.kernel.org/lkml/ztgydqrlk6wx_...@eldamar.lan/ >>>> >>>> (the latter discussion contains that reproduce script which speeds up >>>> the crash in the guest): >>>>> #!/bin/bash >>>>> >>>>> chvt 3 >>>>> >>>>> for j in $(seq 80); do >>>>> echo "$(date) starting round $j" >>>>> if [ "$(journalctl --boot | grep "failed to allocate VRAM >>>>> BO")" != "" ]; then >>>>> echo "bug was reproduced after $j tries" >>>>> exit 1 >>>>> fi >>>>> for i in $(seq 100); do >>>>> dmesg > /dev/tty3 >>>>> done >>>>> done >>>>> >>>>> echo "bug could not be reproduced" >>>>> exit 0 >>>> >>>> The bug itself seems to remain unfixed, as I was able to reproduce that >>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our >>>> cpr-transfer code also seems to be buggy as it triggers the crash - >>>> without the cpr-transfer migration the above reproduce doesn't lead to >>>> crash on the source VM. >>>> >>>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but >>>> rather passes it through the memory backend object, our code might >>>> somehow corrupt the VRAM. However, I wasn't able to trace the >>>> corruption so far. >>>> >>>> Could somebody help the investigation and take a look into this? Any >>>> suggestions would be appreciated. Thanks! >>> >>> Possibly some memory region created by qxl is not being preserved. >>> Try adding these traces to see what is preserved: >>> >>> -trace enable='*cpr*' >>> -trace enable='*ram_alloc*' >> >> Also try adding this patch to see if it flags any ram blocks as not >> compatible with cpr. A message is printed at migration start time. >> https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email- >> steven.sist...@oracle.com/ >> >> - Steve >> > > With the traces enabled + the "migration: ram block cpr blockers" patch > applied: > > Source: >> cpr_find_fd pc.bios, id 0 returns -1 >> cpr_save_fd pc.bios, id 0, fd 22 >> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host >> 0x7fec18e00000 >> cpr_find_fd pc.rom, id 0 returns -1 >> cpr_save_fd pc.rom, id 0, fd 23 >> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host >> 0x7fec18c00000 >> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1 >> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24 >> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd >> 24 host 0x7fec18a00000 >> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1 >> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25 >> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 >> fd 25 host 0x7feb77e00000 >> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1 >> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27 >> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27 >> host 0x7fec18800000 >> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1 >> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28 >> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 >> fd 28 host 0x7feb73c00000 >> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1 >> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34 >> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34 >> host 0x7fec18600000 >> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1 >> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35 >> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd >> 35 host 0x7fec18200000 >> cpr_find_fd /rom@etc/table-loader, id 0 returns -1 >> cpr_save_fd /rom@etc/table-loader, id 0, fd 36 >> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36 >> host 0x7feb8b600000 >> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1 >> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37 >> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host >> 0x7feb8b400000 >> >> cpr_state_save cpr-transfer mode >> cpr_transfer_output /var/run/alma8cpr-dst.sock > > Target: >> cpr_transfer_input /var/run/alma8cpr-dst.sock >> cpr_state_load cpr-transfer mode >> cpr_find_fd pc.bios, id 0 returns 20 >> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host >> 0x7fcdc9800000 >> cpr_find_fd pc.rom, id 0 returns 19 >> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host >> 0x7fcdc9600000 >> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18 >> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd >> 18 host 0x7fcdc9400000 >> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17 >> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 >> fd 17 host 0x7fcd27e00000 >> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16 >> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16 >> host 0x7fcdc9200000 >> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15 >> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 >> fd 15 host 0x7fcd23c00000 >> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14 >> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14 >> host 0x7fcdc8800000 >> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13 >> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd >> 13 host 0x7fcdc8400000 >> cpr_find_fd /rom@etc/table-loader, id 0 returns 11 >> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11 >> host 0x7fcdc8200000 >> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10 >> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host >> 0x7fcd3be00000 > > Looks like both vga.vram and qxl.vram are being preserved (with the same > addresses), and no incompatible ram blocks are found during migration. >
Sorry, addressed are not the same, of course. However corresponding ram blocks do seem to be preserved and initialized.