On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
> On 2/28/25 8:20 PM, Steven Sistare wrote:
>> On 2/28/2025 1:13 PM, Steven Sistare wrote:
>>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
>>>> Hi all,
>>>>
>>>> We've been experimenting with cpr-transfer migration mode recently and
>>>> have discovered the following issue with the guest QXL driver:
>>>>
>>>> Run migration source:
>>>>> EMULATOR=/path/to/emulator
>>>>> ROOTFS=/path/to/image
>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
>>>>>
>>>>> $EMULATOR -enable-kvm \
>>>>>      -machine q35 \
>>>>>      -cpu host -smp 2 -m 2G \
>>>>>      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
>>>>> ram0,share=on\
>>>>>      -machine memory-backend=ram0 \
>>>>>      -machine aux-ram-share=on \
>>>>>      -drive file=$ROOTFS,media=disk,if=virtio \
>>>>>      -qmp unix:$QMPSOCK,server=on,wait=off \
>>>>>      -nographic \
>>>>>      -device qxl-vga
>>>>
>>>> Run migration target:
>>>>> EMULATOR=/path/to/emulator
>>>>> ROOTFS=/path/to/image
>>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
>>>>> $EMULATOR -enable-kvm \
>>>>>      -machine q35 \
>>>>>      -cpu host -smp 2 -m 2G \
>>>>>      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
>>>>> ram0,share=on\
>>>>>      -machine memory-backend=ram0 \
>>>>>      -machine aux-ram-share=on \
>>>>>      -drive file=$ROOTFS,media=disk,if=virtio \
>>>>>      -qmp unix:$QMPSOCK,server=on,wait=off \
>>>>>      -nographic \
>>>>>      -device qxl-vga \
>>>>>      -incoming tcp:0:44444 \
>>>>>      -incoming '{"channel-type": "cpr", "addr": { "transport":
>>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
>>>>
>>>>
>>>> Launch the migration:
>>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
>>>>>
>>>>> $QMPSHELL -p $QMPSOCK <<EOF
>>>>>      migrate-set-parameters mode=cpr-transfer
>>>>>      migrate channels=[{"channel-type":"main","addr":
>>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
>>>>> {"channel-type":"cpr","addr":
>>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
>>>>> dst.sock"}}]
>>>>> EOF
>>>>
>>>> Then, after a while, QXL guest driver on target crashes spewing the
>>>> following messages:
>>>>> [   73.962002] [TTM] Buffer eviction failed
>>>>> [   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
>>>>> 0x00000001)
>>>>> [   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
>>>>> allocate VRAM BO
>>>>
>>>> That seems to be a known kernel QXL driver bug:
>>>>
>>>> https://lore.kernel.org/all/20220907094423.93581-1-min_h...@163.com/T/
>>>> https://lore.kernel.org/lkml/ztgydqrlk6wx_...@eldamar.lan/
>>>>
>>>> (the latter discussion contains that reproduce script which speeds up
>>>> the crash in the guest):
>>>>> #!/bin/bash
>>>>>
>>>>> chvt 3
>>>>>
>>>>> for j in $(seq 80); do
>>>>>          echo "$(date) starting round $j"
>>>>>          if [ "$(journalctl --boot | grep "failed to allocate VRAM
>>>>> BO")" != "" ]; then
>>>>>                  echo "bug was reproduced after $j tries"
>>>>>                  exit 1
>>>>>          fi
>>>>>          for i in $(seq 100); do
>>>>>                  dmesg > /dev/tty3
>>>>>          done
>>>>> done
>>>>>
>>>>> echo "bug could not be reproduced"
>>>>> exit 0
>>>>
>>>> The bug itself seems to remain unfixed, as I was able to reproduce that
>>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
>>>> cpr-transfer code also seems to be buggy as it triggers the crash -
>>>> without the cpr-transfer migration the above reproduce doesn't lead to
>>>> crash on the source VM.
>>>>
>>>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
>>>> rather passes it through the memory backend object, our code might
>>>> somehow corrupt the VRAM.  However, I wasn't able to trace the
>>>> corruption so far.
>>>>
>>>> Could somebody help the investigation and take a look into this?  Any
>>>> suggestions would be appreciated.  Thanks!
>>>
>>> Possibly some memory region created by qxl is not being preserved.
>>> Try adding these traces to see what is preserved:
>>>
>>> -trace enable='*cpr*'
>>> -trace enable='*ram_alloc*'
>>
>> Also try adding this patch to see if it flags any ram blocks as not
>> compatible with cpr.  A message is printed at migration start time.
>>   https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
>> steven.sist...@oracle.com/
>>
>> - Steve
>>
> 
> With the traces enabled + the "migration: ram block cpr blockers" patch
> applied:
> 
> Source:
>> cpr_find_fd pc.bios, id 0 returns -1
>> cpr_save_fd pc.bios, id 0, fd 22
>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host 
>> 0x7fec18e00000
>> cpr_find_fd pc.rom, id 0 returns -1
>> cpr_save_fd pc.rom, id 0, fd 23
>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host 
>> 0x7fec18c00000
>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
>> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd 
>> 24 host 0x7fec18a00000
>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
>> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 
>> fd 25 host 0x7feb77e00000
>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
>> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27 
>> host 0x7fec18800000
>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
>> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 
>> fd 28 host 0x7feb73c00000
>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
>> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34 
>> host 0x7fec18600000
>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
>> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 
>> 35 host 0x7fec18200000
>> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
>> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36 
>> host 0x7feb8b600000
>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
>> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host 
>> 0x7feb8b400000
>>
>> cpr_state_save cpr-transfer mode
>> cpr_transfer_output /var/run/alma8cpr-dst.sock
> 
> Target:
>> cpr_transfer_input /var/run/alma8cpr-dst.sock
>> cpr_state_load cpr-transfer mode
>> cpr_find_fd pc.bios, id 0 returns 20
>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host 
>> 0x7fcdc9800000
>> cpr_find_fd pc.rom, id 0 returns 19
>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host 
>> 0x7fcdc9600000
>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd 
>> 18 host 0x7fcdc9400000
>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 
>> fd 17 host 0x7fcd27e00000
>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16 
>> host 0x7fcdc9200000
>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 
>> fd 15 host 0x7fcd23c00000
>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14 
>> host 0x7fcdc8800000
>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 
>> 13 host 0x7fcdc8400000
>> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11 
>> host 0x7fcdc8200000
>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host 
>> 0x7fcd3be00000
> 
> Looks like both vga.vram and qxl.vram are being preserved (with the same
> addresses), and no incompatible ram blocks are found during migration.
> 

Sorry, addressed are not the same, of course.  However corresponding ram
blocks do seem to be preserved and initialized.


Reply via email to