Re: [RFC V2 0/8] Live update: tap and vhost

Steven Sistare Tue, 02 Sep 2025 20:52:32 -0700

On 9/1/2025 7:44 AM, Vladimir Sementsov-Ogievskiy wrote:

On 29.08.25 22:37, Steven Sistare wrote:

On 8/28/2025 11:48 AM, Steven Sistare wrote:

On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:

On 17.07.25 21:39, Steve Sistare wrote:

Tap and vhost devices can be preserved during cpr-transfer using
traditional live migration methods, wherein the management layer
creates new interfaces for the target and fiddles with 'ip link'
to deactivate the old interface and activate the new.


However, CPR can simply send the file descriptors to new QEMU,
with no special management actions required.  The user enables
this behavior by specifing '-netdev tap,cpr=on'.  The default
is cpr=off.


Hi Steve!

First, me trying to test the series:


Thank-you Vladimir for all the work you are doing in this area.  I have
reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
Let me dig into that before I study the larger questions you pose
about preserving tap/vhost-user-blk in local migration versus cpr.


I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
the blocking fd problems which you allude to.  The attached patch fixes
them, and will be squashed into the series.

Ben, you also reported the !r assertion failure, so this fix should help
you also.

SOURCE:

sudo build/qemu-system-x86_64 -display none -vga none -device 
pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device 
pcie-root-port,id=s0,slot=0,bus=pcie.1 -device 
pcie-root-port,id=s1,slot=1,bus=pcie.1 -device 
pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda 
/home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 
-nodefaults -vga std -qmp stdio -msg timestamp -S -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine 
memory-backend=ram0 -machine aux-ram-share=on

{"execute": "qmp_capabilities"}
{"return": {}}
{"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, 
"ifname": "tap0", "type": "tap", "id": "netdev.1"}}
{"return": {}}
{"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, 
"mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
{"return": {}}
{"execute": "cont"}
{"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": 
"RESUME"}
{"return": {}}
{"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": 
{"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
{"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
{"return": {}}
{"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, 
{"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
{"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
{"return": {}}

TARGET:

sudo build/qemu-system-x86_64 -display none -vga none -device 
pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device 
pcie-root-port,id=s0,slot=0,bus=pcie.1 -device 
pcie-root-port,id=s1,slot=1,bus=pcie.1 -device 
pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda 
/home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 
-nodefaults -vga std -qmp stdio -S -object 
memory-backend-file,id=ram0,size=4G,mem-p
ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": 
{ "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'

<need to wait until "migrate" on source>

{"execute": "qmp_capabilities"}
{"return": {}}
{"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, 
"ifname": "tap0", "type": "tap", "id": "netdev.1"}}
{"return": {}}
{"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, 
"mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
could not disable queue
qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: 
Assertion `!r' failed.
fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT 
(Abort)

So, it crashes on device_add..

Second, I've come a long way, backporting you TAP v1 series together with 
needed parts of CPR and migration channels to QEMU 7.2, fixing different issues 
(like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous 
use of tap on source an target, avoid making the fd blocking again on target), 
and it finally started to work.

But next, I went to support similar migration for vhost-user-blk, and that was 
a lot more complex. No reason to pass an fd in preliminary stage, when source 
is running (like in CPR), because:

1. we just can't use the fd on target at all, until we stop use it on source, 
otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where 
some ioctls called on target doesn't break source)
2. we have to pass enough additional variables, which are simpler to pass 
through normal migration channel (how to pass anything except fds through cpr 
channel?)


You can pass extra state through the cpr channel.  See for example 
vmstate_cpr_vfio_device,
and how vmstate_cpr_vfio_devices is defined as a sub-section of 
vmstate_cpr_state.


O, I missed this.

Hmm. Still, finally CPR becomes just an additional stage of migration, which is 
done prior device initialization on target..

Didn't you think of integrating it to the common scheme: so that devices may 
have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, 
and CPR stage of migration will work the same way as normal migration?


I proposed a single migration stream containing pre-create state that was read 
early,
but that was rejected as too complex.

I also proposed refactoring initialization so the monitor and migration streams
could be opened earlier, but again rejected as too complex and/or not 
consistent with
a long term vision for reworking initialization.

Still2, if we pass some state in CPR it should be a kind of constant. We need a 
guarantee that it will not change between migration start and source stop.

So, I decided to go another way, and just migrate everything backend-related including 
fds through main migration channel. Of course, this requires deep reworking of device 
initialization in case of incoming migration (but for vhost-user-blk we need it anyway). 
The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local 
migration" (you are in CC).


You did a lot of work in those series!
I suspect much less rework of initialization is required if you pass variables 
in cpr state.


Not sure. I had to rework initialization anyway, as initialization damaged the 
connection. And this lead me to idea "if rework anyway, why not to go with one 
migration channel".

The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to 
avoid using additional cpr channel and unusual waiting for QMP interface on target. And, 
I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"


Is there a use case for this outside of CPR?


It just works without CPR.. Will CPR bring more benefit if I enable it in the 
setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared 
of-course)?

CPR is intended to be the "local migration" solution that does it all :)
But if you do proceed with your local migration tap solution, I would want
to see that CPR could also use your code paths.

CPR can transparently use my code: you may enable both CPR and local-tap capability and it should work. Some devices will migrate their fds through CPR, TAP fds amd state will migrate through main migration channel.


OK, I believe that.

I also care about cpr-exec mode.  We use it internally, and I am trying to push
it upstream:
  
https://lore.kernel.org/qemu-devel/1755191843-283480-8-git-send-email-steven.sist...@oracle.com/
I believe it would work with your code.  Migrated fd's in both the cpr channel 
and
the main migration channel would be handled differently as shown in 
vmstate-types.c
get_fd() and put_fd().  The fd is kept open across execv(), and vmstate 
represents
the fd by its value (eg a small integer), rather than as an object in the unix 
channel.

Making both channels to be unix-sockets should not be a considerable overhead I 
think.

Why I like my solution more:

- no additional channel
- no additional logic in management software (to handle target start with no QMP access 
until "migrate" command on source)
- less code to backport (that's personal, of course not an argument for final 
upstream solution)

It seems that CPR is simpler to support as we don't need to do deep rework of 
initialization code.. But in reality, there is a lot of work anyway: TAP, 
vhost-user-blk cases proves this. You series about vfio are also huge.


TAP is the only case where we can compare both approaches, and the numbers tell
the story:

  TAP initialization refactoring: 277 insertions(+), 308 deletions(-)
  live-TAP local migration:       681 insertions(+), 72 deletions(-)
                        total:    958 insertions(+), 380 deletions(-)

  Live update tap and vhost:      223 insertions(+), 55 deletions(-)

For any given system, if the maintainers accept the larger amount of change,
then local migration is cool (and CPR made it possible by adding fd support
to vmstate+QEMUFile).  But the amount of change is a harder sell.

What is the benefit of CPR against simple (unix-socket) migration?CPR supports 
vfio, iommufd, and pinned memory.  Memory backend objects are

created early, before the main migration stream is read, and squashing
CPR into migration for those cases would require a major change in how
qemu creates objects during live migration.

Hence CPR is the method that works for all types of objects.  The mgmt
layer does not need to support multiple methods of live update, depending
on what devices the VM contains.

- Steve

Re: [RFC V2 0/8] Live update: tap and vhost

Reply via email to