On Sat, Jan 17, 2026 at 08:49:13PM +0100, Lukas Straub wrote:
> On Thu, 15 Jan 2026 18:38:51 -0500
> Peter Xu <[email protected]> wrote:
> 
> > On Thu, Jan 15, 2026 at 10:59:47PM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu ([email protected]) wrote:  
> > > > On Thu, Jan 15, 2026 at 10:49:29PM +0100, Lukas Straub wrote:  
> > > > > Nack.
> > > > > 
> > > > > This code has users, as explained in my other email:
> > > > > https://lore.kernel.org/qemu-devel/20260115224516.7f0309ba@penguin/T/#mc99839451d6841366619c4ec0d5af5264e2f6464
> > > > >   
> > > > 
> > > > Please then rework that series and consider include the following (I
> > > > believe I pointed out a long time ago somewhere..):
> > > >   
> > >   
> > > > - Some form of justification of why multifd needs to be enabled for 
> > > > COLO.
> > > >   For example, in your cluster deployment, using multifd can improve XXX
> > > >   by YYY.  Please describe the use case and improvements.  
> > > 
> > > That one is pretty easy; since COLO is regularly taking snapshots, the 
> > > faster
> > > the snapshoting the less overhead there is.  
> > 
> > Thanks for chiming in, Dave.  I can explain why I want to request for some
> > numbers.
> > 
> > Firstly, numbers normally proves it's used in a real system.  It's at least
> > being used and seriously tested.
> > 
> > Secondly, per my very limited understanding to COLO... the two VMs in most
> > cases should be in-sync state already when both sides generate the same
> > network packets.
> > 
> > Another sync (where multifd can start to take effect) is only needed when
> > there're packets misalignments, but IIUC it should be rare.  I don't know
> > how rare it is, it would be good if Lukas could introduce some of those
> > numbers in his deployment to help us understand COLO better if we'll need
> > to keep it.
> 
> It really depends on the workload and if you want to tune for
> throughput or latency.

Thanks for all the answers from all of you.

If we decide to keep COLO, looks like I'll have no choice but understand it
better, as long as it still has anything to do with migration..  I'll leave
some more questions / comments at the end.

> 
> You need to do a checkpoint eventually and the more time passes between
> checkpoints the more dirty memory you have to transfer during the
> checkpoint.
> 
> Also keep in mind that the guest is stopped during checkpoints. Because
> even if we continue running the guest, we can not release the mismatched
> packets since that would expose a state of the guest to the outside
> world that is not yet replicated to the secondary.

Yes this makes sense.  However it is also the very confusing part of COLO.

When I said "I was expecting migration to not be the hot path", one reason
is I believe COLO checkpoint (or say, when migration happens) will
introduce a larger downtime than normal migration, because this process
transfers RAM with both VMs stopped.

You helped explain why that large downtime is needed, thanks.  However then
it means either (1) packet misalignment, or (2) periodical timer kickoff,
either of them will kickoff checkpoint..

I don't know if COLO services care about such relatively large downtime,
especially it is not happening once, but periodically for every tens of
seconds at least (assuming when periodically then packet misalignment is
rare).

> 
> So the migration performance is actually the most important part in
> COLO to keep the checkpoints as short as possible.

IIUC when a heartbeat will be lost on PVM _during_ sync checkpoints, then
SVM can only rollback to the last time checkpoint.  Would this be good
enough in reality?  It means if there's a TCP transaction then it may broke
anyway.  x-checkpoint-delay / periodic checkpoints definitely make this
more possible to happen.

> 
> I have quite a few more performance and cleanup patches on my hands,
> for example to transfer dirty memory between checkpoints.
> 
> > 
> > IIUC, the critical path of COLO shouldn't be migration on its own?  It
> > should be when heartbeat gets lost; that normally should happen when two
> > VMs are in sync.  In this path, I don't see how multifd helps..  because
> > there's no migration happening, only the src recording what has changed.
> > Hence I think some number with description of the measurements may help us
> > understand how important multifd is to COLO.
> > 
> > Supporting multifd will cause new COLO functions to inject into core
> > migration code paths (even if not much..). I want to make sure such (new)
> > complexity is justified. I also want to avoid introducing a feature only
> > because "we have XXX, then let's support XXX in COLO too, maybe some day
> > it'll be useful".
> 
> What COLO needs from migration at the low level:
> 
> Primary/Outgoing side:
> 
> Not much actually, we just need a way to incrementally send the
> dirtied memory and the full device state.
> Also, we ensure that migration never actually finishes since we will
> never do a switchover. For example we never set
> RAMState::last_stage with COLO.
> 
> Secondary/Incoming side:
> 
> colo cache:
> Since the secondary always needs to be ready to take over (even during
> checkpointing), we can not write the received ram pages directly to
> the guest ram to prevent having half of the old and half of the new
> contents.
> So we redirect the received ram pages to the colo cache. This is
> basically a mirror of the primary side ram.
> It also simplifies the primary side since from it's point of view it's
> just a normal migration target. So primary side doesn't have to care
> about dirtied pages on the secondary for example.
> 
> Dirty Bitmap:
> With COLO we also need a dirty bitmap on the incoming side to track
> 1. pages dirtied by the secondary guest
> 2. pages dirtied by the primary guest (incoming ram pages)
> In the last step during the checkpointing, this bitmap is then used
> to overwrite the guest ram with the colo cache so the secondary guest
> is in sync with the primary guest.
> 
> All this individually is very little code as you can see from my
> multifd patch. Just something to keep in mind I guess.
> 
> 
> At the high level we have the COLO framework outgoing and incoming
> threads which just tell the migration code to:
> Send all ram pages (qemu_savevm_live_state()) on the outgoing side
> paired with a qemu_loadvm_state_main on the incoming side.
> Send the device state (qemu_save_device_state()) paired with writing
> that stream to a buffer on the incoming side.
> And finally flusing the colo cache and loading the device state on the
> incoming side.
> 
> And of course we coordinate with the colo block replication and
> colo-compare.

Thank you.  Maybe you should generalize some of the explanations and put it
into docs/devel/migration/ somewhere.  I think many of them are not
mentioned in the doc on how COLO works internally.

Let me ask some more questions while I'm reading COLO today:

- For each of the checkpoint (colo_do_checkpoint_transaction()), COLO will
  do the following:

    bql_lock()
    vm_stop_force_state(RUN_STATE_COLO)     # stop vm
    bql_unlock()

    ...
  
    bql_lock()
    qemu_save_device_state()                # into a temp buffer fb
    bql_unlock()

    ...

    qemu_savevm_state_complete_precopy()    # send RAM, directly to the wire
    qemu_put_buffer(fb)                     # push temp buffer fb to wire

    ...

    bql_lock()
    vm_start()                              # start vm
    bql_unlock()

  A few questions that I didn't ask previously:

  - If VM is stopped anyway, why putting the device states into a temp
    buffer, instead of using what we already have for precopy phase, or
    just push everything directly to the wire?

  - Above operation frequently releases BQL, why is it needed?  What
    happens if (within the window where BQL released) someone invoked QMP
    command "cont" and causing VM to start? Would COLO be broken with it?
    Should we take BQL for the whole process to avoid it?

- Does colo_cache has an limitation, or should we expect SVM to double
  consume the guest RAM size?  As I didn't see where colo_cache will be
  released during each sync (e.g. after colo_flush_ram_cache).  I am
  expecting over time SVM will have most of the pages touched, then the
  colo_cache can consume the same as guest mem on SVM.

Thanks,

-- 
Peter Xu


Reply via email to