On Thu, Nov 6, 2025 at 11:21 AM Zhang Chen <[email protected]> wrote:
>
> On Thu, Nov 6, 2025 at 9:10 AM Zhijian Li (Fujitsu)
> <[email protected]> wrote:
> >
> >
> >
> > On 06/11/2025 04:58, Peter Xu wrote:
> > > On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
> > >> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> > >> state before completed during migration, which broke the original 
> > >> transition
> > >> to COLO. The migration flow for precopy has changed to:
> > >> active -> pre-switchover -> device -> completed.
> > >>
> > >> This patch updates the transition state to ensure that the Pre-COLO
> > >> state corresponds to DEVICE state correctly.
> > >>
> > >> Fixes: 4881411136 ("migration: Always set DEVICE state")
> > >> Signed-off-by: Li Zhijian <[email protected]>
> > >> ---
> > >>   migration/migration.c | 4 ++--
> > >>   1 file changed, 2 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/migration/migration.c b/migration/migration.c
> > >> index a63b46bbef..6ec7f3cec8 100644
> > >> --- a/migration/migration.c
> > >> +++ b/migration/migration.c
> > >> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
> > >>           goto fail;
> > >>       }
> > >>
> > >> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> > >> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
> > >>           /* COLO does not support postcopy */
> > >> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> > >> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
> > >>                             MIGRATION_STATUS_COLO);
> > >>       } else {
> > >>           migration_completion_end(s);
> > >
> > > Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
> > > 10.0/10.1..
> > >
> > > Hailiang/Chen, do you still know anyone who is using COLO, especially in
> > > enterprise?  I don't expect any individual using it.. It definitely
> > > complicates migration logics all over the places.  Fabiano and I discussed
> > > a few times on removing legacy code and COLO was always in the list.
> > >
> > > We used to discuss RDMA obsoletion too, that's when Huawei developers at
> > > least tried to re-implement the whole RDMA using rsocket, that didn't land
> > > only because of a perf regression.  Meanwhile, Zhijian also provided an
> > > unit test, which we rely on recently to not break RDMA at the minimum.
> > >
> > > If we do not have known users, I sincerely want to discuss with you on
> > > obsoletion and removal of COLO from qemu codebase.  Do you see feasible?
> > >
> > > Zhijian, do you have any input here?
> >
> >
> > If we don't have any known users, I personally have no objection to 
> > removing COLO.
> >
> >  From my previous understanding, its use cases are rather limited, and the 
> > checkpointing overhead is significant.
> > Moreover, with the continuous development of Cloud Native over the past 
> > decade, service-based
> > FT/HA solutions have become very mature, which shrinks the use cases for 
> > VM-based FT solutions even further.
> >
> > I think it's worth keeping if we have:
> >
> > - Active users who depend on it.
> > - A unit test for the COLO framework.
> >
> > Thanks
> > Zhijian
> >
> >
>
> Add CC Lukas.
>
> From technical point, I agree Zhijian's comments. We can probably do
> this gradually.
> In my side, I know some local companies build thier HA/FT product based on 
> COLO.
> In this case, I think most of them already forked QEMU upstream code
> to a private repo for internal mantained.
> It may caused some upgrade issues in the future.
>
> And another part is Lukas covered pacemaker project integrated COLO,
> and I don't know users status for pacemaker.
> Maybe Lukas can input some comments?
>
> For the implementation, COLO not only have migration part of code(it
> is the core of COLO), it also including network and block replication
> for co-working.
> If we remove migration related code need to consider how to handle
> other parts, network maybe change to general QEMU netfilter?  block
> replication ?
>
> For the COLO framework unit test,  I think it need to add some "#if
> defined(qtest)" in migration code for testing(COLO proxy/netfilter
> already have independent qtest).
>
> Thanks
> Chen
>
>

Add pacemaker/corosync related details for COLO:
https://wiki.qemu.org/Features/COLO/Managed_HOWTO


>
>
>
> >
> > >
> > > Thanks,
> > >

Reply via email to