Re: Migration and the Job abstraction

Peter Xu Tue, 02 Dec 2025 11:17:33 -0800

On Tue, Dec 02, 2025 at 02:16:31PM +0100, Markus Armbruster wrote:
> Peter Xu <[email protected]> writes:
> 
> > On Thu, Nov 20, 2025 at 01:16:48PM +0100, Kevin Wolf wrote:
> >> Am 20.11.2025 um 11:30 hat Markus Armbruster geschrieben:
> >> > Peter Xu <[email protected]> writes:
> >> > > On Wed, Nov 19, 2025 at 08:45:39AM +0100, Markus Armbruster wrote:
> >> > >> [*] If the job abstraction had been available in time, migration would
> >> > >> totally be a job.  There's no *design* reason for it being not a job.
> >> > >> Plenty of implementation and backward compatibility reasons, though.
> >> > >
> >> > > There might be something common between Jobs that block uses and a
> >> > > migration process.  If so, we can provide CommonJob and make 
> >> > > MigrationJob
> >> > > and BlockJobs dependent on it.
> >> 
> >> Conceptually, live migration and the mirror block job are _really_
> >> similar. You have a bulk copy phase and you keep copying data that has
> >> changed to bring both sides in sync. When both sides are close enough,
> >> you stop new changes from coming in, copy the small remainder and finish
> >> the thing.
> >> 
> >> The main difference is that mirror copies disk content whereas live
> >> migration mostly copies RAM. But that's irrelevant conceptually.
> >
> > True at least until here..
> >
> >> 
> >> So it makes a lot of sense to me that the same user-visible state
> >> machine should be applicable to both.
> >
> > MigrationStatus should have quite some more states that block mirror may
> > not use.  They're added over time.
> >
> >> 
> >> (I'm not saying that we have to do this, just that I expect it to be
> >> possible.)
> >> 
> >> > > Possible challenges of adopting Jobs in migration flow
> >> > > ======================================================
> >> > >
> >> > > - Many Jobs defined property doesn't directly suite migration
> >> > >
> >> > >   - JobStatus is not directly suitable for migration purposes.  
> >> > > There're
> >> > >     some of the JobStatus that I can't think of any use
> >> > >     (e.g. JOB_STATUS_WAITING, JOB_STATUS_PENDING, which is fine, 
> >> > > because we
> >> > >     can simply not use it), but there're other status that migration 
> >> > > needs
> >> > >     but isn't availble. Introducing them seems to be an overkill 
> >> > > instead to
> >> > >     block layer's use case.
> >> 
> >> Which other status does live migration have that the user cares about?
> >> 
> >> Does it have to be a status on the Job level or could it be some kind of
> >> substatus that could be represented by job-specific information in
> >> query-jobs? (Which doesn't exist yet, but I think we have been talking
> >> about it multiple times before.)
> >
> > Yes, sub-status might work, but I'm not sure how well even if so.  We can
> > evaluate this when we have more solid idea on switching the code over.
> >
> > Meanwhile, the need of sub-status may be a hint at least to me that
> > migration shouldn't move over.
> 
> Not to me.
> 
> Like the existing jobs, migration would be a specialization of Job.
> 
> Digression: in OO, we use subtypes for specialization.  QAPI doesn't
> really support subtyping.  Instead, we commonly make do with the
> old-fashioned way: unions.  For jobs, we haven't had to.  On output
> (query-job, events), there has been no need for job-specific data.  On
> input, we simply use job-specific commands to create the jobs.
> 
> Having a specialization refine a common state machine feels natural to
> me.


Sure, this isn't a blocker indeed.  So if we want we can fit into the
picture.

> 
> > IIUC, the major functionality that the Jobs layer provides is about either
> > Jobs status change, or verbs that can invoke hooks.  If migration cannot
> > leverage Jobs interface to either (1) reduce its own code, or (2) getting
> > improvements, then we don't need to move to Jobs interface either.  IMHO if
> > we can settle the two questions (1,2) above, then we can help decide
> > whether this is worth exploring.
> 
> Yes, these are the right questions, but we should consider external
> interface in addition to implementation.
> 
> Example for reduced interface complexity: generic job-cancel superseding
> migrate_cancel.

Yes, this should be listed as one of the benefits.  We could have more
e.g., transaction support mentioned below, even if I'm not yet familiar
with it; meanwhile I'll have some other comments that may make this less
beneficial, more below.

So at least we have two advantages here (comparing to... the rest, which
might be disadvantages or challenges..).

> 
> Example for improved interface: migration gaining a progress meter from
> generic query-jobs.

This may or may not be a benefit in case of migration..  I almost keep
getting complains from people on libvirt migrating stuck at 99% (where
libvirt implemented the meter for migration), only because the meter
doesn't make much sense for most of migrations, aka, precopy.. which is
unfortunate..

> 
> > [1]
> >
> > I apologize if above was a wrong statement, because that was only based on
> > my quick glimpse over job.c.  Please correct me if so.
> 
> These are the right questions whether you got all the details right or
> not!
> 
> > Maybe there is some QEMU feature that may depend on Jobs so that if
> > migration moved over then migration can also benefit from the feature?
> 
> Transactions?  Like ...
> 
> >> > The Job abstraction defines possible states and state transitions.  Each
> >> > job finds its own path from the initial state @created to the final
> >> > state @concluded.  If a state doesn't make sense for a certain type of
> >> > job, it simply doesn't go there.
> >> 
> >> Note that most states are really managed by the common Job abstraction.
> >> The job itself only goes potentially from RUNNING to READY, and then
> >> implicitly to WAITING/ABORTING when it returns from .run().
> >> 
> >> Everything else only exists so that management tools can orchestrate
> >> jobs in the right way and can query errors before the job disappears.
> >> I'm not sure if WAITING is really useless for migration.
> 
> ... this:
> 
> >>                                                          In theory, you
> >> could have a job transaction of some mirror jobs for the disks and live
> >> migration for the device state, and of course you want both to finish
> >> and switch over at the same time. I'm also not completely sure if it
> >> will actually be used in practice, but the Job infrastructure gives you
> >> the option for free.
> 
> Like Kevin, I can't tell whether anybody wants this.  It does feel
> nifty, doesn't it?

Yep.  Is transaction about auto-rewind when some job fails (hence,
apply-all or apply-none)?

One thing I should mention here is I am aware that not all the transaction
(where migration is relevant) always only involves QMP commands.

One example is when kubenetes is taking into the picture and if we need
e.g. block snapshots over kubenetes storage (where block drives are almost
always RAW rather than QCOW2), then in the future then we may not always be
able to benefit from a QEMU-only / QMP-only transaction system.

The other thing is, IMHO such transaction system would be more helpful when
we start to adopt new features, so we write less code in mgmt.  If we have
all the codes ready for libvirt on error handling anyway for migration for
all these years... moving it over may add extra work instead..

And if combining the two ideas of above, when there's possible transaction
that may take operation outside QEMU, then IIUC libvirt will always need to
manage its own unwind operation.

It looks like the 2nd benefit might be less appealing.  But maybe I have
some loopholes on the understanding.

> 
> >> PENDING and the associated job-finalize already exists in live migration
> >> in the form of pause-before-switchover/pre-switchover status and
> >> migrate-continue. So I don't think you can argue you have no use for it.
> >
> > Yes, if we want, we can map some migration status into some of those.
> >
> >> 
> >> > So, job states migration doesn't want are only a problem if there is no
> >> > path from start to finish that doesn't go through unwanted states.
> >> > 
> >> > There may also be states migration wants that aren't job states.  We
> >> > could make them job states.  Or we map multiple migration states to a
> >> > single job state, i.e. have the job state *abstract* from migration
> >> > state details.
> >> > 
> >> > >   - Similarly to JobVerb.  E.g. JOB_VERB_CHANGE doesn't seem to apply 
> >> > > to
> >> > >     any concept to migration, but it misses quite some others
> >> > >     (e.g. JOB_VERB_SET_DOWNTIME, JOB_VERB_POSTCOPY_START, and more).
> >> 
> >> How is SET_DOWNTIME or POSTCOPY_START not just a form of CHANGE?
> >
> > I don't know, hence I listed it. :) If it fits, it's great.
> >
> > However if so, I wonder why JOB_VERB_SET_SPEED isn't part of CHANGE
> > already.  If we go further, RESUME/DISMISS/... can all be seen as CHANGE.
> 
> Jobs use separate commands to trigger state machine state transitions:
> job-resume, job-dismiss, ...
> 
> For other configuration bits that can be changed while the job runs, all
> we have is block-job-set-speed and block-job-change.  Perhaps these
> should both be superseded by a generic reconfiguration interface.
> 
> Migration's configuration interface has grown over many, many years, and
> it shows.  This isn't criticism!  It's what happens when something is so
> useful that it gets extended again and again.  We've talked about making
> that interface simpler and more regular.  Extending the job interface
> for migration should do that.

Yep, this isn't a blocker either, just like the job status.  But we already
now touched fundamentally the core of jobs on status / verbs..

It's just that IMHO the more work we need in Jobs to fit migration, the
more points we need to add on top of "disadvantages" when we compare it
against the advantages.

Thanks,

> 
> >> > JobVerb is used internally to restrict certain job commands to certain
> >> > job states.  For instance, command job-dismiss is rejected unless job is
> >> > in state @concluded.
> >> > 
> >> > This governs the generic job-FOO commands.  It also covers the legacy
> >> > block-job-FOO commands, because these wrap around the same C core as the
> >> > job-FOO commands.
> >> > 
> >> > We could have commands specific to a certain job type (say migration
> >> > jobs) that bypass the JobVerb infrastructure, and do their own thing to
> >> > restrict themselves to certain states.  Probably stupid if the states
> >> > that matter are job states.  Probably necessary if they aren't (say a
> >> > more fine-grained migration state).
> >> 
> >> I suspect we would have to look at specific examples to figure out how
> >> to represent them best. In general, I think a generic job-change (to be
> >> added as a more generic version of block-job-change) and job-specific
> >> data in query-jobs can cover a lot.
> >> 
> >> You may want to have job-specific QMP events outside of the Job
> >> mechanism, or we could have a generic one to notify the user that
> >> something in the queryable state has changed.
> 
> [...]
> 
> > IMHO if we want to move on with this idea, it'll be great if someone can
> > help answer what major benefits migration can get to move over, as I asked
> > above [1].  We'll likely need to pay quite some for it (including Libvirt
> > adopting the new interface), so I want to double check what we get.
> 
> Fair!
> 
> > Thanks,
> 

-- 
Peter Xu

Re: Migration and the Job abstraction

Reply via email to