Re: [PATCH 0/3] migration: Error fixes and improvements

Peter Xu Wed, 19 Nov 2025 12:59:27 -0800

On Wed, Nov 19, 2025 at 08:45:39AM +0100, Markus Armbruster wrote:

[...]


> The hairy part is the background task.
> 
> I believe it used to simply do its job, reporting errors to stderr along
> the way, until it either succeeded or failed.  The errors reported made
> success / failure "obvious" for users.
> 
> This can report multiple errors, which can be confusing.
> 
> Worse, it was no good for management applications.  These need to
> observe migration as a state machine, with final success and error
> states, where the error state comes with an indication of what went
> wrong.  So we made migration store the first of certain errors in the
> migration state in addition to reporting to stderr.
> 
> "First", because we store only when the state doesn't already have an
> error.  "Certain", because I doubt we do it for all errors we report.
> 
> Compare this to how jobs solve this problem.  These are a much, much
> later invention, and designed for management applications from the
> start[*].  A job is a state machine.  Management applications can
> observe and control the state.  Errors are not supposed to be reported,
> they should be fed to the state machine, which goes into an error state
> then.  The job is not supposed to do actual work in an error state.
> Therefore, no further errors should be possible.  When something goes
> wrong, we get a single error, stored in the job state, where the
> management application can find it.
> 
> Migration is also a state machine, and we long ago retrofitted the means
> for management applications to observe and control the state.  What we
> haven't done is the disciplined feeding of errors to the state machine.
> We can still get multiple errors.  We store the first of certain errors
> where the managament application can find it, but whether that error
> suffices to explain what went wrong is a crap shot.  As long as that's
> the case, we need to spew the other errors to stderr, where a human can
> find it.

Since above mentioned once more on the possibility of reusing Jobs idea, I
did try to list things explicitly this time, that why I think it should be
challenging and maybe not as worthwhile (?) to do so, however I might be
wrong.  I attached it at the end of this email almost for myself in the
future to reference, please feel free comment, or, to ignore all of those!
IMHO it's not directly relevant to the error reporting issues.

IMHO rewriting migration with Jobs will not help much in error reporting,
because the challenge for refactoring from migration side is not the "Jobs"
interfacing, but internally of migration.  Say, even if migration provided
a "job", it's the "job" impl that did error reporting bad, not the Jobs
interfacing.. the "job" impl will need to manage quite some threads on its
own, making sure errors are properly reported at least to the "job"
interface.

Said that, I totally agree we should try to improve error reporting in
migration.. with / without Jobs.

[...]

> > Maybe I should ping Vladimir on his recent work here?
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > That'll be part of such cleanup effort (and yes unfortunately many
> > migration related cleanups will need a lot of code churns...).
> 
> I know...
> 
> Can we afford modest efforts to reduce the mess one step at a time?

Yes, I'll try to follow up on that.

[...]

> [*] If the job abstraction had been available in time, migration would
> totally be a job.  There's no *design* reason for it being not a job.
> Plenty of implementation and backward compatibility reasons, though.

There might be something common between Jobs that block uses and a
migration process.  If so, we can provide CommonJob and make MigrationJob
and BlockJobs dependent on it.

However, I sincerely don't know how much common function will there be.
IOW, I doubt even in an imaginery world, if we could go back to when Jobs
was designed and if we would make migration a Job too (note!  snapshots is
definitely a too simple migration scenario..).  Is it possible after
evaluation we still don't?  I don't know, but I think it's possible.

Thanks!
Peter




Possible challenges of adopting Jobs in migration flow
======================================================

- Many Jobs defined property doesn't directly suite migration

  - JobStatus is not directly suitable for migration purposes.  There're
    some of the JobStatus that I can't think of any use
    (e.g. JOB_STATUS_WAITING, JOB_STATUS_PENDING, which is fine, because we
    can simply not use it), but there're other status that migration needs
    but isn't availble. Introducing them seems to be an overkill instead to
    block layer's use case.

  - Similarly to JobVerb.  E.g. JOB_VERB_CHANGE doesn't seem to apply to
    any concept to migration, but it misses quite some others
    (e.g. JOB_VERB_SET_DOWNTIME, JOB_VERB_POSTCOPY_START, and more).

  - Similarly, JobInfo reports in current-progress (which is not optional
    but required), which may make perfect sense for block jobs. However
    migration is OTOH convergence-triggered process, or user-triggered (in
    case of postcopy).  It doesn't have a quantified process but only
    "COMPLETED" / "IN_PROGRESS".

  - Another very major example that I have discussed a few times
    previously, Jobs are close attached to AioContext, while migration
    doesn't have, meanwhile migration is moving even further away from
    event driven model..  See:

    https://lore.kernel.org/all/[email protected]/#t

  There're just too many example showing that Jobs are defined almost only
  for block layer.. e.g. job-finalize (which may not make much sense in a
  migration context anyway..) mentions finalizing of graph changes, which
  also doesn't exist in migration process.

  So if we rewrite migration somehow with Jobs or keeping migration in mind
  designing Jobs, Jobs may need to be very bloated containing both
  migration and block layer requirements.

- Migration involves "two" QEMU instances instead of one

  I'm guessing existing Jobs operations are not as such, and providing such
  mechanisms in "Jobs" only for migration may introduce unnecessary code
  that block layer will never use.

  E.g. postcopy migration attached the two QEMU instances to represent one
  VM instance.  I do not have a clear picture in mind yet on how we can
  manage that if we see it as two separate Jobs on each side, and what
  happens if each side operates on its own Job with different purposes, and
  how we should connect two Jobs to say they're relevant (or maybe we don't
  need to?).

- More challenges on dest QEMU (VM loader) than src QEMU

  Unlike on the src side, the dest QEMU, when in an incoming state, is not
  a VM at all yet, but waiting to receive the migration data to become a
  working VM. It's not a generic long term process, but a pure listening
  port of QEMU where QEMU can do nothing without this "job" being
  completed..

  If we think about CPR it's even more complicated, because we essential
  require part of incoming process to happen before almost everything.. it
  may even include monitors being initialized.

- Deep integration with other subsystems

  Migration is deeply integrated into many other subsystems (auto-converge
  being able to throttle vCPUs, RAM being able to ignore empty pages
  reported from balloons, dirty trackings per-module, etc.), so we're not
  sure if there'll be some limitation from Jobs (when designed with block
  layer in mind) that will make such transition harder.

  For example, we at least want to make sure Jobs won't have simple locks
  that will be held while running migration, that can further deadlock if
  the migration code may invoke something else that tries to re-take the
  Jobs lock, which may cause dead-locks.

  Or, since migration runs nowadays with quite some threads concurrently,
  whether the main migration Job can always properly synchronize between
  all of them with no problem (maybe yes, but I just don't know Jobs enough
  to say).  This is also a relevant question about how much AioContext
  plays a role in core of Jobs idea and whether it can work well with
  complicated threaded environment.

-- 
Peter Xu

Re: [PATCH 0/3] migration: Error fixes and improvements

Reply via email to