On Wed, Jun 08, 2022 at 06:05:28PM +0100, Dr. David Alan Gilbert wrote:
> > @@ -2005,7 +2005,17 @@ static void loadvm_postcopy_handle_run_bh(void 
> > *opaque)
> >      /* TODO we should move all of this lot into postcopy_ram.c or a shared 
> > code
> >       * in migration.c
> >       */
> > -    cpu_synchronize_all_post_init();
> > +    cpu_synchronize_all_post_init(&local_err);
> > +    if (local_err) {
> > +        /*
> > +         * TODO: a better way to do this is to tell the src that we cannot
> > +         * run the VM here so hopefully we can keep the VM running on src
> > +         * and immediately halt the switch-over.  But that needs work.
> 
> Yes, I think it is possible; unlike some of the later errors in the same
> function, in this case we know no disks/network/etc have been touched,
> so we should be able to recover.
> I wonder if we can move the postcopy_state_set(POSTCOPY_INCOMING_RUNNING)
> out of loadvm_postcopy_handle_run to after this point.
> 
> We've already got the return path, so we should be able to signal the
> failure unless we're very unlucky.

Right.  It's just that for the new ACK we may need to modify the return
path protocol for sure, because none of the existing ones can notify such
an information.

One idea is to reuse MIG_RP_MSG_RESUME_ACK, it was only used for postcopy
recovery before to do the final handshake with offload=1 only (which is
defined as MIGRATION_RESUME_ACK_VALUE).  We could try to fill in the
payload with some !1 value, to tell the source that we NACK the migration
then src fails the migration as long as possible?

That seems to be even compatibile with one old qemu migrating to a new qemu
scenario, because when the old qemu notices the MIG_RP_MSG_RESUME_ACK
message with !1 payload, it'll mark the rp bad:

  if (migrate_handle_rp_resume_ack(ms, tmp32)) {
      mark_source_rp_bad(ms);
      goto out;
  }

  static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
  {
      trace_source_return_path_thread_resume_ack(value);
  
      if (value != MIGRATION_RESUME_ACK_VALUE) {
          error_report("%s: illegal resume_ack value %"PRIu32,
                       __func__, value);
          return -1;
      }
      ...
  }

If it looks generally good, I can try with such a change in v2.

Thanks,

-- 
Peter Xu


Reply via email to