On Fri, Feb 14, 2020 at 03:04:23PM -0500, David Hildenbrand wrote:
> 
> 
> > Am 14.02.2020 um 20:45 schrieb Peter Xu <pet...@redhat.com>:
> > 
> > On Fri, Feb 14, 2020 at 07:26:59PM +0100, David Hildenbrand wrote:
> >>>>>> +    if (!postcopy_is_running()) {
> >>>>>> +        Error *err = NULL;
> >>>>>> +
> >>>>>> +        /*
> >>>>>> +         * Precopy code cannot deal with the size of ram blocks 
> >>>>>> changing at
> >>>>>> +         * random points in time. We're still running on the source, 
> >>>>>> abort
> >>>>>> +         * the migration and continue running here. Make sure to wait 
> >>>>>> until
> >>>>>> +         * migration was canceled.
> >>>>>> +         */
> >>>>>> +        error_setg(&err, "RAM resized during precopy.");
> >>>>>> +        migrate_set_error(migrate_get_current(), err);
> >>>>>> +        error_free(err);
> >>>>>> +        migration_cancel();
> >>>>>> +    } else {
> >>>>>> +        /*
> >>>>>> +         * Postcopy code cannot deal with the size of ram blocks 
> >>>>>> changing at
> >>>>>> +         * random points in time. We're running on the target. Fail 
> >>>>>> hard.
> >>>>>> +         *
> >>>>>> +         * TODO: How to handle this in a better way?
> >>>>>> +         */
> >>>>>> +        error_report("RAM resized during postcopy.");
> >>>>>> +        exit(-1);
> >>>>> 
> >>>>> Now I'm rethinking the postcopy case....
> >>>>> 
> >>>>> ram_dirty_bitmap_reload() should only happen during the postcopy
> >>>>> recovery, and when that happens the VM should be stopped on both
> >>>>> sides.  Which means, ram resizing should not trigger there...
> >>>> 
> >>>> But that guest got the chance to run for a bit and eventually reboot
> >>>> AFAIK. Also, there are other data races possible when used_length
> >>>> suddenly changes, this is just the most obvious one where things will;
> >>>> get screwed up.
> >>> 
> >>> Right, the major one could be in ram_load_postcopy() where we call
> >>> host_from_ram_block_offset().  However if FW extension is the major
> >>> use case then it seems to still work (still better than crashing,
> >>> isn't it? :).
> >> 
> >> "Let's close our eyes and hope it will never happen" ? :) No, I don't
> >> like that. This screams for a better solution long term, and until then
> >> a proper fencing IMHO. We're making here wild guesses about data races
> >> and why they might not be that bad in certain cases (did I mention
> >> load/store tearing? used_length is not an atomic value ...).
> > 
> > Yeah fencing is good, but crashing a VM while it wasn't going to crash
> > is another thing, imho.  You can dump an error message if you really
> > like, but instead of exit() I really prefer we either still let the
> > old way to at least work or really fix it.
> 
> I‘ll do whatever Juan/Dave think is best. I am not convinced that there is no 
> way to corrupt data or crash later when the guest is already running again 
> post-reboot and doing real work.

Yeah I never said it will always work. :)

However it does not mean it'll break every time.  My guess is that for
the happened cases it might still survive quite a few, confessing that
is without much clue.  I just prefer to avoid having an explicit patch
to bail out like that, because it doesn't really help that much by
crashing earlier.

That's something I learnt when I started to work on migration, that
is, we don't call exit() on source VM when we really, really needed
to.  For postcopy, it's the destination VM that matters here.

Yeh not a big deal since this is really corner case even if it
happened.  Let's follow the maintainers' judgement.

Thanks,

-- 
Peter Xu


Reply via email to