On 14.02.20 19:11, Peter Xu wrote: > On Fri, Feb 14, 2020 at 06:32:35PM +0100, David Hildenbrand wrote: >> On 14.02.20 18:29, Peter Xu wrote: >>> On Fri, Feb 14, 2020 at 01:02:46PM +0100, David Hildenbrand wrote: >>>> From c0049ac2e95d6756037db918852c507fb88297d9 Mon Sep 17 00:00:00 2001 >>>> From: David Hildenbrand <da...@redhat.com> >>>> Date: Fri, 14 Feb 2020 13:01:03 +0100 >>>> Subject: [PATCH v1] tmp >>>> >>>> Signed-off-by: David Hildenbrand <da...@redhat.com> >>>> --- >>>> migration/migration.c | 9 +++++++-- >>>> migration/migration.h | 1 + >>>> migration/ram.c | 42 ++++++++++++++++++++++++++++++++++++++++++ >>>> 3 files changed, 50 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/migration/migration.c b/migration/migration.c >>>> index 3a21a4686c..0e7efe2920 100644 >>>> --- a/migration/migration.c >>>> +++ b/migration/migration.c >>>> @@ -175,13 +175,18 @@ void migration_object_init(void) >>>> } >>>> } >>>> >>>> +void migration_cancel(void) >>>> +{ >>>> + migrate_fd_cancel(current_migration); >>>> +} >>>> + >>>> void migration_shutdown(void) >>>> { >>>> /* >>>> * Cancel the current migration - that will (eventually) >>>> * stop the migration using this structure >>>> */ >>>> - migrate_fd_cancel(current_migration); >>>> + migration_cancel(); >>>> object_unref(OBJECT(current_migration)); >>>> } >>>> >>>> @@ -2019,7 +2024,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool >>>> blk, >>>> >>>> void qmp_migrate_cancel(Error **errp) >>>> { >>>> - migrate_fd_cancel(migrate_get_current()); >>>> + migration_cancel(); >>>> } >>>> >>>> void qmp_migrate_continue(MigrationStatus state, Error **errp) >>>> diff --git a/migration/migration.h b/migration/migration.h >>>> index 8473ddfc88..79fd74afa5 100644 >>>> --- a/migration/migration.h >>>> +++ b/migration/migration.h >>>> @@ -343,5 +343,6 @@ int foreach_not_ignored_block(RAMBlockIterFunc func, >>>> void *opaque); >>>> void migration_make_urgent_request(void); >>>> void migration_consume_urgent_request(void); >>>> bool migration_rate_limit(void); >>>> +void migration_cancel(void); >>>> >>>> #endif >>>> diff --git a/migration/ram.c b/migration/ram.c >>>> index ed23ed1c7c..f86f32b453 100644 >>>> --- a/migration/ram.c >>>> +++ b/migration/ram.c >>>> @@ -52,6 +52,7 @@ >>>> #include "migration/colo.h" >>>> #include "block.h" >>>> #include "sysemu/sysemu.h" >>>> +#include "sysemu/runstate.h" >>>> #include "savevm.h" >>>> #include "qemu/iov.h" >>>> #include "multifd.h" >>>> @@ -3710,8 +3711,49 @@ static SaveVMHandlers savevm_ram_handlers = { >>>> .resume_prepare = ram_resume_prepare, >>>> }; >>>> >>>> +static void ram_mig_ram_block_resized(RAMBlockNotifier *n, void *host, >>>> + size_t old_size, size_t new_size) >>>> +{ >>>> + /* >>>> + * We don't care about resizes triggered on incoming migration (when >>>> + * syncing ram blocks), or of course, when no migration is going on. >>>> + */ >>>> + if (migration_is_idle() || !runstate_is_running()) { >>>> + return; >>>> + } >>> >>> I feel like migration_is_idle() check is enough. Firstly, I feel like >>> we allow migration even with VM stopped. At the meantime, if VM is >>> not running, I see no reason that this resizing will happen after all? :) >> >> Migration code resizes ram blocks when synchronizing the ram state. See >> qemu_ram_resize() in ram_load_precopy() >> >> That happens on incoming migration while the vm is stopped. > > Ah so your comment is for that which I misread. I'm surprised even > the incoming migration will set MigrationState and migration_is_idle()
I was already surprised by that :) And states are not well document (IOW, not documented) > will return false for it, since we've got MigrationIncomingState after > all. But yeh that seems to be correct. > > Then it seems fine. It's just a bit unclear even with the comment. > Another alternative is we only register this resize() hook when > migration starts, and unregister it when stopped. Then we can even > drop the migration_is_idle(). Yeah, I prefer the current code because its simpler (and I don't have to worry about races when registering/unregistering notifiers, because locking in combination with migration is still a big puzzle to me) I'll try to make the comment clearer, thanks! > >> >>> >>>> + >>>> + if (!postcopy_is_running()) { >>>> + Error *err = NULL; >>>> + >>>> + /* >>>> + * Precopy code cannot deal with the size of ram blocks changing >>>> at >>>> + * random points in time. We're still running on the source, abort >>>> + * the migration and continue running here. Make sure to wait >>>> until >>>> + * migration was canceled. >>>> + */ >>>> + error_setg(&err, "RAM resized during precopy."); >>>> + migrate_set_error(migrate_get_current(), err); >>>> + error_free(err); >>>> + migration_cancel(); >>>> + } else { >>>> + /* >>>> + * Postcopy code cannot deal with the size of ram blocks changing >>>> at >>>> + * random points in time. We're running on the target. Fail hard. >>>> + * >>>> + * TODO: How to handle this in a better way? >>>> + */ >>>> + error_report("RAM resized during postcopy."); >>>> + exit(-1); >>> >>> Now I'm rethinking the postcopy case.... >>> >>> ram_dirty_bitmap_reload() should only happen during the postcopy >>> recovery, and when that happens the VM should be stopped on both >>> sides. Which means, ram resizing should not trigger there... >> >> But that guest got the chance to run for a bit and eventually reboot >> AFAIK. Also, there are other data races possible when used_length >> suddenly changes, this is just the most obvious one where things will; >> get screwed up. > > Right, the major one could be in ram_load_postcopy() where we call > host_from_ram_block_offset(). However if FW extension is the major > use case then it seems to still work (still better than crashing, > isn't it? :). "Let's close our eyes and hope it will never happen" ? :) No, I don't like that. This screams for a better solution long term, and until then a proper fencing IMHO. We're making here wild guesses about data races and why they might not be that bad in certain cases (did I mention load/store tearing? used_length is not an atomic value ...). -- Thanks, David / dhildenb