26.06.2018 11:44, Vladimir Sementsov-Ogievskiy wrote: > 25.06.2018 20:50, Dr. David Alan Gilbert wrote: >> * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: >>> * Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote: >>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >>>>> * Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote: >>>>>> Invalidate cache before source start in case of failed migration. >>>>>> >>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>>>>> <vsement...@virtuozzo.com> >>>>> Why doesn't the code at the bottom of migration_completion, >>>>> fail_invalidate: and the code in migrate_fd_cancel handle this? >>>>> >>>>> What case did you see it in that those didn't handle? >>>>> (Also I guess it probably should set s->block_inactive = false) >>>> on source I see: >>>> >>>> 81392@1529065750.766289:migrate_set_state new state 7 >>>> 81392@1529065750.766330:migration_thread_file_err >>>> 81392@1529065750.766332:migration_thread_after_loop >>>> >>>> so, we are leaving loop on >>>> if (qemu_file_get_error(s->to_dst_file)) { >>>> migrate_set_state(&s->state, current_active_state, >>>> MIGRATION_STATUS_FAILED); >>>> trace_migration_thread_file_err(); >>>> break; >>>> } >>>> >>>> and skip migration_completion() > > > John is right, this ls an unrelated log, here we fail before > inactivation and there are no problems. > > Actual problem is when we fail in postcopy_start, at the end. And > source log looks like: > > 84297@1530001796.287344:migrate_set_state new state 1 > 84297@1530001796.287374:migration_fd_outgoing fd=101 > 84297@1530001796.287383:migration_set_outgoing_channel > ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null) > 84297@1530001796.294032:migration_bitmap_sync_start > 84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932 > 84297@1530001796.300561:migrate_set_state new state 4 > 84297@1530001796.300588:migration_thread_setup_complete > 84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 > (pre = 0 compat=1107976192 post=0) > 84297@1530001796.300595:migrate_set_state new state 5 > Tap fd 33 disable, ret 0 > 84297@1530001796.426477:migration_bitmap_sync_start > 84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091 > 84297@1530001796.439077:migrate_global_state_pre_save saved state: > running > 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream > errored -5 > 84297@1530001796.439141:migrate_set_state new state 7 > 84297@1530001796.439181:migration_thread_after_loop > Tap fd 33 enable, ret 0 > 84297@1530001796.453639:migrate_fd_cleanup > qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion > `!(bs->open_flags & 0x0800)' failed. > 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed > > >>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation >>> test that had previously ended with a 'cancelled' state has now >>> ended up >>> in 'failed' (which is the state 7 you have above). >>> I suspect there's something else going on as well; I think what is >>> supposed to happen in the case of 'cancel' is that it spins in >>> 'cancelling' for >>> a while in migrate_fd_cancel and then at the bottom of >>> migrate_fd_cancel >>> it does the recovery, but because it's going to failed instead, then >>> it's jumping over that recovery. >> Going back and actually looking at the patch again; >> can I ask for 1 small change; >> Can you set s->block_inactive = false in the case where you >> don't get the local_err (Like we do at the bottom of migrate_fd_cancel) >> >> >> Does that make sense? > > Ok, I'll resend. > > Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my > reproduce isn't appropriate for upstream. > But looks like current code have a possibility of the same fail: > > postcopy_start() > .... > ret = qemu_file_get_error(ms->to_dst_file); > if (ret) { > error_report("postcopy_start: Migration stream errored"); > > leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run > > then the loop should finish, as state should be > MIGRATION_STATUS_FAILED, so we will not call migration_completion. > > Hm, I have questions now: > > 1. should we check s->block_inactive, and if it is false, don't > invalidate? it is done in migrate_fd_cancel(), but not don in > migration_completion(). > 2. should we call qemu_mutex_lock_iothread() like in > migration_completion()? Why is it needed in migration_completion(), > when vm is not running?
Hm, forgotten thread, I should resend, but what do you think about these questions? > >> >> Thanks, >> >> Dave >> >>> Dave >>> >>>>> Dave >>>>> >>>>>> --- >>>>>> >>>>>> migration/migration.c | 9 ++++++++- >>>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/migration/migration.c b/migration/migration.c >>>>>> index 1e99ec9b7e..8f39e0dc02 100644 >>>>>> --- a/migration/migration.c >>>>>> +++ b/migration/migration.c >>>>>> @@ -2806,7 +2806,14 @@ static void >>>>>> migration_iteration_finish(MigrationState *s) >>>>>> case MIGRATION_STATUS_FAILED: >>>>>> case MIGRATION_STATUS_CANCELLED: >>>>>> if (s->vm_was_running) { >>>>>> - vm_start(); >>>>>> + Error *local_err = NULL; >>>>>> + bdrv_invalidate_cache_all(&local_err); >>>>>> + if (local_err) { >>>>>> + error_reportf_err(local_err, "Can't invalidate >>>>>> disks before " >>>>>> + "source vm start"); >>>>>> + } else { >>>>>> + vm_start(); >>>>>> + } >>>>>> } else { >>>>>> if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { >>>>>> runstate_set(RUN_STATE_POSTMIGRATE); >>>>>> -- >>>>>> 2.11.1 >>>>>> >>>>> -- >>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>> >>>> -- >>>> Best regards, >>>> Vladimir >>>> >>> -- >>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >> -- >> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > -- Best regards, Vladimir