On Mon, May 18, 2020 at 10:52:08AM +0800, Jason Wang wrote: > > On 2020/5/16 上午11:20, Li Feng wrote: > >Hi, Dima. > >This abort is what I have mentioned in my previous email. > >I have triggered this crash without any fix a week ago. > >And I have written a test patch to let vhost_log_global_start return > >int and propagate the error to up layer. > >However, my change is a little large, because the origin callback > >return void, and don't do some rollback. > >After test, the migration could migrate to dst successfully, and fio > >is still running perfectly, but the src vm is still stuck here, no > >crash. > > > >Is it right to return this error to the up layer? > > > That could be a solution or we may ask David for more suggestion. > > Another thing that might be useful is to block re connection during > migration. I've written a little more information as answer to Feng's mail. But what if add some new callback to get the device started state (started or not). And for the vhost-user (or at least vhost-usr-blk) devices it will use the connected field also to return the device state: - disconnect -> not started For other devices we can just return the started field value as it is right now.
No other comments mixed in below. > > Thanks > > > > > >Thanks, > >Feng Li > > > >Dima Stepanov <dimas...@yandex-team.ru> 于2020年5月16日周六 上午12:55写道: > >>On Thu, May 14, 2020 at 03:34:24PM +0800, Jason Wang wrote: > >>>On 2020/5/13 下午5:47, Dima Stepanov wrote: > >>>>>> case CHR_EVENT_CLOSED: > >>>>>> /* a close event may happen during a read/write, but vhost > >>>>>> * code assumes the vhost_dev remains setup, so delay the > >>>>>> * stop & clear to idle. > >>>>>> * FIXME: better handle failure in vhost code, remove bh > >>>>>> */ > >>>>>> if (s->watch) { > >>>>>> AioContext *ctx = qemu_get_current_aio_context(); > >>>>>> > >>>>>> g_source_remove(s->watch); > >>>>>> s->watch = 0; > >>>>>> qemu_chr_fe_set_handlers(&s->chr, NULL, NULL, NULL, NULL, > >>>>>> NULL, NULL, false); > >>>>>> > >>>>>> aio_bh_schedule_oneshot(ctx, chr_closed_bh, opaque); > >>>>>> } > >>>>>> break; > >>>>>> > >>>>>>I think it's time we dropped the FIXME and moved the handling to common > >>>>>>code. Jason? Marc-André? > >>>>>I agree. Just to confirm, do you prefer bh or doing changes like what is > >>>>>done in this series? It looks to me bh can have more easier codes. > >>>>Could it be a good idea just to make disconnect in the char device but > >>>>postphone clean up in the vhost-user-blk (or any other vhost-user > >>>>device) itself? So we are moving the postphone logic and decision from > >>>>the char device to vhost-user device. One of the idea i have is as > >>>>follows: > >>>> - Put ourself in the INITIALIZATION state > >>>> - Start these vhost-user "handshake" commands > >>>> - If we got a disconnect error, perform disconnect, but don't clean up > >>>> device (it will be clean up on the roll back). I can be done by > >>>> checking the state in vhost_user_..._disconnect routine or smth like > >>>> it > >>> > >>>Any issue you saw just using the aio bh as Michael posted above. > >>> > >>>Then we don't need to deal with the silent vhost_dev_stop() and we will > >>>have > >>>codes that is much more easier to understand. > >>I've implemented this solution inside > >>hw/block/vhost-user-blk.c:vhost_user_blk_event() in the similar way by > >>using the s->connected field. Looks good and more correct fix ). I have > >>two questions here before i'll rework the fixes: > >>1. Is it okay to make the similar fix inside vhost_user_blk_event() or > >>we are looking for more generic vhost-user solution? What do you think? > >>2. For migration we require an additional information that for the > >>vhost-user device it isn't an error, because i'm trigerring the > >>following assert error: > >> Core was generated by `x86_64-softmmu/qemu-system-x86_64 -nodefaults > >> -no-user-config -M q35,sata=false'. > >> Program terminated with signal SIGABRT, Aborted. > >> #0 0x00007fb56e729428 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > >> [Current thread is 1 (Thread 0x7fb486ef5700 (LWP 527734))] > >> > >> (gdb) bt > >> #0 0x00007fb56e729428 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > >> #1 0x00007fb56e72b02a in abort () from /lib/x86_64-linux-gnu/libc.so.6 > >> #2 0x00005648ea376ee6 in vhost_log_global_start > >> (listener=0x5648ece4eb08) at ./hw/virtio/vhost.c:857 > >> #3 0x00005648ea2dde7e in memory_global_dirty_log_start () > >> at ./memory.c:2611 > >> #4 0x00005648ea2e68e7 in ram_init_bitmaps (rs=0x7fb4740008c0) > >> at ./migration/ram.c:2305 > >> #5 0x00005648ea2e698b in ram_init_all (rsp=0x5648eb1f0f20 <ram_state>) > >> at ./migration/ram.c:2323 > >> #6 0x00005648ea2e6cc5 in ram_save_setup (f=0x5648ec609e00, > >> opaque=0x5648eb1f0f20 <ram_state>) > >> at ./migration/ram.c:2436 > >> #7 0x00005648ea67b7d3 in qemu_savevm_state_setup (f=0x5648ec609e00) at > >> migration/savevm.c:1176 > >> #8 0x00005648ea674511 in migration_thread (opaque=0x5648ec031ff0) at > >> migration/migration.c:3416 > >> #9 0x00005648ea85d65d in qemu_thread_start (args=0x5648ec6057f0) at > >> util/qemu-thread-posix.c:519 > >> #10 0x00007fb56eac56ba in start_thread () from > >> /lib/x86_64-linux-gnu/libpthread.so.0 > >> #11 0x00007fb56e7fb41d in clone () from /lib/x86_64-linux-gnu/libc.so.6 > >> (gdb) frame 2 > >> #2 0x00005648ea376ee6 in vhost_log_global_start > >> (listener=0x5648ece4eb08) at ./hw/virtio/vhost.c:857 > >> 857 abort(); > >> (gdb) list > >> 852 { > >> 853 int r; > >> 854 > >> 855 r = vhost_migration_log(listener, true); > >> 856 if (r < 0) { > >> 857 abort(); > >> 858 } > >> 859 } > >> 860 > >> 861 static void vhost_log_global_stop(MemoryListener *listener) > >>Since bh postphone the clean up, we can't use the ->started field. > >>Do we have any mechanism to get the device type/state in the common > >>vhost_migration_log() routine? So for example for the vhost-user/disconnect > >>device we will be able to return 0. Or should we implement it and introduce > >>it in this patch set? > >> > >>Thanks, Dima. > >> > >>>Thank > >>> > >>> > >>>> - vhost-user command returns error back to the _start() routine > >>>> - Rollback in one place in the start() routine, by calling this > >>>> postphoned clean up for the disconnect > >>>> >