* Daniel P. Berrangé (berra...@redhat.com) wrote: > On Mon, May 11, 2020 at 08:12:18PM +0200, Lukas Straub wrote: > > On Mon, 11 May 2020 12:49:47 +0100 > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote: > > > > Hello Everyone, > > > > In many cases, if qemu has a network connection (qmp, migration, > > > > chardev, etc.) > > > > to some other server and that server dies or hangs, qemu hangs too. > > > > > > If qemu as a whole hangs due to a stalled network connection, that is a > > > bug in QEMU that we should be fixing IMHO. QEMU should be doing > > > non-blocking > > > I/O in general, such that if the network connection or remote server > > > stalls, > > > we simply stop sending I/O - we shouldn't ever hang the QEMU process or > > > main > > > loop. > > > > > > There are places in QEMU code which are not well behaved in this respect, > > > but many are, and others are getting fixed where found to be important. > > > > > > Arguably any place in QEMU code which can result in a hang of QEMU in the > > > event of a stalled network should be considered a security flaw, because > > > the network is untrusted in general. > > > > The fact that out-of-band qmp commands exist at all shows that we have to > > make tradeoffs of developer time vs. doing things right. Sure, the > > migration code can be rewritten to use non-blocking i/o and finegrained > > locks. But as a hobbyist I don't have time to fix this. > > > > > > These patches introduce the new 'yank' out-of-band qmp command to > > > > recover from > > > > these kinds of hangs. The different subsystems register callbacks which > > > > get > > > > executed with the yank command. For example the callback can shutdown() > > > > a > > > > socket. This is intended for the colo use-case, but it can be used for > > > > other > > > > things too of course. > > > > > > IIUC, invoking the "yank" command unconditionally kills every single > > > network connection in QEMU that has registered with the "yank" subsystem. > > > IMHO this is way too big of a hammer, even if we accept there are bugs in > > > QEMU not handling stalled networking well. > > > > > > eg if a chardev hangs QEMU, and we tear down everything, killing the NBD > > > connection used for the guest disk, we needlessly break I/O. > > > > Yeah, these patches are intended to solve the problems with the colo > > use-case where all external connections (migration, chardevs, nbd) > > are just for replication. In other use-cases you'd enable the yank > > feature only on the non-essential connections. > > That is a pretty inflexible design for other use cases though, > as "non-essential" is not a black & white list in general. There > are varying levels of importance to the different channels. We > can afford to loose migration without any user visible effects. > If that doesn't solve it, a serial device chardev, or VNC connection > can be dropped at the inconvenience of loosing interactive console > which is end user visible impact, so may only be want to be yanked > if the migration yank didn't fix it.
In the case of COLO that's not the case though - here we explicitly want to kill the migration to be able to ensure that we can recover - and we're under time pressure to get the other member of the pair running again. Dave > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK