On Wed, Sep 28, 2016 at 11:03:15AM +0200, Juan Quintela wrote: > "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > > * Stefan Hajnoczi (stefa...@gmail.com) wrote: > >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > >> > Heya! > >> > > >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefa...@gmail.com> wrote: > >> > > > >> > > At KVM Forum an interesting idea was proposed to avoid > >> > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > >> > > mentioned running at queue depth 1. It needs more thought to make it > >> > > workable but I want to capture it here for discussion and to archive > >> > > it. > >> > > > >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > >> > > requests hang. We should find a better way of quiescing I/O that is > >> > > not synchronous. Up until now I thought we should simply add a > >> > > timeout to bdrv_drain_all() so it can at least fail (and live > >> > > migration would fail) if I/O is stuck instead of hanging the VM. But > >> > > the following approach is also interesting... > >> > > > >> > > During the iteration phase of live migration we could limit the queue > >> > > depth so points with no I/O requests in-flight are identified. At > >> > > these points the migration algorithm has the opportunity to move to > >> > > the next phase without requiring bdrv_drain_all() since no requests > >> > > are pending. > >> > > >> > I actually think that this "io quiesced state" is highly unlikely > >> > to _just_ happen on a busy guest. The main idea behind running at > >> > QD1 is to naturally throttle the guest and make it easier to > >> > "force quiesce" the VQs. > >> > > >> > In other words, if the guest is busy and we run at QD1, I would > >> > expect the rings to be quite full of pending (ie. unprocessed) > >> > requests. At the same time, I would expect that a call to > >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much > >> > quicker. > >> > > >> > Nevertheless, you mentioned that this is still problematic as that > >> > single outstanding IO could block, leaving the VM paused for > >> > longer. > >> > > >> > My suggestion is therefore that we leave the vCPUs running, but > >> > stop picking up requests from the VQs. Provided nothing blocks, > >> > you should reach the "io quiesced state" fairly quickly. If you > >> > don't, then the VM is at least still running (despite seeing no > >> > progress on its VQs). > >> > > >> > Thoughts on that? > >> > >> If the guest experiences a hung disk it may enter error recovery. QEMU > >> should avoid this so the guest doesn't remount file systems read-only. > >> > >> This can be solved by only quiescing the disk for, say, 30 seconds at a > >> time. If we don't reach a point where live migration can proceed during > >> those 30 seconds then the disk will service requests again temporarily > >> to avoid upsetting the guest. > >> > >> I wonder if Juan or David have any thoughts from the live migration > >> perspective? > > > > Throttling IO to reduce the time in the final drain makes sense > > to me, however: > > a) It doesn't solve the problem if the IO device dies at just the wrong > > time, > > so you can still get that hang in bdrv_drain_all > > > > b) Completely stopping guest IO sounds too drastic to me unless you can > > time it to be just at the point before the end of migration; that > > feels > > tricky to get right unless you can somehow tie it to an estimate of > > remaining dirty RAM (that never works that well). > > > > c) Something like a 30 second pause still feels too long; if that was > > a big hairy database workload it would effectively be 30 seconds > > of downtime. > > > > Dave > > I think something like the proposed thing could work. > > We can put queue depth = 1 or somesuch when we know we are near > completion for migration. What we need them is a way to call the > equivalent of: > > bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In > that case, we just do another round over the whole memory, or retry in X > seconds. Anything is good for us, we just need a way to ask for the > operation but that it don't block. > > Notice that migration is the equivalent of: > > while (true) { > write_some_dirty_pages(); > if (dirty_pages < threshold) { > break; > } > } > bdrv_drain_all(); > write_rest_of_dirty_pages(); > > (Lots and lots of details ommited) > > What we really want is to issue the call of bdrv_drain_all() equivalent > inside the while, so, if there is any problem, we just do another cycle, > no problem.
It seems that the main downside of this is that it makes normal pre-copy live migration even less likely to successfully complete that it already is. This increases the liklihood of needing to use post-copy live migration, which has the same bdrv_drain_all problem. THis is hard to solve because QEMU isn't in charge of when post-copy starts, so it can't simply wait for a convenient moment to switch to post-copy if drain_all is busy. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|