"Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > * Stefan Hajnoczi (stefa...@gmail.com) wrote: >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: >> > Heya! >> > >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefa...@gmail.com> wrote: >> > > >> > > At KVM Forum an interesting idea was proposed to avoid >> > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi >> > > mentioned running at queue depth 1. It needs more thought to make it >> > > workable but I want to capture it here for discussion and to archive >> > > it. >> > > >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O >> > > requests hang. We should find a better way of quiescing I/O that is >> > > not synchronous. Up until now I thought we should simply add a >> > > timeout to bdrv_drain_all() so it can at least fail (and live >> > > migration would fail) if I/O is stuck instead of hanging the VM. But >> > > the following approach is also interesting... >> > > >> > > During the iteration phase of live migration we could limit the queue >> > > depth so points with no I/O requests in-flight are identified. At >> > > these points the migration algorithm has the opportunity to move to >> > > the next phase without requiring bdrv_drain_all() since no requests >> > > are pending. >> > >> > I actually think that this "io quiesced state" is highly unlikely >> > to _just_ happen on a busy guest. The main idea behind running at >> > QD1 is to naturally throttle the guest and make it easier to >> > "force quiesce" the VQs. >> > >> > In other words, if the guest is busy and we run at QD1, I would >> > expect the rings to be quite full of pending (ie. unprocessed) >> > requests. At the same time, I would expect that a call to >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much >> > quicker. >> > >> > Nevertheless, you mentioned that this is still problematic as that >> > single outstanding IO could block, leaving the VM paused for >> > longer. >> > >> > My suggestion is therefore that we leave the vCPUs running, but >> > stop picking up requests from the VQs. Provided nothing blocks, >> > you should reach the "io quiesced state" fairly quickly. If you >> > don't, then the VM is at least still running (despite seeing no >> > progress on its VQs). >> > >> > Thoughts on that? >> >> If the guest experiences a hung disk it may enter error recovery. QEMU >> should avoid this so the guest doesn't remount file systems read-only. >> >> This can be solved by only quiescing the disk for, say, 30 seconds at a >> time. If we don't reach a point where live migration can proceed during >> those 30 seconds then the disk will service requests again temporarily >> to avoid upsetting the guest. >> >> I wonder if Juan or David have any thoughts from the live migration >> perspective? > > Throttling IO to reduce the time in the final drain makes sense > to me, however: > a) It doesn't solve the problem if the IO device dies at just the wrong > time, > so you can still get that hang in bdrv_drain_all > > b) Completely stopping guest IO sounds too drastic to me unless you can > time it to be just at the point before the end of migration; that feels > tricky to get right unless you can somehow tie it to an estimate of > remaining dirty RAM (that never works that well). > > c) Something like a 30 second pause still feels too long; if that was > a big hairy database workload it would effectively be 30 seconds > of downtime. > > Dave
I think something like the proposed thing could work. We can put queue depth = 1 or somesuch when we know we are near completion for migration. What we need them is a way to call the equivalent of: bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In that case, we just do another round over the whole memory, or retry in X seconds. Anything is good for us, we just need a way to ask for the operation but that it don't block. Notice that migration is the equivalent of: while (true) { write_some_dirty_pages(); if (dirty_pages < threshold) { break; } } bdrv_drain_all(); write_rest_of_dirty_pages(); (Lots and lots of details ommited) What we really want is to issue the call of bdrv_drain_all() equivalent inside the while, so, if there is any problem, we just do another cycle, no problem. Later, Juan.