> On 28 Sep 2016, at 10:03, Juan Quintela <quint...@redhat.com> wrote: > > "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: >> * Stefan Hajnoczi (stefa...@gmail.com) wrote: >>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: >>>> Heya! >>>> >>>>> On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefa...@gmail.com> wrote: >>>>> >>>>> At KVM Forum an interesting idea was proposed to avoid >>>>> bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi >>>>> mentioned running at queue depth 1. It needs more thought to make it >>>>> workable but I want to capture it here for discussion and to archive >>>>> it. >>>>> >>>>> bdrv_drain_all() is synchronous and can cause VM downtime if I/O >>>>> requests hang. We should find a better way of quiescing I/O that is >>>>> not synchronous. Up until now I thought we should simply add a >>>>> timeout to bdrv_drain_all() so it can at least fail (and live >>>>> migration would fail) if I/O is stuck instead of hanging the VM. But >>>>> the following approach is also interesting... >>>>> >>>>> During the iteration phase of live migration we could limit the queue >>>>> depth so points with no I/O requests in-flight are identified. At >>>>> these points the migration algorithm has the opportunity to move to >>>>> the next phase without requiring bdrv_drain_all() since no requests >>>>> are pending. >>>> >>>> I actually think that this "io quiesced state" is highly unlikely >>>> to _just_ happen on a busy guest. The main idea behind running at >>>> QD1 is to naturally throttle the guest and make it easier to >>>> "force quiesce" the VQs. >>>> >>>> In other words, if the guest is busy and we run at QD1, I would >>>> expect the rings to be quite full of pending (ie. unprocessed) >>>> requests. At the same time, I would expect that a call to >>>> bdrv_drain_all() (as part of do_vm_stop()) should complete much >>>> quicker. >>>> >>>> Nevertheless, you mentioned that this is still problematic as that >>>> single outstanding IO could block, leaving the VM paused for >>>> longer. >>>> >>>> My suggestion is therefore that we leave the vCPUs running, but >>>> stop picking up requests from the VQs. Provided nothing blocks, >>>> you should reach the "io quiesced state" fairly quickly. If you >>>> don't, then the VM is at least still running (despite seeing no >>>> progress on its VQs). >>>> >>>> Thoughts on that? >>> >>> If the guest experiences a hung disk it may enter error recovery. QEMU >>> should avoid this so the guest doesn't remount file systems read-only. >>> >>> This can be solved by only quiescing the disk for, say, 30 seconds at a >>> time. If we don't reach a point where live migration can proceed during >>> those 30 seconds then the disk will service requests again temporarily >>> to avoid upsetting the guest. >>> >>> I wonder if Juan or David have any thoughts from the live migration >>> perspective? >> >> Throttling IO to reduce the time in the final drain makes sense >> to me, however: >> a) It doesn't solve the problem if the IO device dies at just the wrong >> time, >> so you can still get that hang in bdrv_drain_all >> >> b) Completely stopping guest IO sounds too drastic to me unless you can >> time it to be just at the point before the end of migration; that feels >> tricky to get right unless you can somehow tie it to an estimate of >> remaining dirty RAM (that never works that well). >> >> c) Something like a 30 second pause still feels too long; if that was >> a big hairy database workload it would effectively be 30 seconds >> of downtime. >> >> Dave > > I think something like the proposed thing could work. > > We can put queue depth = 1 or somesuch when we know we are near > completion for migration. What we need them is a way to call the > equivalent of: > > bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In > that case, we just do another round over the whole memory, or retry in X > seconds. Anything is good for us, we just need a way to ask for the > operation but that it don't block. > > Notice that migration is the equivalent of: > > while (true) { > write_some_dirty_pages(); > if (dirty_pages < threshold) { > break; > } > } > bdrv_drain_all(); > write_rest_of_dirty_pages(); > > (Lots and lots of details ommited) > > What we really want is to issue the call of bdrv_drain_all() equivalent > inside the while, so, if there is any problem, we just do another cycle, > no problem. > > Later, Juan.
Hi, Actually, the way I perceive the problem is that Qemu is doing a vm_stop() *after* the "break;" in the pseudocode above (but *before* the drain). That means the VM could be stopped for a long time while you're doing bdrv_drain_all(). I don't see a magic solution for this. All we can do is try and find a way of doing this that improves the VM experience during the migration. It's easy to argue that it's better to see your storage performance go down for a short period of time instead of seeing your CPUs not running for a long period of time. After all, there's a reason for "cpu downtime" being an actual hypervisor metric. What I'd propose is a simple improvement like this: while (true) { write_some_dirty_pages(); if (dirty_pages < threshold_very_low) { break; } else if (dirty_pages < threshold_low) { bdrv_stop_picking_new_reqs(); } else if (dirty_pages < threshold_med) { bdrv_run_at_qd1(); } } vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); bdrv_drain_all(); write_rest_of_dirty_pages(); The idea is simple: * When we're somewhere near, we pick only one request at a time. * When we're really close, we stop picking up new requests. That still allows the block drivers to complete whatever is outstanding. * When we're really really close, we can break. At this point, we're very likely drained already. Knowing that most OSs use 30s by default as a "this request is not completing anymore" kind of timeout, we can even improve the above to resume the block drivers (or abort the migration) if the time between reaching "threshold_low" and "threshold_very_low" exceeds, say, 15s. That can be combined with actually waiting for everything to complete before stopping the CPUs. A more complete version would look like this: while (true) { write_some_dirty_pages(); if (dirty_pages < threshold_very_low) { if (bdrv_all_is_drained()) { break; } else if (bdrv_is_stopped() && (now() - ts_bdrv_stopped > 15s)) { bdrv_run_at_qd1(); // or abort the migration and resume normally, // perhaps after a few retries } } if (dirty_pages < threshold_low) { bdrv_stop_picking_new_reqs(); ts_bdrv_stopped = now(); } else if (dirty_pages < threshold_med) { bdrv_run_at_qd1(); } } vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); bdrv_drain_all(); write_rest_of_dirty_pages(); Note that this version (somewhat) copes with (dirty_pages<threshold_very_low) being reached before we actually observed a (dirty_pages<threshold_low). There's still a race where requests are fired after bdrv_all_is_drained() and before vm_stop_force_state(). But that can be easily addressed. Thoughts? Thanks, Felipe