Re: [Qemu-devel] Live migration without bdrv_drain_all()

Felipe Franciosi Wed, 28 Sep 2016 03:35:59 -0700

> On 28 Sep 2016, at 10:03, Juan Quintela <quint...@redhat.com> wrote:
> 
> "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote:
>> * Stefan Hajnoczi (stefa...@gmail.com) wrote:
>>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
>>>> Heya!
>>>> 
>>>>> On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefa...@gmail.com> wrote:
>>>>> 
>>>>> At KVM Forum an interesting idea was proposed to avoid
>>>>> bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
>>>>> mentioned running at queue depth 1.  It needs more thought to make it
>>>>> workable but I want to capture it here for discussion and to archive
>>>>> it.
>>>>> 
>>>>> bdrv_drain_all() is synchronous and can cause VM downtime if I/O
>>>>> requests hang.  We should find a better way of quiescing I/O that is
>>>>> not synchronous.  Up until now I thought we should simply add a
>>>>> timeout to bdrv_drain_all() so it can at least fail (and live
>>>>> migration would fail) if I/O is stuck instead of hanging the VM.  But
>>>>> the following approach is also interesting...
>>>>> 
>>>>> During the iteration phase of live migration we could limit the queue
>>>>> depth so points with no I/O requests in-flight are identified.  At
>>>>> these points the migration algorithm has the opportunity to move to
>>>>> the next phase without requiring bdrv_drain_all() since no requests
>>>>> are pending.
>>>> 
>>>> I actually think that this "io quiesced state" is highly unlikely
>>>> to _just_ happen on a busy guest. The main idea behind running at
>>>> QD1 is to naturally throttle the guest and make it easier to
>>>> "force quiesce" the VQs.
>>>> 
>>>> In other words, if the guest is busy and we run at QD1, I would
>>>> expect the rings to be quite full of pending (ie. unprocessed)
>>>> requests. At the same time, I would expect that a call to
>>>> bdrv_drain_all() (as part of do_vm_stop()) should complete much
>>>> quicker.
>>>> 
>>>> Nevertheless, you mentioned that this is still problematic as that
>>>> single outstanding IO could block, leaving the VM paused for
>>>> longer.
>>>> 
>>>> My suggestion is therefore that we leave the vCPUs running, but
>>>> stop picking up requests from the VQs. Provided nothing blocks,
>>>> you should reach the "io quiesced state" fairly quickly. If you
>>>> don't, then the VM is at least still running (despite seeing no
>>>> progress on its VQs).
>>>> 
>>>> Thoughts on that?
>>> 
>>> If the guest experiences a hung disk it may enter error recovery.  QEMU
>>> should avoid this so the guest doesn't remount file systems read-only.
>>> 
>>> This can be solved by only quiescing the disk for, say, 30 seconds at a
>>> time.  If we don't reach a point where live migration can proceed during
>>> those 30 seconds then the disk will service requests again temporarily
>>> to avoid upsetting the guest.
>>> 
>>> I wonder if Juan or David have any thoughts from the live migration
>>> perspective?
>> 
>> Throttling IO to reduce the time in the final drain makes sense
>> to me, however:
>>   a) It doesn't solve the problem if the IO device dies at just the wrong 
>> time,
>>      so you can still get that hang in bdrv_drain_all
>> 
>>   b) Completely stopping guest IO sounds too drastic to me unless you can
>>      time it to be just at the point before the end of migration; that feels
>>      tricky to get right unless you can somehow tie it to an estimate of
>>      remaining dirty RAM (that never works that well).
>> 
>>   c) Something like a 30 second pause still feels too long; if that was
>>      a big hairy database workload it would effectively be 30 seconds
>>      of downtime.
>> 
>> Dave
> 
> I think something like the proposed thing could work.
> 
> We can put queue depth = 1 or somesuch when we know we are near
> completion for migration.  What we need them is a way to call the
> equivalent of:
> 
> bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment.  In
> that case, we just do another round over the whole memory, or retry in X
> seconds.  Anything is good for us, we just need a way to ask for the
> operation but that it don't block.
> 
> Notice that migration is the equivalent of:
> 
> while (true) {
>     write_some_dirty_pages();
>     if (dirty_pages < threshold) {
>        break;
>     }
> }
> bdrv_drain_all();
> write_rest_of_dirty_pages();
> 
> (Lots and lots of details ommited)
> 
> What we really want is to issue the call of bdrv_drain_all() equivalent
> inside the while, so, if there is any problem, we just do another cycle,
> no problem.
> 
> Later, Juan.


Hi,

Actually, the way I perceive the problem is that Qemu is doing a vm_stop() 
*after* the "break;" in the pseudocode above (but *before* the drain). That 
means the VM could be stopped for a long time while you're doing 
bdrv_drain_all().

I don't see a magic solution for this. All we can do is try and find a way of 
doing this that improves the VM experience during the migration.

It's easy to argue that it's better to see your storage performance go down for 
a short period of time instead of seeing your CPUs not running for a long 
period of time. After all, there's a reason for "cpu downtime" being an actual 
hypervisor metric.

What I'd propose is a simple improvement like this:

while (true) {
  write_some_dirty_pages();
  if (dirty_pages < threshold_very_low) {
    break;
  } else if (dirty_pages < threshold_low) {
    bdrv_stop_picking_new_reqs();
  } else if (dirty_pages < threshold_med) {
    bdrv_run_at_qd1();
  }
}
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
bdrv_drain_all();
write_rest_of_dirty_pages();

The idea is simple:
* When we're somewhere near, we pick only one request at a time.
* When we're really close, we stop picking up new requests. That still allows 
the block drivers to complete whatever is outstanding.
* When we're really really close, we can break. At this point, we're very 
likely drained already.

Knowing that most OSs use 30s by default as a "this request is not completing 
anymore" kind of timeout, we can even improve the above to resume the block 
drivers (or abort the migration) if the time between reaching "threshold_low" 
and "threshold_very_low" exceeds, say, 15s. That can be combined with actually 
waiting for everything to complete before stopping the CPUs. A more complete 
version would look like this:

while (true) {
  write_some_dirty_pages();
  if (dirty_pages < threshold_very_low) {
    if (bdrv_all_is_drained()) {
      break;
    } else if (bdrv_is_stopped() && (now() - ts_bdrv_stopped > 15s)) {
      bdrv_run_at_qd1();
      // or abort the migration and resume normally,
      // perhaps after a few retries
    }
  }
  if (dirty_pages < threshold_low) {
    bdrv_stop_picking_new_reqs();
    ts_bdrv_stopped = now();
  } else if (dirty_pages < threshold_med) {
    bdrv_run_at_qd1();
  }
}
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
bdrv_drain_all();
write_rest_of_dirty_pages();

Note that this version (somewhat) copes with (dirty_pages<threshold_very_low) 
being reached before we actually observed a (dirty_pages<threshold_low). 
There's still a race where requests are fired after bdrv_all_is_drained() and 
before vm_stop_force_state(). But that can be easily addressed.

Thoughts?

Thanks,
Felipe

Re: [Qemu-devel] Live migration without bdrv_drain_all()

Reply via email to