On 10/10/2016 12:36, Kevin Wolf wrote:
> 
> At the BlockBackend level (or really anywhere in the graph), we have two
> different kinds of drain requests that we just fail to implement
> differently today:
> 
> 1. blk_drain(), i.e. a request to actually drain all requests pending in
>    this BlockBackend. This is what we implement today, even though only
>    indirectly: We call bdrv_drain() on the root BDS and it calls back
>    into what is case 2.
> 
> 2. BdrvChildRole.drained_begin/end(), i.e. stop sending requests to a
>    child node. The efficient way to do this is not what we're doing
>    today (sending all throttled requests and waiting for their
>    completion), but just stopping to process throttled requests.
> 
> Your approach would permanently prevent us from doing this right and
> force us to do the full drain even for the second case.

You're entirely correct that these patches are a layering violation.
However, I think you're wrong that they are a step backwards, because
they are merely exposing pre-existing gaps in the BDS/BB separation.  In
fact, I think that these patches are a very good first step towards
getting the design correct, and I suspect that we agree on the design
since we have talked about it in the past.

Now, the good news is that we have an API that makes sense, and we're
using it mostly correctly.  (The exception is that there are places
where we don't have a BlockBackend and thus call
bdrv_drain/bdrv_co_drain instead of blk_drain/blk_co_drain).  Yes,
bdrv_drained_begin and bdrv_drained_end may need tweaks to work with
multiple parents, but wherever we use one of the APIs in the family
we're using the right one.

The bad news is that while the APIs are good, they are implemented in
the wrong way---all of them, though some less than others.  In
particular, blk_drain and bdrv_drain and
bdrv_drained_begin/bdrv_drained_end represent three completely different
concepts but we conflate the implementations.   But again, knowing and
exposing the problem is the first step of solving it, and IMO these
patches if anything make the problem easier to solve.


So, let's first of all look at what the three operations should mean.

blk_drain should mean "wait for completion of all requests sent from
*this* BlockBackend".  Typically it would be used by the owner of the
BlockBackend to get its own *data structures* in a known-good state.  It
should _not_ imply any wait for metadata writes to complete, and it
should not do anything to requests sent from other BlockBackends.  Three
random notes: 1) this should obviously include throttled operations,
just like it does now; 2) there should be a blk_co_drain too; 3) if the
implementation can't be contained entirely within BlockBackend, we're
doing it wrong.

bdrv_drain should only be needed in bdrv_set_aio_context and bdrv_close,
everyone else should drain its own BlockBackend or use a section.  Its
role is to ensure the BDS's children are completely quiescent.
Quiescence of the parents should be a precondition:
bdrv_set_aio_context, should use bdrv_drained_begin/bdrv_drained_end for
this purpose; while in the case of bdrv_close there should be no parents.

bdrv_{co_,}drained_begin and bdrv_{co_,}drained_end should split into two:

- blk_{co_,}isolate_begin(blk) and blk_{co_,}isolate_end(blk).  These
operations visit the BdrvChild graph in the child-to-parent direction,
starting at blk->root->bs, and tell each root BdrvChild (other than blk)
to quiesce itself.  Quiescence means ensuring that no new I/O operations
are triggered.  This in turn has both an external aspect (a Notifier or
other callback invoked by blk_root_drained_begin/end) and an internal
aspect (throttling).

- bdrv_drained_begin(bs) and bdrv_drained_end(bs), which quiesce all
root BdrvChildren reachable from bs; bdrv_drained_begin then does a
recursive bdrv_drain(bs) to flush all pending writes.  Just like
bdrv_drain, bdrv_drained_begin and bdrv_drained_end should be used very
rarely, possibly only in bdrv_set_aio_context.


In this new setting, BlockBackend need not disable throttling in
blk_root_drained_begin/end, and this is important because it explains
why conflating the concepts is bad.  Once BlockBackends can have
"isolated sections" instead of "drained sections", calling blk_drain
from blk_root_drained_begin/end is just one way to ensure that throttled
writes are not submitted.  It's not a very good one, but that doesn't
change the fact that blk_drain should wait for throttled requests to
complete without disabling throttling!  In fact, I surmise that any
other case where you want blk_drain to ignore throttling is a case where
you have a bug or missing feature elsewhere, for example some minimal
bdrv_cancel support that lets you cancel those throttled requests.  (And
I suspect that the handling of plugged operations in bdrv_drain to be
the same, but I haven't put much thought on it).


Now let's look at the implementation.

blk_drain and blk_co_drain can simply count in-flight requests, exactly
as done in patch 1.  Sure, I'm adding it at the wrong layer because not
every user of the block layer has already gotten its own personal BB.
However, moving the count from BDS to BB is easy, and another thing that
patch 1 gets right is to track the lifetime of AIO requests correctly
and remove the "double aio_poll" hack once and for all from the
implementation of bdrv_drain.

Because of the new precondition, bdrv_drain only needs to do the
recursive descent that it does in patch 2, without the previous
quiescing.  So that's another thing that the series nudges in the right
direction. :)  bdrv_drain should probably be renamed, but I can't think
of a good one.  Food for thought: think of merging the .bdrv_drain
callback with .bdrv_inactivate.

blk_isolate_begin(blk) should be just "bdrv_quiesce_begin(blk->root)"
(I'm now really sold on that BdrvChild thing!) and likewise for
blk_isolate_end->bdrv_quiesce_end, with bdrv_quiesce_begin/end doing the
child-to-parent walk.  Their first argument means "invoke quiescing
callbacks on all parents of blk->root, except for blk->root itself".
The BdrvChild callbacks would be renamed to .bdrv_quiesce_begin and
.bdrv_quiesce_end.

bdrv_drained_begin(bs) and bdrv_drained_end(bs) should be implemented as
above.

bdrv_drain_all should be changed to bdrv_quiesce_all_{begin,end}.  qdev
drive properties would install a vmstate notifier to quiesce their own
BlockBackend rather than relying on a blind bdrv_drain_all from vm_stop.


And thirdly, how to get there.

It should not surprise you that I think that these patches are the first
step.  The second is to separate sections (isolated and quiescent)
sections from the concept of draining, including phasing out
bdrv_drain_all.  Now the multi-parent case is handled correctly and you
can introduce "BB for everyone" including block jobs, block migration
and the NBD server.  Finally you cleanup bdrv_drain and move the
in-flight count from BDS to BB.

Thoughts?

Paolo

> 
> Maybe things become even a bit clearer if you extend your view of the
> graph by one level and take the users of BlockBackends into account.
> Nobody would expect block jobs or even guest devices to increment the
> refcount of BDSes for requests that they will ever want to issue.
> Instead, what we do is to stop them from sending requests while the
> child is drained. This is the model for BlockBackend.

Reply via email to