On 10/10/2016 12:36, Kevin Wolf wrote: > > At the BlockBackend level (or really anywhere in the graph), we have two > different kinds of drain requests that we just fail to implement > differently today: > > 1. blk_drain(), i.e. a request to actually drain all requests pending in > this BlockBackend. This is what we implement today, even though only > indirectly: We call bdrv_drain() on the root BDS and it calls back > into what is case 2. > > 2. BdrvChildRole.drained_begin/end(), i.e. stop sending requests to a > child node. The efficient way to do this is not what we're doing > today (sending all throttled requests and waiting for their > completion), but just stopping to process throttled requests. > > Your approach would permanently prevent us from doing this right and > force us to do the full drain even for the second case.
You're entirely correct that these patches are a layering violation. However, I think you're wrong that they are a step backwards, because they are merely exposing pre-existing gaps in the BDS/BB separation. In fact, I think that these patches are a very good first step towards getting the design correct, and I suspect that we agree on the design since we have talked about it in the past. Now, the good news is that we have an API that makes sense, and we're using it mostly correctly. (The exception is that there are places where we don't have a BlockBackend and thus call bdrv_drain/bdrv_co_drain instead of blk_drain/blk_co_drain). Yes, bdrv_drained_begin and bdrv_drained_end may need tweaks to work with multiple parents, but wherever we use one of the APIs in the family we're using the right one. The bad news is that while the APIs are good, they are implemented in the wrong way---all of them, though some less than others. In particular, blk_drain and bdrv_drain and bdrv_drained_begin/bdrv_drained_end represent three completely different concepts but we conflate the implementations. But again, knowing and exposing the problem is the first step of solving it, and IMO these patches if anything make the problem easier to solve. So, let's first of all look at what the three operations should mean. blk_drain should mean "wait for completion of all requests sent from *this* BlockBackend". Typically it would be used by the owner of the BlockBackend to get its own *data structures* in a known-good state. It should _not_ imply any wait for metadata writes to complete, and it should not do anything to requests sent from other BlockBackends. Three random notes: 1) this should obviously include throttled operations, just like it does now; 2) there should be a blk_co_drain too; 3) if the implementation can't be contained entirely within BlockBackend, we're doing it wrong. bdrv_drain should only be needed in bdrv_set_aio_context and bdrv_close, everyone else should drain its own BlockBackend or use a section. Its role is to ensure the BDS's children are completely quiescent. Quiescence of the parents should be a precondition: bdrv_set_aio_context, should use bdrv_drained_begin/bdrv_drained_end for this purpose; while in the case of bdrv_close there should be no parents. bdrv_{co_,}drained_begin and bdrv_{co_,}drained_end should split into two: - blk_{co_,}isolate_begin(blk) and blk_{co_,}isolate_end(blk). These operations visit the BdrvChild graph in the child-to-parent direction, starting at blk->root->bs, and tell each root BdrvChild (other than blk) to quiesce itself. Quiescence means ensuring that no new I/O operations are triggered. This in turn has both an external aspect (a Notifier or other callback invoked by blk_root_drained_begin/end) and an internal aspect (throttling). - bdrv_drained_begin(bs) and bdrv_drained_end(bs), which quiesce all root BdrvChildren reachable from bs; bdrv_drained_begin then does a recursive bdrv_drain(bs) to flush all pending writes. Just like bdrv_drain, bdrv_drained_begin and bdrv_drained_end should be used very rarely, possibly only in bdrv_set_aio_context. In this new setting, BlockBackend need not disable throttling in blk_root_drained_begin/end, and this is important because it explains why conflating the concepts is bad. Once BlockBackends can have "isolated sections" instead of "drained sections", calling blk_drain from blk_root_drained_begin/end is just one way to ensure that throttled writes are not submitted. It's not a very good one, but that doesn't change the fact that blk_drain should wait for throttled requests to complete without disabling throttling! In fact, I surmise that any other case where you want blk_drain to ignore throttling is a case where you have a bug or missing feature elsewhere, for example some minimal bdrv_cancel support that lets you cancel those throttled requests. (And I suspect that the handling of plugged operations in bdrv_drain to be the same, but I haven't put much thought on it). Now let's look at the implementation. blk_drain and blk_co_drain can simply count in-flight requests, exactly as done in patch 1. Sure, I'm adding it at the wrong layer because not every user of the block layer has already gotten its own personal BB. However, moving the count from BDS to BB is easy, and another thing that patch 1 gets right is to track the lifetime of AIO requests correctly and remove the "double aio_poll" hack once and for all from the implementation of bdrv_drain. Because of the new precondition, bdrv_drain only needs to do the recursive descent that it does in patch 2, without the previous quiescing. So that's another thing that the series nudges in the right direction. :) bdrv_drain should probably be renamed, but I can't think of a good one. Food for thought: think of merging the .bdrv_drain callback with .bdrv_inactivate. blk_isolate_begin(blk) should be just "bdrv_quiesce_begin(blk->root)" (I'm now really sold on that BdrvChild thing!) and likewise for blk_isolate_end->bdrv_quiesce_end, with bdrv_quiesce_begin/end doing the child-to-parent walk. Their first argument means "invoke quiescing callbacks on all parents of blk->root, except for blk->root itself". The BdrvChild callbacks would be renamed to .bdrv_quiesce_begin and .bdrv_quiesce_end. bdrv_drained_begin(bs) and bdrv_drained_end(bs) should be implemented as above. bdrv_drain_all should be changed to bdrv_quiesce_all_{begin,end}. qdev drive properties would install a vmstate notifier to quiesce their own BlockBackend rather than relying on a blind bdrv_drain_all from vm_stop. And thirdly, how to get there. It should not surprise you that I think that these patches are the first step. The second is to separate sections (isolated and quiescent) sections from the concept of draining, including phasing out bdrv_drain_all. Now the multi-parent case is handled correctly and you can introduce "BB for everyone" including block jobs, block migration and the NBD server. Finally you cleanup bdrv_drain and move the in-flight count from BDS to BB. Thoughts? Paolo > > Maybe things become even a bit clearer if you extend your view of the > graph by one level and take the users of BlockBackends into account. > Nobody would expect block jobs or even guest devices to increment the > refcount of BDSes for requests that they will ever want to issue. > Instead, what we do is to stop them from sending requests while the > child is drained. This is the model for BlockBackend.