Re: [PATCH v2] block/stream: Drain subtree around graph change

Hanna Reitz Tue, 29 Mar 2022 05:17:05 -0700

On 29.03.22 11:55, Vladimir Sementsov-Ogievskiy wrote:

29.03.2022 11:54, Hanna Reitz wrote:
On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
28.03.2022 11:09, Hanna Reitz wrote:
On 28.03.22 09:44, Hanna Reitz wrote:
On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
24.03.2022 17:09, Hanna Reitz wrote:
When the stream block job cuts out the nodes between top andbase instream_prepare(), it does not drain the subtree manually; itfetches thebase node, and tries to insert it as the top node's backing nodewithbdrv_set_backing_hd(). bdrv_set_backing_hd() however willdrain, and sothe actual base node might change (because the base node isactually not
part of the stream job) before the old base node passed to
bdrv_set_backing_hd() is installed.
This has two implications:
First, the stream job does not keep a strong reference to thebase node.
Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
because some other block job is drained to finish), we will get a
use-after-free.  We should keep a strong reference to that node.
Second, even with such a strong reference, the problem remainsthat thebase node might change before bdrv_set_backing_hd() actuallyruns and as
a result the wrong base node is installed.
Hmm.
So, we don't really need a strong reference, as if it helps toavoid some use-after-free, it means that we'll finish up withwrong block graph..
Sure. But I found it better style to strongly reference a nodewhile it’s used. I’d rather have an outdated block graph (as in:A node that was supposed to disappear would still be in use) thana use-after-free.
Graph modifying operations must be somehow isolated from each other.
Both effects can be seen in 030'sTestParallelOps.test_overlapping_5()case, which has five nodes, and simultaneously streams from themiddlenode to the top node, and commits the middle node down to thebase node.
As it is, this will sometimes crash, namely when we encounter the
above-described use-after-free.
Taking a strong reference to the base node, we no longer get acrash,but the resuling block graph is less than ideal: The expectedresult isobviously that all middle nodes are cut out and the base node istheimmediate backing child of the top node. However, ifstream_prepare()takes a strong reference to its base node (the middle node), andthenthe commit job finishes in bdrv_set_backing_hd(), supposedlydropping
that middle node, the stream job will just reinstall it again.

Therefore, we need to keep the whole subtree drained in
stream_prepare(), so that the graph modification it performs is
effectively atomic, i.e. that the base node it fetches is stillthe basenode when bdrv_set_backing_hd() sets it as the top node'sbacking node.
Emanuele has similar idea of isolating graph changes from eachother by subtree-drain.
If I understand correctly the idea is that we'll drain all otherblock jobs, so the wouldn't do their block-graph modificationduring drained section. So, we can safely modify the graph.
I don't like this idea:
1. drained section = stop IO. But we don't need to stop IO in thewhole subtree to do a needed block-graph modification.
If you mean to say that draining just the single node should besufficient, I’ll be happy to change it.
Not sure which node, though, because I’d think it would be `base`,but to safely fetch it I’d need to drain it, which seems to biteitself in the tail. That’s why I went for a subtree drain from`above_base`.
2. Drained section is not a lock, several clients may drain sameset of nodes.. So we exploit the fact that concurrent clientswill be paused by drained section and don't proceed tograph-modification code.. But are we sure that block-jobs are(and will be?) the only concurrent block-graph modifying clients?Can qmp commands interleave somehow?
They can under very specific circumstances and that’s a bug. Seehttps://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html.
Can some jobs from other subtree start a block-graph modificationthat touches our subtree?
That would be wrong. A block job shouldn’t change nodes itdoesn’t own; stream doesn’t own the base, but it also doesn’tchange it, it only needs to have the top node point to it.
If go this way, that would be more safe to drain the wholeblock-graph on any block-graph modification..
I think we'd better have a separate global mechanism forisolating graph modifications. Something like a global co-mutexor queue, where clients waits for their turn in block graphmodifications.
Here is my old proposal on that topic:https://patchew.org/QEMU/20201120161622.1537-1-vsement...@virtuozzo.com/
That would only solve the very specific issue in 030, right?
It should solve the same issue as your patch. You don't addsubtree_drain around every graph modification.. Or we already have it?
Well, I’m not saying it will solve every single bug, but draining instream_prepare() will at least mean that that is safe from basicallyanything else, because it will prevent concurrent automatic graphchanges (e.g. because of jobs finishing), and QMP commands shouldn’tbe executed in drained sections either (when they do, it’s wrong, butthat seems to occur only extremely rarely). Draining alone shouldmake a place safe, it isn’t a lock that all sides need to take.
The stream job isn’t protected from any graph modifications butthose coming from mirror. Might be a solution going forward (Ididn’t look closer at it at the time, given I saw you had adiscussion with Kevin), if we lock every graph change operation(though a global lock honestly doesn’t sound strictly better thandraining subsections of the graph, both have their drawbacks), butthat doesn’t look like it’d be something for 7.1.
Same way, with draining solution you should make a subtree-drain forevery graph change operation.
Since we don’t have any lock yet, draining is the de-facto way we useto forbid concurrent graph modifications. I’m not saying we use itcorrectly and thoroughly, but it is what we do right now.
I wonder whether we could have a short-term version of`BdrvChild.frozen` that’s a coroutine mutex. If `.frozen` is set,you just can’t change the graph, and you also can’t wait, so that’sjust an error. But if `.frozen_lock` is set, you can wait on it.Here, we’d keep `.frozen` set for all links between top andabove_base, and then in prepare() take `.frozen_lock` on the linkbetween above_base and base.
Yes that's seems an alternative to global lock, that doesn't blockthe whole graph. Still, I don't think that is bad to lock the wholegraph for graph modificaiton, as modification should be rare and fast.
Fair enough.
Another thought: does subtree-drain really drain the wholeconnectivity component of the graph?
imagine something like this:

[A]  [   C  ]
 |    |    |
 v    v    v
[ B    ]  [ D ]


If we do subtree drain at node A, this will drain B and C, but not D..
Imagine, some another job is attached to node D, and it will startdrained section too. So, for example both jobs will share drainedsection on node C. That doesn't seem save, and draining is not a lock.
So, if we are going to rely on drained section as on lock, thatisolates graph modifications from each other, we should drain thewhole connectivity component of the graph.
The drained section is not a lock, but if the other job is onlyattached to node D, it won’t change node C. It might change the linkfrom C to D, but that doesn’t concern the job that is concerned aboutA and B. Overlapping drains are fine.
OK. Maybe it works. It's just not obvious to me that subtree_drainworks good in all cases. And global graph-modification-lock shouldobviously work.
Next, I'm not relly sure that two jobs can simultaneusly enterdrained section and do graph modifications. What prevents this?Assume two block-stream jobs reaches their finish simultaneously andgo to subtree-drain. That just means that job_pause will be calledfor both jobs.. But what that means for the block-stream jobs thatis in bdrv_subtree_drained_beeing() call in stream_prepare()? Seemsnothing? Then both jobs will start graph modification processsimultaneously and can interleave on any yield point (for exmaplerewriting backing_file in qcow2 metadata).
So I don’t think that scenario can really happen, because the streamjob freezes the chain between above_base and top, so you can’t reallyhave two simultaneous stream jobs that cause problems between eachother.
They cause problem on the boundary, as base of one stream job may betop of another, and we have also a filter, that should beinserted/removed at some moment. As I remember, that's the problematiccase in 030..
Furthermore, the prepare() functions are run in the main thread, sothe only real danger is actually that draining around the actualgraph modification (bdrv_set_backing_hd()) causes another block jobto finish, modifying the block graph. But then that job will alsoactually finish, because it’s all in the main thread.
It is true that child_job_drained_poll() says that job that are aboutto prepare() are quiesced, but I don’t think that’s a problem, giventhat all jobs in that state run in the main thread.
Another reason, why I think that subtree drain is a wrong tool, as Isaid, is extra IO-stop.
I know and agree, but that’s an optimization question.
Imaging the following graph:

[A]
 |
 v
[B] [C]
 |   |
 v   v
[base]
If we want to rebase A onto base, we actually need only stop IOrequests in A and B. Why C should suffer from this graphmodification? IO requests produced by C, and that are living in Cand in base don't intersect with rebasing A on base process in any way.
====
Actually, I'm not strictly against your patch, and believe that itfixes the problem in most cases. And it's probably OK in short term.The only real doubt on including it now is that drained sectionssometimes lead to dead locks, and is it possible that we now fix thebug that happens only in iotest 30 (or is it reported somewhere?)and risking to introduce some dead-lock?
Saying that the example in 030 is contrived would mean wecould/should re-include the base into the list of nodes that belongto the stream job, which would simply disallow the case in 030 that’sbeing tested and fails.
Then we don’t need a subtree drain, and the plain drain inbdrv_set_backing_hd() would be fine.
Seems that if in some code it's safe to call drained_begin(), itshould be safe to call subtree_drained_begin(). And if it triggersome deadlock, it just shows some another bug.. Is it worth fixingnow, near to 7.0 release? We live with this bug for years.. Orsomething changed that I missed?
I mean... I can understand your concern that adding a subtree drainhas performance implications (when a stream job ends, which shouldn’tbe often). But I’m not sure whether I should share the deadlockconcern. Sounds like a sad state of affairs if I can’t just drainsomething when I need it to be drained.
I wasn’t aware of this bug, actually. Now I am, and I feel ratheruncomfortable living with a use-after-free bug, because that’s on theworse end of the bug spectrum.
OK, I don't know:) And others are silent. Agree that global-locksolution is not for 7.0 anyway. And this one is simple enough. So,take my
Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov...@mail.ru>


Thanks!

A global lock solution sounds good to me for 7.1+, and it even providesa solution to solving the problem with QMP commands being executed whileother main-thread code is running. (I mean, the QMP command would stillbe executed, but at least concurrent graph changes would be impossible.)

Re: [PATCH v2] block/stream: Drain subtree around graph change

Reply via email to