Am 27.06.2018 um 16:30 hat Max Reitz geschrieben: > On 2018-04-11 18:39, Kevin Wolf wrote: > > We cannot allow aio_poll() in bdrv_drain_invoke(begin=true) until we're > > done with propagating the drain through the graph and are doing the > > single final BDRV_POLL_WHILE(). > > > > Just schedule the coroutine with the callback and increase bs->in_flight > > to make sure that the polling phase will wait for it. > > > > Signed-off-by: Kevin Wolf <[email protected]> > > --- > > block/io.c | 28 +++++++++++++++++++++++----- > > 1 file changed, 23 insertions(+), 5 deletions(-) > > According to bisect, this breaks blockdev-snapshot with QED:
I have no idea why it would trigger only with qed, but I think the real
bug is in commit dcf94a23b1 ('block: Don't poll in parent drain
callbacks').
The crash is specifically in the place where the new overlay needs to be
drained because it's new backing file is drained. First,
bdrv_drain_invoke() creates new activity on the overlay when it gets the
old active layer attached as a backing file:
#0 0x000055b90eb73b88 in bdrv_drain_invoke (bs=0x55b910fd8440,
begin=<optimized out>) at block/io.c:217
#1 0x000055b90eb1e2b0 in bdrv_replace_child_noperm (child=0x55b911f32450,
new_bs=<optimized out>) at block.c:2069
#2 0x000055b90eb20875 in bdrv_replace_child (child=<optimized out>,
new_bs=0x55b910fd2130) at block.c:2098
#3 0x000055b90eb20c53 in bdrv_root_attach_child
(child_bs=child_bs@entry=0x55b910fd2130,
child_name=child_name@entry=0x55b90eda75e5 "backing",
child_role=child_role@entry=0x55b90f3d3300 <child_backing>, perm=0,
shared_perm=31, opaque=opaque@entry=0x55b910fd8440, errp=0x7fffb8943620) at
block.c:2141
#4 0x000055b90eb20d60 in bdrv_attach_child
(parent_bs=parent_bs@entry=0x55b910fd8440,
child_bs=child_bs@entry=0x55b910fd2130,
child_name=child_name@entry=0x55b90eda75e5 "backing",
child_role=child_role@entry=0x55b90f3d3300 <child_backing>,
errp=0x7fffb8943620) at block.c:2162
#5 0x000055b90eb23c30 in bdrv_set_backing_hd (bs=bs@entry=0x55b910fd8440,
backing_hd=backing_hd@entry=0x55b910fd2130, errp=errp@entry=0x7fffb8943620) at
block.c:2249
#6 0x000055b90eb24a76 in bdrv_append (bs_new=0x55b910fd8440,
bs_top=0x55b910fd2130, errp=errp@entry=0x7fffb8943680) at block.c:3535
#7 0x000055b90e937a89 in external_snapshot_prepare (common=0x55b910f5cf90,
errp=0x7fffb89436f8) at blockdev.c:1680
And then, when trying to move all users of the old active layer to the
new overlay, it turns out that the bdrv_drain_invoke() BH hasn't been
executed yet:
#0 0x00007fdfcef5d9fb in raise () at /lib64/libc.so.6
#1 0x00007fdfcef5f800 in abort () at /lib64/libc.so.6
#2 0x00007fdfcef560da in __assert_fail_base () at /lib64/libc.so.6
#3 0x00007fdfcef56152 in () at /lib64/libc.so.6
#4 0x000055b90eb24867 in bdrv_replace_node (from=<optimized out>,
to=0x55b910fd8440, errp=0x7fffb8943620) at block.c:3470
#5 0x000055b90eb24abe in bdrv_append (bs_new=0x55b910fd8440,
bs_top=0x55b910fd2130, errp=errp@entry=0x7fffb8943680) at block.c:3541
#6 0x000055b90e937a89 in external_snapshot_prepare (common=0x55b910f5cf90,
errp=0x7fffb89436f8) at blockdev.c:1680
Not polling in bdrv_child_cb_drained_begin() is great if the original
drain is still polling, but if the original drain_begin has already
returned, we obviously do need to poll so we don't enter a drained
section while requests are still running.
Another thing I noticed is that qmp_transaction() only calls a one-time
bdrv_drain_all() instead of using a proper begin/end section. I suppose
this should be fixed as well.
Kevin
signature.asc
Description: PGP signature
