Re: [Qemu-block] [Qemu-discuss] qemu-img convert stuck

2018-04-11 Thread David Lee
On Thu, Apr 12, 2018 at 10:03 AM, Fam Zheng  wrote:

> On Thu, 04/12 09:51, David Lee wrote:
> > On Mon, Apr 9, 2018 at 3:35 AM, Benny Zlotnik 
> wrote:
> >
> > > $ gdb -p 13024 -batch -ex "thread apply all bt"
> > > [Thread debugging using libthread_db enabled]
> > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > 0x7f98275cfaff in ppoll () from /lib64/libc.so.6
> > >
> > > Thread 1 (Thread 0x7f983e30ab00 (LWP 13024)):
> > > #0  0x7f98275cfaff in ppoll () from /lib64/libc.so.6
> > > #1  0x55b55cf59d69 in qemu_poll_ns ()
> > > #2  0x55b55cf5ba45 in aio_poll ()
> > > #3  0x55b55ceedc0f in bdrv_get_block_status_above ()
> > > #4  0x55b55cea3611 in convert_iteration_sectors ()
> > > #5  0x55b55cea4352 in img_convert ()
> > > #6  0x55b55ce9d819 in main ()
> >
> >
> > My team caught this issue too after switching to CentOS 7.4 with qemu-img
> > 2.9.0
> > gdb shows exactly the same backtrace when the convert stuck, and we are
> on
> > NFS.
> >
> > Later we found the following:
> > 1. The stuck can happen on local storage, too.
> > 2. Replace qemu-img 2.9.0 with 2.6.0 and everything works smoothly again.
> >
> > BTW, we use "qemu-img convert" to convert qcow2 and its backing files
> into
> > a single qcow2 image.
>
> Maybe it is RHBZ 1508886?
>
> Fam
>


Thanks, Fam.  We just tracked down to this BZ too and are about to trying
the commit ef6dada8b44e1e7c4bec5c1115903af9af415b50


-- 
Thanks,
Li Qun


Re: [Qemu-block] [Qemu-discuss] qemu-img convert stuck

2018-04-11 Thread Fam Zheng
On Thu, 04/12 09:51, David Lee wrote:
> On Mon, Apr 9, 2018 at 3:35 AM, Benny Zlotnik  wrote:
> 
> > $ gdb -p 13024 -batch -ex "thread apply all bt"
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > 0x7f98275cfaff in ppoll () from /lib64/libc.so.6
> >
> > Thread 1 (Thread 0x7f983e30ab00 (LWP 13024)):
> > #0  0x7f98275cfaff in ppoll () from /lib64/libc.so.6
> > #1  0x55b55cf59d69 in qemu_poll_ns ()
> > #2  0x55b55cf5ba45 in aio_poll ()
> > #3  0x55b55ceedc0f in bdrv_get_block_status_above ()
> > #4  0x55b55cea3611 in convert_iteration_sectors ()
> > #5  0x55b55cea4352 in img_convert ()
> > #6  0x55b55ce9d819 in main ()
> 
> 
> My team caught this issue too after switching to CentOS 7.4 with qemu-img
> 2.9.0
> gdb shows exactly the same backtrace when the convert stuck, and we are on
> NFS.
> 
> Later we found the following:
> 1. The stuck can happen on local storage, too.
> 2. Replace qemu-img 2.9.0 with 2.6.0 and everything works smoothly again.
> 
> BTW, we use "qemu-img convert" to convert qcow2 and its backing files into
> a single qcow2 image.

Maybe it is RHBZ 1508886?

Fam



Re: [Qemu-block] [Qemu-discuss] qemu-img convert stuck

2018-04-11 Thread David Lee
On Mon, Apr 9, 2018 at 3:35 AM, Benny Zlotnik  wrote:

> $ gdb -p 13024 -batch -ex "thread apply all bt"
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x7f98275cfaff in ppoll () from /lib64/libc.so.6
>
> Thread 1 (Thread 0x7f983e30ab00 (LWP 13024)):
> #0  0x7f98275cfaff in ppoll () from /lib64/libc.so.6
> #1  0x55b55cf59d69 in qemu_poll_ns ()
> #2  0x55b55cf5ba45 in aio_poll ()
> #3  0x55b55ceedc0f in bdrv_get_block_status_above ()
> #4  0x55b55cea3611 in convert_iteration_sectors ()
> #5  0x55b55cea4352 in img_convert ()
> #6  0x55b55ce9d819 in main ()


My team caught this issue too after switching to CentOS 7.4 with qemu-img
2.9.0
gdb shows exactly the same backtrace when the convert stuck, and we are on
NFS.

Later we found the following:
1. The stuck can happen on local storage, too.
2. Replace qemu-img 2.9.0 with 2.6.0 and everything works smoothly again.

BTW, we use "qemu-img convert" to convert qcow2 and its backing files into
a single qcow2 image.


> On Sun, Apr 8, 2018 at 10:28 PM, Nir Soffer  wrote:
>
> > On Sun, Apr 8, 2018 at 9:27 PM Benny Zlotnik 
> wrote:
> >
> >> Hi,
> >>
> >> As part of copy operation initiated by rhev got stuck for more than a
> day
> >> and consumes plenty of CPU
> >> vdsm 13024  3117 99 Apr07 ?1-06:58:43 /usr/bin/qemu-img
> >> convert
> >> -p -t none -T none -f qcow2
> >> /rhev/data-center/bb422fac-81c5-4fea-8782-3498bb5c8a59/
> >> 26989331-2c39-4b34-a7ed-d7dd7703646c/images/597e12b6-
> >> 19f5-45bd-868f-767600c7115e/62a5492e-e120-4c25-898e-9f5f5629853e
> >> -O raw /rhev/data-center/mnt/mantis-nfs-lif1.lab.eng.tlv2.redhat.com:
> >> _vol__service/26989331-2c39-4b34-a7ed-d7dd7703646c/images/
> >> 9ece9408-9ca6-48cd-992a-6f590c710672/06d6d3c0-beb8-
> 4b6b-ab00-56523df185da
> >>
> >> The target image appears to have no data yet:
> >> qemu-img info 06d6d3c0-beb8-4b6b-ab00-56523df185da"
> >> image: 06d6d3c0-beb8-4b6b-ab00-56523df185da
> >> file format: raw
> >> virtual size: 120G (128849018880 bytes)
> >> disk size: 0
> >>
> >> strace -p 13024 -tt -T -f shows only:
> >> ...
> >> 21:13:01.309382 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
> >> 0},
> >> NULL, 8) = 0 (Timeout) <0.10>
> >> 21:13:01.309411 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
> >> 0},
> >> NULL, 8) = 0 (Timeout) <0.09>
> >> 21:13:01.309440 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
> >> 0},
> >> NULL, 8) = 0 (Timeout) <0.09>
> >> 21:13:01.309468 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
> >> 0},
> >> NULL, 8) = 0 (Timeout) <0.10>
> >>
> >> version: qemu-img-rhev-2.9.0-16.el7_4.13.x86_64
> >>
> >> What could cause this? I'll provide any additional information needed
> >>
> >
> > A backtrace may help, try:
> >
> > gdb -p 13024 -batch -ex "thread apply all bt"
> >
> > Also adding Kevin and qemu-block.
> >
> > Nir
> >
>


-- 
Thanks,
Li Qun


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-11 Thread Nir Soffer
On Tue, Apr 10, 2018 at 6:53 PM Richard W.M. Jones 
wrote:

> On Tue, Apr 10, 2018 at 03:25:47PM +, Nir Soffer wrote:
> > On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones 
> > wrote:
> >
> > > On Tue, Apr 10, 2018 at 02:07:33PM +, Nir Soffer wrote:
> > > > This makes sense if the device is backed by a block device on oVirt
> side,
> > > > and the NBD support efficient zeroing. But in this case the device is
> > > backed
> > > > by an empty sparse file on NFS, and oVirt does not support yet
> efficient
> > > > zeroing, we just write zeros manually.
> > > >
> > > > I think should be handled on virt-v2v plugin side. When zeroing a
> file
> > > raw
> > > > image,
> > > > you can ignore zero requests after the highest write offset, since
> the
> > > > plugin
> > > > created a new image, and we know that the image is empty.
> > > >
> > > > When the destination is a block device we cannot avoid zeroing since
> a
> > > block
> > > > device may contain junk data (we usually get dirty empty images from
> our
> > > > local
> > > > xtremio server).
> > >
> > > (Off topic for qemu-block but ...)  We don't have enough information
> > > at our end to know about any of this.
> > >
> >
> > Can't use use this logic in the oVirt plugin?
> >
> > file based storage -> skip initial zeroing
> > block based storage -> use initial zeroing
> >
> > Do you think that publishing disk capabilities in the sdk will solve
> this?
>
> The plugin would have to do some complicated gymnastics.  It would
> have to keep track of which areas of the disk have been written and
> ignore NBD_CMD_WRITE_ZEROES for other areas, except if block-based
> storage is being used.  And so yes we'd also need the imageio API to
> publish that information to the plugin.
>
> So it's possible but not trivial.
>

I think this should be fixed in qemu-img. The current zero optimization may
have insignificant improvement when the backend supports fast zeroing
(e.g fallocate, blkzeroout) by minimizing number of syscalls, but it has
severe
performance issue when the desntiation does not support fast zero
(e.g. NFS < 4.2).

We plan to improve zero performance in 4.2.z, but this will not solve the
issue
on NFS < 4.2.

By the way I think we're slowly reimplementing NBD in the imageio API.
>

You can also look at it as - slowly showing that HTTP can replace
NBD :-)


> Dan Berrange pointed out earlier on that it might be easier if imageio
> just exposed NBD, or if we found a way to tunnel NBD requests over web
> sockets (in the format case nbdkit would not be needed, in the latter
> case nbdkit could act as a bridge).
>

I'm not the more complex web sockets are needed for moving images
around. web sockets are good when the clients is a browser, and you need
to pass lot of small messages. If you have a real socket you don't need
web sockets. And the overhead of http headers is insignificant when you
move lot of data around.

Exposing NBD using the same authentication mechanism can be interesting,
but only if clients have an easy way to use this.

Will this allow client (e.g. backup vender) to download and upload images
using qemu-img?

qemu-img <- nbd-> ovirt-imageio <- ndb -> qemu/ndb-server

We also need streaming support - with http you can stream the data from
anywhere, while qemu-img needs a file. I don't think that backup vendors
or users will be happy to implement nbdkit plugins.

If qemu-img would support input and ouptut from stdin/stdout instead of
files, this model sounds much more interesting.

Nir


Re: [Qemu-block] [Qemu-devel] [PATCH v4 11/13] block/mirror: Add active mirroring

2018-04-11 Thread Eric Blake
On 04/11/2018 01:54 PM, Max Reitz wrote:
> This patch implements active synchronous mirroring.  In active mode, the
> passive mechanism will still be in place and is used to copy all
> initially dirty clusters off the source disk; but every write request
> will write data both to the source and the target disk, so the source
> cannot be dirtied faster than data is mirrored to the target.  Also,
> once the block job has converged (BLOCK_JOB_READY sent), source and
> target are guaranteed to stay in sync (unless an error occurs).
> 
> Active mode is completely optional and currently disabled at runtime.  A
> later patch will add a way for users to enable it.
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Fam Zheng 
> ---
>  qapi/block-core.json |  18 
>  block/mirror.c   | 252 
> ++-
>  2 files changed, 265 insertions(+), 5 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index c50517bff3..8210d601f4 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1049,6 +1049,24 @@
>  { 'enum': 'MirrorSyncMode',
>'data': ['top', 'full', 'none', 'incremental'] }
>  
> +##
> +# @MirrorCopyMode:
> +#
> +# An enumeration whose values tell the mirror block job when to
> +# trigger writes to the target.
> +#
> +# @background: copy data in background only.
> +#
> +# @write-blocking: when data is written to the source, write it
> +#  (synchronously) to the target as well.  In
> +#  addition, data is copied in background just like in
> +#  @background mode.
> +#
> +# Since: 2.12

Missed an instance of 2.13

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH v4 08/13] test-hbitmap: Add non-advancing iter_next tests

2018-04-11 Thread Max Reitz
Add a function that wraps hbitmap_iter_next() and always calls it in
non-advancing mode first, and in advancing mode next.  The result should
always be the same.

By using this function everywhere we called hbitmap_iter_next() before,
we should get good test coverage for non-advancing hbitmap_iter_next().

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
Reviewed-by: John Snow 
---
 tests/test-hbitmap.c | 36 
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index f2158f767d..5e67ac1d3a 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -30,6 +30,18 @@ typedef struct TestHBitmapData {
 } TestHBitmapData;
 
 
+static int64_t check_hbitmap_iter_next(HBitmapIter *hbi)
+{
+int next0, next1;
+
+next0 = hbitmap_iter_next(hbi, false);
+next1 = hbitmap_iter_next(hbi, true);
+
+g_assert_cmpint(next0, ==, next1);
+
+return next0;
+}
+
 /* Check that the HBitmap and the shadow bitmap contain the same data,
  * ignoring the same "first" bits.
  */
@@ -46,7 +58,7 @@ static void hbitmap_test_check(TestHBitmapData *data,
 
 i = first;
 for (;;) {
-next = hbitmap_iter_next(, true);
+next = check_hbitmap_iter_next();
 if (next < 0) {
 next = data->size;
 }
@@ -435,25 +447,25 @@ static void test_hbitmap_iter_granularity(TestHBitmapData 
*data,
 /* Note that hbitmap_test_check has to be invoked manually in this test.  
*/
 hbitmap_test_init(data, 131072 << 7, 7);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
+g_assert_cmpint(check_hbitmap_iter_next(), <, 0);
 
 hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
-g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
+g_assert_cmpint(check_hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
+g_assert_cmpint(check_hbitmap_iter_next(), <, 0);
 
 hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
 g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_test_set(data, (131072 << 7) - 8, 8);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
-g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
-g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
+g_assert_cmpint(check_hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
+g_assert_cmpint(check_hbitmap_iter_next(), ==, 131071 << 7);
+g_assert_cmpint(check_hbitmap_iter_next(), <, 0);
 
 hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
-g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
-g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
+g_assert_cmpint(check_hbitmap_iter_next(), ==, 131071 << 7);
+g_assert_cmpint(check_hbitmap_iter_next(), <, 0);
 }
 
 static void hbitmap_test_set_boundary_bits(TestHBitmapData *data, ssize_t diff)
@@ -893,7 +905,7 @@ static void test_hbitmap_serialize_zeroes(TestHBitmapData 
*data,
 for (i = 0; i < num_positions; i++) {
 hbitmap_deserialize_zeroes(data->hb, positions[i], min_l1, true);
 hbitmap_iter_init(, data->hb, 0);
-next = hbitmap_iter_next(, true);
+next = check_hbitmap_iter_next();
 if (i == num_positions - 1) {
 g_assert_cmpint(next, ==, -1);
 } else {
@@ -919,10 +931,10 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData 
*data,
 
 hbitmap_iter_init(, data->hb, BITS_PER_LONG - 1);
 
-hbitmap_iter_next(, true);
+check_hbitmap_iter_next();
 
 hbitmap_reset_all(data->hb);
-hbitmap_iter_next(, true);
+check_hbitmap_iter_next();
 }
 
 static void test_hbitmap_next_zero_check(TestHBitmapData *data, int64_t start)
-- 
2.14.3




[Qemu-block] [PATCH v4 07/13] hbitmap: Add @advance param to hbitmap_iter_next()

2018-04-11 Thread Max Reitz
This new parameter allows the caller to just query the next dirty
position without moving the iterator.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
Reviewed-by: John Snow 
---
 include/qemu/hbitmap.h |  5 -
 block/backup.c |  2 +-
 block/dirty-bitmap.c   |  2 +-
 tests/test-hbitmap.c   | 26 +-
 util/hbitmap.c | 10 +++---
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 6b6490ecad..ddca52c48e 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -324,11 +324,14 @@ void hbitmap_free_meta(HBitmap *hb);
 /**
  * hbitmap_iter_next:
  * @hbi: HBitmapIter to operate on.
+ * @advance: If true, advance the iterator.  Otherwise, the next call
+ *   of this function will return the same result (if that
+ *   position is still dirty).
  *
  * Return the next bit that is set in @hbi's associated HBitmap,
  * or -1 if all remaining bits are zero.
  */
-int64_t hbitmap_iter_next(HBitmapIter *hbi);
+int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance);
 
 /**
  * hbitmap_iter_next_word:
diff --git a/block/backup.c b/block/backup.c
index 453cd62c24..60374fe866 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -368,7 +368,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 HBitmapIter hbi;
 
 hbitmap_iter_init(, job->copy_bitmap, 0);
-while ((cluster = hbitmap_iter_next()) != -1) {
+while ((cluster = hbitmap_iter_next(, true)) != -1) {
 do {
 if (yield_and_check(job)) {
 return 0;
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 967159479d..df7b711610 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -546,7 +546,7 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)
 
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-return hbitmap_iter_next(>hbi);
+return hbitmap_iter_next(>hbi, true);
 }
 
 /* Called within bdrv_dirty_bitmap_lock..unlock */
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index f29631f939..f2158f767d 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -46,7 +46,7 @@ static void hbitmap_test_check(TestHBitmapData *data,
 
 i = first;
 for (;;) {
-next = hbitmap_iter_next();
+next = hbitmap_iter_next(, true);
 if (next < 0) {
 next = data->size;
 }
@@ -435,25 +435,25 @@ static void test_hbitmap_iter_granularity(TestHBitmapData 
*data,
 /* Note that hbitmap_test_check has to be invoked manually in this test.  
*/
 hbitmap_test_init(data, 131072 << 7, 7);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_test_set(data, (131072 << 7) - 8, 8);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
-g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
-g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 }
 
 static void hbitmap_test_set_boundary_bits(TestHBitmapData *data, ssize_t diff)
@@ -893,7 +893,7 @@ static void test_hbitmap_serialize_zeroes(TestHBitmapData 
*data,
 for (i = 0; i < num_positions; i++) {
 hbitmap_deserialize_zeroes(data->hb, positions[i], min_l1, true);
 hbitmap_iter_init(, data->hb, 0);
-next = hbitmap_iter_next();
+next = hbitmap_iter_next(, true);
 if (i == num_positions - 1) {
 g_assert_cmpint(next, ==, -1);
 } else {
@@ -919,10 +919,10 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData 
*data,
 
 hbitmap_iter_init(, data->hb, BITS_PER_LONG - 1);
 
-hbitmap_iter_next();
+hbitmap_iter_next(, true);
 
 hbitmap_reset_all(data->hb);
-hbitmap_iter_next();
+hbitmap_iter_next(, true);
 }
 
 static void test_hbitmap_next_zero_check(TestHBitmapData *data, 

[Qemu-block] [PATCH v4 06/13] block: Generalize should_update_child() rule

2018-04-11 Thread Max Reitz
Currently, bdrv_replace_node() refuses to create loops from one BDS to
itself if the BDS to be replaced is the backing node of the BDS to
replace it: Say there is a node A and a node B.  Replacing B by A means
making all references to B point to A.  If B is a child of A (i.e. A has
a reference to B), that would mean we would have to make this reference
point to A itself -- so we'd create a loop.

bdrv_replace_node() (through should_update_child()) refuses to do so if
B is the backing node of A.  There is no reason why we should create
loops if B is not the backing node of A, though.  The BDS graph should
never contain loops, so we should always refuse to create them.

If B is a child of A and B is to be replaced by A, we should simply
leave B in place there because it is the most sensible choice.

A more specific argument would be: Putting filter drivers into the BDS
graph is basically the same as appending an overlay to a backing chain.
But the main child BDS of a filter driver is not "backing" but "file",
so restricting the no-loop rule to backing nodes would fail here.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 include/block/block_int.h |  2 ++
 block.c   | 44 ++--
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index c4dd1d4bb8..8d63a1b0c1 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -616,6 +616,8 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+typedef QLIST_HEAD(BdrvChildList, BdrvChild) BdrvChildList;
+
 /*
  * Note: the function bdrv_append() copies and swaps contents of
  * BlockDriverStates, so if you add new fields to this struct, please
diff --git a/block.c b/block.c
index a2caadf0a0..9294b89eb7 100644
--- a/block.c
+++ b/block.c
@@ -3383,16 +3383,39 @@ static bool should_update_child(BdrvChild *c, 
BlockDriverState *to)
 return false;
 }
 
-if (c->role == _backing) {
-/* If @from is a backing file of @to, ignore the child to avoid
- * creating a loop. We only want to change the pointer of other
- * parents. */
-QLIST_FOREACH(to_c, >children, next) {
-if (to_c == c) {
-break;
-}
-}
-if (to_c) {
+/* If the child @c belongs to the BDS @to, replacing the current
+ * c->bs by @to would mean to create a loop.
+ *
+ * Such a case occurs when appending a BDS to a backing chain.
+ * For instance, imagine the following chain:
+ *
+ *   guest device -> node A -> further backing chain...
+ *
+ * Now we create a new BDS B which we want to put on top of this
+ * chain, so we first attach A as its backing node:
+ *
+ *   node B
+ * |
+ * v
+ *   guest device -> node A -> further backing chain...
+ *
+ * Finally we want to replace A by B.  When doing that, we want to
+ * replace all pointers to A by pointers to B -- except for the
+ * pointer from B because (1) that would create a loop, and (2)
+ * that pointer should simply stay intact:
+ *
+ *   guest device -> node B
+ * |
+ * v
+ *   node A -> further backing chain...
+ *
+ * In general, when replacing a node A (c->bs) by a node B (@to),
+ * if A is a child of B, that means we cannot replace A by B there
+ * because that would create a loop.  Silently detaching A from B
+ * is also not really an option.  So overall just leaving A in
+ * place there is the most sensible choice. */
+QLIST_FOREACH(to_c, >children, next) {
+if (to_c == c) {
 return false;
 }
 }
@@ -3418,6 +3441,7 @@ void bdrv_replace_node(BlockDriverState *from, 
BlockDriverState *to,
 
 /* Put all parents into @list and calculate their cumulative permissions */
 QLIST_FOREACH_SAFE(c, >parents, next_parent, next) {
+assert(c->bs == from);
 if (!should_update_child(c, to)) {
 continue;
 }
-- 
2.14.3




[Qemu-block] [PATCH v4 for-2.13 00/13] block/mirror: Add active-sync mirroring

2018-04-11 Thread Max Reitz
This series implements an active and synchronous mirroring mode.

Currently, the mirror block job is passive an asynchronous: Depending on
your start conditions, some part of the source disk starts as "dirty".
Then, the block job will (as a background operation) continuously copy
dirty parts to the target disk until all of the source disk is clean.
In the meantime, any write to the source disk dirties the affected area.

One effect of this operational mode is that the job may never converge:
If the writes to the source happen faster than the block job copies data
to the target, the job can never finish.

When the active mode implemented in this series is enabled, every write
request to the source will automatically trigger a synchronous write to
the target right afterwards.  Therefore, the source can never get dirty
faster than data is copied to the target.  Most importantly, once source
and target are in sync (BLOCK_JOB_READY is emitted), they will not
diverge (unless e.g. an I/O error occurs).

Active mirroring also improves on a second issue of the passive mode: We
do not have to read data from the source in order to write it to the
target.  When new data is written to the source in active mode, it is
automatically mirrored to the target, which saves us the superfluous
read from the source.


Things to do on top of this series:
- Allow switching between active and passive mode at runtime: Mainly
  hinges on the question of how to expose it to the user (ideally
  through a generic block-job-set-option command)

- Implement an asynchronous active mode (launch both write operations to
  the source and the target at the same time, and do not wait for the
  target operation to finish)

- Integrate the mirror BDS more tightly into the BDS graph:  Both source
  and target should be BdrvChildren (and the source should not be the
  "backing" child).  I'm working on this in a follow-up.

- Improve the mirror job coroutine use: Currently more of a hack, a
  follow-up will make this nicer.

- Add read-write-blocking mode: This series adds the write-blocking
  mode, where every write blocks until the data has been mirrored to the
  target.  read-write-blocking would also mirror data on reads from the
  source, which saves some performance (because that data does not have
  to be read twice) at the cost of latency on mirroring read operations.
  (Will be in the same follow-up.)


v4:
- Dropped patches 1 through 3.  Kevin has taken the old patch 3 (a drain
  test case) into his lastest drain series ("Drain fixes and cleanups,
  part 3"), which to me implies that my preceding patches (the old 1 and
  2) may not have been enough.  As I explained in some older cover
  latter (it might have been v1...), all of those patches actually would
  have been only necessary for the follow-up (that is going to come
  along at some point...) where I plan to make the mirror target an
  immediate BdrvChild of the mirror node.
  This series does not make the mirror target such an immediate child,
  thus the mirror node continues to only have a single child, which
  means that those patches are actually not required for this series.  I
  only included them because they still made sense.
  However, now I am no longer convinced it makes sense to include them
  (because that would create a dependency on Kevin's series), so I'll
  push them off to the follow-up.

- Patch 12 (was: 15): Replaced "2.12" by "2.13" [Eric]

- Added Rb-s, rebased (with no effect, judging from
  git-backport-diff...)


git-backport-diff to v3:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/13:[] [--] 'block/mirror: Pull out mirror_perform()'
002/13:[] [--] 'block/mirror: Convert to coroutines'
003/13:[] [--] 'block/mirror: Use CoQueue to wait on in-flight ops'
004/13:[] [--] 'block/mirror: Wait for in-flight op conflicts'
005/13:[] [--] 'block/mirror: Use source as a BdrvChild'
006/13:[] [--] 'block: Generalize should_update_child() rule'
007/13:[] [--] 'hbitmap: Add @advance param to hbitmap_iter_next()'
008/13:[] [--] 'test-hbitmap: Add non-advancing iter_next tests'
009/13:[] [--] 'block/dirty-bitmap: Add bdrv_dirty_iter_next_area'
010/13:[] [--] 'block/mirror: Add MirrorBDSOpaque'
011/13:[] [--] 'block/mirror: Add active mirroring'
012/13:[0004] [FC] 'block/mirror: Add copy mode QAPI interface'
013/13:[] [--] 'iotests: Add test for active mirroring'


Max Reitz (13):
  block/mirror: Pull out mirror_perform()
  block/mirror: Convert to coroutines
  block/mirror: Use CoQueue to wait on in-flight ops
  block/mirror: Wait for in-flight op conflicts
  block/mirror: Use source as a BdrvChild
  block: Generalize should_update_child() rule
  hbitmap: Add @advance param to hbitmap_iter_next()
  test-hbitmap: Add non-advancing iter_next 

[Qemu-block] [PATCH v4 11/13] block/mirror: Add active mirroring

2018-04-11 Thread Max Reitz
This patch implements active synchronous mirroring.  In active mode, the
passive mechanism will still be in place and is used to copy all
initially dirty clusters off the source disk; but every write request
will write data both to the source and the target disk, so the source
cannot be dirtied faster than data is mirrored to the target.  Also,
once the block job has converged (BLOCK_JOB_READY sent), source and
target are guaranteed to stay in sync (unless an error occurs).

Active mode is completely optional and currently disabled at runtime.  A
later patch will add a way for users to enable it.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 qapi/block-core.json |  18 
 block/mirror.c   | 252 ++-
 2 files changed, 265 insertions(+), 5 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index c50517bff3..8210d601f4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1049,6 +1049,24 @@
 { 'enum': 'MirrorSyncMode',
   'data': ['top', 'full', 'none', 'incremental'] }
 
+##
+# @MirrorCopyMode:
+#
+# An enumeration whose values tell the mirror block job when to
+# trigger writes to the target.
+#
+# @background: copy data in background only.
+#
+# @write-blocking: when data is written to the source, write it
+#  (synchronously) to the target as well.  In
+#  addition, data is copied in background just like in
+#  @background mode.
+#
+# Since: 2.12
+##
+{ 'enum': 'MirrorCopyMode',
+  'data': ['background', 'write-blocking'] }
+
 ##
 # @BlockJobType:
 #
diff --git a/block/mirror.c b/block/mirror.c
index abaf2a83c7..a700862029 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -53,8 +53,12 @@ typedef struct MirrorBlockJob {
 Error *replace_blocker;
 bool is_none_mode;
 BlockMirrorBackingMode backing_mode;
+MirrorCopyMode copy_mode;
 BlockdevOnError on_source_error, on_target_error;
 bool synced;
+/* Set when the target is synced (dirty bitmap is clean, nothing
+ * in flight) and the job is running in active mode */
+bool actively_synced;
 bool should_complete;
 int64_t granularity;
 size_t buf_size;
@@ -76,6 +80,7 @@ typedef struct MirrorBlockJob {
 int target_cluster_size;
 int max_iov;
 bool initial_zeroing_ongoing;
+int in_active_write_counter;
 } MirrorBlockJob;
 
 typedef struct MirrorBDSOpaque {
@@ -93,6 +98,7 @@ struct MirrorOp {
 int64_t *bytes_handled;
 
 bool is_pseudo_op;
+bool is_active_write;
 CoQueue waiting_requests;
 
 QTAILQ_ENTRY(MirrorOp) next;
@@ -108,6 +114,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob 
*s, bool read,
 int error)
 {
 s->synced = false;
+s->actively_synced = false;
 if (read) {
 return block_job_error_action(>common, s->on_source_error,
   true, error);
@@ -274,7 +281,7 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 return ret;
 }
 
-static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
+static inline void mirror_wait_for_any_operation(MirrorBlockJob *s, bool 
active)
 {
 MirrorOp *op;
 
@@ -284,7 +291,7 @@ static inline void 
mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
  * caller of this function.  Since there is only one pseudo op
  * at any given time, we will always find some real operation
  * to wait on. */
-if (!op->is_pseudo_op) {
+if (!op->is_pseudo_op && op->is_active_write == active) {
 qemu_co_queue_wait(>waiting_requests, NULL);
 return;
 }
@@ -292,6 +299,12 @@ static inline void 
mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 abort();
 }
 
+static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
+{
+/* Only non-active operations use up in-flight slots */
+mirror_wait_for_any_operation(s, false);
+}
+
 /* Perform a mirror copy operation.
  *
  * *op->bytes_handled is set to the number of bytes copied after and
@@ -849,6 +862,7 @@ static void coroutine_fn mirror_run(void *opaque)
 /* Report BLOCK_JOB_READY and wait for complete. */
 block_job_event_ready(>common);
 s->synced = true;
+s->actively_synced = true;
 while (!block_job_is_cancelled(>common) && !s->should_complete) {
 block_job_yield(>common);
 }
@@ -900,6 +914,12 @@ static void coroutine_fn mirror_run(void *opaque)
 int64_t cnt, delta;
 bool should_complete;
 
+/* Do not start passive operations while there are active
+ * writes in progress */
+while (s->in_active_write_counter) {
+mirror_wait_for_any_operation(s, true);
+}
+
 if (s->ret < 0) {
 ret = s->ret;
 goto immediate_exit;
@@ -947,6 +967,9 @@ static 

[Qemu-block] [PATCH v4 03/13] block/mirror: Use CoQueue to wait on in-flight ops

2018-04-11 Thread Max Reitz
Attach a CoQueue to each in-flight operation so if we need to wait for
any we can use it to wait instead of just blindly yielding and hoping
for some operation to wake us.

A later patch will use this infrastructure to allow requests accessing
the same area of the virtual disk to specifically wait for each other.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 block/mirror.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index bd8ee7d92c..1af6481876 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/coroutine.h"
 #include "trace.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -34,6 +35,8 @@ typedef struct MirrorBuffer {
 QSIMPLEQ_ENTRY(MirrorBuffer) next;
 } MirrorBuffer;
 
+typedef struct MirrorOp MirrorOp;
+
 typedef struct MirrorBlockJob {
 BlockJob common;
 RateLimit limit;
@@ -67,15 +70,15 @@ typedef struct MirrorBlockJob {
 unsigned long *in_flight_bitmap;
 int in_flight;
 int64_t bytes_in_flight;
+QTAILQ_HEAD(MirrorOpList, MirrorOp) ops_in_flight;
 int ret;
 bool unmap;
-bool waiting_for_io;
 int target_cluster_size;
 int max_iov;
 bool initial_zeroing_ongoing;
 } MirrorBlockJob;
 
-typedef struct MirrorOp {
+struct MirrorOp {
 MirrorBlockJob *s;
 QEMUIOVector qiov;
 int64_t offset;
@@ -84,7 +87,11 @@ typedef struct MirrorOp {
 /* The pointee is set by mirror_co_read(), mirror_co_zero(), and
  * mirror_co_discard() before yielding for the first time */
 int64_t *bytes_handled;
-} MirrorOp;
+
+CoQueue waiting_requests;
+
+QTAILQ_ENTRY(MirrorOp) next;
+};
 
 typedef enum MirrorMethod {
 MIRROR_METHOD_COPY,
@@ -125,7 +132,9 @@ static void coroutine_fn mirror_iteration_done(MirrorOp 
*op, int ret)
 
 chunk_num = op->offset / s->granularity;
 nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
+
 bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
+QTAILQ_REMOVE(>ops_in_flight, op, next);
 if (ret >= 0) {
 if (s->cow_bitmap) {
 bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
@@ -135,11 +144,9 @@ static void coroutine_fn mirror_iteration_done(MirrorOp 
*op, int ret)
 }
 }
 qemu_iovec_destroy(>qiov);
-g_free(op);
 
-if (s->waiting_for_io) {
-qemu_coroutine_enter(s->common.co);
-}
+qemu_co_queue_restart_all(>waiting_requests);
+g_free(op);
 }
 
 static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
@@ -229,10 +236,11 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 
 static inline void mirror_wait_for_io(MirrorBlockJob *s)
 {
-assert(!s->waiting_for_io);
-s->waiting_for_io = true;
-qemu_coroutine_yield();
-s->waiting_for_io = false;
+MirrorOp *op;
+
+op = QTAILQ_FIRST(>ops_in_flight);
+assert(op);
+qemu_co_queue_wait(>waiting_requests, NULL);
 }
 
 /* Perform a mirror copy operation.
@@ -342,6 +350,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 .bytes  = bytes,
 .bytes_handled  = _handled,
 };
+qemu_co_queue_init(>waiting_requests);
 
 switch (mirror_method) {
 case MIRROR_METHOD_COPY:
@@ -357,6 +366,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 abort();
 }
 
+QTAILQ_INSERT_TAIL(>ops_in_flight, op, next);
 qemu_coroutine_enter(co);
 /* At this point, ownership of op has been moved to the coroutine
  * and the object may already be freed */
@@ -1283,6 +1293,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 }
 }
 
+QTAILQ_INIT(>ops_in_flight);
+
 trace_mirror_start(bs, s, opaque);
 block_job_start(>common);
 return;
-- 
2.14.3




[Qemu-block] [PATCH v4 02/13] block/mirror: Convert to coroutines

2018-04-11 Thread Max Reitz
In order to talk to the source BDS (and maybe in the future to the
target BDS as well) directly, we need to convert our existing AIO
requests into coroutine I/O requests.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 block/mirror.c | 152 ++---
 1 file changed, 90 insertions(+), 62 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1718571766..bd8ee7d92c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -80,6 +80,10 @@ typedef struct MirrorOp {
 QEMUIOVector qiov;
 int64_t offset;
 uint64_t bytes;
+
+/* The pointee is set by mirror_co_read(), mirror_co_zero(), and
+ * mirror_co_discard() before yielding for the first time */
+int64_t *bytes_handled;
 } MirrorOp;
 
 typedef enum MirrorMethod {
@@ -101,7 +105,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob 
*s, bool read,
 }
 }
 
-static void mirror_iteration_done(MirrorOp *op, int ret)
+static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 {
 MirrorBlockJob *s = op->s;
 struct iovec *iov;
@@ -138,9 +142,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
 }
 }
 
-static void mirror_write_complete(void *opaque, int ret)
+static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
 {
-MirrorOp *op = opaque;
 MirrorBlockJob *s = op->s;
 
 aio_context_acquire(blk_get_aio_context(s->common.blk));
@@ -157,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
 aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
-static void mirror_read_complete(void *opaque, int ret)
+static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
 {
-MirrorOp *op = opaque;
 MirrorBlockJob *s = op->s;
 
 aio_context_acquire(blk_get_aio_context(s->common.blk));
@@ -174,8 +176,9 @@ static void mirror_read_complete(void *opaque, int ret)
 
 mirror_iteration_done(op, ret);
 } else {
-blk_aio_pwritev(s->target, op->offset, >qiov,
-0, mirror_write_complete, op);
+ret = blk_co_pwritev(s->target, op->offset,
+ op->qiov.size, >qiov, 0);
+mirror_write_complete(op, ret);
 }
 aio_context_release(blk_get_aio_context(s->common.blk));
 }
@@ -232,60 +235,57 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
 s->waiting_for_io = false;
 }
 
-/* Submit async read while handling COW.
- * Returns: The number of bytes copied after and including offset,
- *  excluding any bytes copied prior to offset due to alignment.
- *  This will be @bytes if no alignment is necessary, or
- *  (new_end - offset) if tail is rounded up or down due to
- *  alignment or buffer limit.
+/* Perform a mirror copy operation.
+ *
+ * *op->bytes_handled is set to the number of bytes copied after and
+ * including offset, excluding any bytes copied prior to offset due
+ * to alignment.  This will be op->bytes if no alignment is necessary,
+ * or (new_end - op->offset) if the tail is rounded up or down due to
+ * alignment or buffer limit.
  */
-static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
-   uint64_t bytes)
+static void coroutine_fn mirror_co_read(void *opaque)
 {
+MirrorOp *op = opaque;
+MirrorBlockJob *s = op->s;
 BlockBackend *source = s->common.blk;
 int nb_chunks;
 uint64_t ret;
-MirrorOp *op;
 uint64_t max_bytes;
 
 max_bytes = s->granularity * s->max_iov;
 
 /* We can only handle as much as buf_size at a time. */
-bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
-assert(bytes);
-assert(bytes < BDRV_REQUEST_MAX_BYTES);
-ret = bytes;
+op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
+assert(op->bytes);
+assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
+*op->bytes_handled = op->bytes;
 
 if (s->cow_bitmap) {
-ret += mirror_cow_align(s, , );
+*op->bytes_handled += mirror_cow_align(s, >offset, >bytes);
 }
-assert(bytes <= s->buf_size);
+/* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
+assert(*op->bytes_handled <= UINT_MAX);
+assert(op->bytes <= s->buf_size);
 /* The offset is granularity-aligned because:
  * 1) Caller passes in aligned values;
  * 2) mirror_cow_align is used only when target cluster is larger. */
-assert(QEMU_IS_ALIGNED(offset, s->granularity));
+assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
 /* The range is sector-aligned, since bdrv_getlength() rounds up. */
-assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
-nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
+assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
+nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
 
 while (s->buf_free_count < nb_chunks) {
-trace_mirror_yield_in_flight(s, offset, s->in_flight);
+

[Qemu-block] [PATCH v4 04/13] block/mirror: Wait for in-flight op conflicts

2018-04-11 Thread Max Reitz
This patch makes the mirror code differentiate between simply waiting
for any operation to complete (mirror_wait_for_free_in_flight_slot())
and specifically waiting for all operations touching a certain range of
the virtual disk to complete (mirror_wait_on_conflicts()).

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 block/mirror.c | 102 +++--
 1 file changed, 84 insertions(+), 18 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1af6481876..964ffbe682 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/coroutine.h"
+#include "qemu/range.h"
 #include "trace.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -88,6 +89,7 @@ struct MirrorOp {
  * mirror_co_discard() before yielding for the first time */
 int64_t *bytes_handled;
 
+bool is_pseudo_op;
 CoQueue waiting_requests;
 
 QTAILQ_ENTRY(MirrorOp) next;
@@ -112,6 +114,41 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob 
*s, bool read,
 }
 }
 
+static void coroutine_fn mirror_wait_on_conflicts(MirrorOp *self,
+  MirrorBlockJob *s,
+  uint64_t offset,
+  uint64_t bytes)
+{
+uint64_t self_start_chunk = offset / s->granularity;
+uint64_t self_end_chunk = DIV_ROUND_UP(offset + bytes, s->granularity);
+uint64_t self_nb_chunks = self_end_chunk - self_start_chunk;
+
+while (find_next_bit(s->in_flight_bitmap, self_end_chunk,
+ self_start_chunk) < self_end_chunk &&
+   s->ret >= 0)
+{
+MirrorOp *op;
+
+QTAILQ_FOREACH(op, >ops_in_flight, next) {
+uint64_t op_start_chunk = op->offset / s->granularity;
+uint64_t op_nb_chunks = DIV_ROUND_UP(op->offset + op->bytes,
+ s->granularity) -
+op_start_chunk;
+
+if (op == self) {
+continue;
+}
+
+if (ranges_overlap(self_start_chunk, self_nb_chunks,
+   op_start_chunk, op_nb_chunks))
+{
+qemu_co_queue_wait(>waiting_requests, NULL);
+break;
+}
+}
+}
+}
+
 static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 {
 MirrorBlockJob *s = op->s;
@@ -234,13 +271,22 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 return ret;
 }
 
-static inline void mirror_wait_for_io(MirrorBlockJob *s)
+static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 {
 MirrorOp *op;
 
-op = QTAILQ_FIRST(>ops_in_flight);
-assert(op);
-qemu_co_queue_wait(>waiting_requests, NULL);
+QTAILQ_FOREACH(op, >ops_in_flight, next) {
+/* Do not wait on pseudo ops, because it may in turn wait on
+ * some other operation to start, which may in fact be the
+ * caller of this function.  Since there is only one pseudo op
+ * at any given time, we will always find some real operation
+ * to wait on. */
+if (!op->is_pseudo_op) {
+qemu_co_queue_wait(>waiting_requests, NULL);
+return;
+}
+}
+abort();
 }
 
 /* Perform a mirror copy operation.
@@ -284,7 +330,7 @@ static void coroutine_fn mirror_co_read(void *opaque)
 
 while (s->buf_free_count < nb_chunks) {
 trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
-mirror_wait_for_io(s);
+mirror_wait_for_free_in_flight_slot(s);
 }
 
 /* Now make a QEMUIOVector taking enough granularity-sized chunks
@@ -384,8 +430,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
 BlockDriverState *source = s->source;
-int64_t offset, first_chunk;
-uint64_t delay_ns = 0;
+MirrorOp *pseudo_op;
+int64_t offset;
+uint64_t delay_ns = 0, ret = 0;
 /* At least the first dirty chunk is mirrored in one iteration. */
 int nb_chunks = 1;
 bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
@@ -401,11 +448,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 }
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
 
-first_chunk = offset / s->granularity;
-while (test_bit(first_chunk, s->in_flight_bitmap)) {
-trace_mirror_yield_in_flight(s, offset, s->in_flight);
-mirror_wait_for_io(s);
-}
+mirror_wait_on_conflicts(NULL, s, offset, 1);
 
 block_job_pause_point(>common);
 
@@ -442,6 +485,21 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
nb_chunks * s->granularity);
 

[Qemu-block] [PATCH v4 13/13] iotests: Add test for active mirroring

2018-04-11 Thread Max Reitz
Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 tests/qemu-iotests/151 | 120 +
 tests/qemu-iotests/151.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 126 insertions(+)
 create mode 100755 tests/qemu-iotests/151
 create mode 100644 tests/qemu-iotests/151.out

diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
new file mode 100755
index 00..8d8e050f98
--- /dev/null
+++ b/tests/qemu-iotests/151
@@ -0,0 +1,120 @@
+#!/usr/bin/env python
+#
+# Tests for active mirroring
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+from iotests import qemu_img
+
+source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
+target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
+
+class TestActiveMirror(iotests.QMPTestCase):
+image_len = 128 * 1024 * 1024 # MB
+potential_writes_in_flight = True
+
+def setUp(self):
+qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
+qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
+
+blk_source = {'id': 'source',
+  'if': 'none',
+  'node-name': 'source-node',
+  'driver': iotests.imgfmt,
+  'file': {'driver': 'file',
+   'filename': source_img}}
+
+blk_target = {'node-name': 'target-node',
+  'driver': iotests.imgfmt,
+  'file': {'driver': 'file',
+   'filename': target_img}}
+
+self.vm = iotests.VM()
+self.vm.add_drive_raw(self.qmp_to_opts(blk_source))
+self.vm.add_blockdev(self.qmp_to_opts(blk_target))
+self.vm.add_device('virtio-blk,drive=source')
+self.vm.launch()
+
+def tearDown(self):
+self.vm.shutdown()
+
+if not self.potential_writes_in_flight:
+self.assertTrue(iotests.compare_images(source_img, target_img),
+'mirror target does not match source')
+
+os.remove(source_img)
+os.remove(target_img)
+
+def doActiveIO(self, sync_source_and_target):
+# Fill the source image
+self.vm.hmp_qemu_io('source',
+'write -P 1 0 %i' % self.image_len);
+
+# Start some background requests
+for offset in range(1 * self.image_len / 8, 3 * self.image_len / 8, 
1024 * 1024):
+self.vm.hmp_qemu_io('source', 'aio_write -P 2 %i 1M' % offset)
+for offset in range(2 * self.image_len / 8, 3 * self.image_len / 8, 
1024 * 1024):
+self.vm.hmp_qemu_io('source', 'aio_write -z %i 1M' % offset)
+
+# Start the block job
+result = self.vm.qmp('blockdev-mirror',
+ job_id='mirror',
+ filter_node_name='mirror-node',
+ device='source-node',
+ target='target-node',
+ sync='full',
+ copy_mode='write-blocking')
+self.assert_qmp(result, 'return', {})
+
+# Start some more requests
+for offset in range(3 * self.image_len / 8, 5 * self.image_len / 8, 
1024 * 1024):
+self.vm.hmp_qemu_io('source', 'aio_write -P 3 %i 1M' % offset)
+for offset in range(4 * self.image_len / 8, 5 * self.image_len / 8, 
1024 * 1024):
+self.vm.hmp_qemu_io('source', 'aio_write -z %i 1M' % offset)
+
+# Wait for the READY event
+self.wait_ready(drive='mirror')
+
+# Now start some final requests; all of these (which land on
+# the source) should be settled using the active mechanism.
+# The mirror code itself asserts that the source BDS's dirty
+# bitmap will stay clean between READY and COMPLETED.
+for offset in range(5 * self.image_len / 8, 7 * self.image_len / 8, 
1024 * 1024):
+self.vm.hmp_qemu_io('source', 'aio_write -P 3 %i 1M' % offset)
+for offset in range(6 * self.image_len / 8, 7 * self.image_len / 8, 
1024 * 1024):
+self.vm.hmp_qemu_io('source', 'aio_write -z %i 1M' % offset)
+
+if sync_source_and_target:
+# If source and target should be in sync 

[Qemu-block] [PATCH v4 10/13] block/mirror: Add MirrorBDSOpaque

2018-04-11 Thread Max Reitz
This will allow us to access the block job data when the mirror block
driver becomes more complex.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 block/mirror.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/mirror.c b/block/mirror.c
index 40c7c55f07..abaf2a83c7 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -78,6 +78,10 @@ typedef struct MirrorBlockJob {
 bool initial_zeroing_ongoing;
 } MirrorBlockJob;
 
+typedef struct MirrorBDSOpaque {
+MirrorBlockJob *job;
+} MirrorBDSOpaque;
+
 struct MirrorOp {
 MirrorBlockJob *s;
 QEMUIOVector qiov;
@@ -602,6 +606,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
 MirrorExitData *data = opaque;
+MirrorBDSOpaque *bs_opaque = s->mirror_top_bs->opaque;
 AioContext *replace_aio_context = NULL;
 BlockDriverState *src = s->mirror_top_bs->backing->bs;
 BlockDriverState *target_bs = blk_bs(s->target);
@@ -694,6 +699,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 blk_set_perm(job->blk, 0, BLK_PERM_ALL, _abort);
 blk_insert_bs(job->blk, mirror_top_bs, _abort);
 
+bs_opaque->job = NULL;
 block_job_completed(>common, data->ret);
 
 g_free(data);
@@ -1225,6 +1231,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
  Error **errp)
 {
 MirrorBlockJob *s;
+MirrorBDSOpaque *bs_opaque;
 BlockDriverState *mirror_top_bs;
 bool target_graph_mod;
 bool target_is_backing;
@@ -1258,6 +1265,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 mirror_top_bs->implicit = true;
 }
 mirror_top_bs->total_sectors = bs->total_sectors;
+bs_opaque = g_new0(MirrorBDSOpaque, 1);
+mirror_top_bs->opaque = bs_opaque;
 bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
 
 /* bdrv_append takes ownership of the mirror_top_bs reference, need to keep
@@ -1282,6 +1291,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 if (!s) {
 goto fail;
 }
+bs_opaque->job = s;
+
 /* The block job now has a reference to this node */
 bdrv_unref(mirror_top_bs);
 
@@ -1371,6 +1382,7 @@ fail:
 
 g_free(s->replaces);
 blk_unref(s->target);
+bs_opaque->job = NULL;
 block_job_early_fail(>common);
 }
 
-- 
2.14.3




[Qemu-block] [PATCH v4 12/13] block/mirror: Add copy mode QAPI interface

2018-04-11 Thread Max Reitz
This patch allows the user to specify whether to use active or only
background mode for mirror block jobs.  Currently, this setting will
remain constant for the duration of the entire block job.

Signed-off-by: Max Reitz 
---
 qapi/block-core.json  | 11 +--
 include/block/block_int.h |  4 +++-
 block/mirror.c| 12 +++-
 blockdev.c|  9 -
 4 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8210d601f4..1653f4ce93 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1791,6 +1791,9 @@
 # written. Both will result in identical contents.
 # Default is true. (Since 2.4)
 #
+# @copy-mode: when to copy data to the destination; defaults to 'background'
+# (Since: 2.13)
+#
 # Since: 1.3
 ##
 { 'struct': 'DriveMirror',
@@ -1800,7 +1803,7 @@
 '*speed': 'int', '*granularity': 'uint32',
 '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
 '*on-target-error': 'BlockdevOnError',
-'*unmap': 'bool' } }
+'*unmap': 'bool', '*copy-mode': 'MirrorCopyMode' } }
 
 ##
 # @BlockDirtyBitmap:
@@ -1979,6 +1982,9 @@
 #above @device. If this option is not given, a node name is
 #autogenerated. (Since: 2.9)
 #
+# @copy-mode: when to copy data to the destination; defaults to 'background'
+# (Since: 2.13)
+#
 # Returns: nothing on success.
 #
 # Since: 2.6
@@ -1999,7 +2005,8 @@
 '*speed': 'int', '*granularity': 'uint32',
 '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
 '*on-target-error': 'BlockdevOnError',
-'*filter-node-name': 'str' } }
+'*filter-node-name': 'str',
+'*copy-mode': 'MirrorCopyMode' } }
 
 ##
 # @block_set_io_throttle:
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 8d63a1b0c1..9a307cc7ad 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -975,6 +975,7 @@ void commit_active_start(const char *job_id, 
BlockDriverState *bs,
  * @filter_node_name: The node name that should be assigned to the filter
  * driver that the mirror job inserts into the graph above @bs. NULL means that
  * a node name should be autogenerated.
+ * @copy_mode: When to trigger writes to the target.
  * @errp: Error object.
  *
  * Start a mirroring operation on @bs.  Clusters that are allocated
@@ -988,7 +989,8 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
-  bool unmap, const char *filter_node_name, Error **errp);
+  bool unmap, const char *filter_node_name,
+  MirrorCopyMode copy_mode, Error **errp);
 
 /*
  * backup_job_create:
diff --git a/block/mirror.c b/block/mirror.c
index a700862029..6144fc25b0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1468,7 +1468,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
  const BlockJobDriver *driver,
  bool is_none_mode, BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,
- bool is_mirror,
+ bool is_mirror, MirrorCopyMode copy_mode,
  Error **errp)
 {
 MirrorBlockJob *s;
@@ -1574,7 +1574,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 s->on_target_error = on_target_error;
 s->is_none_mode = is_none_mode;
 s->backing_mode = backing_mode;
-s->copy_mode = MIRROR_COPY_MODE_BACKGROUND;
+s->copy_mode = copy_mode;
 s->base = base;
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
@@ -1641,7 +1641,8 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
-  bool unmap, const char *filter_node_name, Error **errp)
+  bool unmap, const char *filter_node_name,
+  MirrorCopyMode copy_mode, Error **errp)
 {
 bool is_none_mode;
 BlockDriverState *base;
@@ -1656,7 +1657,7 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
  speed, granularity, buf_size, backing_mode,
  on_source_error, on_target_error, unmap, NULL, NULL,
  _job_driver, is_none_mode, base, false,
- filter_node_name, true, errp);
+ filter_node_name, true, copy_mode, errp);
 }
 
 void commit_active_start(const char *job_id, 

[Qemu-block] [PATCH v4 01/13] block/mirror: Pull out mirror_perform()

2018-04-11 Thread Max Reitz
When converting mirror's I/O to coroutines, we are going to need a point
where these coroutines are created.  mirror_perform() is going to be
that point.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Jeff Cody 
---
 block/mirror.c | 51 +--
 1 file changed, 29 insertions(+), 22 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 820f512c7b..1718571766 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -82,6 +82,12 @@ typedef struct MirrorOp {
 uint64_t bytes;
 } MirrorOp;
 
+typedef enum MirrorMethod {
+MIRROR_METHOD_COPY,
+MIRROR_METHOD_ZERO,
+MIRROR_METHOD_DISCARD,
+} MirrorMethod;
+
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
@@ -321,6 +327,22 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
 }
 }
 
+static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
+   unsigned bytes, MirrorMethod mirror_method)
+{
+switch (mirror_method) {
+case MIRROR_METHOD_COPY:
+return mirror_do_read(s, offset, bytes);
+case MIRROR_METHOD_ZERO:
+case MIRROR_METHOD_DISCARD:
+mirror_do_zero_or_discard(s, offset, bytes,
+  mirror_method == MIRROR_METHOD_DISCARD);
+return bytes;
+default:
+abort();
+}
+}
+
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
 BlockDriverState *source = s->source;
@@ -387,11 +409,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 int ret;
 int64_t io_bytes;
 int64_t io_bytes_acct;
-enum MirrorMethod {
-MIRROR_METHOD_COPY,
-MIRROR_METHOD_ZERO,
-MIRROR_METHOD_DISCARD
-} mirror_method = MIRROR_METHOD_COPY;
+MirrorMethod mirror_method = MIRROR_METHOD_COPY;
 
 assert(!(offset % s->granularity));
 ret = bdrv_block_status_above(source, NULL, offset,
@@ -429,22 +447,11 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 }
 
 io_bytes = mirror_clip_bytes(s, offset, io_bytes);
-switch (mirror_method) {
-case MIRROR_METHOD_COPY:
-io_bytes = io_bytes_acct = mirror_do_read(s, offset, io_bytes);
-break;
-case MIRROR_METHOD_ZERO:
-case MIRROR_METHOD_DISCARD:
-mirror_do_zero_or_discard(s, offset, io_bytes,
-  mirror_method == MIRROR_METHOD_DISCARD);
-if (write_zeroes_ok) {
-io_bytes_acct = 0;
-} else {
-io_bytes_acct = io_bytes;
-}
-break;
-default:
-abort();
+io_bytes = mirror_perform(s, offset, io_bytes, mirror_method);
+if (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok) {
+io_bytes_acct = 0;
+} else {
+io_bytes_acct = io_bytes;
 }
 assert(io_bytes);
 offset += io_bytes;
@@ -638,7 +645,7 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 continue;
 }
 
-mirror_do_zero_or_discard(s, offset, bytes, false);
+mirror_perform(s, offset, bytes, MIRROR_METHOD_ZERO);
 offset += bytes;
 }
 
-- 
2.14.3




[Qemu-block] [PATCH v4 05/13] block/mirror: Use source as a BdrvChild

2018-04-11 Thread Max Reitz
With this, the mirror_top_bs is no longer just a technically required
node in the BDS graph but actually represents the block job operation.

Also, drop MirrorBlockJob.source, as we can reach it through
mirror_top_bs->backing.

Signed-off-by: Max Reitz 
Reviewed-by: Fam Zheng 
---
 block/mirror.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 964ffbe682..40c7c55f07 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -43,7 +43,6 @@ typedef struct MirrorBlockJob {
 RateLimit limit;
 BlockBackend *target;
 BlockDriverState *mirror_top_bs;
-BlockDriverState *source;
 BlockDriverState *base;
 
 /* The name of the graph node to replace */
@@ -301,7 +300,6 @@ static void coroutine_fn mirror_co_read(void *opaque)
 {
 MirrorOp *op = opaque;
 MirrorBlockJob *s = op->s;
-BlockBackend *source = s->common.blk;
 int nb_chunks;
 uint64_t ret;
 uint64_t max_bytes;
@@ -351,7 +349,8 @@ static void coroutine_fn mirror_co_read(void *opaque)
 s->bytes_in_flight += op->bytes;
 trace_mirror_one_iteration(s, op->offset, op->bytes);
 
-ret = blk_co_preadv(source, op->offset, op->bytes, >qiov, 0);
+ret = bdrv_co_preadv(s->mirror_top_bs->backing, op->offset, op->bytes,
+ >qiov, 0);
 mirror_read_complete(op, ret);
 }
 
@@ -429,7 +428,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
-BlockDriverState *source = s->source;
+BlockDriverState *source = s->mirror_top_bs->backing->bs;
 MirrorOp *pseudo_op;
 int64_t offset;
 uint64_t delay_ns = 0, ret = 0;
@@ -604,7 +603,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
 MirrorExitData *data = opaque;
 AioContext *replace_aio_context = NULL;
-BlockDriverState *src = s->source;
+BlockDriverState *src = s->mirror_top_bs->backing->bs;
 BlockDriverState *target_bs = blk_bs(s->target);
 BlockDriverState *mirror_top_bs = s->mirror_top_bs;
 Error *local_err = NULL;
@@ -719,7 +718,7 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 {
 int64_t offset;
 BlockDriverState *base = s->base;
-BlockDriverState *bs = s->source;
+BlockDriverState *bs = s->mirror_top_bs->backing->bs;
 BlockDriverState *target_bs = blk_bs(s->target);
 int ret;
 int64_t count;
@@ -801,7 +800,7 @@ static void coroutine_fn mirror_run(void *opaque)
 {
 MirrorBlockJob *s = opaque;
 MirrorExitData *data;
-BlockDriverState *bs = s->source;
+BlockDriverState *bs = s->mirror_top_bs->backing->bs;
 BlockDriverState *target_bs = blk_bs(s->target);
 bool need_drain = true;
 int64_t length;
@@ -1286,7 +1285,6 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 /* The block job now has a reference to this node */
 bdrv_unref(mirror_top_bs);
 
-s->source = bs;
 s->mirror_top_bs = mirror_top_bs;
 
 /* No resize for the target either; while the mirror is still running, a
-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [PATCH 05/19] tests/test-bdrv-drain: bdrv_drain_all() works in coroutines now

2018-04-11 Thread Eric Blake
On 04/11/2018 11:39 AM, Kevin Wolf wrote:
> Since we use bdrv_do_drained_begin/end() for bdrv_drain_all_begin/end(),
> coroutine context is automatically left with a BH, preventing the
> deadlocks that made bdrv_drain_all*() unsafe in coroutine context. We
> can consider it compatible now the latest, after having removed the old

s/the latest/at last/ ?

> polling code as dead code.
> 
> Enable the coroutine test cases for bdrv_drain_all().
> 
> Signed-off-by: Kevin Wolf 
> ---
>  tests/test-bdrv-drain.c | 16 ++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH 04/19] block: Don't manually poll in bdrv_drain_all()

2018-04-11 Thread Eric Blake
On 04/11/2018 11:39 AM, Kevin Wolf wrote:
> All involved nodes are already idle, we called bdrv_do_draine_begin() on

s/draine/drain/

> them.
> 
> The comment in the code suggested that this were not correct because the

s/were/was/

> completion of a request on one node could spawn a new request on a
> different node (which might have been drained before, so we wouldn't
> drain the new request). In reality, new requests to different nodes
> aren't spawned out of nothing, but only in the context of a parent
> request, and they aren't submitted to random nodes, but only to child
> nodes. As long as we still poll for the completion of the parent request
> (which we do), draining each root node separately is good enough.
> 
> Remove the additional polling code from bdrv_drain_all_begin() and
> replace it with an assertion that all nodes are already idle after we
> drained them separately.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/io.c | 41 -
>  1 file changed, 12 insertions(+), 29 deletions(-)
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-discuss] qemu-img convert stuck

2018-04-11 Thread Max Reitz
On 2018-04-09 08:04, Stefan Hajnoczi wrote:
> On Sun, Apr 08, 2018 at 10:35:16PM +0300, Benny Zlotnik wrote:
> 
> What type of storage are the source and destination images?  (e.g.
> source is a local qcow2 file on xfs, destination is a raw file on NFS)
> 
>> $ gdb -p 13024 -batch -ex "thread apply all bt"
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib64/libthread_db.so.1".
>> 0x7f98275cfaff in ppoll () from /lib64/libc.so.6
>>
>> Thread 1 (Thread 0x7f983e30ab00 (LWP 13024)):
>> #0  0x7f98275cfaff in ppoll () from /lib64/libc.so.6
>> #1  0x55b55cf59d69 in qemu_poll_ns ()
>> #2  0x55b55cf5ba45 in aio_poll ()
>> #3  0x55b55ceedc0f in bdrv_get_block_status_above ()
>> #4  0x55b55cea3611 in convert_iteration_sectors ()
> 
> CCing Max Reitz in case this is familiar.

Hmm, not really, no...

The culprit I know of (sensing block status outside of qemu) would block
in lseek64() under find_allocation().

I didn't have any luck reproducing the issue either...

Whenever I had some hang in ppoll(), it was usually during a drain, but
that doesn't seem to be the case here either.  So I have no idea.

Maybe I'll test some other configurations at another time, but so far I
didn't experience any hangs and I have no idea what could be provoking
them (other than some network issue outside of qemu, but well...).

Max

>> #5  0x55b55cea4352 in img_convert ()
>> #6  0x55b55ce9d819 in main ()
>>
>>
>> On Sun, Apr 8, 2018 at 10:28 PM, Nir Soffer  wrote:
>>
>>> On Sun, Apr 8, 2018 at 9:27 PM Benny Zlotnik  wrote:
>>>
 Hi,

 As part of copy operation initiated by rhev got stuck for more than a day
 and consumes plenty of CPU
 vdsm 13024  3117 99 Apr07 ?1-06:58:43 /usr/bin/qemu-img
 convert
 -p -t none -T none -f qcow2
 /rhev/data-center/bb422fac-81c5-4fea-8782-3498bb5c8a59/
 26989331-2c39-4b34-a7ed-d7dd7703646c/images/597e12b6-
 19f5-45bd-868f-767600c7115e/62a5492e-e120-4c25-898e-9f5f5629853e
 -O raw /rhev/data-center/mnt/mantis-nfs-lif1.lab.eng.tlv2.redhat.com:
 _vol__service/26989331-2c39-4b34-a7ed-d7dd7703646c/images/
 9ece9408-9ca6-48cd-992a-6f590c710672/06d6d3c0-beb8-4b6b-ab00-56523df185da

 The target image appears to have no data yet:
 qemu-img info 06d6d3c0-beb8-4b6b-ab00-56523df185da"
 image: 06d6d3c0-beb8-4b6b-ab00-56523df185da
 file format: raw
 virtual size: 120G (128849018880 bytes)
 disk size: 0

 strace -p 13024 -tt -T -f shows only:
 ...
 21:13:01.309382 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
 0},
 NULL, 8) = 0 (Timeout) <0.10>
 21:13:01.309411 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
 0},
 NULL, 8) = 0 (Timeout) <0.09>
 21:13:01.309440 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
 0},
 NULL, 8) = 0 (Timeout) <0.09>
 21:13:01.309468 ppoll([{fd=12, events=POLLIN|POLLERR|POLLHUP}], 1, {0,
 0},
 NULL, 8) = 0 (Timeout) <0.10>

 version: qemu-img-rhev-2.9.0-16.el7_4.13.x86_64

 What could cause this? I'll provide any additional information needed

>>>
>>> A backtrace may help, try:
>>>
>>> gdb -p 13024 -batch -ex "thread apply all bt"
>>>
>>> Also adding Kevin and qemu-block.
>>>
>>> Nir
>>>




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH 06/19] block: Avoid unnecessary aio_poll() in AIO_WAIT_WHILE()

2018-04-11 Thread Su Hang
Dear Kevin:
I notice checkpatch.pl complained about code style problems,
you may want to replace `while (aio_poll(bs->aio_context, false));`
with
```
while (aio_poll(bs->aio_context, false)) {
/* No further work */
}
```
to suppress the complaints.

Best,
Su Hang

"Kevin Wolf" wrote:
> Commit 91af091f923 added an additional aio_poll() to BDRV_POLL_WHILE()
> in order to make sure that all pending BHs are executed on drain. This
> was the wrong place to make the fix, as it is useless overhead for all
> other users of the macro and unnecessarily complicates the mechanism.
> 
> This patch effectively reverts said commit (the context has changed a
> bit and the code has moved to AIO_WAIT_WHILE()) and instead polls in the
> loop condition for drain.
> 
> The effect is probably hard to measure in any real-world use case
> because actual I/O will dominate, but if I run only the initialisation
> part of 'qemu-img convert' where it calls bdrv_block_status() for the
> whole image to find out how much data there is copy, this phase actually
> needs only roughly half the time after this patch.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/block/aio-wait.h | 22 --
>  block/io.c   | 11 ++-
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index 8c90a2e66e..783d3678dd 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -73,29 +73,23 @@ typedef struct {
>   */
>  #define AIO_WAIT_WHILE(wait, ctx, cond) ({ \
>  bool waited_ = false;  \
> -bool busy_ = true; \
>  AioWait *wait_ = (wait);   \
>  AioContext *ctx_ = (ctx);  \
>  if (in_aio_context_home_thread(ctx_)) {\
> -while ((cond) || busy_) {  \
> -busy_ = aio_poll(ctx_, (cond));\
> -waited_ |= !!(cond) | busy_;   \
> +while ((cond)) {   \
> +aio_poll(ctx_, true);  \
> +waited_ = true;\
>  }  \
>  } else {   \
>  assert(qemu_get_current_aio_context() ==   \
> qemu_get_aio_context());\
>  /* Increment wait_->num_waiters before evaluating cond. */ \
>  atomic_inc(_->num_waiters);   \
> -while (busy_) {\
> -if ((cond)) {  \
> -waited_ = busy_ = true;\
> -aio_context_release(ctx_); \
> -aio_poll(qemu_get_aio_context(), true);\
> -aio_context_acquire(ctx_); \
> -} else {   \
> -busy_ = aio_poll(ctx_, false); \
> -waited_ |= busy_;  \
> -}  \
> +while ((cond)) {   \
> +aio_context_release(ctx_); \
> +aio_poll(qemu_get_aio_context(), true);\
> +aio_context_acquire(ctx_); \
> +waited_ = true;\
>  }  \
>  atomic_dec(_->num_waiters);   \
>  }  \
> diff --git a/block/io.c b/block/io.c
> index ea6f9f023a..6f580f49ff 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -181,13 +181,22 @@ static void bdrv_drain_invoke(BlockDriverState *bs, 
> bool begin)
>  BDRV_POLL_WHILE(bs, !data.done);
>  }
>  
> +/* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
> +static bool bdrv_drain_poll(BlockDriverState *bs)
> +{
> +/* Execute pending BHs first and check everything else only after the BHs
> + * have executed. */
> +while (aio_poll(bs->aio_context, false));
> +return atomic_read(>in_flight);
> +}
> +
>  static bool bdrv_drain_recurse(BlockDriverState *bs)
>  {
>  BdrvChild *child, *tmp;
>  bool waited;
>  
>  /* Wait for drained requests to finish */
> -waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);
> +waited = BDRV_POLL_WHILE(bs, 

Re: [Qemu-block] [Qemu-devel] [PATCH 00/19] Drain fixes and cleanups, part 3

2018-04-11 Thread no-reply
Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180411163940.2523-1-kw...@redhat.com
Subject: [Qemu-devel] [PATCH 00/19] Drain fixes and cleanups, part 3

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
   675608cb84..6523eaca37  master -> master
 t [tag update]
patchew/1523089594-1422-1-git-send-email-lidongc...@tencent.com -> 
patchew/1523089594-1422-1-git-send-email-lidongc...@tencent.com
 t [tag update]
patchew/1523377186-32578-1-git-send-email-c...@braap.org -> 
patchew/1523377186-32578-1-git-send-email-c...@braap.org
 t [tag update]
patchew/20180410134203.17552-1-peter.mayd...@linaro.org -> 
patchew/20180410134203.17552-1-peter.mayd...@linaro.org
 t [tag update]
patchew/20180410193919.28026-1-alex.ben...@linaro.org -> 
patchew/20180410193919.28026-1-alex.ben...@linaro.org
 t [tag update]
patchew/20180411122606.367301-1-vsement...@virtuozzo.com -> 
patchew/20180411122606.367301-1-vsement...@virtuozzo.com
 * [new tag]   patchew/20180411163940.2523-1-kw...@redhat.com -> 
patchew/20180411163940.2523-1-kw...@redhat.com
Switched to a new branch 'test'
4b7690f2d7 test-bdrv-drain: Test graph changes in drain_all section
c3b712e854 block: Allow graph changes in bdrv_drain_all_begin/end sections
b42a7e0a7d block: Move bdrv_drain_all_begin() out of coroutine context
74bb69f37c block: Allow AIO_WAIT_WHILE with NULL ctx
a6e790e0bc test-bdrv-drain: Test that bdrv_drain_invoke() doesn't poll
26fc9f7a2f block: Defer .bdrv_drain_begin callback to polling phase
5de06df1ac test-bdrv-drain: Graph change through parent callback
13fb2f568b block: Don't poll in parent drain callbacks
48cfd9a68a test-bdrv-drain: Test node deletion in subtree recursion
81174751a0 block: Drain recursively with a single BDRV_POLL_WHILE()
1f4daf1742 test-bdrv-drain: Add test for node deletion
df4213f29a block: Remove bdrv_drain_recurse()
5bddf60629 block: Really pause block jobs on drain
aed8d29900 block: Avoid unnecessary aio_poll() in AIO_WAIT_WHILE()
6479633b40 tests/test-bdrv-drain: bdrv_drain_all() works in coroutines now
b02f2e5912 block: Don't manually poll in bdrv_drain_all()
c3fc61add1 block: Remove 'recursive' parameter from bdrv_drain_invoke()
f33873949d block: Use bdrv_do_drain_begin/end in bdrv_drain_all()
9edb04df89 test-bdrv-drain: bdrv_drain() works with cross-AioContext events

=== OUTPUT BEGIN ===
Checking PATCH 1/19: test-bdrv-drain: bdrv_drain() works with cross-AioContext 
events...
Checking PATCH 2/19: block: Use bdrv_do_drain_begin/end in bdrv_drain_all()...
Checking PATCH 3/19: block: Remove 'recursive' parameter from 
bdrv_drain_invoke()...
Checking PATCH 4/19: block: Don't manually poll in bdrv_drain_all()...
Checking PATCH 5/19: tests/test-bdrv-drain: bdrv_drain_all() works in 
coroutines now...
Checking PATCH 6/19: block: Avoid unnecessary aio_poll() in AIO_WAIT_WHILE()...
ERROR: trailing statements should be on next line
#37: FILE: block/io.c:189:
+while (aio_poll(bs->aio_context, false));

ERROR: braces {} are necessary for all arms of this statement
#37: FILE: block/io.c:189:
+while (aio_poll(bs->aio_context, false));
[...]

total: 2 errors, 0 warnings, 60 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 7/19: block: Really pause block jobs on drain...
ERROR: trailing statements should be on next line
#98: FILE: block/io.c:204:
+while (aio_poll(bs->aio_context, false));

ERROR: braces {} are necessary for all arms of this statement
#98: FILE: block/io.c:204:
+while (aio_poll(bs->aio_context, false));
[...]

total: 2 errors, 0 warnings, 234 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 8/19: block: Remove bdrv_drain_recurse()...
Checking PATCH 9/19: test-bdrv-drain: Add test for node deletion...
Checking PATCH 10/19: block: Drain recursively with a single 
BDRV_POLL_WHILE()...
Checking PATCH 11/19: test-bdrv-drain: Test node deletion in subtree 
recursion...
WARNING: line over 80 characters
#85: FILE: tests/test-bdrv-drain.c:1029:
+g_test_add_func("/bdrv-drain/detach/drain_subtree", 

Re: [Qemu-block] [PATCH v2 0/2] bitmaps persistent and migration fixes

2018-04-11 Thread Max Reitz
On 2018-04-11 14:26, Vladimir Sementsov-Ogievskiy wrote:
> v2:
> 
> 01: new, proposed by Max
> 02: fix leaving cat processes
> 
> Vladimir Sementsov-Ogievskiy (2):
>   qcow2: try load bitmaps only once
>   iotests: fix 169
> 
>  block/qcow2.h  |  1 +
>  block/qcow2.c  | 16 
>  tests/qemu-iotests/169 | 48 +++-
>  3 files changed, 36 insertions(+), 29 deletions(-)

Thanks, changed the comment wording in patch 1 according to Eric's
suggestion and applied to my block branch:

https://github.com/XanClic/qemu/commits/block

I left the commit wording unchanged because as you said there might be
more cases where we have bitmaps at the node although we haven't yet
loaded any from the file.  I certainly don't want to have to think too
much about those cases and that to me is the main use of patch 1.

If there is going to be an rc4 (it appears there is), I will send a pull
request with these patches for it.  If not, I probably won't.  Then I'll
just add a qemu-stable CC and get them into 2.13 -- unless you disagree
and you think that this series should indeed be in 2.12.

Max



[Qemu-block] [PATCH 14/19] block: Defer .bdrv_drain_begin callback to polling phase

2018-04-11 Thread Kevin Wolf
We cannot allow aio_poll() in bdrv_drain_invoke(begin=true) until we're
done with propagating the drain through the graph and are doing the
single final BDRV_POLL_WHILE().

Just schedule the coroutine with the callback and increase bs->in_flight
to make sure that the polling phase will wait for it.

Signed-off-by: Kevin Wolf 
---
 block/io.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index f372b9ffb0..fc26c1a2f8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -178,22 +178,40 @@ static void coroutine_fn bdrv_drain_invoke_entry(void 
*opaque)
 
 /* Set data->done before reading bs->wakeup.  */
 atomic_mb_set(>done, true);
-bdrv_wakeup(bs);
+bdrv_dec_in_flight(bs);
+
+if (data->begin) {
+g_free(data);
+}
 }
 
 /* Recursively call BlockDriver.bdrv_co_drain_begin/end callbacks */
 static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
 {
-BdrvCoDrainData data = { .bs = bs, .done = false, .begin = begin};
+BdrvCoDrainData *data;
 
 if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
 (!begin && !bs->drv->bdrv_co_drain_end)) {
 return;
 }
 
-data.co = qemu_coroutine_create(bdrv_drain_invoke_entry, );
-bdrv_coroutine_enter(bs, data.co);
-BDRV_POLL_WHILE(bs, !data.done);
+data = g_new(BdrvCoDrainData, 1);
+*data = (BdrvCoDrainData) {
+.bs = bs,
+.done = false,
+.begin = begin
+};
+
+/* Make sure the driver callback completes during the polling phase for
+ * drain_begin. */
+bdrv_inc_in_flight(bs);
+data->co = qemu_coroutine_create(bdrv_drain_invoke_entry, data);
+aio_co_schedule(bdrv_get_aio_context(bs), data->co);
+
+if (!begin) {
+BDRV_POLL_WHILE(bs, !data->done);
+g_free(data);
+}
 }
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
-- 
2.13.6




Re: [Qemu-block] [PATCH v2 2/2] iotests: fix 169

2018-04-11 Thread Max Reitz
On 2018-04-11 14:26, Vladimir Sementsov-Ogievskiy wrote:
> Improve and fix 169:
> - use MIGRATION events instead of RESUME
> - make a TODO: enable dirty-bitmaps capability for offline case
> - recreate vm_b without -incoming near test end
> 
> This (likely) fixes racy faults at least of the following types:
> 
> - timeout on waiting for RESUME event
> - sha256 mismatch on line 136 (142 after this patch)
> - fail to self.vm_b.launch() on line 135 (141 now after this patch)
> 
> And surely fixes cat processes, left after test finish.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  tests/qemu-iotests/169 | 48 +++-
>  1 file changed, 27 insertions(+), 21 deletions(-)

Looks good, makes the test pass reliably (here) and doesn't result in
overly loud meowing due to an abundance of orphaned cats.  Thanks!

Reviewed-by: Max Reitz 



[Qemu-block] [PATCH 18/19] block: Allow graph changes in bdrv_drain_all_begin/end sections

2018-04-11 Thread Kevin Wolf
bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and
did a subtree drain for each of them. This works fine as long as the
graph is static, but sadly, reality looks different.

If the graph changes so that root nodes are added or removed, we would
have to compensate for this. bdrv_next() returns each root node only
once even if it's the root node for multiple BlockBackends or for a
monitor-owned block driver tree, which would only complicate things.

The much easier and more obviously correct way is to fundamentally
change the way the functions work: Iterate over all BlockDriverStates,
no matter who owns them, and drain them individually. Compensation is
only necessary when a new BDS is created inside a drain_all section.
Removal of a BDS doesn't require any action because it's gone afterwards
anyway.

This change means that some nodes get a higher bs->quiesce_count now
because each node propagates its individual drain to all of its parents.
In the old subtree drain, propagation back to the parent that made the
recursive drain request was avoided. While this isn't perfectly
beautiful, accepting the higher counts seems preferable to adding drain
code to multiple other places that modify the graph.

The test case is changed to account for the higher counts where
necessary.

Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  1 +
 include/block/block_int.h |  1 +
 block.c   | 20 +++-
 block/io.c| 58 ---
 tests/test-bdrv-drain.c   | 37 +-
 5 files changed, 87 insertions(+), 30 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index de2cba2c74..0b59d519c5 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -412,6 +412,7 @@ BlockDriverState *bdrv_lookup_bs(const char *device,
  Error **errp);
 bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base);
 BlockDriverState *bdrv_next_node(BlockDriverState *bs);
+BlockDriverState *bdrv_next_all_states(BlockDriverState *bs);
 
 typedef struct BdrvNextIterator {
 enum {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index dc6985e3ae..2c86a7b53f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -804,6 +804,7 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
 int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
 
+extern unsigned int bdrv_drain_all_count;
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState 
*old_parent);
 
diff --git a/block.c b/block.c
index 330238de19..c1b339c4a4 100644
--- a/block.c
+++ b/block.c
@@ -332,6 +332,10 @@ BlockDriverState *bdrv_new(void)
 
 qemu_co_queue_init(>flush_queue);
 
+for (i = 0; i < bdrv_drain_all_count; i++) {
+bdrv_drained_begin(bs);
+}
+
 QTAILQ_INSERT_TAIL(_bdrv_states, bs, bs_list);
 
 return bs;
@@ -1160,7 +1164,7 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
 int open_flags, Error **errp)
 {
 Error *local_err = NULL;
-int ret;
+int i, ret;
 
 bdrv_assign_node_name(bs, node_name, _err);
 if (local_err) {
@@ -1208,6 +1212,12 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
 assert(bdrv_min_mem_align(bs) != 0);
 assert(is_power_of_2(bs->bl.request_alignment));
 
+for (i = 0; i < bs->quiesce_counter; i++) {
+if (drv->bdrv_co_drain_begin) {
+drv->bdrv_co_drain_begin(bs);
+}
+}
+
 return 0;
 open_failed:
 bs->drv = NULL;
@@ -4034,6 +4044,14 @@ BlockDriverState *bdrv_next_node(BlockDriverState *bs)
 return QTAILQ_NEXT(bs, node_list);
 }
 
+BlockDriverState *bdrv_next_all_states(BlockDriverState *bs)
+{
+if (!bs) {
+return QTAILQ_FIRST(_bdrv_states);
+}
+return QTAILQ_NEXT(bs, bs_list);
+}
+
 const char *bdrv_get_node_name(const BlockDriverState *bs)
 {
 return bs->node_name;
diff --git a/block/io.c b/block/io.c
index db6810b35f..af65c3ec2f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -38,6 +38,8 @@
 /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
 #define MAX_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
 
+static AioWait drain_all_aio_wait;
+
 static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 int64_t offset, int bytes, BdrvRequestFlags flags);
 
@@ -445,6 +447,29 @@ static void bdrv_drain_assert_idle(BlockDriverState *bs)
 }
 }
 
+unsigned int bdrv_drain_all_count = 0;
+
+static bool bdrv_drain_all_poll(void)
+{
+BlockDriverState *bs = NULL;
+bool result = false;
+
+/* Execute pending BHs first (may modify the graph) and check everything
+ * else only after the BHs have executed. */
+while (aio_poll(qemu_get_aio_context(), false));
+
+

[Qemu-block] [PATCH 11/19] test-bdrv-drain: Test node deletion in subtree recursion

2018-04-11 Thread Kevin Wolf
If bdrv_do_drained_begin() polls during its subtree recursion, the graph
can change and mess up the bs->children iteration. Test that this
doesn't happen.

Signed-off-by: Kevin Wolf 
---
 tests/test-bdrv-drain.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 05768213be..4af6571ca3 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -875,7 +875,8 @@ static void coroutine_fn test_co_delete_by_drain(void 
*opaque)
  * If @detach_instead_of_delete is set, the BDS is not going to be
  * deleted but will only detach all of its children.
  */
-static void do_test_delete_by_drain(bool detach_instead_of_delete)
+static void do_test_delete_by_drain(bool detach_instead_of_delete,
+enum drain_type drain_type)
 {
 BlockBackend *blk;
 BlockDriverState *bs, *child_bs, *null_bs;
@@ -931,9 +932,23 @@ static void do_test_delete_by_drain(bool 
detach_instead_of_delete)
  * test_co_delete_by_drain() resuming.  Thus, @bs will be deleted
  * and the coroutine will exit while this drain operation is still
  * in progress. */
-bdrv_ref(child_bs);
-bdrv_drain(child_bs);
-bdrv_unref(child_bs);
+switch (drain_type) {
+case BDRV_DRAIN:
+bdrv_ref(child_bs);
+bdrv_drain(child_bs);
+bdrv_unref(child_bs);
+break;
+case BDRV_SUBTREE_DRAIN:
+/* Would have to ref/unref bs here for !detach_instead_of_delete, but
+ * then the whole test becomes pointless because the graph changes
+ * don't occur during the drain any more. */
+assert(detach_instead_of_delete);
+bdrv_subtree_drained_begin(bs);
+bdrv_subtree_drained_end(bs);
+break;
+default:
+g_assert_not_reached();
+}
 
 while (!dbdd.done) {
 aio_poll(qemu_get_aio_context(), true);
@@ -946,15 +961,19 @@ static void do_test_delete_by_drain(bool 
detach_instead_of_delete)
 }
 }
 
-
 static void test_delete_by_drain(void)
 {
-do_test_delete_by_drain(false);
+do_test_delete_by_drain(false, BDRV_DRAIN);
 }
 
 static void test_detach_by_drain(void)
 {
-do_test_delete_by_drain(true);
+do_test_delete_by_drain(true, BDRV_DRAIN);
+}
+
+static void test_detach_by_drain_subtree(void)
+{
+do_test_delete_by_drain(true, BDRV_SUBTREE_DRAIN);
 }
 
 
@@ -1005,8 +1024,9 @@ int main(int argc, char **argv)
 g_test_add_func("/bdrv-drain/blockjob/drain_subtree",
 test_blockjob_drain_subtree);
 
-g_test_add_func("/bdrv-drain/deletion", test_delete_by_drain);
-g_test_add_func("/bdrv-drain/detach", test_detach_by_drain);
+g_test_add_func("/bdrv-drain/deletion/drain", test_delete_by_drain);
+g_test_add_func("/bdrv-drain/detach/drain", test_detach_by_drain);
+g_test_add_func("/bdrv-drain/detach/drain_subtree", 
test_detach_by_drain_subtree);
 
 ret = g_test_run();
 qemu_event_destroy(_event);
-- 
2.13.6




[Qemu-block] [PATCH 19/19] test-bdrv-drain: Test graph changes in drain_all section

2018-04-11 Thread Kevin Wolf
This tests both adding and remove a node between bdrv_drain_all_begin()
and bdrv_drain_all_end(), and enabled the existing detach test for
drain_all.

Signed-off-by: Kevin Wolf 
---
 tests/test-bdrv-drain.c | 75 +++--
 1 file changed, 73 insertions(+), 2 deletions(-)

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 97ca0743c6..462842a761 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -457,7 +457,7 @@ static void test_multiparent(void)
 blk_unref(blk_b);
 }
 
-static void test_graph_change(void)
+static void test_graph_change_drain_subtree(void)
 {
 BlockBackend *blk_a, *blk_b;
 BlockDriverState *bs_a, *bs_b, *backing;
@@ -536,6 +536,63 @@ static void test_graph_change(void)
 blk_unref(blk_b);
 }
 
+static void test_graph_change_drain_all(void)
+{
+BlockBackend *blk_a, *blk_b;
+BlockDriverState *bs_a, *bs_b;
+BDRVTestState *a_s, *b_s;
+
+/* Create node A with a BlockBackend */
+blk_a = blk_new(BLK_PERM_ALL, BLK_PERM_ALL);
+bs_a = bdrv_new_open_driver(_test, "test-node-a", BDRV_O_RDWR,
+_abort);
+a_s = bs_a->opaque;
+blk_insert_bs(blk_a, bs_a, _abort);
+
+g_assert_cmpint(bs_a->quiesce_counter, ==, 0);
+g_assert_cmpint(a_s->drain_count, ==, 0);
+
+/* Call bdrv_drain_all_begin() */
+bdrv_drain_all_begin();
+
+g_assert_cmpint(bs_a->quiesce_counter, ==, 1);
+g_assert_cmpint(a_s->drain_count, ==, 1);
+
+/* Create node B with a BlockBackend */
+blk_b = blk_new(BLK_PERM_ALL, BLK_PERM_ALL);
+bs_b = bdrv_new_open_driver(_test, "test-node-b", BDRV_O_RDWR,
+_abort);
+b_s = bs_b->opaque;
+blk_insert_bs(blk_b, bs_b, _abort);
+
+g_assert_cmpint(bs_a->quiesce_counter, ==, 1);
+g_assert_cmpint(bs_b->quiesce_counter, ==, 1);
+g_assert_cmpint(a_s->drain_count, ==, 1);
+g_assert_cmpint(b_s->drain_count, ==, 1);
+
+/* Unref and finally delete node A */
+blk_unref(blk_a);
+
+g_assert_cmpint(bs_a->quiesce_counter, ==, 1);
+g_assert_cmpint(bs_b->quiesce_counter, ==, 1);
+g_assert_cmpint(a_s->drain_count, ==, 1);
+g_assert_cmpint(b_s->drain_count, ==, 1);
+
+bdrv_unref(bs_a);
+
+g_assert_cmpint(bs_b->quiesce_counter, ==, 1);
+g_assert_cmpint(b_s->drain_count, ==, 1);
+
+/* End the drained section */
+bdrv_drain_all_end();
+
+g_assert_cmpint(bs_b->quiesce_counter, ==, 0);
+g_assert_cmpint(b_s->drain_count, ==, 0);
+
+bdrv_unref(bs_b);
+blk_unref(blk_b);
+}
+
 struct test_iothread_data {
 BlockDriverState *bs;
 enum drain_type drain_type;
@@ -971,6 +1028,10 @@ static void do_test_delete_by_drain(bool 
detach_instead_of_delete,
 bdrv_subtree_drained_begin(bs);
 bdrv_subtree_drained_end(bs);
 break;
+case BDRV_DRAIN_ALL:
+bdrv_drain_all_begin();
+bdrv_drain_all_end();
+break;
 default:
 g_assert_not_reached();
 }
@@ -991,6 +1052,11 @@ static void test_delete_by_drain(void)
 do_test_delete_by_drain(false, BDRV_DRAIN);
 }
 
+static void test_detach_by_drain_all(void)
+{
+do_test_delete_by_drain(true, BDRV_DRAIN_ALL);
+}
+
 static void test_detach_by_drain(void)
 {
 do_test_delete_by_drain(true, BDRV_DRAIN);
@@ -1219,7 +1285,11 @@ int main(int argc, char **argv)
 
 g_test_add_func("/bdrv-drain/nested", test_nested);
 g_test_add_func("/bdrv-drain/multiparent", test_multiparent);
-g_test_add_func("/bdrv-drain/graph-change", test_graph_change);
+
+g_test_add_func("/bdrv-drain/graph-change/drain_subtree",
+test_graph_change_drain_subtree);
+g_test_add_func("/bdrv-drain/graph-change/drain_all",
+test_graph_change_drain_all);
 
 g_test_add_func("/bdrv-drain/iothread/drain_all", test_iothread_drain_all);
 g_test_add_func("/bdrv-drain/iothread/drain", test_iothread_drain);
@@ -1232,6 +1302,7 @@ int main(int argc, char **argv)
 test_blockjob_drain_subtree);
 
 g_test_add_func("/bdrv-drain/deletion/drain", test_delete_by_drain);
+g_test_add_func("/bdrv-drain/detach/drain_all", test_detach_by_drain_all);
 g_test_add_func("/bdrv-drain/detach/drain", test_detach_by_drain);
 g_test_add_func("/bdrv-drain/detach/drain_subtree", 
test_detach_by_drain_subtree);
 g_test_add_func("/bdrv-drain/detach/parent_cb", test_detach_by_parent_cb);
-- 
2.13.6




[Qemu-block] [PATCH 17/19] block: Move bdrv_drain_all_begin() out of coroutine context

2018-04-11 Thread Kevin Wolf
Before we can introduce a single polling loop for all nodes in
bdrv_drain_all_begin(), we must make sure to run it outside of coroutine
context like we already do for bdrv_do_drained_begin().

Signed-off-by: Kevin Wolf 
---
 block/io.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index fc26c1a2f8..db6810b35f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -255,11 +255,16 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 Coroutine *co = data->co;
 BlockDriverState *bs = data->bs;
 
-bdrv_dec_in_flight(bs);
-if (data->begin) {
-bdrv_do_drained_begin(bs, data->recursive, data->parent, data->poll);
+if (bs) {
+bdrv_dec_in_flight(bs);
+if (data->begin) {
+bdrv_do_drained_begin(bs, data->recursive, data->parent, 
data->poll);
+} else {
+bdrv_do_drained_end(bs, data->recursive, data->parent);
+}
 } else {
-bdrv_do_drained_end(bs, data->recursive, data->parent);
+assert(data->begin);
+bdrv_drain_all_begin();
 }
 
 data->done = true;
@@ -285,7 +290,9 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 .parent = parent,
 .poll = poll,
 };
-bdrv_inc_in_flight(bs);
+if (bs) {
+bdrv_inc_in_flight(bs);
+}
 aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
 bdrv_co_drain_bh_cb, );
 
@@ -455,6 +462,11 @@ void bdrv_drain_all_begin(void)
 BlockDriverState *bs;
 BdrvNextIterator it;
 
+if (qemu_in_coroutine()) {
+bdrv_co_yield_to_drain(NULL, true, false, NULL, true);
+return;
+}
+
 /* BDRV_POLL_WHILE() for a node can only be called from its own I/O thread
  * or the main loop AioContext. We potentially use BDRV_POLL_WHILE() on
  * nodes in several different AioContexts, so make sure we're in the main
-- 
2.13.6




[Qemu-block] [PATCH 16/19] block: Allow AIO_WAIT_WHILE with NULL ctx

2018-04-11 Thread Kevin Wolf
bdrv_drain_all() wants to have a single polling loop for draining the
in-flight requests of all nodes. This means that the AIO_WAIT_WHILE()
condition relies on activity in multiple AioContexts, which is polled
from the mainloop context. We must therefore call AIO_WAIT_WHILE() from
the mainloop thread and use the AioWait notification mechanism.

Just randomly picking the AioContext of any non-mainloop thread would
work, but instead of bothering to find such a context in the caller, we
can just as well accept NULL for ctx.

Signed-off-by: Kevin Wolf 
---
 include/block/aio-wait.h | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index 783d3678dd..c85a62f798 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -57,7 +57,8 @@ typedef struct {
 /**
  * AIO_WAIT_WHILE:
  * @wait: the aio wait object
- * @ctx: the aio context
+ * @ctx: the aio context, or NULL if multiple aio contexts (for which the
+ *   caller does not hold a lock) are involved in the polling condition.
  * @cond: wait while this conditional expression is true
  *
  * Wait while a condition is true.  Use this to implement synchronous
@@ -75,7 +76,7 @@ typedef struct {
 bool waited_ = false;  \
 AioWait *wait_ = (wait);   \
 AioContext *ctx_ = (ctx);  \
-if (in_aio_context_home_thread(ctx_)) {\
+if (ctx_ && in_aio_context_home_thread(ctx_)) {\
 while ((cond)) {   \
 aio_poll(ctx_, true);  \
 waited_ = true;\
@@ -86,9 +87,13 @@ typedef struct {
 /* Increment wait_->num_waiters before evaluating cond. */ \
 atomic_inc(_->num_waiters);   \
 while ((cond)) {   \
-aio_context_release(ctx_); \
+if (ctx_) {\
+aio_context_release(ctx_); \
+}  \
 aio_poll(qemu_get_aio_context(), true);\
-aio_context_acquire(ctx_); \
+if (ctx_) {\
+aio_context_acquire(ctx_); \
+}  \
 waited_ = true;\
 }  \
 atomic_dec(_->num_waiters);   \
-- 
2.13.6




[Qemu-block] [PATCH 15/19] test-bdrv-drain: Test that bdrv_drain_invoke() doesn't poll

2018-04-11 Thread Kevin Wolf
This adds a test case that goes wrong if bdrv_drain_invoke() calls
aio_poll().

Signed-off-by: Kevin Wolf 
---
 tests/test-bdrv-drain.c | 102 +---
 1 file changed, 88 insertions(+), 14 deletions(-)

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index fdf3ce19ea..79d845ecbb 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -34,12 +34,16 @@ static QemuEvent done_event;
 typedef struct BDRVTestState {
 int drain_count;
 AioContext *bh_indirection_ctx;
+bool sleep_in_drain_begin;
 } BDRVTestState;
 
 static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
 {
 BDRVTestState *s = bs->opaque;
 s->drain_count++;
+if (s->sleep_in_drain_begin) {
+qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 10);
+}
 }
 
 static void coroutine_fn bdrv_test_co_drain_end(BlockDriverState *bs)
@@ -80,6 +84,22 @@ static int coroutine_fn bdrv_test_co_preadv(BlockDriverState 
*bs,
 return 0;
 }
 
+static void bdrv_test_child_perm(BlockDriverState *bs, BdrvChild *c,
+ const BdrvChildRole *role,
+ BlockReopenQueue *reopen_queue,
+ uint64_t perm, uint64_t shared,
+ uint64_t *nperm, uint64_t *nshared)
+{
+/* bdrv_format_default_perms() accepts only these two, so disguise
+ * detach_by_driver_cb_role as one of them. */
+if (role != _file && role != _backing) {
+role = _file;
+}
+
+bdrv_format_default_perms(bs, c, role, reopen_queue, perm, shared,
+  nperm, nshared);
+}
+
 static BlockDriver bdrv_test = {
 .format_name= "test",
 .instance_size  = sizeof(BDRVTestState),
@@ -90,7 +110,7 @@ static BlockDriver bdrv_test = {
 .bdrv_co_drain_begin= bdrv_test_co_drain_begin,
 .bdrv_co_drain_end  = bdrv_test_co_drain_end,
 
-.bdrv_child_perm= bdrv_format_default_perms,
+.bdrv_child_perm= bdrv_test_child_perm,
 };
 
 static void aio_ret_cb(void *opaque, int ret)
@@ -982,13 +1002,14 @@ struct detach_by_parent_data {
 BdrvChild *child_b;
 BlockDriverState *c;
 BdrvChild *child_c;
+bool by_parent_cb;
 };
+static struct detach_by_parent_data detach_by_parent_data;
 
-static void detach_by_parent_aio_cb(void *opaque, int ret)
+static void detach_indirect_bh(void *opaque)
 {
 struct detach_by_parent_data *data = opaque;
 
-g_assert_cmpint(ret, ==, 0);
 bdrv_unref_child(data->parent_b, data->child_b);
 
 bdrv_ref(data->c);
@@ -996,6 +1017,25 @@ static void detach_by_parent_aio_cb(void *opaque, int ret)
   _file, _abort);
 }
 
+static void detach_by_parent_aio_cb(void *opaque, int ret)
+{
+struct detach_by_parent_data *data = _by_parent_data;
+
+g_assert_cmpint(ret, ==, 0);
+if (data->by_parent_cb) {
+detach_indirect_bh(data);
+}
+}
+
+static void detach_by_driver_cb_drained_begin(BdrvChild *child)
+{
+aio_bh_schedule_oneshot(qemu_get_current_aio_context(),
+detach_indirect_bh, _by_parent_data);
+child_file.drained_begin(child);
+}
+
+static BdrvChildRole detach_by_driver_cb_role;
+
 /*
  * Initial graph:
  *
@@ -1003,17 +1043,25 @@ static void detach_by_parent_aio_cb(void *opaque, int 
ret)
  *\ /   \
  * A B C
  *
- * PA has a pending write request whose callback changes the child nodes of PB:
- * It removes B and adds C instead. The subtree of PB is drained, which will
- * indirectly drain the write request, too.
+ * by_parent_cb == true:  Test that parent callbacks don't poll
+ *
+ * PA has a pending write request whose callback changes the child nodes of
+ * PB: It removes B and adds C instead. The subtree of PB is drained, which
+ * will indirectly drain the write request, too.
+ *
+ * by_parent_cb == false: Test that bdrv_drain_invoke() doesn't poll
+ *
+ * PA's BdrvChildRole has a .drained_begin callback that schedules a BH
+ * that does the same graph change. If bdrv_drain_invoke() calls it, the
+ * state is messed up, but if it is only polled in the single
+ * BDRV_POLL_WHILE() at the end of the drain, this should work fine.
  */
-static void test_detach_by_parent_cb(void)
+static void test_detach_indirect(bool by_parent_cb)
 {
 BlockBackend *blk;
 BlockDriverState *parent_a, *parent_b, *a, *b, *c;
 BdrvChild *child_a, *child_b;
 BlockAIOCB *acb;
-struct detach_by_parent_data data;
 
 QEMUIOVector qiov;
 struct iovec iov = {
@@ -1022,6 +1070,12 @@ static void test_detach_by_parent_cb(void)
 };
 qemu_iovec_init_external(, , 1);
 
+if (!by_parent_cb) {
+detach_by_driver_cb_role = child_file;
+detach_by_driver_cb_role.drained_begin =
+detach_by_driver_cb_drained_begin;
+}
+
 /* Create all involved nodes */
 

[Qemu-block] [PATCH 13/19] test-bdrv-drain: Graph change through parent callback

2018-04-11 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 tests/test-bdrv-drain.c | 130 
 1 file changed, 130 insertions(+)

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 4af6571ca3..fdf3ce19ea 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -977,6 +977,135 @@ static void test_detach_by_drain_subtree(void)
 }
 
 
+struct detach_by_parent_data {
+BlockDriverState *parent_b;
+BdrvChild *child_b;
+BlockDriverState *c;
+BdrvChild *child_c;
+};
+
+static void detach_by_parent_aio_cb(void *opaque, int ret)
+{
+struct detach_by_parent_data *data = opaque;
+
+g_assert_cmpint(ret, ==, 0);
+bdrv_unref_child(data->parent_b, data->child_b);
+
+bdrv_ref(data->c);
+data->child_c = bdrv_attach_child(data->parent_b, data->c, "PB-C",
+  _file, _abort);
+}
+
+/*
+ * Initial graph:
+ *
+ * PA PB
+ *\ /   \
+ * A B C
+ *
+ * PA has a pending write request whose callback changes the child nodes of PB:
+ * It removes B and adds C instead. The subtree of PB is drained, which will
+ * indirectly drain the write request, too.
+ */
+static void test_detach_by_parent_cb(void)
+{
+BlockBackend *blk;
+BlockDriverState *parent_a, *parent_b, *a, *b, *c;
+BdrvChild *child_a, *child_b;
+BlockAIOCB *acb;
+struct detach_by_parent_data data;
+
+QEMUIOVector qiov;
+struct iovec iov = {
+.iov_base = NULL,
+.iov_len = 0,
+};
+qemu_iovec_init_external(, , 1);
+
+/* Create all involved nodes */
+parent_a = bdrv_new_open_driver(_test, "parent-a", BDRV_O_RDWR,
+_abort);
+parent_b = bdrv_new_open_driver(_test, "parent-b", 0,
+_abort);
+
+a = bdrv_new_open_driver(_test, "a", BDRV_O_RDWR, _abort);
+b = bdrv_new_open_driver(_test, "b", BDRV_O_RDWR, _abort);
+c = bdrv_new_open_driver(_test, "c", BDRV_O_RDWR, _abort);
+
+/* blk is a BB for parent-a */
+blk = blk_new(BLK_PERM_ALL, BLK_PERM_ALL);
+blk_insert_bs(blk, parent_a, _abort);
+bdrv_unref(parent_a);
+
+/* Set child relationships */
+bdrv_ref(b);
+bdrv_ref(a);
+child_b = bdrv_attach_child(parent_b, b, "PB-B", _file, 
_abort);
+child_a = bdrv_attach_child(parent_b, a, "PB-A", _backing, 
_abort);
+
+bdrv_ref(a);
+bdrv_attach_child(parent_a, a, "PA-A", _file, _abort);
+
+g_assert_cmpint(parent_a->refcnt, ==, 1);
+g_assert_cmpint(parent_b->refcnt, ==, 1);
+g_assert_cmpint(a->refcnt, ==, 3);
+g_assert_cmpint(b->refcnt, ==, 2);
+g_assert_cmpint(c->refcnt, ==, 1);
+
+g_assert(QLIST_FIRST(_b->children) == child_a);
+g_assert(QLIST_NEXT(child_a, next) == child_b);
+g_assert(QLIST_NEXT(child_b, next) == NULL);
+
+/* Start the evil write request */
+data = (struct detach_by_parent_data) {
+.parent_b = parent_b,
+.child_b = child_b,
+.c = c,
+};
+acb = blk_aio_preadv(blk, 0, , 0, detach_by_parent_aio_cb, );
+g_assert(acb != NULL);
+
+/* Drain and check the expected result */
+bdrv_subtree_drained_begin(parent_b);
+
+g_assert(data.child_c != NULL);
+
+g_assert_cmpint(parent_a->refcnt, ==, 1);
+g_assert_cmpint(parent_b->refcnt, ==, 1);
+g_assert_cmpint(a->refcnt, ==, 3);
+g_assert_cmpint(b->refcnt, ==, 1);
+g_assert_cmpint(c->refcnt, ==, 2);
+
+g_assert(QLIST_FIRST(_b->children) == data.child_c);
+g_assert(QLIST_NEXT(data.child_c, next) == child_a);
+g_assert(QLIST_NEXT(child_a, next) == NULL);
+
+g_assert_cmpint(parent_a->quiesce_counter, ==, 1);
+g_assert_cmpint(parent_b->quiesce_counter, ==, 1);
+g_assert_cmpint(a->quiesce_counter, ==, 1);
+g_assert_cmpint(b->quiesce_counter, ==, 0);
+g_assert_cmpint(c->quiesce_counter, ==, 1);
+
+bdrv_subtree_drained_end(parent_b);
+
+bdrv_unref(parent_b);
+blk_unref(blk);
+
+/* XXX Once bdrv_close() unref's children instead of just detaching them,
+ * this won't be necessary any more. */
+bdrv_unref(a);
+bdrv_unref(a);
+bdrv_unref(c);
+
+g_assert_cmpint(a->refcnt, ==, 1);
+g_assert_cmpint(b->refcnt, ==, 1);
+g_assert_cmpint(c->refcnt, ==, 1);
+bdrv_unref(a);
+bdrv_unref(b);
+bdrv_unref(c);
+}
+
+
 int main(int argc, char **argv)
 {
 int ret;
@@ -1027,6 +1156,7 @@ int main(int argc, char **argv)
 g_test_add_func("/bdrv-drain/deletion/drain", test_delete_by_drain);
 g_test_add_func("/bdrv-drain/detach/drain", test_detach_by_drain);
 g_test_add_func("/bdrv-drain/detach/drain_subtree", 
test_detach_by_drain_subtree);
+g_test_add_func("/bdrv-drain/detach/parent_cb", test_detach_by_parent_cb);
 
 ret = g_test_run();
 qemu_event_destroy(_event);
-- 
2.13.6




[Qemu-block] [PATCH 12/19] block: Don't poll in parent drain callbacks

2018-04-11 Thread Kevin Wolf
bdrv_do_drained_begin() is only safe if we have a single
BDRV_POLL_WHILE() after quiescing all affected nodes. We cannot allow
that parent callbacks introduce a nested polling loop that could cause
graph changes while we're traversing the graph.

Split off bdrv_do_drained_begin_quiesce(), which only quiesces a single
node without waiting for its requests to complete. These requests will
be waited for in the BDRV_POLL_WHILE() call down the call chain.

Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  9 +
 block.c   |  2 +-
 block/io.c| 24 
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 91bf3b4e36..de2cba2c74 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -578,6 +578,15 @@ bool bdrv_drain_poll(BlockDriverState *bs, bool top_level, 
bool recursive);
 void bdrv_drained_begin(BlockDriverState *bs);
 
 /**
+ * bdrv_do_drained_begin_quiesce:
+ *
+ * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
+ * running requests to complete.
+ */
+void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
+   BdrvChild *parent);
+
+/**
  * Like bdrv_drained_begin, but recursively begins a quiesced section for
  * exclusive access to all child nodes as well.
  */
diff --git a/block.c b/block.c
index 9fe39ac8c1..330238de19 100644
--- a/block.c
+++ b/block.c
@@ -817,7 +817,7 @@ static char *bdrv_child_get_parent_desc(BdrvChild *c)
 static void bdrv_child_cb_drained_begin(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
-bdrv_drained_begin(bs);
+bdrv_do_drained_begin_quiesce(bs, NULL);
 }
 
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
diff --git a/block/io.c b/block/io.c
index 1287630c58..f372b9ffb0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -277,15 +277,10 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 assert(data.done);
 }
 
-void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
-   BdrvChild *parent, bool poll)
+void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
+   BdrvChild *parent)
 {
-BdrvChild *child, *next;
-
-if (qemu_in_coroutine()) {
-bdrv_co_yield_to_drain(bs, true, recursive, parent, poll);
-return;
-}
+assert(!qemu_in_coroutine());
 
 /* Stop things in parent-to-child order */
 if (atomic_fetch_inc(>quiesce_counter) == 0) {
@@ -294,6 +289,19 @@ void bdrv_do_drained_begin(BlockDriverState *bs, bool 
recursive,
 
 bdrv_parent_drained_begin(bs, parent);
 bdrv_drain_invoke(bs, true);
+}
+
+static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
+  BdrvChild *parent, bool poll)
+{
+BdrvChild *child, *next;
+
+if (qemu_in_coroutine()) {
+bdrv_co_yield_to_drain(bs, true, recursive, parent, poll);
+return;
+}
+
+bdrv_do_drained_begin_quiesce(bs, parent);
 
 if (recursive) {
 bs->recursive_quiesce_counter++;
-- 
2.13.6




[Qemu-block] [PATCH 10/19] block: Drain recursively with a single BDRV_POLL_WHILE()

2018-04-11 Thread Kevin Wolf
Anything can happen inside BDRV_POLL_WHILE(), including graph
changes that may interfere with its callers (e.g. child list iteration
in recursive callers of bdrv_do_drained_begin).

Switch to a single BDRV_POLL_WHILE() call for the whole subtree at the
end of bdrv_do_drained_begin() to avoid such effects. The recursion
happens now inside the loop condition. As the graph can only change
between bdrv_drain_poll() calls, but not inside of it, doing the
recursion here is safe.

Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  2 +-
 block.c   |  2 +-
 block/io.c| 58 +--
 3 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 23dee3c114..91bf3b4e36 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -563,7 +563,7 @@ void bdrv_parent_drained_end(BlockDriverState *bs, 
BdrvChild *ignore);
  *
  * Poll for pending requests in @bs. This is part of bdrv_drained_begin.
  */
-bool bdrv_drain_poll(BlockDriverState *bs, bool top_level);
+bool bdrv_drain_poll(BlockDriverState *bs, bool top_level, bool recursive);
 
 /**
  * bdrv_drained_begin:
diff --git a/block.c b/block.c
index 462287bdfb..9fe39ac8c1 100644
--- a/block.c
+++ b/block.c
@@ -823,7 +823,7 @@ static void bdrv_child_cb_drained_begin(BdrvChild *child)
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
-return bdrv_drain_poll(bs, false);
+return bdrv_drain_poll(bs, false, false);
 }
 
 static void bdrv_child_cb_drained_end(BdrvChild *child)
diff --git a/block/io.c b/block/io.c
index f24f39c278..1287630c58 100644
--- a/block/io.c
+++ b/block/io.c
@@ -161,6 +161,7 @@ typedef struct {
 bool done;
 bool begin;
 bool recursive;
+bool poll;
 BdrvChild *parent;
 } BdrvCoDrainData;
 
@@ -196,8 +197,10 @@ static void bdrv_drain_invoke(BlockDriverState *bs, bool 
begin)
 }
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
-bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
+bool bdrv_drain_poll(BlockDriverState *bs, bool top_level, bool recursive)
 {
+BdrvChild *child, *next;
+
 /* Execute pending BHs first and check everything else only after the BHs
  * have executed. */
 if (top_level) {
@@ -208,11 +211,23 @@ bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
 return true;
 }
 
-return atomic_read(>in_flight);
+if (atomic_read(>in_flight)) {
+return true;
+}
+
+if (recursive) {
+QLIST_FOREACH_SAFE(child, >children, next, next) {
+if (bdrv_drain_poll(child->bs, false, recursive)) {
+return true;
+}
+}
+}
+
+return false;
 }
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
-  BdrvChild *parent);
+  BdrvChild *parent, bool poll);
 static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
 BdrvChild *parent);
 
@@ -224,7 +239,7 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 
 bdrv_dec_in_flight(bs);
 if (data->begin) {
-bdrv_do_drained_begin(bs, data->recursive, data->parent);
+bdrv_do_drained_begin(bs, data->recursive, data->parent, data->poll);
 } else {
 bdrv_do_drained_end(bs, data->recursive, data->parent);
 }
@@ -235,7 +250,7 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 
 static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
 bool begin, bool recursive,
-BdrvChild *parent)
+BdrvChild *parent, bool poll)
 {
 BdrvCoDrainData data;
 
@@ -250,6 +265,7 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 .begin = begin,
 .recursive = recursive,
 .parent = parent,
+.poll = poll,
 };
 bdrv_inc_in_flight(bs);
 aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
@@ -262,12 +278,12 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 }
 
 void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
-   BdrvChild *parent)
+   BdrvChild *parent, bool poll)
 {
 BdrvChild *child, *next;
 
 if (qemu_in_coroutine()) {
-bdrv_co_yield_to_drain(bs, true, recursive, parent);
+bdrv_co_yield_to_drain(bs, true, recursive, parent, poll);
 return;
 }
 
@@ -279,25 +295,35 @@ void bdrv_do_drained_begin(BlockDriverState *bs, bool 
recursive,
 bdrv_parent_drained_begin(bs, parent);
 bdrv_drain_invoke(bs, true);
 
-/* Wait for drained requests to finish */
-BDRV_POLL_WHILE(bs, bdrv_drain_poll(bs, true));
-
 if (recursive) {
 

[Qemu-block] [PATCH 00/19] Drain fixes and cleanups, part 3

2018-04-11 Thread Kevin Wolf
This is the third and hopefully for now last part of my work to fix
drain. The main goal of this series is to make drain robust against
graph changes that happen in any callbacks of in-flight requests while
we drain a block node.

The individual patches describe the details, but the rough plan is to
change all three drain types (single node, subtree and all) to work like
this:

1. First call all the necessary callbacks to quiesce external sources
   for new requests. This includes the block driver callbacks, the child
   node callbacks and disabling external AioContext events. This is done
   recursively.

   Much of the trouble we had with drain resulted from the fact that the
   graph changed while we were traversing the graph recursively. None of
   the callbacks called in this phase may change the graph.

2. Then do a single AIO_WAIT_WHILE() to drain the requests of all
   affected nodes. The aio_poll() called by it is where graph changes
   can happen and we need to be careful.

   However, while evaluating the loop condition, the graph can't change,
   so we can safely call all necessary callbacks, if needed recursively,
   to determine whether there are still pending requests in any affected
   nodes. We just need to make sure that we don't rely on the set of
   nodes being the same between any two evaluation of the condition.

There are a few more smaller, mostly self-contained changes needed
before we're actually safe, but this is the main mechanism that will
help you understand what we're working towards during the series.

Kevin Wolf (18):
  test-bdrv-drain: bdrv_drain() works with cross-AioContext events
  block: Use bdrv_do_drain_begin/end in bdrv_drain_all()
  block: Remove 'recursive' parameter from bdrv_drain_invoke()
  block: Don't manually poll in bdrv_drain_all()
  tests/test-bdrv-drain: bdrv_drain_all() works in coroutines now
  block: Avoid unnecessary aio_poll() in AIO_WAIT_WHILE()
  block: Really pause block jobs on drain
  block: Remove bdrv_drain_recurse()
  block: Drain recursively with a single BDRV_POLL_WHILE()
  test-bdrv-drain: Test node deletion in subtree recursion
  block: Don't poll in parent drain callbacks
  test-bdrv-drain: Graph change through parent callback
  block: Defer .bdrv_drain_begin callback to polling phase
  test-bdrv-drain: Test that bdrv_drain_invoke() doesn't poll
  block: Allow AIO_WAIT_WHILE with NULL ctx
  block: Move bdrv_drain_all_begin() out of coroutine context
  block: Allow graph changes in bdrv_drain_all_begin/end sections
  test-bdrv-drain: Test graph changes in drain_all section

Max Reitz (1):
  test-bdrv-drain: Add test for node deletion

 include/block/aio-wait.h |  25 +-
 include/block/block.h|  17 +
 include/block/block_int.h|   8 +
 include/block/blockjob_int.h |   8 +
 block.c  |  31 +-
 block/io.c   | 280 +---
 block/mirror.c   |   8 +
 blockjob.c   |  21 ++
 tests/test-bdrv-drain.c  | 738 ---
 9 files changed, 974 insertions(+), 162 deletions(-)

-- 
2.13.6




[Qemu-block] [PATCH 09/19] test-bdrv-drain: Add test for node deletion

2018-04-11 Thread Kevin Wolf
From: Max Reitz 

This patch adds two bdrv-drain tests for what happens if some BDS goes
away during the drainage.

The basic idea is that you have a parent BDS with some child nodes.
Then, you drain one of the children.  Because of that, the party who
actually owns the parent decides to (A) delete it, or (B) detach all its
children from it -- both while the child is still being drained.

A real-world case where this can happen is the mirror block job, which
may exit if you drain one of its children.

Signed-off-by: Max Reitz 
Signed-off-by: Kevin Wolf 
---
 tests/test-bdrv-drain.c | 169 
 1 file changed, 169 insertions(+)

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 37f2d6ac8c..05768213be 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -792,6 +792,172 @@ static void test_blockjob_drain_subtree(void)
 test_blockjob_common(BDRV_SUBTREE_DRAIN);
 }
 
+
+typedef struct BDRVTestTopState {
+BdrvChild *wait_child;
+} BDRVTestTopState;
+
+static void bdrv_test_top_close(BlockDriverState *bs)
+{
+BdrvChild *c, *next_c;
+QLIST_FOREACH_SAFE(c, >children, next, next_c) {
+bdrv_unref_child(bs, c);
+}
+}
+
+static int coroutine_fn bdrv_test_top_co_preadv(BlockDriverState *bs,
+uint64_t offset, uint64_t 
bytes,
+QEMUIOVector *qiov, int flags)
+{
+BDRVTestTopState *tts = bs->opaque;
+return bdrv_co_preadv(tts->wait_child, offset, bytes, qiov, flags);
+}
+
+static BlockDriver bdrv_test_top_driver = {
+.format_name= "test_top_driver",
+.instance_size  = sizeof(BDRVTestTopState),
+
+.bdrv_close = bdrv_test_top_close,
+.bdrv_co_preadv = bdrv_test_top_co_preadv,
+
+.bdrv_child_perm= bdrv_format_default_perms,
+};
+
+typedef struct TestCoDeleteByDrainData {
+BlockBackend *blk;
+bool detach_instead_of_delete;
+bool done;
+} TestCoDeleteByDrainData;
+
+static void coroutine_fn test_co_delete_by_drain(void *opaque)
+{
+TestCoDeleteByDrainData *dbdd = opaque;
+BlockBackend *blk = dbdd->blk;
+BlockDriverState *bs = blk_bs(blk);
+BDRVTestTopState *tts = bs->opaque;
+void *buffer = g_malloc(65536);
+QEMUIOVector qiov;
+struct iovec iov = {
+.iov_base = buffer,
+.iov_len  = 65536,
+};
+
+qemu_iovec_init_external(, , 1);
+
+/* Pretend some internal write operation from parent to child.
+ * Important: We have to read from the child, not from the parent!
+ * Draining works by first propagating it all up the tree to the
+ * root and then waiting for drainage from root to the leaves
+ * (protocol nodes).  If we have a request waiting on the root,
+ * everything will be drained before we go back down the tree, but
+ * we do not want that.  We want to be in the middle of draining
+ * when this following requests returns. */
+bdrv_co_preadv(tts->wait_child, 0, 65536, , 0);
+
+g_assert_cmpint(bs->refcnt, ==, 1);
+
+if (!dbdd->detach_instead_of_delete) {
+blk_unref(blk);
+} else {
+BdrvChild *c, *next_c;
+QLIST_FOREACH_SAFE(c, >children, next, next_c) {
+bdrv_unref_child(bs, c);
+}
+}
+
+dbdd->done = true;
+}
+
+/**
+ * Test what happens when some BDS has some children, you drain one of
+ * them and this results in the BDS being deleted.
+ *
+ * If @detach_instead_of_delete is set, the BDS is not going to be
+ * deleted but will only detach all of its children.
+ */
+static void do_test_delete_by_drain(bool detach_instead_of_delete)
+{
+BlockBackend *blk;
+BlockDriverState *bs, *child_bs, *null_bs;
+BDRVTestTopState *tts;
+TestCoDeleteByDrainData dbdd;
+Coroutine *co;
+
+bs = bdrv_new_open_driver(_test_top_driver, "top", BDRV_O_RDWR,
+  _abort);
+bs->total_sectors = 65536 >> BDRV_SECTOR_BITS;
+tts = bs->opaque;
+
+null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | 
BDRV_O_PROTOCOL,
+_abort);
+bdrv_attach_child(bs, null_bs, "null-child", _file, _abort);
+
+/* This child will be the one to pass to requests through to, and
+ * it will stall until a drain occurs */
+child_bs = bdrv_new_open_driver(_test, "child", BDRV_O_RDWR,
+_abort);
+child_bs->total_sectors = 65536 >> BDRV_SECTOR_BITS;
+/* Takes our reference to child_bs */
+tts->wait_child = bdrv_attach_child(bs, child_bs, "wait-child", 
_file,
+_abort);
+
+/* This child is just there to be deleted
+ * (for detach_instead_of_delete == true) */
+null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | 
BDRV_O_PROTOCOL,
+_abort);
+bdrv_attach_child(bs, 

[Qemu-block] [PATCH 02/19] block: Use bdrv_do_drain_begin/end in bdrv_drain_all()

2018-04-11 Thread Kevin Wolf
bdrv_do_drain_begin/end() implement already everything that
bdrv_drain_all_begin/end() need and currently still do manually: Disable
external events, call parent drain callbacks, call block driver
callbacks.

It also does two more things:

The first is incrementing bs->quiesce_counter. bdrv_drain_all() already
stood out in the test case by behaving different from the other drain
variants. Adding this is not only safe, but in fact a bug fix.

The second is calling bdrv_drain_recurse(). We already do that later in
the same function in a loop, so basically doing an early first iteration
doesn't hurt.

Signed-off-by: Kevin Wolf 
---
 block/io.c  | 10 ++
 tests/test-bdrv-drain.c | 14 --
 2 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/block/io.c b/block/io.c
index 28e71221b0..cad59db2f4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -412,11 +412,8 @@ void bdrv_drain_all_begin(void)
 for (bs = bdrv_first(); bs; bs = bdrv_next()) {
 AioContext *aio_context = bdrv_get_aio_context(bs);
 
-/* Stop things in parent-to-child order */
 aio_context_acquire(aio_context);
-aio_disable_external(aio_context);
-bdrv_parent_drained_begin(bs, NULL);
-bdrv_drain_invoke(bs, true, true);
+bdrv_do_drained_begin(bs, true, NULL);
 aio_context_release(aio_context);
 
 if (!g_slist_find(aio_ctxs, aio_context)) {
@@ -457,11 +454,8 @@ void bdrv_drain_all_end(void)
 for (bs = bdrv_first(); bs; bs = bdrv_next()) {
 AioContext *aio_context = bdrv_get_aio_context(bs);
 
-/* Re-enable things in child-to-parent order */
 aio_context_acquire(aio_context);
-bdrv_drain_invoke(bs, false, true);
-bdrv_parent_drained_end(bs, NULL);
-aio_enable_external(aio_context);
+bdrv_do_drained_end(bs, true, NULL);
 aio_context_release(aio_context);
 }
 }
diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 29634102d8..cd870609c5 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -276,8 +276,7 @@ static void test_quiesce_common(enum drain_type drain_type, 
bool recursive)
 
 static void test_quiesce_drain_all(void)
 {
-// XXX drain_all doesn't quiesce
-//test_quiesce_common(BDRV_DRAIN_ALL, true);
+test_quiesce_common(BDRV_DRAIN_ALL, true);
 }
 
 static void test_quiesce_drain(void)
@@ -319,12 +318,7 @@ static void test_nested(void)
 
 for (outer = 0; outer < DRAIN_TYPE_MAX; outer++) {
 for (inner = 0; inner < DRAIN_TYPE_MAX; inner++) {
-/* XXX bdrv_drain_all() doesn't increase the quiesce_counter */
-int bs_quiesce  = (outer != BDRV_DRAIN_ALL) +
-  (inner != BDRV_DRAIN_ALL);
-int backing_quiesce = (outer == BDRV_SUBTREE_DRAIN) +
-  (inner == BDRV_SUBTREE_DRAIN);
-int backing_cb_cnt  = (outer != BDRV_DRAIN) +
+int backing_quiesce = (outer != BDRV_DRAIN) +
   (inner != BDRV_DRAIN);
 
 g_assert_cmpint(bs->quiesce_counter, ==, 0);
@@ -335,10 +329,10 @@ static void test_nested(void)
 do_drain_begin(outer, bs);
 do_drain_begin(inner, bs);
 
-g_assert_cmpint(bs->quiesce_counter, ==, bs_quiesce);
+g_assert_cmpint(bs->quiesce_counter, ==, 2);
 g_assert_cmpint(backing->quiesce_counter, ==, backing_quiesce);
 g_assert_cmpint(s->drain_count, ==, 2);
-g_assert_cmpint(backing_s->drain_count, ==, backing_cb_cnt);
+g_assert_cmpint(backing_s->drain_count, ==, backing_quiesce);
 
 do_drain_end(inner, bs);
 do_drain_end(outer, bs);
-- 
2.13.6




[Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-11 Thread Kevin Wolf
We already requested that block jobs be paused in .bdrv_drained_begin,
but no guarantee was made that the job was actually inactive at the
point where bdrv_drained_begin() returned.

This introduces a new callback BdrvChildRole.bdrv_drained_poll() and
uses it to make bdrv_drain_poll() consider block jobs using the node to
be drained.

For the test case to work as expected, we have to switch from
block_job_sleep_ns() to qemu_co_sleep_ns() so that the test job is even
considered active and must be waited for when draining the node.

Signed-off-by: Kevin Wolf 
---
 include/block/block.h|  7 +++
 include/block/block_int.h|  7 +++
 include/block/blockjob_int.h |  8 
 block.c  |  9 +
 block/io.c   | 27 ---
 block/mirror.c   |  8 
 blockjob.c   | 21 +
 tests/test-bdrv-drain.c  | 18 ++
 8 files changed, 94 insertions(+), 11 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index cdec3639a3..23dee3c114 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -559,6 +559,13 @@ void bdrv_parent_drained_begin(BlockDriverState *bs, 
BdrvChild *ignore);
 void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore);
 
 /**
+ * bdrv_drain_poll:
+ *
+ * Poll for pending requests in @bs. This is part of bdrv_drained_begin.
+ */
+bool bdrv_drain_poll(BlockDriverState *bs, bool top_level);
+
+/**
  * bdrv_drained_begin:
  *
  * Begin a quiesced section for exclusive access to the BDS, by disabling
diff --git a/include/block/block_int.h b/include/block/block_int.h
index c4dd1d4bb8..dc6985e3ae 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -575,6 +575,13 @@ struct BdrvChildRole {
 void (*drained_begin)(BdrvChild *child);
 void (*drained_end)(BdrvChild *child);
 
+/*
+ * Returns whether the parent has pending requests for the child. This
+ * callback is polled after .drained_begin() has been called until all
+ * activity on the child has stopped.
+ */
+bool (*drained_poll)(BdrvChild *child);
+
 /* Notifies the parent that the child has been activated/inactivated (e.g.
  * when migration is completing) and it can start/stop requesting
  * permissions and doing I/O on it. */
diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
index 642adce68b..3a2f851d3f 100644
--- a/include/block/blockjob_int.h
+++ b/include/block/blockjob_int.h
@@ -106,6 +106,14 @@ struct BlockJobDriver {
 void coroutine_fn (*resume)(BlockJob *job);
 
 /*
+ * Returns whether the job has pending requests for the child or will
+ * submit new requests before the next pause point. This callback is polled
+ * in the context of draining a job node after requesting that the job be
+ * paused, until all activity on the child has stopped.
+ */
+bool (*drained_poll)(BlockJob *job);
+
+/*
  * If the callback is not NULL, it will be invoked before the job is
  * resumed in a new AioContext.  This is the place to move any resources
  * besides job->blk to the new AioContext.
diff --git a/block.c b/block.c
index a2caadf0a0..462287bdfb 100644
--- a/block.c
+++ b/block.c
@@ -820,6 +820,12 @@ static void bdrv_child_cb_drained_begin(BdrvChild *child)
 bdrv_drained_begin(bs);
 }
 
+static bool bdrv_child_cb_drained_poll(BdrvChild *child)
+{
+BlockDriverState *bs = child->opaque;
+return bdrv_drain_poll(bs, false);
+}
+
 static void bdrv_child_cb_drained_end(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
@@ -904,6 +910,7 @@ const BdrvChildRole child_file = {
 .get_parent_desc = bdrv_child_get_parent_desc,
 .inherit_options = bdrv_inherited_options,
 .drained_begin   = bdrv_child_cb_drained_begin,
+.drained_poll= bdrv_child_cb_drained_poll,
 .drained_end = bdrv_child_cb_drained_end,
 .attach  = bdrv_child_cb_attach,
 .detach  = bdrv_child_cb_detach,
@@ -928,6 +935,7 @@ const BdrvChildRole child_format = {
 .get_parent_desc = bdrv_child_get_parent_desc,
 .inherit_options = bdrv_inherited_fmt_options,
 .drained_begin   = bdrv_child_cb_drained_begin,
+.drained_poll= bdrv_child_cb_drained_poll,
 .drained_end = bdrv_child_cb_drained_end,
 .attach  = bdrv_child_cb_attach,
 .detach  = bdrv_child_cb_detach,
@@ -1047,6 +1055,7 @@ const BdrvChildRole child_backing = {
 .detach  = bdrv_backing_detach,
 .inherit_options = bdrv_backing_options,
 .drained_begin   = bdrv_child_cb_drained_begin,
+.drained_poll= bdrv_child_cb_drained_poll,
 .drained_end = bdrv_child_cb_drained_end,
 .inactivate  = bdrv_child_cb_inactivate,
 .update_filename = bdrv_backing_update_filename,
diff --git a/block/io.c b/block/io.c
index 6f580f49ff..0a778eeff4 100644
--- 

[Qemu-block] [PATCH 06/19] block: Avoid unnecessary aio_poll() in AIO_WAIT_WHILE()

2018-04-11 Thread Kevin Wolf
Commit 91af091f923 added an additional aio_poll() to BDRV_POLL_WHILE()
in order to make sure that all pending BHs are executed on drain. This
was the wrong place to make the fix, as it is useless overhead for all
other users of the macro and unnecessarily complicates the mechanism.

This patch effectively reverts said commit (the context has changed a
bit and the code has moved to AIO_WAIT_WHILE()) and instead polls in the
loop condition for drain.

The effect is probably hard to measure in any real-world use case
because actual I/O will dominate, but if I run only the initialisation
part of 'qemu-img convert' where it calls bdrv_block_status() for the
whole image to find out how much data there is copy, this phase actually
needs only roughly half the time after this patch.

Signed-off-by: Kevin Wolf 
---
 include/block/aio-wait.h | 22 --
 block/io.c   | 11 ++-
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index 8c90a2e66e..783d3678dd 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -73,29 +73,23 @@ typedef struct {
  */
 #define AIO_WAIT_WHILE(wait, ctx, cond) ({ \
 bool waited_ = false;  \
-bool busy_ = true; \
 AioWait *wait_ = (wait);   \
 AioContext *ctx_ = (ctx);  \
 if (in_aio_context_home_thread(ctx_)) {\
-while ((cond) || busy_) {  \
-busy_ = aio_poll(ctx_, (cond));\
-waited_ |= !!(cond) | busy_;   \
+while ((cond)) {   \
+aio_poll(ctx_, true);  \
+waited_ = true;\
 }  \
 } else {   \
 assert(qemu_get_current_aio_context() ==   \
qemu_get_aio_context());\
 /* Increment wait_->num_waiters before evaluating cond. */ \
 atomic_inc(_->num_waiters);   \
-while (busy_) {\
-if ((cond)) {  \
-waited_ = busy_ = true;\
-aio_context_release(ctx_); \
-aio_poll(qemu_get_aio_context(), true);\
-aio_context_acquire(ctx_); \
-} else {   \
-busy_ = aio_poll(ctx_, false); \
-waited_ |= busy_;  \
-}  \
+while ((cond)) {   \
+aio_context_release(ctx_); \
+aio_poll(qemu_get_aio_context(), true);\
+aio_context_acquire(ctx_); \
+waited_ = true;\
 }  \
 atomic_dec(_->num_waiters);   \
 }  \
diff --git a/block/io.c b/block/io.c
index ea6f9f023a..6f580f49ff 100644
--- a/block/io.c
+++ b/block/io.c
@@ -181,13 +181,22 @@ static void bdrv_drain_invoke(BlockDriverState *bs, bool 
begin)
 BDRV_POLL_WHILE(bs, !data.done);
 }
 
+/* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
+static bool bdrv_drain_poll(BlockDriverState *bs)
+{
+/* Execute pending BHs first and check everything else only after the BHs
+ * have executed. */
+while (aio_poll(bs->aio_context, false));
+return atomic_read(>in_flight);
+}
+
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
 BdrvChild *child, *tmp;
 bool waited;
 
 /* Wait for drained requests to finish */
-waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);
+waited = BDRV_POLL_WHILE(bs, bdrv_drain_poll(bs));
 
 QLIST_FOREACH_SAFE(child, >children, next, tmp) {
 BlockDriverState *bs = child->bs;
-- 
2.13.6




[Qemu-block] [PATCH 05/19] tests/test-bdrv-drain: bdrv_drain_all() works in coroutines now

2018-04-11 Thread Kevin Wolf
Since we use bdrv_do_drained_begin/end() for bdrv_drain_all_begin/end(),
coroutine context is automatically left with a BH, preventing the
deadlocks that made bdrv_drain_all*() unsafe in coroutine context. We
can consider it compatible now the latest, after having removed the old
polling code as dead code.

Enable the coroutine test cases for bdrv_drain_all().

Signed-off-by: Kevin Wolf 
---
 tests/test-bdrv-drain.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index cd870609c5..6de525b509 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -233,6 +233,11 @@ static void test_drv_cb_drain_subtree(void)
 test_drv_cb_common(BDRV_SUBTREE_DRAIN, true);
 }
 
+static void test_drv_cb_co_drain_all(void)
+{
+call_in_coroutine(test_drv_cb_drain_all);
+}
+
 static void test_drv_cb_co_drain(void)
 {
 call_in_coroutine(test_drv_cb_drain);
@@ -289,6 +294,11 @@ static void test_quiesce_drain_subtree(void)
 test_quiesce_common(BDRV_SUBTREE_DRAIN, true);
 }
 
+static void test_quiesce_co_drain_all(void)
+{
+call_in_coroutine(test_quiesce_drain_all);
+}
+
 static void test_quiesce_co_drain(void)
 {
 call_in_coroutine(test_quiesce_drain);
@@ -795,7 +805,8 @@ int main(int argc, char **argv)
 g_test_add_func("/bdrv-drain/driver-cb/drain_subtree",
 test_drv_cb_drain_subtree);
 
-// XXX bdrv_drain_all() doesn't work in coroutine context
+g_test_add_func("/bdrv-drain/driver-cb/co/drain_all",
+test_drv_cb_co_drain_all);
 g_test_add_func("/bdrv-drain/driver-cb/co/drain", test_drv_cb_co_drain);
 g_test_add_func("/bdrv-drain/driver-cb/co/drain_subtree",
 test_drv_cb_co_drain_subtree);
@@ -806,7 +817,8 @@ int main(int argc, char **argv)
 g_test_add_func("/bdrv-drain/quiesce/drain_subtree",
 test_quiesce_drain_subtree);
 
-// XXX bdrv_drain_all() doesn't work in coroutine context
+g_test_add_func("/bdrv-drain/quiesce/co/drain_all",
+test_quiesce_co_drain_all);
 g_test_add_func("/bdrv-drain/quiesce/co/drain", test_quiesce_co_drain);
 g_test_add_func("/bdrv-drain/quiesce/co/drain_subtree",
 test_quiesce_co_drain_subtree);
-- 
2.13.6




[Qemu-block] [PATCH 01/19] test-bdrv-drain: bdrv_drain() works with cross-AioContext events

2018-04-11 Thread Kevin Wolf
As long as nobody keeps the other I/O thread from working, there is no
reason why bdrv_drain() wouldn't work with cross-AioContext events. The
key is that the root request we're waiting for is in the AioContext
we're polling (which it always is for bdrv_drain()) so that aio_poll()
is woken up in the end.

Add a test case that shows that it works. Remove the comment in
bdrv_drain() that claims otherwise.

Signed-off-by: Kevin Wolf 
---
 block/io.c  |   4 --
 tests/test-bdrv-drain.c | 187 +++-
 2 files changed, 186 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index bd9a19a9c4..28e71221b0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -369,10 +369,6 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, 
BlockDriverState *old_parent)
  *
  * Note that unlike bdrv_drain_all(), the caller must hold the BlockDriverState
  * AioContext.
- *
- * Only this BlockDriverState's AioContext is run, so in-flight requests must
- * not depend on events in other AioContexts.  In that case, use
- * bdrv_drain_all() instead.
  */
 void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
 {
diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 7673de1062..29634102d8 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -27,9 +27,13 @@
 #include "block/blockjob_int.h"
 #include "sysemu/block-backend.h"
 #include "qapi/error.h"
+#include "iothread.h"
+
+static QemuEvent done_event;
 
 typedef struct BDRVTestState {
 int drain_count;
+AioContext *bh_indirection_ctx;
 } BDRVTestState;
 
 static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
@@ -50,16 +54,29 @@ static void bdrv_test_close(BlockDriverState *bs)
 g_assert_cmpint(s->drain_count, >, 0);
 }
 
+static void co_reenter_bh(void *opaque)
+{
+aio_co_wake(opaque);
+}
+
 static int coroutine_fn bdrv_test_co_preadv(BlockDriverState *bs,
 uint64_t offset, uint64_t bytes,
 QEMUIOVector *qiov, int flags)
 {
+BDRVTestState *s = bs->opaque;
+
 /* We want this request to stay until the polling loop in drain waits for
  * it to complete. We need to sleep a while as bdrv_drain_invoke() comes
  * first and polls its result, too, but it shouldn't accidentally complete
  * this request yet. */
 qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 10);
 
+if (s->bh_indirection_ctx) {
+aio_bh_schedule_oneshot(s->bh_indirection_ctx, co_reenter_bh,
+qemu_coroutine_self());
+qemu_coroutine_yield();
+}
+
 return 0;
 }
 
@@ -490,6 +507,164 @@ static void test_graph_change(void)
 blk_unref(blk_b);
 }
 
+struct test_iothread_data {
+BlockDriverState *bs;
+enum drain_type drain_type;
+int *aio_ret;
+};
+
+static void test_iothread_drain_entry(void *opaque)
+{
+struct test_iothread_data *data = opaque;
+
+aio_context_acquire(bdrv_get_aio_context(data->bs));
+do_drain_begin(data->drain_type, data->bs);
+g_assert_cmpint(*data->aio_ret, ==, 0);
+do_drain_end(data->drain_type, data->bs);
+aio_context_release(bdrv_get_aio_context(data->bs));
+
+qemu_event_set(_event);
+}
+
+static void test_iothread_aio_cb(void *opaque, int ret)
+{
+int *aio_ret = opaque;
+*aio_ret = ret;
+qemu_event_set(_event);
+}
+
+/*
+ * Starts an AIO request on a BDS that runs in the AioContext of iothread 1.
+ * The request involves a BH on iothread 2 before it can complete.
+ *
+ * @drain_thread = 0 means that do_drain_begin/end are called from the main
+ * thread, @drain_thread = 1 means that they are called from iothread 1. Drain
+ * for this BDS cannot be called from iothread 2 because only the main thread
+ * may do cross-AioContext polling.
+ */
+static void test_iothread_common(enum drain_type drain_type, int drain_thread)
+{
+BlockBackend *blk;
+BlockDriverState *bs;
+BDRVTestState *s;
+BlockAIOCB *acb;
+int aio_ret;
+struct test_iothread_data data;
+
+IOThread *a = iothread_new();
+IOThread *b = iothread_new();
+AioContext *ctx_a = iothread_get_aio_context(a);
+AioContext *ctx_b = iothread_get_aio_context(b);
+
+QEMUIOVector qiov;
+struct iovec iov = {
+.iov_base = NULL,
+.iov_len = 0,
+};
+qemu_iovec_init_external(, , 1);
+
+/* bdrv_drain_all() may only be called from the main loop thread */
+if (drain_type == BDRV_DRAIN_ALL && drain_thread != 0) {
+goto out;
+}
+
+blk = blk_new(BLK_PERM_ALL, BLK_PERM_ALL);
+bs = bdrv_new_open_driver(_test, "test-node", BDRV_O_RDWR,
+  _abort);
+s = bs->opaque;
+blk_insert_bs(blk, bs, _abort);
+
+blk_set_aio_context(blk, ctx_a);
+aio_context_acquire(ctx_a);
+
+s->bh_indirection_ctx = ctx_b;
+
+aio_ret = -EINPROGRESS;
+if (drain_thread == 0) {
+acb = 

[Qemu-block] [PATCH 08/19] block: Remove bdrv_drain_recurse()

2018-04-11 Thread Kevin Wolf
For bdrv_drain(), recursively waiting for child node requests is
pointless because we didn't quiesce their parents, so new requests could
come in anyway. Letting the function work only on a single node makes it
more consistent.

For subtree drains and drain_all, we already have the recursion in
bdrv_do_drained_begin(), so the extra recursion doesn't add anything
either.

Remove the useless code.

Signed-off-by: Kevin Wolf 
---
 block/io.c | 36 +++-
 1 file changed, 3 insertions(+), 33 deletions(-)

diff --git a/block/io.c b/block/io.c
index 0a778eeff4..f24f39c278 100644
--- a/block/io.c
+++ b/block/io.c
@@ -211,38 +211,6 @@ bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
 return atomic_read(>in_flight);
 }
 
-static bool bdrv_drain_recurse(BlockDriverState *bs)
-{
-BdrvChild *child, *tmp;
-bool waited;
-
-/* Wait for drained requests to finish */
-waited = BDRV_POLL_WHILE(bs, bdrv_drain_poll(bs, true));
-
-QLIST_FOREACH_SAFE(child, >children, next, tmp) {
-BlockDriverState *bs = child->bs;
-bool in_main_loop =
-qemu_get_current_aio_context() == qemu_get_aio_context();
-assert(bs->refcnt > 0);
-if (in_main_loop) {
-/* In case the recursive bdrv_drain_recurse processes a
- * block_job_defer_to_main_loop BH and modifies the graph,
- * let's hold a reference to bs until we are done.
- *
- * IOThread doesn't have such a BH, and it is not safe to call
- * bdrv_unref without BQL, so skip doing it there.
- */
-bdrv_ref(bs);
-}
-waited |= bdrv_drain_recurse(bs);
-if (in_main_loop) {
-bdrv_unref(bs);
-}
-}
-
-return waited;
-}
-
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
   BdrvChild *parent);
 static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
@@ -310,7 +278,9 @@ void bdrv_do_drained_begin(BlockDriverState *bs, bool 
recursive,
 
 bdrv_parent_drained_begin(bs, parent);
 bdrv_drain_invoke(bs, true);
-bdrv_drain_recurse(bs);
+
+/* Wait for drained requests to finish */
+BDRV_POLL_WHILE(bs, bdrv_drain_poll(bs, true));
 
 if (recursive) {
 bs->recursive_quiesce_counter++;
-- 
2.13.6




[Qemu-block] [PATCH 04/19] block: Don't manually poll in bdrv_drain_all()

2018-04-11 Thread Kevin Wolf
All involved nodes are already idle, we called bdrv_do_draine_begin() on
them.

The comment in the code suggested that this were not correct because the
completion of a request on one node could spawn a new request on a
different node (which might have been drained before, so we wouldn't
drain the new request). In reality, new requests to different nodes
aren't spawned out of nothing, but only in the context of a parent
request, and they aren't submitted to random nodes, but only to child
nodes. As long as we still poll for the completion of the parent request
(which we do), draining each root node separately is good enough.

Remove the additional polling code from bdrv_drain_all_begin() and
replace it with an assertion that all nodes are already idle after we
drained them separately.

Signed-off-by: Kevin Wolf 
---
 block/io.c | 41 -
 1 file changed, 12 insertions(+), 29 deletions(-)

diff --git a/block/io.c b/block/io.c
index d2bd89c3bb..ea6f9f023a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -376,6 +376,16 @@ void bdrv_drain(BlockDriverState *bs)
 bdrv_drained_end(bs);
 }
 
+static void bdrv_drain_assert_idle(BlockDriverState *bs)
+{
+BdrvChild *child, *next;
+
+assert(atomic_read(>in_flight) == 0);
+QLIST_FOREACH_SAFE(child, >children, next, next) {
+bdrv_drain_assert_idle(child->bs);
+}
+}
+
 /*
  * Wait for pending requests to complete across all BlockDriverStates
  *
@@ -390,11 +400,8 @@ void bdrv_drain(BlockDriverState *bs)
  */
 void bdrv_drain_all_begin(void)
 {
-/* Always run first iteration so any pending completion BHs run */
-bool waited = true;
 BlockDriverState *bs;
 BdrvNextIterator it;
-GSList *aio_ctxs = NULL, *ctx;
 
 /* BDRV_POLL_WHILE() for a node can only be called from its own I/O thread
  * or the main loop AioContext. We potentially use BDRV_POLL_WHILE() on
@@ -408,35 +415,11 @@ void bdrv_drain_all_begin(void)
 aio_context_acquire(aio_context);
 bdrv_do_drained_begin(bs, true, NULL);
 aio_context_release(aio_context);
-
-if (!g_slist_find(aio_ctxs, aio_context)) {
-aio_ctxs = g_slist_prepend(aio_ctxs, aio_context);
-}
 }
 
-/* Note that completion of an asynchronous I/O operation can trigger any
- * number of other I/O operations on other devices---for example a
- * coroutine can submit an I/O request to another device in response to
- * request completion.  Therefore we must keep looping until there was no
- * more activity rather than simply draining each device independently.
- */
-while (waited) {
-waited = false;
-
-for (ctx = aio_ctxs; ctx != NULL; ctx = ctx->next) {
-AioContext *aio_context = ctx->data;
-
-aio_context_acquire(aio_context);
-for (bs = bdrv_first(); bs; bs = bdrv_next()) {
-if (aio_context == bdrv_get_aio_context(bs)) {
-waited |= bdrv_drain_recurse(bs);
-}
-}
-aio_context_release(aio_context);
-}
+for (bs = bdrv_first(); bs; bs = bdrv_next()) {
+bdrv_drain_assert_idle(bs);
 }
-
-g_slist_free(aio_ctxs);
 }
 
 void bdrv_drain_all_end(void)
-- 
2.13.6




[Qemu-block] [PATCH 03/19] block: Remove 'recursive' parameter from bdrv_drain_invoke()

2018-04-11 Thread Kevin Wolf
All callers pass false for the 'recursive' parameter now. Remove it.

Signed-off-by: Kevin Wolf 
---
 block/io.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index cad59db2f4..d2bd89c3bb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -167,9 +167,8 @@ static void coroutine_fn bdrv_drain_invoke_entry(void 
*opaque)
 }
 
 /* Recursively call BlockDriver.bdrv_co_drain_begin/end callbacks */
-static void bdrv_drain_invoke(BlockDriverState *bs, bool begin, bool recursive)
+static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
 {
-BdrvChild *child, *tmp;
 BdrvCoDrainData data = { .bs = bs, .done = false, .begin = begin};
 
 if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
@@ -180,12 +179,6 @@ static void bdrv_drain_invoke(BlockDriverState *bs, bool 
begin, bool recursive)
 data.co = qemu_coroutine_create(bdrv_drain_invoke_entry, );
 bdrv_coroutine_enter(bs, data.co);
 BDRV_POLL_WHILE(bs, !data.done);
-
-if (recursive) {
-QLIST_FOREACH_SAFE(child, >children, next, tmp) {
-bdrv_drain_invoke(child->bs, begin, true);
-}
-}
 }
 
 static bool bdrv_drain_recurse(BlockDriverState *bs)
@@ -286,7 +279,7 @@ void bdrv_do_drained_begin(BlockDriverState *bs, bool 
recursive,
 }
 
 bdrv_parent_drained_begin(bs, parent);
-bdrv_drain_invoke(bs, true, false);
+bdrv_drain_invoke(bs, true);
 bdrv_drain_recurse(bs);
 
 if (recursive) {
@@ -321,7 +314,7 @@ void bdrv_do_drained_end(BlockDriverState *bs, bool 
recursive,
 old_quiesce_counter = atomic_fetch_dec(>quiesce_counter);
 
 /* Re-enable things in child-to-parent order */
-bdrv_drain_invoke(bs, false, false);
+bdrv_drain_invoke(bs, false);
 bdrv_parent_drained_end(bs, parent);
 if (old_quiesce_counter == 1) {
 aio_enable_external(bdrv_get_aio_context(bs));
-- 
2.13.6




Re: [Qemu-block] [PATCH v2 1/2] qcow2: try load bitmaps only once

2018-04-11 Thread Max Reitz
On 2018-04-11 14:26, Vladimir Sementsov-Ogievskiy wrote:
> Checking reopen by existence of some bitmaps is wrong, as it may be
> some other bitmaps, or on the other hand, user may remove bitmaps. This
> criteria is bad. To simplify things and make behavior more predictable
> let's just add a flag to remember, that we've already tried to load
> bitmaps on open and do not want do it again.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h |  1 +
>  block/qcow2.c | 16 
>  2 files changed, 9 insertions(+), 8 deletions(-)

Reviewed-by: Max Reitz 



Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: fix 169

2018-04-11 Thread Max Reitz
On 2018-04-11 15:05, Vladimir Sementsov-Ogievskiy wrote:

[...]

> Hmm, first type? I'm now not sure about, did I really see sha256
> mismatch, or something like this (should be error, but found bitmap):
> 
> --- /work/src/qemu/up-169/tests/qemu-iotests/169.out    2018-04-11
> 15:35:10.055027392 +0300
> +++ /work/src/qemu/up-169/tests/qemu-iotests/169.out.bad 2018-04-11
> 15:58:09.300450045 +0300
> @@ -1,5 +1,20 @@
> -
> +F...
> +==
> +FAIL: test__not_persistent__migbitmap__offline
> (__main__.TestDirtyBitmapMigration)
> +methodcaller(name, ...) --> methodcaller object
> +--
> +Traceback (most recent call last):
> +  File "169", line 136, in do_test_migration
> +    self.check_bitmap(self.vm_b, sha256 if persistent else False)
> +  File "169", line 77, in check_bitmap
> +    "Dirty bitmap 'bitmap0' not found");
> +  File "/work/src/qemu/up-169/tests/qemu-iotests/iotests.py", line 389,
> in assert_qmp
> +    result = self.dictpath(d, path)
> +  File "/work/src/qemu/up-169/tests/qemu-iotests/iotests.py", line 348,
> in dictpath
> +    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
> +AssertionError: failed path traversal for "error/desc" in "{u'return':
> {u'sha256':
> u'01d2ebedcb8f549a2547dbf8e231c410e3e747a9479e98909fc936e0035cf8b1'}}"
> 
> 
> Max, did you really seed sha256 mismatch or only something like this?

I'm pretty sure I did see mismatches.

Max



Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] qcow2: try load bitmaps only once

2018-04-11 Thread Vladimir Sementsov-Ogievskiy

11.04.2018 16:22, Eric Blake wrote:

On 04/11/2018 07:26 AM, Vladimir Sementsov-Ogievskiy wrote:

Checking reopen by existence of some bitmaps is wrong, as it may be
some other bitmaps, or on the other hand, user may remove bitmaps. This
criteria is bad. To simplify things and make behavior more predictable
let's just add a flag to remember, that we've already tried to load
bitmaps on open and do not want do it again.

Wording suggestion:

Checking whether we are reopening an image based on the existence of
some bitmaps is wrong, as the set of bitmaps may have changed (for
example, the user may have added or removed bitmaps in the meantime).
To simplify things and make behavior more predictable, let's just add a
flag to remember whether we've already tried to load bitmaps, so that we
only attempt it on the initial open.


You've dropped an idea supposed by Max, about migration related bitmaps,
I said "some other bitmaps", because now I doubt is it possible: we have 
migration related

bitmaps on source, not on target..

Anyway, I'm ok with all your suggestions, thank you. I can respin, or 
they may be squashed-in inflight.





Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/qcow2.h |  1 +
  block/qcow2.c | 16 
  2 files changed, 9 insertions(+), 8 deletions(-)

@@ -1480,10 +1481,9 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
  s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
  }
  
-if (bdrv_dirty_bitmap_next(bs, NULL)) {

-/* It's some kind of reopen with already existing dirty bitmaps. There
- * are no known cases where we need loading bitmaps in such situation,
- * so it's safer don't load them.
+if (s->dirty_bitmaps_loaded) {
+/* It's some kind of reopen. There are no known cases where we need
+ * loading bitmaps in such situation, so it's safer don't load them.

Pre-existing wording, but sounds better as:

There are no known cases where we need to reload bitmaps in such a
situation, so it's safer to skip them.


   *
   * Moreover, if we have some readonly bitmaps and we are reopening for
   * rw we should reopen bitmaps correspondingly.
@@ -1491,13 +1491,13 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
  if (bdrv_has_readonly_bitmaps(bs) &&
  !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE))
  {
-bool header_updated = false;
  qcow2_reopen_bitmaps_rw_hint(bs, _updated, _err);
-update_header = update_header && !header_updated;
  }
-} else if (qcow2_load_dirty_bitmaps(bs, _err)) {
-update_header = false;
+} else {
+header_updated = qcow2_load_dirty_bitmaps(bs, _err);
+s->dirty_bitmaps_loaded = true;
  }
+update_header = update_header && !header_updated;

Could write this as 'update_header &= !header_updated;'




--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: fix 169

2018-04-11 Thread Vladimir Sementsov-Ogievskiy

11.04.2018 16:05, Vladimir Sementsov-Ogievskiy wrote:

11.04.2018 12:36, Vladimir Sementsov-Ogievskiy wrote:

11.04.2018 12:02, Vladimir Sementsov-Ogievskiy wrote:

03.04.2018 23:13, John Snow wrote:


On 04/03/2018 12:23 PM, Max Reitz wrote:

On 2018-03-30 18:10, Vladimir Sementsov-Ogievskiy wrote:

Use MIGRATION events instead of RESUME. Also, make a TODO: enable
dirty-bitmaps capability for offline case.

This (likely) fixes racy faults at least of the following types:

 - timeout on waiting for RESUME event
 - sha256 mismatch on 136 (138 after this patch)

Signed-off-by: Vladimir Sementsov-Ogievskiy 


---

This patch is a true change for the test anyway. But I don't 
understand,
why (and do really) it fixes the things. And I'm not sure about 
do we
really have a bug in bitmap migration or persistence. So, it's up 
to you,

take it into 2.12...

It was already discussed, that "STOP" event is bad for tests. 
What about
"RESUME"? How can we miss it? And sha256 mismatch is really 
something

strange.

Max, please check, do it fix 169 for you.

  tests/qemu-iotests/169 | 44 
+++-

  1 file changed, 23 insertions(+), 21 deletions(-)
This makes the test pass (thanks!), but it still leaves behind 
five cats...


Max



Hmm:

jhuston  14772  0.0  0.0   4296   784 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14796  0.0  0.0   4296   764 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14940  0.0  0.0   4296   788 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14964  0.0  0.0   4296   720 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  15052  0.0  0.0   4296   768 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file

Why do these get left behind? Nothing to consume the data...?


aha, understand. it is due to last vm_b.shutdown() and vm_b.launch 
in case of should_migrate. So, at the end of the test I restart vm_b 
with -incoming parameter. But it looks like a bug anyway, If we 
start qemu with -incoming "exec", should not we kill cat process, if 
there were no migration?




third type of fail, without this patch:

+==
+ERROR: test__persistent__migbitmap__offline_shared 
(__main__.TestDirtyBitmapMigration)

+methodcaller(name, ...) --> methodcaller object
+--
+Traceback (most recent call last):
+  File "169", line 135, in do_test_migration
+    self.vm_b.launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", 
line 221, in launch

+    self._launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", 
line 244, in _launch

+    self._post_launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qtest.py", 
line 100, in _post_launch

+    super(QEMUQtestMachine, self)._post_launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", 
line 196, in _post_launch

+    self._qmp.accept()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 157, in accept

+    return self.__negotiate_capabilities()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 75, in __negotiate_capabilities

+    resp = self.cmd('qmp_capabilities')
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 191, in cmd

+    return self.cmd_obj(qmp_cmd)
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 174, in cmd_obj

+    resp = self.__json_read()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 82, in __json_read

+    data = self.__sockfile.readline()
+  File "/usr/lib64/python2.7/socket.py", line 447, in readline
+    data = self._sock.recv(self._rbufsize)
+error: [Errno 104] Connection reset by peer
+




Hmm, first type? I'm now not sure about, did I really see sha256 
mismatch, or something like this (should be error, but found bitmap):


--- /work/src/qemu/up-169/tests/qemu-iotests/169.out    2018-04-11 
15:35:10.055027392 +0300
+++ /work/src/qemu/up-169/tests/qemu-iotests/169.out.bad 2018-04-11 
15:58:09.300450045 +0300

@@ -1,5 +1,20 @@
-
+F...
+==
+FAIL: test__not_persistent__migbitmap__offline 
(__main__.TestDirtyBitmapMigration)

+methodcaller(name, ...) --> methodcaller object
+--
+Traceback (most recent call last):
+  File "169", line 136, in do_test_migration
+    self.check_bitmap(self.vm_b, sha256 if persistent else False)
+  File "169", 

Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] qcow2: try load bitmaps only once

2018-04-11 Thread Eric Blake
On 04/11/2018 07:26 AM, Vladimir Sementsov-Ogievskiy wrote:
> Checking reopen by existence of some bitmaps is wrong, as it may be
> some other bitmaps, or on the other hand, user may remove bitmaps. This
> criteria is bad. To simplify things and make behavior more predictable
> let's just add a flag to remember, that we've already tried to load
> bitmaps on open and do not want do it again.

Wording suggestion:

Checking whether we are reopening an image based on the existence of
some bitmaps is wrong, as the set of bitmaps may have changed (for
example, the user may have added or removed bitmaps in the meantime).
To simplify things and make behavior more predictable, let's just add a
flag to remember whether we've already tried to load bitmaps, so that we
only attempt it on the initial open.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h |  1 +
>  block/qcow2.c | 16 
>  2 files changed, 9 insertions(+), 8 deletions(-)
> 
> @@ -1480,10 +1481,9 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
> *bs, QDict *options,
>  s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
>  }
>  
> -if (bdrv_dirty_bitmap_next(bs, NULL)) {
> -/* It's some kind of reopen with already existing dirty bitmaps. 
> There
> - * are no known cases where we need loading bitmaps in such 
> situation,
> - * so it's safer don't load them.
> +if (s->dirty_bitmaps_loaded) {
> +/* It's some kind of reopen. There are no known cases where we need
> + * loading bitmaps in such situation, so it's safer don't load them.

Pre-existing wording, but sounds better as:

There are no known cases where we need to reload bitmaps in such a
situation, so it's safer to skip them.

>   *
>   * Moreover, if we have some readonly bitmaps and we are reopening 
> for
>   * rw we should reopen bitmaps correspondingly.
> @@ -1491,13 +1491,13 @@ static int coroutine_fn 
> qcow2_do_open(BlockDriverState *bs, QDict *options,
>  if (bdrv_has_readonly_bitmaps(bs) &&
>  !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & 
> BDRV_O_INACTIVE))
>  {
> -bool header_updated = false;
>  qcow2_reopen_bitmaps_rw_hint(bs, _updated, _err);
> -update_header = update_header && !header_updated;
>  }
> -} else if (qcow2_load_dirty_bitmaps(bs, _err)) {
> -update_header = false;
> +} else {
> +header_updated = qcow2_load_dirty_bitmaps(bs, _err);
> +s->dirty_bitmaps_loaded = true;
>  }
> +update_header = update_header && !header_updated;

Could write this as 'update_header &= !header_updated;'

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: fix 169

2018-04-11 Thread Vladimir Sementsov-Ogievskiy

11.04.2018 12:36, Vladimir Sementsov-Ogievskiy wrote:

11.04.2018 12:02, Vladimir Sementsov-Ogievskiy wrote:

03.04.2018 23:13, John Snow wrote:


On 04/03/2018 12:23 PM, Max Reitz wrote:

On 2018-03-30 18:10, Vladimir Sementsov-Ogievskiy wrote:

Use MIGRATION events instead of RESUME. Also, make a TODO: enable
dirty-bitmaps capability for offline case.

This (likely) fixes racy faults at least of the following types:

 - timeout on waiting for RESUME event
 - sha256 mismatch on 136 (138 after this patch)

Signed-off-by: Vladimir Sementsov-Ogievskiy 


---

This patch is a true change for the test anyway. But I don't 
understand,

why (and do really) it fixes the things. And I'm not sure about do we
really have a bug in bitmap migration or persistence. So, it's up 
to you,

take it into 2.12...

It was already discussed, that "STOP" event is bad for tests. What 
about

"RESUME"? How can we miss it? And sha256 mismatch is really something
strange.

Max, please check, do it fix 169 for you.

  tests/qemu-iotests/169 | 44 
+++-

  1 file changed, 23 insertions(+), 21 deletions(-)
This makes the test pass (thanks!), but it still leaves behind five 
cats...


Max



Hmm:

jhuston  14772  0.0  0.0   4296   784 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14796  0.0  0.0   4296   764 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14940  0.0  0.0   4296   788 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14964  0.0  0.0   4296   720 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  15052  0.0  0.0   4296   768 pts/3    S    16:12 0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file

Why do these get left behind? Nothing to consume the data...?


aha, understand. it is due to last vm_b.shutdown() and vm_b.launch in 
case of should_migrate. So, at the end of the test I restart vm_b 
with -incoming parameter. But it looks like a bug anyway, If we start 
qemu with -incoming "exec", should not we kill cat process, if there 
were no migration?




third type of fail, without this patch:

+==
+ERROR: test__persistent__migbitmap__offline_shared 
(__main__.TestDirtyBitmapMigration)

+methodcaller(name, ...) --> methodcaller object
+--
+Traceback (most recent call last):
+  File "169", line 135, in do_test_migration
+    self.vm_b.launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", line 
221, in launch

+    self._launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", line 
244, in _launch

+    self._post_launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qtest.py", 
line 100, in _post_launch

+    super(QEMUQtestMachine, self)._post_launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", line 
196, in _post_launch

+    self._qmp.accept()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 157, in accept

+    return self.__negotiate_capabilities()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 75, in __negotiate_capabilities

+    resp = self.cmd('qmp_capabilities')
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 191, in cmd

+    return self.cmd_obj(qmp_cmd)
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 174, in cmd_obj

+    resp = self.__json_read()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 82, in __json_read

+    data = self.__sockfile.readline()
+  File "/usr/lib64/python2.7/socket.py", line 447, in readline
+    data = self._sock.recv(self._rbufsize)
+error: [Errno 104] Connection reset by peer
+




Hmm, first type? I'm now not sure about, did I really see sha256 
mismatch, or something like this (should be error, but found bitmap):


--- /work/src/qemu/up-169/tests/qemu-iotests/169.out    2018-04-11 
15:35:10.055027392 +0300
+++ /work/src/qemu/up-169/tests/qemu-iotests/169.out.bad 2018-04-11 
15:58:09.300450045 +0300

@@ -1,5 +1,20 @@
-
+F...
+==
+FAIL: test__not_persistent__migbitmap__offline 
(__main__.TestDirtyBitmapMigration)

+methodcaller(name, ...) --> methodcaller object
+--
+Traceback (most recent call last):
+  File "169", line 136, in do_test_migration
+    self.check_bitmap(self.vm_b, sha256 if persistent else False)
+  File "169", line 77, in check_bitmap
+    "Dirty bitmap 'bitmap0' 

[Qemu-block] [PATCH v2 1/2] qcow2: try load bitmaps only once

2018-04-11 Thread Vladimir Sementsov-Ogievskiy
Checking reopen by existence of some bitmaps is wrong, as it may be
some other bitmaps, or on the other hand, user may remove bitmaps. This
criteria is bad. To simplify things and make behavior more predictable
let's just add a flag to remember, that we've already tried to load
bitmaps on open and do not want do it again.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h |  1 +
 block/qcow2.c | 16 
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index d301f77cea..adf5c3950f 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -298,6 +298,7 @@ typedef struct BDRVQcow2State {
 uint32_t nb_bitmaps;
 uint64_t bitmap_directory_size;
 uint64_t bitmap_directory_offset;
+bool dirty_bitmaps_loaded;
 
 int flags;
 int qcow_version;
diff --git a/block/qcow2.c b/block/qcow2.c
index 486f3e83b7..4242e99f1e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1142,6 +1142,7 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 uint64_t ext_end;
 uint64_t l1_vm_state_index;
 bool update_header = false;
+bool header_updated = false;
 
 ret = bdrv_pread(bs->file, 0, , sizeof(header));
 if (ret < 0) {
@@ -1480,10 +1481,9 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
 }
 
-if (bdrv_dirty_bitmap_next(bs, NULL)) {
-/* It's some kind of reopen with already existing dirty bitmaps. There
- * are no known cases where we need loading bitmaps in such situation,
- * so it's safer don't load them.
+if (s->dirty_bitmaps_loaded) {
+/* It's some kind of reopen. There are no known cases where we need
+ * loading bitmaps in such situation, so it's safer don't load them.
  *
  * Moreover, if we have some readonly bitmaps and we are reopening for
  * rw we should reopen bitmaps correspondingly.
@@ -1491,13 +1491,13 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 if (bdrv_has_readonly_bitmaps(bs) &&
 !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE))
 {
-bool header_updated = false;
 qcow2_reopen_bitmaps_rw_hint(bs, _updated, _err);
-update_header = update_header && !header_updated;
 }
-} else if (qcow2_load_dirty_bitmaps(bs, _err)) {
-update_header = false;
+} else {
+header_updated = qcow2_load_dirty_bitmaps(bs, _err);
+s->dirty_bitmaps_loaded = true;
 }
+update_header = update_header && !header_updated;
 if (local_err != NULL) {
 error_propagate(errp, local_err);
 ret = -EINVAL;
-- 
2.11.1




[Qemu-block] [PATCH v2 2/2] iotests: fix 169

2018-04-11 Thread Vladimir Sementsov-Ogievskiy
Improve and fix 169:
- use MIGRATION events instead of RESUME
- make a TODO: enable dirty-bitmaps capability for offline case
- recreate vm_b without -incoming near test end

This (likely) fixes racy faults at least of the following types:

- timeout on waiting for RESUME event
- sha256 mismatch on line 136 (142 after this patch)
- fail to self.vm_b.launch() on line 135 (141 now after this patch)

And surely fixes cat processes, left after test finish.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/169 | 48 +++-
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/tests/qemu-iotests/169 b/tests/qemu-iotests/169
index 153b10b6e7..f243db9955 100755
--- a/tests/qemu-iotests/169
+++ b/tests/qemu-iotests/169
@@ -31,6 +31,8 @@ disk_a = os.path.join(iotests.test_dir, 'disk_a')
 disk_b = os.path.join(iotests.test_dir, 'disk_b')
 size = '1M'
 mig_file = os.path.join(iotests.test_dir, 'mig_file')
+mig_cmd = 'exec: cat > ' + mig_file
+incoming_cmd = 'exec: cat ' + mig_file
 
 
 class TestDirtyBitmapMigration(iotests.QMPTestCase):
@@ -49,7 +51,6 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 self.vm_a.launch()
 
 self.vm_b = iotests.VM(path_suffix='b')
-self.vm_b.add_incoming("exec: cat '" + mig_file + "'")
 
 def add_bitmap(self, vm, granularity, persistent):
 params = {'node': 'drive0',
@@ -86,36 +87,30 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
(0xa0201, 0x1000))
 
 should_migrate = migrate_bitmaps or persistent and shared_storage
+mig_caps = [{'capability': 'events', 'state': True}]
+if migrate_bitmaps:
+mig_caps.append({'capability': 'dirty-bitmaps', 'state': True})
 
+result = self.vm_a.qmp('migrate-set-capabilities',
+   capabilities=mig_caps)
+self.assert_qmp(result, 'return', {})
+
+self.vm_b.add_incoming(incoming_cmd if online else "defer")
 self.vm_b.add_drive(disk_a if shared_storage else disk_b)
 
 if online:
 os.mkfifo(mig_file)
 self.vm_b.launch()
+result = self.vm_b.qmp('migrate-set-capabilities',
+   capabilities=mig_caps)
+self.assert_qmp(result, 'return', {})
 
 self.add_bitmap(self.vm_a, granularity, persistent)
 for r in regions:
 self.vm_a.hmp_qemu_io('drive0', 'write %d %d' % r)
 sha256 = self.get_bitmap_hash(self.vm_a)
 
-if migrate_bitmaps:
-capabilities = [{'capability': 'dirty-bitmaps', 'state': True}]
-
-result = self.vm_a.qmp('migrate-set-capabilities',
-   capabilities=capabilities)
-self.assert_qmp(result, 'return', {})
-
-if online:
-result = self.vm_b.qmp('migrate-set-capabilities',
-   capabilities=capabilities)
-self.assert_qmp(result, 'return', {})
-
-result = self.vm_a.qmp('migrate-set-capabilities',
-   capabilities=[{'capability': 'events',
-  'state': True}])
-self.assert_qmp(result, 'return', {})
-
-result = self.vm_a.qmp('migrate', uri='exec:cat>' + mig_file)
+result = self.vm_a.qmp('migrate', uri=mig_cmd)
 while True:
 event = self.vm_a.event_wait('MIGRATION')
 if event['data']['status'] == 'completed':
@@ -124,14 +119,25 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 if not online:
 self.vm_a.shutdown()
 self.vm_b.launch()
-# TODO enable bitmap capability for vm_b in this case
+result = self.vm_b.qmp('migrate-set-capabilities',
+   capabilities=mig_caps)
+self.assert_qmp(result, 'return', {})
+result = self.vm_b.qmp('migrate-incoming', uri=incoming_cmd)
+self.assert_qmp(result, 'return', {})
 
-self.vm_b.event_wait("RESUME", timeout=10.0)
+while True:
+event = self.vm_b.event_wait('MIGRATION')
+if event['data']['status'] == 'completed':
+break
 
 self.check_bitmap(self.vm_b, sha256 if should_migrate else False)
 
 if should_migrate:
 self.vm_b.shutdown()
+# recreate vm_b, as we don't want -incoming option (this will lead
+# to "cat" process left alive after test finish)
+self.vm_b = iotests.VM(path_suffix='b')
+self.vm_b.add_drive(disk_a if shared_storage else disk_b)
 self.vm_b.launch()
 self.check_bitmap(self.vm_b, sha256 if persistent else False)
 
-- 
2.11.1




[Qemu-block] [PATCH v2 0/2] bitmaps persistent and migration fixes

2018-04-11 Thread Vladimir Sementsov-Ogievskiy
v2:

01: new, proposed by Max
02: fix leaving cat processes

Vladimir Sementsov-Ogievskiy (2):
  qcow2: try load bitmaps only once
  iotests: fix 169

 block/qcow2.h  |  1 +
 block/qcow2.c  | 16 
 tests/qemu-iotests/169 | 48 +++-
 3 files changed, 36 insertions(+), 29 deletions(-)

-- 
2.11.1




Re: [Qemu-block] [PATCH] iotests: Split 214 off of 122

2018-04-11 Thread Max Reitz
On 2018-04-06 18:41, Max Reitz wrote:
> Commit abd3622cc03cf41ed542126a540385f30a4c0175 added a case to 122
> regarding how the qcow2 driver handles an incorrect compressed data
> length value.  This does not really fit into 122, as that file is
> supposed to contain qemu-img convert test cases, which this case is not.
> So this patch splits it off into its own file; maybe we will even get
> more qcow2-only compression tests in the future.
> 
> Also, that test case does not work with refcount_bits=1, so mark that
> option as unsupported.
> 
> Signed-off-by: Max Reitz 
> ---
> Kind of a v2 for "iotests: 122 needs at least two refcount bits now"
> (fulfills the same purpose, but also splits the case into its own file
>  so you can still run 122 with refcount_bits=1 [Eric]).
> 
> I was a bit lost what to do about the copyright text, since this test
> case was written by Berto.  I figured I'd drop the "owner" variable (it
> isn't used anyway), but I put "Red Hat" into the copyright line --
> currently every test has copyright information, so I decided it'd be
> difficult to leave that out, and I figured I simply cannot claim
> copyright for Igalia.  So, here we go.
> ---
>  tests/qemu-iotests/122 | 47 ---
>  tests/qemu-iotests/122.out | 33 
>  tests/qemu-iotests/214 | 96 
> ++
>  tests/qemu-iotests/214.out | 35 +
>  tests/qemu-iotests/group   |  1 +
>  5 files changed, 132 insertions(+), 80 deletions(-)
>  create mode 100755 tests/qemu-iotests/214
>  create mode 100644 tests/qemu-iotests/214.out

Changed the copyright information, added Berto's S-o-b (and Eric's R-b)
and applied to my block-next branch.

Max



Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: fix 169

2018-04-11 Thread Max Reitz
On 2018-04-11 11:02, Vladimir Sementsov-Ogievskiy wrote:
> 03.04.2018 23:13, John Snow wrote:
>>
>> On 04/03/2018 12:23 PM, Max Reitz wrote:
>>> On 2018-03-30 18:10, Vladimir Sementsov-Ogievskiy wrote:
 Use MIGRATION events instead of RESUME. Also, make a TODO: enable
 dirty-bitmaps capability for offline case.

 This (likely) fixes racy faults at least of the following types:

  - timeout on waiting for RESUME event
  - sha256 mismatch on 136 (138 after this patch)

 Signed-off-by: Vladimir Sementsov-Ogievskiy 
 ---

 This patch is a true change for the test anyway. But I don't
 understand,
 why (and do really) it fixes the things. And I'm not sure about do we
 really have a bug in bitmap migration or persistence. So, it's up to
 you,
 take it into 2.12...

 It was already discussed, that "STOP" event is bad for tests. What
 about
 "RESUME"? How can we miss it? And sha256 mismatch is really something
 strange.

 Max, please check, do it fix 169 for you.

   tests/qemu-iotests/169 | 44
 +++-
   1 file changed, 23 insertions(+), 21 deletions(-)
>>> This makes the test pass (thanks!), but it still leaves behind five
>>> cats...
>>>
>>> Max
>>>
>>>
>> Hmm:
>>
>> jhuston  14772  0.0  0.0   4296   784 pts/3    S    16:12   0:00 cat
>> /home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
>> jhuston  14796  0.0  0.0   4296   764 pts/3    S    16:12   0:00 cat
>> /home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
>> jhuston  14940  0.0  0.0   4296   788 pts/3    S    16:12   0:00 cat
>> /home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
>> jhuston  14964  0.0  0.0   4296   720 pts/3    S    16:12   0:00 cat
>> /home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
>> jhuston  15052  0.0  0.0   4296   768 pts/3    S    16:12   0:00 cat
>> /home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
>>
>> Why do these get left behind? Nothing to consume the data...?
> 
> aha, understand. it is due to last vm_b.shutdown() and vm_b.launch in
> case of should_migrate. So, at the end of the test I restart vm_b with
> -incoming parameter. But it looks like  a bug anyway, If we start qemu
> with -incoming "exec", should not we kill cat process, if there were no
> migration?

I agree, but it's your choice whether you want to fix that bug or just
change the test slightly -- I'm responsible for the iotests, but not for
migration, so I have to admit I don't mind just changing this test to
accomodate. O:-)

It appears that just removing the mig_file before the second
vm_b.launch() is sufficient (and enclosing the removal of that file in
tearDown() in a try/except block).  I suppose the cat process will
complain, but that doesn't stop the test from passing.  If you want to
be really nice, I suppose you could just overwrite the FIFO mig_file
with an empty regular file, but I don't think it's actually necessary...

Max



Re: [Qemu-block] [PATCH for-2.12 v2] qemu-iotests: update 185 output

2018-04-11 Thread Max Reitz
On 2018-04-10 10:11, Stefan Hajnoczi wrote:
> On Wed, Apr 04, 2018 at 06:16:12PM +0200, Max Reitz wrote:
>> On 2018-04-04 17:01, Stefan Hajnoczi wrote:
>>  === Start mirror job and exit qemu ===
>>
>> This seems to be independent of whether there is actually data on
>> TEST_IMG (the commit source), so something doesn't seem quite right with
>> the block job throttling here...?
> 
> I've been playing around with this test failure.  Let's leave it broken
> (!) in QEMU 2.12 because this test has uncovered a block job ratelimit
> issue that's not a regression.  The fix shouldn't be rushed.

OK for me.

> block/mirror.c calculates delay_ns but then discards it:
> 
>   static void coroutine_fn mirror_run(void *opaque)
>   {
>   ...
> 
>   for (;;) {
>   ...delay_ns is calculated based on job speed...
> 
>   ret = 0;
>   trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
>   if (block_job_is_cancelled(>common) && s->common.force) {
>   break;
>   } else if (!should_complete) {
>   delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
>   ^--- we discard it!
>   block_job_sleep_ns(>common, delay_ns);
>   }
>   s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>   }
> 
>   ...
>   }

Uh, nice.

> This is why the mirror-based tests are completing even though we'd
> expect them to only reach the "next" iteration due to the job speed.
> 
> Increasing the 4 MB write operation in the test to >16 MB (the mirror
> buffer size) doesn't solve the problem because the next QMP command will
> race with the job's 0 ns sleep.  It would just make the test unreliable.
> 
> I'll work on the following for QEMU 2.13:
> 
> 1. Avoid spurious block_job_enter() from block_job_drain().  This
>eliminates the extra block job iteration that changed the output in
>the first place.
> 
> 2. Honor the job speed in block/mirror.c.
> 
> The end result will be that the test output will not require changes.

Thanks!

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] iotests: Split 214 off of 122

2018-04-11 Thread Max Reitz
On 2018-04-09 11:28, Alberto Garcia wrote:
> On Fri 06 Apr 2018 06:41:08 PM CEST, Max Reitz  wrote:
>> Commit abd3622cc03cf41ed542126a540385f30a4c0175 added a case to 122
>> regarding how the qcow2 driver handles an incorrect compressed data
>> length value.  This does not really fit into 122, as that file is
>> supposed to contain qemu-img convert test cases, which this case is not.
>> So this patch splits it off into its own file; maybe we will even get
>> more qcow2-only compression tests in the future.
>>
>> Also, that test case does not work with refcount_bits=1, so mark that
>> option as unsupported.
>>
>> Signed-off-by: Max Reitz 
> 
> Looks good to me
> 
>> I was a bit lost what to do about the copyright text, since this test
>> case was written by Berto.  I figured I'd drop the "owner" variable
>> (it isn't used anyway), but I put "Red Hat" into the copyright line --
>> currently every test has copyright information, so I decided it'd be
>> difficult to leave that out, and I figured I simply cannot claim
>> copyright for Igalia.  So, here we go.
> 
> The new file contains only my test, right?

Yep.

>You can use
> 
> Copyright (C) 2018 Igalia, S.L.
> Author: Alberto Garcia 
> 
> and add
> 
> Signed-off-by: Alberto Garcia 

Thanks! :-)

Max



Re: [Qemu-block] [PULL 0/7] Block layer patches for 2.12.0-rc3

2018-04-11 Thread Peter Maydell
On 10 April 2018 at 16:37, Kevin Wolf  wrote:
> The following changes since commit df6378eb0e6cfd58a22a1c3ff8fa4a9039f1eaa8:
>
>   Merge remote-tracking branch 'remotes/kraxel/tags/ui-20180410-pull-request' 
> into staging (2018-04-10 14:04:27 +0100)
>
> are available in the git repository at:
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to c1de5696d6a25b426432c147dfd7fb8a9eb86b89:
>
>   qemu-iotests: update 185 output (2018-04-10 16:34:38 +0200)
>
> 
> Block layer patches
>

Applied, thanks.

-- PMM



Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: fix 169

2018-04-11 Thread Vladimir Sementsov-Ogievskiy

11.04.2018 12:02, Vladimir Sementsov-Ogievskiy wrote:

03.04.2018 23:13, John Snow wrote:


On 04/03/2018 12:23 PM, Max Reitz wrote:

On 2018-03-30 18:10, Vladimir Sementsov-Ogievskiy wrote:

Use MIGRATION events instead of RESUME. Also, make a TODO: enable
dirty-bitmaps capability for offline case.

This (likely) fixes racy faults at least of the following types:

 - timeout on waiting for RESUME event
 - sha256 mismatch on 136 (138 after this patch)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

This patch is a true change for the test anyway. But I don't 
understand,

why (and do really) it fixes the things. And I'm not sure about do we
really have a bug in bitmap migration or persistence. So, it's up 
to you,

take it into 2.12...

It was already discussed, that "STOP" event is bad for tests. What 
about

"RESUME"? How can we miss it? And sha256 mismatch is really something
strange.

Max, please check, do it fix 169 for you.

  tests/qemu-iotests/169 | 44 
+++-

  1 file changed, 23 insertions(+), 21 deletions(-)
This makes the test pass (thanks!), but it still leaves behind five 
cats...


Max



Hmm:

jhuston  14772  0.0  0.0   4296   784 pts/3    S    16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14796  0.0  0.0   4296   764 pts/3    S    16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14940  0.0  0.0   4296   788 pts/3    S    16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14964  0.0  0.0   4296   720 pts/3    S    16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  15052  0.0  0.0   4296   768 pts/3    S    16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file

Why do these get left behind? Nothing to consume the data...?


aha, understand. it is due to last vm_b.shutdown() and vm_b.launch in 
case of should_migrate. So, at the end of the test I restart vm_b with 
-incoming parameter. But it looks like  a bug anyway, If we start qemu 
with -incoming "exec", should not we kill cat process, if there were 
no migration?




third type of fail, without this patch:

+==
+ERROR: test__persistent__migbitmap__offline_shared 
(__main__.TestDirtyBitmapMigration)

+methodcaller(name, ...) --> methodcaller object
+--
+Traceback (most recent call last):
+  File "169", line 135, in do_test_migration
+    self.vm_b.launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", line 
221, in launch

+    self._launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", line 
244, in _launch

+    self._post_launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qtest.py", line 
100, in _post_launch

+    super(QEMUQtestMachine, self)._post_launch()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qemu.py", line 
196, in _post_launch

+    self._qmp.accept()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 157, in accept

+    return self.__negotiate_capabilities()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 75, in __negotiate_capabilities

+    resp = self.cmd('qmp_capabilities')
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 191, in cmd

+    return self.cmd_obj(qmp_cmd)
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 174, in cmd_obj

+    resp = self.__json_read()
+  File 
"/work/src/qemu/up-169/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 82, in __json_read

+    data = self.__sockfile.readline()
+  File "/usr/lib64/python2.7/socket.py", line 447, in readline
+    data = self._sock.recv(self._rbufsize)
+error: [Errno 104] Connection reset by peer
+


--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: fix 169

2018-04-11 Thread Vladimir Sementsov-Ogievskiy

03.04.2018 23:13, John Snow wrote:


On 04/03/2018 12:23 PM, Max Reitz wrote:

On 2018-03-30 18:10, Vladimir Sementsov-Ogievskiy wrote:

Use MIGRATION events instead of RESUME. Also, make a TODO: enable
dirty-bitmaps capability for offline case.

This (likely) fixes racy faults at least of the following types:

 - timeout on waiting for RESUME event
 - sha256 mismatch on 136 (138 after this patch)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

This patch is a true change for the test anyway. But I don't understand,
why (and do really) it fixes the things. And I'm not sure about do we
really have a bug in bitmap migration or persistence. So, it's up to you,
take it into 2.12...

It was already discussed, that "STOP" event is bad for tests. What about
"RESUME"? How can we miss it? And sha256 mismatch is really something
strange.

Max, please check, do it fix 169 for you.

  tests/qemu-iotests/169 | 44 +++-
  1 file changed, 23 insertions(+), 21 deletions(-)

This makes the test pass (thanks!), but it still leaves behind five cats...

Max



Hmm:

jhuston  14772  0.0  0.0   4296   784 pts/3S16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14796  0.0  0.0   4296   764 pts/3S16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14940  0.0  0.0   4296   788 pts/3S16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  14964  0.0  0.0   4296   720 pts/3S16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file
jhuston  15052  0.0  0.0   4296   768 pts/3S16:12   0:00 cat
/home/bos/jhuston/src/qemu/bin/git/tests/qemu-iotests/scratch/mig_file

Why do these get left behind? Nothing to consume the data...?


aha, understand. it is due to last vm_b.shutdown() and vm_b.launch in 
case of should_migrate. So, at the end of the test I restart vm_b with 
-incoming parameter. But it looks like  a bug anyway, If we start qemu 
with -incoming "exec", should not we kill cat process, if there were no 
migration?


--
Best regards,
Vladimir