Re: [Qemu-block] [PATCH 4/4] block: Cater to iscsi with non-power-of-2 discard

2016-11-10 Thread Eric Blake
On 11/08/2016 05:03 AM, Peter Lieven wrote:
> Am 25.10.2016 um 18:12 schrieb Eric Blake:
>> On 10/25/2016 09:36 AM, Paolo Bonzini wrote:
>>>
>>> On 25/10/2016 16:35, Eric Blake wrote:
 So your argument is that we should always pass down every unaligned
 less-than-optimum discard request all the way to the hardware, rather
 than dropping it higher in the stack, even though discard requests are
 already advisory, in order to leave the hardware as the ultimate
 decision on whether to ignore the unaligned request?
>>> Yes, I agree with Peter as to this.
>> Okay, I'll work on patches. I think it counts as bug fix, so appropriate
>> even if I miss soft freeze (I'd still like to get NBD write zero support
>> into 2.8, since it already missed 2.7, but that one is still awaiting
>> review with not much time left).
>>
> 
> Hi Eric,
> 
> have you had time to look at this?
> If you need help, let me know.

Patches posted, but testing help would be appreciated since you have the
actual hardware that exhibits the issue.

I'm also trying to write a patch to extend the blkdebug driver to share
this "feature" of a 15M page, and write a qemu-iotest to make it harder
to regress in the future.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 7/8] quorum: Implement .bdrv_co_preadv/pwritev()

2016-11-10 Thread Eric Blake
On 11/10/2016 11:19 AM, Kevin Wolf wrote:
> This enables byte granularity requests on quorum nodes.
> 
> Note that the QMP events emitted by the driver are an external API that
> we were careless enough to define as sector based. The offset and length
> of requests reported in events are rounded down therefore.

Would it be better to round offset down and length up?  A length of 0
looks odd.

Should we add more fields to the two affected events (QUORUM_FAILURE and
QUORUM_REPORT_BAD)? We have to keep the existing fields for back-compat,
but we could add new fields that give byte-based locations for
management apps smart enough to use the new instead of the old
(particularly since the old fields are named 'sector-num' and
'sectors-count').

> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/quorum.c | 75 
> ++
>  1 file changed, 33 insertions(+), 42 deletions(-)
> 

> @@ -198,14 +198,17 @@ static void quorum_report_bad(QuorumOpType type, 
> uint64_t sector_num,
>  }
>  
>  qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name,
> -  sector_num, nb_sectors, _abort);
> +  offset / BDRV_SECTOR_SIZE,
> +  bytes / BDRV_SECTOR_SIZE, 
> _abort);

Rounding the offset down makes sense, but rounding the bytes down can
lead to weird messages.  Blindly rounding it up to a sector boundary can
also be wrong (example: writing 2 bytes at offset 511 really affects
1024 bytes when you consider that two sectors had to undergo
read-modify-write). Don't we have a helper routine for determining the
end boundary when we have to convert an offset and length to a courser
alignment?


> @@ -462,8 +461,8 @@ static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB 
> *acb,
>  va_list ap;
>  
>  va_start(ap, fmt);
> -fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
> -acb->sector_num, acb->nb_sectors);
> +fprintf(stderr, "quorum: offset=%" PRIu64 " bytes=%" PRIu64 " ",
> +acb->offset, acb->bytes);

Might be worth a separate patch to get rid of fprintf and use correct
error reporting.  But not the work for this patch.

Overall, I like that you are getting rid of more sector-based cruft.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 6/8] quorum: Avoid bdrv_aio_writev() for rewrites

2016-11-10 Thread Eric Blake
On 11/10/2016 11:19 AM, Kevin Wolf wrote:
> Replacing it with bdrv_co_pwritev() prepares us for byte granularity
> requests and gets us rid of the last bdrv_aio_*() user in quorum.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/quorum.c | 52 +---
>  1 file changed, 33 insertions(+), 19 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index b2bb3af..1426115 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -134,6 +134,11 @@ struct QuorumAIOCB {
>  int children_read;  /* how many children have been read from */
>  };
>  
> +typedef struct QuorumCo {
> +QuorumAIOCB *acb;
> +int i;
> +} QuorumCo;
> +

> @@ -586,11 +605,6 @@ free_exit:
>  return rewrite;
>  }
>  
> -typedef struct QuorumCo {
> -QuorumAIOCB *acb;
> -int i;
> -} QuorumCo;

Code motion of a struct added earlier in this series; do you want to
declare it up front to begin with to minimize churn?


-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 4/8] quorum: Do cleanup in caller coroutine

2016-11-10 Thread Eric Blake
On 11/10/2016 11:19 AM, Kevin Wolf wrote:
> Instead of calling quorum_aio_finalize() deeply nested in what used
> to be an AIO callback, do it in the same functions that allocated the
> AIOCB.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/quorum.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 

Looks clean, but again I'm weak enough on coroutines that I won't give
R-b to an RFC

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 3/8] quorum: Implement .bdrv_co_readv/writev

2016-11-10 Thread Eric Blake
On 11/10/2016 11:19 AM, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf 
> ---
>  block/quorum.c | 194 
> ++---
>  1 file changed, 117 insertions(+), 77 deletions(-)
> 


> +
> +static void read_quorum_children_entry(void *opaque)
> +{
> +QuorumCo *co = opaque;
> +QuorumAIOCB *acb = co->acb;
> +BDRVQuorumState *s = acb->bs->opaque;
> +int i = co->i;
> +int ret;
> +co = NULL; /* Not valid after the first yield */

Why bother to invalidate co...

> +
> +acb->qcrs[i].bs = s->children[i]->bs;
> +ret = bdrv_co_preadv(s->children[i], acb->sector_num * BDRV_SECTOR_SIZE,
> + acb->nb_sectors * BDRV_SECTOR_SIZE,
> + >qcrs[i].qiov, 0);
> +quorum_aio_cb(>qcrs[i], ret);
> +}

...when it isn't used later? Is it just for future-proofing edits made
in later patches?

> +static void write_quorum_entry(void *opaque)
> +{
> +QuorumCo *co = opaque;
> +QuorumAIOCB *acb = co->acb;
> +BDRVQuorumState *s = acb->bs->opaque;
> +int i = co->i;
> +int ret;
> +co = NULL; /* Not valid after the first yield */
> +
> +acb->qcrs[i].bs = s->children[i]->bs;
> +ret = bdrv_co_pwritev(s->children[i], acb->sector_num * BDRV_SECTOR_SIZE,
> +  acb->nb_sectors * BDRV_SECTOR_SIZE, acb->qiov, 0);
> +quorum_aio_cb(>qcrs[i], ret);
>  }

and again

Otherwise, the conversion looks sane to me, but I'm just weak enough on
coroutines that I won't give R-b while this is still in RFC

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 2/8] quorum: Remove s from quorum_aio_get() arguments

2016-11-10 Thread Eric Blake
On 11/10/2016 11:19 AM, Kevin Wolf wrote:
> There is no point in passing the value of bs->opaque in order to
> overwrite it with itself.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/quorum.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH 1/8] coroutine: Introduce qemu_coroutine_enter_if_inactive()

2016-11-10 Thread Eric Blake
On 11/10/2016 11:19 AM, Kevin Wolf wrote:
> In the context of asynchronous work, if we have a worker coroutine that
> didn't yield, the parent coroutine cannot be reentered because it hasn't
> yielded yet. In this case we don't even have to reenter the parent
> because it will see that the work is already done and won't even yield.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/qemu/coroutine.h | 6 ++
>  util/qemu-coroutine.c| 8 
>  2 files changed, 14 insertions(+)
> 

> +++ b/util/qemu-coroutine.c
> @@ -19,6 +19,7 @@
>  #include "qemu/atomic.h"
>  #include "qemu/coroutine.h"
>  #include "qemu/coroutine_int.h"
> +#include "block/aio.h"

Why do you need this include?

>  
>  enum {
>  POOL_BATCH_SIZE = 64,
> @@ -131,6 +132,13 @@ void qemu_coroutine_enter(Coroutine *co)
>  }
>  }
>  
> +void qemu_coroutine_enter_if_inactive(Coroutine *co)
> +{
> +if (!qemu_coroutine_entered(co)) {
> +qemu_coroutine_enter(co);
> +}
> +}
> +
>  void coroutine_fn qemu_coroutine_yield(void)
>  {
>  Coroutine *self = qemu_coroutine_self();
> 

Otherwise:
Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH 1/2] block: Return -ENOTSUP rather than assert on unaligned discards

2016-11-10 Thread Eric Blake
Right now, the block layer rounds discard requests, so that
individual drivers are able to assert that discard requests
will never be unaligned.  But there are some ISCSI devices
that track and coalesce multiple unaligned requests, turning it
into an actual discard if the requests eventually cover an
entire page, which implies that it is better to always pass
discard requests as low down the stack as possible.

In isolation, this patch has no semantic effect, since the
block layer currently never passes an unaligned request down.
But the block layer already has code that silently ignores
drivers that return -ENOTSUP for a discard request that cannot
be honored (as well as drivers that return 0 even when nothing
was done).  So the next patch will update the block layer to
fragment discard requests, so that clients are guaranteed that
they are either dealing with an unaligned head or tail, or an
aligned body, of the overall request, making it similar to the
block layer semantics of write zero fragmentation.

Signed-off-by: Eric Blake 
CC: qemu-sta...@nongnu.org
---
 block/iscsi.c| 4 +++-
 block/qcow2.c| 4 
 block/sheepdog.c | 5 +++--
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 71bd523..0960929 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1083,7 +1083,9 @@ coroutine_fn iscsi_co_pdiscard(BlockDriverState *bs, 
int64_t offset, int count)
 struct IscsiTask iTask;
 struct unmap_list list;

-assert(is_byte_request_lun_aligned(offset, count, iscsilun));
+if (!is_byte_request_lun_aligned(offset, count, iscsilun)) {
+return -ENOTSUP;
+}

 if (!iscsilun->lbp.lbpu) {
 /* UNMAP is not supported by the target */
diff --git a/block/qcow2.c b/block/qcow2.c
index 6d5689a..9af00fd 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2490,6 +2490,10 @@ static coroutine_fn int 
qcow2_co_pdiscard(BlockDriverState *bs,
 int ret;
 BDRVQcow2State *s = bs->opaque;

+if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) {
+return -ENOTSUP;
+}
+
 qemu_co_mutex_lock(>lock);
 ret = qcow2_discard_clusters(bs, offset, count >> BDRV_SECTOR_BITS,
  QCOW2_DISCARD_REQUEST, false);
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 1fb9173..4c9af89 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -2829,8 +2829,9 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState 
*bs, int64_t offset,
 iov.iov_len = sizeof(zero);
 discard_iov.iov = 
 discard_iov.niov = 1;
-assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-assert((count & (BDRV_SECTOR_SIZE - 1)) == 0);
+if (!QEMU_IS_ALIGNED(offset | count, BDRV_SECTOR_SIZE)) {
+return -ENOTSUP;
+}
 acb = sd_aio_setup(bs, _iov, offset >> BDRV_SECTOR_BITS,
count >> BDRV_SECTOR_BITS);
 acb->aiocb_type = AIOCB_DISCARD_OBJ;
-- 
2.7.4




[Qemu-block] [PATCH 2/2] block: Pass unaligned discard requests to drivers

2016-11-10 Thread Eric Blake
Discard is advisory, so rounding the requests to alignment
boundaries is never semantically wrong from the data that
the guest sees.  But at least the Dell Equallogic iSCSI SANs
has an interesting property that its advertised discard
alignment is 15M, yet documents that discarding a sequence
of 1M slices will eventually result in the 15M page being
marked as discarded, and it is possible to observe which
pages have been discarded.

Between commits 9f1963b and b8d0a980, we converted the block
layer to a byte-based interface that ultimately ignores any
unaligned head or tail based on the driver's advertised
discard granularity, which means that qemu 2.7 refuses to
pass any discard request smaller than 15M down to the Dell
Equallogic hardware.  This is a slight regression in behavior
compared to earlier qemu, where a guest executing discards
in power-of-2 chunks used to be able to get every page
discarded, but is now left with various pages still allocated
because the guest requests did not align with the hardware's
15M pages.

Since the SCSI specification says nothing about a minimum
discard granularity, and only documents the preferred
alignment, it is best if the block layer gives the driver
every bit of information about discard requests, rather than
rounding it to alignment boundaries early.

Rework the block layer discard algorithm to mirror the write
zero algorithm: always peel off any unaligned head or tail
and manage that in isolation, then do the bulk of the request
on an aligned boundary.  The fallback when the driver returns
-ENOTSUP for an unaligned request is to silently ignore that
portion of the discard request; but for devices that can pass
the partial request all the way down to hardware, this can
result in the hardware coalescing requests and discarding
aligned pages after all.

Reported by: Peter Lieven 
Signed-off-by: Eric Blake 
CC: qemu-sta...@nongnu.org
---
 block/io.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index aa532a5..76c3030 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2421,7 +2421,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, 
int64_t offset,
 {
 BdrvTrackedRequest req;
 int max_pdiscard, ret;
-int head, align;
+int head, tail, align;

 if (!bs->drv) {
 return -ENOMEDIUM;
@@ -2444,19 +2444,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, 
int64_t offset,
 return 0;
 }

-/* Discard is advisory, so ignore any unaligned head or tail */
+/* Discard is advisory, but some devices track and coalesce
+ * unaligned requests, so we must pass everything down rather than
+ * round here.  Still, most devices will just silently ignore
+ * unaligned requests (by returning -ENOTSUP), so we must fragment
+ * the request accordingly.  */
 align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
 assert(align % bs->bl.request_alignment == 0);
 head = offset % align;
-if (head) {
-head = MIN(count, align - head);
-count -= head;
-offset += head;
-}
-count = QEMU_ALIGN_DOWN(count, align);
-if (!count) {
-return 0;
-}
+tail = (offset + count) % align;

 bdrv_inc_in_flight(bs);
 tracked_request_begin(, bs, offset, count, BDRV_TRACKED_DISCARD);
@@ -2468,11 +2464,25 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, 
int64_t offset,

 max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
align);
-assert(max_pdiscard);
+assert(max_pdiscard >= bs->bl.request_alignment);

 while (count > 0) {
 int ret;
-int num = MIN(count, max_pdiscard);
+int num = count;
+
+if (head) {
+/* Make a small request up to the first aligned sector. */
+num = MIN(count, align - head);
+head = (head + num) % align;
+assert(num < max_pdiscard);
+} else if (tail && num > align) {
+/* Shorten the request to the last aligned sector.  */
+num -= tail;
+}
+/* limit request size */
+if (num > max_pdiscard) {
+num = max_pdiscard;
+}

 if (bs->drv->bdrv_co_pdiscard) {
 ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
-- 
2.7.4




[Qemu-block] [PATCH for-2.8 0/2] pass partial discard requests all the way down

2016-11-10 Thread Eric Blake
Peter reported a mild regression in qemu 2.7 when targetting the
Dell Equallogic iSCSI, which advertizes a preferred and maximum
unmap alignment at 15M.  In qemu 2.6, trims not aligned to those
boundaries still made it to the device, but in 2.7, the block
layer is ignoring unaligned portions of guest requests, resulting
in fewer blocks being actually trimmed.

Since discard is advisary, it is borderline if this counts as a
bug fix worthy for inclusion in 2.8, or if it should be deferred
to 2.9.  At any rate, while I was able to test that the patch
didn't misbehave for me when I tweaked an NBD setup to force
15M alignment, I am unable to test that it makes an actual
difference for Peter's hardware, and that needs to happen before
anyone even thinks of applying this.

Eric Blake (2):
  block: Return -ENOTSUP rather than assert on unaligned discards
  block: Pass unaligned discard requests to drivers

 block/io.c   | 36 +++-
 block/iscsi.c|  4 +++-
 block/qcow2.c|  4 
 block/sheepdog.c |  5 +++--
 4 files changed, 33 insertions(+), 16 deletions(-)

-- 
2.7.4




[Qemu-block] [RFC PATCH 8/8] quorum: Inline quorum_fifo_aio_cb()

2016-11-10 Thread Kevin Wolf
Inlining the function removes some boilerplace code and replaces
recursion by a simple loop, so the code becomes somewhat easier to
understand.

Signed-off-by: Kevin Wolf 
---
 block/quorum.c | 42 +-
 1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 45939c5..0e91d59 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -248,30 +248,6 @@ static void quorum_report_bad_acb(QuorumChildRequest 
*sacb, int ret)
 quorum_report_bad(type, acb->offset, acb->bytes, sacb->bs->node_name, ret);
 }
 
-static int quorum_fifo_aio_cb(void *opaque, int ret)
-{
-QuorumChildRequest *sacb = opaque;
-QuorumAIOCB *acb = sacb->parent;
-BDRVQuorumState *s = acb->bs->opaque;
-
-assert(acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO);
-
-if (ret < 0) {
-quorum_report_bad_acb(sacb, ret);
-
-/* We try to read next child in FIFO order if we fail to read */
-if (acb->children_read < s->num_children) {
-return read_fifo_child(acb);
-}
-}
-
-acb->vote_ret = ret;
-
-/* FIXME: rewrite failed children if acb->children_read > 1? */
-
-return ret;
-}
-
 static void quorum_report_bad_versions(BDRVQuorumState *s,
QuorumAIOCB *acb,
QuorumVoteValue *value)
@@ -677,12 +653,20 @@ static int read_quorum_children(QuorumAIOCB *acb)
 static int read_fifo_child(QuorumAIOCB *acb)
 {
 BDRVQuorumState *s = acb->bs->opaque;
-int n = acb->children_read++;
-int ret;
+int n, ret;
+
+/* We try to read the next child in FIFO order if we failed to read */
+do {
+n = acb->children_read++;
+acb->qcrs[n].bs = s->children[n]->bs;
+ret = bdrv_co_preadv(s->children[n], acb->offset, acb->bytes,
+ acb->qiov, 0);
+if (ret < 0) {
+quorum_report_bad_acb(>qcrs[n], ret);
+}
+} while(ret < 0 && acb->children_read < s->num_children);
 
-acb->qcrs[n].bs = s->children[n]->bs;
-ret = bdrv_co_preadv(s->children[n], acb->offset, acb->bytes, acb->qiov, 
0);
-ret = quorum_fifo_aio_cb(>qcrs[n], ret);
+/* FIXME: rewrite failed children if acb->children_read > 1? */
 
 return ret;
 }
-- 
1.8.3.1




[Qemu-block] [RFC PATCH 3/8] quorum: Implement .bdrv_co_readv/writev

2016-11-10 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 block/quorum.c | 194 ++---
 1 file changed, 117 insertions(+), 77 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index dfa9fd3..3cb579e 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -97,7 +97,7 @@ typedef struct QuorumAIOCB QuorumAIOCB;
  * $children_count QuorumChildRequest.
  */
 typedef struct QuorumChildRequest {
-BlockAIOCB *aiocb;
+BlockDriverState *bs;
 QEMUIOVector qiov;
 uint8_t *buf;
 int ret;
@@ -110,7 +110,8 @@ typedef struct QuorumChildRequest {
  * used to do operations on each children and track overall progress.
  */
 struct QuorumAIOCB {
-BlockAIOCB common;
+BlockDriverState *bs;
+Coroutine *co;
 
 /* Request metadata */
 uint64_t sector_num;
@@ -129,36 +130,18 @@ struct QuorumAIOCB {
 QuorumVotes votes;
 
 bool is_read;
+bool has_completed;
 int vote_ret;
 int children_read;  /* how many children have been read from */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
 
-static void quorum_aio_cancel(BlockAIOCB *blockacb)
-{
-QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
-BDRVQuorumState *s = acb->common.bs->opaque;
-int i;
-
-/* cancel all callbacks */
-for (i = 0; i < s->num_children; i++) {
-if (acb->qcrs[i].aiocb) {
-bdrv_aio_cancel_async(acb->qcrs[i].aiocb);
-}
-}
-}
-
-static AIOCBInfo quorum_aiocb_info = {
-.aiocb_size = sizeof(QuorumAIOCB),
-.cancel_async   = quorum_aio_cancel,
-};
-
 static void quorum_aio_finalize(QuorumAIOCB *acb)
 {
-acb->common.cb(acb->common.opaque, acb->vote_ret);
+acb->has_completed = true;
 g_free(acb->qcrs);
-qemu_aio_unref(acb);
+qemu_coroutine_enter_if_inactive(acb->co);
 }
 
 static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
@@ -174,14 +157,14 @@ static bool quorum_64bits_compare(QuorumVoteValue *a, 
QuorumVoteValue *b)
 static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs,
QEMUIOVector *qiov,
uint64_t sector_num,
-   int nb_sectors,
-   BlockCompletionFunc *cb,
-   void *opaque)
+   int nb_sectors)
 {
 BDRVQuorumState *s = bs->opaque;
-QuorumAIOCB *acb = qemu_aio_get(_aiocb_info, bs, cb, opaque);
+QuorumAIOCB *acb = g_new(QuorumAIOCB, 1);
 int i;
 
+acb->co = qemu_coroutine_self();
+acb->bs = bs;
 acb->sector_num = sector_num;
 acb->nb_sectors = nb_sectors;
 acb->qiov = qiov;
@@ -191,6 +174,7 @@ static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs,
 acb->rewrite_count = 0;
 acb->votes.compare = quorum_sha256_compare;
 QLIST_INIT(>votes.vote_list);
+acb->has_completed = false;
 acb->is_read = false;
 acb->vote_ret = 0;
 
@@ -217,7 +201,7 @@ static void quorum_report_bad(QuorumOpType type, uint64_t 
sector_num,
 
 static void quorum_report_failure(QuorumAIOCB *acb)
 {
-const char *reference = bdrv_get_device_or_node_name(acb->common.bs);
+const char *reference = bdrv_get_device_or_node_name(acb->bs);
 qapi_event_send_quorum_failure(reference, acb->sector_num,
acb->nb_sectors, _abort);
 }
@@ -226,7 +210,7 @@ static int quorum_vote_error(QuorumAIOCB *acb);
 
 static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb)
 {
-BDRVQuorumState *s = acb->common.bs->opaque;
+BDRVQuorumState *s = acb->bs->opaque;
 
 if (acb->success_count < s->threshold) {
 acb->vote_ret = quorum_vote_error(acb);
@@ -252,7 +236,7 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
-static BlockAIOCB *read_fifo_child(QuorumAIOCB *acb);
+static int read_fifo_child(QuorumAIOCB *acb);
 
 static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
 {
@@ -272,14 +256,14 @@ static void quorum_report_bad_acb(QuorumChildRequest 
*sacb, int ret)
 QuorumAIOCB *acb = sacb->parent;
 QuorumOpType type = acb->is_read ? QUORUM_OP_TYPE_READ : 
QUORUM_OP_TYPE_WRITE;
 quorum_report_bad(type, acb->sector_num, acb->nb_sectors,
-  sacb->aiocb->bs->node_name, ret);
+  sacb->bs->node_name, ret);
 }
 
-static void quorum_fifo_aio_cb(void *opaque, int ret)
+static int quorum_fifo_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
 QuorumAIOCB *acb = sacb->parent;
-BDRVQuorumState *s = acb->common.bs->opaque;
+BDRVQuorumState *s = acb->bs->opaque;
 
 assert(acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO);
 
@@ -288,8 +272,7 @@ static void quorum_fifo_aio_cb(void *opaque, int ret)
 
 /* We try to read next child in FIFO order if we fail to read */
 if (acb->children_read < s->num_children) {

[Qemu-block] [RFC PATCH 2/8] quorum: Remove s from quorum_aio_get() arguments

2016-11-10 Thread Kevin Wolf
There is no point in passing the value of bs->opaque in order to
overwrite it with itself.

Signed-off-by: Kevin Wolf 
---
 block/quorum.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d122299..dfa9fd3 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -171,18 +171,17 @@ static bool quorum_64bits_compare(QuorumVoteValue *a, 
QuorumVoteValue *b)
 return a->l == b->l;
 }
 
-static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
-   BlockDriverState *bs,
+static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs,
QEMUIOVector *qiov,
uint64_t sector_num,
int nb_sectors,
BlockCompletionFunc *cb,
void *opaque)
 {
+BDRVQuorumState *s = bs->opaque;
 QuorumAIOCB *acb = qemu_aio_get(_aiocb_info, bs, cb, opaque);
 int i;
 
-acb->common.bs->opaque = s;
 acb->sector_num = sector_num;
 acb->nb_sectors = nb_sectors;
 acb->qiov = qiov;
@@ -691,7 +690,7 @@ static BlockAIOCB *quorum_aio_readv(BlockDriverState *bs,
 void *opaque)
 {
 BDRVQuorumState *s = bs->opaque;
-QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
+QuorumAIOCB *acb = quorum_aio_get(bs, qiov, sector_num,
   nb_sectors, cb, opaque);
 acb->is_read = true;
 acb->children_read = 0;
@@ -711,7 +710,7 @@ static BlockAIOCB *quorum_aio_writev(BlockDriverState *bs,
  void *opaque)
 {
 BDRVQuorumState *s = bs->opaque;
-QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, nb_sectors,
+QuorumAIOCB *acb = quorum_aio_get(bs, qiov, sector_num, nb_sectors,
   cb, opaque);
 int i;
 
-- 
1.8.3.1




[Qemu-block] [RFC PATCH 4/8] quorum: Do cleanup in caller coroutine

2016-11-10 Thread Kevin Wolf
Instead of calling quorum_aio_finalize() deeply nested in what used
to be an AIO callback, do it in the same functions that allocated the
AIOCB.

Signed-off-by: Kevin Wolf 
---
 block/quorum.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 3cb579e..a225327 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -139,9 +139,8 @@ static bool quorum_vote(QuorumAIOCB *acb);
 
 static void quorum_aio_finalize(QuorumAIOCB *acb)
 {
-acb->has_completed = true;
 g_free(acb->qcrs);
-qemu_coroutine_enter_if_inactive(acb->co);
+g_free(acb);
 }
 
 static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
@@ -233,7 +232,8 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 return;
 }
 
-quorum_aio_finalize(acb);
+acb->has_completed = true;
+qemu_coroutine_enter_if_inactive(acb->co);
 }
 
 static int read_fifo_child(QuorumAIOCB *acb);
@@ -279,7 +279,7 @@ static int quorum_fifo_aio_cb(void *opaque, int ret)
 acb->vote_ret = ret;
 
 /* FIXME: rewrite failed children if acb->children_read > 1? */
-quorum_aio_finalize(acb);
+
 return ret;
 }
 
@@ -317,7 +317,8 @@ static void quorum_aio_cb(void *opaque, int ret)
 
 /* if no rewrite is done the code will finish right away */
 if (!rewrite) {
-quorum_aio_finalize(acb);
+acb->has_completed = true;
+qemu_coroutine_enter_if_inactive(acb->co);
 }
 }
 
@@ -716,7 +717,8 @@ static int quorum_co_readv(BlockDriverState *bs,
 } else {
 ret = read_fifo_child(acb);
 }
-g_free(acb);
+quorum_aio_finalize(acb);
+
 return ret;
 }
 
@@ -759,6 +761,7 @@ static int quorum_co_writev(BlockDriverState *bs,
 }
 
 ret = acb->vote_ret;
+quorum_aio_finalize(acb);
 
 return ret;
 }
-- 
1.8.3.1




[Qemu-block] [RFC PATCH 1/8] coroutine: Introduce qemu_coroutine_enter_if_inactive()

2016-11-10 Thread Kevin Wolf
In the context of asynchronous work, if we have a worker coroutine that
didn't yield, the parent coroutine cannot be reentered because it hasn't
yielded yet. In this case we don't even have to reenter the parent
because it will see that the work is already done and won't even yield.

Signed-off-by: Kevin Wolf 
---
 include/qemu/coroutine.h | 6 ++
 util/qemu-coroutine.c| 8 
 2 files changed, 14 insertions(+)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index e6a60d5..12584ed 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -71,6 +71,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void 
*opaque);
 void qemu_coroutine_enter(Coroutine *coroutine);
 
 /**
+ * Transfer control to a coroutine if it's not active (i.e. part of the call
+ * stack of the running coroutine). Otherwise, do nothing.
+ */
+void qemu_coroutine_enter_if_inactive(Coroutine *co);
+
+/**
  * Transfer control back to a coroutine's caller
  *
  * This function does not return until the coroutine is re-entered using
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 737bffa..a06befe 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -19,6 +19,7 @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
+#include "block/aio.h"
 
 enum {
 POOL_BATCH_SIZE = 64,
@@ -131,6 +132,13 @@ void qemu_coroutine_enter(Coroutine *co)
 }
 }
 
+void qemu_coroutine_enter_if_inactive(Coroutine *co)
+{
+if (!qemu_coroutine_entered(co)) {
+qemu_coroutine_enter(co);
+}
+}
+
 void coroutine_fn qemu_coroutine_yield(void)
 {
 Coroutine *self = qemu_coroutine_self();
-- 
1.8.3.1




[Qemu-block] [RFC PATCH 0/8] quorum: Implement .bdrv_co_preadv/pwritev()

2016-11-10 Thread Kevin Wolf
This is part of a series that I'm working on and that aims to remove the
bdrv_aio_*() emulation in io.c. blkdebug and blkverify were easy, but for
quorum I need a few more patches, so I'm sending this out as an RFC while I
continue work on the rest (QED, and then possibly some polishing).

After the series, quorum doesn't only use the preferred interfaces, but the
code becomes a little easier to follow and byte granularity requests are
supported. There is probably still some potential for additional cleanups in a
follow-up series, but this should already be enough to be worth merging.

Kevin Wolf (8):
  coroutine: Introduce qemu_coroutine_enter_if_inactive()
  quorum: Remove s from quorum_aio_get() arguments
  quorum: Implement .bdrv_co_readv/writev
  quorum: Do cleanup in caller coroutine
  quorum: Inline quorum_aio_cb()
  quorum: Avoid bdrv_aio_writev() for rewrites
  quorum: Implement .bdrv_co_preadv/pwritev()
  quorum: Inline quorum_fifo_aio_cb()

 block/quorum.c   | 373 +--
 include/qemu/coroutine.h |   6 +
 util/qemu-coroutine.c|   8 +
 3 files changed, 211 insertions(+), 176 deletions(-)

-- 
1.8.3.1




[Qemu-block] [RFC PATCH 6/8] quorum: Avoid bdrv_aio_writev() for rewrites

2016-11-10 Thread Kevin Wolf
Replacing it with bdrv_co_pwritev() prepares us for byte granularity
requests and gets us rid of the last bdrv_aio_*() user in quorum.

Signed-off-by: Kevin Wolf 
---
 block/quorum.c | 52 +---
 1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index b2bb3af..1426115 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -134,6 +134,11 @@ struct QuorumAIOCB {
 int children_read;  /* how many children have been read from */
 };
 
+typedef struct QuorumCo {
+QuorumAIOCB *acb;
+int i;
+} QuorumCo;
+
 static bool quorum_vote(QuorumAIOCB *acb);
 
 static void quorum_aio_finalize(QuorumAIOCB *acb)
@@ -218,15 +223,6 @@ static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb)
 return false;
 }
 
-static void quorum_rewrite_aio_cb(void *opaque, int ret)
-{
-QuorumAIOCB *acb = opaque;
-
-/* one less rewrite to do */
-acb->rewrite_count--;
-qemu_coroutine_enter_if_inactive(acb->co);
-}
-
 static int read_fifo_child(QuorumAIOCB *acb);
 
 static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
@@ -293,7 +289,25 @@ static void quorum_report_bad_versions(BDRVQuorumState *s,
 }
 }
 
-static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
+static void quorum_rewrite_entry(void *opaque)
+{
+QuorumCo *co = opaque;
+QuorumAIOCB *acb = co->acb;
+BDRVQuorumState *s = acb->bs->opaque;
+int ret;
+
+ret = bdrv_co_pwritev(s->children[co->i],
+  acb->sector_num * BDRV_SECTOR_SIZE,
+  acb->nb_sectors * BDRV_SECTOR_SIZE,
+  acb->qiov, 0);
+(void) ret;
+
+/* one less rewrite to do */
+acb->rewrite_count--;
+qemu_coroutine_enter_if_inactive(acb->co);
+}
+
+static bool quorum_rewrite_bad_versions(QuorumAIOCB *acb,
 QuorumVoteValue *value)
 {
 QuorumVoteVersion *version;
@@ -321,9 +335,14 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState 
*s, QuorumAIOCB *acb,
 continue;
 }
 QLIST_FOREACH(item, >items, next) {
-bdrv_aio_writev(s->children[item->index], acb->sector_num,
-acb->qiov, acb->nb_sectors, quorum_rewrite_aio_cb,
-acb);
+Coroutine *co;
+QuorumCo data = {
+.acb = acb,
+.i = item->index,
+};
+
+co = qemu_coroutine_create(quorum_rewrite_entry, );
+qemu_coroutine_enter(co);
 }
 }
 
@@ -577,7 +596,7 @@ static bool quorum_vote(QuorumAIOCB *acb)
 
 /* corruption correction is enabled */
 if (s->rewrite_corrupted) {
-rewrite = quorum_rewrite_bad_versions(s, acb, >value);
+rewrite = quorum_rewrite_bad_versions(acb, >value);
 }
 
 free_exit:
@@ -586,11 +605,6 @@ free_exit:
 return rewrite;
 }
 
-typedef struct QuorumCo {
-QuorumAIOCB *acb;
-int i;
-} QuorumCo;
-
 static void read_quorum_children_entry(void *opaque)
 {
 QuorumCo *co = opaque;
-- 
1.8.3.1




[Qemu-block] [RFC PATCH 7/8] quorum: Implement .bdrv_co_preadv/pwritev()

2016-11-10 Thread Kevin Wolf
This enables byte granularity requests on quorum nodes.

Note that the QMP events emitted by the driver are an external API that
we were careless enough to define as sector based. The offset and length
of requests reported in events are rounded down therefore.

Signed-off-by: Kevin Wolf 
---
 block/quorum.c | 75 ++
 1 file changed, 33 insertions(+), 42 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 1426115..45939c5 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -114,8 +114,8 @@ struct QuorumAIOCB {
 Coroutine *co;
 
 /* Request metadata */
-uint64_t sector_num;
-int nb_sectors;
+uint64_t offset;
+uint64_t bytes;
 
 QEMUIOVector *qiov; /* calling IOV */
 
@@ -159,8 +159,8 @@ static bool quorum_64bits_compare(QuorumVoteValue *a, 
QuorumVoteValue *b)
 
 static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs,
QEMUIOVector *qiov,
-   uint64_t sector_num,
-   int nb_sectors)
+   uint64_t offset,
+   uint64_t bytes)
 {
 BDRVQuorumState *s = bs->opaque;
 QuorumAIOCB *acb = g_new(QuorumAIOCB, 1);
@@ -168,8 +168,8 @@ static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs,
 
 acb->co = qemu_coroutine_self();
 acb->bs = bs;
-acb->sector_num = sector_num;
-acb->nb_sectors = nb_sectors;
+acb->offset = offset;
+acb->bytes = bytes;
 acb->qiov = qiov;
 acb->qcrs = g_new0(QuorumChildRequest, s->num_children);
 acb->count = 0;
@@ -189,8 +189,8 @@ static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs,
 return acb;
 }
 
-static void quorum_report_bad(QuorumOpType type, uint64_t sector_num,
-  int nb_sectors, char *node_name, int ret)
+static void quorum_report_bad(QuorumOpType type, uint64_t offset,
+  uint64_t bytes, char *node_name, int ret)
 {
 const char *msg = NULL;
 if (ret < 0) {
@@ -198,14 +198,17 @@ static void quorum_report_bad(QuorumOpType type, uint64_t 
sector_num,
 }
 
 qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name,
-  sector_num, nb_sectors, _abort);
+  offset / BDRV_SECTOR_SIZE,
+  bytes / BDRV_SECTOR_SIZE, _abort);
 }
 
 static void quorum_report_failure(QuorumAIOCB *acb)
 {
 const char *reference = bdrv_get_device_or_node_name(acb->bs);
-qapi_event_send_quorum_failure(reference, acb->sector_num,
-   acb->nb_sectors, _abort);
+qapi_event_send_quorum_failure(reference,
+   acb->offset / BDRV_SECTOR_SIZE,
+   acb->bytes / BDRV_SECTOR_SIZE,
+   _abort);
 }
 
 static int quorum_vote_error(QuorumAIOCB *acb);
@@ -242,8 +245,7 @@ static void quorum_report_bad_acb(QuorumChildRequest *sacb, 
int ret)
 {
 QuorumAIOCB *acb = sacb->parent;
 QuorumOpType type = acb->is_read ? QUORUM_OP_TYPE_READ : 
QUORUM_OP_TYPE_WRITE;
-quorum_report_bad(type, acb->sector_num, acb->nb_sectors,
-  sacb->bs->node_name, ret);
+quorum_report_bad(type, acb->offset, acb->bytes, sacb->bs->node_name, ret);
 }
 
 static int quorum_fifo_aio_cb(void *opaque, int ret)
@@ -282,8 +284,7 @@ static void quorum_report_bad_versions(BDRVQuorumState *s,
 continue;
 }
 QLIST_FOREACH(item, >items, next) {
-quorum_report_bad(QUORUM_OP_TYPE_READ, acb->sector_num,
-  acb->nb_sectors,
+quorum_report_bad(QUORUM_OP_TYPE_READ, acb->offset, acb->bytes,
   s->children[item->index]->bs->node_name, 0);
 }
 }
@@ -296,9 +297,7 @@ static void quorum_rewrite_entry(void *opaque)
 BDRVQuorumState *s = acb->bs->opaque;
 int ret;
 
-ret = bdrv_co_pwritev(s->children[co->i],
-  acb->sector_num * BDRV_SECTOR_SIZE,
-  acb->nb_sectors * BDRV_SECTOR_SIZE,
+ret = bdrv_co_pwritev(s->children[co->i], acb->offset, acb->bytes,
   acb->qiov, 0);
 (void) ret;
 
@@ -462,8 +461,8 @@ static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb,
 va_list ap;
 
 va_start(ap, fmt);
-fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ",
-acb->sector_num, acb->nb_sectors);
+fprintf(stderr, "quorum: offset=%" PRIu64 " bytes=%" PRIu64 " ",
+acb->offset, acb->bytes);
 vfprintf(stderr, fmt, ap);
 fprintf(stderr, "\n");
 va_end(ap);
@@ -481,9 +480,8 @@ static bool quorum_compare(QuorumAIOCB *acb,
 if (s->is_blkverify) {
 offset = qemu_iovec_compare(a, b);
 if (offset != -1) {
-quorum_err(acb, 

Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/6] block-backend: Always notify on blk_eject

2016-11-10 Thread John Snow

Ping,

Kevin: Look ok?

--js

On 11/07/2016 04:13 PM, John Snow wrote:

Requires patches in my IDE branch, for context.

This series changes blk_eject (used for a software-initiated eject request)
to always trigger a QMP tray event, in contrast to the current behavior where
a tray event only occurs if a medium is already in the tray.

V2: Now with tests. (Kevin)



For convenience, this branch is available at:
https://github.com/jnsnow/qemu.git branch tray-always-notify
https://github.com/jnsnow/qemu/tree/tray-always-notify

This version is tagged tray-always-notify-v2:
https://github.com/jnsnow/qemu/releases/tag/tray-always-notify-v2

John Snow (6):
  block-backend: Always notify on blk_eject
  libqtest: add qmp_eventwait_ref
  libqos/ahci: Support expected errors
  libqos/ahci: Add ATAPI tray macros
  libqos/ahci: Add get_sense and test_ready
  ahci-test: add QMP tray test for ATAPI

 block/block-backend.c | 13 +++
 tests/ahci-test.c | 98 +++
 tests/libqos/ahci.c   | 96 ++---
 tests/libqos/ahci.h   | 26 --
 tests/libqtest.c  | 13 +--
 tests/libqtest.h  | 22 
 6 files changed, 252 insertions(+), 16 deletions(-)



--
—js



[Qemu-block] [PULL 39/47] megasas: remove unnecessary megasas_use_msix()

2016-11-10 Thread Michael S. Tsirkin
From: Cao jin 

Also move certain hunk above, to place msix init related code together.

CC: Hannes Reinecke 
CC: Paolo Bonzini 
CC: Markus Armbruster 
CC: Marcel Apfelbaum 
CC: Michael S. Tsirkin 

Reviewed-by: Markus Armbruster 
Signed-off-by: Cao jin 
Acked-by: Marcel Apfelbaum 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/scsi/megasas.c | 19 ++-
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index 6cef9a3..ba79e7a 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -155,11 +155,6 @@ static bool megasas_use_queue64(MegasasState *s)
 return s->flags & MEGASAS_MASK_USE_QUEUE64;
 }
 
-static bool megasas_use_msix(MegasasState *s)
-{
-return s->msix != ON_OFF_AUTO_OFF;
-}
-
 static bool megasas_is_jbod(MegasasState *s)
 {
 return s->flags & MEGASAS_MASK_USE_JBOD;
@@ -2299,9 +2294,7 @@ static void megasas_scsi_uninit(PCIDevice *d)
 {
 MegasasState *s = MEGASAS(d);
 
-if (megasas_use_msix(s)) {
-msix_uninit(d, >mmio_io, >mmio_io);
-}
+msix_uninit(d, >mmio_io, >mmio_io);
 msi_uninit(d);
 }
 
@@ -2353,7 +2346,7 @@ static void megasas_scsi_realize(PCIDevice *dev, Error 
**errp)
 
 memory_region_init_io(>mmio_io, OBJECT(s), _mmio_ops, s,
   "megasas-mmio", 0x4000);
-if (megasas_use_msix(s)) {
+if (s->msix != ON_OFF_AUTO_OFF) {
 ret = msix_init(dev, 15, >mmio_io, b->mmio_bar, 0x2000,
 >mmio_io, b->mmio_bar, 0x3800, 0x68, );
 /* Any error other than -ENOTSUP(board's MSI support is broken)
@@ -2373,6 +2366,10 @@ static void megasas_scsi_realize(PCIDevice *dev, Error 
**errp)
 error_free(err);
 }
 
+if (msix_enabled(dev)) {
+msix_vector_use(dev, 0);
+}
+
 memory_region_init_io(>port_io, OBJECT(s), _port_ops, s,
   "megasas-io", 256);
 memory_region_init_io(>queue_io, OBJECT(s), _queue_ops, s,
@@ -2388,10 +2385,6 @@ static void megasas_scsi_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, b->mmio_bar, bar_type, >mmio_io);
 pci_register_bar(dev, 3, bar_type, >queue_io);
 
-if (megasas_use_msix(s)) {
-msix_vector_use(dev, 0);
-}
-
 s->fw_state = MFI_FWSTATE_READY;
 if (!s->sas_addr) {
 s->sas_addr = ((NAA_LOCALLY_ASSIGNED_ID << 24) |
-- 
MST




[Qemu-block] [PULL 36/47] pci: Convert msix_init() to Error and fix callers to check it

2016-11-10 Thread Michael S. Tsirkin
From: Cao jin 

msix_init() reports errors with error_report(), which is wrong when
it's used in realize().  The same issue was fixed for msi_init() in
commit 1108b2f.

For some devices(like e1000e, vmxnet3) who won't fail because of
msix_init's failure, suppress the error report by passing NULL error object.

Bonus: add comment for msix_init.

CC: Jiri Pirko 
CC: Gerd Hoffmann 
CC: Dmitry Fleytman 
CC: Jason Wang 
CC: Michael S. Tsirkin 
CC: Hannes Reinecke 
CC: Paolo Bonzini 
CC: Alex Williamson 
CC: Markus Armbruster 
CC: Marcel Apfelbaum 

Reviewed-by: Markus Armbruster 
Signed-off-by: Cao jin 
Acked-by: Marcel Apfelbaum 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/pci/msix.h  |  5 +++--
 hw/block/nvme.c|  5 -
 hw/misc/ivshmem.c  |  8 
 hw/net/e1000e.c|  6 +-
 hw/net/rocker/rocker.c |  7 ++-
 hw/net/vmxnet3.c   |  8 ++--
 hw/pci/msix.c  | 34 +-
 hw/scsi/megasas.c  |  5 -
 hw/usb/hcd-xhci.c  | 13 -
 hw/vfio/pci.c  |  8 ++--
 hw/virtio/virtio-pci.c | 11 +--
 11 files changed, 80 insertions(+), 30 deletions(-)

diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 048a29d..1f27658 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -9,9 +9,10 @@ MSIMessage msix_get_message(PCIDevice *dev, unsigned int 
vector);
 int msix_init(PCIDevice *dev, unsigned short nentries,
   MemoryRegion *table_bar, uint8_t table_bar_nr,
   unsigned table_offset, MemoryRegion *pba_bar,
-  uint8_t pba_bar_nr, unsigned pba_offset, uint8_t cap_pos);
+  uint8_t pba_bar_nr, unsigned pba_offset, uint8_t cap_pos,
+  Error **errp);
 int msix_init_exclusive_bar(PCIDevice *dev, unsigned short nentries,
-uint8_t bar_nr);
+uint8_t bar_nr, Error **errp);
 
 void msix_write_config(PCIDevice *dev, uint32_t address, uint32_t val, int 
len);
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d479fd2..2d703c8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -831,6 +831,7 @@ static int nvme_init(PCIDevice *pci_dev)
 {
 NvmeCtrl *n = NVME(pci_dev);
 NvmeIdCtrl *id = >id_ctrl;
+Error *err = NULL;
 
 int i;
 int64_t bs_size;
@@ -872,7 +873,9 @@ static int nvme_init(PCIDevice *pci_dev)
 pci_register_bar(>parent_obj, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 >iomem);
-msix_init_exclusive_bar(>parent_obj, n->num_queues, 4);
+if (msix_init_exclusive_bar(>parent_obj, n->num_queues, 4, )) {
+error_report_err(err);
+}
 
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 230e51b..ca6f804 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -749,13 +749,13 @@ static void ivshmem_reset(DeviceState *d)
 }
 }
 
-static int ivshmem_setup_interrupts(IVShmemState *s)
+static int ivshmem_setup_interrupts(IVShmemState *s, Error **errp)
 {
 /* allocate QEMU callback data for receiving interrupts */
 s->msi_vectors = g_malloc0(s->vectors * sizeof(MSIVector));
 
 if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
-if (msix_init_exclusive_bar(PCI_DEVICE(s), s->vectors, 1)) {
+if (msix_init_exclusive_bar(PCI_DEVICE(s), s->vectors, 1, errp)) {
 return -1;
 }
 
@@ -897,8 +897,8 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 qemu_chr_fe_set_handlers(>server_chr, ivshmem_can_receive,
  ivshmem_read, NULL, s, NULL, true);
 
-if (ivshmem_setup_interrupts(s) < 0) {
-error_setg(errp, "failed to initialize interrupts");
+if (ivshmem_setup_interrupts(s, errp) < 0) {
+error_prepend(errp, "Failed to initialize interrupts: ");
 return;
 }
 }
diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index 89f96eb..bf32aca 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -292,7 +292,11 @@ e1000e_init_msix(E1000EState *s)
 E1000E_MSIX_IDX, E1000E_MSIX_TABLE,
 >msix,
 E1000E_MSIX_IDX, E1000E_MSIX_PBA,
-0xA0);
+0xA0, NULL);
+
+/* Any error other than -ENOTSUP(board's MSI support is broken)
+ * is a programming error. Fall back to INTx silently on -ENOTSUP */
+assert(!res || res == -ENOTSUP);
 
 if (res < 0) {
 

[Qemu-block] [PULL 37/47] megasas: change behaviour of msix switch

2016-11-10 Thread Michael S. Tsirkin
From: Cao jin 

Resolve the TODO, msix=auto means msix on; if user specify msix=on,
then device creation fail on msix_init failure.
Also undo the overwrites of user configuration of msix.

CC: Michael S. Tsirkin 
CC: Hannes Reinecke 
CC: Paolo Bonzini 
CC: Markus Armbruster 
CC: Marcel Apfelbaum 

Reviewed-by: Markus Armbruster 
Signed-off-by: Cao jin 
Acked-by: Marcel Apfelbaum 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/scsi/megasas.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index fada834..6cef9a3 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -2353,19 +2353,31 @@ static void megasas_scsi_realize(PCIDevice *dev, Error 
**errp)
 
 memory_region_init_io(>mmio_io, OBJECT(s), _mmio_ops, s,
   "megasas-mmio", 0x4000);
+if (megasas_use_msix(s)) {
+ret = msix_init(dev, 15, >mmio_io, b->mmio_bar, 0x2000,
+>mmio_io, b->mmio_bar, 0x3800, 0x68, );
+/* Any error other than -ENOTSUP(board's MSI support is broken)
+ * is a programming error */
+assert(!ret || ret == -ENOTSUP);
+if (ret && s->msix == ON_OFF_AUTO_ON) {
+/* Can't satisfy user's explicit msix=on request, fail */
+error_append_hint(, "You have to use msix=auto (default) or "
+"msix=off with this machine type.\n");
+/* No instance_finalize method, need to free the resource here */
+object_unref(OBJECT(>mmio_io));
+error_propagate(errp, err);
+return;
+}
+assert(!err || s->msix == ON_OFF_AUTO_AUTO);
+/* With msix=auto, we fall back to MSI off silently */
+error_free(err);
+}
+
 memory_region_init_io(>port_io, OBJECT(s), _port_ops, s,
   "megasas-io", 256);
 memory_region_init_io(>queue_io, OBJECT(s), _queue_ops, s,
   "megasas-queue", 0x4);
 
-if (megasas_use_msix(s) &&
-msix_init(dev, 15, >mmio_io, b->mmio_bar, 0x2000,
-  >mmio_io, b->mmio_bar, 0x3800, 0x68, )) {
-/*TODO: check msix_init's error, and should fail on msix=on */
-error_report_err(err);
-s->msix = ON_OFF_AUTO_OFF;
-}
-
 if (pci_is_express(dev)) {
 pcie_endpoint_cap_init(dev, 0xa0);
 }
-- 
MST




Re: [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer

2016-11-10 Thread Kevin Wolf
Am 10.11.2016 um 03:11 hat Fam Zheng geschrieben:
> On Wed, 11/09 14:06, Eric Blake wrote:
> > On 11/09/2016 07:49 AM, Stefan Hajnoczi wrote:
> > > On Tue, Nov 08, 2016 at 04:52:15PM -0600, Eric Blake wrote:
> > >> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> > >> an unaligned request never crosses a cluster boundary.  But
> > >> in the rewrite, the new code assumed that at most one iteration
> > >> would be needed to get to an alignment boundary.
> > >>
> > >> However, it is easy to trigger an assertion failure: the Linux
> > >> kernel limits loopback devices to advertise a max_transfer of
> > >> only 64k.  Any operation that requires falling back to writes
> > >> rather than more efficient zeroing must obey max_transfer during
> > >> that fallback, which means an unaligned head may require multiple
> > >> iterations of the write fallbacks before reaching the aligned
> > >> boundaries, when layering a format with clusters larger than 64k
> > >> atop the protocol of file access to a loopback device.
> > >>
> > >> Test case:
> > >>
> > >> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> > >> $ losetup /dev/loop2 /path/to/file
> > >> $ qemu-io -f qcow2 /dev/loop2
> > >> qemu-io> w 7m 1k
> > >> qemu-io> w -z 8003584 2093056
> > > 
> > > Please include a qemu-iotests test case to protect against regressions.
> > 
> > None of the existing qemu-iotests use losetup; I guess the closest thing
> > to do is crib from a test that uses passwordless sudo?
> > 
> > It will certainly be a separate commit, but I'll give it my best shot to
> > post something soon.
> 
> Alternatively, maybe add a blkdebug option to emulate a small max_transfer at
> the protocol layer?

This sounds like the better option indeed.

Kevin