Re: [Qemu-block] [PATCH 4/4] block: Cater to iscsi with non-power-of-2 discard
On 11/08/2016 05:03 AM, Peter Lieven wrote: > Am 25.10.2016 um 18:12 schrieb Eric Blake: >> On 10/25/2016 09:36 AM, Paolo Bonzini wrote: >>> >>> On 25/10/2016 16:35, Eric Blake wrote: So your argument is that we should always pass down every unaligned less-than-optimum discard request all the way to the hardware, rather than dropping it higher in the stack, even though discard requests are already advisory, in order to leave the hardware as the ultimate decision on whether to ignore the unaligned request? >>> Yes, I agree with Peter as to this. >> Okay, I'll work on patches. I think it counts as bug fix, so appropriate >> even if I miss soft freeze (I'd still like to get NBD write zero support >> into 2.8, since it already missed 2.7, but that one is still awaiting >> review with not much time left). >> > > Hi Eric, > > have you had time to look at this? > If you need help, let me know. Patches posted, but testing help would be appreciated since you have the actual hardware that exhibits the issue. I'm also trying to write a patch to extend the blkdebug driver to share this "feature" of a 15M page, and write a qemu-iotest to make it harder to regress in the future. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [RFC PATCH 7/8] quorum: Implement .bdrv_co_preadv/pwritev()
On 11/10/2016 11:19 AM, Kevin Wolf wrote: > This enables byte granularity requests on quorum nodes. > > Note that the QMP events emitted by the driver are an external API that > we were careless enough to define as sector based. The offset and length > of requests reported in events are rounded down therefore. Would it be better to round offset down and length up? A length of 0 looks odd. Should we add more fields to the two affected events (QUORUM_FAILURE and QUORUM_REPORT_BAD)? We have to keep the existing fields for back-compat, but we could add new fields that give byte-based locations for management apps smart enough to use the new instead of the old (particularly since the old fields are named 'sector-num' and 'sectors-count'). > > Signed-off-by: Kevin Wolf> --- > block/quorum.c | 75 > ++ > 1 file changed, 33 insertions(+), 42 deletions(-) > > @@ -198,14 +198,17 @@ static void quorum_report_bad(QuorumOpType type, > uint64_t sector_num, > } > > qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name, > - sector_num, nb_sectors, _abort); > + offset / BDRV_SECTOR_SIZE, > + bytes / BDRV_SECTOR_SIZE, > _abort); Rounding the offset down makes sense, but rounding the bytes down can lead to weird messages. Blindly rounding it up to a sector boundary can also be wrong (example: writing 2 bytes at offset 511 really affects 1024 bytes when you consider that two sectors had to undergo read-modify-write). Don't we have a helper routine for determining the end boundary when we have to convert an offset and length to a courser alignment? > @@ -462,8 +461,8 @@ static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB > *acb, > va_list ap; > > va_start(ap, fmt); > -fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ", > -acb->sector_num, acb->nb_sectors); > +fprintf(stderr, "quorum: offset=%" PRIu64 " bytes=%" PRIu64 " ", > +acb->offset, acb->bytes); Might be worth a separate patch to get rid of fprintf and use correct error reporting. But not the work for this patch. Overall, I like that you are getting rid of more sector-based cruft. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [RFC PATCH 6/8] quorum: Avoid bdrv_aio_writev() for rewrites
On 11/10/2016 11:19 AM, Kevin Wolf wrote: > Replacing it with bdrv_co_pwritev() prepares us for byte granularity > requests and gets us rid of the last bdrv_aio_*() user in quorum. > > Signed-off-by: Kevin Wolf> --- > block/quorum.c | 52 +--- > 1 file changed, 33 insertions(+), 19 deletions(-) > > diff --git a/block/quorum.c b/block/quorum.c > index b2bb3af..1426115 100644 > --- a/block/quorum.c > +++ b/block/quorum.c > @@ -134,6 +134,11 @@ struct QuorumAIOCB { > int children_read; /* how many children have been read from */ > }; > > +typedef struct QuorumCo { > +QuorumAIOCB *acb; > +int i; > +} QuorumCo; > + > @@ -586,11 +605,6 @@ free_exit: > return rewrite; > } > > -typedef struct QuorumCo { > -QuorumAIOCB *acb; > -int i; > -} QuorumCo; Code motion of a struct added earlier in this series; do you want to declare it up front to begin with to minimize churn? -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [RFC PATCH 4/8] quorum: Do cleanup in caller coroutine
On 11/10/2016 11:19 AM, Kevin Wolf wrote: > Instead of calling quorum_aio_finalize() deeply nested in what used > to be an AIO callback, do it in the same functions that allocated the > AIOCB. > > Signed-off-by: Kevin Wolf> --- > block/quorum.c | 15 +-- > 1 file changed, 9 insertions(+), 6 deletions(-) > Looks clean, but again I'm weak enough on coroutines that I won't give R-b to an RFC -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [RFC PATCH 3/8] quorum: Implement .bdrv_co_readv/writev
On 11/10/2016 11:19 AM, Kevin Wolf wrote: > Signed-off-by: Kevin Wolf> --- > block/quorum.c | 194 > ++--- > 1 file changed, 117 insertions(+), 77 deletions(-) > > + > +static void read_quorum_children_entry(void *opaque) > +{ > +QuorumCo *co = opaque; > +QuorumAIOCB *acb = co->acb; > +BDRVQuorumState *s = acb->bs->opaque; > +int i = co->i; > +int ret; > +co = NULL; /* Not valid after the first yield */ Why bother to invalidate co... > + > +acb->qcrs[i].bs = s->children[i]->bs; > +ret = bdrv_co_preadv(s->children[i], acb->sector_num * BDRV_SECTOR_SIZE, > + acb->nb_sectors * BDRV_SECTOR_SIZE, > + >qcrs[i].qiov, 0); > +quorum_aio_cb(>qcrs[i], ret); > +} ...when it isn't used later? Is it just for future-proofing edits made in later patches? > +static void write_quorum_entry(void *opaque) > +{ > +QuorumCo *co = opaque; > +QuorumAIOCB *acb = co->acb; > +BDRVQuorumState *s = acb->bs->opaque; > +int i = co->i; > +int ret; > +co = NULL; /* Not valid after the first yield */ > + > +acb->qcrs[i].bs = s->children[i]->bs; > +ret = bdrv_co_pwritev(s->children[i], acb->sector_num * BDRV_SECTOR_SIZE, > + acb->nb_sectors * BDRV_SECTOR_SIZE, acb->qiov, 0); > +quorum_aio_cb(>qcrs[i], ret); > } and again Otherwise, the conversion looks sane to me, but I'm just weak enough on coroutines that I won't give R-b while this is still in RFC -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [RFC PATCH 2/8] quorum: Remove s from quorum_aio_get() arguments
On 11/10/2016 11:19 AM, Kevin Wolf wrote: > There is no point in passing the value of bs->opaque in order to > overwrite it with itself. > > Signed-off-by: Kevin Wolf> --- > block/quorum.c | 9 - > 1 file changed, 4 insertions(+), 5 deletions(-) > Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [RFC PATCH 1/8] coroutine: Introduce qemu_coroutine_enter_if_inactive()
On 11/10/2016 11:19 AM, Kevin Wolf wrote: > In the context of asynchronous work, if we have a worker coroutine that > didn't yield, the parent coroutine cannot be reentered because it hasn't > yielded yet. In this case we don't even have to reenter the parent > because it will see that the work is already done and won't even yield. > > Signed-off-by: Kevin Wolf> --- > include/qemu/coroutine.h | 6 ++ > util/qemu-coroutine.c| 8 > 2 files changed, 14 insertions(+) > > +++ b/util/qemu-coroutine.c > @@ -19,6 +19,7 @@ > #include "qemu/atomic.h" > #include "qemu/coroutine.h" > #include "qemu/coroutine_int.h" > +#include "block/aio.h" Why do you need this include? > > enum { > POOL_BATCH_SIZE = 64, > @@ -131,6 +132,13 @@ void qemu_coroutine_enter(Coroutine *co) > } > } > > +void qemu_coroutine_enter_if_inactive(Coroutine *co) > +{ > +if (!qemu_coroutine_entered(co)) { > +qemu_coroutine_enter(co); > +} > +} > + > void coroutine_fn qemu_coroutine_yield(void) > { > Coroutine *self = qemu_coroutine_self(); > Otherwise: Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-block] [PATCH 1/2] block: Return -ENOTSUP rather than assert on unaligned discards
Right now, the block layer rounds discard requests, so that individual drivers are able to assert that discard requests will never be unaligned. But there are some ISCSI devices that track and coalesce multiple unaligned requests, turning it into an actual discard if the requests eventually cover an entire page, which implies that it is better to always pass discard requests as low down the stack as possible. In isolation, this patch has no semantic effect, since the block layer currently never passes an unaligned request down. But the block layer already has code that silently ignores drivers that return -ENOTSUP for a discard request that cannot be honored (as well as drivers that return 0 even when nothing was done). So the next patch will update the block layer to fragment discard requests, so that clients are guaranteed that they are either dealing with an unaligned head or tail, or an aligned body, of the overall request, making it similar to the block layer semantics of write zero fragmentation. Signed-off-by: Eric BlakeCC: qemu-sta...@nongnu.org --- block/iscsi.c| 4 +++- block/qcow2.c| 4 block/sheepdog.c | 5 +++-- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index 71bd523..0960929 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -1083,7 +1083,9 @@ coroutine_fn iscsi_co_pdiscard(BlockDriverState *bs, int64_t offset, int count) struct IscsiTask iTask; struct unmap_list list; -assert(is_byte_request_lun_aligned(offset, count, iscsilun)); +if (!is_byte_request_lun_aligned(offset, count, iscsilun)) { +return -ENOTSUP; +} if (!iscsilun->lbp.lbpu) { /* UNMAP is not supported by the target */ diff --git a/block/qcow2.c b/block/qcow2.c index 6d5689a..9af00fd 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -2490,6 +2490,10 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs, int ret; BDRVQcow2State *s = bs->opaque; +if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) { +return -ENOTSUP; +} + qemu_co_mutex_lock(>lock); ret = qcow2_discard_clusters(bs, offset, count >> BDRV_SECTOR_BITS, QCOW2_DISCARD_REQUEST, false); diff --git a/block/sheepdog.c b/block/sheepdog.c index 1fb9173..4c9af89 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -2829,8 +2829,9 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState *bs, int64_t offset, iov.iov_len = sizeof(zero); discard_iov.iov = discard_iov.niov = 1; -assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0); -assert((count & (BDRV_SECTOR_SIZE - 1)) == 0); +if (!QEMU_IS_ALIGNED(offset | count, BDRV_SECTOR_SIZE)) { +return -ENOTSUP; +} acb = sd_aio_setup(bs, _iov, offset >> BDRV_SECTOR_BITS, count >> BDRV_SECTOR_BITS); acb->aiocb_type = AIOCB_DISCARD_OBJ; -- 2.7.4
[Qemu-block] [PATCH 2/2] block: Pass unaligned discard requests to drivers
Discard is advisory, so rounding the requests to alignment boundaries is never semantically wrong from the data that the guest sees. But at least the Dell Equallogic iSCSI SANs has an interesting property that its advertised discard alignment is 15M, yet documents that discarding a sequence of 1M slices will eventually result in the 15M page being marked as discarded, and it is possible to observe which pages have been discarded. Between commits 9f1963b and b8d0a980, we converted the block layer to a byte-based interface that ultimately ignores any unaligned head or tail based on the driver's advertised discard granularity, which means that qemu 2.7 refuses to pass any discard request smaller than 15M down to the Dell Equallogic hardware. This is a slight regression in behavior compared to earlier qemu, where a guest executing discards in power-of-2 chunks used to be able to get every page discarded, but is now left with various pages still allocated because the guest requests did not align with the hardware's 15M pages. Since the SCSI specification says nothing about a minimum discard granularity, and only documents the preferred alignment, it is best if the block layer gives the driver every bit of information about discard requests, rather than rounding it to alignment boundaries early. Rework the block layer discard algorithm to mirror the write zero algorithm: always peel off any unaligned head or tail and manage that in isolation, then do the bulk of the request on an aligned boundary. The fallback when the driver returns -ENOTSUP for an unaligned request is to silently ignore that portion of the discard request; but for devices that can pass the partial request all the way down to hardware, this can result in the hardware coalescing requests and discarding aligned pages after all. Reported by: Peter LievenSigned-off-by: Eric Blake CC: qemu-sta...@nongnu.org --- block/io.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/block/io.c b/block/io.c index aa532a5..76c3030 100644 --- a/block/io.c +++ b/block/io.c @@ -2421,7 +2421,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, { BdrvTrackedRequest req; int max_pdiscard, ret; -int head, align; +int head, tail, align; if (!bs->drv) { return -ENOMEDIUM; @@ -2444,19 +2444,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, return 0; } -/* Discard is advisory, so ignore any unaligned head or tail */ +/* Discard is advisory, but some devices track and coalesce + * unaligned requests, so we must pass everything down rather than + * round here. Still, most devices will just silently ignore + * unaligned requests (by returning -ENOTSUP), so we must fragment + * the request accordingly. */ align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment); assert(align % bs->bl.request_alignment == 0); head = offset % align; -if (head) { -head = MIN(count, align - head); -count -= head; -offset += head; -} -count = QEMU_ALIGN_DOWN(count, align); -if (!count) { -return 0; -} +tail = (offset + count) % align; bdrv_inc_in_flight(bs); tracked_request_begin(, bs, offset, count, BDRV_TRACKED_DISCARD); @@ -2468,11 +2464,25 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), align); -assert(max_pdiscard); +assert(max_pdiscard >= bs->bl.request_alignment); while (count > 0) { int ret; -int num = MIN(count, max_pdiscard); +int num = count; + +if (head) { +/* Make a small request up to the first aligned sector. */ +num = MIN(count, align - head); +head = (head + num) % align; +assert(num < max_pdiscard); +} else if (tail && num > align) { +/* Shorten the request to the last aligned sector. */ +num -= tail; +} +/* limit request size */ +if (num > max_pdiscard) { +num = max_pdiscard; +} if (bs->drv->bdrv_co_pdiscard) { ret = bs->drv->bdrv_co_pdiscard(bs, offset, num); -- 2.7.4
[Qemu-block] [PATCH for-2.8 0/2] pass partial discard requests all the way down
Peter reported a mild regression in qemu 2.7 when targetting the Dell Equallogic iSCSI, which advertizes a preferred and maximum unmap alignment at 15M. In qemu 2.6, trims not aligned to those boundaries still made it to the device, but in 2.7, the block layer is ignoring unaligned portions of guest requests, resulting in fewer blocks being actually trimmed. Since discard is advisary, it is borderline if this counts as a bug fix worthy for inclusion in 2.8, or if it should be deferred to 2.9. At any rate, while I was able to test that the patch didn't misbehave for me when I tweaked an NBD setup to force 15M alignment, I am unable to test that it makes an actual difference for Peter's hardware, and that needs to happen before anyone even thinks of applying this. Eric Blake (2): block: Return -ENOTSUP rather than assert on unaligned discards block: Pass unaligned discard requests to drivers block/io.c | 36 +++- block/iscsi.c| 4 +++- block/qcow2.c| 4 block/sheepdog.c | 5 +++-- 4 files changed, 33 insertions(+), 16 deletions(-) -- 2.7.4
[Qemu-block] [RFC PATCH 8/8] quorum: Inline quorum_fifo_aio_cb()
Inlining the function removes some boilerplace code and replaces recursion by a simple loop, so the code becomes somewhat easier to understand. Signed-off-by: Kevin Wolf--- block/quorum.c | 42 +- 1 file changed, 13 insertions(+), 29 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 45939c5..0e91d59 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -248,30 +248,6 @@ static void quorum_report_bad_acb(QuorumChildRequest *sacb, int ret) quorum_report_bad(type, acb->offset, acb->bytes, sacb->bs->node_name, ret); } -static int quorum_fifo_aio_cb(void *opaque, int ret) -{ -QuorumChildRequest *sacb = opaque; -QuorumAIOCB *acb = sacb->parent; -BDRVQuorumState *s = acb->bs->opaque; - -assert(acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO); - -if (ret < 0) { -quorum_report_bad_acb(sacb, ret); - -/* We try to read next child in FIFO order if we fail to read */ -if (acb->children_read < s->num_children) { -return read_fifo_child(acb); -} -} - -acb->vote_ret = ret; - -/* FIXME: rewrite failed children if acb->children_read > 1? */ - -return ret; -} - static void quorum_report_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb, QuorumVoteValue *value) @@ -677,12 +653,20 @@ static int read_quorum_children(QuorumAIOCB *acb) static int read_fifo_child(QuorumAIOCB *acb) { BDRVQuorumState *s = acb->bs->opaque; -int n = acb->children_read++; -int ret; +int n, ret; + +/* We try to read the next child in FIFO order if we failed to read */ +do { +n = acb->children_read++; +acb->qcrs[n].bs = s->children[n]->bs; +ret = bdrv_co_preadv(s->children[n], acb->offset, acb->bytes, + acb->qiov, 0); +if (ret < 0) { +quorum_report_bad_acb(>qcrs[n], ret); +} +} while(ret < 0 && acb->children_read < s->num_children); -acb->qcrs[n].bs = s->children[n]->bs; -ret = bdrv_co_preadv(s->children[n], acb->offset, acb->bytes, acb->qiov, 0); -ret = quorum_fifo_aio_cb(>qcrs[n], ret); +/* FIXME: rewrite failed children if acb->children_read > 1? */ return ret; } -- 1.8.3.1
[Qemu-block] [RFC PATCH 3/8] quorum: Implement .bdrv_co_readv/writev
Signed-off-by: Kevin Wolf--- block/quorum.c | 194 ++--- 1 file changed, 117 insertions(+), 77 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index dfa9fd3..3cb579e 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -97,7 +97,7 @@ typedef struct QuorumAIOCB QuorumAIOCB; * $children_count QuorumChildRequest. */ typedef struct QuorumChildRequest { -BlockAIOCB *aiocb; +BlockDriverState *bs; QEMUIOVector qiov; uint8_t *buf; int ret; @@ -110,7 +110,8 @@ typedef struct QuorumChildRequest { * used to do operations on each children and track overall progress. */ struct QuorumAIOCB { -BlockAIOCB common; +BlockDriverState *bs; +Coroutine *co; /* Request metadata */ uint64_t sector_num; @@ -129,36 +130,18 @@ struct QuorumAIOCB { QuorumVotes votes; bool is_read; +bool has_completed; int vote_ret; int children_read; /* how many children have been read from */ }; static bool quorum_vote(QuorumAIOCB *acb); -static void quorum_aio_cancel(BlockAIOCB *blockacb) -{ -QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common); -BDRVQuorumState *s = acb->common.bs->opaque; -int i; - -/* cancel all callbacks */ -for (i = 0; i < s->num_children; i++) { -if (acb->qcrs[i].aiocb) { -bdrv_aio_cancel_async(acb->qcrs[i].aiocb); -} -} -} - -static AIOCBInfo quorum_aiocb_info = { -.aiocb_size = sizeof(QuorumAIOCB), -.cancel_async = quorum_aio_cancel, -}; - static void quorum_aio_finalize(QuorumAIOCB *acb) { -acb->common.cb(acb->common.opaque, acb->vote_ret); +acb->has_completed = true; g_free(acb->qcrs); -qemu_aio_unref(acb); +qemu_coroutine_enter_if_inactive(acb->co); } static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b) @@ -174,14 +157,14 @@ static bool quorum_64bits_compare(QuorumVoteValue *a, QuorumVoteValue *b) static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs, QEMUIOVector *qiov, uint64_t sector_num, - int nb_sectors, - BlockCompletionFunc *cb, - void *opaque) + int nb_sectors) { BDRVQuorumState *s = bs->opaque; -QuorumAIOCB *acb = qemu_aio_get(_aiocb_info, bs, cb, opaque); +QuorumAIOCB *acb = g_new(QuorumAIOCB, 1); int i; +acb->co = qemu_coroutine_self(); +acb->bs = bs; acb->sector_num = sector_num; acb->nb_sectors = nb_sectors; acb->qiov = qiov; @@ -191,6 +174,7 @@ static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs, acb->rewrite_count = 0; acb->votes.compare = quorum_sha256_compare; QLIST_INIT(>votes.vote_list); +acb->has_completed = false; acb->is_read = false; acb->vote_ret = 0; @@ -217,7 +201,7 @@ static void quorum_report_bad(QuorumOpType type, uint64_t sector_num, static void quorum_report_failure(QuorumAIOCB *acb) { -const char *reference = bdrv_get_device_or_node_name(acb->common.bs); +const char *reference = bdrv_get_device_or_node_name(acb->bs); qapi_event_send_quorum_failure(reference, acb->sector_num, acb->nb_sectors, _abort); } @@ -226,7 +210,7 @@ static int quorum_vote_error(QuorumAIOCB *acb); static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb) { -BDRVQuorumState *s = acb->common.bs->opaque; +BDRVQuorumState *s = acb->bs->opaque; if (acb->success_count < s->threshold) { acb->vote_ret = quorum_vote_error(acb); @@ -252,7 +236,7 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } -static BlockAIOCB *read_fifo_child(QuorumAIOCB *acb); +static int read_fifo_child(QuorumAIOCB *acb); static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) { @@ -272,14 +256,14 @@ static void quorum_report_bad_acb(QuorumChildRequest *sacb, int ret) QuorumAIOCB *acb = sacb->parent; QuorumOpType type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE; quorum_report_bad(type, acb->sector_num, acb->nb_sectors, - sacb->aiocb->bs->node_name, ret); + sacb->bs->node_name, ret); } -static void quorum_fifo_aio_cb(void *opaque, int ret) +static int quorum_fifo_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; QuorumAIOCB *acb = sacb->parent; -BDRVQuorumState *s = acb->common.bs->opaque; +BDRVQuorumState *s = acb->bs->opaque; assert(acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO); @@ -288,8 +272,7 @@ static void quorum_fifo_aio_cb(void *opaque, int ret) /* We try to read next child in FIFO order if we fail to read */ if (acb->children_read < s->num_children) {
[Qemu-block] [RFC PATCH 2/8] quorum: Remove s from quorum_aio_get() arguments
There is no point in passing the value of bs->opaque in order to overwrite it with itself. Signed-off-by: Kevin Wolf--- block/quorum.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d122299..dfa9fd3 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -171,18 +171,17 @@ static bool quorum_64bits_compare(QuorumVoteValue *a, QuorumVoteValue *b) return a->l == b->l; } -static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, - BlockDriverState *bs, +static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs, QEMUIOVector *qiov, uint64_t sector_num, int nb_sectors, BlockCompletionFunc *cb, void *opaque) { +BDRVQuorumState *s = bs->opaque; QuorumAIOCB *acb = qemu_aio_get(_aiocb_info, bs, cb, opaque); int i; -acb->common.bs->opaque = s; acb->sector_num = sector_num; acb->nb_sectors = nb_sectors; acb->qiov = qiov; @@ -691,7 +690,7 @@ static BlockAIOCB *quorum_aio_readv(BlockDriverState *bs, void *opaque) { BDRVQuorumState *s = bs->opaque; -QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, +QuorumAIOCB *acb = quorum_aio_get(bs, qiov, sector_num, nb_sectors, cb, opaque); acb->is_read = true; acb->children_read = 0; @@ -711,7 +710,7 @@ static BlockAIOCB *quorum_aio_writev(BlockDriverState *bs, void *opaque) { BDRVQuorumState *s = bs->opaque; -QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, nb_sectors, +QuorumAIOCB *acb = quorum_aio_get(bs, qiov, sector_num, nb_sectors, cb, opaque); int i; -- 1.8.3.1
[Qemu-block] [RFC PATCH 4/8] quorum: Do cleanup in caller coroutine
Instead of calling quorum_aio_finalize() deeply nested in what used to be an AIO callback, do it in the same functions that allocated the AIOCB. Signed-off-by: Kevin Wolf--- block/quorum.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 3cb579e..a225327 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -139,9 +139,8 @@ static bool quorum_vote(QuorumAIOCB *acb); static void quorum_aio_finalize(QuorumAIOCB *acb) { -acb->has_completed = true; g_free(acb->qcrs); -qemu_coroutine_enter_if_inactive(acb->co); +g_free(acb); } static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b) @@ -233,7 +232,8 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) return; } -quorum_aio_finalize(acb); +acb->has_completed = true; +qemu_coroutine_enter_if_inactive(acb->co); } static int read_fifo_child(QuorumAIOCB *acb); @@ -279,7 +279,7 @@ static int quorum_fifo_aio_cb(void *opaque, int ret) acb->vote_ret = ret; /* FIXME: rewrite failed children if acb->children_read > 1? */ -quorum_aio_finalize(acb); + return ret; } @@ -317,7 +317,8 @@ static void quorum_aio_cb(void *opaque, int ret) /* if no rewrite is done the code will finish right away */ if (!rewrite) { -quorum_aio_finalize(acb); +acb->has_completed = true; +qemu_coroutine_enter_if_inactive(acb->co); } } @@ -716,7 +717,8 @@ static int quorum_co_readv(BlockDriverState *bs, } else { ret = read_fifo_child(acb); } -g_free(acb); +quorum_aio_finalize(acb); + return ret; } @@ -759,6 +761,7 @@ static int quorum_co_writev(BlockDriverState *bs, } ret = acb->vote_ret; +quorum_aio_finalize(acb); return ret; } -- 1.8.3.1
[Qemu-block] [RFC PATCH 1/8] coroutine: Introduce qemu_coroutine_enter_if_inactive()
In the context of asynchronous work, if we have a worker coroutine that didn't yield, the parent coroutine cannot be reentered because it hasn't yielded yet. In this case we don't even have to reenter the parent because it will see that the work is already done and won't even yield. Signed-off-by: Kevin Wolf--- include/qemu/coroutine.h | 6 ++ util/qemu-coroutine.c| 8 2 files changed, 14 insertions(+) diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h index e6a60d5..12584ed 100644 --- a/include/qemu/coroutine.h +++ b/include/qemu/coroutine.h @@ -71,6 +71,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque); void qemu_coroutine_enter(Coroutine *coroutine); /** + * Transfer control to a coroutine if it's not active (i.e. part of the call + * stack of the running coroutine). Otherwise, do nothing. + */ +void qemu_coroutine_enter_if_inactive(Coroutine *co); + +/** * Transfer control back to a coroutine's caller * * This function does not return until the coroutine is re-entered using diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index 737bffa..a06befe 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -19,6 +19,7 @@ #include "qemu/atomic.h" #include "qemu/coroutine.h" #include "qemu/coroutine_int.h" +#include "block/aio.h" enum { POOL_BATCH_SIZE = 64, @@ -131,6 +132,13 @@ void qemu_coroutine_enter(Coroutine *co) } } +void qemu_coroutine_enter_if_inactive(Coroutine *co) +{ +if (!qemu_coroutine_entered(co)) { +qemu_coroutine_enter(co); +} +} + void coroutine_fn qemu_coroutine_yield(void) { Coroutine *self = qemu_coroutine_self(); -- 1.8.3.1
[Qemu-block] [RFC PATCH 0/8] quorum: Implement .bdrv_co_preadv/pwritev()
This is part of a series that I'm working on and that aims to remove the bdrv_aio_*() emulation in io.c. blkdebug and blkverify were easy, but for quorum I need a few more patches, so I'm sending this out as an RFC while I continue work on the rest (QED, and then possibly some polishing). After the series, quorum doesn't only use the preferred interfaces, but the code becomes a little easier to follow and byte granularity requests are supported. There is probably still some potential for additional cleanups in a follow-up series, but this should already be enough to be worth merging. Kevin Wolf (8): coroutine: Introduce qemu_coroutine_enter_if_inactive() quorum: Remove s from quorum_aio_get() arguments quorum: Implement .bdrv_co_readv/writev quorum: Do cleanup in caller coroutine quorum: Inline quorum_aio_cb() quorum: Avoid bdrv_aio_writev() for rewrites quorum: Implement .bdrv_co_preadv/pwritev() quorum: Inline quorum_fifo_aio_cb() block/quorum.c | 373 +-- include/qemu/coroutine.h | 6 + util/qemu-coroutine.c| 8 + 3 files changed, 211 insertions(+), 176 deletions(-) -- 1.8.3.1
[Qemu-block] [RFC PATCH 6/8] quorum: Avoid bdrv_aio_writev() for rewrites
Replacing it with bdrv_co_pwritev() prepares us for byte granularity requests and gets us rid of the last bdrv_aio_*() user in quorum. Signed-off-by: Kevin Wolf--- block/quorum.c | 52 +--- 1 file changed, 33 insertions(+), 19 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index b2bb3af..1426115 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -134,6 +134,11 @@ struct QuorumAIOCB { int children_read; /* how many children have been read from */ }; +typedef struct QuorumCo { +QuorumAIOCB *acb; +int i; +} QuorumCo; + static bool quorum_vote(QuorumAIOCB *acb); static void quorum_aio_finalize(QuorumAIOCB *acb) @@ -218,15 +223,6 @@ static bool quorum_has_too_much_io_failed(QuorumAIOCB *acb) return false; } -static void quorum_rewrite_aio_cb(void *opaque, int ret) -{ -QuorumAIOCB *acb = opaque; - -/* one less rewrite to do */ -acb->rewrite_count--; -qemu_coroutine_enter_if_inactive(acb->co); -} - static int read_fifo_child(QuorumAIOCB *acb); static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) @@ -293,7 +289,25 @@ static void quorum_report_bad_versions(BDRVQuorumState *s, } } -static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb, +static void quorum_rewrite_entry(void *opaque) +{ +QuorumCo *co = opaque; +QuorumAIOCB *acb = co->acb; +BDRVQuorumState *s = acb->bs->opaque; +int ret; + +ret = bdrv_co_pwritev(s->children[co->i], + acb->sector_num * BDRV_SECTOR_SIZE, + acb->nb_sectors * BDRV_SECTOR_SIZE, + acb->qiov, 0); +(void) ret; + +/* one less rewrite to do */ +acb->rewrite_count--; +qemu_coroutine_enter_if_inactive(acb->co); +} + +static bool quorum_rewrite_bad_versions(QuorumAIOCB *acb, QuorumVoteValue *value) { QuorumVoteVersion *version; @@ -321,9 +335,14 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb, continue; } QLIST_FOREACH(item, >items, next) { -bdrv_aio_writev(s->children[item->index], acb->sector_num, -acb->qiov, acb->nb_sectors, quorum_rewrite_aio_cb, -acb); +Coroutine *co; +QuorumCo data = { +.acb = acb, +.i = item->index, +}; + +co = qemu_coroutine_create(quorum_rewrite_entry, ); +qemu_coroutine_enter(co); } } @@ -577,7 +596,7 @@ static bool quorum_vote(QuorumAIOCB *acb) /* corruption correction is enabled */ if (s->rewrite_corrupted) { -rewrite = quorum_rewrite_bad_versions(s, acb, >value); +rewrite = quorum_rewrite_bad_versions(acb, >value); } free_exit: @@ -586,11 +605,6 @@ free_exit: return rewrite; } -typedef struct QuorumCo { -QuorumAIOCB *acb; -int i; -} QuorumCo; - static void read_quorum_children_entry(void *opaque) { QuorumCo *co = opaque; -- 1.8.3.1
[Qemu-block] [RFC PATCH 7/8] quorum: Implement .bdrv_co_preadv/pwritev()
This enables byte granularity requests on quorum nodes. Note that the QMP events emitted by the driver are an external API that we were careless enough to define as sector based. The offset and length of requests reported in events are rounded down therefore. Signed-off-by: Kevin Wolf--- block/quorum.c | 75 ++ 1 file changed, 33 insertions(+), 42 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 1426115..45939c5 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -114,8 +114,8 @@ struct QuorumAIOCB { Coroutine *co; /* Request metadata */ -uint64_t sector_num; -int nb_sectors; +uint64_t offset; +uint64_t bytes; QEMUIOVector *qiov; /* calling IOV */ @@ -159,8 +159,8 @@ static bool quorum_64bits_compare(QuorumVoteValue *a, QuorumVoteValue *b) static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs, QEMUIOVector *qiov, - uint64_t sector_num, - int nb_sectors) + uint64_t offset, + uint64_t bytes) { BDRVQuorumState *s = bs->opaque; QuorumAIOCB *acb = g_new(QuorumAIOCB, 1); @@ -168,8 +168,8 @@ static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs, acb->co = qemu_coroutine_self(); acb->bs = bs; -acb->sector_num = sector_num; -acb->nb_sectors = nb_sectors; +acb->offset = offset; +acb->bytes = bytes; acb->qiov = qiov; acb->qcrs = g_new0(QuorumChildRequest, s->num_children); acb->count = 0; @@ -189,8 +189,8 @@ static QuorumAIOCB *quorum_aio_get(BlockDriverState *bs, return acb; } -static void quorum_report_bad(QuorumOpType type, uint64_t sector_num, - int nb_sectors, char *node_name, int ret) +static void quorum_report_bad(QuorumOpType type, uint64_t offset, + uint64_t bytes, char *node_name, int ret) { const char *msg = NULL; if (ret < 0) { @@ -198,14 +198,17 @@ static void quorum_report_bad(QuorumOpType type, uint64_t sector_num, } qapi_event_send_quorum_report_bad(type, !!msg, msg, node_name, - sector_num, nb_sectors, _abort); + offset / BDRV_SECTOR_SIZE, + bytes / BDRV_SECTOR_SIZE, _abort); } static void quorum_report_failure(QuorumAIOCB *acb) { const char *reference = bdrv_get_device_or_node_name(acb->bs); -qapi_event_send_quorum_failure(reference, acb->sector_num, - acb->nb_sectors, _abort); +qapi_event_send_quorum_failure(reference, + acb->offset / BDRV_SECTOR_SIZE, + acb->bytes / BDRV_SECTOR_SIZE, + _abort); } static int quorum_vote_error(QuorumAIOCB *acb); @@ -242,8 +245,7 @@ static void quorum_report_bad_acb(QuorumChildRequest *sacb, int ret) { QuorumAIOCB *acb = sacb->parent; QuorumOpType type = acb->is_read ? QUORUM_OP_TYPE_READ : QUORUM_OP_TYPE_WRITE; -quorum_report_bad(type, acb->sector_num, acb->nb_sectors, - sacb->bs->node_name, ret); +quorum_report_bad(type, acb->offset, acb->bytes, sacb->bs->node_name, ret); } static int quorum_fifo_aio_cb(void *opaque, int ret) @@ -282,8 +284,7 @@ static void quorum_report_bad_versions(BDRVQuorumState *s, continue; } QLIST_FOREACH(item, >items, next) { -quorum_report_bad(QUORUM_OP_TYPE_READ, acb->sector_num, - acb->nb_sectors, +quorum_report_bad(QUORUM_OP_TYPE_READ, acb->offset, acb->bytes, s->children[item->index]->bs->node_name, 0); } } @@ -296,9 +297,7 @@ static void quorum_rewrite_entry(void *opaque) BDRVQuorumState *s = acb->bs->opaque; int ret; -ret = bdrv_co_pwritev(s->children[co->i], - acb->sector_num * BDRV_SECTOR_SIZE, - acb->nb_sectors * BDRV_SECTOR_SIZE, +ret = bdrv_co_pwritev(s->children[co->i], acb->offset, acb->bytes, acb->qiov, 0); (void) ret; @@ -462,8 +461,8 @@ static void GCC_FMT_ATTR(2, 3) quorum_err(QuorumAIOCB *acb, va_list ap; va_start(ap, fmt); -fprintf(stderr, "quorum: sector_num=%" PRId64 " nb_sectors=%d ", -acb->sector_num, acb->nb_sectors); +fprintf(stderr, "quorum: offset=%" PRIu64 " bytes=%" PRIu64 " ", +acb->offset, acb->bytes); vfprintf(stderr, fmt, ap); fprintf(stderr, "\n"); va_end(ap); @@ -481,9 +480,8 @@ static bool quorum_compare(QuorumAIOCB *acb, if (s->is_blkverify) { offset = qemu_iovec_compare(a, b); if (offset != -1) { -quorum_err(acb,
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/6] block-backend: Always notify on blk_eject
Ping, Kevin: Look ok? --js On 11/07/2016 04:13 PM, John Snow wrote: Requires patches in my IDE branch, for context. This series changes blk_eject (used for a software-initiated eject request) to always trigger a QMP tray event, in contrast to the current behavior where a tray event only occurs if a medium is already in the tray. V2: Now with tests. (Kevin) For convenience, this branch is available at: https://github.com/jnsnow/qemu.git branch tray-always-notify https://github.com/jnsnow/qemu/tree/tray-always-notify This version is tagged tray-always-notify-v2: https://github.com/jnsnow/qemu/releases/tag/tray-always-notify-v2 John Snow (6): block-backend: Always notify on blk_eject libqtest: add qmp_eventwait_ref libqos/ahci: Support expected errors libqos/ahci: Add ATAPI tray macros libqos/ahci: Add get_sense and test_ready ahci-test: add QMP tray test for ATAPI block/block-backend.c | 13 +++ tests/ahci-test.c | 98 +++ tests/libqos/ahci.c | 96 ++--- tests/libqos/ahci.h | 26 -- tests/libqtest.c | 13 +-- tests/libqtest.h | 22 6 files changed, 252 insertions(+), 16 deletions(-) -- —js
[Qemu-block] [PULL 39/47] megasas: remove unnecessary megasas_use_msix()
From: Cao jinAlso move certain hunk above, to place msix init related code together. CC: Hannes Reinecke CC: Paolo Bonzini CC: Markus Armbruster CC: Marcel Apfelbaum CC: Michael S. Tsirkin Reviewed-by: Markus Armbruster Signed-off-by: Cao jin Acked-by: Marcel Apfelbaum Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin --- hw/scsi/megasas.c | 19 ++- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c index 6cef9a3..ba79e7a 100644 --- a/hw/scsi/megasas.c +++ b/hw/scsi/megasas.c @@ -155,11 +155,6 @@ static bool megasas_use_queue64(MegasasState *s) return s->flags & MEGASAS_MASK_USE_QUEUE64; } -static bool megasas_use_msix(MegasasState *s) -{ -return s->msix != ON_OFF_AUTO_OFF; -} - static bool megasas_is_jbod(MegasasState *s) { return s->flags & MEGASAS_MASK_USE_JBOD; @@ -2299,9 +2294,7 @@ static void megasas_scsi_uninit(PCIDevice *d) { MegasasState *s = MEGASAS(d); -if (megasas_use_msix(s)) { -msix_uninit(d, >mmio_io, >mmio_io); -} +msix_uninit(d, >mmio_io, >mmio_io); msi_uninit(d); } @@ -2353,7 +2346,7 @@ static void megasas_scsi_realize(PCIDevice *dev, Error **errp) memory_region_init_io(>mmio_io, OBJECT(s), _mmio_ops, s, "megasas-mmio", 0x4000); -if (megasas_use_msix(s)) { +if (s->msix != ON_OFF_AUTO_OFF) { ret = msix_init(dev, 15, >mmio_io, b->mmio_bar, 0x2000, >mmio_io, b->mmio_bar, 0x3800, 0x68, ); /* Any error other than -ENOTSUP(board's MSI support is broken) @@ -2373,6 +2366,10 @@ static void megasas_scsi_realize(PCIDevice *dev, Error **errp) error_free(err); } +if (msix_enabled(dev)) { +msix_vector_use(dev, 0); +} + memory_region_init_io(>port_io, OBJECT(s), _port_ops, s, "megasas-io", 256); memory_region_init_io(>queue_io, OBJECT(s), _queue_ops, s, @@ -2388,10 +2385,6 @@ static void megasas_scsi_realize(PCIDevice *dev, Error **errp) pci_register_bar(dev, b->mmio_bar, bar_type, >mmio_io); pci_register_bar(dev, 3, bar_type, >queue_io); -if (megasas_use_msix(s)) { -msix_vector_use(dev, 0); -} - s->fw_state = MFI_FWSTATE_READY; if (!s->sas_addr) { s->sas_addr = ((NAA_LOCALLY_ASSIGNED_ID << 24) | -- MST
[Qemu-block] [PULL 36/47] pci: Convert msix_init() to Error and fix callers to check it
From: Cao jinmsix_init() reports errors with error_report(), which is wrong when it's used in realize(). The same issue was fixed for msi_init() in commit 1108b2f. For some devices(like e1000e, vmxnet3) who won't fail because of msix_init's failure, suppress the error report by passing NULL error object. Bonus: add comment for msix_init. CC: Jiri Pirko CC: Gerd Hoffmann CC: Dmitry Fleytman CC: Jason Wang CC: Michael S. Tsirkin CC: Hannes Reinecke CC: Paolo Bonzini CC: Alex Williamson CC: Markus Armbruster CC: Marcel Apfelbaum Reviewed-by: Markus Armbruster Signed-off-by: Cao jin Acked-by: Marcel Apfelbaum Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin --- include/hw/pci/msix.h | 5 +++-- hw/block/nvme.c| 5 - hw/misc/ivshmem.c | 8 hw/net/e1000e.c| 6 +- hw/net/rocker/rocker.c | 7 ++- hw/net/vmxnet3.c | 8 ++-- hw/pci/msix.c | 34 +- hw/scsi/megasas.c | 5 - hw/usb/hcd-xhci.c | 13 - hw/vfio/pci.c | 8 ++-- hw/virtio/virtio-pci.c | 11 +-- 11 files changed, 80 insertions(+), 30 deletions(-) diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h index 048a29d..1f27658 100644 --- a/include/hw/pci/msix.h +++ b/include/hw/pci/msix.h @@ -9,9 +9,10 @@ MSIMessage msix_get_message(PCIDevice *dev, unsigned int vector); int msix_init(PCIDevice *dev, unsigned short nentries, MemoryRegion *table_bar, uint8_t table_bar_nr, unsigned table_offset, MemoryRegion *pba_bar, - uint8_t pba_bar_nr, unsigned pba_offset, uint8_t cap_pos); + uint8_t pba_bar_nr, unsigned pba_offset, uint8_t cap_pos, + Error **errp); int msix_init_exclusive_bar(PCIDevice *dev, unsigned short nentries, -uint8_t bar_nr); +uint8_t bar_nr, Error **errp); void msix_write_config(PCIDevice *dev, uint32_t address, uint32_t val, int len); diff --git a/hw/block/nvme.c b/hw/block/nvme.c index d479fd2..2d703c8 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -831,6 +831,7 @@ static int nvme_init(PCIDevice *pci_dev) { NvmeCtrl *n = NVME(pci_dev); NvmeIdCtrl *id = >id_ctrl; +Error *err = NULL; int i; int64_t bs_size; @@ -872,7 +873,9 @@ static int nvme_init(PCIDevice *pci_dev) pci_register_bar(>parent_obj, 0, PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64, >iomem); -msix_init_exclusive_bar(>parent_obj, n->num_queues, 4); +if (msix_init_exclusive_bar(>parent_obj, n->num_queues, 4, )) { +error_report_err(err); +} id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID)); id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID)); diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index 230e51b..ca6f804 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -749,13 +749,13 @@ static void ivshmem_reset(DeviceState *d) } } -static int ivshmem_setup_interrupts(IVShmemState *s) +static int ivshmem_setup_interrupts(IVShmemState *s, Error **errp) { /* allocate QEMU callback data for receiving interrupts */ s->msi_vectors = g_malloc0(s->vectors * sizeof(MSIVector)); if (ivshmem_has_feature(s, IVSHMEM_MSI)) { -if (msix_init_exclusive_bar(PCI_DEVICE(s), s->vectors, 1)) { +if (msix_init_exclusive_bar(PCI_DEVICE(s), s->vectors, 1, errp)) { return -1; } @@ -897,8 +897,8 @@ static void ivshmem_common_realize(PCIDevice *dev, Error **errp) qemu_chr_fe_set_handlers(>server_chr, ivshmem_can_receive, ivshmem_read, NULL, s, NULL, true); -if (ivshmem_setup_interrupts(s) < 0) { -error_setg(errp, "failed to initialize interrupts"); +if (ivshmem_setup_interrupts(s, errp) < 0) { +error_prepend(errp, "Failed to initialize interrupts: "); return; } } diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c index 89f96eb..bf32aca 100644 --- a/hw/net/e1000e.c +++ b/hw/net/e1000e.c @@ -292,7 +292,11 @@ e1000e_init_msix(E1000EState *s) E1000E_MSIX_IDX, E1000E_MSIX_TABLE, >msix, E1000E_MSIX_IDX, E1000E_MSIX_PBA, -0xA0); +0xA0, NULL); + +/* Any error other than -ENOTSUP(board's MSI support is broken) + * is a programming error. Fall back to INTx silently on -ENOTSUP */ +assert(!res || res == -ENOTSUP); if (res < 0) {
[Qemu-block] [PULL 37/47] megasas: change behaviour of msix switch
From: Cao jinResolve the TODO, msix=auto means msix on; if user specify msix=on, then device creation fail on msix_init failure. Also undo the overwrites of user configuration of msix. CC: Michael S. Tsirkin CC: Hannes Reinecke CC: Paolo Bonzini CC: Markus Armbruster CC: Marcel Apfelbaum Reviewed-by: Markus Armbruster Signed-off-by: Cao jin Acked-by: Marcel Apfelbaum Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin --- hw/scsi/megasas.c | 28 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c index fada834..6cef9a3 100644 --- a/hw/scsi/megasas.c +++ b/hw/scsi/megasas.c @@ -2353,19 +2353,31 @@ static void megasas_scsi_realize(PCIDevice *dev, Error **errp) memory_region_init_io(>mmio_io, OBJECT(s), _mmio_ops, s, "megasas-mmio", 0x4000); +if (megasas_use_msix(s)) { +ret = msix_init(dev, 15, >mmio_io, b->mmio_bar, 0x2000, +>mmio_io, b->mmio_bar, 0x3800, 0x68, ); +/* Any error other than -ENOTSUP(board's MSI support is broken) + * is a programming error */ +assert(!ret || ret == -ENOTSUP); +if (ret && s->msix == ON_OFF_AUTO_ON) { +/* Can't satisfy user's explicit msix=on request, fail */ +error_append_hint(, "You have to use msix=auto (default) or " +"msix=off with this machine type.\n"); +/* No instance_finalize method, need to free the resource here */ +object_unref(OBJECT(>mmio_io)); +error_propagate(errp, err); +return; +} +assert(!err || s->msix == ON_OFF_AUTO_AUTO); +/* With msix=auto, we fall back to MSI off silently */ +error_free(err); +} + memory_region_init_io(>port_io, OBJECT(s), _port_ops, s, "megasas-io", 256); memory_region_init_io(>queue_io, OBJECT(s), _queue_ops, s, "megasas-queue", 0x4); -if (megasas_use_msix(s) && -msix_init(dev, 15, >mmio_io, b->mmio_bar, 0x2000, - >mmio_io, b->mmio_bar, 0x3800, 0x68, )) { -/*TODO: check msix_init's error, and should fail on msix=on */ -error_report_err(err); -s->msix = ON_OFF_AUTO_OFF; -} - if (pci_is_express(dev)) { pcie_endpoint_cap_init(dev, 0xa0); } -- MST
Re: [Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
Am 10.11.2016 um 03:11 hat Fam Zheng geschrieben: > On Wed, 11/09 14:06, Eric Blake wrote: > > On 11/09/2016 07:49 AM, Stefan Hajnoczi wrote: > > > On Tue, Nov 08, 2016 at 04:52:15PM -0600, Eric Blake wrote: > > >> Commit 443668ca rewrote the write_zeroes logic to guarantee that > > >> an unaligned request never crosses a cluster boundary. But > > >> in the rewrite, the new code assumed that at most one iteration > > >> would be needed to get to an alignment boundary. > > >> > > >> However, it is easy to trigger an assertion failure: the Linux > > >> kernel limits loopback devices to advertise a max_transfer of > > >> only 64k. Any operation that requires falling back to writes > > >> rather than more efficient zeroing must obey max_transfer during > > >> that fallback, which means an unaligned head may require multiple > > >> iterations of the write fallbacks before reaching the aligned > > >> boundaries, when layering a format with clusters larger than 64k > > >> atop the protocol of file access to a loopback device. > > >> > > >> Test case: > > >> > > >> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M > > >> $ losetup /dev/loop2 /path/to/file > > >> $ qemu-io -f qcow2 /dev/loop2 > > >> qemu-io> w 7m 1k > > >> qemu-io> w -z 8003584 2093056 > > > > > > Please include a qemu-iotests test case to protect against regressions. > > > > None of the existing qemu-iotests use losetup; I guess the closest thing > > to do is crib from a test that uses passwordless sudo? > > > > It will certainly be a separate commit, but I'll give it my best shot to > > post something soon. > > Alternatively, maybe add a blkdebug option to emulate a small max_transfer at > the protocol layer? This sounds like the better option indeed. Kevin