On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
>
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
>
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified
> bdev direct-io")
> Signed-off-by: Martin Wilck <[email protected]>
> ---
> fs/block_dev.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct
> iov_iter *iter,
>
> ret = bio_iov_iter_get_pages(&bio, iter);
> if (unlikely(ret))
> - return ret;
> + goto out;
> +
> + while (ret == 0 &&
> + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> + ret = bio_iov_iter_get_pages(&bio, iter);
> +
> ret = bio.bi_iter.bi_size;
>
> if (iov_iter_rw(iter) == READ) {
> @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct
> iov_iter *iter,
> put_page(bvec->bv_page);
> }
>
> +out:
> if (vecs != inline_vecs)
> kfree(vecs);
>
You might put the 'vecs' leak fix into another patch, and resue the
current code block for that.
Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
what do you think about the following way?
diff --git a/block/bio.c b/block/bio.c
index f3536bfc8298..23dd4c163dfc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -904,15 +904,7 @@ int bio_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_add_page);
-/**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
- * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
- *
- * Pins as many pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- */
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
@@ -951,6 +943,28 @@ int bio_iov_iter_get_pages(struct bio *bio, struct
iov_iter *iter)
iov_iter_advance(iter, size);
return 0;
}
+
+/**
+ * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * @bio: bio to add pages to
+ * @iter: iov iterator describing the region to be mapped
+ *
+ * Pins as many pages from *iter and appends them to @bio's bvec array. The
+ * pages will have to be released using put_page() when done.
+ */
+int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+ int ret;
+ unsigned int size;
+
+ do {
+ size = bio->bi_iter.bi_size;
+ ret = __bio_iov_iter_get_pages(bio, iter);
+ } while (!bio_full(bio) && iov_iter_count(iter) > 0 &&
+ bio->bi_iter.bi_size > size);
+
+ return bio->bi_iter.bi_size > 0 ? 0 : ret;
+}
EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
static void submit_bio_wait_endio(struct bio *bio)
Thanks,
Ming