bio_iov_iter_get_pages() returns only pages for a single non-empty
segment of the input iov_iter's iovec. This may be much less than the number
of pages __blkdev_direct_IO_simple() is supposed to process. Call
bio_iov_iter_get_pages() repeatedly until either the requested number
of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
short writes or reads may occur for direct synchronous IOs with multiple
iovec slots (such as generated by writev()). In that case,
__generic_file_write_iter() falls back to buffered writes, which
has been observed to cause data corruption in certain workloads.

Note: if segments aren't page-aligned in the input iovec, this patch may
result in multiple adjacent slots of the bi_io_vec array to reference the same
page (the byte ranges are guaranteed to be disjunct if the preceding patch is
applied). We haven't seen problems with that in our and the customer's
tests. It'd be possible to detect this situation and merge bi_io_vec slots
that refer to the same page, but I prefer to keep it simple for now.

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev 
direct-io")
Signed-off-by: Martin Wilck <[email protected]>
---
 fs/block_dev.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aa..41643c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
 
        ret = bio_iov_iter_get_pages(&bio, iter);
        if (unlikely(ret))
-               return ret;
+               goto out;
+
+       while (ret == 0 &&
+              bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
+               ret = bio_iov_iter_get_pages(&bio, iter);
+
        ret = bio.bi_iter.bi_size;
 
        if (iov_iter_rw(iter) == READ) {
@@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
                put_page(bvec->bv_page);
        }
 
+out:
        if (vecs != inline_vecs)
                kfree(vecs);
 
-- 
2.17.1

Reply via email to