On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the
> > > number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > bio_iov_iter_get_pages() repeatedly until either the requested number
> > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > > short writes or reads may occur for direct synchronous IOs with multiple
> > > iovec slots (such as generated by writev()). In that case,
> > > __generic_file_write_iter() falls back to buffered writes, which
> > > has been observed to cause data corruption in certain workloads.
> > >
> > > Note: if segments aren't page-aligned in the input iovec, this patch may
> > > result in multiple adjacent slots of the bi_io_vec array to reference the
> > > same
> > > page (the byte ranges are guaranteed to be disjunct if the preceding
> > > patch is
> > > applied). We haven't seen problems with that in our and the customer's
> > > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > > that refer to the same page, but I prefer to keep it simple for now.
> > >
> > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > > simplified bdev direct-io")
> > > Signed-off-by: Martin Wilck <[email protected]>
> > > ---
> > > fs/block_dev.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 0dd87aa..41643c4 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct
> > > iov_iter *iter,
> > >
> > > ret = bio_iov_iter_get_pages(&bio, iter);
> > > if (unlikely(ret))
> > > - return ret;
> > > + goto out;
> > > +
> > > + while (ret == 0 &&
> > > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > > + ret = bio_iov_iter_get_pages(&bio, iter);
> > > +
> > > ret = bio.bi_iter.bi_size;
> > >
> > > if (iov_iter_rw(iter) == READ) {
> > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct
> > > iov_iter *iter,
> > > put_page(bvec->bv_page);
> > > }
> > >
> > > +out:
> > > if (vecs != inline_vecs)
> > > kfree(vecs);
> > >
> >
> > You might put the 'vecs' leak fix into another patch, and resue the
> > current code block for that.
> >
> > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> > what do you think about the following way?
>
> No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
> fine with it returning less pages and they loop appropriately.
OK, but this way still may make one bio to hold more data, especially
the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from
*iter', so looks it is the correct way to do.
thanks,
Ming