Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-14 Thread Avi Kivity



On 09/07/2020 05.25, Dave Chinner wrote:



Nobody's proposing changing Direct I/O to exclusively work through the
pagecache.  The proposal is to behave less weirdly when there's already
data in the pagecache.

No, the proposal it to make direct IO behave *less
deterministically* if there is data in the page cache.

e.g. Instead of having a predicatable submission CPU overhead and
read latency of 100us for your data, this proposal makes the claim
that it is always better to burn 10x the IO submission CPU for a
single IO to copy the data and give that specific IO 10x lower
latency than it is to submit 10 async IOs to keep the IO pipeline
full.

What it fails to take into account is that in spending that CPU time
to copy the data, we haven't submitted 10 other IOs and so the
actual in-flight IO for the application has decreased. If
performance comes from keeping the IO pipeline as close to 100% full
as possible, then copying the data out of the page cache will cause
performance regressions.

i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
depth craters because we've only fulfilled 5 complete IOs instead of
submitting 50 entire IOs. This is the hidden cost of synchronous IO
via CPU data copying vs async IO via hardware offload, and if we
take that into account we must look at future hardware performance
trends to determine if this cost is going to increase or decrease in
future.

That is: CPUs are not getting faster anytime soon. IO subsystems are
still deep in the "performance doubles every 2 years" part of the
technology curve (pcie 3.0->4.0 just happened, 4->5 is a year away,
5->6 is 3-4 years away, etc). Hence our reality is that we are deep
within a performance trend curve that tells us synchronous CPU
operations are not getting faster, but IO bandwidth and IOPS are
going to increase massively over the next 5-10 years. Hence putting
(already expensive) synchronous CPU operations in the asynchronous
zero-data-touch IO fast path is -exactly the wrong direction to be
moving-.

This is simple math. The gap between IO latency and bandwidth and
CPU addressable memory latency and bandwidth is closing all the
time, and the closer that gap gets the less sense it makes to use
CPU addressable memory for buffering syscall based read and write
IO. We are not quite yet at the cross-over point, but we really
aren't that far from it.




My use-case supports this. The application uses AIO+DIO, but backup may 
bring pages into page cache. For me, it is best to ignore page cache (as 
long as it's clean, which it is for backup) and serve from disk as usual.





Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-10 Thread Christoph Hellwig
This looks sane - slightly updated version below to not bother with
the ret and a few tidyups.

That being said and to get back to the discussion in this thread:
I think it would be saner to give up on direct I/O in case of the
invalidation failure.  I've cooked up a patch on top of this one
(for which I had a few trivial cleanups).  It is still under testing,
but I'll send the two out in a new thread.



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Dave Chinner
On Thu, Jul 09, 2020 at 10:10:38AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> > On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > > -*/
> > > > -   ret = invalidate_inode_pages2_range(mapping,
> > > > -   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -   if (ret)
> > > > -   dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -   ret = 0;
> > > > +   if (iov_iter_rw(iter) == WRITE) {
> > > > +   /*
> > > > +* Try to invalidate cache pages for the range we're 
> > > > direct
> > > > +* writing.  If this invalidation fails, tough, the 
> > > > write will
> > > > +* still work, but racing two incompatible write paths 
> > > > is a
> > > > +* pretty crazy thing to do, so we don't support it 
> > > > 100%.
> > > > +*/
> > > > +   ret = invalidate_inode_pages2_range(mapping,
> > > > +   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > +   if (ret)
> > > > +   dio_warn_stale_pagecache(iocb->ki_filp);
> > > > +   ret = 0;
> > > >  
> > > > -   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > > -   !inode->i_sb->s_dio_done_wq) {
> > > > -   ret = sb_init_dio_done_wq(inode->i_sb);
> > > > -   if (ret < 0)
> > > > -   goto out_free_dio;
> > > > +   if (!wait_for_completion &&
> > > > +   !inode->i_sb->s_dio_done_wq) {
> > > > +   ret = sb_init_dio_done_wq(inode->i_sb);
> > > > +   if (ret < 0)
> > > > +   goto out_free_dio;
> 
> ...and yes I did add in the closing brace here. :P

Doh! I forgot to refresh the patch after fixing that. :/

Thanks!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Darrick J. Wong
On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > iomap: Only invalidate page cache pages on direct IO writes
> > > 
> > > From: Dave Chinner 
> > > 
> > > The historic requirement for XFS to invalidate cached pages on
> > > direct IO reads has been lost in the twisty pages of history - it was
> > > inherited from Irix, which implemented page cache invalidation on
> > > read as a method of working around problems synchronising page
> > > cache state with uncached IO.
> > 
> > Urk.
> > 
> > > XFS has carried this ever since. In the initial linux ports it was
> > > necessary to get mmap and DIO to play "ok" together and not
> > > immediately corrupt data. This was the state of play until the linux
> > > kernel had infrastructure to track unwritten extents and synchronise
> > > page faults with allocations and unwritten extent conversions
> > > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > > on DIO read was necessary to prevent trivial data corruptions. This
> > > didn't solve all the problems, though.
> > > 
> > > There were peformance problems if we didn't invalidate the entire
> > > page cache over the file on read - we couldn't easily determine if
> > > the cached pages were over the range of the IO, and invalidation
> > > required taking a serialising lock (i_mutex) on the inode. This
> > > serialising lock was an issue for XFS, as it was the only exclusive
> > > lock in the direct Io read path.
> > > 
> > > Hence if there were any cached pages, we'd just invalidate the
> > > entire file in one go so that subsequent IOs didn't need to take the
> > > serialising lock. This was a problem that prevented ranged
> > > invalidation from being particularly useful for avoiding the
> > > remaining coherency issues. This was solved with the conversion of
> > > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > > use i_rwsem. Hence we could now just do ranged invalidation and the
> > > performance problem went away.
> > > 
> > > However, page cache invalidation was still needed to serialise
> > > sub-page/sub-block zeroing via direct IO against buffered IO because
> > > bufferhead state attached to the cached page could get out of whack
> > > when direct IOs were issued.  We've removed bufferheads from the
> > > XFS code, and we don't carry any extent state on the cached pages
> > > anymore, and so this problem has gone away, too.
> > > 
> > > IOWs, it would appear that we don't have any good reason to be
> > > invalidating the page cache on DIO reads anymore. Hence remove the
> > > invalidation on read because it is unnecessary overhead,
> > > not needed to maintain coherency between mmap/buffered access and
> > > direct IO anymore, and prevents anyone from using direct IO reads
> > > from intentionally invalidating the page cache of a file.
> > > 
> > > Signed-off-by: Dave Chinner 
> > > ---
> > >  fs/iomap/direct-io.c | 33 +
> > >  1 file changed, 17 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ec7b78e6feca..ef0059eb34b5 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
> > > *iter,
> > >   if (ret)
> > >   goto out_free_dio;
> > >  
> > > - /*
> > > -  * Try to invalidate cache pages for the range we're direct
> > > -  * writing.  If this invalidation fails, tough, the write will
> > > -  * still work, but racing two incompatible write paths is a
> > > -  * pretty crazy thing to do, so we don't support it 100%.
> > 
> > I always wondered about the repeated use of 'write' in this comment
> > despite the lack of any sort of WRITE check logic.  Seems fine to me,
> > let's throw it on the fstests pile and see what happens.
> > 
> > Reviewed-by: Darrick J. Wong 
> 
> Reviewed-by: Matthew Wilcox (Oracle) 
> 
> > --D
> > 
> > > -  */
> > > - ret = invalidate_inode_pages2_range(mapping,
> > > - pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > - if (ret)
> > > - dio_warn_stale_pagecache(iocb->ki_filp);
> > > - ret = 0;
> > > + if (iov_iter_rw(iter) == WRITE) {
> > > + /*
> > > +  * Try to invalidate cache pages for the range we're direct
> > > +  * writing.  If this invalidation fails, tough, the write will
> > > +  * still work, but racing two incompatible write paths is a
> > > +  * pretty crazy thing to do, so we don't support it 100%.
> > > +  */
> > > + ret = invalidate_inode_pages2_range(mapping,
> > > + pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > + if (ret)
> > > + dio_warn_stale_pagecache(iocb->ki_filp);
> > > + ret = 0;
> > >  
> > > - if (iov_iter_rw(iter) == WRITE && 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Matthew Wilcox
On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > iomap: Only invalidate page cache pages on direct IO writes
> > 
> > From: Dave Chinner 
> > 
> > The historic requirement for XFS to invalidate cached pages on
> > direct IO reads has been lost in the twisty pages of history - it was
> > inherited from Irix, which implemented page cache invalidation on
> > read as a method of working around problems synchronising page
> > cache state with uncached IO.
> 
> Urk.
> 
> > XFS has carried this ever since. In the initial linux ports it was
> > necessary to get mmap and DIO to play "ok" together and not
> > immediately corrupt data. This was the state of play until the linux
> > kernel had infrastructure to track unwritten extents and synchronise
> > page faults with allocations and unwritten extent conversions
> > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > on DIO read was necessary to prevent trivial data corruptions. This
> > didn't solve all the problems, though.
> > 
> > There were peformance problems if we didn't invalidate the entire
> > page cache over the file on read - we couldn't easily determine if
> > the cached pages were over the range of the IO, and invalidation
> > required taking a serialising lock (i_mutex) on the inode. This
> > serialising lock was an issue for XFS, as it was the only exclusive
> > lock in the direct Io read path.
> > 
> > Hence if there were any cached pages, we'd just invalidate the
> > entire file in one go so that subsequent IOs didn't need to take the
> > serialising lock. This was a problem that prevented ranged
> > invalidation from being particularly useful for avoiding the
> > remaining coherency issues. This was solved with the conversion of
> > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > use i_rwsem. Hence we could now just do ranged invalidation and the
> > performance problem went away.
> > 
> > However, page cache invalidation was still needed to serialise
> > sub-page/sub-block zeroing via direct IO against buffered IO because
> > bufferhead state attached to the cached page could get out of whack
> > when direct IOs were issued.  We've removed bufferheads from the
> > XFS code, and we don't carry any extent state on the cached pages
> > anymore, and so this problem has gone away, too.
> > 
> > IOWs, it would appear that we don't have any good reason to be
> > invalidating the page cache on DIO reads anymore. Hence remove the
> > invalidation on read because it is unnecessary overhead,
> > not needed to maintain coherency between mmap/buffered access and
> > direct IO anymore, and prevents anyone from using direct IO reads
> > from intentionally invalidating the page cache of a file.
> > 
> > Signed-off-by: Dave Chinner 
> > ---
> >  fs/iomap/direct-io.c | 33 +
> >  1 file changed, 17 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index ec7b78e6feca..ef0059eb34b5 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
> > *iter,
> > if (ret)
> > goto out_free_dio;
> >  
> > -   /*
> > -* Try to invalidate cache pages for the range we're direct
> > -* writing.  If this invalidation fails, tough, the write will
> > -* still work, but racing two incompatible write paths is a
> > -* pretty crazy thing to do, so we don't support it 100%.
> 
> I always wondered about the repeated use of 'write' in this comment
> despite the lack of any sort of WRITE check logic.  Seems fine to me,
> let's throw it on the fstests pile and see what happens.
> 
> Reviewed-by: Darrick J. Wong 

Reviewed-by: Matthew Wilcox (Oracle) 

> --D
> 
> > -*/
> > -   ret = invalidate_inode_pages2_range(mapping,
> > -   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -   if (ret)
> > -   dio_warn_stale_pagecache(iocb->ki_filp);
> > -   ret = 0;
> > +   if (iov_iter_rw(iter) == WRITE) {
> > +   /*
> > +* Try to invalidate cache pages for the range we're direct
> > +* writing.  If this invalidation fails, tough, the write will
> > +* still work, but racing two incompatible write paths is a
> > +* pretty crazy thing to do, so we don't support it 100%.
> > +*/
> > +   ret = invalidate_inode_pages2_range(mapping,
> > +   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > +   if (ret)
> > +   dio_warn_stale_pagecache(iocb->ki_filp);
> > +   ret = 0;
> >  
> > -   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > -   !inode->i_sb->s_dio_done_wq) {
> > -   ret = sb_init_dio_done_wq(inode->i_sb);
> > -   if (ret < 0)
> > -   goto out_free_dio;
> > +   if 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Darrick J. Wong
On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> > > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > > > Indeed, I'm in favour of not invalidating
> > > > > the page cache at all for direct I/O.  For reads, I think the page 
> > > > > cache
> > > > > should be used to satisfy any portion of the read which is currently
> > > > > cached.  For writes, I think we should write into the page cache pages
> > > > > which currently exist, and then force those pages to be written back,
> > > > > but left in cache.
> > > > 
> > > > Something like that, yes.
> > > 
> > > So are we really willing to take the performance regression that
> > > occurs from copying out of the page cache consuming lots more CPU
> > > than an actual direct IO read? Or that direct IO writes suddenly
> > > serialise because there are page cache pages and now we have to do
> > > buffered IO?
> > > 
> > > Direct IO should be a deterministic, zero-copy IO path to/from
> > > storage. Using the CPU to copy data during direct IO is the complete
> > > opposite of the intended functionality, not to mention the behaviour
> > > that many applications have been careful designed and tuned for.
> > 
> > Direct I/O isn't deterministic though.
> 
> When all the application is doing is direct IO, it is as
> deterministic as the underlying storage behaviour. This is the best
> we can possibly do from the perspective of the filesystem, and this
> is largely what Direct IO requires from the filesystem.
> 
> Direct IO starts from delegating all responsibility for IO
> synchronisation data coherency and integrity to userspace, and then
> follows up by requiring the filesystem and kernel to stay out of the
> IO path except where it is absolutely necessary to read or write
> data to/from the underlying storage hardware. Serving Direct IO from
> the page cache violates the second of these requirements.
> 
> > If the file isn't shared, then
> > it works great, but as soon as you get mixed buffered and direct I/O,
> > everything is already terrible.
> 
> Right, but that's *the rare exception* for applications using direct
> IO, not the common fast path. It is the slow path where -shit has
> already gone wrong on the production machine-, and it most
> definitely does not change the DIO requirements that the filesystem
> needs to provide userspace via the direct IO path.
> 
> Optimising the slow exception path is fine if it does not affect the
> guarantees we try to provide through the common/fast path. If it is
> does affect behaviour of the fast path, then we must be able to
> either turn it off or provide our own alternative implementation
> that does not have that cost.
> 
> > Direct I/Os perform pagecache lookups
> > already, but instead of using the data that we found in the cache, we
> > (if it's dirty) write it back, wait for the write to complete, remove
> > the page from the pagecache and then perform another I/O to get the data
> > that we just wrote out!  And then the app that's using buffered I/O has
> > to read it back in again.
> 
> Yup, that's because we have a history going back 20+ years of people
> finding performance regressions in applications using direct IO when
> we leave incorrectly left pages in the page cache rather than
> invalidating them and continuing to do direct IO.
> 
> 
> > Nobody's proposing changing Direct I/O to exclusively work through the
> > pagecache.  The proposal is to behave less weirdly when there's already
> > data in the pagecache.
> 
> No, the proposal it to make direct IO behave *less
> deterministically* if there is data in the page cache.
> 
> e.g. Instead of having a predicatable submission CPU overhead and
> read latency of 100us for your data, this proposal makes the claim
> that it is always better to burn 10x the IO submission CPU for a
> single IO to copy the data and give that specific IO 10x lower
> latency than it is to submit 10 async IOs to keep the IO pipeline
> full.
> 
> What it fails to take into account is that in spending that CPU time
> to copy the data, we haven't submitted 10 other IOs and so the
> actual in-flight IO for the application has decreased. If
> performance comes from keeping the IO pipeline as close to 100% full
> as possible, then copying the data out of the page cache will cause
> performance regressions.
> 
> i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
> depth craters because we've only fulfilled 5 complete IOs instead of
> submitting 50 entire IOs. This is the hidden cost of synchronous IO
> via CPU data copying vs async IO via hardware offload, and if we
> take that into account we must look at future hardware performance
> trends to determine if this cost is going to increase or decrease in
> future.
> 
> That is: CPUs 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Steven Whitehouse

Hi,

On 08/07/2020 17:54, Christoph Hellwig wrote:

On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:

Direct I/O isn't deterministic though.  If the file isn't shared, then
it works great, but as soon as you get mixed buffered and direct I/O,
everything is already terrible.  Direct I/Os perform pagecache lookups
already, but instead of using the data that we found in the cache, we
(if it's dirty) write it back, wait for the write to complete, remove
the page from the pagecache and then perform another I/O to get the data
that we just wrote out!  And then the app that's using buffered I/O has
to read it back in again.

Mostly agreed.  That being said I suspect invalidating clean cache
might still be a good idea.  The original idea was mostly on how
to deal with invalidation failures of any kind, but falling back for
any kind of dirty cache also makes at least some sense.


I have had an objection raised off-list.  In a scenario with a block
device shared between two systems and an application which does direct
I/O, everything is normally fine.  If one of the systems uses tar to
back up the contents of the block device then the application on that
system will no longer see the writes from the other system because
there's nothing to invalidate the pagecache on the first system.

Err, WTF?  If someone access shared block devices with random
applications all bets are off anyway.


On GFS2 the locking should take care of that. Not 100% sure about OCFS2 
without looking, but I'm fairly sure that they have a similar 
arrangement. So this shouldn't be a problem unless there is an 
additional cluster fs that I'm not aware of that they are using in this 
case. It would be good to confirm which fs they are using,


Steve.




Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Dave Chinner
On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > > Indeed, I'm in favour of not invalidating
> > > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > > should be used to satisfy any portion of the read which is currently
> > > > cached.  For writes, I think we should write into the page cache pages
> > > > which currently exist, and then force those pages to be written back,
> > > > but left in cache.
> > > 
> > > Something like that, yes.
> > 
> > So are we really willing to take the performance regression that
> > occurs from copying out of the page cache consuming lots more CPU
> > than an actual direct IO read? Or that direct IO writes suddenly
> > serialise because there are page cache pages and now we have to do
> > buffered IO?
> > 
> > Direct IO should be a deterministic, zero-copy IO path to/from
> > storage. Using the CPU to copy data during direct IO is the complete
> > opposite of the intended functionality, not to mention the behaviour
> > that many applications have been careful designed and tuned for.
> 
> Direct I/O isn't deterministic though.

When all the application is doing is direct IO, it is as
deterministic as the underlying storage behaviour. This is the best
we can possibly do from the perspective of the filesystem, and this
is largely what Direct IO requires from the filesystem.

Direct IO starts from delegating all responsibility for IO
synchronisation data coherency and integrity to userspace, and then
follows up by requiring the filesystem and kernel to stay out of the
IO path except where it is absolutely necessary to read or write
data to/from the underlying storage hardware. Serving Direct IO from
the page cache violates the second of these requirements.

> If the file isn't shared, then
> it works great, but as soon as you get mixed buffered and direct I/O,
> everything is already terrible.

Right, but that's *the rare exception* for applications using direct
IO, not the common fast path. It is the slow path where -shit has
already gone wrong on the production machine-, and it most
definitely does not change the DIO requirements that the filesystem
needs to provide userspace via the direct IO path.

Optimising the slow exception path is fine if it does not affect the
guarantees we try to provide through the common/fast path. If it is
does affect behaviour of the fast path, then we must be able to
either turn it off or provide our own alternative implementation
that does not have that cost.

> Direct I/Os perform pagecache lookups
> already, but instead of using the data that we found in the cache, we
> (if it's dirty) write it back, wait for the write to complete, remove
> the page from the pagecache and then perform another I/O to get the data
> that we just wrote out!  And then the app that's using buffered I/O has
> to read it back in again.

Yup, that's because we have a history going back 20+ years of people
finding performance regressions in applications using direct IO when
we leave incorrectly left pages in the page cache rather than
invalidating them and continuing to do direct IO.


> Nobody's proposing changing Direct I/O to exclusively work through the
> pagecache.  The proposal is to behave less weirdly when there's already
> data in the pagecache.

No, the proposal it to make direct IO behave *less
deterministically* if there is data in the page cache.

e.g. Instead of having a predicatable submission CPU overhead and
read latency of 100us for your data, this proposal makes the claim
that it is always better to burn 10x the IO submission CPU for a
single IO to copy the data and give that specific IO 10x lower
latency than it is to submit 10 async IOs to keep the IO pipeline
full.

What it fails to take into account is that in spending that CPU time
to copy the data, we haven't submitted 10 other IOs and so the
actual in-flight IO for the application has decreased. If
performance comes from keeping the IO pipeline as close to 100% full
as possible, then copying the data out of the page cache will cause
performance regressions.

i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
depth craters because we've only fulfilled 5 complete IOs instead of
submitting 50 entire IOs. This is the hidden cost of synchronous IO
via CPU data copying vs async IO via hardware offload, and if we
take that into account we must look at future hardware performance
trends to determine if this cost is going to increase or decrease in
future.

That is: CPUs are not getting faster anytime soon. IO subsystems are
still deep in the "performance doubles every 2 years" part of the
technology curve (pcie 3.0->4.0 just happened, 4->5 is a year away,
5->6 is 3-4 years away, etc). Hence our reality is that we are deep
within a 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Matthew Wilcox
On Wed, Jul 08, 2020 at 06:54:12PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> > Direct I/O isn't deterministic though.  If the file isn't shared, then
> > it works great, but as soon as you get mixed buffered and direct I/O,
> > everything is already terrible.  Direct I/Os perform pagecache lookups
> > already, but instead of using the data that we found in the cache, we
> > (if it's dirty) write it back, wait for the write to complete, remove
> > the page from the pagecache and then perform another I/O to get the data
> > that we just wrote out!  And then the app that's using buffered I/O has
> > to read it back in again.
> 
> Mostly agreed.  That being said I suspect invalidating clean cache
> might still be a good idea.  The original idea was mostly on how
> to deal with invalidation failures of any kind, but falling back for
> any kind of dirty cache also makes at least some sense.

That's certainly the btrfs problem that needs to be solved, but I think
it's all part of the directio misdesign.

> > I have had an objection raised off-list.  In a scenario with a block
> > device shared between two systems and an application which does direct
> > I/O, everything is normally fine.  If one of the systems uses tar to
> > back up the contents of the block device then the application on that
> > system will no longer see the writes from the other system because
> > there's nothing to invalidate the pagecache on the first system.
> 
> Err, WTF?  If someone access shared block devices with random
> applications all bets are off anyway.

That doesn't mean that customers don't do it.  It is, of course, not
recommended, but we suspect people do it anyway.  Because it does
work, unfortunately.  I'd be open to making this exact situation
deterministically not work (eg disallowing mixing O_DIRECT and
non-O_DIRECT openers of block devices), but making it suddenly
non-deterministically give you old data is a non-starter.

> > Unfortunately, this is in direct conflict with the performance
> > problem caused by some little arsewipe deciding to do:
> > 
> > $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct 
> > of=/dev/null; done
> > 
> > ... doesn't hurt me because my root filesystem is on ext4 which doesn't
> > purge the cache.  But anything using iomap gets all the pages for libc
> > kicked out of the cache, and that's a lot of fun.
> 
> ext4 uses iomap..

I happen to be running an older kernel that doesn't on this laptop ;-)



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Christoph Hellwig
On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> Direct I/O isn't deterministic though.  If the file isn't shared, then
> it works great, but as soon as you get mixed buffered and direct I/O,
> everything is already terrible.  Direct I/Os perform pagecache lookups
> already, but instead of using the data that we found in the cache, we
> (if it's dirty) write it back, wait for the write to complete, remove
> the page from the pagecache and then perform another I/O to get the data
> that we just wrote out!  And then the app that's using buffered I/O has
> to read it back in again.

Mostly agreed.  That being said I suspect invalidating clean cache
might still be a good idea.  The original idea was mostly on how
to deal with invalidation failures of any kind, but falling back for
any kind of dirty cache also makes at least some sense.

> I have had an objection raised off-list.  In a scenario with a block
> device shared between two systems and an application which does direct
> I/O, everything is normally fine.  If one of the systems uses tar to
> back up the contents of the block device then the application on that
> system will no longer see the writes from the other system because
> there's nothing to invalidate the pagecache on the first system.

Err, WTF?  If someone access shared block devices with random
applications all bets are off anyway.

> 
> Unfortunately, this is in direct conflict with the performance
> problem caused by some little arsewipe deciding to do:
> 
> $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct 
> of=/dev/null; done
> 
> ... doesn't hurt me because my root filesystem is on ext4 which doesn't
> purge the cache.  But anything using iomap gets all the pages for libc
> kicked out of the cache, and that's a lot of fun.

ext4 uses iomap..



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Dave Chinner
On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues 
> > > > > 
> > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > > indicate
> > > > > that if the page invalidation fails, return back control to the
> > > > > filesystem so it may fallback to buffered mode.
> > > > > 
> > > > > Reviewed-by: Darrick J. Wong 
> > > > > Signed-off-by: Goldwyn Rodrigues 
> > > > 
> > > > I'd like to start a discussion of this shouldn't really be the
> > > > default behavior.  If we have page cache that can't be invalidated it
> > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > warnings, etc.
> > > > 
> > > > Adding all the relevant lists.
> > > 
> > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > 
> > > What error should be returned in case of such an error? I think the
> > 
> > Christoph's message is ambiguous.  I don't know if he means "fail the
> > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > strongly in favour of the latter.
> 
> Same here.  Sorry if my previous mail was unclear.
> 
> > Indeed, I'm in favour of not invalidating
> > the page cache at all for direct I/O.  For reads, I think the page cache
> > should be used to satisfy any portion of the read which is currently
> > cached.  For writes, I think we should write into the page cache pages
> > which currently exist, and then force those pages to be written back,
> > but left in cache.
> 
> Something like that, yes.

So are we really willing to take the performance regression that
occurs from copying out of the page cache consuming lots more CPU
than an actual direct IO read? Or that direct IO writes suddenly
serialise because there are page cache pages and now we have to do
buffered IO?

Direct IO should be a deterministic, zero-copy IO path to/from
storage. Using the CPU to copy data during direct IO is the complete
opposite of the intended functionality, not to mention the behaviour
that many applications have been careful designed and tuned for.

Hence I think that forcing iomap to use cached pages for DIO is a
non-starter. I have no problems with providing infrastructure that
allows filesystems to -opt in- to using buffered IO for the direct
IO path. However, the change in IO behaviour caused by unpredicably
switching between direct IO and buffered IO (e.g. suddening DIO
writes serialise -all IO-) will cause unacceptible performance
regressions for many applications and be -very difficult to
diagnose- in production systems.

IOWs, we need to let the individual filesystems decide how they want
to use the page cache for direct IO. Just because we have new direct
IO infrastructure (i.e. iomap) it does not mean we can just make
wholesale changes to the direct IO path behaviour...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Darrick J. Wong
On Tue, Jul 07, 2020 at 08:49:52AM -0500, Goldwyn Rodrigues wrote:
> On 13:57 07/07, Matthew Wilcox wrote:
> > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues 
> > > > > 
> > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > > indicate
> > > > > that if the page invalidation fails, return back control to the
> > > > > filesystem so it may fallback to buffered mode.
> > > > > 
> > > > > Reviewed-by: Darrick J. Wong 
> > > > > Signed-off-by: Goldwyn Rodrigues 
> > > > 
> > > > I'd like to start a discussion of this shouldn't really be the
> > > > default behavior.  If we have page cache that can't be invalidated it
> > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > warnings, etc.
> > > > 
> > > > Adding all the relevant lists.
> > > 
> > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > 
> > > What error should be returned in case of such an error? I think the
> > 
> > Christoph's message is ambiguous.  I don't know if he means "fail the
> > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
> > the page cache at all for direct I/O.  For reads, I think the page cache
> > should be used to satisfy any portion of the read which is currently
> 
> That indeed would make reads faster. How about if the pages are dirty
> during DIO reads?
> Should a direct I/O read be responsible for making sure that the dirty
> pages are written back. Technically direct I/O reads is that we are
> reading from the device.

The filemap_write_and_wait_range should persist that data, right?

> > cached.  For writes, I think we should write into the page cache pages
> > which currently exist, and then force those pages to be written back,
> > but left in cache.
> 
> Yes, that makes sense.
> If this is implemented, what would be the difference between O_DIRECT
> and O_DSYNC, if any?

Presumably a direct write would proceed as it does today if there's no
pagecache at all?

--D

> -- 
> Goldwyn



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Goldwyn Rodrigues
On  7:01 07/07, Darrick J. Wong wrote:
> On Tue, Jul 07, 2020 at 08:49:52AM -0500, Goldwyn Rodrigues wrote:
> > On 13:57 07/07, Matthew Wilcox wrote:
> > > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > > From: Goldwyn Rodrigues 
> > > > > > 
> > > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > > > indicate
> > > > > > that if the page invalidation fails, return back control to the
> > > > > > filesystem so it may fallback to buffered mode.
> > > > > > 
> > > > > > Reviewed-by: Darrick J. Wong 
> > > > > > Signed-off-by: Goldwyn Rodrigues 
> > > > > 
> > > > > I'd like to start a discussion of this shouldn't really be the
> > > > > default behavior.  If we have page cache that can't be invalidated it
> > > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > > warnings, etc.
> > > > > 
> > > > > Adding all the relevant lists.
> > > > 
> > > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > > 
> > > > What error should be returned in case of such an error? I think the
> > > 
> > > Christoph's message is ambiguous.  I don't know if he means "fail the
> > > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > > strongly in favour of the latter.  Indeed, I'm in favour of not 
> > > invalidating
> > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > should be used to satisfy any portion of the read which is currently
> > 
> > That indeed would make reads faster. How about if the pages are dirty
> > during DIO reads?
> > Should a direct I/O read be responsible for making sure that the dirty
> > pages are written back. Technically direct I/O reads is that we are
> > reading from the device.
> 
> The filemap_write_and_wait_range should persist that data, right?

Right. filemap_write_and_wait_range() would not make sense for writes
though.

> 
> > > cached.  For writes, I think we should write into the page cache pages
> > > which currently exist, and then force those pages to be written back,
> > > but left in cache.
> > 
> > Yes, that makes sense.
> > If this is implemented, what would be the difference between O_DIRECT
> > and O_DSYNC, if any?
> 
> Presumably a direct write would proceed as it does today if there's no
> pagecache at all?
> 

Yes, correct. Just that it would leave pages in the cache instead of
invalidating it after DIO write is complete.

-- 
Goldwyn



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Goldwyn Rodrigues
On 13:57 07/07, Matthew Wilcox wrote:
> On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > On  9:53 01/07, Christoph Hellwig wrote:
> > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues 
> > > > 
> > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > indicate
> > > > that if the page invalidation fails, return back control to the
> > > > filesystem so it may fallback to buffered mode.
> > > > 
> > > > Reviewed-by: Darrick J. Wong 
> > > > Signed-off-by: Goldwyn Rodrigues 
> > > 
> > > I'd like to start a discussion of this shouldn't really be the
> > > default behavior.  If we have page cache that can't be invalidated it
> > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > warnings, etc.
> > > 
> > > Adding all the relevant lists.
> > 
> > Since no one responded so far, let me see if I can stir the cauldron :)
> > 
> > What error should be returned in case of such an error? I think the
> 
> Christoph's message is ambiguous.  I don't know if he means "fail the
> I/O with an error" or "satisfy the I/O through the page cache".  I'm
> strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
> the page cache at all for direct I/O.  For reads, I think the page cache
> should be used to satisfy any portion of the read which is currently

That indeed would make reads faster. How about if the pages are dirty
during DIO reads?
Should a direct I/O read be responsible for making sure that the dirty
pages are written back. Technically direct I/O reads is that we are
reading from the device.

> cached.  For writes, I think we should write into the page cache pages
> which currently exist, and then force those pages to be written back,
> but left in cache.

Yes, that makes sense.
If this is implemented, what would be the difference between O_DIRECT
and O_DSYNC, if any?

-- 
Goldwyn



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Matthew Wilcox
On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> On  9:53 01/07, Christoph Hellwig wrote:
> > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues 
> > > 
> > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > that if the page invalidation fails, return back control to the
> > > filesystem so it may fallback to buffered mode.
> > > 
> > > Reviewed-by: Darrick J. Wong 
> > > Signed-off-by: Goldwyn Rodrigues 
> > 
> > I'd like to start a discussion of this shouldn't really be the
> > default behavior.  If we have page cache that can't be invalidated it
> > actually makes a whole lot of sense to not do direct I/O, avoid the
> > warnings, etc.
> > 
> > Adding all the relevant lists.
> 
> Since no one responded so far, let me see if I can stir the cauldron :)
> 
> What error should be returned in case of such an error? I think the

Christoph's message is ambiguous.  I don't know if he means "fail the
I/O with an error" or "satisfy the I/O through the page cache".  I'm
strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
the page cache at all for direct I/O.  For reads, I think the page cache
should be used to satisfy any portion of the read which is currently
cached.  For writes, I think we should write into the page cache pages
which currently exist, and then force those pages to be written back,
but left in cache.



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Goldwyn Rodrigues
On  9:53 01/07, Christoph Hellwig wrote:
> On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues 
> > 
> > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > that if the page invalidation fails, return back control to the
> > filesystem so it may fallback to buffered mode.
> > 
> > Reviewed-by: Darrick J. Wong 
> > Signed-off-by: Goldwyn Rodrigues 
> 
> I'd like to start a discussion of this shouldn't really be the
> default behavior.  If we have page cache that can't be invalidated it
> actually makes a whole lot of sense to not do direct I/O, avoid the
> warnings, etc.
> 
> Adding all the relevant lists.

Since no one responded so far, let me see if I can stir the cauldron :)

What error should be returned in case of such an error? I think the
userspace process must be immediately informed if it in unable to
invalidate the page cache and complete the direct I/O. Currently, the
iomap code treats this as a writeback error and continues with the
direct I/O and the userspace process comes to know only during file
closure.

If such a change is incorporated, are the current userspace applications
prepared for it?


-- 
Goldwyn



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Christoph Hellwig
On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > On  9:53 01/07, Christoph Hellwig wrote:
> > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues 
> > > > 
> > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > indicate
> > > > that if the page invalidation fails, return back control to the
> > > > filesystem so it may fallback to buffered mode.
> > > > 
> > > > Reviewed-by: Darrick J. Wong 
> > > > Signed-off-by: Goldwyn Rodrigues 
> > > 
> > > I'd like to start a discussion of this shouldn't really be the
> > > default behavior.  If we have page cache that can't be invalidated it
> > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > warnings, etc.
> > > 
> > > Adding all the relevant lists.
> > 
> > Since no one responded so far, let me see if I can stir the cauldron :)
> > 
> > What error should be returned in case of such an error? I think the
> 
> Christoph's message is ambiguous.  I don't know if he means "fail the
> I/O with an error" or "satisfy the I/O through the page cache".  I'm
> strongly in favour of the latter.

Same here.  Sorry if my previous mail was unclear.

> Indeed, I'm in favour of not invalidating
> the page cache at all for direct I/O.  For reads, I think the page cache
> should be used to satisfy any portion of the read which is currently
> cached.  For writes, I think we should write into the page cache pages
> which currently exist, and then force those pages to be written back,
> but left in cache.

Something like that, yes.