Re: [Cluster-devel] [PATCH 16/17] block: use iomap for writes to block devices

2023-05-23 Thread Dave Chinner
On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> On 4/24/23 07:49, Christoph Hellwig wrote:
> > Use iomap in buffer_head compat mode to write to block devices.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >   block/Kconfig |  1 +
> >   block/fops.c  | 33 +
> >   2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 941b2dca70db73..672b08f0096ab4 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -5,6 +5,7 @@
> >   menuconfig BLOCK
> >  bool "Enable the block layer" if EXPERT
> >  default y
> > +   select IOMAP
> >  select SBITMAP
> >  help
> >  Provide block layer support for the kernel.
> > diff --git a/block/fops.c b/block/fops.c
> > index 318247832a7bcf..7910636f8df33b 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -15,6 +15,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include "blk.h"
> > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, 
> > struct iov_iter *iter)
> > return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
> >   }
> > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> > length,
> > +   unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +   struct block_device *bdev = I_BDEV(inode);
> > +   loff_t isize = i_size_read(inode);
> > +
> > +   iomap->bdev = bdev;
> > +   iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> > +   if (WARN_ON_ONCE(iomap->offset >= isize))
> > +   return -EIO;
> 
> I'm hitting this during booting:
> [5.016324]  
> [5.030256]  iomap_iter+0x11a/0x350
> [5.030264]  iomap_readahead+0x1eb/0x2c0
> [5.030272]  read_pages+0x5d/0x220
> [5.030279]  page_cache_ra_unbounded+0x131/0x180
> [5.030284]  filemap_get_pages+0xff/0x5a0

Why is filemap_get_pages() using unbounded readahead? Surely
readahead should be limited to reading within EOF

> [5.030292]  filemap_read+0xca/0x320
> [5.030296]  ? aa_file_perm+0x126/0x500
> [5.040216]  ? touch_atime+0xc8/0x150
> [5.040224]  blkdev_read_iter+0xb0/0x150
> [5.040228]  vfs_read+0x226/0x2d0
> [5.040234]  ksys_read+0xa5/0xe0
> [5.040238]  do_syscall_64+0x5b/0x80
> 
> Maybe we should consider this patch:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 524b8a828aad..d202fb663f25 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode,
> loff_t offset, loff_t length,
> 
> iomap->bdev = bdev;
> iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> -   if (WARN_ON_ONCE(iomap->offset >= isize))
> -   return -EIO;
> -   iomap->type = IOMAP_MAPPED;
> -   iomap->addr = iomap->offset;
> +   if (WARN_ON_ONCE(iomap->offset >= isize)) {
> +   iomap->type = IOMAP_HOLE;
> +   iomap->addr = IOMAP_NULL_ADDR;
> +   } else {
> +   iomap->type = IOMAP_MAPPED;
> +   iomap->addr = iomap->offset;
> +   }

I think Christoph's code is correct. IMO, any attempt to read beyond
the end of the device should throw out a warning and return an
error, not silently return zeros.

If readahead is trying to read beyond the end of the device, then it
really seems to me like the problem here is readahead, not the iomap
code detecting the OOB read request

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Christoph Hellwig
On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> 
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

generic/708 is the btrfs version of this.

But I think all of the file systems that have this deadlock are actually
fundamentally broken because they have a mess up locking hierarchy
where page faults take the same lock that is held over the the direct I/
operation.  And the right thing is to fix this.  I have work in progress
for btrfs, and something similar should apply to gfs2, with the added
complication that it probably means a revision to their network
protocol.

I'm absolutely not in favour to add workarounds for thes kind of locking
problems to the core kernel.  I already feel bad for allowing the
small workaround in iomap for btrfs, as just fixing the locking back
then would have avoid massive ratholing.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Kent Overstreet
> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
> 
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Oof, that sounds a bit sketchy. What happens if the dio call passes in
an address from the same address space? What happens if we race with the
pages we faulted in being evicted?

> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

More tests more good.

So if we want to lift this scheme to the VFS layer, we'd start by
replacing the lock you added (grepping for it, the name escapes me) with
a different type of lock - two_state_shared_lock in my code, it's like a
rw lock except writers don't exclude other writers. That way the DIO
path can use it without singlethreading writes to a single file.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Kent Overstreet
On Tue, May 23, 2023 at 09:21:56AM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > 
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
> 
> generic/708 is the btrfs version of this.
> 
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation.  And the right thing is to fix this.  I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

No, this is fundamentally because userspace controls the ordering of
locking because the buffer passed to dio can point into any address
space. You can't solve this by changing the locking heirarchy.

If you want to be able to have locking around adding things to the
pagecache so that things that bypass the pagecache can prevent
inconsistencies (and we do, the big one is fcollapse), and if you want
dio to be able to use that same locking (because otherwise dio will also
cause page cache inconsistency), this is the way to do it.



Re: [Cluster-devel] [PATCH 10/13] fs: factor out a direct_write_fallback helper

2023-05-23 Thread Christoph Hellwig
On Mon, May 22, 2023 at 04:19:38PM +0200, Miklos Szeredi wrote:
> > +   ssize_t direct_written, ssize_t buffered_written)
> > +{
> > +   struct address_space *mapping = iocb->ki_filp->f_mapping;
> > +   loff_t pos = iocb->ki_pos, end;
> 
> At this point pos will point after the end of the buffered write (as
> per earlier patches), yes?

Yes.  I'll fix the pos and end calculation.



Re: [Cluster-devel] [PATCH 08/13] iomap: assign current->backing_dev_info in iomap_file_buffered_write

2023-05-23 Thread Christoph Hellwig
On Tue, May 23, 2023 at 04:30:51AM +0100, Matthew Wilcox wrote:
> AFAICT (the code went through some metamorphoses in the intervening
> twenty years), the last use of it ended up in current_may_throttle(),
> and it was removed in March 2022 by Neil Brown in commit b9b1335e6403.
> Since then, there have been no users of task->backing_dev_info, and I'm
> pretty sure it can go away.

Oh, nice.  I hadn't noticed it finally went away.  The next iteration
of the series will just remove it.



Re: [Cluster-devel] [PATCH 07/13] iomap: update ki_pos in iomap_file_buffered_write

2023-05-23 Thread Christoph Hellwig
On Mon, May 22, 2023 at 09:01:05AM +0900, Damien Le Moal wrote:
> > -   int ret;
> > +   ssize_t ret;
> >  
> > if (iocb->ki_flags & IOCB_NOWAIT)
> > iter.flags |= IOMAP_NOWAIT;
> >  
> > while ((ret = iomap_iter(, ops)) > 0)
> > iter.processed = iomap_write_iter(, i);
> > -   if (iter.pos == iocb->ki_pos)
> > +
> > +   if (unlikely(ret < 0))
> 
> Nit: This could be if (unlikely(ret <= 0)), no ?

No.  iomap_iter does not return te amount of bytes written.



Re: [Cluster-devel] [PATCH 06/13] filemap: add a kiocb_invalidate_post_write helper

2023-05-23 Thread Christoph Hellwig
On Mon, May 22, 2023 at 08:56:34AM +0900, Damien Le Moal wrote:
> On 5/19/23 18:35, Christoph Hellwig wrote:
> > Add a helper to invalidate page cache after a dio write.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Nit: kiocb_invalidate_post_dio_write() may be a better name to be explicit 
> about
> the fact that this is for DIOs only ?

I've renamed it to kiocb_invalidate_post_direct_write, thanks.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Jan Kara
On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > From: Kent Overstreet 
> > > 
> > > This is used by bcachefs to fix a page cache coherency issue with
> > > O_DIRECT writes.
> > > 
> > > Also relevant: mapping->invalidate_lock, see below.
> > > 
> > > O_DIRECT writes (and other filesystem operations that modify file data
> > > while bypassing the page cache) need to shoot down ranges of the page
> > > cache - and additionally, need locking to prevent those pages from
> > > pulled back in.
> > > 
> > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > and the page fault handler will need to take that same lock - this is a
> > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > writing to and uses those pages for the buffer to write from, and it's a
> > > lock ordering deadlock in general.
> > > 
> > > Thus we need a way to signal from the dio code to the page fault handler
> > > when we already are holding the pagecache add lock on an address space -
> > > this patch just adds a member to task_struct for this purpose. For now
> > > only bcachefs is implementing this locking, though it may be moved out
> > > of bcachefs and made available to other filesystems in the future.
> > 
> > It would be nice to have at least a link to the code that's actually using
> > the field you are adding.
> 
> Bit of a trick to link to a _later_ patch in the series from a commit
> message, but...
> 
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

Thanks and I'm sorry for the delay.

> > Also I think we were already through this discussion [1] and we ended up
> > agreeing that your scheme actually solves only the AA deadlock but a
> > malicious userspace can easily create AB BA deadlock by running direct IO
> > to file A using mapped file B as a buffer *and* direct IO to file B using
> > mapped file A as a buffer.
> 
> No, that's definitely handled (and you can see it in the code I linked),
> and I wrote a torture test for fstests as well.

I've checked the code and AFAICT it is all indeed handled. BTW, I've now
remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
way (by prefaulting pages from the iter before grabbing the problematic
lock and then disabling page faults for the iomap_dio_rw() call). I guess
we should somehow unify these schemes so that we don't have two mechanisms
for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Also good that you've written a fstest for this, that is definitely a useful
addition, although I suspect GFS2 guys added a test for this not so long
ago when testing their stuff. Maybe they have a pointer handy?

Honza
-- 
Jan Kara 
SUSE Labs, CR



Re: [Cluster-devel] [PATCH 5/6] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-05-23 Thread Matthew Wilcox
On Tue, May 23, 2023 at 02:46:07PM +0200, Andreas Gruenbacher wrote:
> >  void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
> > -unsigned int from, unsigned int len)
> > +size_t from, size_t len)
> >  {
> > struct buffer_head *head = folio_buffers(folio);
> > unsigned int bsize = head->b_size;
> 
> This only makes sense if the to, start, and end variables in
> gfs2_trans_add_databufs() are changed from unsigned int to size_t as
> well.

The history of this patch is that I started doing conversions from page
-> folio in gfs2, then you came out with a very similar series.  This
patch is the remainder after rebasing my patches on yours.  So we can
either drop this patch or just apply it.  I wasn't making a concerted
effort to make gfs2 support 4GB+ sized folios, it's just part of the
conversion that I do.

> >  extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio 
> > *folio,
> > -   unsigned int from, unsigned int len);
> > +   size_t from, size_t len);



Re: [Cluster-devel] [PATCH 5/6] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-05-23 Thread Andreas Gruenbacher
On Wed, May 17, 2023 at 5:24 AM Matthew Wilcox (Oracle)
 wrote:
> We may someday support folios larger than 4GB, so use a size_t for
> the byte count within a folio to prevent unpleasant truncations.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/gfs2/aops.c | 2 +-
>  fs/gfs2/aops.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index e97462a5302e..8da4397aafc6 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -38,7 +38,7 @@
>
>
>  void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
> -unsigned int from, unsigned int len)
> +size_t from, size_t len)
>  {
> struct buffer_head *head = folio_buffers(folio);
> unsigned int bsize = head->b_size;

This only makes sense if the to, start, and end variables in
gfs2_trans_add_databufs() are changed from unsigned int to size_t as
well.

> diff --git a/fs/gfs2/aops.h b/fs/gfs2/aops.h
> index 09db1914425e..f08322ef41cf 100644
> --- a/fs/gfs2/aops.h
> +++ b/fs/gfs2/aops.h
> @@ -10,6 +10,6 @@
>
>  extern void adjust_fs_space(struct inode *inode);
>  extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio 
> *folio,
> -   unsigned int from, unsigned int len);
> +   size_t from, size_t len);
>
>  #endif /* __AOPS_DOT_H__ */
> --
> 2.39.2
>

Thanks,
Andreas