Re: [Cluster-devel] [PATCH RFC v5 00/29] io_uring getdents

2023-08-25 Thread Darrick J. Wong
On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.

NAK on the entire series until Jens actually writes down what NOWAIT
does, so that we can check that the *existing* nowait code branches
actually behave how he says it should.

https://lore.kernel.org/all/e2d8e5f1-f794-38eb-cecf-ed30c5712...@kernel.dk/

--D

> 
> Patch1 and Patch2 are some preparation
> Patch3 supports nowait for xfs getdents code
> Patch4-11 are vfs change, include adding helpers and trylock for locks
> Patch12-29 supports nowait for involved xfs journal stuff
> note, Patch24 and 27 are actually two questions, might be removed later.
> an xfs test may come later.
> 
> Tests I've done:
> a liburing test case for functional test:
> https://github.com/HowHsu/liburing/commit/39dc9a8e19c06a8cebf8c2301b85320eb45c061e?diff=unified
> 
> xfstests:
> test/generic: 1 fails and 171 not run
> test/xfs: 72 fails and 156 not run
> run the code before without this patchset, same result.
> I'll try to make the environment more right to run more tests here.
> 
> 
> Tested it with a liburing performance test:
> https://github.com/HowHsu/liburing/blob/getdents/test/getdents2.c
> 
> The test is controlled by the below script[2] which runs getdents2.t 100
> times and calulate the avg.
> The result show that io_uring version is about 2.6% faster:
> 
> note:
> [1] the number of getdents call/request in io_uring and normal sync version
> are made sure to be same beforehand.
> 
> [2] run_getdents.py
> 
> ```python3
> 
> import subprocess
> 
> N = 100
> sum = 0.0
> args = ["/data/home/howeyxu/tmpdir", "sync"]
> 
> for i in range(N):
> output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
> sum += float(output)
> 
> average = sum / N
> print("Average of sync:", average)
> 
> sum = 0.0
> args = ["/data/home/howeyxu/tmpdir", "iouring"]
> 
> for i in range(N):
> output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
> sum += float(output)
> 
> average = sum / N
> print("Average of iouring:", average)
> 
> ```
> 
> v4->v5:
>  - move atime update to the beginning of getdents operation
>  - trylock for i_rwsem
>  - nowait semantics for involved xfs journal stuff
> 
> v3->v4:
>  - add Dave's xfs nowait code and fix a deadlock problem, with some code
>style tweak.
>  - disable fixed file to avoid a race problem for now
>  - add a test program.
> 
> v2->v3:
>  - removed the kernfs patches
>  - add f_pos_lock logic
>  - remove the "reduce last EOF getdents try" optimization since
>Dominique reports that doesn't make difference
>  - remove the rewind logic, I think the right way is to introduce lseek
>to io_uring not to patch this logic to getdents.
>  - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
>complained that Co-developed-by someone should be accompanied with
>Signed-off-by same person, I can remove them if Stefan thinks that's
>not proper.
> 
> 
> Dominique Martinet (1):
>   fs: split off vfs_getdents function of getdents64 syscall
> 
> Hao Xu (28):
>   xfs: rename XBF_TRYLOCK to XBF_NOWAIT
>   xfs: add NOWAIT semantics for readdir
>   vfs: add nowait flag for struct dir_context
>   vfs: add a vfs helper for io_uring file pos lock
>   vfs: add file_pos_unlock() for io_uring usage
>   vfs: add a nowait parameter for touch_atime()
>   vfs: add nowait parameter for file_accessed()
>   vfs: move file_accessed() to the beginning of iterate_dir()
>   vfs: add S_NOWAIT for nowait time update
>   vfs: trylock inode->i_rwsem in iterate_dir() to support nowait
>   xfs: enforce GFP_NOIO implicitly during nowait time update
>   xfs: make xfs_trans_alloc() support nowait semantics
>   xfs: support nowait for xfs_log_reserve()
>   xfs: don't wait for free space in xlog_grant_head_check() in nowait
> case
>   xfs: add nowait parameter for xfs_inode_item_init()
>   xfs: make xfs_trans_ijoin() error out -EAGAIN
>   xfs: set XBF_NOWAIT for xfs_buf_read_map if necessary
>   xfs: support nowait memory allocation in _xfs_buf_alloc()
>   xfs: distinguish error type of memory allocation failure for nowait
> case
>   xfs: return -EAGAIN when bulk memory allocation fails in nowait case
>   xfs: comment page allocation for nowait case in xfs_buf_find_insert()
>   xfs: don't print warn info for -EAGAIN error in  xfs_buf_get_map()
>   xfs: support nowait for xfs_buf_read_map()
>   xfs: support nowait for xfs_buf_item_init()
>   xfs: return -EAGAIN when nowait meets sync in transaction commit
>   xfs: add a comment for xlog_kvmalloc()
>   xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()
>   io_uring: add support for getdents
> 
>  arch/s390/hypfs/inode.c |  2 +-
>  block/fops.c|  2 +-
>  fs/btrfs/file.c |  2 +-

Re: [Cluster-devel] [PATCH v7 07/13] xfs: have xfs_vn_update_time gets its own timestamp

2023-08-09 Thread Darrick J. Wong
On Mon, Aug 07, 2023 at 03:38:38PM -0400, Jeff Layton wrote:
> In later patches we're going to drop the "now" parameter from the
> update_time operation. Prepare XFS for this by reworking how it fetches
> timestamps and sets them in the inode. Ensure that we update the ctime
> even if only S_MTIME is set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/xfs_iops.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 731f45391baa..72d18e7840f5 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1037,6 +1037,7 @@ xfs_vn_update_time(
>   int log_flags = XFS_ILOG_TIMESTAMP;
>   struct xfs_trans*tp;
>   int error;
> + struct timespec64   now = current_time(inode);
>  
>   trace_xfs_update_time(ip);
>  
> @@ -1056,12 +1057,15 @@ xfs_vn_update_time(
>   return error;
>  
>   xfs_ilock(ip, XFS_ILOCK_EXCL);
> - if (flags & S_CTIME)
> - inode_set_ctime_to_ts(inode, *now);
> + if (flags & (S_CTIME|S_MTIME))

Minor nit: spaces around^ the operator.

Otherwise looks ok to me...
Acked-by: Darrick J. Wong 

--D

> + now = inode_set_ctime_current(inode);
> + else
> + now = current_time(inode);
> +
>   if (flags & S_MTIME)
> - inode->i_mtime = *now;
> + inode->i_mtime = now;
>   if (flags & S_ATIME)
> - inode->i_atime = *now;
> + inode->i_atime = now;
>  
>   xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>   xfs_trans_log_inode(tp, ip, log_flags);
> 
> -- 
> 2.41.0
> 



Re: [Cluster-devel] [PATCH v6 5/7] xfs: switch to multigrain timestamps

2023-08-02 Thread Darrick J. Wong
On Tue, Jul 25, 2023 at 10:58:18AM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> always set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
>  fs/xfs/xfs_iops.c   | 4 ++--
>  fs/xfs/xfs_super.c  | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 6b2296ff248a..ad22656376d3 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + ASSERT(flags & XFS_ICHGTIME_CHG);
>  
> + tv = inode_set_ctime_current(inode);
>   if (flags & XFS_ICHGTIME_MOD)
>   inode->i_mtime = tv;
> - if (flags & XFS_ICHGTIME_CHG)
> - inode_set_ctime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CREATE)
>   ip->i_crtime = tv;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 3a9363953ef2..3f89ef5a2820 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -573,10 +573,10 @@ xfs_vn_getattr(
>   stat->gid = vfsgid_into_kgid(vfsgid);
>   stat->ino = ip->i_ino;
>   stat->atime = inode->i_atime;
> - stat->mtime = inode->i_mtime;
> - stat->ctime = inode_get_ctime(inode);
>   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
>  
> + fill_mg_cmtime(request_mask, inode, stat);

Huh.  I would've thought @stat would come first since that's what we're
acting upon, but ... eh. :)

If everyone else is ok with the fill_mg_cmtime signature,
Acked-by: Darrick J. Wong 

--D

> +
>   if (xfs_has_v3inodes(mp)) {
>   if (request_mask & STATX_BTIME) {
>   stat->result_mask |= STATX_BTIME;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 818510243130..4b10edb2c972 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -2009,7 +2009,7 @@ static struct file_system_type xfs_fs_type = {
>   .init_fs_context= xfs_init_fs_context,
>   .parameters = xfs_fs_parameters,
>   .kill_sb= kill_block_super,
> - .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> + .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
>  };
>  MODULE_ALIAS_FS("xfs");
>  
> 
> -- 
> 2.41.0
> 



Re: [Cluster-devel] [PATCH 09/12] fs: factor out a direct_write_fallback helper

2023-06-05 Thread Darrick J. Wong
On Thu, Jun 01, 2023 at 04:59:01PM +0200, Christoph Hellwig wrote:
> Add a helper dealing with handling the syncing of a buffered write fallback
> for direct I/O.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Damien Le Moal 
> Reviewed-by: Miklos Szeredi 

Looks good to me; whose tree do you want this to go through?

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/libfs.c | 41 
>  include/linux/fs.h |  2 ++
>  mm/filemap.c   | 66 +++---
>  3 files changed, 58 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 89cf614a327158..5b851315eeed03 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1613,3 +1613,44 @@ u64 inode_query_iversion(struct inode *inode)
>   return cur >> I_VERSION_QUERIED_SHIFT;
>  }
>  EXPORT_SYMBOL(inode_query_iversion);
> +
> +ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
> + ssize_t direct_written, ssize_t buffered_written)
> +{
> + struct address_space *mapping = iocb->ki_filp->f_mapping;
> + loff_t pos = iocb->ki_pos - buffered_written;
> + loff_t end = iocb->ki_pos - 1;
> + int err;
> +
> + /*
> +  * If the buffered write fallback returned an error, we want to return
> +  * the number of bytes which were written by direct I/O, or the error
> +  * code if that was zero.
> +  *
> +  * Note that this differs from normal direct-io semantics, which will
> +  * return -EFOO even if some bytes were written.
> +  */
> + if (unlikely(buffered_written < 0)) {
> + if (direct_written)
> + return direct_written;
> + return buffered_written;
> + }
> +
> + /*
> +  * We need to ensure that the page cache pages are written to disk and
> +  * invalidated to preserve the expected O_DIRECT semantics.
> +  */
> + err = filemap_write_and_wait_range(mapping, pos, end);
> + if (err < 0) {
> + /*
> +  * We don't know how much we wrote, so just return the number of
> +  * bytes which were direct-written
> +  */
> + if (direct_written)
> + return direct_written;
> + return err;
> + }
> + invalidate_mapping_pages(mapping, pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> + return direct_written + buffered_written;
> +}
> +EXPORT_SYMBOL_GPL(direct_write_fallback);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 91021b4e1f6f48..6af25137543824 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2738,6 +2738,8 @@ extern ssize_t __generic_file_write_iter(struct kiocb 
> *, struct iov_iter *);
>  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
>  ssize_t generic_perform_write(struct kiocb *, struct iov_iter *);
> +ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
> + ssize_t direct_written, ssize_t buffered_written);
>  
>  ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
>   rwf_t flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ddb6f8aa86d6ca..137508da5525b6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -4006,23 +4006,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>  {
>   struct file *file = iocb->ki_filp;
>   struct address_space *mapping = file->f_mapping;
> - struct inode*inode = mapping->host;
> - ssize_t written = 0;
> - ssize_t err;
> - ssize_t status;
> + struct inode *inode = mapping->host;
> + ssize_t ret;
>  
> - err = file_remove_privs(file);
> - if (err)
> - goto out;
> + ret = file_remove_privs(file);
> + if (ret)
> + return ret;
>  
> - err = file_update_time(file);
> - if (err)
> - goto out;
> + ret = file_update_time(file);
> + if (ret)
> + return ret;
>  
>   if (iocb->ki_flags & IOCB_DIRECT) {
> - loff_t pos, endbyte;
> -
> - written = generic_file_direct_write(iocb, from);
> + ret = generic_file_direct_write(iocb, from);
>   /*
>* If the write stopped short of completing, fall back to
>* buffered writes.  Some filesystems do this for writes to
> @@ -4030,45 +4026,13 @@ ssize_t __generic_file_write_it

Re: [Cluster-devel] [PATCH 01/11] backing_dev: remove current->backing_dev_info

2023-05-24 Thread Darrick J. Wong
On Wed, May 24, 2023 at 08:38:00AM +0200, Christoph Hellwig wrote:
> The last user of current->backing_dev_info disappeared in commit
> b9b1335e6403 ("remove bdi_congested() and wb_congested() and related
> functions").  Remove the field and all assignments to it.
> 
> Signed-off-by: Christoph Hellwig 

Yay code removal!!!! :)

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/btrfs/file.c   | 6 +-
>  fs/ceph/file.c| 4 
>  fs/ext4/file.c| 2 --
>  fs/f2fs/file.c| 2 --
>  fs/fuse/file.c| 4 
>  fs/gfs2/file.c| 2 --
>  fs/nfs/file.c | 5 +
>  fs/ntfs/file.c| 2 --
>  fs/ntfs3/file.c   | 3 ---
>  fs/xfs/xfs_file.c | 4 
>  include/linux/sched.h | 3 ---
>  mm/filemap.c  | 3 ---
>  12 files changed, 2 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index f649647392e0e4..ecd43ab66fa6c7 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1145,7 +1145,6 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
> iov_iter *from,
>   !(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | 
> BTRFS_INODE_PREALLOC)))
>   return -EAGAIN;
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   ret = file_remove_privs(file);
>   if (ret)
>   return ret;
> @@ -1165,10 +1164,8 @@ static int btrfs_write_check(struct kiocb *iocb, 
> struct iov_iter *from,
>   loff_t end_pos = round_up(pos + count, fs_info->sectorsize);
>  
>   ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
> - if (ret) {
> - current->backing_dev_info = NULL;
> + if (ret)
>   return ret;
> - }
>   }
>  
>   return 0;
> @@ -1689,7 +1686,6 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct 
> iov_iter *from,
>   if (sync)
>   atomic_dec(>sync_writers);
>  
> - current->backing_dev_info = NULL;
>   return num_written;
>  }
>  
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index f4d8bf7dec88a8..c8ef72f723badd 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1791,9 +1791,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   else
>   ceph_start_io_write(inode);
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   if (iocb->ki_flags & IOCB_APPEND) {
>   err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
>   if (err < 0)
> @@ -1940,7 +1937,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   ceph_end_io_write(inode);
>  out_unlocked:
>   ceph_free_cap_flush(prealloc_cf);
> - current->backing_dev_info = NULL;
>   return written ? written : err;
>  }
>  
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index d101b3b0c7dad8..bc430270c23c19 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -285,9 +285,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb 
> *iocb,
>   if (ret <= 0)
>   goto out;
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   ret = generic_perform_write(iocb, from);
> - current->backing_dev_info = NULL;
>  
>  out:
>   inode_unlock(inode);
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 5ac53d2627d20d..4f423d367a44b9 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4517,9 +4517,7 @@ static ssize_t f2fs_buffered_write_iter(struct kiocb 
> *iocb,
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   return -EOPNOTSUPP;
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   ret = generic_perform_write(iocb, from);
> - current->backing_dev_info = NULL;
>  
>   if (ret > 0) {
>   iocb->ki_pos += ret;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 89d97f6188e05e..97d435874b14aa 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1362,9 +1362,6 @@ static ssize_t fuse_cache_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>  writethrough:
>   inode_lock(inode);
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   err = generic_write_checks(iocb, from);
>   if (err <= 0)
>   goto out;
> @@ -1409,7 +1406,6 @@ static ssize_t fuse_cache_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>

Re: [Cluster-devel] cleanup the filemap / direct I/O interaction

2023-05-22 Thread Darrick J. Wong
On Fri, May 19, 2023 at 11:35:08AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series cleans up some of the generic write helper calling
> conventions and the page cache writeback / invalidation for
> direct I/O.  This is a spinoff from the no-bufferhead kernel
> project, for while we'll want to an use iomap based buffered
> write path in the block layer.

Heh.

For patches 3 and 8, I wonder if you could just get rid of
current->backing_dev_info?

For patches 2, 4-6, and 10:
Acked-by: Darrick J. Wong 

For patches 1, 7, and 9:
Reviewed-by: Darrick J. Wong 

The fuse patches I have no idea about. :/

--D

> diffstat:
>  block/fops.c|   18 
>  fs/ceph/file.c  |6 -
>  fs/direct-io.c  |   10 --
>  fs/ext4/file.c  |   12 ---
>  fs/f2fs/file.c  |3 
>  fs/fuse/file.c  |   47 ++--
>  fs/gfs2/file.c  |7 -
>  fs/iomap/buffered-io.c  |   12 ++-
>  fs/iomap/direct-io.c|   88 --
>  fs/libfs.c  |   36 +
>  fs/nfs/file.c   |6 -
>  fs/xfs/xfs_file.c   |7 -
>  fs/zonefs/file.c|4 -
>  include/linux/fs.h  |7 -
>  include/linux/pagemap.h |4 +
>  mm/filemap.c|  184 
> +---
>  16 files changed, 190 insertions(+), 261 deletions(-)



Re: [Cluster-devel] [PATCH 08/13] iomap: assign current->backing_dev_info in iomap_file_buffered_write

2023-05-22 Thread Darrick J. Wong
On Fri, May 19, 2023 at 11:35:16AM +0200, Christoph Hellwig wrote:
> Move the assignment to current->backing_dev_info from the callers into
> iomap_file_buffered_write to reduce boiler plate code and reduce the
> scope to just around the page dirtying loop.
> 
> Note that zonefs was missing this assignment before.

I'm still wondering (a) what the hell current->backing_dev_info is for,
and (b) if we need it around the iomap_unshare operation.

$ git grep current..backing_dev_info
fs/btrfs/file.c:1148:   current->backing_dev_info = inode_to_bdi(inode);
fs/btrfs/file.c:1169:   current->backing_dev_info = NULL;
fs/btrfs/file.c:1692:   current->backing_dev_info = NULL;
fs/ceph/file.c:1795:current->backing_dev_info = inode_to_bdi(inode);
fs/ceph/file.c:1943:current->backing_dev_info = NULL;
fs/ext4/file.c:288: current->backing_dev_info = inode_to_bdi(inode);
fs/ext4/file.c:290: current->backing_dev_info = NULL;
fs/f2fs/file.c:4520:current->backing_dev_info = inode_to_bdi(inode);
fs/f2fs/file.c:4522:current->backing_dev_info = NULL;
fs/fuse/file.c:1366:current->backing_dev_info = inode_to_bdi(inode);
fs/fuse/file.c:1412:current->backing_dev_info = NULL;
fs/gfs2/file.c:1044:current->backing_dev_info = inode_to_bdi(inode);
fs/gfs2/file.c:1048:current->backing_dev_info = NULL;
fs/nfs/file.c:652:  current->backing_dev_info = inode_to_bdi(inode);
fs/nfs/file.c:654:  current->backing_dev_info = NULL;
fs/ntfs/file.c:1914:current->backing_dev_info = inode_to_bdi(vi);
fs/ntfs/file.c:1918:current->backing_dev_info = NULL;
fs/ntfs3/file.c:823:current->backing_dev_info = inode_to_bdi(inode);
fs/ntfs3/file.c:996:current->backing_dev_info = NULL;
fs/xfs/xfs_file.c:721:  current->backing_dev_info = inode_to_bdi(inode);
fs/xfs/xfs_file.c:756:  current->backing_dev_info = NULL;
mm/filemap.c:3995:  current->backing_dev_info = inode_to_bdi(inode);
mm/filemap.c:4056:  current->backing_dev_info = NULL;

AFAICT nobody uses it at all?  Unless there's some bizarre user that
isn't extracting it from @current?

Oh, hey, new question (c) isn't this set incorrectly for xfs realtime
files?

--D

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/file.c | 3 ---
>  fs/iomap/buffered-io.c | 3 +++
>  fs/xfs/xfs_file.c  | 5 -
>  3 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 499ef174dec138..261897fcfbc495 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -25,7 +25,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  #include "gfs2.h"
> @@ -1041,11 +1040,9 @@ static ssize_t gfs2_file_buffered_write(struct kiocb 
> *iocb,
>   goto out_unlock;
>   }
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   pagefault_disable();
>   ret = iomap_file_buffered_write(iocb, from, _iomap_ops);
>   pagefault_enable();
> - current->backing_dev_info = NULL;
>   if (ret > 0)
>   written += ret;
>  
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 550525a525c45c..b2779bd1f10611 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -3,6 +3,7 @@
>   * Copyright (C) 2010 Red Hat, Inc.
>   * Copyright (C) 2016-2019 Christoph Hellwig.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -869,8 +870,10 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *i,
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   iter.flags |= IOMAP_NOWAIT;
>  
> + current->backing_dev_info = inode_to_bdi(iter.inode);
>   while ((ret = iomap_iter(, ops)) > 0)
>   iter.processed = iomap_write_iter(, i);
> + current->backing_dev_info = NULL;
>  
>   if (unlikely(ret < 0))
>   return ret;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bfba10e0b0f3c2..98d763cc3b114c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -27,7 +27,6 @@
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -717,9 +716,6 @@ xfs_file_buffered_write(
>   if (ret)
>   goto out;
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   trace_xfs_file_buffered_write(iocb, from);
>   ret = iomap_file_buffered_write(iocb, from,
>   _buffered_write_iomap_ops);
> @@ -751,7 +747,6 @@ xfs_file_buffered_write(
>   goto write_retry;
>   }
>  
> - current->backing_dev_info = NULL;
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> -- 
> 2.39.2
> 



Re: [Cluster-devel] [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write

2023-04-24 Thread Darrick J. Wong
On Mon, Apr 24, 2023 at 07:49:20AM +0200, Christoph Hellwig wrote:
> Move the assignment to current->backing_dev_info from the callers into
> iomap_file_buffered_write.  Note that zonefs was missing this assignment
> before.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/file.c | 3 ---
>  fs/iomap/buffered-io.c | 4 
>  fs/xfs/xfs_file.c  | 5 -
>  3 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 8c4fad359ff538..4d88c6080b3e30 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -25,7 +25,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  #include "gfs2.h"
> @@ -1041,11 +1040,9 @@ static ssize_t gfs2_file_buffered_write(struct kiocb 
> *iocb,
>   goto out_unlock;
>   }
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   pagefault_disable();
>   ret = iomap_file_buffered_write(iocb, from, _iomap_ops);
>   pagefault_enable();
> - current->backing_dev_info = NULL;
>   if (ret > 0) {
>   iocb->ki_pos += ret;
>   written += ret;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2986be63d2bea6..3d5042efda202a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -3,6 +3,7 @@
>   * Copyright (C) 2010 Red Hat, Inc.
>   * Copyright (C) 2016-2019 Christoph Hellwig.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -876,8 +877,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *i,
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   iter.flags |= IOMAP_NOWAIT;
>  
> + current->backing_dev_info = inode_to_bdi(iter.inode);

Dumb question from me late on a Sunday night, but does the iomap_unshare
code need to set this too?  Since it works by dirtying pagecache folios
without actually changing the contents?

--D

>   while ((ret = iomap_iter(, ops)) > 0)
>   iter.processed = iomap_write_iter(, i);
> + current->backing_dev_info = NULL;
> +
>   if (iter.pos == iocb->ki_pos)
>   return ret;
>   return iter.pos - iocb->ki_pos;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 705250f9f90a1b..f5442e689baf15 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -27,7 +27,6 @@
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -717,9 +716,6 @@ xfs_file_buffered_write(
>   if (ret)
>   goto out;
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   trace_xfs_file_buffered_write(iocb, from);
>   ret = iomap_file_buffered_write(iocb, from,
>   _buffered_write_iomap_ops);
> @@ -753,7 +749,6 @@ xfs_file_buffered_write(
>   goto write_retry;
>   }
>  
> - current->backing_dev_info = NULL;
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> -- 
> 2.39.2
> 



Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Darrick J. Wong
On Wed, Apr 05, 2023 at 06:02:21PM +0200, Andrey Albershteyn wrote:
> Hi Darrick,
> 
> On Tue, Apr 04, 2023 at 09:36:02AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 04, 2023 at 04:53:17PM +0200, Andrey Albershteyn wrote:
> > > In case of different Merkle tree block size fs-verity expects
> > > ->read_merkle_tree_page() to return Merkle tree page filled with
> > > Merkle tree blocks. The XFS stores each merkle tree block under
> > > extended attribute. Those attributes are addressed by block offset
> > > into Merkle tree.
> > > 
> > > This patch make ->read_merkle_tree_page() to fetch multiple merkle
> > > tree blocks based on size ratio. Also the reference to each xfs_buf
> > > is passed with page->private to ->drop_page().
> > > 
> > > Signed-off-by: Andrey Albershteyn 
> > > ---
> > >  fs/xfs/xfs_verity.c | 74 +++--
> > >  fs/xfs/xfs_verity.h |  8 +
> > >  2 files changed, 66 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> > > index a9874ff4efcd..ef0aff216f06 100644
> > > --- a/fs/xfs/xfs_verity.c
> > > +++ b/fs/xfs/xfs_verity.c
> > > @@ -134,6 +134,10 @@ xfs_read_merkle_tree_page(
> > >   struct page *page = NULL;
> > >   __be64  name = cpu_to_be64(index << PAGE_SHIFT);
> > >   uint32_tbs = 1 << log_blocksize;
> > > + int blocks_per_page =
> > > + (1 << (PAGE_SHIFT - log_blocksize));
> > > + int n = 0;
> > > + int offset = 0;
> > >   struct xfs_da_args  args = {
> > >   .dp = ip,
> > >   .attr_filter= XFS_ATTR_VERITY,
> > > @@ -143,26 +147,59 @@ xfs_read_merkle_tree_page(
> > >   .valuelen   = bs,
> > >   };
> > >   int error = 0;
> > > + boolis_checked = true;
> > > + struct xfs_verity_buf_list  *buf_list;
> > >  
> > >   page = alloc_page(GFP_KERNEL);
> > >   if (!page)
> > >   return ERR_PTR(-ENOMEM);
> > >  
> > > - error = xfs_attr_get();
> > > - if (error) {
> > > - kmem_free(args.value);
> > > - xfs_buf_rele(args.bp);
> > > + buf_list = kzalloc(sizeof(struct xfs_verity_buf_list), GFP_KERNEL);
> > > + if (!buf_list) {
> > >   put_page(page);
> > > - return ERR_PTR(-EFAULT);
> > > + return ERR_PTR(-ENOMEM);
> > >   }
> > >  
> > > - if (args.bp->b_flags & XBF_VERITY_CHECKED)
> > > + /*
> > > +  * Fill the page with Merkle tree blocks. The blcoks_per_page is higher
> > > +  * than 1 when fs block size != PAGE_SIZE or Merkle tree block size !=
> > > +  * PAGE SIZE
> > > +  */
> > > + for (n = 0; n < blocks_per_page; n++) {
> > 
> > Ahah, ok, that's why we can't pass the xfs_buf pages up to fsverity.
> > 
> > > + offset = bs * n;
> > > + name = cpu_to_be64(((index << PAGE_SHIFT) + offset));
> > 
> > Really this ought to be a typechecked helper...
> > 
> > struct xfs_fsverity_merkle_key {
> > __be64  merkleoff;
> 
> Sure, thanks, will change this
> 
> > };
> > 
> > static inline void
> > xfs_fsverity_merkle_key_to_disk(struct xfs_fsverity_merkle_key *k, loff_t 
> > pos)
> > {
> > k->merkeloff = cpu_to_be64(pos);
> > }
> > 
> > 
> > 
> > > + args.name = (const uint8_t *)
> > > +
> > > + error = xfs_attr_get();
> > > + if (error) {
> > > + kmem_free(args.value);
> > > + /*
> > > +  * No more Merkle tree blocks (e.g. this was the last
> > > +  * block of the tree)
> > > +  */
> > > + if (error == -ENOATTR)
> > > + break;
> > > + xfs_buf_rele(args.bp);
> > > + put_page(page);
> > > + kmem_free(buf_list);
> > > + return ERR_PTR(-EFAULT);
> > > + }
> > > +
> > > + buf_list->bufs[buf_list->buf_count++] = args.bp;
> > > +
> > > + /* One of the buffers was dropped */
> > > + if (!(ar

Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Darrick J. Wong
On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote:
> On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> > > The direct path is not supported on verity files. Attempts to use direct
> > > I/O path on such files should fall back to buffered I/O path.
> > > 
> > > Signed-off-by: Andrey Albershteyn 
> > > ---
> > >  fs/xfs/xfs_file.c | 14 +++---
> > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 947b5c436172..9e072e82f6c1 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -244,7 +244,8 @@ xfs_file_dax_read(
> > >   struct kiocb*iocb,
> > >   struct iov_iter *to)
> > >  {
> > > - struct xfs_inode*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> > > + struct inode*inode = iocb->ki_filp->f_mapping->host;
> > > + struct xfs_inode*ip = XFS_I(inode);
> > >   ssize_t ret = 0;
> > >  
> > >   trace_xfs_file_dax_read(iocb, to);
> > > @@ -297,10 +298,17 @@ xfs_file_read_iter(
> > >  
> > >   if (IS_DAX(inode))
> > >   ret = xfs_file_dax_read(iocb, to);
> > > - else if (iocb->ki_flags & IOCB_DIRECT)
> > > + else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
> > >   ret = xfs_file_dio_read(iocb, to);
> > > - else
> > > + else {
> > > + /*
> > > +  * In case fs-verity is enabled, we also fallback to the
> > > +  * buffered read from the direct read path. Therefore,
> > > +  * IOCB_DIRECT is set and need to be cleared
> > > +  */
> > > + iocb->ki_flags &= ~IOCB_DIRECT;
> > >   ret = xfs_file_buffered_read(iocb, to);
> > 
> > XFS doesn't usually allow directio fallback to the pagecache. Why
> > would fsverity be any different?
> 
> Didn't know that, this is what happens on ext4 so I did the same.
> Then it probably make sense to just error on DIRECT on verity
> sealed file.

Thinking about this a little more -- I suppose we shouldn't just go
breaking directio reads from a verity file if we can help it.  Is there
a way to ask fsverity to perform its validation against some arbitrary
memory buffer that happens to be fs-block aligned?  In which case we
could support fsblock-aligned directio reads without falling back to the
page cache?

--D

> > 
> > --D
> > 
> > > + }
> > >  
> > >   if (ret > 0)
> > >   XFS_STATS_ADD(mp, xs_read_bytes, ret);
> > > -- 
> > > 2.38.4
> > > 
> > 
> 
> -- 
> - Andrey
> 



Re: [Cluster-devel] [PATCH v2 09/23] iomap: allow filesystem to implement read path verification

2023-04-05 Thread Darrick J. Wong
On Wed, Apr 05, 2023 at 01:01:16PM +0200, Andrey Albershteyn wrote:
> Hi Christoph,
> 
> On Tue, Apr 04, 2023 at 08:37:02AM -0700, Christoph Hellwig wrote:
> > >   if (iomap_block_needs_zeroing(iter, pos)) {
> > >   folio_zero_range(folio, poff, plen);
> > > + if (iomap->flags & IOMAP_F_READ_VERITY) {
> > 
> > Wju do we need the new flag vs just testing that folio_ops and
> > folio_ops->verify_folio is non-NULL?
> 
> Yes, it can be just test, haven't noticed that it's used only here,
> initially I used it in several places.
> 
> > 
> > > - ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
> > > -  REQ_OP_READ, gfp);
> > > + ctx->bio = bio_alloc_bioset(iomap->bdev, bio_max_segs(nr_vecs),
> > > + REQ_OP_READ, GFP_NOFS, 
> > > _read_ioend_bioset);
> > 
> > All other callers don't really need the larger bioset, so I'd avoid
> > the unconditional allocation here, but more on that later.
> 
> Ok, make sense.
> 
> > 
> > > + ioend = container_of(ctx->bio, struct iomap_read_ioend,
> > > + read_inline_bio);
> > > + ioend->io_inode = iter->inode;
> > > + if (ctx->ops && ctx->ops->prepare_ioend)
> > > + ctx->ops->prepare_ioend(ioend);
> > > +
> > 
> > So what we're doing in writeback and direct I/O, is to:
> > 
> >  a) have a submit_bio hook
> >  b) allow the file system to then hook the bi_end_io caller
> >  c) (only in direct O/O for now) allow the file system to provide
> > a bio_set to allocate from
> 
> I see.
> 
> > 
> > I wonder if that also makes sense and keep all the deferral in the
> > file system.  We'll need that for the btrfs iomap conversion anyway,
> > and it seems more flexible.  The ioend processing would then move into
> > XFS.
> > 
> 
> Not sure what you mean here.

I /think/ Christoph is talking about allowing callers of iomap pagecache
operations to supply a custom submit_bio function and a bio_set so that
filesystems can add in their own post-IO processing and appropriately
sized (read: minimum you can get away with) bios.  I imagine btrfs has
quite a lot of (read) ioend processing they need to do, as will xfs now
that you're adding fsverity.

> > > @@ -156,6 +160,11 @@ struct iomap_folio_ops {
> > >* locked by the iomap code.
> > >*/
> > >   bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
> > > +
> > > + /*
> > > +  * Verify folio when successfully read
> > > +  */
> > > + bool (*verify_folio)(struct folio *folio, loff_t pos, unsigned int len);

Any reason why we shouldn't return the usual negative errno?

> > Why isn't this in iomap_readpage_ops?
> > 
> 
> Yes, it can be. But it appears to me to be more relevant to
> _folio_ops, any particular reason to move it there? Don't mind
> moving it to iomap_readpage_ops.

I think the point is that this is a general "check what we just read"
hook, so it could be in readpage_ops since we're never going to need to
re-validate verity contents, right?  Hence it could be in readpage_ops
instead of the general iomap_folio_ops.

 Is there a use case for ->verify_folio that isn't a read post-
processing step?

--D

> -- 
> - Andrey
> 



Re: [Cluster-devel] [PATCH v2 00/23] fs-verity support for XFS

2023-04-04 Thread Darrick J. Wong
On Tue, Apr 04, 2023 at 04:52:56PM +0200, Andrey Albershteyn wrote:
> Hi all,
> 
> This is V2 of fs-verity support in XFS. In this series I did
> numerous changes from V1 which are described below.
> 
> This patchset introduces fs-verity [5] support for XFS. This
> implementation utilizes extended attributes to store fs-verity
> metadata. The Merkle tree blocks are stored in the remote extended
> attributes.
> 
> A few key points:
> - fs-verity metadata is stored in extended attributes
> - Direct path and DAX are disabled for inodes with fs-verity
> - Pages are verified in iomap's read IO path (offloaded to
>   workqueue)
> - New workqueue for verification processing
> - New ro-compat flag
> - Inodes with fs-verity have new on-disk diflag
> - xfs_attr_get() can return buffer with the attribute
> 
> The patchset is tested with xfstests -g auto on xfs_1k, xfs_4k,
> xfs_1k_quota, and xfs_4k_quota. Haven't found any major failures.
> 
> Patches [6/23] and [7/23] touch ext4, f2fs, btrfs, and patch [8/23]
> touches erofs, gfs2, and zonefs.
> 
> The patchset consist of four parts:
> - [1..4]: Patches from Parent Pointer patchset which add binary
>   xattr names with a few deps
> - [5..7]: Improvements to core fs-verity
> - [8..9]: Add read path verification to iomap
> - [10..23]: Integration of fs-verity to xfs
> 
> Changes from V1:
> - Added parent pointer patches for easier testing
> - Many issues and refactoring points fixed from the V1 review
> - Adjusted for recent changes in fs-verity core (folios, non-4k)
> - Dropped disabling of large folios
> - Completely new fsverity patches (fix, callout, log_blocksize)
> - Change approach to verification in iomap to the same one as in
>   write path. Callouts to fs instead of direct fs-verity use.
> - New XFS workqueue for post read folio verification
> - xfs_attr_get() can return underlying xfs_buf
> - xfs_bufs are marked with XBF_VERITY_CHECKED to track verified
>   blocks
> 
> kernel:
> [1]: https://github.com/alberand/linux/tree/xfs-verity-v2
> 
> xfsprogs:
> [2]: https://github.com/alberand/xfsprogs/tree/fsverity-v2

Will there any means for xfs_repair to check the merkle tree contents?
Should it clear the ondisk inode flag if it decides to trash the xattr
structure, or is it ok to let the kernel deal with flag set and no
verity data?

--D

> xfstests:
> [3]: https://github.com/alberand/xfstests/tree/fsverity-v2
> 
> v1:
> [4]: 
> https://lore.kernel.org/linux-xfs/20221213172935.680971-1-aalbe...@redhat.com/
> 
> fs-verity:
> [5]: https://www.kernel.org/doc/html/latest/filesystems/fsverity.html
> 
> Thanks,
> Andrey
> 
> Allison Henderson (4):
>   xfs: Add new name to attri/d
>   xfs: add parent pointer support to attribute code
>   xfs: define parent pointer xattr format
>   xfs: Add xfs_verify_pptr
> 
> Andrey Albershteyn (19):
>   fsverity: make fsverity_verify_folio() accept folio's offset and size
>   fsverity: add drop_page() callout
>   fsverity: pass Merkle tree block size to ->read_merkle_tree_page()
>   iomap: hoist iomap_readpage_ctx from the iomap_readahead/_folio
>   iomap: allow filesystem to implement read path verification
>   xfs: add XBF_VERITY_CHECKED xfs_buf flag
>   xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer
>   xfs: introduce workqueue for post read IO work
>   xfs: add iomap's readpage operations
>   xfs: add attribute type for fs-verity
>   xfs: add fs-verity ro-compat flag
>   xfs: add inode on-disk VERITY flag
>   xfs: initialize fs-verity on file open and cleanup on inode
> destruction
>   xfs: don't allow to enable DAX on fs-verity sealsed inode
>   xfs: disable direct read path for fs-verity sealed files
>   xfs: add fs-verity support
>   xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
>   xfs: add fs-verity ioctls
>   xfs: enable ro-compat fs-verity flag
> 
>  fs/btrfs/verity.c   |  15 +-
>  fs/erofs/data.c |  12 +-
>  fs/ext4/verity.c|   9 +-
>  fs/f2fs/verity.c|   9 +-
>  fs/gfs2/aops.c  |  10 +-
>  fs/ioctl.c  |   4 +
>  fs/iomap/buffered-io.c  |  89 ++-
>  fs/verity/read_metadata.c   |   7 +-
>  fs/verity/verify.c  |   9 +-
>  fs/xfs/Makefile |   1 +
>  fs/xfs/libxfs/xfs_attr.c|  81 +-
>  fs/xfs/libxfs/xfs_attr.h|   7 +-
>  fs/xfs/libxfs/xfs_attr_leaf.c   |   7 +
>  fs/xfs/libxfs/xfs_attr_remote.c |  13 +-
>  fs/xfs/libxfs/xfs_da_btree.h|   7 +-
>  fs/xfs/libxfs/xfs_da_format.h   |  46 +-
>  fs/xfs/libxfs/xfs_format.h  |  14 +-
>  fs/xfs/libxfs/xfs_log_format.h  |   8 +-
>  fs/xfs/libxfs/xfs_sb.c  |   2 +
>  fs/xfs/scrub/attr.c |   4 +-
>  fs/xfs/xfs_aops.c   |  61 +++-
>  fs/xfs/xfs_attr_item.c  | 142 +++---
>  fs/xfs/xfs_attr_item.h  |   1 +
>  fs/xfs/xfs_attr_list.c  |  17 ++-
>  fs/xfs/xfs_buf.h|  17 ++-
>  

Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-04 Thread Darrick J. Wong
On Tue, Apr 04, 2023 at 04:53:17PM +0200, Andrey Albershteyn wrote:
> In case of different Merkle tree block size fs-verity expects
> ->read_merkle_tree_page() to return Merkle tree page filled with
> Merkle tree blocks. The XFS stores each merkle tree block under
> extended attribute. Those attributes are addressed by block offset
> into Merkle tree.
> 
> This patch make ->read_merkle_tree_page() to fetch multiple merkle
> tree blocks based on size ratio. Also the reference to each xfs_buf
> is passed with page->private to ->drop_page().
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/xfs_verity.c | 74 +++--
>  fs/xfs/xfs_verity.h |  8 +
>  2 files changed, 66 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> index a9874ff4efcd..ef0aff216f06 100644
> --- a/fs/xfs/xfs_verity.c
> +++ b/fs/xfs/xfs_verity.c
> @@ -134,6 +134,10 @@ xfs_read_merkle_tree_page(
>   struct page *page = NULL;
>   __be64  name = cpu_to_be64(index << PAGE_SHIFT);
>   uint32_tbs = 1 << log_blocksize;
> + int blocks_per_page =
> + (1 << (PAGE_SHIFT - log_blocksize));
> + int n = 0;
> + int offset = 0;
>   struct xfs_da_args  args = {
>   .dp = ip,
>   .attr_filter= XFS_ATTR_VERITY,
> @@ -143,26 +147,59 @@ xfs_read_merkle_tree_page(
>   .valuelen   = bs,
>   };
>   int error = 0;
> + boolis_checked = true;
> + struct xfs_verity_buf_list  *buf_list;
>  
>   page = alloc_page(GFP_KERNEL);
>   if (!page)
>   return ERR_PTR(-ENOMEM);
>  
> - error = xfs_attr_get();
> - if (error) {
> - kmem_free(args.value);
> - xfs_buf_rele(args.bp);
> + buf_list = kzalloc(sizeof(struct xfs_verity_buf_list), GFP_KERNEL);
> + if (!buf_list) {
>   put_page(page);
> - return ERR_PTR(-EFAULT);
> + return ERR_PTR(-ENOMEM);
>   }
>  
> - if (args.bp->b_flags & XBF_VERITY_CHECKED)
> + /*
> +  * Fill the page with Merkle tree blocks. The blcoks_per_page is higher
> +  * than 1 when fs block size != PAGE_SIZE or Merkle tree block size !=
> +  * PAGE SIZE
> +  */
> + for (n = 0; n < blocks_per_page; n++) {

Ahah, ok, that's why we can't pass the xfs_buf pages up to fsverity.

> + offset = bs * n;
> + name = cpu_to_be64(((index << PAGE_SHIFT) + offset));

Really this ought to be a typechecked helper...

struct xfs_fsverity_merkle_key {
__be64  merkleoff;
};

static inline void
xfs_fsverity_merkle_key_to_disk(struct xfs_fsverity_merkle_key *k, loff_t pos)
{
k->merkeloff = cpu_to_be64(pos);
}



> + args.name = (const uint8_t *)
> +
> + error = xfs_attr_get();
> + if (error) {
> + kmem_free(args.value);
> + /*
> +  * No more Merkle tree blocks (e.g. this was the last
> +  * block of the tree)
> +  */
> + if (error == -ENOATTR)
> + break;
> + xfs_buf_rele(args.bp);
> + put_page(page);
> + kmem_free(buf_list);
> + return ERR_PTR(-EFAULT);
> + }
> +
> + buf_list->bufs[buf_list->buf_count++] = args.bp;
> +
> + /* One of the buffers was dropped */
> + if (!(args.bp->b_flags & XBF_VERITY_CHECKED))
> + is_checked = false;

If there's enough memory pressure to cause the merkle tree pages to get
evicted, what are the chances that the xfs_bufs survive the eviction?

> + memcpy(page_address(page) + offset, args.value, args.valuelen);
> + kmem_free(args.value);
> + args.value = NULL;
> + }
> +
> + if (is_checked)
>   SetPageChecked(page);
> + page->private = (unsigned long)buf_list;
>  
> - page->private = (unsigned long)args.bp;
> - memcpy(page_address(page), args.value, args.valuelen);
> -
> - kmem_free(args.value);
>   return page;
>  }
>  
> @@ -191,16 +228,21 @@ xfs_write_merkle_tree_block(
>  
>  static void
>  xfs_drop_page(
> - struct page *page)
> + struct page *page)
>  {
> - struct xfs_buf *buf = (struct xfs_buf *)page->private;
> + int i = 0;
> + struct xfs_verity_buf_list  *buf_list =
> + (struct xfs_verity_buf_list *)page->private;
>  
> - ASSERT(buf != NULL);
> + ASSERT(buf_list != NULL);
>  
> - if (PageChecked(page))
> - buf->b_flags |= XBF_VERITY_CHECKED;
> + for (i = 0; i < buf_list->buf_count; i++) {
> + if 

Re: [Cluster-devel] [PATCH v2 20/23] xfs: add fs-verity support

2023-04-04 Thread Darrick J. Wong
On Tue, Apr 04, 2023 at 04:53:16PM +0200, Andrey Albershteyn wrote:
> Add integration with fs-verity. The XFS store fs-verity metadata in
> the extended attributes. The metadata consist of verity descriptor
> and Merkle tree blocks.
> 
> The descriptor is stored under "verity_descriptor" extended
> attribute. The Merkle tree blocks are stored under binary indexes.
> 
> When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
> flag is set meaning that the Merkle tree is being build. The
> initialization ends with storing of verity descriptor and setting
> inode on-disk flag (XFS_DIFLAG2_VERITY).
> 
> The verification on read is done in iomap. Based on the inode verity
> flag the IOMAP_F_READ_VERITY is set in xfs_read_iomap_begin() to let
> iomap know that verification is needed.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/Makefile  |   1 +
>  fs/xfs/libxfs/xfs_attr.c |  13 +++
>  fs/xfs/xfs_inode.h   |   3 +-
>  fs/xfs/xfs_iomap.c   |   3 +
>  fs/xfs/xfs_ondisk.h  |   4 +
>  fs/xfs/xfs_super.c   |   8 ++
>  fs/xfs/xfs_verity.c  | 214 +++
>  fs/xfs/xfs_verity.h  |  19 
>  8 files changed, 264 insertions(+), 1 deletion(-)
>  create mode 100644 fs/xfs/xfs_verity.c
>  create mode 100644 fs/xfs/xfs_verity.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 92d88dc3c9f7..76174770d91a 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -130,6 +130,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)   += xfs_acl.o
>  xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o
> +xfs-$(CONFIG_FS_VERITY)  += xfs_verity.o
>  
>  # notify failure
>  ifeq ($(CONFIG_MEMORY_FAILURE),y)
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index 298b74245267..39d9038fbeee 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -26,6 +26,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_attr_item.h"
>  #include "xfs_xattr.h"
> +#include "xfs_verity.h"
>  
>  struct kmem_cache*xfs_attr_intent_cache;
>  
> @@ -1635,6 +1636,18 @@ xfs_attr_namecheck(
>   return xfs_verify_pptr(mp, (struct xfs_parent_name_rec *)name);
>   }
>  
> + if (flags & XFS_ATTR_VERITY) {
> + /* Merkle tree pages are stored under u64 indexes */
> + if (length == sizeof(__be64))

This ondisk structure should be actual structs that we can grep and
ctags on, not open-coded __be64 scattered around the xattr code.

> + return true;
> +
> + /* Verity descriptor blocks are held in a named attribute. */
> + if (length == XFS_VERITY_DESCRIPTOR_NAME_LEN)
> + return true;
> +
> + return false;
> + }
> +
>   return xfs_str_attr_namecheck(name, length);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 69d21e42c10a..a95f28cb049f 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -324,7 +324,8 @@ static inline bool 
> xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>   * inactivation completes, both flags will be cleared and the inode is a
>   * plain old IRECLAIMABLE inode.
>   */
> -#define XFS_INACTIVATING (1 << 13)
> +#define XFS_INACTIVATING (1 << 13)
> +#define XFS_IVERITY_CONSTRUCTION (1 << 14) /* merkle tree construction */
>  
>  /* All inode state flags related to inode reclaim. */
>  #define XFS_ALL_IRECLAIM_FLAGS   (XFS_IRECLAIMABLE | \
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e0f3c5d709f6..0adde39f02a5 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -143,6 +143,9 @@ xfs_bmbt_to_iomap(
>   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>   iomap->flags |= IOMAP_F_DIRTY;
>  
> + if (fsverity_active(VFS_I(ip)))
> + iomap->flags |= IOMAP_F_READ_VERITY;
> +
>   iomap->validity_cookie = sequence_cookie;
>   iomap->folio_ops = _iomap_folio_ops;
>   return 0;
> diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> index 9737b5a9f405..7fe88ccda519 100644
> --- a/fs/xfs/xfs_ondisk.h
> +++ b/fs/xfs/xfs_ondisk.h
> @@ -189,6 +189,10 @@ xfs_check_ondisk_structs(void)
>   XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MIN << XFS_DQ_BIGTIME_SHIFT, 4);
>   XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MAX << XFS_DQ_BIGTIME_SHIFT,
>   16299260424LL);
> +
> + /* fs-verity descriptor xattr name */
> + XFS_CHECK_VALUE(strlen(XFS_VERITY_DESCRIPTOR_NAME),

Are you encoding the trailing null in the xattr name too?  The attr name
length is stored explicitly, so the null isn't strictly necessary.

> + XFS_VERITY_DESCRIPTOR_NAME_LEN);
>  }
>  
>  #endif /* __XFS_ONDISK_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d40de32362b1..b6e99ed3b187 100644
> --- a/fs/xfs/xfs_super.c
> +++ 

Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-04 Thread Darrick J. Wong
On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> The direct path is not supported on verity files. Attempts to use direct
> I/O path on such files should fall back to buffered I/O path.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/xfs_file.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 947b5c436172..9e072e82f6c1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -244,7 +244,8 @@ xfs_file_dax_read(
>   struct kiocb*iocb,
>   struct iov_iter *to)
>  {
> - struct xfs_inode*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> + struct inode*inode = iocb->ki_filp->f_mapping->host;
> + struct xfs_inode*ip = XFS_I(inode);
>   ssize_t ret = 0;
>  
>   trace_xfs_file_dax_read(iocb, to);
> @@ -297,10 +298,17 @@ xfs_file_read_iter(
>  
>   if (IS_DAX(inode))
>   ret = xfs_file_dax_read(iocb, to);
> - else if (iocb->ki_flags & IOCB_DIRECT)
> + else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
>   ret = xfs_file_dio_read(iocb, to);
> - else
> + else {
> + /*
> +  * In case fs-verity is enabled, we also fallback to the
> +  * buffered read from the direct read path. Therefore,
> +  * IOCB_DIRECT is set and need to be cleared
> +  */
> + iocb->ki_flags &= ~IOCB_DIRECT;
>   ret = xfs_file_buffered_read(iocb, to);

XFS doesn't usually allow directio fallback to the pagecache.  Why would
fsverity be any different?

--D

> + }
>  
>   if (ret > 0)
>   XFS_STATS_ADD(mp, xs_read_bytes, ret);
> -- 
> 2.38.4
> 



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Darrick J. Wong
On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > I don't have any objections to pulling everything except patches 8 and
> > 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.

Ok, done.

> > 
> > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > don't think it does, but OTOH zone pointer management might complicate
> > that.
> 
> Adding Damien.
> 
> > 2. How about porting the writeback iomap validation to use this
> > mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.

Sorry, I should've been more specific -- can xfs writeback use the
validity cookie in struct iomap and thereby get rid of struct
xfs_writepage_ctx entirely?

> > 2. Do we need to revalidate mappings for directio writes?  I think the
> > answer is no (for xfs) because the ->iomap_begin call will allocate
> > whatever blocks are needed and truncate/punch/reflink block on the
> > iolock while the directio writes are pending, so you'll never end up
> > with a stale mapping.
> 
> Yes.

Er... yes as in "Yes, we *do* need to revalidate directio writes", or
"Yes, your reasoning is correct"?

--D



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-15 Thread Darrick J. Wong
On Tue, Jan 10, 2023 at 02:09:07AM +0100, Andreas Grünbacher wrote:
> Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner :
> > On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner  wrote:
> > > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > > handler and validating the mapping there.
> > > > >
> > > > > Signed-off-by: Andreas Gruenbacher 
> > > >
> > > > I think this is wrong.
> > > >
> > > > The ->iomap_valid() function handles a fundamental architectural
> > > > issue with cached iomaps: the iomap can become stale at any time
> > > > whilst it is in use by the iomap core code.
> > > >
> > > > The current problem it solves in the iomap_write_begin() path has to
> > > > do with writeback and memory reclaim races over unwritten extents,
> > > > but the general case is that we must be able to check the iomap
> > > > at any point in time to assess it's validity.
> > > >
> > > > Indeed, we also have this same "iomap valid check" functionality in the
> > > > writeback code as cached iomaps can become stale due to racing
> > > > writeback, truncated, etc. But you wouldn't know it by looking at the 
> > > > iomap
> > > > writeback code - this is currently hidden by XFS by embedding
> > > > the checks into the iomap writeback ->map_blocks function.
> > > >
> > > > That is, the first thing that xfs_map_blocks() does is check if the
> > > > cached iomap is valid, and if it is valid it returns immediately and
> > > > the iomap writeback code uses it without question.
> > > >
> > > > The reason that this is embedded like this is that the iomap did not
> > > > have a validity cookie field in it, and so the validity information
> > > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > > filesystem has to decode it from that private wrapping structure.
> > > >
> > > > However, the validity information iin the structure wrapper is
> > > > indentical to the iomap validity cookie,
> > >
> > > Then could that part of the xfs code be converted to use
> > > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > > eliminated?
> >
> > Yes, that is the plan.
> >
> > >
> > > > and so the direction I've
> > > > been working towards is to replace this implicit, hidden cached
> > > > iomap validity check with an explicit ->iomap_valid call and then
> > > > only call ->map_blocks if the validity check fails (or is not
> > > > implemented).
> > > >
> > > > I want to use the same code for all the iomap validity checks in all
> > > > the iomap core code - this is an iomap issue, the conditions where
> > > > we need to check for iomap validity are different for depending on
> > > > the iomap context being run, and the checks are not necessarily
> > > > dependent on first having locked a folio.
> > > >
> > > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > > that does not dictate where the validity checking needs to be done
> > > > by the iomap core.
> > > >
> > > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > > iomap core code - the iomap core needs to be able to formally verify
> > > > the iomap is valid at any point in time, not just at the point in
> > > > time a folio in the page cache has been locked...
> > >
> > > We don't need to validate an iomap "at any time". It's two specific
> > > places in the code in which we need to check, and we're not going to
> > > end up with ten more such places tomorrow.
> >
> > Not immediately, but that doesn't change the fact this is not a
> > filesystem specific issue - it's an inherent characteristic of
> > cached iomaps and unsynchronised extent state changes that occur
> > outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> > IO completion contexts).
> >
> > Racing mmap + buffered writes can expose these state changes as the
> > iomap bufferred write IO path is not serialised against the iomap
> > mmap IO path except via folio locks. Hence a mmap page fault can
> > invalidate a cached buffered write iomap by causing a hole ->
> > unwritten, hole -> delalloc or hole -> written conversion in the
> > middle of the buffered write range. The buffered write still has a
> > hole mapping cached for that entire range, and it is now incorrect.
> >
> > If the mmap write happens to change extent state at the trailing
> > edge of a partial buffered write, data corruption will occur if we
> > race just right with writeback and memory reclaim. I'm pretty sure
> > that this corruption can be reporduced on gfs2 if we try hard enough
> > - generic/346 triggers the mmap/write race condition, all that is
> > needed from that point is for writeback and reclaiming pages at
> > exactly the right time...
> >
> > > I'd prefer to keep those
> > > filesystem internals in the filesystem specific code instead of
> > > exposing them to 

Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-15 Thread Darrick J. Wong
On Sun, Jan 15, 2023 at 09:01:22AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 10, 2023 at 01:34:16PM +, Matthew Wilcox wrote:
> > On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > > checking for that in iomap_get_folio().  Your patch then turns into
> > > > the below.
> > > 
> > > Exactly.  And as I already pointed out in reply to Dave's original
> > > patch what we really should be doing is returning an ERR_PTR from
> > > __filemap_get_folio instead of reverse-engineering the expected
> > > error code.
> > 
> > Ouch, we have a nasty problem.
> > 
> > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > meaning that some shadow entries will look like errors.  The way I
> > solved this in the XArray code is by shifting the error values by
> > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > 
> > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > but so far we haven't, and I'd like to make that decision intentionally.
> 
> Sorry, I'm not following this at all -- where in buffered-io.c does
> anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
> either...?

Oh, never mind, I worked out that the conflict is between iomap not
passing FGP_ENTRY and wanting a pointer or a negative errno; and someone
who does FGP_ENTRY, in which case the xarray value can be confused for a
negative errno.

OFC now I wonder, can we simply say that the return value is "The found
folio or NULL if you set FGP_ENTRY; or the found folio or a negative
errno if you don't" ?

--D

> --D



Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-15 Thread Darrick J. Wong
On Tue, Jan 10, 2023 at 01:34:16PM +, Matthew Wilcox wrote:
> On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > checking for that in iomap_get_folio().  Your patch then turns into
> > > the below.
> > 
> > Exactly.  And as I already pointed out in reply to Dave's original
> > patch what we really should be doing is returning an ERR_PTR from
> > __filemap_get_folio instead of reverse-engineering the expected
> > error code.
> 
> Ouch, we have a nasty problem.
> 
> If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> encodings for shadow entries overlap with the encodings for ERR_PTR,
> meaning that some shadow entries will look like errors.  The way I
> solved this in the XArray code is by shifting the error values by
> two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> 
> I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> but so far we haven't, and I'd like to make that decision intentionally.

Sorry, I'm not following this at all -- where in buffered-io.c does
anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
either...?

--D



Re: [Cluster-devel] [PATCH v2] filelock: move file locking definitions to separate header file

2023-01-10 Thread Darrick J. Wong
On Thu, Jan 05, 2023 at 04:19:29PM -0500, Jeff Layton wrote:
> The file locking definitions have lived in fs.h since the dawn of time,
> but they are only used by a small subset of the source files that
> include it.
> 
> Move the file locking definitions to a new header file, and add the
> appropriate #include directives to the source files that need them. By
> doing this we trim down fs.h a bit and limit the amount of rebuilding
> that has to be done when we make changes to the file locking APIs.
> 
> Reviewed-by: Xiubo Li 
> Reviewed-by: Christian Brauner (Microsoft) 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: David Howells 
> Acked-by: Chuck Lever 
> Acked-by: Joseph Qi 
> Acked-by: Steve French 
> Signed-off-by: Jeff Layton 

The XFS part looks good,
Acked-by: Darrick J. Wong 

--D

> ---
>  arch/arm/kernel/sys_oabi-compat.c |   1 +
>  fs/9p/vfs_file.c  |   1 +
>  fs/afs/internal.h |   1 +
>  fs/attr.c |   1 +
>  fs/ceph/locks.c   |   1 +
>  fs/cifs/cifsfs.c  |   1 +
>  fs/cifs/cifsglob.h|   1 +
>  fs/cifs/cifssmb.c |   1 +
>  fs/cifs/file.c|   1 +
>  fs/cifs/smb2file.c|   1 +
>  fs/dlm/plock.c|   1 +
>  fs/fcntl.c|   1 +
>  fs/file_table.c   |   1 +
>  fs/fuse/file.c|   1 +
>  fs/gfs2/file.c|   1 +
>  fs/inode.c|   1 +
>  fs/ksmbd/smb2pdu.c|   1 +
>  fs/ksmbd/vfs.c|   1 +
>  fs/ksmbd/vfs_cache.c  |   1 +
>  fs/lockd/clntproc.c   |   1 +
>  fs/lockd/netns.h  |   1 +
>  fs/locks.c|   1 +
>  fs/namei.c|   1 +
>  fs/nfs/file.c |   1 +
>  fs/nfs/nfs4_fs.h  |   1 +
>  fs/nfs/pagelist.c |   1 +
>  fs/nfs/write.c|   1 +
>  fs/nfs_common/grace.c |   1 +
>  fs/nfsd/netns.h   |   1 +
>  fs/ocfs2/locks.c  |   1 +
>  fs/ocfs2/stack_user.c |   1 +
>  fs/open.c |   1 +
>  fs/orangefs/file.c|   1 +
>  fs/posix_acl.c|   1 +
>  fs/proc/fd.c  |   1 +
>  fs/utimes.c   |   1 +
>  fs/xattr.c|   1 +
>  fs/xfs/xfs_linux.h|   1 +
>  include/linux/filelock.h  | 438 ++
>  include/linux/fs.h| 426 -
>  include/linux/lockd/xdr.h |   1 +
>  41 files changed, 477 insertions(+), 426 deletions(-)
>  create mode 100644 include/linux/filelock.h
> 
> v2:
> - drop pointless externs from the new filelock.h
> - move include into xfs_linux.h instead of xfs_buf.h
> - have filelock.h #include fs.h. Any file including filelock.h will
>   almost certainly need fs.h anyway.
> 
> I've left some of Al's comments unaddressed for now, as I'd like to keep
> this move mostly mechanical, and no go changing function prototypes
> until the dust settles.
> 
> I'll plan to drop this into linux-next within the next few days, with an
> eye toward merging this for v6.3.
> 
> diff --git a/arch/arm/kernel/sys_oabi-compat.c 
> b/arch/arm/kernel/sys_oabi-compat.c
> index 68112c172025..006163195d67 100644
> --- a/arch/arm/kernel/sys_oabi-compat.c
> +++ b/arch/arm/kernel/sys_oabi-compat.c
> @@ -73,6 +73,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index b740017634ef..b6ba22975781 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> index fd8567b98e2b..2d6d7dae225a 100644
> --- a/fs/afs/internal.h
> +++ b/fs/afs/internal.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/attr.c b/fs/attr.c
> index b45f30e516fa..f3eb8e57b451 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index f3b461c708a8..476f25bba263 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -7,6 +7,7 @@
>  
>  #include "super.h"
>  #include 

Re: [Cluster-devel] [PATCH v5 7/9] iomap/xfs: Eliminate the iomap_valid handler

2023-01-10 Thread Darrick J. Wong
On Sun, Jan 08, 2023 at 07:50:01PM +0100, Andreas Gruenbacher wrote:
> On Sun, Jan 8, 2023 at 6:32 PM Christoph Hellwig  wrote:
> > On Wed, Jan 04, 2023 at 07:08:17PM +, Matthew Wilcox wrote:
> > > On Wed, Jan 04, 2023 at 09:53:17AM -0800, Darrick J. Wong wrote:
> > > > I wonder if this should be reworked a bit to reduce indenting:
> > > >
> > > > if (PTR_ERR(folio) == -ESTALE) {
> > >
> > > FYI this is a bad habit to be in.  The compiler can optimise
> > >
> > >   if (folio == ERR_PTR(-ESTALE))
> > >
> > > better than it can optimise the other way around.
> >
> > Yes.  I think doing the recording that Darrick suggested combined
> > with this style would be best:
> >
> > if (folio == ERR_PTR(-ESTALE)) {
> > iter->iomap.flags |= IOMAP_F_STALE;
> > return 0;
> > }
> > if (IS_ERR(folio))
> > return PTR_ERR(folio);
> 
> Again, I've implemented this as a nested if because the -ESTALE case
> should be pretty rare, and if we unnest, we end up with an additional
> check on the main code path. To be specific, the "before" code here on
> my current system is this:
> 
> 
> if (IS_ERR(folio)) {
> 22ad:   48 81 fd 00 f0 ff ffcmp$0xf000,%rbp
> 22b4:   0f 87 bf 03 00 00   ja 2679 
> return 0;
> }
> return PTR_ERR(folio);
> }
> [...]
> 2679:   89 e8   mov%ebp,%eax
> if (folio == ERR_PTR(-ESTALE)) {
> 267b:   48 83 fd 8c cmp$0xff8c,%rbp
> 267f:   0f 85 b7 fc ff ff   jne233c 
> iter->iomap.flags |= IOMAP_F_STALE;
> 2685:   66 81 4b 42 00 02   orw$0x200,0x42(%rbx)
> return 0;
> 268b:   e9 aa fc ff ff  jmp233a 
> 
> 
> While the "after" code is this:
> 
> 
> if (folio == ERR_PTR(-ESTALE)) {
> 22ad:   48 83 fd 8c cmp$0xff8c,%rbp
> 22b1:   0f 84 bc 00 00 00   je 2373 
> iter->iomap.flags |= IOMAP_F_STALE;
> return 0;
> }
> if (IS_ERR(folio))
> return PTR_ERR(folio);
> 22b7:   89 e8   mov%ebp,%eax
> if (IS_ERR(folio))
> 22b9:   48 81 fd 00 f0 ff ffcmp$0xf000,%rbp
> 22c0:   0f 87 82 00 00 00   ja 2348 
> 
> 
> The compiler isn't smart enough to re-nest the ifs by recognizing that
> folio == ERR_PTR(-ESTALE) is a subset of IS_ERR(folio).
> 
> So do you still insist on that un-nesting even though it produces worse code?

Me?  Not anymore. :)

--D

> Thanks,
> Andreas
> 



Re: [Cluster-devel] [PATCH v5 6/9] iomap: Rename page_prepare handler to get_folio

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:16PM +0100, Andreas Gruenbacher wrote:
> The ->page_prepare() handler in struct iomap_page_ops is now somewhat
> misnamed, so rename it to ->get_folio().
> 
> Signed-off-by: Andreas Gruenbacher 

Looks good to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/bmap.c | 6 +++---
>  fs/iomap/buffered-io.c | 4 ++--
>  include/linux/iomap.h  | 6 +++---
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 41349e09558b..d3adb715ac8c 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -957,7 +957,7 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t 
> pos, loff_t length,
>  }
>  
>  static struct folio *
> -gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
> +gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
>  {
>   struct inode *inode = iter->inode;
>   unsigned int blockmask = i_blocksize(inode) - 1;
> @@ -998,7 +998,7 @@ static void gfs2_iomap_put_folio(struct inode *inode, 
> loff_t pos,
>  }
>  
>  static const struct iomap_page_ops gfs2_iomap_page_ops = {
> - .page_prepare = gfs2_iomap_page_prepare,
> + .get_folio = gfs2_iomap_get_folio,
>   .put_folio = gfs2_iomap_put_folio,
>  };
>  
> @@ -1291,7 +1291,7 @@ int gfs2_alloc_extent(struct inode *inode, u64 lblock, 
> u64 *dblock,
>  /*
>   * NOTE: Never call gfs2_block_zero_range with an open transaction because it
>   * uses iomap write to perform its actions, which begin their own 
> transactions
> - * (iomap_begin, page_prepare, etc.)
> + * (iomap_begin, get_folio, etc.)
>   */
>  static int gfs2_block_zero_range(struct inode *inode, loff_t from,
>unsigned int length)
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7decd8cdc755..4f363d42dbaf 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -642,8 +642,8 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   if (!mapping_large_folio_support(iter->inode->i_mapping))
>   len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>  
> - if (page_ops && page_ops->page_prepare)
> - folio = page_ops->page_prepare(iter, pos, len);
> + if (page_ops && page_ops->get_folio)
> + folio = page_ops->get_folio(iter, pos, len);
>   else
>   folio = iomap_get_folio(iter, pos);
>   if (IS_ERR(folio))
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 87b5d0f8e578..dd3575ada5d1 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -126,17 +126,17 @@ static inline bool iomap_inline_data_valid(const struct 
> iomap *iomap)
>  }
>  
>  /*
> - * When a filesystem sets page_ops in an iomap mapping it returns, 
> page_prepare
> + * When a filesystem sets page_ops in an iomap mapping it returns, get_folio
>   * and put_folio will be called for each page written to.  This only applies 
> to
>   * buffered writes as unbuffered writes will not typically have pages
>   * associated with them.
>   *
> - * When page_prepare succeeds, put_folio will always be called to do any
> + * When get_folio succeeds, put_folio will always be called to do any
>   * cleanup work necessary.  put_folio is responsible for unlocking and 
> putting
>   * @folio.
>   */
>  struct iomap_page_ops {
> - struct folio *(*page_prepare)(struct iomap_iter *iter, loff_t pos,
> + struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
>   unsigned len);
>   void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
>   struct folio *folio);
> -- 
> 2.38.1
> 



Re: [Cluster-devel] [PATCH v5 8/9] iomap: Rename page_ops to folio_ops

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:18PM +0100, Andreas Gruenbacher wrote:
> The operations in struct page_ops all operate on folios, so rename
> struct page_ops to struct folio_ops.
> 
> Signed-off-by: Andreas Gruenbacher 

Yup.
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/bmap.c |  4 ++--
>  fs/iomap/buffered-io.c | 12 ++--
>  fs/xfs/xfs_iomap.c |  4 ++--
>  include/linux/iomap.h  | 14 +++---
>  4 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index d3adb715ac8c..e191ecfb1fde 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -997,7 +997,7 @@ static void gfs2_iomap_put_folio(struct inode *inode, 
> loff_t pos,
>   gfs2_trans_end(sdp);
>  }
>  
> -static const struct iomap_page_ops gfs2_iomap_page_ops = {
> +static const struct iomap_folio_ops gfs2_iomap_folio_ops = {
>   .get_folio = gfs2_iomap_get_folio,
>   .put_folio = gfs2_iomap_put_folio,
>  };
> @@ -1075,7 +1075,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, 
> loff_t pos,
>   }
>  
>   if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip))
> - iomap->page_ops = _iomap_page_ops;
> + iomap->folio_ops = _iomap_folio_ops;
>   return 0;
>  
>  out_trans_end:
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index df6fca11f18c..c4a7aef2a272 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -605,10 +605,10 @@ static int __iomap_write_begin(const struct iomap_iter 
> *iter, loff_t pos,
>  static void iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
>   struct folio *folio)
>  {
> - const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> + const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>  
> - if (page_ops && page_ops->put_folio) {
> - page_ops->put_folio(iter->inode, pos, ret, folio);
> + if (folio_ops && folio_ops->put_folio) {
> + folio_ops->put_folio(iter->inode, pos, ret, folio);
>   } else {
>   folio_unlock(folio);
>   folio_put(folio);
> @@ -627,7 +627,7 @@ static int iomap_write_begin_inline(const struct 
> iomap_iter *iter,
>  static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>   size_t len, struct folio **foliop)
>  {
> - const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> + const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>   const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct folio *folio;
>   int status;
> @@ -642,8 +642,8 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   if (!mapping_large_folio_support(iter->inode->i_mapping))
>   len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>  
> - if (page_ops && page_ops->get_folio)
> - folio = page_ops->get_folio(iter, pos, len);
> + if (folio_ops && folio_ops->get_folio)
> + folio = folio_ops->get_folio(iter, pos, len);
>   else
>   folio = iomap_get_folio(iter, pos);
>   if (IS_ERR(folio)) {
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index d0bf99539180..5bddf31e21eb 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -98,7 +98,7 @@ xfs_get_folio(
>   return folio;
>  }
>  
> -const struct iomap_page_ops xfs_iomap_page_ops = {
> +const struct iomap_folio_ops xfs_iomap_folio_ops = {
>   .get_folio  = xfs_get_folio,
>  };
>  
> @@ -148,7 +148,7 @@ xfs_bmbt_to_iomap(
>   iomap->flags |= IOMAP_F_DIRTY;
>  
>   iomap->validity_cookie = sequence_cookie;
> - iomap->page_ops = _iomap_page_ops;
> + iomap->folio_ops = _iomap_folio_ops;
>   return 0;
>  }
>  
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6f8e3321e475..2e2be828af86 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -86,7 +86,7 @@ struct vm_fault;
>   */
>  #define IOMAP_NULL_ADDR -1ULL/* addr is not valid */
>  
> -struct iomap_page_ops;
> +struct iomap_folio_ops;
>  
>  struct iomap {
>   u64 addr; /* disk offset of mapping, bytes */
> @@ -98,7 +98,7 @@ struct iomap {
>   struct dax_device   *dax_dev; /* dax_dev for dax operations */
>   void*inline_data;
>   void*private; /* filesystem private */
> - const struct iomap_page_ops *page_ops;
> + const struct iomap_folio_ops *folio_ops;
>   u64

Re: [Cluster-devel] [PATCH v5 7/9] iomap/xfs: Eliminate the iomap_valid handler

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:17PM +0100, Andreas Gruenbacher wrote:
> Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> handler and validating the mapping there.
> 
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/iomap/buffered-io.c | 25 +
>  fs/xfs/xfs_iomap.c | 37 ++---
>  include/linux/iomap.h  | 22 +-
>  3 files changed, 36 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4f363d42dbaf..df6fca11f18c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -630,7 +630,7 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>   const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct folio *folio;
> - int status = 0;
> + int status;
>  
>   BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
>   if (srcmap != >iomap)
> @@ -646,27 +646,12 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   folio = page_ops->get_folio(iter, pos, len);
>   else
>   folio = iomap_get_folio(iter, pos);
> - if (IS_ERR(folio))
> - return PTR_ERR(folio);
> -
> - /*
> -  * Now we have a locked folio, before we do anything with it we need to
> -  * check that the iomap we have cached is not stale. The inode extent
> -  * mapping can change due to concurrent IO in flight (e.g.
> -  * IOMAP_UNWRITTEN state can change and memory reclaim could have
> -  * reclaimed a previously partially written page at this index after IO
> -  * completion before this write reaches this file offset) and hence we
> -  * could do the wrong thing here (zero a page range incorrectly or fail
> -  * to zero) and corrupt data.
> -  */
> - if (page_ops && page_ops->iomap_valid) {
> - bool iomap_valid = page_ops->iomap_valid(iter->inode,
> - >iomap);
> - if (!iomap_valid) {
> + if (IS_ERR(folio)) {
> + if (folio == ERR_PTR(-ESTALE)) {
>   iter->iomap.flags |= IOMAP_F_STALE;
> - status = 0;
> - goto out_unlock;
> + return 0;
>   }
> + return PTR_ERR(folio);

I wonder if this should be reworked a bit to reduce indenting:

if (PTR_ERR(folio) == -ESTALE) {
iter->iomap.flags |= IOMAP_F_STALE;
return 0;
}
if (IS_ERR(folio))
return PTR_ERR(folio);

But I don't have any strong opinions about that.

>   }
>  
>   if (pos + len > folio_pos(folio) + folio_size(folio))
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 669c1bc5c3a7..d0bf99539180 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -62,29 +62,44 @@ xfs_iomap_inode_sequence(
>   return cookie | READ_ONCE(ip->i_df.if_seq);
>  }
>  
> -/*
> - * Check that the iomap passed to us is still valid for the given offset and
> - * length.
> - */
> -static bool
> -xfs_iomap_valid(
> - struct inode*inode,
> - const struct iomap  *iomap)
> +static struct folio *
> +xfs_get_folio(
> + struct iomap_iter   *iter,
> + loff_t  pos,
> + unsignedlen)
>  {
> + struct inode*inode = iter->inode;
> + struct iomap*iomap = >iomap;
>   struct xfs_inode*ip = XFS_I(inode);
> + struct folio*folio;
>  
> + folio = iomap_get_folio(iter, pos);
> + if (IS_ERR(folio))
> + return folio;
> +
> + /*
> +  * Now that we have a locked folio, we need to check that the iomap we
> +  * have cached is not stale.  The inode extent mapping can change due to
> +  * concurrent IO in flight (e.g., IOMAP_UNWRITTEN state can change and
> +  * memory reclaim could have reclaimed a previously partially written
> +  * page at this index after IO completion before this write reaches
> +  * this file offset) and hence we could do the wrong thing here (zero a
> +  * page range incorrectly or fail to zero) and corrupt data.
> +  */
>   if (iomap->validity_cookie !=
>   xfs_iomap_inode_sequence(ip, iomap->flags)) {
>   trace_xfs_iomap_invalid(ip, iomap);
> - return false;
> + folio_unlock(folio);
> + folio_put(folio);
> + return ERR_PTR(-ESTALE);
>   }
>  
>   XFS_ERRORTAG_DELAY(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
> - return true;
> + return folio;
>  }
>  
>  const struct iomap_page_ops xfs_iomap_page_ops = {
> - .iomap_valid= xfs_iomap_valid,
> + .get_folio  = xfs_get_folio,
>  };
>  
>  int
> diff --git a/include/linux/iomap.h 

Re: [Cluster-devel] [PATCH v5 2/9] iomap/gfs2: Unlock and put folio in page_done handler

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:12PM +0100, Andreas Gruenbacher wrote:
> When an iomap defines a ->page_done() handler in its page_ops, delegate
> unlocking the folio and putting the folio reference to that handler.
> 
> This allows to fix a race between journaled data writes and folio
> writeback in gfs2: before this change, gfs2_iomap_page_done() was called
> after unlocking the folio, so writeback could start writing back the
> folio's buffers before they could be marked for writing to the journal.
> Also, try_to_free_buffers() could free the buffers before
> gfs2_iomap_page_done() was done adding the buffers to the current
> current transaction.  With this change, gfs2_iomap_page_done() adds the
> buffers to the current transaction while the folio is still locked, so
> the problems described above can no longer occur.
> 
> The only current user of ->page_done() is gfs2, so other filesystems are
> not affected.  To catch out any out-of-tree users, switch from a page to
> a folio in ->page_done().

I really hope there aren't any out of tree users...

> Signed-off-by: Andreas Gruenbacher 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/bmap.c | 15 ---
>  fs/iomap/buffered-io.c |  8 
>  include/linux/iomap.h  |  7 ---
>  3 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index e7537fd305dd..46206286ad42 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -968,14 +968,23 @@ static int gfs2_iomap_page_prepare(struct inode *inode, 
> loff_t pos,
>  }
>  
>  static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
> -  unsigned copied, struct page *page)
> +  unsigned copied, struct folio *folio)
>  {
>   struct gfs2_trans *tr = current->journal_info;
>   struct gfs2_inode *ip = GFS2_I(inode);
>   struct gfs2_sbd *sdp = GFS2_SB(inode);
>  
> - if (page && !gfs2_is_stuffed(ip))
> - gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
> + if (!folio) {
> + gfs2_trans_end(sdp);
> + return;
> + }
> +
> + if (!gfs2_is_stuffed(ip))
> + gfs2_page_add_databufs(ip, >page, offset_in_page(pos),
> +copied);
> +
> + folio_unlock(folio);
> + folio_put(folio);
>  
>   if (tr->tr_num_buf_new)
>   __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c30d150a9303..e13d5694e299 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -580,12 +580,12 @@ static void iomap_put_folio(struct iomap_iter *iter, 
> loff_t pos, size_t ret,
>  {
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>  
> - if (folio)
> + if (page_ops && page_ops->page_done) {
> + page_ops->page_done(iter->inode, pos, ret, folio);
> + } else if (folio) {
>   folio_unlock(folio);
> - if (page_ops && page_ops->page_done)
> - page_ops->page_done(iter->inode, pos, ret, >page);
> - if (folio)
>   folio_put(folio);
> + }
>  }
>  
>  static int iomap_write_begin_inline(const struct iomap_iter *iter,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0983dfc9a203..743e2a909162 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -131,13 +131,14 @@ static inline bool iomap_inline_data_valid(const struct 
> iomap *iomap)
>   * associated with them.
>   *
>   * When page_prepare succeeds, page_done will always be called to do any
> - * cleanup work necessary.  In that page_done call, @page will be NULL if the
> - * associated page could not be obtained.
> + * cleanup work necessary.  In that page_done call, @folio will be NULL if 
> the
> + * associated folio could not be obtained.  When folio is not NULL, page_done
> + * is responsible for unlocking and putting the folio.
>   */
>  struct iomap_page_ops {
>   int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
>   void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
> - struct page *page);
> + struct folio *folio);
>  
>   /*
>* Check that the cached iomap still maps correctly to the filesystem's
> -- 
> 2.38.1
> 



Re: [Cluster-devel] [PATCH v5 5/9] iomap/gfs2: Get page in page_prepare handler

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:15PM +0100, Andreas Gruenbacher wrote:
> Change the iomap ->page_prepare() handler to get and return a locked
> folio instead of doing that in iomap_write_begin().  This allows to
> recover from out-of-memory situations in ->page_prepare(), which
> eliminates the corresponding error handling code in iomap_write_begin().
> The ->put_folio() handler now also isn't called with NULL as the folio
> value anymore.
> 
> Filesystems are expected to use the iomap_get_folio() helper for getting
> locked folios in their ->page_prepare() handlers.
> 
> Signed-off-by: Andreas Gruenbacher 

This patchset makes the page ops make a lot more sense to me now.  I
very much like the way that the new ->get_folio ->put_folio functions
split the responsibilities for setting up the page cach write and
tearing it down.  Thank you for cleaning this up. :)

(I would still like hch or willy to take a second look at this, however.)

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/bmap.c | 21 +
>  fs/iomap/buffered-io.c | 17 ++---
>  include/linux/iomap.h  |  9 +
>  3 files changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 0c041459677b..41349e09558b 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -956,15 +956,25 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t 
> pos, loff_t length,
>   goto out;
>  }
>  
> -static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
> -unsigned len)
> +static struct folio *
> +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
>  {
> + struct inode *inode = iter->inode;
>   unsigned int blockmask = i_blocksize(inode) - 1;
>   struct gfs2_sbd *sdp = GFS2_SB(inode);
>   unsigned int blocks;
> + struct folio *folio;
> + int status;
>  
>   blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> - return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> + status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> + if (status)
> + return ERR_PTR(status);
> +
> + folio = iomap_get_folio(iter, pos);
> + if (IS_ERR(folio))
> + gfs2_trans_end(sdp);
> + return folio;
>  }
>  
>  static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
> @@ -974,11 +984,6 @@ static void gfs2_iomap_put_folio(struct inode *inode, 
> loff_t pos,
>   struct gfs2_inode *ip = GFS2_I(inode);
>   struct gfs2_sbd *sdp = GFS2_SB(inode);
>  
> - if (!folio) {
> - gfs2_trans_end(sdp);
> - return;
> - }
> -
>   if (!gfs2_is_stuffed(ip))
>   gfs2_page_add_databufs(ip, >page, offset_in_page(pos),
>  copied);
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index b84838d2b5d8..7decd8cdc755 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -609,7 +609,7 @@ static void iomap_put_folio(struct iomap_iter *iter, 
> loff_t pos, size_t ret,
>  
>   if (page_ops && page_ops->put_folio) {
>   page_ops->put_folio(iter->inode, pos, ret, folio);
> - } else if (folio) {
> + } else {
>   folio_unlock(folio);
>   folio_put(folio);
>   }
> @@ -642,17 +642,12 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   if (!mapping_large_folio_support(iter->inode->i_mapping))
>   len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>  
> - if (page_ops && page_ops->page_prepare) {
> - status = page_ops->page_prepare(iter->inode, pos, len);
> - if (status)
> - return status;
> - }
> -
> - folio = iomap_get_folio(iter, pos);
> - if (IS_ERR(folio)) {
> - iomap_put_folio(iter, pos, 0, NULL);
> + if (page_ops && page_ops->page_prepare)
> + folio = page_ops->page_prepare(iter, pos, len);
> + else
> + folio = iomap_get_folio(iter, pos);
> + if (IS_ERR(folio))
>   return PTR_ERR(folio);
> - }
>  
>   /*
>* Now we have a locked folio, before we do anything with it we need to
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e5732cc5716b..87b5d0f8e578 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -13,6 +13,7 @@
>  struct address_space;
>  struct fiemap_extent_info;
>  struct inode;
> +struct iomap_iter;
>  struct iomap_dio;
>  struct ioma

Re: [Cluster-devel] [PATCH v5 3/9] iomap: Rename page_done handler to put_folio

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:13PM +0100, Andreas Gruenbacher wrote:
> The ->page_done() handler in struct iomap_page_ops is now somewhat
> misnamed in that it mainly deals with unlocking and putting a folio, so
> rename it to ->put_folio().
> 
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/gfs2/bmap.c |  4 ++--
>  fs/iomap/buffered-io.c |  4 ++--
>  include/linux/iomap.h  | 10 +-
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 46206286ad42..0c041459677b 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -967,7 +967,7 @@ static int gfs2_iomap_page_prepare(struct inode *inode, 
> loff_t pos,
>   return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
>  }
>  
> -static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
> +static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
>unsigned copied, struct folio *folio)
>  {
>   struct gfs2_trans *tr = current->journal_info;
> @@ -994,7 +994,7 @@ static void gfs2_iomap_page_done(struct inode *inode, 
> loff_t pos,
>  
>  static const struct iomap_page_ops gfs2_iomap_page_ops = {
>   .page_prepare = gfs2_iomap_page_prepare,
> - .page_done = gfs2_iomap_page_done,
> + .put_folio = gfs2_iomap_put_folio,
>  };
>  
>  static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e13d5694e299..2a9bab4f3c79 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -580,8 +580,8 @@ static void iomap_put_folio(struct iomap_iter *iter, 
> loff_t pos, size_t ret,
>  {
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>  
> - if (page_ops && page_ops->page_done) {
> - page_ops->page_done(iter->inode, pos, ret, folio);
> + if (page_ops && page_ops->put_folio) {
> + page_ops->put_folio(iter->inode, pos, ret, folio);
>   } else if (folio) {
>   folio_unlock(folio);
>   folio_put(folio);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 743e2a909162..10ec36f373f4 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -126,18 +126,18 @@ static inline bool iomap_inline_data_valid(const struct 
> iomap *iomap)
>  
>  /*
>   * When a filesystem sets page_ops in an iomap mapping it returns, 
> page_prepare
> - * and page_done will be called for each page written to.  This only applies 
> to
> + * and put_folio will be called for each page written to.  This only applies 
> to

"...for each folio written to."

With that fixed,
Reviewed-by: Darrick J. Wong 

--D


>   * buffered writes as unbuffered writes will not typically have pages
>   * associated with them.
>   *
> - * When page_prepare succeeds, page_done will always be called to do any
> - * cleanup work necessary.  In that page_done call, @folio will be NULL if 
> the
> - * associated folio could not be obtained.  When folio is not NULL, page_done
> + * When page_prepare succeeds, put_folio will always be called to do any
> + * cleanup work necessary.  In that put_folio call, @folio will be NULL if 
> the
> + * associated folio could not be obtained.  When folio is not NULL, put_folio
>   * is responsible for unlocking and putting the folio.
>   */
>  struct iomap_page_ops {
>   int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
> - void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
> + void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
>   struct folio *folio);
>  
>   /*
> -- 
> 2.38.1
> 



Re: [Cluster-devel] [PATCH v5 1/9] iomap: Add iomap_put_folio helper

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:11PM +0100, Andreas Gruenbacher wrote:
> Add an iomap_put_folio() helper to encapsulate unlocking the folio,
> calling ->page_done(), and putting the folio.  Use the new helper in
> iomap_write_begin() and iomap_write_end().
> 
> This effectively doesn't change the way the code works, but prepares for
> successive improvements.
> 
> Signed-off-by: Andreas Gruenbacher 

Ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 29 +
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..c30d150a9303 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -575,6 +575,19 @@ static int __iomap_write_begin(const struct iomap_iter 
> *iter, loff_t pos,
>   return 0;
>  }
>  
> +static void iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
> + struct folio *folio)
> +{
> + const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> +
> + if (folio)
> + folio_unlock(folio);
> + if (page_ops && page_ops->page_done)
> + page_ops->page_done(iter->inode, pos, ret, >page);
> + if (folio)
> + folio_put(folio);
> +}
> +
>  static int iomap_write_begin_inline(const struct iomap_iter *iter,
>   struct folio *folio)
>  {
> @@ -616,7 +629,8 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   fgp, mapping_gfp_mask(iter->inode->i_mapping));
>   if (!folio) {
>   status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> - goto out_no_page;
> + iomap_put_folio(iter, pos, 0, NULL);
> + return status;
>   }
>  
>   /*
> @@ -656,13 +670,9 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   return 0;
>  
>  out_unlock:
> - folio_unlock(folio);
> - folio_put(folio);
> + iomap_put_folio(iter, pos, 0, folio);
>   iomap_write_failed(iter->inode, pos, len);
>  
> -out_no_page:
> - if (page_ops && page_ops->page_done)
> - page_ops->page_done(iter->inode, pos, 0, NULL);
>   return status;
>  }
>  
> @@ -712,7 +722,6 @@ static size_t iomap_write_end_inline(const struct 
> iomap_iter *iter,
>  static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t 
> len,
>   size_t copied, struct folio *folio)
>  {
> - const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>   const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   loff_t old_size = iter->inode->i_size;
>   size_t ret;
> @@ -735,14 +744,10 @@ static size_t iomap_write_end(struct iomap_iter *iter, 
> loff_t pos, size_t len,
>   i_size_write(iter->inode, pos + ret);
>   iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
>   }
> - folio_unlock(folio);
> + iomap_put_folio(iter, pos, ret, folio);
>  
>   if (old_size < pos)
>   pagecache_isize_extended(iter->inode, old_size, pos);
> - if (page_ops && page_ops->page_done)
> - page_ops->page_done(iter->inode, pos, ret, >page);
> - folio_put(folio);
> -
>   if (ret < len)
>   iomap_write_failed(iter->inode, pos + ret, len - ret);
>   return ret;
> -- 
> 2.38.1
> 



Re: [Cluster-devel] [PATCH v5 4/9] iomap: Add iomap_get_folio helper

2023-01-04 Thread Darrick J. Wong
On Sat, Dec 31, 2022 at 04:09:14PM +0100, Andreas Gruenbacher wrote:
> Add an iomap_get_folio() helper that gets a folio reference based on
> an iomap iterator and an offset into the address space.  Use it in
> iomap_write_begin().
> 
> Signed-off-by: Andreas Gruenbacher 

Pretty straightforward,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 39 ++-
>  include/linux/iomap.h  |  1 +
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2a9bab4f3c79..b84838d2b5d8 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, 
> size_t from, size_t count)
>  }
>  EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  
> +/**
> + * iomap_get_folio - get a folio reference for writing
> + * @iter: iteration structure
> + * @pos: start offset of write
> + *
> + * Returns a locked reference to the folio at @pos, or an error pointer if 
> the
> + * folio could not be obtained.
> + */
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +{
> + unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> + struct folio *folio;
> +
> + if (iter->flags & IOMAP_NOWAIT)
> + fgp |= FGP_NOWAIT;
> +
> + folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> + fgp, mapping_gfp_mask(iter->inode->i_mapping));
> + if (folio)
> + return folio;
> +
> + if (iter->flags & IOMAP_NOWAIT)
> + return ERR_PTR(-EAGAIN);
> + return ERR_PTR(-ENOMEM);
> +}
> +EXPORT_SYMBOL_GPL(iomap_get_folio);
> +
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>  {
>   trace_iomap_release_folio(folio->mapping->host, folio_pos(folio),
> @@ -603,12 +630,8 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>   const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct folio *folio;
> - unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
>   int status = 0;
>  
> - if (iter->flags & IOMAP_NOWAIT)
> - fgp |= FGP_NOWAIT;
> -
>   BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
>   if (srcmap != >iomap)
>   BUG_ON(pos + len > srcmap->offset + srcmap->length);
> @@ -625,12 +648,10 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   return status;
>   }
>  
> - folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> - fgp, mapping_gfp_mask(iter->inode->i_mapping));
> - if (!folio) {
> - status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> + folio = iomap_get_folio(iter, pos);
> + if (IS_ERR(folio)) {
>   iomap_put_folio(iter, pos, 0, NULL);
> - return status;
> + return PTR_ERR(folio);
>   }
>  
>   /*
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 10ec36f373f4..e5732cc5716b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -261,6 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode 
> *inode,
>  int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
>  void iomap_readahead(struct readahead_control *, const struct iomap_ops 
> *ops);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.38.1
> 



[Cluster-devel] [ANNOUNCE] xfs-linux: for-next updated (with iomap changes) to 7dd73802f97d

2022-11-29 Thread Darrick J. Wong
Hi folks,

The for-next branch of the xfs-linux repository at:

git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git

has just been updated.

Patches often get missed, so please check if your outstanding patches
were in this update. If they have not been in this update, please
resubmit them to linux-...@vger.kernel.org so they can be picked up in
the next update.

**NOTE** I've merged Dave's fixes for the iomap write race corruption
problem.  The fixes require changes to fs/iomap/, hence the broader than
usual cc list.  There will be at least one more push with all the other
bug fixes pending on the list (realistically, at least two) but I wanted
to get this big piece out /early/ for advance testing by the robots.
This is extremely late in the cycle, but we gotta resolve these
problems.

I also found yet another corruption problem last night involving xfs/179
and a dax+reflink filesystem, so those will be getting written up and
sent out shortly.  Thank you all who have been reviewing the long stream
of bug fixes. :)

The new head of the for-next branch is commit:

7dd73802f97d Merge tag 'xfs-iomap-stale-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB

44 new commits:

Darrick J. Wong (32):
  [9a48b4a6fd51] xfs: fully initialize xfs_da_args in xchk_directory_blocks
  [be1317fdb8d4] xfs: don't track the AGFL buffer in the scrub AG context
  [3e59c0103e66] xfs: log the AGI/AGF buffers when rolling transactions 
during an AG repair
  [48ff40458f87] xfs: standardize GFP flags usage in online scrub
  [b255fab0f80c] xfs: make AGFL repair function avoid crosslinked blocks
  [a7a0f9a5503f] xfs: return EINTR when a fatal signal terminates scrub
  [0a713bd41ea2] xfs: fix return code when fatal signal encountered during 
dquot scrub
  [fcd2a43488d5] xfs: initialize the check_owner object fully
  [6bf2f8791597] xfs: don't retry repairs harder when EAGAIN is returned
  [306195f355bb] xfs: pivot online scrub away from kmem.[ch]
  [9e13975bb062] xfs: load rtbitmap and rtsummary extent mapping btrees at 
mount time
  [11f97e684583] xfs: skip fscounters comparisons when the scan is 
incomplete
  [93b0c58ed04b] xfs: don't return -EFSCORRUPTED from repair when resources 
cannot be grabbed
  [5f369dc5b4eb] xfs: make rtbitmap ILOCKing consistent when scanning the 
rt bitmap file
  [e74331d6fa2c] xfs: online checking of the free rt extent count
  [033985b6fe87] xfs: fix perag loop in xchk_bmap_check_rmaps
  [6a5777865eeb] xfs: teach scrub to check for adjacent bmaps when rmap 
larger than bmap
  [830ffa09fb13] xfs: block map scrub should handle incore delalloc 
reservations
  [f23c40443d1c] xfs: check quota files for unwritten extents
  [31785537010a] xfs: check that CoW fork extents are not shared
  [5eef46358fae] xfs: teach scrub to flag non-extents format cow forks
  [bd5ab5f98741] xfs: don't warn about files that are exactly s_maxbytes 
long
  [f36b954a1f1b] xfs: check inode core when scrubbing metadata files
  [823ca26a8f07] Merge tag 'scrub-fix-ag-header-handling-6.2_2022-11-16' of 
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [af1077fa87c3] Merge tag 'scrub-cleanup-malloc-6.2_2022-11-16' of 
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [3d8426b13bac] Merge tag 'scrub-fix-return-value-6.2_2022-11-16' of 
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [b76f593b33aa] Merge tag 'scrub-fix-rtmeta-ilocking-6.2_2022-11-16' of 
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [7aab8a05e7c7] Merge tag 'scrub-fscounters-enhancements-6.2_2022-11-16' 
of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [cc5f38fa12fc] Merge tag 'scrub-bmap-enhancements-6.2_2022-11-16' of 
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [7b082b5e8afa] Merge tag 
'scrub-check-metadata-inode-records-6.2_2022-11-16' of 
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into 
xfs-6.2-mergeA
  [2653d53345bd] xfs: fix incorrect error-out in xfs_remove
  [7dd73802f97d] Merge tag 'xfs-iomap-stale-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB

Dave Chinner (9):
  [118e021b4b66] xfs: write page faults in iomap are not buffered writes
  [198dd8aedee6] xfs: punching delalloc extents on write failure is racy
  [b71f889c18ad] xfs: use byte ranges for write cleanup ranges
  [9c7babf94a0d] xfs,iomap: move delalloc punching to iomap
  [f43dc4dc3eff] iomap: buffered write failure should not truncate the page 
cache
  [7348b322332d] xfs: xfs_bmap_punch_delalloc_range() should take a byte 
range
  [d7b64041164c] iomap: write iomap validity checks
  [304a68b9c63b] xfs: use

Re: [Cluster-devel] [PATCH] filelock: move file locking definitions to separate header file

2022-11-21 Thread Darrick J. Wong
On Mon, Nov 21, 2022 at 12:16:16PM -0500, Jeff Layton wrote:
> On Mon, 2022-11-21 at 08:53 -0800, Darrick J. Wong wrote:
> > On Sun, Nov 20, 2022 at 03:59:57PM -0500, Jeff Layton wrote:
> > > The file locking definitions have lived in fs.h since the dawn of time,
> > > but they are only used by a small subset of the source files that
> > > include it.
> > > 
> > > Move the file locking definitions to a new header file, and add the
> > > appropriate #include directives to the source files that need them. By
> > > doing this we trim down fs.h a bit and limit the amount of rebuilding
> > > that has to be done when we make changes to the file locking APIs.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/9p/vfs_file.c  |   1 +
> > >  fs/afs/internal.h |   1 +
> > >  fs/attr.c |   1 +
> > >  fs/ceph/locks.c   |   1 +
> > >  fs/cifs/cifsfs.c  |   1 +
> > >  fs/cifs/cifsglob.h|   1 +
> > >  fs/cifs/cifssmb.c |   1 +
> > >  fs/cifs/file.c|   1 +
> > >  fs/cifs/smb2file.c|   1 +
> > >  fs/dlm/plock.c|   1 +
> > >  fs/fcntl.c|   1 +
> > >  fs/file_table.c   |   1 +
> > >  fs/fuse/file.c|   1 +
> > >  fs/gfs2/file.c|   1 +
> > >  fs/inode.c|   1 +
> > >  fs/ksmbd/smb2pdu.c|   1 +
> > >  fs/ksmbd/vfs.c|   1 +
> > >  fs/ksmbd/vfs_cache.c  |   1 +
> > >  fs/lockd/clntproc.c   |   1 +
> > >  fs/lockd/netns.h  |   1 +
> > >  fs/locks.c|   1 +
> > >  fs/namei.c|   1 +
> > >  fs/nfs/nfs4_fs.h  |   1 +
> > >  fs/nfs_common/grace.c |   1 +
> > >  fs/nfsd/netns.h   |   1 +
> > >  fs/ocfs2/locks.c  |   1 +
> > >  fs/ocfs2/stack_user.c |   1 +
> > >  fs/open.c |   1 +
> > >  fs/orangefs/file.c|   1 +
> > >  fs/proc/fd.c  |   1 +
> > >  fs/utimes.c   |   1 +
> > >  fs/xattr.c|   1 +
> > >  fs/xfs/xfs_buf.h  |   1 +
> > 
> > What part of the xfs buffer cache requires the file locking APIs?
> > 
> XFS mostly needs the layout recall APIs. Several xfs files seem to pull
> in fs.h currently by including xfs_buf.h. I started adding filelock.h to
> several *.c files, but quite a few needed it.
> 
> I can go back and add them to the xfs *.c files if you prefer.

Hmm, was it was a mechanical change?  I was wondering why you didn't
pick xfs_linux.h, since that's where we usually dump header #includes
for the rest of the kernel...

$ git grep \/fs.h fs/xfs/
fs/xfs/xfs_buf.h:13:#include 

$ git grep xfs_buf.h fs/xfs/xfs_linux.h
fs/xfs/xfs_linux.h:80:#include "xfs_buf.h"

Would you mind #include'ing fs.h and filelock.h in xfs_linux.h instead?
I think/hope you'll find that the changes to xfs_file.c and xfs_inode.c
become unnecessary after that.

--D

> > 
> > >  fs/xfs/xfs_file.c |   1 +
> > >  fs/xfs/xfs_inode.c|   1 +
> > >  include/linux/filelock.h  | 428 ++
> > >  include/linux/fs.h| 421 -
> > >  include/linux/lockd/xdr.h |   1 +
> > >  38 files changed, 464 insertions(+), 421 deletions(-)
> > >  create mode 100644 include/linux/filelock.h
> > > 
> > > Unless anyone has objections, I'll plan to merge this in via the file
> > > locking tree for v6.3. I'd appreciate Acked-bys or Reviewed-bys from
> > > maintainers, however.
> > > 
> > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> > > index aec43ba83799..5e3c4b5198a6 100644
> > > --- a/fs/9p/vfs_file.c
> > > +++ b/fs/9p/vfs_file.c
> > > @@ -9,6 +9,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> > > index 723d162078a3..c41a82a08f8b 100644
> > > --- a/fs/afs/internal.h
> > > +++ b/fs/afs/internal.h
> > > @@ -9,6 +9,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > diff --git a/fs/attr.c b/fs/attr.c
> > > index 1552a5f23d6b..e643f17a5465 100644
> > > --- a

Re: [Cluster-devel] [PATCH] filelock: move file locking definitions to separate header file

2022-11-21 Thread Darrick J. Wong
On Sun, Nov 20, 2022 at 03:59:57PM -0500, Jeff Layton wrote:
> The file locking definitions have lived in fs.h since the dawn of time,
> but they are only used by a small subset of the source files that
> include it.
> 
> Move the file locking definitions to a new header file, and add the
> appropriate #include directives to the source files that need them. By
> doing this we trim down fs.h a bit and limit the amount of rebuilding
> that has to be done when we make changes to the file locking APIs.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/9p/vfs_file.c  |   1 +
>  fs/afs/internal.h |   1 +
>  fs/attr.c |   1 +
>  fs/ceph/locks.c   |   1 +
>  fs/cifs/cifsfs.c  |   1 +
>  fs/cifs/cifsglob.h|   1 +
>  fs/cifs/cifssmb.c |   1 +
>  fs/cifs/file.c|   1 +
>  fs/cifs/smb2file.c|   1 +
>  fs/dlm/plock.c|   1 +
>  fs/fcntl.c|   1 +
>  fs/file_table.c   |   1 +
>  fs/fuse/file.c|   1 +
>  fs/gfs2/file.c|   1 +
>  fs/inode.c|   1 +
>  fs/ksmbd/smb2pdu.c|   1 +
>  fs/ksmbd/vfs.c|   1 +
>  fs/ksmbd/vfs_cache.c  |   1 +
>  fs/lockd/clntproc.c   |   1 +
>  fs/lockd/netns.h  |   1 +
>  fs/locks.c|   1 +
>  fs/namei.c|   1 +
>  fs/nfs/nfs4_fs.h  |   1 +
>  fs/nfs_common/grace.c |   1 +
>  fs/nfsd/netns.h   |   1 +
>  fs/ocfs2/locks.c  |   1 +
>  fs/ocfs2/stack_user.c |   1 +
>  fs/open.c |   1 +
>  fs/orangefs/file.c|   1 +
>  fs/proc/fd.c  |   1 +
>  fs/utimes.c   |   1 +
>  fs/xattr.c|   1 +
>  fs/xfs/xfs_buf.h  |   1 +

What part of the xfs buffer cache requires the file locking APIs?

--D

>  fs/xfs/xfs_file.c |   1 +
>  fs/xfs/xfs_inode.c|   1 +
>  include/linux/filelock.h  | 428 ++
>  include/linux/fs.h| 421 -
>  include/linux/lockd/xdr.h |   1 +
>  38 files changed, 464 insertions(+), 421 deletions(-)
>  create mode 100644 include/linux/filelock.h
> 
> Unless anyone has objections, I'll plan to merge this in via the file
> locking tree for v6.3. I'd appreciate Acked-bys or Reviewed-bys from
> maintainers, however.
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index aec43ba83799..5e3c4b5198a6 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> index 723d162078a3..c41a82a08f8b 100644
> --- a/fs/afs/internal.h
> +++ b/fs/afs/internal.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/attr.c b/fs/attr.c
> index 1552a5f23d6b..e643f17a5465 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index f3b461c708a8..476f25bba263 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -7,6 +7,7 @@
>  
>  #include "super.h"
>  #include "mds_client.h"
> +#include 
>  #include 
>  
>  static u64 lock_secret;
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index fe220686bba4..8d255916b6bf 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -12,6 +12,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
> index 1420acf987f0..1b9fee67a25e 100644
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -25,6 +25,7 @@
>  #include 
>  #include "../smbfs_common/smb2pdu.h"
>  #include "smb2pdu.h"
> +#include 
>  
>  #define SMB_PATH_MAX 260
>  #define CIFS_PORT 445
> diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
> index 1724066c1536..0410658c00bd 100644
> --- a/fs/cifs/cifssmb.c
> +++ b/fs/cifs/cifssmb.c
> @@ -15,6 +15,7 @@
>   /* want to reuse a stale file handle and only the caller knows the file 
> info */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 6c1431979495..c230e86f1e09 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -9,6 +9,7 @@
>   *
>   */
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/cifs/smb2file.c b/fs/cifs/smb2file.c
> index ffbd9a99fc12..1f421bfbe797 100644
> --- a/fs/cifs/smb2file.c
> +++ b/fs/cifs/smb2file.c
> @@ -7,6 +7,7 @@
>   *
>   */
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 737f185aad8d..ed4357e62f35 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -4,6 +4,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/fs/fcntl.c 

Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-03 Thread Darrick J. Wong
On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote:
> On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > > Converted function to use folios throughout. This is in preparation for
> > > > the removal of find_get_pages_range_tag().
> > > > 
> > > > Signed-off-by: Vishal Moola (Oracle) 
> > > > ---
> > > >  mm/page-writeback.c | 44 +++-
> > > >  1 file changed, 23 insertions(+), 21 deletions(-)
> > > > 
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index 032a7bf8d259..087165357a5a 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space 
> > > > *mapping,
> > > > int ret = 0;
> > > > int done = 0;
> > > > int error;
> > > > -   struct pagevec pvec;
> > > > -   int nr_pages;
> > > > +   struct folio_batch fbatch;
> > > > +   int nr_folios;
> > > > pgoff_t index;
> > > > pgoff_t end;/* Inclusive */
> > > > pgoff_t done_index;
> > > > int range_whole = 0;
> > > > xa_mark_t tag;
> > > >  
> > > > -   pagevec_init();
> > > > +   folio_batch_init();
> > > > if (wbc->range_cyclic) {
> > > > index = mapping->writeback_index; /* prev offset */
> > > > end = -1;
> > > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space 
> > > > *mapping,
> > > > while (!done && (index <= end)) {
> > > > int i;
> > > >  
> > > > -   nr_pages = pagevec_lookup_range_tag(, mapping, 
> > > > , end,
> > > > -   tag);
> > > > -   if (nr_pages == 0)
> > > > +   nr_folios = filemap_get_folios_tag(mapping, , end,
> > > > +   tag, );
> > > 
> > > This can find and return dirty multi-page folios if the filesystem
> > > enables them in the mapping at instantiation time, right?
> > 
> > Yup, it will.
> > 
> > > > +
> > > > +   if (nr_folios == 0)
> > > > break;
> > > >  
> > > > -   for (i = 0; i < nr_pages; i++) {
> > > > -   struct page *page = pvec.pages[i];
> > > > +   for (i = 0; i < nr_folios; i++) {
> > > > +   struct folio *folio = fbatch.folios[i];
> > > >  
> > > > -   done_index = page->index;
> > > > +   done_index = folio->index;
> > > >  
> > > > -   lock_page(page);
> > > > +   folio_lock(folio);
> > > >  
> > > > /*
> > > >  * Page truncated or invalidated. We can freely 
> > > > skip it
> > > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space 
> > > > *mapping,
> > > >  * even if there is now a new, dirty page at 
> > > > the same
> > > >  * pagecache address.
> > > >  */
> > > > -   if (unlikely(page->mapping != mapping)) {
> > > > +   if (unlikely(folio->mapping != mapping)) {
> > > >  continue_unlock:
> > > > -   unlock_page(page);
> > > > +   folio_unlock(folio);
> > > > continue;
> > > > }
> > > >  
> > > > -   if (!PageDirty(page)) {
> > > > +   if (!folio_test_dirty(folio)) {
> > > > /* someone wrote it for us */
> > > > goto continue_unlock;
> > > > }
> > > >  
> > > > -   if (PageWriteback(page)) {
> > > > +   if (folio_test_writeback(folio)) {
> > > > if (wbc->sync_mode != WB_SYNC_NONE)
> > > > -   wait_on_page_writeback(page);
> > > > +   folio_wait_writeback(folio);
> > > > else
> > > > goto continue_unlock;
> > > > }
> > > >  
> > > > -   BUG_ON(PageWriteback(page));
> > > > -   if (!clear_page_dirty_for_io(page))
> > > > +   BUG_ON(folio_test_writeback(folio));
> > > > +   if (!folio_clear_dirty_for_io(folio))
> > > > goto continue_unlock;
> > > >  
> > > > trace_wbc_writepage(wbc, 
> > > > inode_to_bdi(mapping->host));
> > > > -   error = (*writepage)(page, wbc, data);
> > > > +   error = writepage(>page, wbc, data);
> > > 
> > > 

[Cluster-devel] [GIT PULL] iomap: new code for 5.20/6.0, part 2

2022-08-11 Thread Darrick J. Wong
Hi Linus,

Please pull this second branch containing new code for iomap for
5.20^W6.0.  In the past 10 days or so I've not heard any ZOMG STOP style
complaints about removing ->writepage support from gfs2 or zonefs, so
here's the pull request removing them (and the underlying fs iomap
support) from the kernel.

As usual, I did a test-merge with upstream master as of a few minutes
ago, and didn't see any conflicts.  Please let me know if you encounter
any problems.

--D

The following changes since commit f8189d5d5fbf082786fb91c549f5127f23daec09:

  dax: set did_zero to true when zeroing successfully (2022-06-30 10:05:11 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/iomap-6.0-merge-2

for you to fetch changes up to 478af190cb6c501efaa8de2b9c9418ece2e4d0bd:

  iomap: remove iomap_writepage (2022-07-22 10:59:17 -0700)


New code for 6.0:
 - Remove iomap_writepage and all callers, since the mm apparently never
   called the zonefs or gfs2 writepage functions.

Signed-off-by: Darrick J. Wong 


Christoph Hellwig (4):
  gfs2: stop using generic_writepages in gfs2_ail1_start_one
  gfs2: remove ->writepage
  zonefs: remove ->writepage
  iomap: remove iomap_writepage

 fs/gfs2/aops.c | 26 --
 fs/gfs2/log.c  |  5 ++---
 fs/iomap/buffered-io.c | 15 ---
 fs/zonefs/super.c  |  8 
 include/linux/iomap.h  |  3 ---
 5 files changed, 2 insertions(+), 55 deletions(-)



Re: [Cluster-devel] [PATCH 4/4] iomap: remove iomap_writepage

2022-07-22 Thread Darrick J. Wong
On Tue, Jul 19, 2022 at 06:13:11AM +0200, Christoph Hellwig wrote:
> Unused now.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Damien Le Moal 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 15 ---
>  include/linux/iomap.h  |  3 ---
>  2 files changed, 18 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d2a9f699e17ed..1bac8bda40d0c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1518,21 +1518,6 @@ iomap_do_writepage(struct page *page, struct 
> writeback_control *wbc, void *data)
>   return 0;
>  }
>  
> -int
> -iomap_writepage(struct page *page, struct writeback_control *wbc,
> - struct iomap_writepage_ctx *wpc,
> - const struct iomap_writeback_ops *ops)
> -{
> - int ret;
> -
> - wpc->ops = ops;
> - ret = iomap_do_writepage(page, wbc, wpc);
> - if (!wpc->ioend)
> - return ret;
> - return iomap_submit_ioend(wpc, wpc->ioend, ret);
> -}
> -EXPORT_SYMBOL_GPL(iomap_writepage);
> -
>  int
>  iomap_writepages(struct address_space *mapping, struct writeback_control 
> *wbc,
>   struct iomap_writepage_ctx *wpc,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e552097c67e0b..911888560d3eb 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -303,9 +303,6 @@ void iomap_finish_ioends(struct iomap_ioend *ioend, int 
> error);
>  void iomap_ioend_try_merge(struct iomap_ioend *ioend,
>   struct list_head *more_ioends);
>  void iomap_sort_ioends(struct list_head *ioend_list);
> -int iomap_writepage(struct page *page, struct writeback_control *wbc,
> - struct iomap_writepage_ctx *wpc,
> - const struct iomap_writeback_ops *ops);
>  int iomap_writepages(struct address_space *mapping,
>   struct writeback_control *wbc, struct iomap_writepage_ctx *wpc,
>   const struct iomap_writeback_ops *ops);
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH v2 11/19] mm/migrate: Add filemap_migrate_folio()

2022-06-08 Thread Darrick J. Wong
On Wed, Jun 08, 2022 at 04:02:41PM +0100, Matthew Wilcox (Oracle) wrote:
> There is nothing iomap-specific about iomap_migratepage(), and it fits
> a pattern used by several other filesystems, so move it to mm/migrate.c,
> convert it to be filemap_migrate_folio() and convert the iomap filesystems
> to use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/aops.c  |  2 +-
>  fs/iomap/buffered-io.c  | 25 -
>  fs/xfs/xfs_aops.c   |  2 +-
>  fs/zonefs/super.c   |  2 +-
>  include/linux/iomap.h   |  6 --
>  include/linux/pagemap.h |  6 ++
>  mm/migrate.c| 20 
>  7 files changed, 29 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 106e90a36583..57ff883d432c 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -774,7 +774,7 @@ static const struct address_space_operations gfs2_aops = {
>   .invalidate_folio = iomap_invalidate_folio,
>   .bmap = gfs2_bmap,
>   .direct_IO = noop_direct_IO,
> - .migratepage = iomap_migrate_page,
> + .migrate_folio = filemap_migrate_folio,
>   .is_partially_uptodate = iomap_is_partially_uptodate,
>   .error_remove_page = generic_error_remove_page,
>  };
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 66278a14bfa7..5a91aa1db945 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -489,31 +489,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t 
> offset, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>  
> -#ifdef CONFIG_MIGRATION
> -int
> -iomap_migrate_page(struct address_space *mapping, struct page *newpage,
> - struct page *page, enum migrate_mode mode)
> -{
> - struct folio *folio = page_folio(page);
> - struct folio *newfolio = page_folio(newpage);
> - int ret;
> -
> - ret = folio_migrate_mapping(mapping, newfolio, folio, 0);
> - if (ret != MIGRATEPAGE_SUCCESS)
> - return ret;
> -
> - if (folio_test_private(folio))
> - folio_attach_private(newfolio, folio_detach_private(folio));
> -
> - if (mode != MIGRATE_SYNC_NO_COPY)
> - folio_migrate_copy(newfolio, folio);
> - else
> - folio_migrate_flags(newfolio, folio);
> - return MIGRATEPAGE_SUCCESS;
> -}
> -EXPORT_SYMBOL_GPL(iomap_migrate_page);
> -#endif /* CONFIG_MIGRATION */
> -
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 8ec38b25187b..5d1a995b15f8 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -570,7 +570,7 @@ const struct address_space_operations 
> xfs_address_space_operations = {
>   .invalidate_folio   = iomap_invalidate_folio,
>   .bmap   = xfs_vm_bmap,
>   .direct_IO  = noop_direct_IO,
> - .migratepage= iomap_migrate_page,
> + .migrate_folio  = filemap_migrate_folio,
>   .is_partially_uptodate  = iomap_is_partially_uptodate,
>   .error_remove_page  = generic_error_remove_page,
>   .swap_activate  = xfs_iomap_swapfile_activate,
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index bcb21aea990a..d4c3f28f34ee 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -237,7 +237,7 @@ static const struct address_space_operations 
> zonefs_file_aops = {
>   .dirty_folio= filemap_dirty_folio,
>   .release_folio  = iomap_release_folio,
>   .invalidate_folio   = iomap_invalidate_folio,
> - .migratepage= iomap_migrate_page,
> + .migrate_folio  = filemap_migrate_folio,
>   .is_partially_uptodate  = iomap_is_partially_uptodate,
>   .error_remove_page  = generic_error_remove_page,
>   .direct_IO  = noop_direct_IO,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e552097c67e0..758a1125e72f 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -231,12 +231,6 @@ void iomap_readahead(struct readahead_control *, const 
> struct iomap_ops *ops);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> -#ifdef CONFIG_MIGRATION
> -int iomap_migrate_page(struct address_space *mapping, struct page *newpage,
> - struct page *page, enum migrate_mode mode);
> -#else
> -#define iomap_migrate_p

Re: [Cluster-devel] [GIT PULL] fs/iomap: Fix buffered write page prefaulting

2022-03-25 Thread Darrick J. Wong
On Sat, Mar 26, 2022 at 01:22:17AM +0100, Andreas Gruenbacher wrote:
> On Sat, Mar 26, 2022 at 1:03 AM Darrick J. Wong  wrote:
> > On Fri, Mar 25, 2022 at 03:37:01PM +0100, Andreas Gruenbacher wrote:
> > > Hello Linus,
> > >
> > > please consider pulling the following fix, which I've forgotten to send
> > > in the previous merge window.  I've only improved the patch description
> > > since.
> > >
> > > Thank you very much,
> > > Andreas
> > >
> > > The following changes since commit 
> > > 42eb8fdac2fc5d62392dcfcf0253753e821a97b0:
> > >
> > >   Merge tag 'gfs2-v5.16-rc2-fixes' of 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2021-11-17 
> > > 15:55:07 -0800)
> > >
> > > are available in the Git repository at:
> > >
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
> > > tags/write-page-prefaulting
> > >
> > > for you to fetch changes up to 631f871f071746789e9242e514ab0f49067fa97a:
> > >
> > >   fs/iomap: Fix buffered write page prefaulting (2022-03-25 15:14:03 
> > > +0100)
> >
> > When was this sent to fsdevel for public consideration?  The last time I
> > saw any patches related to prefaulting in iomap was November.
> 
> On November 23, exact same patch:
> 
> https://lore.kernel.org/linux-fsdevel/20211123151812.361624-1-agrue...@redhat.com/

Ah, ok, so I just missed it then.  Sorry about that, I seem to suck as
maintainer more and more by the day. :( :(

(Hey, at least you got the other maintainer to RVB it...)

--D

> Andreas
> 



Re: [Cluster-devel] [GIT PULL] fs/iomap: Fix buffered write page prefaulting

2022-03-25 Thread Darrick J. Wong
On Fri, Mar 25, 2022 at 03:37:01PM +0100, Andreas Gruenbacher wrote:
> Hello Linus,
> 
> please consider pulling the following fix, which I've forgotten to send
> in the previous merge window.  I've only improved the patch description
> since.
> 
> Thank you very much,
> Andreas
> 
> The following changes since commit 42eb8fdac2fc5d62392dcfcf0253753e821a97b0:
> 
>   Merge tag 'gfs2-v5.16-rc2-fixes' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2021-11-17 
> 15:55:07 -0800)
> 
> are available in the Git repository at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
> tags/write-page-prefaulting
> 
> for you to fetch changes up to 631f871f071746789e9242e514ab0f49067fa97a:
> 
>   fs/iomap: Fix buffered write page prefaulting (2022-03-25 15:14:03 +0100)

When was this sent to fsdevel for public consideration?  The last time I
saw any patches related to prefaulting in iomap was November.

--D

> 
> 
> Fix buffered write page prefaulting
> 
> 
> Andreas Gruenbacher (1):
>   fs/iomap: Fix buffered write page prefaulting
> 
>  fs/iomap/buffered-io.c | 2 +-
>  mm/filemap.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 



Re: [Cluster-devel] [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL

2022-03-01 Thread Darrick J. Wong
On Mon, Feb 28, 2022 at 02:32:03PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Some users recently reported that MariaDB was getting a read corruption
> when using io_uring on top of btrfs. This started to happen in 5.16,
> after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> during direct IO reads and writes"). That changed btrfs to use the new
> iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> corresponds to a memory mapped file region. That type of scenario is
> exercised by test case generic/647 from fstests, and it also affected
> gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> 
> For this MariaDB scenario, we attempt to read 16K from file offset X
> using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> with a size of 4K, and what happens is the following:
> 
> 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> 
> 2) iomap creates a struct iomap_dio object, its reference count is
>initialized to 1 and its ->size field is initialized to 0;
> 
> 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
>first 4K extent, and setups an iomap for this extent consisting of
>a single page;
> 
> 4) At iomap_dio_bio_iter(), we are able to access the first page of the
>buffer (struct iov_iter) with bio_iov_iter_get_pages() without
>triggering a page fault;
> 
> 5) iomap submits a bio for this 4K extent
>(iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
>the refcount on the struct iomap_dio object to 2; The ->size field
>of the struct iomap_dio object is incremented to 4K;
> 
> 6) iomap calls btrfs_iomap_begin() again, this time with a file
>offset of X + 4K. There we setup an iomap for the next extent
>that also has a size of 4K;
> 
> 7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
>which tries to access the next page (2nd page) of the buffer.
>This triggers a page fault and returns -EFAULT;
> 
> 8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
>to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
>the struct iomap_dio object has a ->size value of 4K (we submitted
>a bio for an extent already). The 'wait_for_completion' variable
>is not set to true, because our iocb has IOCB_NOWAIT set;
> 
> 9) At the bottom of __iomap_dio_rw(), we decrement the reference count
>of the struct iomap_dio object from 2 to 1. Because we were not
>the only ones holding a reference on it and 'wait_for_completion' is
>set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
>just returns it up the callchain, up to io_uring;
> 
> 10) The bio submitted for the first extent (step 5) completes and its
> bio endio function, iomap_dio_bio_end_io(), decrements the last
> reference on the struct iomap_dio object, resulting in calling
> iomap_dio_complete_work() -> iomap_dio_complete().
> 
> 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> and return 4K (the amount of io done) to iomap_dio_complete_work();
> 
> 12) iomap_dio_complete_work() calls the iocb completion callback,
> iocb->ki_complete() with a second argument value of 4K (total io
> done) and the iocb with the adjust ki_pos of X + 4K. This results
> in completing the read request for io_uring, leaving it with a
> result of 4K bytes read, and only the first page of the buffer
> filled in, while the remaining 3 pages, corresponding to the other
> 3 extents, were not filled;

Just checking my understanding of the problem here -- the first page is
filled, the directio read returns that it read $pagesize bytes, and the
remaining 3 pages in the reader's buffer are untouched?  And it's not
the case that there's some IO running in the background that will
eventually fill those three pages, nor is it the case that iouring will
say that the read completed 4k before the contents actually reach the
page?

> 13) For the application, the result is unexpected because if we ask
> to read N bytes, it expects to get N bytes read as long as those
> N bytes don't cross the EOF (i_size).

Hmm.  Is the problem here that directio readers expect that a read
request for N bytes either (reads N bytes and returns N) or (returns the
usual negative errno), and nobody's expecting a short direct read?

> So fix this by making __iomap_dio_rw() assign true to the boolean variable
> 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> progress for a read and we have not crossed the EOF boundary. Do this even
> if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> an unexpected result to an application. This results in returning a
> positive value to the iomap caller, which tells it to fault in the
> remaining pages associated to the buffer (struct 

Re: [Cluster-devel] [PATCH 1/1] Revert "iomap: fall back to buffered writes for invalidation failures"

2022-02-09 Thread Darrick J. Wong
On Wed, Feb 09, 2022 at 08:52:43AM +, Lee Jones wrote:
> This reverts commit 60263d5889e6dc5987dc51b801be4955ff2e4aa7.
> 
> Reverting since this commit opens a potential avenue for abuse.

What kind of abuse?  Did you conclude there's an avenue solely because
some combination of userspace rigging produced a BUG warning?  Or is
this a real problem that someone found?

> The C-reproducer and more information can be found at the link below.

No.  Post the information and your analysis here.  I'm not going to dig
into some Google site to find out what happened, and I'm not going to
assume that future developers will be able to access that URL to learn
why this patch was created.

I am not convinced that this revert is a proper fix for ... whatever is
the problem, because you never explained what happened.

> With this patch applied, I can no longer get the repro to trigger.

With ext4 completely reverted, I can no longer get the repro to trigger
either.

>   kernel BUG at fs/ext4/inode.c:2647!
>   invalid opcode:  [#1] PREEMPT SMP KASAN
>   CPU: 0 PID: 459 Comm: syz-executor359 Tainted: GW 
> 5.10.93-syzkaller-01028-g0347b1658399 #0

What BUG on fs/ext4/inode.c:2647?

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/inode.c?h=v5.10.93#n2647

All I see is a call to pagevec_release()?  There's a BUG_ON further up
if we wait for page writeback but then it still has Writeback set.  But
I don't see anything in pagevec_release that would actually result in a
BUG warning.

Oh, right, this is one of those inscrutable syzkaller things, where a
person can spend hours figuring out what the hell it set up.

Yeah...no, you don't get to submit patches to core kernel code, claim
it's not your responsibility to know anything about a subsystem that you
want to patch, and then expect us to do the work for you.  If you pick
up a syzkaller report, you get to figure out what broke, why, and how to
fix it in a reasonable manner.

You're a maintainer, would you accept a patch like this?

NAK.

--D

OH WAIT, you're running this on the Android 5.10 kernel, aren't you?
The BUG report came from page_buffers failing to find any buffer heads
attached to the page.
https://android.googlesource.com/kernel/common/+/refs/heads/android12-5.10-2022-02/fs/ext4/inode.c#2647

Yeah, don't care.

>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   RIP: 0010:mpage_prepare_extent_to_map+0xbe9/0xc00 fs/ext4/inode.c:2647
>   Code: e8 fc bd c3 ff e9 80 f6 ff ff 44 89 e9 80 e1 07 38 c1 0f 8c a6 fe ff 
> ff 4c 89 ef e8 e1 bd c3 ff e9 99 fe ff ff e8 87 c9 89 ff <0f> 0b e8 80 c9 89 
> ff 0f 0b e8 79 1e b8 02 66 0f 1f 84 00 00 00 00
>   RSP: 0018:c9e2e9c0 EFLAGS: 00010293
>   RAX: 81e321f9 RBX:  RCX: 88810c12cf00
>   RDX:  RSI:  RDI: 
>   RBP: c9e2eb90 R08: 81e31e71 R09: f940008d68b1
>   R10: f940008d68b1 R11:  R12: ea00046b4580
>   R13: c9e2ea80 R14: 011e R15: dc00
>   FS:  7fcfd0717700() GS:8881f700() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 7fcfd07d5520 CR3: 00010a142000 CR4: 003506b0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: fffe0ff0 DR7: 0400
>   Call Trace:
>ext4_writepages+0xcbb/0x3950 fs/ext4/inode.c:2776
>do_writepages+0x13a/0x280 mm/page-writeback.c:2358
>__filemap_fdatawrite_range+0x356/0x420 mm/filemap.c:427
>filemap_write_and_wait_range+0x64/0xe0 mm/filemap.c:660
>__iomap_dio_rw+0x621/0x10c0 fs/iomap/direct-io.c:495
>iomap_dio_rw+0x35/0x80 fs/iomap/direct-io.c:611
>ext4_dio_write_iter fs/ext4/file.c:571 [inline]
>ext4_file_write_iter+0xfc5/0x1b70 fs/ext4/file.c:681
>do_iter_readv_writev+0x548/0x740 include/linux/fs.h:1941
>do_iter_write+0x182/0x660 fs/read_write.c:866
>vfs_iter_write+0x7c/0xa0 fs/read_write.c:907
>iter_file_splice_write+0x7fb/0xf70 fs/splice.c:686
>do_splice_from fs/splice.c:764 [inline]
>direct_splice_actor+0xfe/0x130 fs/splice.c:933
>splice_direct_to_actor+0x4f4/0xbc0 fs/splice.c:888
>do_splice_direct+0x28b/0x3e0 fs/splice.c:976
>do_sendfile+0x914/0x1390 fs/read_write.c:1257
> 
> Link: https://syzkaller.appspot.com/bug?extid=41c966bf0729a530bd8d
> 
> From the patch:
> Cc: Stable 
> Cc: Christoph Hellwig 
> Cc: Dave Chinner 
> Cc: Goldwyn Rodrigues 
> Cc: Darrick J. Wong 
> Cc: Bob Peterson 
> Cc: Damien Le Moal 
> Cc: Theodore Ts'o  # for ext4
> Cc: Andreas Gruenbacher 
> Cc: Ritesh Harjani 
> 
> Others:
> Cc: Johannes Thumshirn 
> Cc: linux-...@vger

Re: [Cluster-devel] [PATCH] iomap: iomap_read_inline_data cleanup

2021-11-17 Thread Darrick J. Wong
On Wed, Nov 17, 2021 at 11:32:02AM +0100, Andreas Gruenbacher wrote:
> Change iomap_read_inline_data to return 0 or an error code; this
> simplifies the callers.  Add a description.
> 
> Signed-off-by: Andreas Gruenbacher 

Looks good, thank you for cleaning this up for me!
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 30 ++
>  1 file changed, 14 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index fe10d8a30f6b..f1bc9a35184d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -205,7 +205,15 @@ struct iomap_readpage_ctx {
>   struct readahead_control *rac;
>  };
>  
> -static loff_t iomap_read_inline_data(const struct iomap_iter *iter,
> +/**
> + * iomap_read_inline_data - copy inline data into the page cache
> + * @iter: iteration structure
> + * @page: page to copy to
> + *
> + * Copy the inline data in @iter into @page and zero out the rest of the 
> page.
> + * Only a single IOMAP_INLINE extent is allowed at the end of each file.
> + */
> +static int iomap_read_inline_data(const struct iomap_iter *iter,
>   struct page *page)
>  {
>   const struct iomap *iomap = iomap_iter_srcmap(iter);
> @@ -214,7 +222,7 @@ static loff_t iomap_read_inline_data(const struct 
> iomap_iter *iter,
>   void *addr;
>  
>   if (PageUptodate(page))
> - return PAGE_SIZE - poff;
> + return 0;
>  
>   if (WARN_ON_ONCE(size > PAGE_SIZE - poff))
>   return -EIO;
> @@ -231,7 +239,7 @@ static loff_t iomap_read_inline_data(const struct 
> iomap_iter *iter,
>   memset(addr + size, 0, PAGE_SIZE - poff - size);
>   kunmap_local(addr);
>   iomap_set_range_uptodate(page, poff, PAGE_SIZE - poff);
> - return PAGE_SIZE - poff;
> + return 0;
>  }
>  
>  static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
> @@ -256,13 +264,8 @@ static loff_t iomap_readpage_iter(const struct 
> iomap_iter *iter,
>   unsigned poff, plen;
>   sector_t sector;
>  
> - if (iomap->type == IOMAP_INLINE) {
> - loff_t ret = iomap_read_inline_data(iter, page);
> -
> - if (ret < 0)
> - return ret;
> - return 0;
> - }
> + if (iomap->type == IOMAP_INLINE)
> + return iomap_read_inline_data(iter, page);
>  
>   /* zero post-eof blocks as the page may be mapped */
>   iop = iomap_page_create(iter->inode, page);
> @@ -587,15 +590,10 @@ static int __iomap_write_begin(const struct iomap_iter 
> *iter, loff_t pos,
>  static int iomap_write_begin_inline(const struct iomap_iter *iter,
>   struct page *page)
>  {
> - int ret;
> -
>   /* needs more work for the tailpacking case; disable for now */
>   if (WARN_ON_ONCE(iomap_iter_srcmap(iter)->offset != 0))
>   return -EIO;
> - ret = iomap_read_inline_data(iter, page);
> - if (ret < 0)
> - return ret;
> - return 0;
> + return iomap_read_inline_data(iter, page);
>  }
>  
>  static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> -- 
> 2.31.1
> 



Re: [Cluster-devel] [5.15 REGRESSION v2] iomap: Fix inline extent handling in iomap_readpage

2021-11-16 Thread Darrick J. Wong
On Thu, Nov 11, 2021 at 05:17:14PM +0100, Andreas Gruenbacher wrote:
> Before commit 740499c78408 ("iomap: fix the iomap_readpage_actor return
> value for inline data"), when hitting an IOMAP_INLINE extent,
> iomap_readpage_actor would report having read the entire page.  Since
> then, it only reports having read the inline data (iomap->length).
> 
> This will force iomap_readpage into another iteration, and the
> filesystem will report an unaligned hole after the IOMAP_INLINE extent.
> But iomap_readpage_actor (now iomap_readpage_iter) isn't prepared to
> deal with unaligned extents, it will get things wrong on filesystems
> with a block size smaller than the page size, and we'll eventually run
> into the following warning in iomap_iter_advance:
> 
>   WARN_ON_ONCE(iter->processed > iomap_length(iter));
> 
> Fix that by changing iomap_readpage_iter to return 0 when hitting an
> inline extent; this will cause iomap_iter to stop immediately.

I guess this means that we also only support having inline data that
ends at EOF?  IIRC this is true for the three(?) filesystems that have
expressed any interest in inline data: yours, ext4, and erofs?

> To fix readahead as well, change iomap_readahead_iter to pass on
> iomap_readpage_iter return values less than or equal to zero.
> 
> Fixes: 740499c78408 ("iomap: fix the iomap_readpage_actor return value for 
> inline data")
> Cc: sta...@vger.kernel.org # v5.15+
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/iomap/buffered-io.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 1753c26c8e76..fe10d8a30f6b 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -256,8 +256,13 @@ static loff_t iomap_readpage_iter(const struct 
> iomap_iter *iter,
>   unsigned poff, plen;
>   sector_t sector;
>  
> - if (iomap->type == IOMAP_INLINE)
> - return min(iomap_read_inline_data(iter, page), length);
> + if (iomap->type == IOMAP_INLINE) {
> + loff_t ret = iomap_read_inline_data(iter, page);

Ew, iomap_read_inline_data returns loff_t.  I think I'll slip in a
change of return type to ssize_t, if you don't mind?

> +
> + if (ret < 0)
> + return ret;

...and a comment here explaining that we only support inline data that
ends at EOF?

If the answers to all /four/ questions are 'yes', then consider this:
Reviewed-by: Darrick J. Wong 

--D

> + return 0;
> + }
>  
>   /* zero post-eof blocks as the page may be mapped */
>   iop = iomap_page_create(iter->inode, page);
> @@ -370,6 +375,8 @@ static loff_t iomap_readahead_iter(const struct 
> iomap_iter *iter,
>   ctx->cur_page_in_bio = false;
>   }
>   ret = iomap_readpage_iter(iter, ctx, done);
> + if (ret <= 0)
> + return ret;
>   }
>  
>   return done;
> -- 
> 2.31.1
> 



Re: [Cluster-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw

2021-10-19 Thread Darrick J. Wong
On Tue, Oct 19, 2021 at 09:30:58PM +0200, Andreas Gruenbacher wrote:
> On Tue, Oct 19, 2021 at 5:51 PM Darrick J. Wong  wrote:
> > On Tue, Oct 19, 2021 at 03:42:01PM +0200, Andreas Gruenbacher wrote:
> > > Add a done_before argument to iomap_dio_rw that indicates how much of
> > > the request has already been transferred.  When the request succeeds, we
> > > report that done_before additional bytes were tranferred.  This is
> > > useful for finishing a request asynchronously when part of the request
> > > has already been completed synchronously.
> > >
> > > We'll use that to allow iomap_dio_rw to be used with page faults
> > > disabled: when a page fault occurs while submitting a request, we
> > > synchronously complete the part of the request that has already been
> > > submitted.  The caller can then take care of the page fault and call
> > > iomap_dio_rw again for the rest of the request, passing in the number of
> > > bytes already tranferred.
> > >
> > > Signed-off-by: Andreas Gruenbacher 
> > > ---
> > >  fs/btrfs/file.c   |  5 +++--
> > >  fs/erofs/data.c   |  2 +-
> > >  fs/ext4/file.c|  5 +++--
> > >  fs/gfs2/file.c|  4 ++--
> > >  fs/iomap/direct-io.c  | 11 ---
> > >  fs/xfs/xfs_file.c |  6 +++---
> > >  fs/zonefs/super.c |  4 ++--
> > >  include/linux/iomap.h |  4 ++--
> > >  8 files changed, 24 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index f37211d3bb69..9d41b28c67ba 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1957,7 +1957,7 @@ static ssize_t btrfs_direct_write(struct kiocb 
> > > *iocb, struct iov_iter *from)
> > >   }
> > >
> > >   dio = __iomap_dio_rw(iocb, from, _dio_iomap_ops, 
> > > _dio_ops,
> > > -  0);
> > > +  0, 0);
> > >
> > >   btrfs_inode_unlock(inode, ilock_flags);
> > >
> > > @@ -3658,7 +3658,8 @@ static ssize_t btrfs_direct_read(struct kiocb 
> > > *iocb, struct iov_iter *to)
> > >   return 0;
> > >
> > >   btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > > - ret = iomap_dio_rw(iocb, to, _dio_iomap_ops, _dio_ops, 
> > > 0);
> > > + ret = iomap_dio_rw(iocb, to, _dio_iomap_ops, _dio_ops,
> > > +0, 0);
> > >   btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > >   return ret;
> > >  }
> > > diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> > > index 9db829715652..16a41d0db55a 100644
> > > --- a/fs/erofs/data.c
> > > +++ b/fs/erofs/data.c
> > > @@ -287,7 +287,7 @@ static ssize_t erofs_file_read_iter(struct kiocb 
> > > *iocb, struct iov_iter *to)
> > >
> > >   if (!err)
> > >   return iomap_dio_rw(iocb, to, _iomap_ops,
> > > - NULL, 0);
> > > + NULL, 0, 0);
> > >   if (err < 0)
> > >   return err;
> > >   }
> > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > > index ac0e11bbb445..b25c1f8f7c4f 100644
> > > --- a/fs/ext4/file.c
> > > +++ b/fs/ext4/file.c
> > > @@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, 
> > > struct iov_iter *to)
> > >   return generic_file_read_iter(iocb, to);
> > >   }
> > >
> > > - ret = iomap_dio_rw(iocb, to, _iomap_ops, NULL, 0);
> > > + ret = iomap_dio_rw(iocb, to, _iomap_ops, NULL, 0, 0);
> > >   inode_unlock_shared(inode);
> > >
> > >   file_accessed(iocb->ki_filp);
> > > @@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb 
> > > *iocb, struct iov_iter *from)
> > >   if (ilock_shared)
> > >   iomap_ops = _iomap_overwrite_ops;
> > >   ret = iomap_dio_rw(iocb, from, iomap_ops, _dio_write_ops,
> > > -(unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT 
> > > : 0);
> > > +(unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT 
> > > : 0,
> > > +0);
> > >   if (ret == -ENOTBLK)
> > >   ret = 0;
> > >
> > > diff --git a/fs/gfs2/file.c b/fs/gfs2/file

Re: [Cluster-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw

2021-10-19 Thread Darrick J. Wong
omap_dio {
>   atomic_tref;
>   unsignedflags;
>   int error;
> + size_t  done_before;

I have basically the same comment as last time[1]:

"So, now that I actually understand the reason why the count of
previously transferred bytes has to be passed into the iomap_dio, I
would like this field to have a comment so that stupid maintainers like
me don't forget the subtleties again:

/*
 * For asynchronous IO, we have one chance to call the iocb
 * completion method with the results of the directio operation.
 * If this operation is a resubmission after a previous partial
 * completion (e.g. page fault), we need to know about that
 * progress so that we can report both the results of the prior
 * completion and the result of the resubmission to the iocb
 * submitter.
 */
size_t  done_before;

With that added, I think I can live with this enough to:
Reviewed-by: Darrick J. Wong 

--D

[1] https://lore.kernel.org/linux-fsdevel/20210903185351.GD9892@magnolia/

>   boolwait_for_completion;
>  
>   union {
> @@ -124,6 +125,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
>   ret = generic_write_sync(iocb, ret);
>  
> + if (ret > 0)
> + ret += dio->done_before;
> +
>   kfree(dio);
>  
>   return ret;
> @@ -456,7 +460,7 @@ static loff_t iomap_dio_iter(const struct iomap_iter 
> *iter,
>  struct iomap_dio *
>  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
>  {
>   struct address_space *mapping = iocb->ki_filp->f_mapping;
>   struct inode *inode = file_inode(iocb->ki_filp);
> @@ -486,6 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   dio->dops = dops;
>   dio->error = 0;
>   dio->flags = 0;
> + dio->done_before = done_before;
>  
>   dio->submit.iter = iter;
>   dio->submit.waiter = current;
> @@ -652,11 +657,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
>  {
>   struct iomap_dio *dio;
>  
> - dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> + dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
>   if (IS_ERR_OR_NULL(dio))
>   return PTR_ERR_OR_ZERO(dio);
>   return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7aa943edfc02..240eb932c014 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,7 @@ xfs_file_dio_read(
>   ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
>   if (ret)
>   return ret;
> - ret = iomap_dio_rw(iocb, to, _read_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, _read_iomap_ops, NULL, 0, 0);
>   xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>   return ret;
> @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
>   }
>   trace_xfs_file_direct_write(iocb, from);
>   ret = iomap_dio_rw(iocb, from, _direct_write_iomap_ops,
> -_dio_write_ops, 0);
> +_dio_write_ops, 0, 0);
>  out_unlock:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
>  
>   trace_xfs_file_direct_write(iocb, from);
>   ret = iomap_dio_rw(iocb, from, _direct_write_iomap_ops,
> -_dio_write_ops, flags);
> +_dio_write_ops, flags, 0);
>  
>   /*
>* Retry unaligned I/O with exclusive blocking semantics if the DIO
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index ddc346a9df9b..6122c38ab44d 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -852,7 +852,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, 
> struct iov_iter *from)
>   ret = zonefs_file_dio_append(iocb, from);
>   else
>   ret = iomap_dio_rw(iocb, from, _iomap_ops,
> -_write_dio_ops, 0);
> +_write_dio_ops, 0, 0);
>   if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>   (ret > 0 || ret == -EIOCBQUEUED)) {

Re: [Cluster-devel] [PATCH v7 15/19] iomap: Support partial direct I/O on user copy failures

2021-09-03 Thread Darrick J. Wong
On Fri, Aug 27, 2021 at 06:49:22PM +0200, Andreas Gruenbacher wrote:
> In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
> IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
> return a partial result.  This allows the caller to deal with the page
> fault and retry the remainder of the request.
> 
> Signed-off-by: Andreas Gruenbacher 

Pretty straightforward.
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/direct-io.c  | 6 ++
>  include/linux/iomap.h | 7 +++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 8054f5d6c273..ba88fe51b77a 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -561,6 +561,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
>   iomap_dio_actor);
>   if (ret <= 0) {
> + if (ret == -EFAULT && dio->size &&
> + (dio_flags & IOMAP_DIO_PARTIAL)) {
> + wait_for_completion = true;
> + ret = 0;
> + }
> +
>   /* magic error code to fall back to buffered I/O */
>   if (ret == -ENOTBLK) {
>   wait_for_completion = true;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 479c1da3e221..bcae4814b8e3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -267,6 +267,13 @@ struct iomap_dio_ops {
>*/
>  #define IOMAP_DIO_OVERWRITE_ONLY (1 << 1)
>  
> +/*
> + * When a page fault occurs, return a partial synchronous result and allow
> + * the caller to retry the rest of the operation after dealing with the page
> + * fault.
> + */
> +#define IOMAP_DIO_PARTIAL(1 << 2)
> +
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
>   unsigned int dio_flags);
> -- 
> 2.26.3
> 



Re: [Cluster-devel] [PATCH v7 14/19] iomap: Fix iomap_dio_rw return value for user copies

2021-09-03 Thread Darrick J. Wong
On Fri, Aug 27, 2021 at 06:49:21PM +0200, Andreas Gruenbacher wrote:
> When a user copy fails in one of the helpers of iomap_dio_rw, fail with
> -EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
> returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
> changes, iomap_dio_actor now consistently fails with -EFAULT when a user
> page cannot be faulted in.
> 
> Signed-off-by: Andreas Gruenbacher 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/direct-io.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9398b8c31323..8054f5d6c273 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
>  {
>   length = iov_iter_zero(length, dio->submit.iter);
>   dio->size += length;
> - return length;
> + return length ? length : -EFAULT;
>  }
>  
>  static loff_t
> @@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   copied = copy_to_iter(iomap->inline_data + pos, length, iter);
>   }
>   dio->size += copied;
> - return copied;
> + return copied ? copied : -EFAULT;
>  }
>  
>  static loff_t
> -- 
> 2.26.3
> 



Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-09-03 Thread Darrick J. Wong
peration is a resubmission after a previous partial
 * completion (e.g. page fault), we need to know about that
 * progress so that we can report that and the result of the
 * resubmission to the iocb completion.
 */
size_t  done_before;

With that added, I think I can live with this enough to:
Reviewed-by: Darrick J. Wong 

--D

>   boolwait_for_completion;
>  
>   union {
> @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
>   ret = generic_write_sync(iocb, ret);
>  
> + if (ret > 0)
> + ret += dio->done_before;
> +
>   kfree(dio);
>  
>   return ret;
> @@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>  struct iomap_dio *
>  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
>  {
>   struct address_space *mapping = iocb->ki_filp->f_mapping;
>   struct inode *inode = file_inode(iocb->ki_filp);
> @@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   dio->dops = dops;
>   dio->error = 0;
>   dio->flags = 0;
> + dio->done_before = done_before;
>  
>   dio->submit.iter = iter;
>   dio->submit.waiter = current;
> @@ -648,11 +653,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
>  {
>   struct iomap_dio *dio;
>  
> - dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> + dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
>   if (IS_ERR_OR_NULL(dio))
>   return PTR_ERR_OR_ZERO(dio);
>   return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index cc3cfb12df53..3103d9bda466 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,7 @@ xfs_file_dio_read(
>   ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
>   if (ret)
>   return ret;
> - ret = iomap_dio_rw(iocb, to, _read_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, _read_iomap_ops, NULL, 0, 0);
>   xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>   return ret;
> @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
>   }
>   trace_xfs_file_direct_write(iocb, from);
>   ret = iomap_dio_rw(iocb, from, _direct_write_iomap_ops,
> -_dio_write_ops, 0);
> +_dio_write_ops, 0, 0);
>  out_unlock:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
>  
>   trace_xfs_file_direct_write(iocb, from);
>   ret = iomap_dio_rw(iocb, from, _direct_write_iomap_ops,
> -_dio_write_ops, flags);
> +_dio_write_ops, flags, 0);
>  
>   /*
>* Retry unaligned I/O with exclusive blocking semantics if the DIO
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 70055d486bf7..85ca2f5fe06e 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, 
> struct iov_iter *from)
>   ret = zonefs_file_dio_append(iocb, from);
>   else
>   ret = iomap_dio_rw(iocb, from, _iomap_ops,
> -_write_dio_ops, 0);
> +_write_dio_ops, 0, 0);
>   if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>   (ret > 0 || ret == -EIOCBQUEUED)) {
>   if (ret > 0)
> @@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, 
> struct iov_iter *to)
>   }
>   file_accessed(iocb->ki_filp);
>   ret = iomap_dio_rw(iocb, to, _iomap_ops,
> -_read_dio_ops, 0);
> +_read_dio_ops, 0, 0);
>   } else {
>   ret = generic_file_read_iter(iocb, to);
>   if (ret == -EIO)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index bcae4814b8e3..908bda10024c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -276,10 +276,10 @@ struct iomap_dio_ops {
>  
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags);
> + unsigned int dio_flags, size_t done_before);
>  struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags);
> + unsigned int dio_flags, size_t done_before);
>  ssize_t iomap_dio_complete(struct iomap_dio *dio);
>  int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
>  
> -- 
> 2.26.3
> 



Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-09-03 Thread Darrick J. Wong
On Fri, Aug 27, 2021 at 03:35:06PM -0700, Linus Torvalds wrote:
> On Fri, Aug 27, 2021 at 2:32 PM Darrick J. Wong  wrote:
> >
> > No, because you totally ignored the second question:
> >
> > If the directio operation succeeds even partially and the PARTIAL flag
> > is set, won't that push the iov iter ahead by however many bytes
> > completed?
> >
> > We already finished the IO for the first page, so the second attempt
> > should pick up where it left off, i.e. the second page.
> 
> Darrick, I think you're missing the point.
> 
> It's the *return*value* that is the issue, not the iovec.
> 
> The iovec is updated as you say. But the return value from the async
> part is - without Andreas' patch - only the async part of it.
> 
> With Andreas' patch, the async part will now return the full return
> value, including the part that was done synchronously.
> 
> And the return value is returned from that async part, which somehow
> thus needs to know what predated it.

Aha, that was the missing piece, thank you.  I'd forgotten that
iomap_dio_complete_work calls iocb->ki_complete with the return value of
iomap_dio_complete, which means that the iomap_dio has to know if there
was a previous transfer that stopped short so that the caller could do
more work and resubmit.

> Could that pre-existing part perhaps be saved somewhere else? Very
> possibly. That 'struct iomap_dio' addition is kind of ugly. So maybe
> what Andreas did could be done differently.

There's probably a more elegant way for the ->ki_complete functions to
figure out how much got transferred, but that's sufficiently ugly and
invasive so as not to be suitable for a bug fix.

> But I think you guys are arguing past each other.

Yes, definitely.

--D

> 
>Linus



Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-08-27 Thread Darrick J. Wong
On Fri, Aug 27, 2021 at 10:15:11PM +0200, Andreas Gruenbacher wrote:
> On Fri, Aug 27, 2021 at 8:30 PM Darrick J. Wong  wrote:
> > On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> > > Add a done_before argument to iomap_dio_rw that indicates how much of
> > > the request has already been transferred.  When the request succeeds, we
> > > report that done_before additional bytes were tranferred.  This is
> > > useful for finishing a request asynchronously when part of the request
> > > has already been completed synchronously.
> > >
> > > We'll use that to allow iomap_dio_rw to be used with page faults
> > > disabled: when a page fault occurs while submitting a request, we
> > > synchronously complete the part of the request that has already been
> > > submitted.  The caller can then take care of the page fault and call
> > > iomap_dio_rw again for the rest of the request, passing in the number of
> > > bytes already tranferred.
> > >
> > > Signed-off-by: Andreas Gruenbacher 
> > > ---
> > >  fs/btrfs/file.c   |  5 +++--
> > >  fs/ext4/file.c|  5 +++--
> > >  fs/gfs2/file.c|  4 ++--
> > >  fs/iomap/direct-io.c  | 11 ---
> > >  fs/xfs/xfs_file.c |  6 +++---
> > >  fs/zonefs/super.c |  4 ++--
> > >  include/linux/iomap.h |  4 ++--
> > >  7 files changed, 23 insertions(+), 16 deletions(-)
> > >
> >
> > 
> >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ba88fe51b77a..dcf9a2b4381f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -31,6 +31,7 @@ struct iomap_dio {
> > >   atomic_tref;
> > >   unsignedflags;
> > >   int error;
> > > + size_t  done_before;
> > >   boolwait_for_completion;
> > >
> > >   union {
> > > @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > >   ret = generic_write_sync(iocb, ret);
> > >
> > > + if (ret > 0)
> > > + ret += dio->done_before;
> >
> > Pardon my ignorance since this is the first time I've had a crack at
> > this patchset, but why is it necessary to carry the "bytes copied"
> > count from the /previous/ iomap_dio_rw call all the way through to dio
> > completion of the current call?
> 
> Consider the following situation:
> 
>  * A user submits an asynchronous read request.
> 
>  * The first page of the buffer is in memory, but the following
>pages are not. This isn't uncommon for consecutive reads
>into freshly allocated memory.
> 
>  * iomap_dio_rw writes into the first page. Then it
>hits the next page which is missing, so it returns a partial
>result, synchronously.
> 
>  * We then fault in the remaining pages and call iomap_dio_rw
>for the rest of the request.
> 
>  * The rest of the request completes asynchronously.
> 
> Does that answer your question?

No, because you totally ignored the second question:

If the directio operation succeeds even partially and the PARTIAL flag
is set, won't that push the iov iter ahead by however many bytes
completed?

We already finished the IO for the first page, so the second attempt
should pick up where it left off, i.e. the second page.

--D

> Thanks,
> Andreas
> 



Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-08-27 Thread Darrick J. Wong
On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> Add a done_before argument to iomap_dio_rw that indicates how much of
> the request has already been transferred.  When the request succeeds, we
> report that done_before additional bytes were tranferred.  This is
> useful for finishing a request asynchronously when part of the request
> has already been completed synchronously.
> 
> We'll use that to allow iomap_dio_rw to be used with page faults
> disabled: when a page fault occurs while submitting a request, we
> synchronously complete the part of the request that has already been
> submitted.  The caller can then take care of the page fault and call
> iomap_dio_rw again for the rest of the request, passing in the number of
> bytes already tranferred.
> 
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/btrfs/file.c   |  5 +++--
>  fs/ext4/file.c|  5 +++--
>  fs/gfs2/file.c|  4 ++--
>  fs/iomap/direct-io.c  | 11 ---
>  fs/xfs/xfs_file.c |  6 +++---
>  fs/zonefs/super.c |  4 ++--
>  include/linux/iomap.h |  4 ++--
>  7 files changed, 23 insertions(+), 16 deletions(-)
> 



> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ba88fe51b77a..dcf9a2b4381f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -31,6 +31,7 @@ struct iomap_dio {
>   atomic_tref;
>   unsignedflags;
>   int error;
> + size_t  done_before;
>   boolwait_for_completion;
>  
>   union {
> @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
>   ret = generic_write_sync(iocb, ret);
>  
> + if (ret > 0)
> + ret += dio->done_before;

Pardon my ignorance since this is the first time I've had a crack at
this patchset, but why is it necessary to carry the "bytes copied"
count from the /previous/ iomap_dio_rw call all the way through to dio
completion of the current call?

If the directio operation succeeds even partially and the PARTIAL flag
is set, won't that push the iov iter ahead by however many bytes
completed?

In other words, why won't this loop work for gfs2?

size_t copied = 0;
while (iov_iter_count(iov) > 0) {
ssize_t ret = iomap_dio_rw(iocb, iov, ..., IOMAP_DIO_PARTIAL);
if (iov_iter_count(iov) == 0 || ret != -EFAULT)
break;

copied += ret;
/* strange gfs2 relocking I don't understand */
/* deal with page faults... */
};
if (ret < 0)
return ret;
return copied + ret;

It feels clunky to make the caller pass the results of a previous
operation through the current operation just so the caller can catch the
value again afterwards.  Is there something I'm missing?

--D

> +
>   kfree(dio);
>  
>   return ret;
> @@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>  struct iomap_dio *
>  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
>  {
>   struct address_space *mapping = iocb->ki_filp->f_mapping;
>   struct inode *inode = file_inode(iocb->ki_filp);
> @@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   dio->dops = dops;
>   dio->error = 0;
>   dio->flags = 0;
> + dio->done_before = done_before;
>  
>   dio->submit.iter = iter;
>   dio->submit.waiter = current;
> @@ -648,11 +653,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
>  {
>   struct iomap_dio *dio;
>  
> - dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> + dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
>   if (IS_ERR_OR_NULL(dio))
>   return PTR_ERR_OR_ZERO(dio);
>   return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index cc3cfb12df53..3103d9bda466 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,7 @@ xfs_file_dio_read(
>   ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
>   if (ret)
>   return ret;
> - ret = iomap_dio_rw(iocb, to, _read_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, _read_iomap_ops, NULL, 0, 0);
>   xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>   return ret;
> @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
>   }
>   trace_xfs_file_direct_write(iocb, from);
>   ret = iomap_dio_rw(iocb, from, 

Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model

2021-08-12 Thread Darrick J. Wong
On Thu, Aug 12, 2021 at 08:49:14AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 11, 2021 at 12:17:08PM -0700, Darrick J. Wong wrote:
> > > iter.c is also my preference, but in the end I don't care too much.
> > 
> > Ok.  My plan for this is to change this patch to add the new iter code
> > to apply.c, and change patch 24 to remove iomap_apply.  I'll add a patch
> > on the end to rename apply.c to iter.c, which will avoid breaking the
> > history.
> 
> What history?  There is no shared code, so no shared history and.

The history of the gluecode that enables us to walk a bunch of extent
mappings.  In the beginning it was the _apply function, but now in our
spectre-weary world, you've switched it to a direct loop to reduce the
number of indirect calls in the hot path by 30-50%.

As you correctly point out, there's no /code/ shared by the two
implementations, but Dave and I would like to preserve the continuity
from one to the next.

> > I'll send the updated patches as replies to this series to avoid
> > spamming the list, since I also have a patchset of bugfixes to send out
> > and don't want to overwhelm everyone.
> 
> Just as a clear statement:  I think this dance is obsfucation and doesn't
> help in any way.  But if that's what it takes..

I /would/ appreciate it if you'd rvb (or at least ack) patch 31 so I can
get the 5.15 iomap changes finalized next week.  Pretty please? :)

--D



[Cluster-devel] [PATCH 31/30] iomap: move iomap iteration code to iter.c

2021-08-11 Thread Darrick J. Wong
From: Darrick J. Wong 

Now that we've moved iomap to the iterator model, rename this file to be
in sync with the functions contained inside of it.

Signed-off-by: Darrick J. Wong 
---
 fs/iomap/Makefile |2 +-
 fs/iomap/iter.c   |0 
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename fs/iomap/{apply.c => iter.c} (100%)

diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
index e46f936dde81..bb64215ae256 100644
--- a/fs/iomap/Makefile
+++ b/fs/iomap/Makefile
@@ -26,9 +26,9 @@ ccflags-y += -I $(srctree)/$(src) # needed for 
trace events
 obj-$(CONFIG_FS_IOMAP) += iomap.o
 
 iomap-y+= trace.o \
-  apply.o \
   buffered-io.o \
   direct-io.o \
   fiemap.o \
+  iter.o \
   seek.o
 iomap-$(CONFIG_SWAP)   += swapfile.o
diff --git a/fs/iomap/apply.c b/fs/iomap/iter.c
similarity index 100%
rename from fs/iomap/apply.c
rename to fs/iomap/iter.c



[Cluster-devel] [PATCH v2.1 24/30] iomap: remove iomap_apply

2021-08-11 Thread Darrick J. Wong
From: Christoph Hellwig 

iomap_apply is unused now, so remove it.

Signed-off-by: Christoph Hellwig 
[djwong: rebase this patch to preserve git history of iomap loop control]
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
---
 fs/iomap/apply.c  |   91 -
 fs/iomap/trace.h  |   40 --
 include/linux/iomap.h |   10 -
 3 files changed, 141 deletions(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index e82647aef7ea..a1c7592d2ade 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -3,101 +3,10 @@
  * Copyright (C) 2010 Red Hat, Inc.
  * Copyright (c) 2016-2021 Christoph Hellwig.
  */
-#include 
-#include 
 #include 
 #include 
 #include "trace.h"
 
-/*
- * Execute a iomap write on a segment of the mapping that spans a
- * contiguous range of pages that have identical block mapping state.
- *
- * This avoids the need to map pages individually, do individual allocations
- * for each page and most importantly avoid the need for filesystem specific
- * locking per page. Instead, all the operations are amortised over the entire
- * range of pages. It is assumed that the filesystems will lock whatever
- * resources they require in the iomap_begin call, and release them in the
- * iomap_end call.
- */
-loff_t
-iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
-   const struct iomap_ops *ops, void *data, iomap_actor_t actor)
-{
-   struct iomap iomap = { .type = IOMAP_HOLE };
-   struct iomap srcmap = { .type = IOMAP_HOLE };
-   loff_t written = 0, ret;
-   u64 end;
-
-   trace_iomap_apply(inode, pos, length, flags, ops, actor, _RET_IP_);
-
-   /*
-* Need to map a range from start position for length bytes. This can
-* span multiple pages - it is only guaranteed to return a range of a
-* single type of pages (e.g. all into a hole, all mapped or all
-* unwritten). Failure at this point has nothing to undo.
-*
-* If allocation is required for this range, reserve the space now so
-* that the allocation is guaranteed to succeed later on. Once we copy
-* the data into the page cache pages, then we cannot fail otherwise we
-* expose transient stale data. If the reserve fails, we can safely
-* back out at this point as there is nothing to undo.
-*/
-   ret = ops->iomap_begin(inode, pos, length, flags, , );
-   if (ret)
-   return ret;
-   if (WARN_ON(iomap.offset > pos)) {
-   written = -EIO;
-   goto out;
-   }
-   if (WARN_ON(iomap.length == 0)) {
-   written = -EIO;
-   goto out;
-   }
-
-   trace_iomap_apply_dstmap(inode, );
-   if (srcmap.type != IOMAP_HOLE)
-   trace_iomap_apply_srcmap(inode, );
-
-   /*
-* Cut down the length to the one actually provided by the filesystem,
-* as it might not be able to give us the whole size that we requested.
-*/
-   end = iomap.offset + iomap.length;
-   if (srcmap.type != IOMAP_HOLE)
-   end = min(end, srcmap.offset + srcmap.length);
-   if (pos + length > end)
-   length = end - pos;
-
-   /*
-* Now that we have guaranteed that the space allocation will succeed,
-* we can do the copy-in page by page without having to worry about
-* failures exposing transient data.
-*
-* To support COW operations, we read in data for partially blocks from
-* the srcmap if the file system filled it in.  In that case we the
-* length needs to be limited to the earlier of the ends of the iomaps.
-* If the file system did not provide a srcmap we pass in the normal
-* iomap into the actors so that they don't need to have special
-* handling for the two cases.
-*/
-   written = actor(inode, pos, length, data, ,
-   srcmap.type != IOMAP_HOLE ?  : );
-
-out:
-   /*
-* Now the data has been copied, commit the range we've copied.  This
-* should not fail unless the filesystem has had a fatal error.
-*/
-   if (ops->iomap_end) {
-   ret = ops->iomap_end(inode, pos, length,
-written > 0 ? written : 0,
-flags, );
-   }
-
-   return written ? written : ret;
-}
-
 static inline int iomap_iter_advance(struct iomap_iter *iter)
 {
/* handle the previous iteration (if any) */
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 1012d7af6b68..f1519f9a1403 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -138,49 +138,9 @@ DECLARE_EVENT_CLASS(iomap_class,
 DEFINE_EVENT(iomap_class, name,\
TP_PROTO(struct inode *inode, struct iomap *iomap), \
TP_ARGS(inode, iomap))
-DEFINE_IOMAP_EVENT(iomap_apply_

[Cluster-devel] [PATCH v2.1 11/30] iomap: add the new iomap_iter model

2021-08-11 Thread Darrick J. Wong
From: Christoph Hellwig 

The iomap_iter struct provides a convenient way to package up and
maintain all the arguments to the various mapping and operation
functions.  It is operated on using the iomap_iter() function that
is called in loop until the whole range has been processed.  Compared
to the existing iomap_apply() function this avoid an indirect call
for each iteration.

For now iomap_iter() calls back into the existing ->iomap_begin and
->iomap_end methods, but in the future this could be further optimized
to avoid indirect calls entirely.

Based on an earlier patch from Matthew Wilcox .

Signed-off-by: Christoph Hellwig 
[djwong: add to apply.c to preserve git history of iomap loop control]
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
---
 fs/iomap/apply.c  |   74 -
 fs/iomap/trace.h  |   37 -
 include/linux/iomap.h |   56 +
 3 files changed, 165 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 26ab6563181f..e82647aef7ea 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2010 Red Hat, Inc.
- * Copyright (c) 2016-2018 Christoph Hellwig.
+ * Copyright (c) 2016-2021 Christoph Hellwig.
  */
 #include 
 #include 
@@ -97,3 +97,75 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
 
return written ? written : ret;
 }
+
+static inline int iomap_iter_advance(struct iomap_iter *iter)
+{
+   /* handle the previous iteration (if any) */
+   if (iter->iomap.length) {
+   if (iter->processed <= 0)
+   return iter->processed;
+   if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
+   return -EIO;
+   iter->pos += iter->processed;
+   iter->len -= iter->processed;
+   if (!iter->len)
+   return 0;
+   }
+
+   /* clear the state for the next iteration */
+   iter->processed = 0;
+   memset(>iomap, 0, sizeof(iter->iomap));
+   memset(>srcmap, 0, sizeof(iter->srcmap));
+   return 1;
+}
+
+static inline void iomap_iter_done(struct iomap_iter *iter)
+{
+   WARN_ON_ONCE(iter->iomap.offset > iter->pos);
+   WARN_ON_ONCE(iter->iomap.length == 0);
+   WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
+
+   trace_iomap_iter_dstmap(iter->inode, >iomap);
+   if (iter->srcmap.type != IOMAP_HOLE)
+   trace_iomap_iter_srcmap(iter->inode, >srcmap);
+}
+
+/**
+ * iomap_iter - iterate over a ranges in a file
+ * @iter: iteration structue
+ * @ops: iomap ops provided by the file system
+ *
+ * Iterate over filesystem-provided space mappings for the provided file range.
+ *
+ * This function handles cleanup of resources acquired for iteration when the
+ * filesystem indicates there are no more space mappings, which means that this
+ * function must be called in a loop that continues as long it returns a
+ * positive value.  If 0 or a negative value is returned, the caller must not
+ * return to the loop body.  Within a loop body, there are two ways to break 
out
+ * of the loop body:  leave @iter.processed unchanged, or set it to a negative
+ * errno.
+ */
+int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
+{
+   int ret;
+
+   if (iter->iomap.length && ops->iomap_end) {
+   ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
+   iter->processed > 0 ? iter->processed : 0,
+   iter->flags, >iomap);
+   if (ret < 0 && !iter->processed)
+   return ret;
+   }
+
+   trace_iomap_iter(iter, ops, _RET_IP_);
+   ret = iomap_iter_advance(iter);
+   if (ret <= 0)
+   return ret;
+
+   ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
+  >iomap, >srcmap);
+   if (ret < 0)
+   return ret;
+   iomap_iter_done(iter);
+   return 1;
+}
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index e9cd5cc0d6ba..1012d7af6b68 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
- * Copyright (c) 2009-2019 Christoph Hellwig
+ * Copyright (c) 2009-2021 Christoph Hellwig
  *
  * NOTE: none of these tracepoints shall be considered a stable kernel ABI
  * as they can change at any time.
@@ -140,6 +140,8 @@ DEFINE_EVENT(iomap_class, name, \
TP_ARGS(inode, iomap))
 DEFINE_IOMAP_EVENT(iomap_apply_dstmap);
 DEFINE_IOMAP_EVENT(iomap_apply_srcmap);
+DEFINE_IOMAP_EVENT(iomap_iter_dstmap);
+DEFI

Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model

2021-08-11 Thread Darrick J. Wong
On Wed, Aug 11, 2021 at 07:38:56AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 10, 2021 at 05:31:18PM -0700, Darrick J. Wong wrote:
> > > +static inline void iomap_iter_done(struct iomap_iter *iter)
> > 
> > I wonder why this is a separate function, since it only has debugging
> > warnings and tracepoints?
> 
> The reason for these two sub-helpers was to force me to structure the
> code so that Matthews original idea of replacing ->iomap_begin and
> ->iomap_end with a single next callback so that iomap_iter could
> be inlined into callers and the indirect calls could be elided is
> still possible.  This would only be useful for a few specific
> methods (probably dax and direct I/O) where we care so much, but it
> seemed like a nice idea conceptually so I would not want to break it.
> 
> OTOH we could just remove this function for now and do that once needed.



> > Modulo the question about iomap_iter_done, I guess this looks all right
> > to me.  As far as apply.c vs. core.c, I'm not wildly passionate about
> > either naming choice (I would have called it iter.c) but ... fmeh.
> 
> iter.c is also my preference, but in the end I don't care too much.

Ok.  My plan for this is to change this patch to add the new iter code
to apply.c, and change patch 24 to remove iomap_apply.  I'll add a patch
on the end to rename apply.c to iter.c, which will avoid breaking the
history.

I'll send the updated patches as replies to this series to avoid
spamming the list, since I also have a patchset of bugfixes to send out
and don't want to overwhelm everyone.

--D



[Cluster-devel] [PATCH v2.1 19/30] iomap: switch iomap_bmap to use iomap_iter

2021-08-11 Thread Darrick J. Wong
From: Christoph Hellwig 

Rewrite the ->bmap implementation based on iomap_iter.

Signed-off-by: Christoph Hellwig 
[djwong: restructure the loop to make its behavior a little clearer]
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
---
 fs/iomap/fiemap.c |   31 +--
 1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index acad09a8c188..66cf267c68ae 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -92,37 +92,32 @@ int iomap_fiemap(struct inode *inode, struct 
fiemap_extent_info *fi,
 }
 EXPORT_SYMBOL_GPL(iomap_fiemap);
 
-static loff_t
-iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-   void *data, struct iomap *iomap, struct iomap *srcmap)
-{
-   sector_t *bno = data, addr;
-
-   if (iomap->type == IOMAP_MAPPED) {
-   addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
-   *bno = addr;
-   }
-   return 0;
-}
-
 /* legacy ->bmap interface.  0 is the error return (!) */
 sector_t
 iomap_bmap(struct address_space *mapping, sector_t bno,
const struct iomap_ops *ops)
 {
-   struct inode *inode = mapping->host;
-   loff_t pos = bno << inode->i_blkbits;
-   unsigned blocksize = i_blocksize(inode);
+   struct iomap_iter iter = {
+   .inode  = mapping->host,
+   .pos= (loff_t)bno << mapping->host->i_blkbits,
+   .len= i_blocksize(mapping->host),
+   .flags  = IOMAP_REPORT,
+   };
+   const unsigned int blkshift = mapping->host->i_blkbits - SECTOR_SHIFT;
int ret;
 
if (filemap_write_and_wait(mapping))
return 0;
 
bno = 0;
-   ret = iomap_apply(inode, pos, blocksize, 0, ops, ,
- iomap_bmap_actor);
+   while ((ret = iomap_iter(, ops)) > 0) {
+   if (iter.iomap.type == IOMAP_MAPPED)
+   bno = iomap_sector(, iter.pos) >> blkshift;
+   /* leave iter.processed unset to abort loop */
+   }
if (ret)
return 0;
+
return bno;
 }
 EXPORT_SYMBOL_GPL(iomap_bmap);



Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:25AM +0200, Christoph Hellwig wrote:
> The iomap_iter struct provides a convenient way to package up and
> maintain all the arguments to the various mapping and operation
> functions.  It is operated on using the iomap_iter() function that
> is called in loop until the whole range has been processed.  Compared
> to the existing iomap_apply() function this avoid an indirect call
> for each iteration.
> 
> For now iomap_iter() calls back into the existing ->iomap_begin and
> ->iomap_end methods, but in the future this could be further optimized
> to avoid indirect calls entirely.
> 
> Based on an earlier patch from Matthew Wilcox .
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/Makefile |  1 +
>  fs/iomap/core.c   | 79 +++
>  fs/iomap/trace.h  | 37 +++-
>  include/linux/iomap.h | 56 ++
>  4 files changed, 172 insertions(+), 1 deletion(-)
>  create mode 100644 fs/iomap/core.c
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index eef2722d93a183..6b56b10ded347a 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP)  += iomap.o
>  
>  iomap-y  += trace.o \
>  apply.o \
> +core.o \
>  buffered-io.o \
>  direct-io.o \
>  fiemap.o \
> diff --git a/fs/iomap/core.c b/fs/iomap/core.c
> new file mode 100644
> index 00..89a87a1654e8e6
> --- /dev/null
> +++ b/fs/iomap/core.c
> @@ -0,0 +1,79 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2021 Christoph Hellwig.
> + */
> +#include 
> +#include 
> +#include "trace.h"
> +
> +static inline int iomap_iter_advance(struct iomap_iter *iter)
> +{
> + /* handle the previous iteration (if any) */
> + if (iter->iomap.length) {
> + if (iter->processed <= 0)
> + return iter->processed;
> + if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
> + return -EIO;
> + iter->pos += iter->processed;
> + iter->len -= iter->processed;
> + if (!iter->len)
> + return 0;
> + }
> +
> + /* clear the state for the next iteration */
> + iter->processed = 0;
> + memset(>iomap, 0, sizeof(iter->iomap));
> + memset(>srcmap, 0, sizeof(iter->srcmap));
> + return 1;
> +}
> +
> +static inline void iomap_iter_done(struct iomap_iter *iter)

I wonder why this is a separate function, since it only has debugging
warnings and tracepoints?

> +{
> + WARN_ON_ONCE(iter->iomap.offset > iter->pos);
> + WARN_ON_ONCE(iter->iomap.length == 0);
> + WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
> +
> + trace_iomap_iter_dstmap(iter->inode, >iomap);
> + if (iter->srcmap.type != IOMAP_HOLE)
> + trace_iomap_iter_srcmap(iter->inode, >srcmap);
> +}
> +
> +/**
> + * iomap_iter - iterate over a ranges in a file
> + * @iter: iteration structue
> + * @ops: iomap ops provided by the file system
> + *
> + * Iterate over filesystem-provided space mappings for the provided file 
> range.
> + *
> + * This function handles cleanup of resources acquired for iteration when the
> + * filesystem indicates there are no more space mappings, which means that 
> this
> + * function must be called in a loop that continues as long it returns a
> + * positive value.  If 0 or a negative value is returned, the caller must not
> + * return to the loop body.  Within a loop body, there are two ways to break 
> out
> + * of the loop body:  leave @iter.processed unchanged, or set it to a 
> negative
> + * errno.

Thanks for improving the documentation.

Modulo the question about iomap_iter_done, I guess this looks all right
to me.  As far as apply.c vs. core.c, I'm not wildly passionate about
either naming choice (I would have called it iter.c) but ... fmeh.

Reviewed-by: Darrick J. Wong 

--D

> + */
> +int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
> +{
> + int ret;
> +
> + if (iter->iomap.length && ops->iomap_end) {
> + ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
> + iter->processed > 0 ? iter->processed : 0,
> + iter->flags, >iomap);
> + if (ret

Re: [Cluster-devel] [PATCH 17/30] iomap: switch __iomap_dio_rw to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:31AM +0200, Christoph Hellwig wrote:
> Switch __iomap_dio_rw to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

I like the reduction in ->submit_io arguments.  The conversion seems
straightforward enough.

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/btrfs/inode.c  |   5 +-
>  fs/iomap/direct-io.c  | 164 +-
>  include/linux/iomap.h |   4 +-
>  3 files changed, 86 insertions(+), 87 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0117d867ecf876..3b0595e8bdd929 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8194,9 +8194,10 @@ static struct btrfs_dio_private 
> *btrfs_create_dio_private(struct bio *dio_bio,
>   return dip;
>  }
>  
> -static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
> +static blk_qc_t btrfs_submit_direct(const struct iomap_iter *iter,
>   struct bio *dio_bio, loff_t file_offset)
>  {
> + struct inode *inode = iter->inode;
>   const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
> @@ -8212,7 +8213,7 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   int ret;
>   blk_status_t status;
>   struct btrfs_io_geometry geom;
> - struct btrfs_dio_data *dio_data = iomap->private;
> + struct btrfs_dio_data *dio_data = iter->iomap.private;
>   struct extent_map *em = NULL;
>  
>   dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 41ccbfc9dc820a..4ecd255e0511ce 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
>   * Copyright (C) 2010 Red Hat, Inc.
> - * Copyright (c) 2016-2018 Christoph Hellwig.
> + * Copyright (c) 2016-2021 Christoph Hellwig.
>   */
>  #include 
>  #include 
> @@ -59,19 +59,17 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
>  }
>  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
>  
> -static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
> - struct bio *bio, loff_t pos)
> +static void iomap_dio_submit_bio(const struct iomap_iter *iter,
> + struct iomap_dio *dio, struct bio *bio, loff_t pos)
>  {
>   atomic_inc(>ref);
>  
>   if (dio->iocb->ki_flags & IOCB_HIPRI)
>   bio_set_polled(bio, dio->iocb);
>  
> - dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> + dio->submit.last_queue = bdev_get_queue(iter->iomap.bdev);
>   if (dio->dops && dio->dops->submit_io)
> - dio->submit.cookie = dio->dops->submit_io(
> - file_inode(dio->iocb->ki_filp),
> - iomap, bio, pos);
> + dio->submit.cookie = dio->dops->submit_io(iter, bio, pos);
>   else
>   dio->submit.cookie = submit_bio(bio);
>  }
> @@ -181,24 +179,23 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>   }
>  }
>  
> -static void
> -iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
> - unsigned len)
> +static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio 
> *dio,
> + loff_t pos, unsigned len)
>  {
>   struct page *page = ZERO_PAGE(0);
>   int flags = REQ_SYNC | REQ_IDLE;
>   struct bio *bio;
>  
>   bio = bio_alloc(GFP_KERNEL, 1);
> - bio_set_dev(bio, iomap->bdev);
> - bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> + bio_set_dev(bio, iter->iomap.bdev);
> + bio->bi_iter.bi_sector = iomap_sector(>iomap, pos);
>   bio->bi_private = dio;
>   bio->bi_end_io = iomap_dio_bio_end_io;
>  
>   get_page(page);
>   __bio_add_page(bio, page, len, 0);
>   bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
> - iomap_dio_submit_bio(dio, iomap, bio, pos);
> + iomap_dio_submit_bio(iter, dio, bio, pos);
>  }
>  
>  /*
> @@ -206,8 +203,8 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap 
> *iomap, loff_t pos,
>   * mapping, and whether or not we want FUA.  Note that we can end up
>   * clearing the WRITE_FUA flag in the dio request.
>   */
> -static inline unsigned int
> -iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool 
> use_fua)
> +static inline unsigned int iomap_dio_bio_opflags(struct iomap_dio *dio,
> + const struct iomap *iomap, bool use

Re: [Cluster-devel] [PATCH 21/30] iomap: switch iomap_seek_data to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:35AM +0200, Christoph Hellwig wrote:
> Rewrite iomap_seek_data to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

Nice cleanup,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/seek.c | 47 ---
>  1 file changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
> index fed8f9005f9e46..a845c012b50c67 100644
> --- a/fs/iomap/seek.c
> +++ b/fs/iomap/seek.c
> @@ -56,47 +56,48 @@ iomap_seek_hole(struct inode *inode, loff_t pos, const 
> struct iomap_ops *ops)
>  }
>  EXPORT_SYMBOL_GPL(iomap_seek_hole);
>  
> -static loff_t
> -iomap_seek_data_actor(struct inode *inode, loff_t start, loff_t length,
> -   void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_seek_data_iter(const struct iomap_iter *iter,
> + loff_t *hole_pos)
>  {
> - loff_t offset = start;
> + loff_t length = iomap_length(iter);
>  
> - switch (iomap->type) {
> + switch (iter->iomap.type) {
>   case IOMAP_HOLE:
>   return length;
>   case IOMAP_UNWRITTEN:
> - offset = mapping_seek_hole_data(inode->i_mapping, start,
> - start + length, SEEK_DATA);
> - if (offset < 0)
> + *hole_pos = mapping_seek_hole_data(iter->inode->i_mapping,
> + iter->pos, iter->pos + length, SEEK_DATA);
> + if (*hole_pos < 0)
>   return length;
> - fallthrough;
> + return 0;
>   default:
> - *(loff_t *)data = offset;
> + *hole_pos = iter->pos;
>   return 0;
>   }
>  }
>  
>  loff_t
> -iomap_seek_data(struct inode *inode, loff_t offset, const struct iomap_ops 
> *ops)
> +iomap_seek_data(struct inode *inode, loff_t pos, const struct iomap_ops *ops)
>  {
>   loff_t size = i_size_read(inode);
> - loff_t ret;
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= pos,
> + .flags  = IOMAP_REPORT,
> + };
> + int ret;
>  
>   /* Nothing to be found before or beyond the end of the file. */
> - if (offset < 0 || offset >= size)
> + if (pos < 0 || pos >= size)
>   return -ENXIO;
>  
> - while (offset < size) {
> - ret = iomap_apply(inode, offset, size - offset, IOMAP_REPORT,
> -   ops, , iomap_seek_data_actor);
> - if (ret < 0)
> - return ret;
> - if (ret == 0)
> - return offset;
> - offset += ret;
> - }
> -
> + iter.len = size - pos;
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_seek_data_iter(, );
> + if (ret < 0)
> + return ret;
> + if (iter.len) /* found data before EOF */
> + return pos;
>   /* We've reached the end of the file without finding data */
>   return -ENXIO;
>  }
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 20/30] iomap: switch iomap_seek_hole to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:34AM +0200, Christoph Hellwig wrote:
> Rewrite iomap_seek_hole to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

Looks good to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/seek.c | 51 +
>  1 file changed, 26 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
> index ce6fb810854fec..fed8f9005f9e46 100644
> --- a/fs/iomap/seek.c
> +++ b/fs/iomap/seek.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
>   * Copyright (C) 2017 Red Hat, Inc.
> - * Copyright (c) 2018 Christoph Hellwig.
> + * Copyright (c) 2018-2021 Christoph Hellwig.
>   */
>  #include 
>  #include 
> @@ -10,21 +10,20 @@
>  #include 
>  #include 
>  
> -static loff_t
> -iomap_seek_hole_actor(struct inode *inode, loff_t start, loff_t length,
> -   void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_seek_hole_iter(const struct iomap_iter *iter,
> + loff_t *hole_pos)
>  {
> - loff_t offset = start;
> + loff_t length = iomap_length(iter);
>  
> - switch (iomap->type) {
> + switch (iter->iomap.type) {
>   case IOMAP_UNWRITTEN:
> - offset = mapping_seek_hole_data(inode->i_mapping, start,
> - start + length, SEEK_HOLE);
> - if (offset == start + length)
> + *hole_pos = mapping_seek_hole_data(iter->inode->i_mapping,
> + iter->pos, iter->pos + length, SEEK_HOLE);
> + if (*hole_pos == iter->pos + length)
>   return length;
> - fallthrough;
> + return 0;
>   case IOMAP_HOLE:
> - *(loff_t *)data = offset;
> + *hole_pos = iter->pos;
>   return 0;
>   default:
>   return length;
> @@ -32,26 +31,28 @@ iomap_seek_hole_actor(struct inode *inode, loff_t start, 
> loff_t length,
>  }
>  
>  loff_t
> -iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops 
> *ops)
> +iomap_seek_hole(struct inode *inode, loff_t pos, const struct iomap_ops *ops)
>  {
>   loff_t size = i_size_read(inode);
> - loff_t ret;
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= pos,
> + .flags  = IOMAP_REPORT,
> + };
> + int ret;
>  
>   /* Nothing to be found before or beyond the end of the file. */
> - if (offset < 0 || offset >= size)
> + if (pos < 0 || pos >= size)
>   return -ENXIO;
>  
> - while (offset < size) {
> - ret = iomap_apply(inode, offset, size - offset, IOMAP_REPORT,
> -   ops, , iomap_seek_hole_actor);
> - if (ret < 0)
> - return ret;
> - if (ret == 0)
> - break;
> - offset += ret;
> - }
> -
> - return offset;
> + iter.len = size - pos;
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_seek_hole_iter(, );
> + if (ret < 0)
> + return ret;
> + if (iter.len) /* found hole before EOF */
> + return pos;
> + return size;
>  }
>  EXPORT_SYMBOL_GPL(iomap_seek_hole);
>  
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 23/30] fsdax: switch dax_iomap_rw to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:37AM +0200, Christoph Hellwig wrote:
> Switch the dax_iomap_rw implementation to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

/me gets excited about this file getting cleaned up
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 49 -
>  1 file changed, 24 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4d63040fd71f56..51da45301350a6 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1103,20 +1103,21 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
>   return size;
>  }
>  
> -static loff_t
> -dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> - struct iomap *iomap, struct iomap *srcmap)
> +static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> + struct iov_iter *iter)
>  {
> + const struct iomap *iomap = >iomap;
> + loff_t length = iomap_length(iomi);
> + loff_t pos = iomi->pos;
>   struct block_device *bdev = iomap->bdev;
>   struct dax_device *dax_dev = iomap->dax_dev;
> - struct iov_iter *iter = data;
>   loff_t end = pos + length, done = 0;
>   ssize_t ret = 0;
>   size_t xfer;
>   int id;
>  
>   if (iov_iter_rw(iter) == READ) {
> - end = min(end, i_size_read(inode));
> + end = min(end, i_size_read(iomi->inode));
>   if (pos >= end)
>   return 0;
>  
> @@ -1133,7 +1134,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>* written by write(2) is visible in mmap.
>*/
>   if (iomap->flags & IOMAP_F_NEW) {
> - invalidate_inode_pages2_range(inode->i_mapping,
> + invalidate_inode_pages2_range(iomi->inode->i_mapping,
> pos >> PAGE_SHIFT,
> (end - 1) >> PAGE_SHIFT);
>   }
> @@ -1209,31 +1210,29 @@ ssize_t
>  dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops)
>  {
> - struct address_space *mapping = iocb->ki_filp->f_mapping;
> - struct inode *inode = mapping->host;
> - loff_t pos = iocb->ki_pos, ret = 0, done = 0;
> - unsigned flags = 0;
> + struct iomap_iter iomi = {
> + .inode  = iocb->ki_filp->f_mapping->host,
> + .pos= iocb->ki_pos,
> + .len= iov_iter_count(iter),
> + };
> + loff_t done = 0;
> + int ret;
>  
>   if (iov_iter_rw(iter) == WRITE) {
> - lockdep_assert_held_write(>i_rwsem);
> - flags |= IOMAP_WRITE;
> + lockdep_assert_held_write(>i_rwsem);
> + iomi.flags |= IOMAP_WRITE;
>   } else {
> - lockdep_assert_held(>i_rwsem);
> + lockdep_assert_held(>i_rwsem);
>   }
>  
>   if (iocb->ki_flags & IOCB_NOWAIT)
> - flags |= IOMAP_NOWAIT;
> + iomi.flags |= IOMAP_NOWAIT;
>  
> - while (iov_iter_count(iter)) {
> - ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> - iter, dax_iomap_actor);
> - if (ret <= 0)
> - break;
> - pos += ret;
> - done += ret;
> - }
> + while ((ret = iomap_iter(, ops)) > 0)
> + iomi.processed = dax_iomap_iter(, iter);
>  
> - iocb->ki_pos += done;
> + done = iomi.pos - iocb->ki_pos;
> + iocb->ki_pos = iomi.pos;
>   return done ? done : ret;
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_rw);
> @@ -1307,7 +1306,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   }
>  
>   /*
> -  * Note that we don't bother to use iomap_apply here: DAX required
> +  * Note that we don't bother to use iomap_iter here: DAX required
>* the file system block size to be equal the page size, which means
>* that we never have to deal with more than a single extent here.
>*/
> @@ -1561,7 +1560,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   }
>  
>   /*
> -  * Note that we don't use iomap_apply here.  We aren't doing I/O, only
> +  * Note that we don't use iomap_iter here.  We aren't doing I/O, only
>* setting up a mapping, so really we're using iomap_begin() as a way
>* to look up our filesystem block.
>*/
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 22/30] iomap: switch iomap_swapfile_activate to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:36AM +0200, Christoph Hellwig wrote:
> Switch iomap_swapfile_activate to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

Smooth
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/swapfile.c | 38 --
>  1 file changed, 16 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
> index 6250ca6a1f851d..7069606eca85b2 100644
> --- a/fs/iomap/swapfile.c
> +++ b/fs/iomap/swapfile.c
> @@ -88,13 +88,9 @@ static int iomap_swapfile_fail(struct iomap_swapfile_info 
> *isi, const char *str)
>   * swap only cares about contiguous page-aligned physical extents and makes 
> no
>   * distinction between written and unwritten extents.
>   */
> -static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
> - loff_t count, void *data, struct iomap *iomap,
> - struct iomap *srcmap)
> +static loff_t iomap_swapfile_iter(const struct iomap_iter *iter,
> + struct iomap *iomap, struct iomap_swapfile_info *isi)
>  {
> - struct iomap_swapfile_info *isi = data;
> - int error;
> -
>   switch (iomap->type) {
>   case IOMAP_MAPPED:
>   case IOMAP_UNWRITTEN:
> @@ -125,12 +121,12 @@ static loff_t iomap_swapfile_activate_actor(struct 
> inode *inode, loff_t pos,
>   isi->iomap.length += iomap->length;
>   } else {
>   /* Otherwise, add the retained iomap and store this one. */
> - error = iomap_swapfile_add_extent(isi);
> + int error = iomap_swapfile_add_extent(isi);
>   if (error)
>   return error;
>   memcpy(>iomap, iomap, sizeof(isi->iomap));
>   }
> - return count;
> + return iomap_length(iter);
>  }
>  
>  /*
> @@ -141,16 +137,19 @@ int iomap_swapfile_activate(struct swap_info_struct 
> *sis,
>   struct file *swap_file, sector_t *pagespan,
>   const struct iomap_ops *ops)
>  {
> + struct inode *inode = swap_file->f_mapping->host;
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= 0,
> + .len= ALIGN_DOWN(i_size_read(inode), PAGE_SIZE),
> + .flags  = IOMAP_REPORT,
> + };
>   struct iomap_swapfile_info isi = {
>   .sis = sis,
>   .lowest_ppage = (sector_t)-1ULL,
>   .file = swap_file,
>   };
> - struct address_space *mapping = swap_file->f_mapping;
> - struct inode *inode = mapping->host;
> - loff_t pos = 0;
> - loff_t len = ALIGN_DOWN(i_size_read(inode), PAGE_SIZE);
> - loff_t ret;
> + int ret;
>  
>   /*
>* Persist all file mapping metadata so that we won't have any
> @@ -160,15 +159,10 @@ int iomap_swapfile_activate(struct swap_info_struct 
> *sis,
>   if (ret)
>   return ret;
>  
> - while (len > 0) {
> - ret = iomap_apply(inode, pos, len, IOMAP_REPORT,
> - ops, , iomap_swapfile_activate_actor);
> - if (ret <= 0)
> - return ret;
> -
> - pos += ret;
> - len -= ret;
> - }
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_swapfile_iter(, , );
> + if (ret < 0)
> + return ret;
>  
>   if (isi.iomap.length) {
>   ret = iomap_swapfile_add_extent();
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 18/30] iomap: switch iomap_fiemap to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:32AM +0200, Christoph Hellwig wrote:
> Rewrite the ->fiemap implementation based on iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

Nice cleanups!
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/fiemap.c | 70 ---
>  1 file changed, 29 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
> index aab070df4a2175..acad09a8c188df 100644
> --- a/fs/iomap/fiemap.c
> +++ b/fs/iomap/fiemap.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
> - * Copyright (c) 2016-2018 Christoph Hellwig.
> + * Copyright (c) 2016-2021 Christoph Hellwig.
>   */
>  #include 
>  #include 
> @@ -8,13 +8,8 @@
>  #include 
>  #include 
>  
> -struct fiemap_ctx {
> - struct fiemap_extent_info *fi;
> - struct iomap prev;
> -};
> -
>  static int iomap_to_fiemap(struct fiemap_extent_info *fi,
> - struct iomap *iomap, u32 flags)
> + const struct iomap *iomap, u32 flags)
>  {
>   switch (iomap->type) {
>   case IOMAP_HOLE:
> @@ -43,24 +38,22 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>   iomap->length, flags);
>  }
>  
> -static loff_t
> -iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void 
> *data,
> - struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_fiemap_iter(const struct iomap_iter *iter,
> + struct fiemap_extent_info *fi, struct iomap *prev)
>  {
> - struct fiemap_ctx *ctx = data;
> - loff_t ret = length;
> + int ret;
>  
> - if (iomap->type == IOMAP_HOLE)
> - return length;
> + if (iter->iomap.type == IOMAP_HOLE)
> + return iomap_length(iter);
>  
> - ret = iomap_to_fiemap(ctx->fi, >prev, 0);
> - ctx->prev = *iomap;
> + ret = iomap_to_fiemap(fi, prev, 0);
> + *prev = iter->iomap;
>   switch (ret) {
>   case 0: /* success */
> - return length;
> + return iomap_length(iter);
>   case 1: /* extent array full */
>   return 0;
> - default:
> + default:/* error */
>   return ret;
>   }
>  }
> @@ -68,38 +61,33 @@ iomap_fiemap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>   u64 start, u64 len, const struct iomap_ops *ops)
>  {
> - struct fiemap_ctx ctx;
> - loff_t ret;
> -
> - memset(, 0, sizeof(ctx));
> - ctx.fi = fi;
> - ctx.prev.type = IOMAP_HOLE;
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= start,
> + .len= len,
> + .flags  = IOMAP_REPORT,
> + };
> + struct iomap prev = {
> + .type   = IOMAP_HOLE,
> + };
> + int ret;
>  
> - ret = fiemap_prep(inode, fi, start, , 0);
> + ret = fiemap_prep(inode, fi, start, , 0);
>   if (ret)
>   return ret;
>  
> - while (len > 0) {
> - ret = iomap_apply(inode, start, len, IOMAP_REPORT, ops, ,
> - iomap_fiemap_actor);
> - /* inode with no (attribute) mapping will give ENOENT */
> - if (ret == -ENOENT)
> - break;
> - if (ret < 0)
> - return ret;
> - if (ret == 0)
> - break;
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_fiemap_iter(, fi, );
>  
> - start += ret;
> - len -= ret;
> - }
> -
> - if (ctx.prev.type != IOMAP_HOLE) {
> - ret = iomap_to_fiemap(fi, , FIEMAP_EXTENT_LAST);
> + if (prev.type != IOMAP_HOLE) {
> + ret = iomap_to_fiemap(fi, , FIEMAP_EXTENT_LAST);
>   if (ret < 0)
>   return ret;
>   }
>  
> + /* inode with no (attribute) mapping will give ENOENT */
> + if (ret < 0 && ret != -ENOENT)
> + return ret;
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(iomap_fiemap);
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 14/30] iomap: switch iomap_file_unshare to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:28AM +0200, Christoph Hellwig wrote:
> Switch iomap_file_unshare to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 35 ++-
>  1 file changed, 18 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4c7e82928cc546..4f525727462f33 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -817,10 +817,12 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *i,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>  
> -static loff_t
> -iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void 
> *data,
> - struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_unshare_iter(struct iomap_iter *iter)
>  {
> + struct iomap *iomap = >iomap;
> + struct iomap *srcmap = iomap_iter_srcmap(iter);
> + loff_t pos = iter->pos;
> + loff_t length = iomap_length(iter);
>   long status = 0;
>   loff_t written = 0;
>  
> @@ -836,12 +838,12 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
>   struct page *page;
>  
> - status = iomap_write_begin(inode, pos, bytes,
> + status = iomap_write_begin(iter->inode, pos, bytes,
>   IOMAP_WRITE_F_UNSHARE, , iomap, srcmap);
>   if (unlikely(status))
>   return status;
>  
> - status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
> + status = iomap_write_end(iter->inode, pos, bytes, bytes, page, 
> iomap,
>   srcmap);
>   if (WARN_ON_ONCE(status == 0))
>   return -EIO;
> @@ -852,7 +854,7 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   written += status;
>   length -= status;
>  
> - balance_dirty_pages_ratelimited(inode->i_mapping);
> + balance_dirty_pages_ratelimited(iter->inode->i_mapping);
>   } while (length);
>  
>   return written;
> @@ -862,18 +864,17 @@ int
>  iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>   const struct iomap_ops *ops)
>  {
> - loff_t ret;
> -
> - while (len) {
> - ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
> - iomap_unshare_actor);
> - if (ret <= 0)
> - return ret;
> - pos += ret;
> - len -= ret;
> - }
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= pos,
> + .len= len,
> + .flags  = IOMAP_WRITE,
> + };
> + int ret;
>  
> - return 0;
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_unshare_iter();
> + return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 13/30] iomap: switch iomap_file_buffered_write to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:27AM +0200, Christoph Hellwig wrote:
> Switch iomap_file_buffered_write to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 

Seems pretty straightforward.
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 49 +-
>  1 file changed, 25 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9cda461887afad..4c7e82928cc546 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -726,13 +726,14 @@ static size_t iomap_write_end(struct inode *inode, 
> loff_t pos, size_t len,
>   return ret;
>  }
>  
> -static loff_t
> -iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> - struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  {
> - struct iov_iter *i = data;
> - long status = 0;
> + struct iomap *srcmap = iomap_iter_srcmap(iter);
> + struct iomap *iomap = >iomap;
> + loff_t length = iomap_length(iter);
> + loff_t pos = iter->pos;
>   ssize_t written = 0;
> + long status = 0;
>  
>   do {
>   struct page *page;
> @@ -758,18 +759,18 @@ iomap_write_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   break;
>   }
>  
> - status = iomap_write_begin(inode, pos, bytes, 0, , iomap,
> - srcmap);
> + status = iomap_write_begin(iter->inode, pos, bytes, 0, ,
> +iomap, srcmap);
>   if (unlikely(status))
>   break;
>  
> - if (mapping_writably_mapped(inode->i_mapping))
> + if (mapping_writably_mapped(iter->inode->i_mapping))
>   flush_dcache_page(page);
>  
>   copied = copy_page_from_iter_atomic(page, offset, bytes, i);
>  
> - status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
> - srcmap);
> + status = iomap_write_end(iter->inode, pos, bytes, copied, page,
> +  iomap, srcmap);
>  
>   if (unlikely(copied != status))
>   iov_iter_revert(i, copied - status);
> @@ -790,29 +791,29 @@ iomap_write_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   written += status;
>   length -= status;
>  
> - balance_dirty_pages_ratelimited(inode->i_mapping);
> + balance_dirty_pages_ratelimited(iter->inode->i_mapping);
>   } while (iov_iter_count(i) && length);
>  
>   return written ? written : status;
>  }
>  
>  ssize_t
> -iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
> +iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>   const struct iomap_ops *ops)
>  {
> - struct inode *inode = iocb->ki_filp->f_mapping->host;
> - loff_t pos = iocb->ki_pos, ret = 0, written = 0;
> -
> - while (iov_iter_count(iter)) {
> - ret = iomap_apply(inode, pos, iov_iter_count(iter),
> - IOMAP_WRITE, ops, iter, iomap_write_actor);
> - if (ret <= 0)
> - break;
> - pos += ret;
> - written += ret;
> - }
> + struct iomap_iter iter = {
> + .inode  = iocb->ki_filp->f_mapping->host,
> + .pos= iocb->ki_pos,
> + .len= iov_iter_count(i),
> + .flags  = IOMAP_WRITE,
> + };
> + int ret;
>  
> - return written ? written : ret;
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_write_iter(, i);
> + if (iter.pos == iocb->ki_pos)
> + return ret;
> + return iter.pos - iocb->ki_pos;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>  
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 12/30] iomap: switch readahead and readpage to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:26AM +0200, Christoph Hellwig wrote:
> Switch the page cache read functions to use iomap_iter instead of
> iomap_apply.
> 
> Signed-off-by: Christoph Hellwig 

Looks reasonable,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 80 +++---
>  1 file changed, 37 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 26e16cc9d44931..9cda461887afad 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -241,11 +241,12 @@ static inline bool iomap_block_needs_zeroing(struct 
> inode *inode,
>   pos >= i_size_read(inode);
>  }
>  
> -static loff_t
> -iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void 
> *data,
> - struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_readpage_iter(struct iomap_iter *iter,
> + struct iomap_readpage_ctx *ctx, loff_t offset)
>  {
> - struct iomap_readpage_ctx *ctx = data;
> + struct iomap *iomap = >iomap;
> + loff_t pos = iter->pos + offset;
> + loff_t length = iomap_length(iter) - offset;
>   struct page *page = ctx->cur_page;
>   struct iomap_page *iop;
>   loff_t orig_pos = pos;
> @@ -253,15 +254,16 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   sector_t sector;
>  
>   if (iomap->type == IOMAP_INLINE)
> - return min(iomap_read_inline_data(inode, page, iomap), length);
> + return min(iomap_read_inline_data(iter->inode, page, iomap),
> +   length);
>  
>   /* zero post-eof blocks as the page may be mapped */
> - iop = iomap_page_create(inode, page);
> - iomap_adjust_read_range(inode, iop, , length, , );
> + iop = iomap_page_create(iter->inode, page);
> + iomap_adjust_read_range(iter->inode, iop, , length, , );
>   if (plen == 0)
>   goto done;
>  
> - if (iomap_block_needs_zeroing(inode, iomap, pos)) {
> + if (iomap_block_needs_zeroing(iter->inode, iomap, pos)) {
>   zero_user(page, poff, plen);
>   iomap_set_range_uptodate(page, poff, plen);
>   goto done;
> @@ -313,23 +315,23 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  int
>  iomap_readpage(struct page *page, const struct iomap_ops *ops)
>  {
> - struct iomap_readpage_ctx ctx = { .cur_page = page };
> - struct inode *inode = page->mapping->host;
> - unsigned poff;
> - loff_t ret;
> + struct iomap_iter iter = {
> + .inode  = page->mapping->host,
> + .pos= page_offset(page),
> + .len= PAGE_SIZE,
> + };
> + struct iomap_readpage_ctx ctx = {
> + .cur_page   = page,
> + };
> + int ret;
>  
>   trace_iomap_readpage(page->mapping->host, 1);
>  
> - for (poff = 0; poff < PAGE_SIZE; poff += ret) {
> - ret = iomap_apply(inode, page_offset(page) + poff,
> - PAGE_SIZE - poff, 0, ops, ,
> - iomap_readpage_actor);
> - if (ret <= 0) {
> - WARN_ON_ONCE(ret == 0);
> - SetPageError(page);
> - break;
> - }
> - }
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_readpage_iter(, , 0);
> +
> + if (ret < 0)
> + SetPageError(page);
>  
>   if (ctx.bio) {
>   submit_bio(ctx.bio);
> @@ -348,15 +350,14 @@ iomap_readpage(struct page *page, const struct 
> iomap_ops *ops)
>  }
>  EXPORT_SYMBOL_GPL(iomap_readpage);
>  
> -static loff_t
> -iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_readahead_iter(struct iomap_iter *iter,
> + struct iomap_readpage_ctx *ctx)
>  {
> - struct iomap_readpage_ctx *ctx = data;
> + loff_t length = iomap_length(iter);
>   loff_t done, ret;
>  
>   for (done = 0; done < length; done += ret) {
> - if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> + if (ctx->cur_page && offset_in_page(iter->pos + done) == 0) {
>   if (!ctx->cur_page_in_bio)
>   unlock_page(ctx->cur_page);
>   put_page(ctx->cur_page);
> @@ -366,8 +367,7 @

Re: [Cluster-devel] [PATCH 10/30] iomap: fix the iomap_readpage_actor return value for inline data

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:24AM +0200, Christoph Hellwig wrote:
> The actor should never return a larger value than the length that was
> passed in.  The current code handles this gracefully, but the opcoming
> iter model will be more picky.

s/opcoming/upcoming/

With that fixed,
Reviewed-by: Darrick J. Wong 

--D

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/buffered-io.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 44587209e6d7c7..26e16cc9d44931 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -205,7 +205,7 @@ struct iomap_readpage_ctx {
>   struct readahead_control *rac;
>  };
>  
> -static int iomap_read_inline_data(struct inode *inode, struct page *page,
> +static loff_t iomap_read_inline_data(struct inode *inode, struct page *page,
>   const struct iomap *iomap)
>  {
>   size_t size = i_size_read(inode) - iomap->offset;
> @@ -253,7 +253,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   sector_t sector;
>  
>   if (iomap->type == IOMAP_INLINE)
> - return iomap_read_inline_data(inode, page, iomap);
> + return min(iomap_read_inline_data(inode, page, iomap), length);
>  
>   /* zero post-eof blocks as the page may be mapped */
>   iop = iomap_page_create(inode, page);
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model

2021-08-10 Thread Darrick J. Wong
On Tue, Aug 10, 2021 at 08:10:47AM +1000, Dave Chinner wrote:
> On Mon, Aug 09, 2021 at 08:12:25AM +0200, Christoph Hellwig wrote:
> > The iomap_iter struct provides a convenient way to package up and
> > maintain all the arguments to the various mapping and operation
> > functions.  It is operated on using the iomap_iter() function that
> > is called in loop until the whole range has been processed.  Compared
> > to the existing iomap_apply() function this avoid an indirect call
> > for each iteration.
> > 
> > For now iomap_iter() calls back into the existing ->iomap_begin and
> > ->iomap_end methods, but in the future this could be further optimized
> > to avoid indirect calls entirely.
> > 
> > Based on an earlier patch from Matthew Wilcox .
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  fs/iomap/Makefile |  1 +
> >  fs/iomap/core.c   | 79 +++
> >  fs/iomap/trace.h  | 37 +++-
> >  include/linux/iomap.h | 56 ++
> >  4 files changed, 172 insertions(+), 1 deletion(-)
> >  create mode 100644 fs/iomap/core.c
> > 
> > diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> > index eef2722d93a183..6b56b10ded347a 100644
> > --- a/fs/iomap/Makefile
> > +++ b/fs/iomap/Makefile
> > @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP)+= iomap.o
> >  
> >  iomap-y+= trace.o \
> >apply.o \
> > +  core.o \
> 
> This creates a discontinuity in the iomap git history. Can you add
> these new functions to iomap/apply.c, then when the old apply code
> is removed later in the series rename the file to core.c? At least
> that way 'git log --follow fs/iomap/core.c' will walk back into the
> current history of fs/iomap/apply.c and the older pre-disaggregation
> fs/iomap.c without having to take the tree back in time to find
> those files...

...or put the new code in apply.c, remove iomap_apply, and don't bother
with the renaming at all?

I don't see much reason to break the git history.  This is effectively a
new epoch in iomap, but that is plainly obvious from the function
declarations.

I'll wander through the rest of the unreviewed patches tomorrow morning,
these are merely my off-the-cuff impressions.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com



Re: [Cluster-devel] [PATCH 19/30] iomap: switch iomap_bmap to use iomap_iter

2021-08-10 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:33AM +0200, Christoph Hellwig wrote:
> Rewrite the ->bmap implementation based on iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/fiemap.c | 31 +--
>  1 file changed, 13 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
> index acad09a8c188df..60daadba16c149 100644
> --- a/fs/iomap/fiemap.c
> +++ b/fs/iomap/fiemap.c
> @@ -92,35 +92,30 @@ int iomap_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fi,
>  }
>  EXPORT_SYMBOL_GPL(iomap_fiemap);
>  
> -static loff_t
> -iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> -{
> - sector_t *bno = data, addr;
> -
> - if (iomap->type == IOMAP_MAPPED) {
> - addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
> - *bno = addr;
> - }
> - return 0;
> -}
> -
>  /* legacy ->bmap interface.  0 is the error return (!) */
>  sector_t
>  iomap_bmap(struct address_space *mapping, sector_t bno,
>   const struct iomap_ops *ops)
>  {
> - struct inode *inode = mapping->host;
> - loff_t pos = bno << inode->i_blkbits;
> - unsigned blocksize = i_blocksize(inode);
> + struct iomap_iter iter = {
> + .inode  = mapping->host,
> + .pos= (loff_t)bno << mapping->host->i_blkbits,
> + .len= i_blocksize(mapping->host),
> + .flags  = IOMAP_REPORT,
> + };
>   int ret;
>  
>   if (filemap_write_and_wait(mapping))
>   return 0;
>  
>   bno = 0;
> - ret = iomap_apply(inode, pos, blocksize, 0, ops, ,
> -   iomap_bmap_actor);
> + while ((ret = iomap_iter(, ops)) > 0) {
> + if (iter.iomap.type != IOMAP_MAPPED)
> + continue;

I still feel uncomfortable about this use of "continue" here, because it
really means "call iomap_iter again to clean up and exit even though we
know it won't even look for more iomaps to iterate".

To me that feels subtly broken (I usually associate 'continue' with
'go run the loop body again'), and even though bmap has been a quirky
hot mess for 45 years, we don't need to make it even moreso.

Can't this at least be rephrased as:

const uint bno_shift = (mapping->host->i_blkbits - SECTOR_SHIFT);

while ((ret = iomap_iter(, ops)) > 0) {
if (iter.iomap.type == IOMAP_MAPPED)
bno = iomap_sector(iomap, iter.pos) << bno_shift;
/* leave iter.processed unset to stop iteration */
}

to make the loop exit more explicit?

--D

> + bno = (iter.pos - iter.iomap.offset + iter.iomap.addr) >>
> + mapping->host->i_blkbits;
> + }
> +
>   if (ret)
>   return 0;
>   return bno;
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 05/30] iomap: mark the iomap argument to iomap_inline_data_valid const

2021-08-09 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:19AM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  include/linux/iomap.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 560247130357b5..76bfc5d16ef49d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -109,7 +109,7 @@ static inline void *iomap_inline_data(const struct iomap 
> *iomap, loff_t pos)
>   * This is used to guard against accessing data beyond the page inline_data
>   * points at.
>   */
> -static inline bool iomap_inline_data_valid(struct iomap *iomap)
> +static inline bool iomap_inline_data_valid(const struct iomap *iomap)
>  {
>   return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data);
>  }
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 04/30] iomap: mark the iomap argument to iomap_inline_data const

2021-08-09 Thread Darrick J. Wong
On Mon, Aug 09, 2021 at 08:12:18AM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  include/linux/iomap.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8030483331d17f..560247130357b5 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -99,7 +99,7 @@ static inline sector_t iomap_sector(const struct iomap 
> *iomap, loff_t pos)
>  /*
>   * Returns the inline data pointer for logical offset @pos.
>   */
> -static inline void *iomap_inline_data(struct iomap *iomap, loff_t pos)
> +static inline void *iomap_inline_data(const struct iomap *iomap, loff_t pos)
>  {
>   return iomap->inline_data + pos - iomap->offset;
>  }
> -- 
> 2.30.2
> 



Re: [Cluster-devel] RFC: switch iomap to an iterator model

2021-07-29 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:53PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series replies the existing callback-based iomap_apply to an iter based
> model.  The prime aim here is to simply the DAX reflink support, which

Jan Kara pointed out that recent gcc and clang support a magic attribute
that causes a cleanup function to be called when an automatic variable
goes out of scope.  I've ported the XFS for_each_perag* macros to use
it, but I think this would be roughly (totally untested) what you'd do
for iomap iterators:

/* automatic iteration cleanup via macro hell */
struct iomap_iter_cleanup {
struct iomap_ops*ops;
struct iomap_iter   *iter;
loff_t  *ret;
};

static inline void iomap_iter_cleanup(struct iomap_iter_cleanup *ic)
{
struct iomap_iter *iter = ic->iter;
int ret2 = 0;

if (!iter->iomap.length || !ic->ops->iomap_end)
return;

ret2 = ops->iomap_end(iter->inode, iter->pos,
iomap_length(iter), 0, iter->flags,
>iomap);

if (ret2 && *ic->ret == 0)
*ic->ret = ret2;

iter->iomap.length = 0;
}

#define IOMAP_ITER_CLEANUP(pag) \
struct iomap_iter_cleanup __iomap_iter_cleanup \
__attribute__((__cleanup__(iomap_iter_cleanup))) = \
{ .iter = (iter), .ops = (ops), .ret = &(ret) }

#define for_each_iomap(iter, ops, ret) \
(ret) = iomap_iter((iter), (ops)); \
for (IOMAP_ITER_CLEANUP(iter, ops, ret); \
(ret) > 0; \
(ret) = iomap_iter((iter), (ops)) \

Then we actually /can/ write our iteration loops in the normal C style:

struct iomap_iter iter = {
.inode = ...,
.pos = 0,
.length = 32768,
};
loff_t ret = 0;

for_each_iomap(, ops, ret) {
if (iter.iomap.type != WHAT_I_WANT)
break;

ret = am_i_pissed_off(...);
if (ret)
return ret;
}

return ret;

and ->iomap_end will always get called.  There are a few sharp edges:

I can't figure out how far back clang and gcc support this attribute.
The gcc docs mention it at least far back as 3.3.6.  clang (afaict) docs
don't reference it directly, but the clang 4 docs claim that it can be
as pedantic as gcc w.r.t. attribute use.  That's more than new enough
for upstream, which requires gcc 4.9 or clang 10.

The /other/ problem is that gcc gets fussy about defining variables
inside the for loop parentheses, which means that any code using it has
to compile with -std=gnu99, which is /not/ the usual c89 that the kernel
uses.  OTOH, it's been 22 years since C99 was ratified, c'mon...

--D



Re: [Cluster-devel] [PATCH 16/27] iomap: switch iomap_bmap to use iomap_iter

2021-07-27 Thread Darrick J. Wong
On Tue, Jul 27, 2021 at 08:31:38AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 26, 2021 at 09:39:22AM -0700, Darrick J. Wong wrote:
> > The documentation needs to be much more explicit about the fact that you
> > cannot "break;" your way out of an iomap_iter loop.  I think the comment
> > should be rewritten along these lines:
> > 
> > "Iterate over filesystem-provided space mappings for the provided file
> > range.  This function handles cleanup of resources acquired for
> > iteration when the filesystem indicates there are no more space
> > mappings, which means that this function must be called in a loop that
> > continues as long it returns a positive value.  If 0 or a negative value
> > is returned, the caller must not return to the loop body.  Within a loop
> > body, there are two ways to break out of the loop body: leave
> > @iter.processed unchanged, or set it to the usual negative errno."
> > 
> > Hm.
> 
> Yes, I'll update the documentation.

Ok, thanks!

> > Clunky, for sure, but at least we still get to use break as the language
> > designers intended.
> 
> I can't see any advantage there over just proper documentation.  If you
> are totally attached to a working break we might have to come up with
> a nasty for_each macro that ensures we have a final iomap_apply, but I
> doubt it is worth the effort.

I was pushing the explicit _break() function as a means to avoid an even
fuglier loop macro.

--D



Re: [Cluster-devel] [PATCH 17/27] iomap: switch iomap_seek_hole to use iomap_iter

2021-07-26 Thread Darrick J. Wong
On Mon, Jul 26, 2021 at 10:22:36AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 19, 2021 at 10:22:47AM -0700, Darrick J. Wong wrote:
> > > -static loff_t
> > > -iomap_seek_hole_actor(struct inode *inode, loff_t start, loff_t length,
> > > -   void *data, struct iomap *iomap, struct iomap *srcmap)
> > > +static loff_t iomap_seek_hole_iter(const struct iomap_iter *iter, loff_t 
> > > *pos)
> > 
> > /me wonders if @pos should be named hole_pos (here and in the caller) to
> > make it a little easier to read...
> 
> Sure.
> 
> > ...because what we're really saying here is that if seek_hole_iter found
> > a hole (and returned zero, thereby terminating the loop before iter.len
> > could reach zero), we want to return the position of the hole.
> 
> Yes.
> 
> > > + return size;
> > 
> > Not sure why we return size here...?  Oh, because there's an implicit
> > hole at EOF, so we return i_size.  Uh, does this do the right thing if
> > ->iomap_begin returns posteof mappings?  I don't see anything in
> > iomap_iter_advance that would stop iteration at EOF.
> 
> Nothing in ->iomap_begin checks that, iomap_seek_hole initializes
> iter.len so that it stops at EOF.

Oh, right.  Sorry, I forgot that. :(

--D



Re: [Cluster-devel] [PATCH 16/27] iomap: switch iomap_bmap to use iomap_iter

2021-07-26 Thread Darrick J. Wong
On Mon, Jul 26, 2021 at 10:19:42AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 19, 2021 at 10:05:45AM -0700, Darrick J. Wong wrote:
> > >   bno = 0;
> > > - ret = iomap_apply(inode, pos, blocksize, 0, ops, ,
> > > -   iomap_bmap_actor);
> > > + while ((ret = iomap_iter(, ops)) > 0) {
> > > + if (iter.iomap.type != IOMAP_MAPPED)
> > > + continue;
> > 
> > There isn't a mapped extent, so return 0 here, right?
> 
> We can't just return 0, we always need the final iomap_iter() call
> to clean up in case a ->iomap_end method is supplied.  No for bmap
> having and needing one is rather theoretical, but people will copy
> and paste that once we start breaking the rules.

Oh, right, I forgot that someone might want to ->iomap_end.  The
"continue" works because we only asked for one block, therefore we know
that we'll never get to the loop body a second time; and we ignore
iter.processed, which also means we never revisit the loop body.

This "continue without setting iter.processed to break out of loop"
pattern is a rather indirect subtlety, since C programmers are taught
that they can break out of a loop using break;.  This new iomap_iter
pattern fubars that longstanding language feature, and the language
around it is soft:

> /**
>  * iomap_iter - iterate over a ranges in a file
>  * @iter: iteration structue
>  * @ops: iomap ops provided by the file system
>  *
>  * Iterate over file system provided contiguous ranges of blocks with the same
>  * state.  Should be called in a loop that continues as long as this function
>  * returns a positive value.  If 0 or a negative value is returned the caller
>  * should break out of the loop - a negative value is an error either from the
>  * file system or from the last iteration stored in @iter.copied.
>  */

The documentation needs to be much more explicit about the fact that you
cannot "break;" your way out of an iomap_iter loop.  I think the comment
should be rewritten along these lines:

"Iterate over filesystem-provided space mappings for the provided file
range.  This function handles cleanup of resources acquired for
iteration when the filesystem indicates there are no more space
mappings, which means that this function must be called in a loop that
continues as long it returns a positive value.  If 0 or a negative value
is returned, the caller must not return to the loop body.  Within a loop
body, there are two ways to break out of the loop body: leave
@iter.processed unchanged, or set it to the usual negative errno."

Hm.

What if we provide an explicit loop break function?  That would be clear
overkill for bmap, but somebody else wanting to break out of a more
complex loop body ought to be able to say "break" to do that, not
"continue with subtleties".

static inline int
iomap_iter_break(struct iomap_iter *iter, int ret)
{
int ret2;

if (!iter->iomap.length || !ops->iomap_end)
return ret;

ret2 = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
0, iter->flags, >iomap);
return ret ? ret : ret2;
}

And then then theoretical loop body becomes:

while ((ret = iomap_iter(, ops)) > 0) {
if (iter.iomap.type != WHAT_I_WANT) {
ret = iomap_iter_break(, 0);
break;
}



ret = vfs_do_some_risky_thing(...);
if (ret) {
ret = iomap_iter_break(, ret);
break;
}



iter.processed = iter.iomap.length;
}
return ret;

Clunky, for sure, but at least we still get to use break as the language
designers intended.

--D



Re: [Cluster-devel] RFC: switch iomap to an iterator model

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:53PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series replies the existing callback-based iomap_apply to an iter based
> model.  The prime aim here is to simply the DAX reflink support, which
> requires iterating through two inodes, something that is rather painful
> with the apply model.  It also helps to kill an indirect call per segment
> as-is.  Compared to the earlier patchset from Matthew Wilcox that this
> series is based upon it does not eliminate all indirect calls, but as the
> upside it does not change the file systems at all (except for the btrfs
> and gfs2 hooks which have slight prototype changes).

FWIW patches 9-20 look ok to me, modulo the discussion I started in
patch 8 about defining a distinct type for iomap byte lengths instead of
the combination of int/ssize_t/u64 that we use now.

> This passes basic testing on XFS for block based file systems.  The DAX
> changes are entirely untested as I haven't managed to get pmem work in
> recent qemu.

This gets increasingly difficult as time goes by.

Right now I have the following bits of libvirt xml in the vm
definitions:

  1073741824
  

  
/run/g.mem
  
  
10487808
0
  
  

  

Which seems to translate to:

-machine pc-q35-4.2,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm=on
-object 
memory-backend-file,id=memnvdimm0,prealloc=no,mem-path=/run/g.mem,share=yes,size=10739515392,align=128M
-device nvdimm,memdev=memnvdimm0,id=nvdimm0,slot=0,label-size=2M

Evidently something was added to the pmem code(?) that makes it fussy if
the memory region doesn't align to a 128M boundary or the label isn't
big enough for ... whatever gets written into them.

The file /run/g.mem is intended to provide 10GB of pmem to the VM, with
an additional 2M allocated for the label.

--D

> Diffstat:
>  b/fs/btrfs/inode.c   |5 
>  b/fs/buffer.c|4 
>  b/fs/dax.c   |  578 
> ++-
>  b/fs/gfs2/bmap.c |5 
>  b/fs/internal.h  |4 
>  b/fs/iomap/Makefile  |2 
>  b/fs/iomap/buffered-io.c |  344 +--
>  b/fs/iomap/direct-io.c   |  162 ++---
>  b/fs/iomap/fiemap.c  |  101 +++-
>  b/fs/iomap/iter.c|   74 ++
>  b/fs/iomap/seek.c|   88 +++
>  b/fs/iomap/swapfile.c|   38 +--
>  b/fs/iomap/trace.h   |   35 +-
>  b/include/linux/iomap.h  |   73 -
>  fs/iomap/apply.c |   99 
>  15 files changed, 777 insertions(+), 835 deletions(-)



Re: [Cluster-devel] [PATCH 21/27] iomap: remove iomap_apply

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:14PM +0200, Christoph Hellwig wrote:
> iomap_apply is unused now, so remove it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/Makefile |  1 -
>  fs/iomap/apply.c  | 99 ---
>  fs/iomap/trace.h  | 40 -
>  include/linux/iomap.h | 10 -
>  4 files changed, 150 deletions(-)

mmm, negative LOC delta ;)
Reviewed-by: Darrick J. Wong 

--D

>  delete mode 100644 fs/iomap/apply.c
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index 85034deb5a2f19..ebd9866d80ae90 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -9,7 +9,6 @@ ccflags-y += -I $(srctree)/$(src) # needed for 
> trace events
>  obj-$(CONFIG_FS_IOMAP)   += iomap.o
>  
>  iomap-y  += trace.o \
> -apply.o \
>  iter.o \
>  buffered-io.o \
>  direct-io.o \
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> deleted file mode 100644
> index 26ab6563181fc6..00
> --- a/fs/iomap/apply.c
> +++ /dev/null
> @@ -1,99 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -/*
> - * Copyright (C) 2010 Red Hat, Inc.
> - * Copyright (c) 2016-2018 Christoph Hellwig.
> - */
> -#include 
> -#include 
> -#include 
> -#include 
> -#include "trace.h"
> -
> -/*
> - * Execute a iomap write on a segment of the mapping that spans a
> - * contiguous range of pages that have identical block mapping state.
> - *
> - * This avoids the need to map pages individually, do individual allocations
> - * for each page and most importantly avoid the need for filesystem specific
> - * locking per page. Instead, all the operations are amortised over the 
> entire
> - * range of pages. It is assumed that the filesystems will lock whatever
> - * resources they require in the iomap_begin call, and release them in the
> - * iomap_end call.
> - */
> -loff_t
> -iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> - const struct iomap_ops *ops, void *data, iomap_actor_t actor)
> -{
> - struct iomap iomap = { .type = IOMAP_HOLE };
> - struct iomap srcmap = { .type = IOMAP_HOLE };
> - loff_t written = 0, ret;
> - u64 end;
> -
> - trace_iomap_apply(inode, pos, length, flags, ops, actor, _RET_IP_);
> -
> - /*
> -  * Need to map a range from start position for length bytes. This can
> -  * span multiple pages - it is only guaranteed to return a range of a
> -  * single type of pages (e.g. all into a hole, all mapped or all
> -  * unwritten). Failure at this point has nothing to undo.
> -  *
> -  * If allocation is required for this range, reserve the space now so
> -  * that the allocation is guaranteed to succeed later on. Once we copy
> -  * the data into the page cache pages, then we cannot fail otherwise we
> -  * expose transient stale data. If the reserve fails, we can safely
> -  * back out at this point as there is nothing to undo.
> -  */
> - ret = ops->iomap_begin(inode, pos, length, flags, , );
> - if (ret)
> - return ret;
> - if (WARN_ON(iomap.offset > pos)) {
> - written = -EIO;
> - goto out;
> - }
> - if (WARN_ON(iomap.length == 0)) {
> - written = -EIO;
> - goto out;
> - }
> -
> - trace_iomap_apply_dstmap(inode, );
> - if (srcmap.type != IOMAP_HOLE)
> - trace_iomap_apply_srcmap(inode, );
> -
> - /*
> -  * Cut down the length to the one actually provided by the filesystem,
> -  * as it might not be able to give us the whole size that we requested.
> -  */
> - end = iomap.offset + iomap.length;
> - if (srcmap.type != IOMAP_HOLE)
> - end = min(end, srcmap.offset + srcmap.length);
> - if (pos + length > end)
> - length = end - pos;
> -
> - /*
> -  * Now that we have guaranteed that the space allocation will succeed,
> -  * we can do the copy-in page by page without having to worry about
> -  * failures exposing transient data.
> -  *
> -  * To support COW operations, we read in data for partially blocks from
> -  * the srcmap if the file system filled it in.  In that case we the
> -  * length needs to be limited to the earlier of the ends of the iomaps.
> -  * If the file system did not provide a srcmap we pass in the normal
> -  * iomap into the actors so that they don't need to have special
> -  * handling for the two cases

Re: [Cluster-devel] [PATCH 22/27] iomap: pass an iomap_iter to various buffered I/O helpers

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:15PM +0200, Christoph Hellwig wrote:
> Pass the iomap_iter structure instead of individual parameters to
> various internal helpers for buffered I/O.
> 
> Signed-off-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 117 -
>  1 file changed, 56 insertions(+), 61 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c273b5d88dd8a8..daabbe8d7edfb5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -226,12 +226,14 @@ iomap_read_inline_data(struct inode *inode, struct page 
> *page,
>   SetPageUptodate(page);
>  }
>  
> -static inline bool iomap_block_needs_zeroing(struct inode *inode,
> - struct iomap *iomap, loff_t pos)
> +static inline bool iomap_block_needs_zeroing(struct iomap_iter *iter,
> + loff_t pos)
>  {
> - return iomap->type != IOMAP_MAPPED ||
> - (iomap->flags & IOMAP_F_NEW) ||
> - pos >= i_size_read(inode);
> + struct iomap *srcmap = iomap_iter_srcmap(iter);
> +
> + return srcmap->type != IOMAP_MAPPED ||
> + (srcmap->flags & IOMAP_F_NEW) ||
> + pos >= i_size_read(iter->inode);
>  }
>  
>  static loff_t iomap_readpage_iter(struct iomap_iter *iter,
> @@ -259,7 +261,7 @@ static loff_t iomap_readpage_iter(struct iomap_iter *iter,
>   if (plen == 0)
>   goto done;
>  
> - if (iomap_block_needs_zeroing(iter->inode, iomap, pos)) {
> + if (iomap_block_needs_zeroing(iter, pos)) {
>   zero_user(page, poff, plen);
>   iomap_set_range_uptodate(page, poff, plen);
>   goto done;
> @@ -541,12 +543,12 @@ iomap_read_page_sync(loff_t block_start, struct page 
> *page, unsigned poff,
>   return submit_bio_wait();
>  }
>  
> -static int
> -__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> - struct page *page, struct iomap *srcmap)
> +static int __iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> + unsigned len, int flags, struct page *page)
>  {
> - struct iomap_page *iop = iomap_page_create(inode, page);
> - loff_t block_size = i_blocksize(inode);
> + struct iomap *srcmap = iomap_iter_srcmap(iter);
> + struct iomap_page *iop = iomap_page_create(iter->inode, page);
> + loff_t block_size = i_blocksize(iter->inode);
>   loff_t block_start = round_down(pos, block_size);
>   loff_t block_end = round_up(pos + len, block_size);
>   unsigned from = offset_in_page(pos), to = from + len, poff, plen;
> @@ -556,7 +558,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, int flags,
>   ClearPageError(page);
>  
>   do {
> - iomap_adjust_read_range(inode, iop, _start,
> + iomap_adjust_read_range(iter->inode, iop, _start,
>   block_end - block_start, , );
>   if (plen == 0)
>   break;
> @@ -566,7 +568,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, int flags,
>   (to <= poff || to >= poff + plen))
>   continue;
>  
> - if (iomap_block_needs_zeroing(inode, srcmap, block_start)) {
> + if (iomap_block_needs_zeroing(iter, block_start)) {
>   if (WARN_ON_ONCE(flags & IOMAP_WRITE_F_UNSHARE))
>   return -EIO;
>   zero_user_segments(page, poff, from, to, poff + plen);
> @@ -582,41 +584,40 @@ __iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, int flags,
>   return 0;
>  }
>  
> -static int
> -iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
> - struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
> +static int iomap_write_begin(struct iomap_iter *iter, loff_t pos, unsigned 
> len,
> + unsigned flags, struct page **pagep)
>  {
> - const struct iomap_page_ops *page_ops = iomap->page_ops;
> + const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> + struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct page *page;
>   int status = 0;
>  
> - BUG_ON(pos + len > iomap->offset + iomap->length);
> - if (srcmap != iomap)
> + BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
> + if (srcmap != >iomap)
>   BUG_ON(pos + len > srcmap->offset + srcmap->length);
>  
>   if (fatal

Re: [Cluster-devel] [PATCH 07/27] iomap: mark the iomap argument to iomap_read_page_sync const

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:00PM +0200, Christoph Hellwig wrote:
> iomap_read_page_sync never modifies the passed in iomap, so mark
> it const.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e47380259cf7e1..8c26cf7cbd72b0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -535,7 +535,7 @@ iomap_write_failed(struct inode *inode, loff_t pos, 
> unsigned len)
>  
>  static int
>  iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
> - unsigned plen, struct iomap *iomap)
> + unsigned plen, const struct iomap *iomap)
>  {
>   struct bio_vec bvec;
>   struct bio bio;
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 23/27] iomap: rework unshare flag

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:16PM +0200, Christoph Hellwig wrote:
> Instead of another internal flags namespace inside of buffered-io.c,
> just pass a UNSHARE hint in the main iomap flags field.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/buffered-io.c | 23 +--
>  include/linux/iomap.h  |  1 +
>  2 files changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index daabbe8d7edfb5..eb5d742b5bf8b7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -511,10 +511,6 @@ iomap_migrate_page(struct address_space *mapping, struct 
> page *newpage,
>  EXPORT_SYMBOL_GPL(iomap_migrate_page);
>  #endif /* CONFIG_MIGRATION */
>  
> -enum {
> - IOMAP_WRITE_F_UNSHARE   = (1 << 0),
> -};

Oh good, this finally dies.
Reviewed-by: Darrick J. Wong 

--D

> -
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> @@ -544,7 +540,7 @@ iomap_read_page_sync(loff_t block_start, struct page 
> *page, unsigned poff,
>  }
>  
>  static int __iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> - unsigned len, int flags, struct page *page)
> + unsigned len, struct page *page)
>  {
>   struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct iomap_page *iop = iomap_page_create(iter->inode, page);
> @@ -563,13 +559,13 @@ static int __iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   if (plen == 0)
>   break;
>  
> - if (!(flags & IOMAP_WRITE_F_UNSHARE) &&
> + if (!(iter->flags & IOMAP_UNSHARE) &&
>   (from <= poff || from >= poff + plen) &&
>   (to <= poff || to >= poff + plen))
>   continue;
>  
>   if (iomap_block_needs_zeroing(iter, block_start)) {
> - if (WARN_ON_ONCE(flags & IOMAP_WRITE_F_UNSHARE))
> + if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
>   return -EIO;
>   zero_user_segments(page, poff, from, to, poff + plen);
>   } else {
> @@ -585,7 +581,7 @@ static int __iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>  }
>  
>  static int iomap_write_begin(struct iomap_iter *iter, loff_t pos, unsigned 
> len,
> - unsigned flags, struct page **pagep)
> + struct page **pagep)
>  {
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>   struct iomap *srcmap = iomap_iter_srcmap(iter);
> @@ -617,7 +613,7 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos, unsigned len,
>   else if (iter->iomap.flags & IOMAP_F_BUFFER_HEAD)
>   status = __block_write_begin_int(page, pos, len, NULL, srcmap);
>   else
> - status = __iomap_write_begin(iter, pos, len, flags, page);
> + status = __iomap_write_begin(iter, pos, len, page);
>  
>   if (unlikely(status))
>   goto out_unlock;
> @@ -748,7 +744,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, 
> struct iov_iter *i)
>   break;
>   }
>  
> - status = iomap_write_begin(iter, pos, bytes, 0, );
> + status = iomap_write_begin(iter, pos, bytes, );
>   if (unlikely(status))
>   break;
>  
> @@ -825,8 +821,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
>   unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
>   struct page *page;
>  
> - status = iomap_write_begin(iter, pos, bytes,
> - IOMAP_WRITE_F_UNSHARE, );
> + status = iomap_write_begin(iter, pos, bytes, );
>   if (unlikely(status))
>   return status;
>  
> @@ -854,7 +849,7 @@ iomap_file_unshare(struct inode *inode, loff_t pos, 
> loff_t len,
>   .inode  = inode,
>   .pos= pos,
>   .len= len,
> - .flags  = IOMAP_WRITE,
> + .flags  = IOMAP_WRITE | IOMAP_UNSHARE,
>   };
>   int ret;
>  
> @@ -871,7 +866,7 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, 
> loff_t pos, u64 length)
>   unsigned offset = offset_in_page(pos);
>   unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
>  
> - status = iomap_write_begin(iter, pos, bytes, 0, );
> + status = iomap_write_begin(iter, pos, bytes, );
>   if (status)
>   r

Re: [Cluster-devel] [PATCH 27/27] iomap: constify iomap_iter_srcmap

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:20PM +0200, Christoph Hellwig wrote:
> The srcmap returned from iomap_iter_srcmap is never modified, so mark
> the iomap returned from it const and constify a lot of code that never
> modifies the iomap.
> 
> Signed-off-by: Christoph Hellwig 

LGTM!
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 32 
>  include/linux/iomap.h  |  2 +-
>  2 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index eb5d742b5bf8b7..a2dd42f3115cfa 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -226,20 +226,20 @@ iomap_read_inline_data(struct inode *inode, struct page 
> *page,
>   SetPageUptodate(page);
>  }
>  
> -static inline bool iomap_block_needs_zeroing(struct iomap_iter *iter,
> +static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
>   loff_t pos)
>  {
> - struct iomap *srcmap = iomap_iter_srcmap(iter);
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  
>   return srcmap->type != IOMAP_MAPPED ||
>   (srcmap->flags & IOMAP_F_NEW) ||
>   pos >= i_size_read(iter->inode);
>  }
>  
> -static loff_t iomap_readpage_iter(struct iomap_iter *iter,
> +static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>   struct iomap_readpage_ctx *ctx, loff_t offset)
>  {
> - struct iomap *iomap = >iomap;
> + const struct iomap *iomap = >iomap;
>   loff_t pos = iter->pos + offset;
>   loff_t length = iomap_length(iter) - offset;
>   struct page *page = ctx->cur_page;
> @@ -355,7 +355,7 @@ iomap_readpage(struct page *page, const struct iomap_ops 
> *ops)
>  }
>  EXPORT_SYMBOL_GPL(iomap_readpage);
>  
> -static loff_t iomap_readahead_iter(struct iomap_iter *iter,
> +static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
>   struct iomap_readpage_ctx *ctx)
>  {
>   loff_t length = iomap_length(iter);
> @@ -539,10 +539,10 @@ iomap_read_page_sync(loff_t block_start, struct page 
> *page, unsigned poff,
>   return submit_bio_wait();
>  }
>  
> -static int __iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> +static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>   unsigned len, struct page *page)
>  {
> - struct iomap *srcmap = iomap_iter_srcmap(iter);
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct iomap_page *iop = iomap_page_create(iter->inode, page);
>   loff_t block_size = i_blocksize(iter->inode);
>   loff_t block_start = round_down(pos, block_size);
> @@ -580,11 +580,11 @@ static int __iomap_write_begin(struct iomap_iter *iter, 
> loff_t pos,
>   return 0;
>  }
>  
> -static int iomap_write_begin(struct iomap_iter *iter, loff_t pos, unsigned 
> len,
> - struct page **pagep)
> +static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> + unsigned len, struct page **pagep)
>  {
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> - struct iomap *srcmap = iomap_iter_srcmap(iter);
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   struct page *page;
>   int status = 0;
>  
> @@ -655,10 +655,10 @@ static size_t __iomap_write_end(struct inode *inode, 
> loff_t pos, size_t len,
>   return copied;
>  }
>  
> -static size_t iomap_write_end_inline(struct iomap_iter *iter, struct page 
> *page,
> - loff_t pos, size_t copied)
> +static size_t iomap_write_end_inline(const struct iomap_iter *iter,
> + struct page *page, loff_t pos, size_t copied)
>  {
> - struct iomap *iomap = >iomap;
> + const struct iomap *iomap = >iomap;
>   void *addr;
>  
>   WARN_ON_ONCE(!PageUptodate(page));
> @@ -678,7 +678,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, 
> loff_t pos, size_t len,
>   size_t copied, struct page *page)
>  {
>   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> - struct iomap *srcmap = iomap_iter_srcmap(iter);
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   loff_t old_size = iter->inode->i_size;
>   size_t ret;
>  
> @@ -803,7 +803,7 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>  static loff_t iomap_unshare_iter(struct iomap_iter *iter)
>  {
>   struct iomap *iomap = >iomap;
> - struct iomap *srcmap = iomap_iter_srcmap(iter);
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   loff_t pos = iter->pos;
>

Re: [Cluster-devel] [PATCH 06/27] iomap: mark the iomap argument to iomap_read_inline_data const

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:59PM +0200, Christoph Hellwig wrote:
> iomap_read_inline_data never modifies the passed in iomap, so mark
> it const.
> 
> Signed-off-by: Christoph Hellwig 

Looks good,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 75310f6fcf8401..e47380259cf7e1 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -207,7 +207,7 @@ struct iomap_readpage_ctx {
>  
>  static void
>  iomap_read_inline_data(struct inode *inode, struct page *page,
> - struct iomap *iomap)
> + const struct iomap *iomap)
>  {
>   size_t size = i_size_read(inode);
>   void *addr;
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 05/27] fsdax: mark the iomap argument to dax_iomap_sector as const

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:58PM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index da41f9363568e0..4d63040fd71f56 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1005,7 +1005,7 @@ int dax_writeback_mapping_range(struct address_space 
> *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> -static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> +static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
>  {
>   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
>  }
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 26/27] fsdax: switch the fault handlers to use iomap_iter

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:19PM +0200, Christoph Hellwig wrote:
> Avoid the open coded calls to ->iomap_begin and ->iomap_end and call
> iomap_iter instead.
> 
> Signed-off-by: Christoph Hellwig 

Finally this nightmare is over...
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 193 +--
>  1 file changed, 75 insertions(+), 118 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 6d0c6d28be83b1..118c9e2923f5f8 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1010,7 +1010,7 @@ static sector_t dax_iomap_sector(const struct iomap 
> *iomap, loff_t pos)
>   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
>  }
>  
> -static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> +static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>pfn_t *pfnp)
>  {
>   const sector_t sector = dax_iomap_sector(iomap, pos);
> @@ -1068,7 +1068,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>  
>  #ifdef CONFIG_FS_DAX_PMD
>  static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> - struct iomap *iomap, void **entry)
> + const struct iomap *iomap, void **entry)
>  {
>   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   unsigned long pmd_addr = vmf->address & PMD_MASK;
> @@ -1120,7 +1120,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #else
>  static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> - struct iomap *iomap, void **entry)
> + const struct iomap *iomap, void **entry)
>  {
>   return VM_FAULT_FALLBACK;
>  }
> @@ -1309,7 +1309,7 @@ static vm_fault_t dax_fault_return(int error)
>   * flushed on write-faults (non-cow), but not read-faults.
>   */
>  static bool dax_fault_is_synchronous(unsigned long flags,
> - struct vm_area_struct *vma, struct iomap *iomap)
> + struct vm_area_struct *vma, const struct iomap *iomap)
>  {
>   return (flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC)
>   && (iomap->flags & IOMAP_F_DIRTY);
> @@ -1329,22 +1329,22 @@ static vm_fault_t dax_fault_synchronous_pfnp(pfn_t 
> *pfnp, pfn_t pfn)
>   return VM_FAULT_NEEDDSYNC;
>  }
>  
> -static vm_fault_t dax_fault_cow_page(struct vm_fault *vmf, struct iomap 
> *iomap,
> - loff_t pos)
> +static vm_fault_t dax_fault_cow_page(struct vm_fault *vmf,
> + const struct iomap_iter *iter)
>  {
> - sector_t sector = dax_iomap_sector(iomap, pos);
> + sector_t sector = dax_iomap_sector(>iomap, iter->pos);
>   unsigned long vaddr = vmf->address;
>   vm_fault_t ret;
>   int error = 0;
>  
> - switch (iomap->type) {
> + switch (iter->iomap.type) {
>   case IOMAP_HOLE:
>   case IOMAP_UNWRITTEN:
>   clear_user_highpage(vmf->cow_page, vaddr);
>   break;
>   case IOMAP_MAPPED:
> - error = copy_cow_page_dax(iomap->bdev, iomap->dax_dev, sector,
> -   vmf->cow_page, vaddr);
> + error = copy_cow_page_dax(iter->iomap.bdev, iter->iomap.dax_dev,
> +   sector, vmf->cow_page, vaddr);
>   break;
>   default:
>   WARN_ON_ONCE(1);
> @@ -1363,29 +1363,31 @@ static vm_fault_t dax_fault_cow_page(struct vm_fault 
> *vmf, struct iomap *iomap,
>  }
>  
>  /**
> - * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
> + * dax_fault_iter - Common actor to handle pfn insertion in PTE/PMD fault.
>   * @vmf: vm fault instance
> + * @iter:iomap iter
>   * @pfnp:pfn to be returned
>   * @xas: the dax mapping tree of a file
>   * @entry:   an unlocked dax entry to be inserted
>   * @pmd: distinguish whether it is a pmd fault
> - * @flags:   iomap flags
> - * @iomap:   from iomap_begin()
> - * @srcmap:  from iomap_begin(), not equal to iomap if it is a CoW
>   */
> -static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
> - struct xa_state *xas, void **entry, bool pmd,
> - unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> +static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> + const struct iomap_iter *iter, pfn_t *pfnp,
> + struct xa_state *xas, void **entry, bool pmd)
>  {
>   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
&g

Re: [Cluster-devel] [PATCH 04/27] fs: mark the iomap argument to __block_write_begin_int const

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:57PM +0200, Christoph Hellwig wrote:
> __block_write_begin_int never modifies the passed in iomap, so mark it
> const.
> 
> Signed-off-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/buffer.c   | 4 ++--
>  fs/internal.h | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 6290c3afdba488..bd6a9e9fbd64c9 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1912,7 +1912,7 @@ EXPORT_SYMBOL(page_zero_new_buffers);
>  
>  static void
>  iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
> - struct iomap *iomap)
> + const struct iomap *iomap)
>  {
>   loff_t offset = block << inode->i_blkbits;
>  
> @@ -1966,7 +1966,7 @@ iomap_to_bh(struct inode *inode, sector_t block, struct 
> buffer_head *bh,
>  }
>  
>  int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
> - get_block_t *get_block, struct iomap *iomap)
> + get_block_t *get_block, const struct iomap *iomap)
>  {
>   unsigned from = pos & (PAGE_SIZE - 1);
>   unsigned to = from + len;
> diff --git a/fs/internal.h b/fs/internal.h
> index 3ce8edbaa3ca2f..9ad6b5157584b8 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -48,8 +48,8 @@ static inline int emergency_thaw_bdev(struct super_block 
> *sb)
>  /*
>   * buffer.c
>   */
> -extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned 
> len,
> - get_block_t *get_block, struct iomap *iomap);
> +int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
> + get_block_t *get_block, const struct iomap *iomap);
>  
>  /*
>   * char_dev.c
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 17/27] iomap: switch iomap_seek_hole to use iomap_iter

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:10PM +0200, Christoph Hellwig wrote:
> Rewrite iomap_seek_hole to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/seek.c | 46 +++---
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
> index ce6fb810854fec..7d6ed9af925e96 100644
> --- a/fs/iomap/seek.c
> +++ b/fs/iomap/seek.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
>   * Copyright (C) 2017 Red Hat, Inc.
> - * Copyright (c) 2018 Christoph Hellwig.
> + * Copyright (c) 2018-2021 Christoph Hellwig.
>   */
>  #include 
>  #include 
> @@ -10,21 +10,19 @@
>  #include 
>  #include 
>  
> -static loff_t
> -iomap_seek_hole_actor(struct inode *inode, loff_t start, loff_t length,
> -   void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_seek_hole_iter(const struct iomap_iter *iter, loff_t 
> *pos)

/me wonders if @pos should be named hole_pos (here and in the caller) to
make it a little easier to read...

>  {
> - loff_t offset = start;
> + loff_t length = iomap_length(iter);
>  
> - switch (iomap->type) {
> + switch (iter->iomap.type) {
>   case IOMAP_UNWRITTEN:
> - offset = mapping_seek_hole_data(inode->i_mapping, start,
> - start + length, SEEK_HOLE);
> - if (offset == start + length)
> + *pos = mapping_seek_hole_data(iter->inode->i_mapping,
> + iter->pos, iter->pos + length, SEEK_HOLE);
> + if (*pos == iter->pos + length)
>   return length;
> - fallthrough;
> + return 0;
>   case IOMAP_HOLE:
> - *(loff_t *)data = offset;
> + *pos = iter->pos;
>   return 0;
>   default:
>   return length;
> @@ -35,23 +33,25 @@ loff_t
>  iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops 
> *ops)
>  {
>   loff_t size = i_size_read(inode);
> - loff_t ret;
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= offset,
> + .flags  = IOMAP_REPORT,
> + };
> + int ret;
>  
>   /* Nothing to be found before or beyond the end of the file. */
>   if (offset < 0 || offset >= size)
>   return -ENXIO;
>  
> - while (offset < size) {
> - ret = iomap_apply(inode, offset, size - offset, IOMAP_REPORT,
> -   ops, , iomap_seek_hole_actor);
> - if (ret < 0)
> - return ret;
> - if (ret == 0)
> - break;
> - offset += ret;
> - }
> -
> - return offset;
> + iter.len = size - offset;
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = iomap_seek_hole_iter(, );
> + if (ret < 0)
> + return ret;
> + if (iter.len)
> + return offset;

...because what we're really saying here is that if seek_hole_iter found
a hole (and returned zero, thereby terminating the loop before iter.len
could reach zero), we want to return the position of the hole.

> + return size;

Not sure why we return size here...?  Oh, because there's an implicit
hole at EOF, so we return i_size.  Uh, does this do the right thing if
->iomap_begin returns posteof mappings?  I don't see anything in
iomap_iter_advance that would stop iteration at EOF.

--D

>  }
>  EXPORT_SYMBOL_GPL(iomap_seek_hole);
>  
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 16/27] iomap: switch iomap_bmap to use iomap_iter

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:09PM +0200, Christoph Hellwig wrote:
> Rewrite the ->bmap implementation based on iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/fiemap.c | 31 +--
>  1 file changed, 13 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
> index acad09a8c188df..60daadba16c149 100644
> --- a/fs/iomap/fiemap.c
> +++ b/fs/iomap/fiemap.c
> @@ -92,35 +92,30 @@ int iomap_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fi,
>  }
>  EXPORT_SYMBOL_GPL(iomap_fiemap);
>  
> -static loff_t
> -iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> -{
> - sector_t *bno = data, addr;
> -
> - if (iomap->type == IOMAP_MAPPED) {
> - addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
> - *bno = addr;
> - }
> - return 0;
> -}
> -
>  /* legacy ->bmap interface.  0 is the error return (!) */
>  sector_t
>  iomap_bmap(struct address_space *mapping, sector_t bno,
>   const struct iomap_ops *ops)
>  {
> - struct inode *inode = mapping->host;
> - loff_t pos = bno << inode->i_blkbits;
> - unsigned blocksize = i_blocksize(inode);
> + struct iomap_iter iter = {
> + .inode  = mapping->host,
> + .pos= (loff_t)bno << mapping->host->i_blkbits,
> + .len= i_blocksize(mapping->host),
> + .flags  = IOMAP_REPORT,
> + };
>   int ret;
>  
>   if (filemap_write_and_wait(mapping))
>   return 0;
>  
>   bno = 0;
> - ret = iomap_apply(inode, pos, blocksize, 0, ops, ,
> -   iomap_bmap_actor);
> + while ((ret = iomap_iter(, ops)) > 0) {
> + if (iter.iomap.type != IOMAP_MAPPED)
> + continue;

There isn't a mapped extent, so return 0 here, right?

--D

> + bno = (iter.pos - iter.iomap.offset + iter.iomap.addr) >>
> + mapping->host->i_blkbits;
> + }
> +
>   if (ret)
>   return 0;
>   return bno;
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 08/27] iomap: add the new iomap_iter model

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:35:01PM +0200, Christoph Hellwig wrote:
> The iomap_iter struct provides a convenient way to package up and
> maintain all the arguments to the various mapping and operation
> functions.  It is operated on using the iomap_iter() function that
> is called in loop until the whole range has been processed.  Compared
> to the existing iomap_apply() function this avoid an indirect call
> for each iteration.
> 
> For now iomap_iter() calls back into the existing ->iomap_begin and
> ->iomap_end methods, but in the future this could be further optimized
> to avoid indirect calls entirely.
> 
> Based on an earlier patch from Matthew Wilcox .
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/Makefile |  1 +
>  fs/iomap/iter.c   | 74 +++
>  fs/iomap/trace.h  | 37 +-
>  include/linux/iomap.h | 56 
>  4 files changed, 167 insertions(+), 1 deletion(-)
>  create mode 100644 fs/iomap/iter.c
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index eef2722d93a183..85034deb5a2f19 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP)  += iomap.o
>  
>  iomap-y  += trace.o \
>  apply.o \
> +iter.o \
>  buffered-io.o \
>  direct-io.o \
>  fiemap.o \
> diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> new file mode 100644
> index 00..b21e2489700b7c
> --- /dev/null
> +++ b/fs/iomap/iter.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2021 Christoph Hellwig.
> + */
> +#include 
> +#include 
> +#include "trace.h"
> +
> +static inline int iomap_iter_advance(struct iomap_iter *iter)
> +{
> + /* handle the previous iteration (if any) */
> + if (iter->iomap.length) {
> + if (iter->processed <= 0)
> + return iter->processed;

Hmm, converting ssize_t to int here... I suppose that's fine since we're
merely returning "the usual negative errno code", but read on.

> + WARN_ON_ONCE(iter->processed > iomap_length(iter));
> + iter->pos += iter->processed;
> + iter->len -= iter->processed;
> + if (!iter->len)
> + return 0;
> + }
> +
> + /* clear the state for the next iteration */
> + iter->processed = 0;
> + memset(>iomap, 0, sizeof(iter->iomap));
> + memset(>srcmap, 0, sizeof(iter->srcmap));
> + return 1;
> +}
> +
> +static inline void iomap_iter_done(struct iomap_iter *iter)
> +{
> + WARN_ON_ONCE(iter->iomap.offset > iter->pos);
> + WARN_ON_ONCE(iter->iomap.length == 0);
> + WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
> +
> + trace_iomap_iter_dstmap(iter->inode, >iomap);
> + if (iter->srcmap.type != IOMAP_HOLE)
> + trace_iomap_iter_srcmap(iter->inode, >srcmap);
> +}
> +
> +/**
> + * iomap_iter - iterate over a ranges in a file
> + * @iter: iteration structue
> + * @ops: iomap ops provided by the file system
> + *
> + * Iterate over file system provided contiguous ranges of blocks with the 
> same
> + * state.  Should be called in a loop that continues as long as this function
> + * returns a positive value.  If 0 or a negative value is returned the caller
> + * should break out of the loop - a negative value is an error either from 
> the
> + * file system or from the last iteration stored in @iter.copied.
> + */
> +int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
> +{
> + int ret;
> +
> + if (iter->iomap.length && ops->iomap_end) {
> + ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
> + iter->processed > 0 ? iter->processed : 0,
> + iter->flags, >iomap);
> + if (ret < 0 && !iter->processed)
> + return ret;
> + }
> +
> + trace_iomap_iter(iter, ops, _RET_IP_);
> + ret = iomap_iter_advance(iter);
> + if (ret <= 0)
> + return ret;
> +
> + ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
> +>iomap, >srcmap);
> + if (ret < 0)
> + return ret;
> + iomap_iter_done(iter);
> + return 1;
> +}



> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f9c36df6a3061b..a9f3f736017989 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -143,6 +143,62 @@ struct iomap_ops {
>   ssize_t written, unsigned flags, struct iomap *iomap);
>  };
>  
> +/**
> + * struct iomap_iter - Iterate through a range of a file
> + * @inode: Set at the start of the iteration and should not change.
> + * @pos: The current file position we are operating on.  It is 

Re: [Cluster-devel] [PATCH 03/27] iomap: mark the iomap argument to iomap_sector const

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:56PM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

/me wonders, does this have any significant effect on the generated
code?

It's probably a good idea to feed the optimizer as much usage info as we
can, though I would imagine that for such a simple function it can
probably tell that we don't change *iomap.

IMHO, constifiying functions is a good way to signal to /programmers/
that they're not intended to touch the arguments, so

Reviewed-by: Darrick J. Wong 

--D

> ---
>  include/linux/iomap.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 093519d91cc9cc..f9c36df6a3061b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -91,8 +91,7 @@ struct iomap {
>   const struct iomap_page_ops *page_ops;
>  };
>  
> -static inline sector_t
> -iomap_sector(struct iomap *iomap, loff_t pos)
> +static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
>  {
>   return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
>  }
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 02/27] iomap: remove the iomap arguments to ->page_{prepare, done}

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:55PM +0200, Christoph Hellwig wrote:
> These aren't actually used by the only instance implementing the methods.
> 
> Signed-off-by: Christoph Hellwig 

/me finds it kind of amusing that we still don't have any ->page_prepare
use cases for actually passing the page in, but if nobody /else/ has any
objection or imminently wants to use the iomap argument, then...

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/bmap.c | 5 ++---
>  fs/iomap/buffered-io.c | 6 +++---
>  include/linux/iomap.h  | 5 ++---
>  3 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index ed8b67b2171817..5414c2c3358092 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -1002,7 +1002,7 @@ static void gfs2_write_unlock(struct inode *inode)
>  }
>  
>  static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
> -unsigned len, struct iomap *iomap)
> +unsigned len)
>  {
>   unsigned int blockmask = i_blocksize(inode) - 1;
>   struct gfs2_sbd *sdp = GFS2_SB(inode);
> @@ -1013,8 +1013,7 @@ static int gfs2_iomap_page_prepare(struct inode *inode, 
> loff_t pos,
>  }
>  
>  static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
> -  unsigned copied, struct page *page,
> -  struct iomap *iomap)
> +  unsigned copied, struct page *page)
>  {
>   struct gfs2_trans *tr = current->journal_info;
>   struct gfs2_inode *ip = GFS2_I(inode);
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 87ccb3438becd9..75310f6fcf8401 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -605,7 +605,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   return -EINTR;
>  
>   if (page_ops && page_ops->page_prepare) {
> - status = page_ops->page_prepare(inode, pos, len, iomap);
> + status = page_ops->page_prepare(inode, pos, len);
>   if (status)
>   return status;
>   }
> @@ -638,7 +638,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>  
>  out_no_page:
>   if (page_ops && page_ops->page_done)
> - page_ops->page_done(inode, pos, 0, NULL, iomap);
> + page_ops->page_done(inode, pos, 0, NULL);
>   return status;
>  }
>  
> @@ -714,7 +714,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t 
> pos, size_t len,
>   if (old_size < pos)
>   pagecache_isize_extended(inode, old_size, pos);
>   if (page_ops && page_ops->page_done)
> - page_ops->page_done(inode, pos, ret, page, iomap);
> + page_ops->page_done(inode, pos, ret, page);
>   put_page(page);
>  
>   if (ret < len)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 479c1da3e2211e..093519d91cc9cc 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -108,10 +108,9 @@ iomap_sector(struct iomap *iomap, loff_t pos)
>   * associated page could not be obtained.
>   */
>  struct iomap_page_ops {
> - int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len,
> - struct iomap *iomap);
> + int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
>   void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
> - struct page *page, struct iomap *iomap);
> + struct page *page);
>  };
>  
>  /*
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH 01/27] iomap: fix a trivial comment typo in trace.h

2021-07-19 Thread Darrick J. Wong
On Mon, Jul 19, 2021 at 12:34:54PM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/trace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index fdc7ae388476f5..e9cd5cc0d6ba40 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -2,7 +2,7 @@
>  /*
>   * Copyright (c) 2009-2019 Christoph Hellwig
>   *
> - * NOTE: none of these tracepoints shall be consider a stable kernel ABI
> + * NOTE: none of these tracepoints shall be considered a stable kernel ABI
>   * as they can change at any time.
>   */
>  #undef TRACE_SYSTEM
> -- 
> 2.30.2
> 



Re: [Cluster-devel] [PATCH v3 3/3] iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor

2021-07-08 Thread Darrick J. Wong
On Wed, Jul 07, 2021 at 01:55:24PM +0200, Andreas Gruenbacher wrote:
> Now that we create those objects in iomap_writepage_map when needed,
> there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
> 
> Signed-off-by: Andreas Gruenbacher 

I'd like to stage this series as a bugfix branch against -rc1 next week,
if there are no other objections?

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6330dabc451e..9f45050b61dd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -999,7 +999,6 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   block_commit_write(page, 0, length);
>   } else {
>   WARN_ON_ONCE(!PageUptodate(page));
> - iomap_page_create(inode, page);
>   set_page_dirty(page);
>   }
>  
> -- 
> 2.26.3
> 



Re: [Cluster-devel] [PATCH v3 1/3] iomap: Permit pages without an iop to enter writeback

2021-07-08 Thread Darrick J. Wong
On Wed, Jul 07, 2021 at 01:55:22PM +0200, Andreas Gruenbacher wrote:
> Create an iop in the writeback path if one doesn't exist.  This allows us
> to avoid creating the iop in some cases.  We'll initially do that for pages
> with inline data, but it can be extended to pages which are entirely within
> an extent.  It also allows for an iop to be removed from pages in the
> future (eg page split).
> 
> Co-developed-by: Matthew Wilcox (Oracle) 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Signed-off-by: Andreas Gruenbacher 
> Reviewed-by: Christoph Hellwig 

Seems simple enough...
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9023717c5188..598fcfabc337 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1334,14 +1334,13 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   struct writeback_control *wbc, struct inode *inode,
>   struct page *page, u64 end_offset)
>  {
> - struct iomap_page *iop = to_iomap_page(page);
> + struct iomap_page *iop = iomap_page_create(inode, page);
>   struct iomap_ioend *ioend, *next;
>   unsigned len = i_blocksize(inode);
>   u64 file_offset; /* file offset of page */
>   int error = 0, count = 0, i;
>   LIST_HEAD(submit_list);
>  
> - WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
>   WARN_ON_ONCE(iop && atomic_read(>write_bytes_pending) != 0);
>  
>   /*
> -- 
> 2.26.3
> 



Re: [Cluster-devel] [PATCH v3 2/3] iomap: Don't create iomap_page objects for inline files

2021-07-08 Thread Darrick J. Wong
On Wed, Jul 07, 2021 at 01:55:23PM +0200, Andreas Gruenbacher wrote:
> In iomap_readpage_actor, don't create iop objects for inline inodes.
> Otherwise, iomap_read_inline_data will set PageUptodate without setting
> iop->uptodate, and iomap_page_release will eventually complain.
> 
> To prevent this kind of bug from occurring in the future, make sure the
> page doesn't have private data attached in iomap_read_inline_data.
> 
> Signed-off-by: Andreas Gruenbacher 

Looks good to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 598fcfabc337..6330dabc451e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -215,6 +215,7 @@ iomap_read_inline_data(struct inode *inode, struct page 
> *page,
>   if (PageUptodate(page))
>   return;
>  
> + BUG_ON(page_has_private(page));
>   BUG_ON(page->index);
>   BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>  
> @@ -239,7 +240,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  {
>   struct iomap_readpage_ctx *ctx = data;
>   struct page *page = ctx->cur_page;
> - struct iomap_page *iop = iomap_page_create(inode, page);
> + struct iomap_page *iop;
>   bool same_page = false, is_contig = false;
>   loff_t orig_pos = pos;
>   unsigned poff, plen;
> @@ -252,6 +253,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   }
>  
>   /* zero post-eof blocks as the page may be mapped */
> + iop = iomap_page_create(inode, page);
>   iomap_adjust_read_range(inode, iop, , length, , );
>   if (plen == 0)
>   goto done;
> -- 
> 2.26.3
> 



Re: [Cluster-devel] [PATCH v3 2/3] iomap: Don't create iomap_page objects for inline files

2021-07-08 Thread Darrick J. Wong
On Wed, Jul 07, 2021 at 03:28:47PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 07, 2021 at 01:55:23PM +0200, Andreas Gruenbacher wrote:
> > @@ -252,6 +253,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> > }
> >  
> > /* zero post-eof blocks as the page may be mapped */
> > +   iop = iomap_page_create(inode, page);
> > iomap_adjust_read_range(inode, iop, , length, , );
> > if (plen == 0)
> > goto done;
> 
> I /think/ a subsequent patch would look like this:
> 
> + /* No need to create an iop if the page is within an extent */
> + loff_t page_pos = page_offset(page);
> + if (pos > page_pos || pos + length < page_pos + page_size(page))
> + iop = iomap_page_create(inode, page);
> 
> but that might miss some other reasons to create an iop.

I was under the impression that for blksize

Re: [Cluster-devel] [PATCH 3/3] iomap: fall back to buffered writes for invalidation failures

2020-07-22 Thread Darrick J. Wong
Hey Ted,

Could you please review the fs/ext4/ part of this patch (it's the
follow-on to the directio discussion I had with you last week) so that I
can get this moving for 5.9? Thx,

--D

On Tue, Jul 21, 2020 at 08:31:57PM +0200, Christoph Hellwig wrote:
> Failing to invalid the page cache means data in incoherent, which is
> a very bad state for the system.  Always fall back to buffered I/O
> through the page cache if we can't invalidate mappings.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Dave Chinner 
> Reviewed-by: Goldwyn Rodrigues 
> ---
>  fs/ext4/file.c   |  2 ++
>  fs/gfs2/file.c   |  3 ++-
>  fs/iomap/direct-io.c | 16 +++-
>  fs/iomap/trace.h |  1 +
>  fs/xfs/xfs_file.c|  4 ++--
>  fs/zonefs/super.c|  7 +--
>  6 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c4c..129cc1dd6b7952 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   iomap_ops = _iomap_overwrite_ops;
>   ret = iomap_dio_rw(iocb, from, iomap_ops, _dio_write_ops,
>  is_sync_kiocb(iocb) || unaligned_io || extend);
> + if (ret == -ENOTBLK)
> + ret = 0;
>  
>   if (extend)
>   ret = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index bebde537ac8cf2..b085a3bea4f0fd 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -835,7 +835,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, 
> struct iov_iter *from)
>  
>   ret = iomap_dio_rw(iocb, from, _iomap_ops, NULL,
>  is_sync_kiocb(iocb));
> -
> + if (ret == -ENOTBLK)
> + ret = 0;
>  out:
>   gfs2_glock_dq();
>  out_uninit:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 190967e87b69e4..c1aafb2ab99072 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include "trace.h"
>  
>  #include "../internal.h"
>  
> @@ -401,6 +402,9 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   * can be mapped into multiple disjoint IOs and only a subset of the IOs 
> issued
>   * may be pure data writes. In that case, we still need to do a full data 
> sync
>   * completion.
> + *
> + * Returns -ENOTBLK In case of a page invalidation invalidation failure for
> + * writes.  The callers needs to fall back to buffered I/O in this case.
>   */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> @@ -478,13 +482,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   if (iov_iter_rw(iter) == WRITE) {
>   /*
>* Try to invalidate cache pages for the range we are writing.
> -  * If this invalidation fails, tough, the write will still work,
> -  * but racing two incompatible write paths is a pretty crazy
> -  * thing to do, so we don't support it 100%.
> +  * If this invalidation fails, let the caller fall back to
> +  * buffered I/O.
>*/
>   if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
> - end >> PAGE_SHIFT))
> - dio_warn_stale_pagecache(iocb->ki_filp);
> + end >> PAGE_SHIFT)) {
> + trace_iomap_dio_invalidate_fail(inode, pos, count);
> + ret = -ENOTBLK;
> + goto out_free_dio;
> + }
>  
>   if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>   ret = sb_init_dio_done_wq(inode->i_sb);
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 5693a39d52fb63..fdc7ae388476f5 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name,   \
>  DEFINE_RANGE_EVENT(iomap_writepage);
>  DEFINE_RANGE_EVENT(iomap_releasepage);
>  DEFINE_RANGE_EVENT(iomap_invalidatepage);
> +DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
>  
>  #define IOMAP_TYPE_STRINGS \
>   { IOMAP_HOLE,   "HOLE" }, \
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a6ef90457abf97..1b4517fc55f1b9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
>   xfs_iunlock(ip, iolock);
>  
>   /*
> -  * No fallback to buffered IO on errors for XFS, direct IO will either
> -  * complete fully or fail.
> +  * No fallback to buffered IO after short writes for XFS, direct I/O
> +  * will either complete fully or return an error.
>*/
>   ASSERT(ret < 0 || ret == count);
>   return ret;
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 07bc42d62673ce..d0a04528a7e18e 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c

Re: [Cluster-devel] [PATCH 3/3] iomap: fall back to buffered writes for invalidation failures

2020-07-22 Thread Darrick J. Wong
On Wed, Jul 22, 2020 at 08:18:50AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 01:37:49PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 21, 2020 at 08:31:57PM +0200, Christoph Hellwig wrote:
> > > Failing to invalid the page cache means data in incoherent, which is
> > > a very bad state for the system.  Always fall back to buffered I/O
> > > through the page cache if we can't invalidate mappings.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > > Acked-by: Dave Chinner 
> > > Reviewed-by: Goldwyn Rodrigues 
> > 
> > For the iomap and xfs parts,
> > Reviewed-by: Darrick J. Wong 
> > 
> > But I'd still like acks from Ted, Andreas, and Damien for ext4, gfs2,
> > and zonefs, respectively.
> > 
> > (Particularly if anyone was harboring ideas about trying to get this in
> > before 5.10, though I've not yet heard anyone say that explicitly...)
> 
> Why would we want to wait another whole merge window?

Well it /is/ past -rc6, which is a tad late...

OTOH we've been talking about this for 2 months now and most of the
actual behavior change is in xfs land so maybe it's fine. :)

--D



Re: [Cluster-devel] [PATCH 1/3] xfs: use ENOTBLK for direct I/O to buffered I/O fallback

2020-07-21 Thread Darrick J. Wong
On Tue, Jul 21, 2020 at 08:31:55PM +0200, Christoph Hellwig wrote:
> This is what the classic fs/direct-io.c implementation and thuse other
> file systems use.
> 
> Signed-off-by: Christoph Hellwig 

Looks ok to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_file.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 00db81eac80d6c..a6ef90457abf97 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -505,7 +505,7 @@ xfs_file_dio_aio_write(
>*/
>   if (xfs_is_cow_inode(ip)) {
>   trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, 
> count);
> - return -EREMCHG;
> + return -ENOTBLK;
>   }
>   iolock = XFS_IOLOCK_EXCL;
>   } else {
> @@ -714,7 +714,7 @@ xfs_file_write_iter(
>* allow an operation to fall back to buffered mode.
>*/
>   ret = xfs_file_dio_aio_write(iocb, from);
> - if (ret != -EREMCHG)
> + if (ret != -ENOTBLK)
>   return ret;
>   }
>  
> -- 
> 2.27.0
> 



Re: [Cluster-devel] [PATCH 3/3] iomap: fall back to buffered writes for invalidation failures

2020-07-21 Thread Darrick J. Wong
On Tue, Jul 21, 2020 at 08:31:57PM +0200, Christoph Hellwig wrote:
> Failing to invalid the page cache means data in incoherent, which is
> a very bad state for the system.  Always fall back to buffered I/O
> through the page cache if we can't invalidate mappings.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Dave Chinner 
> Reviewed-by: Goldwyn Rodrigues 

For the iomap and xfs parts,
Reviewed-by: Darrick J. Wong 

But I'd still like acks from Ted, Andreas, and Damien for ext4, gfs2,
and zonefs, respectively.

(Particularly if anyone was harboring ideas about trying to get this in
before 5.10, though I've not yet heard anyone say that explicitly...)

--D

> ---
>  fs/ext4/file.c   |  2 ++
>  fs/gfs2/file.c   |  3 ++-
>  fs/iomap/direct-io.c | 16 +++-
>  fs/iomap/trace.h |  1 +
>  fs/xfs/xfs_file.c|  4 ++--
>  fs/zonefs/super.c|  7 +--
>  6 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c4c..129cc1dd6b7952 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   iomap_ops = _iomap_overwrite_ops;
>   ret = iomap_dio_rw(iocb, from, iomap_ops, _dio_write_ops,
>  is_sync_kiocb(iocb) || unaligned_io || extend);
> + if (ret == -ENOTBLK)
> + ret = 0;
>  
>   if (extend)
>   ret = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index bebde537ac8cf2..b085a3bea4f0fd 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -835,7 +835,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, 
> struct iov_iter *from)
>  
>   ret = iomap_dio_rw(iocb, from, _iomap_ops, NULL,
>  is_sync_kiocb(iocb));
> -
> + if (ret == -ENOTBLK)
> + ret = 0;
>  out:
>   gfs2_glock_dq();
>  out_uninit:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 190967e87b69e4..c1aafb2ab99072 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include "trace.h"
>  
>  #include "../internal.h"
>  
> @@ -401,6 +402,9 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   * can be mapped into multiple disjoint IOs and only a subset of the IOs 
> issued
>   * may be pure data writes. In that case, we still need to do a full data 
> sync
>   * completion.
> + *
> + * Returns -ENOTBLK In case of a page invalidation invalidation failure for
> + * writes.  The callers needs to fall back to buffered I/O in this case.
>   */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> @@ -478,13 +482,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   if (iov_iter_rw(iter) == WRITE) {
>   /*
>* Try to invalidate cache pages for the range we are writing.
> -  * If this invalidation fails, tough, the write will still work,
> -  * but racing two incompatible write paths is a pretty crazy
> -  * thing to do, so we don't support it 100%.
> +  * If this invalidation fails, let the caller fall back to
> +  * buffered I/O.
>*/
>   if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
> - end >> PAGE_SHIFT))
> - dio_warn_stale_pagecache(iocb->ki_filp);
> + end >> PAGE_SHIFT)) {
> + trace_iomap_dio_invalidate_fail(inode, pos, count);
> + ret = -ENOTBLK;
> + goto out_free_dio;
> + }
>  
>   if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>   ret = sb_init_dio_done_wq(inode->i_sb);
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 5693a39d52fb63..fdc7ae388476f5 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name,   \
>  DEFINE_RANGE_EVENT(iomap_writepage);
>  DEFINE_RANGE_EVENT(iomap_releasepage);
>  DEFINE_RANGE_EVENT(iomap_invalidatepage);
> +DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
>  
>  #define IOMAP_TYPE_STRINGS \
>   { IOMAP_HOLE,   "HOLE" }, \
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a6ef90457abf97..1b4517fc55f1b9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
>  

Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Darrick J. Wong
On Tue, Jul 21, 2020 at 05:16:16PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > > 
> > > Oh, true.  I'll do that ASAP.
> > 
> > Michael, could we add this to manpages?
> 
> Umm, no.  -ENOTBLK is internal - the file systems will retry using
> buffered I/O and the error shall never escape to userspace (or even the
> VFS for that matter).

It's worth dropping a comment somewhere that ENOTBLK is the desired
"fall back to buffered" errcode, seeing as Dave and I missed that in
XFS...

--D



  1   2   >