Re: [f2fs-dev] [syzbot] [f2fs?] WARNING in rcu_sync_dtor

2024-08-01 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Jul 29, 2024 at 03:27:21PM +0200, Jan Kara wrote:
> Also as the "filesystem shutdown" is spreading across multiple filesystems,
> I'm playing with the idea that maybe we could lift a flag like this to VFS
> so that we can check it in VFS paths and abort some operations early. 

I've been thinking the same thing since I saw what CIFS was doing a
couple of days ago with shutdowns. It's basically just stopping all
new incoming modification operations if the flag is set. i.e. it's
just a check in each filesystem method, and I suspect that many
other filesystems that support shutdown do the same thing.

It looks like exactly the same implementation as CIFS is about to be
added to exfat - stop all the incoming methods and check in the
writeback method - so having a generic superblock flag and generic
checks before calling into filesystem methods would make it real
easy for all filesystems to have basic ->shutdown support for when
block devices go away suddenly.

I also think that we should be lifting *IOC_SHUTDOWN to the VFS -
the same ioctl is now implemented in 4-5 filesystems and they
largely do the same thing - just set a bit in the internal
filesystem superblock flags. Yes, filesystems like XFS and ext4 do
special stuff with journals, but the generic VFS implemenation could
call the filesystem ->shutdown method to do that

> But
> so far I'm not convinced the gain is worth the need to iron out various
> subtle semantical differences of "shutdown" among filesystems.

I don't think we need to change how any filesystem behaves when it
is shut down. As long as filesystems follow at least the "no new
modifications when shutdown" behaviour, filesystems can implement
internal shutdown however they want...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 1/5] zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call

2024-01-23 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Jan 23, 2024 at 09:39:02PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 23 Jan 2024, Johannes Thumshirn wrote:
> 
> > Pass GFP_KERNEL instead of GFP_NOFS to the blkdev_zone_mgmt() call in
> > zonefs_zone_mgmt().
> > 
> > As as zonefs_zone_mgmt() and zonefs_inode_zone_mgmt() are never called
> > from a place that can recurse back into the filesystem on memory reclaim,
> > it is save to call blkdev_zone_mgmt() with GFP_KERNEL.
> > 
> > Signed-off-by: Johannes Thumshirn 
> > ---
> >  fs/zonefs/super.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> > index 93971742613a..63fbac018c04 100644
> > --- a/fs/zonefs/super.c
> > +++ b/fs/zonefs/super.c
> > @@ -113,7 +113,7 @@ static int zonefs_zone_mgmt(struct super_block *sb,
> >  
> > trace_zonefs_zone_mgmt(sb, z, op);
> > ret = blkdev_zone_mgmt(sb->s_bdev, op, z->z_sector,
> > -  z->z_size >> SECTOR_SHIFT, GFP_NOFS);
> > +  z->z_size >> SECTOR_SHIFT, GFP_KERNEL);
> > if (ret) {
> > zonefs_err(sb,
> >"Zone management operation %s at %llu failed %d\n",
> > 
> > -- 
> > 2.43.0
> 
> zonefs_inode_zone_mgmt calls 
> lockdep_assert_held(&ZONEFS_I(inode)->i_truncate_mutex); - so, this 
> function is called with the mutex held - could it happen that the 
> GFP_KERNEL allocation recurses into the filesystem and attempts to take 
> i_truncate_mutex as well?
> 
> i.e. GFP_KERNEL -> iomap_do_writepage -> zonefs_write_map_blocks -> 
> zonefs_write_iomap_begin -> mutex_lock(&zi->i_truncate_mutex)

zonefs doesn't have a ->writepage method, so writeback can't be
called from memory reclaim like this.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-09-10 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Sep 08, 2023 at 01:29:55AM +0100, Pavel Begunkov wrote:
> On 9/3/23 23:30, Dave Chinner wrote:
> > On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> > > On 8/29/23 19:53, Matthew Wilcox wrote:
> > > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > > > From: Hao Xu 
> > > > > > > 
> > > > > > > Add a boolean parameter for file_accessed() to support nowait 
> > > > > > > semantics.
> > > > > > > Currently it is true only with io_uring as its initial caller.
> > > > > > 
> > > > > > So why do we need to do this as part of this series?  Apparently it
> > > > > > hasn't caused any problems for filemap_read().
> > > > > > 
> > > > > 
> > > > > We need this parameter to indicate if nowait semantics should be 
> > > > > enforced in
> > > > > touch_atime(), There are locks and maybe IOs in it.
> > > > 
> > > > That's not my point.  We currently call file_accessed() and
> > > > touch_atime() for nowait reads and nowait writes.  You haven't done
> > > > anything to fix those.
> > > > 
> > > > I suspect you can trim this patchset down significantly by avoiding
> > > > fixing the file_accessed() problem.  And then come back with a later
> > > > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> > > 
> > > I'm ok to do that.
> > > 
> > > > first that fixes it for the existing nowait users, and then a second
> > > > series to do all the directory stuff.
> > > > 
> > > > I'd do the first thing.  Just ignore the problem.  Directory atime
> > > > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > > > everyone uses relatime or nodiratime.
> > > 
> > > Hi Matthew,
> > > The previous discussion shows this does cause issues in real
> > > producations: 
> > > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> > > 
> > 
> > Then separate it out into it's own patch set so we can have a
> > discussion on the merits of requiring using noatime, relatime or
> > lazytime for really latency sensitive IO applications. Changing code
> > is not always the right solution...
> 
> Separation sounds reasonable, but it can hardly be said that only
> latency sensitive apps would care about >1s nowait/async submission
> delays. Presumably, btrfs can improve on that, but it still looks
> like it's perfectly legit for filesystems do heavy stuff in
> timestamping like waiting for IO. Right?

Yes, it is, no-one is denying that. And some filesystems are worse
than others, but none of that means it has to be fixed so getdents
can be converted to NOWAIT semantics.

ie. this patchset is about the getdents NOWAIT machinery, and
fiddling around with timestamps has much, much wider scope than just
NOWAIT getdents machinery. We'll have this discussion about NOWAIT
timestamp updates when a RFC is proposed to address the wider
problem of how timestamp updates should behave in NOWAIT context.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 02/11] xfs: add NOWAIT semantics for readdir

2023-09-03 Thread Dave Chinner via Linux-f2fs-devel
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Implement NOWAIT semantics for readdir. Return EAGAIN error to the
> caller if it would block, like failing to get locks, or going to
> do IO.
> 
> Co-developed-by: Dave Chinner 

Not really.

"Co-developed" implies equal development input between all the
parties, which is not the case here - this patch is based on
prototype I wrote, whilst you're doing the refining, testing and
correctness work.

In these cases with XFS code, we add a line in the commit message to
say:

"This is based on a patch originally written by Dave Chinner."


> Signed-off-by: Dave Chinner 
> Signed-off-by: Hao Xu 
> [fixes deadlock issue, tweak code style]

With a signoff chain like you already have.

In the end you'll also get a RVB from me, which seems rather wrong
to me if I've apparently been "co-developing" the code



> @@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
>   if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
>   return 0;
>  
> - error = xfs_dir3_block_read(args->trans, dp, &bp);
> + if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> + flags |= XFS_DABUF_NOWAIT;
> + error = xfs_dir3_block_read(args->trans, dp, flags, &bp);
>   if (error)
>   return error;
>  

Given we do this same check in both block and leaf formats to set
XFS_DABUF_NOWAIT, and we do the DIR_CONTEXT_F_NOWAIT check in
xfs_readdir() as well, we should probably do this check once at the
higher level and pass flags down from there with XFS_DABUF_NOWAIT
already set.

> @@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
>  STATIC int
>  xfs_dir2_leaf_readbuf(
>   struct xfs_da_args  *args,
> + struct dir_context  *ctx,
>   size_t  bufsize,
>   xfs_dir2_off_t  *cur_off,
>   xfs_dablk_t *ra_blk,
> @@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
>   struct xfs_iext_cursor  icur;
>   int ra_want;
>   int error = 0;
> -
> - error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> - if (error)
> - goto out;
> + unsigned intflags = 0;
> +
> + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
> + flags |= XFS_DABUF_NOWAIT;
> + } else {
> + error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> + if (error)
> + goto out;
> + }

Especially as, in hindsight, this doesn't make a whole lot of sense.
If XFS_DABUF_NOWAIT is set, we keep going until
xfs_ilock_data_map_shared_nowait() where we call
xfs_need_iread_extents() to see if we need to read the extents in
and abort at that point.

So, really, we shouldn't get this far with nowait semantics if
we haven't read the extents in yet - we're supposed to already have
the inode locked here and so we should have already checked this
condition before we bother locking the inode...

i.e. all we should be doing here is this:

if (!(flags & XFS_DABUF_NOWAIT)) {
error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
if (error)
goto out;
}

And then we don't need to pass the VFS dir_context down into low
level XFS functions unnecessarily.


>  
>   /*
>* Look for mapped directory blocks at or above the current offset.
> @@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
>   new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
>   if (new_off > *cur_off)
>   *cur_off = new_off;
> - error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
> + error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, 
> &bp);
>   if (error)
>   goto out;
>  
> @@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents(
>   int byteoff;/* offset in current block */
>   unsigned intoffset = 0;
>   int error = 0;  /* error return value */
> + int written = 0;
>  
>   /*
>* If the offset is at or past the largest allowed value,
> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>   bp = NULL;
>   }
>  
> - if (*lock_mode == 0)
> - *lock_mode = xfs_ilock_data_map_shared(dp);
> - error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff,
> - &rablk, &bp);
> + if (*lock_mode == 0) {
> + *lock_mode =
> + xfs_ilock_data_map_shared_generic(dp,
> + ctx->flags & DIR_CONTEXT_F_NOWAIT);
> + if (!*lock_mode) {
> + error = -EAGAIN;
> + break;
> + }
> + }
> +   

Re: [f2fs-dev] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-09-03 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> On 8/29/23 19:53, Matthew Wilcox wrote:
> > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu 
> > > > > 
> > > > > Add a boolean parameter for file_accessed() to support nowait 
> > > > > semantics.
> > > > > Currently it is true only with io_uring as its initial caller.
> > > > 
> > > > So why do we need to do this as part of this series?  Apparently it
> > > > hasn't caused any problems for filemap_read().
> > > > 
> > > 
> > > We need this parameter to indicate if nowait semantics should be enforced 
> > > in
> > > touch_atime(), There are locks and maybe IOs in it.
> > 
> > That's not my point.  We currently call file_accessed() and
> > touch_atime() for nowait reads and nowait writes.  You haven't done
> > anything to fix those.
> > 
> > I suspect you can trim this patchset down significantly by avoiding
> > fixing the file_accessed() problem.  And then come back with a later
> > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> 
> I'm ok to do that.
> 
> > first that fixes it for the existing nowait users, and then a second
> > series to do all the directory stuff.
> > 
> > I'd do the first thing.  Just ignore the problem.  Directory atime
> > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > everyone uses relatime or nodiratime.
> 
> Hi Matthew,
> The previous discussion shows this does cause issues in real
> producations: 
> https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> 

Then separate it out into it's own patch set so we can have a
discussion on the merits of requiring using noatime, relatime or
lazytime for really latency sensitive IO applications. Changing code
is not always the right solution...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH RFC v5 00/29] io_uring getdents

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.
> 
> Patch1 and Patch2 are some preparation
> Patch3 supports nowait for xfs getdents code
> Patch4-11 are vfs change, include adding helpers and trylock for locks
> Patch12-29 supports nowait for involved xfs journal stuff
> note, Patch24 and 27 are actually two questions, might be removed later.
> an xfs test may come later.

You need to drop all the XFS journal stuff. It's fundamentally
broken as it stands, and we cannot support non-blocking
transactional changes without first putting a massive investment in
transaction and intent chain rollback to allow correctly undoing
partially complete modifications.

Regardless, non-blocking transactions are completely unnecessary for
a non-blocking readdir implementation. readdir should only be
touching atime, and with relatime it should only occur once every 24
hours per inode. If that's a problem, then we have noatime mount
options. Hence I just don't see any point in worrying about having a
timestamp update block occasionally...

I also don't really don't see why you need to fiddle with xfs buffer
cache semantics - it already has the functionality "nowait" buffer
reads require (i.e.  XBF_INCORE|XBF_TRYLOCK).

However, the readahead IO that the xfs readdir code issues cannot
use your defined NOWAIT semantics - it must be able to allocate
memory and issue IO. Readahead already avoids blocking on memory
allocation and blocking on IO via the XBF_READ_AHEAD flag. This sets
__GFP_NORETRY for buffer allocation and REQ_RAHEAD for IO. Hence
readahead only needs the existing XBF_TRYLOCK flag to be set to be
compatible with the required NOWAIT semantics

As for the NOIO memory allocation restrictions io_uring requires,
that should be enforced at the io_uring layer before calling into
the VFS using memalloc_noio_save/restore.  At that point no memory
allocation will trigger IO and none of the code running under NOWAIT
conditions even needs to be aware that io_uring has a GFP_NOIO
restriction on memory allocation

Please go back to the simple "do non-blocking buffer IO"
implementation we started with and don't try to solve every little
blocking problem that might exist in the VFS and filesystems...

-Dave
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 25/29] xfs: support nowait for xfs_buf_item_init()

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:27PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> support nowait for xfs_buf_item_init() and error out -EAGAIN to
> _xfs_trans_bjoin() when it would block.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_buf_item.c |  9 +++--
>  fs/xfs/xfs_buf_item.h |  2 +-
>  fs/xfs/xfs_buf_item_recover.c |  2 +-
>  fs/xfs/xfs_trans_buf.c| 16 +---
>  4 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 023d4e0385dd..b1e63137d65b 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -827,7 +827,8 @@ xfs_buf_item_free_format(
>  int
>  xfs_buf_item_init(
>   struct xfs_buf  *bp,
> - struct xfs_mount *mp)
> + struct xfs_mount *mp,
> + bool   nowait)
>  {
>   struct xfs_buf_log_item *bip = bp->b_log_item;
>   int chunks;
> @@ -847,7 +848,11 @@ xfs_buf_item_init(
>   return 0;
>   }
>  
> - bip = kmem_cache_zalloc(xfs_buf_item_cache, GFP_KERNEL | __GFP_NOFAIL);
> + bip = kmem_cache_zalloc(xfs_buf_item_cache,
> + GFP_KERNEL | (nowait ? 0 : __GFP_NOFAIL));
> + if (!bip)
> + return -EAGAIN;
> +
>   xfs_log_item_init(mp, &bip->bli_item, XFS_LI_BUF, &xfs_buf_item_ops);
>   bip->bli_buf = bp;

I see filesystem shutdowns

> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 016371f58f26..a1e4f2e8629a 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -57,13 +57,14 @@ xfs_trans_buf_item_match(
>   * If the buffer does not yet have a buf log item associated with it,
>   * then allocate one for it.  Then add the buf item to the transaction.
>   */
> -STATIC void
> +STATIC int
>  _xfs_trans_bjoin(
>   struct xfs_trans*tp,
>   struct xfs_buf  *bp,
>   int reset_recur)
>  {
>   struct xfs_buf_log_item *bip;
> + int ret;
>  
>   ASSERT(bp->b_transp == NULL);
>  
> @@ -72,7 +73,11 @@ _xfs_trans_bjoin(
>* it doesn't have one yet, then allocate one and initialize it.
>* The checks to see if one is there are in xfs_buf_item_init().
>*/
> - xfs_buf_item_init(bp, tp->t_mountp);
> + ret = xfs_buf_item_init(bp, tp->t_mountp,
> + tp->t_flags & XFS_TRANS_NOWAIT);
> + if (ret < 0)
> + return ret;
> +
>   bip = bp->b_log_item;
>   ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
>   ASSERT(!(bip->__bli_format.blf_flags & XFS_BLF_CANCEL));
> @@ -92,6 +97,7 @@ _xfs_trans_bjoin(
>   xfs_trans_add_item(tp, &bip->bli_item);
>   bp->b_transp = tp;
>  
> + return 0;
>  }
>  
>  void
> @@ -309,7 +315,11 @@ xfs_trans_read_buf_map(
>   }
>  
>   if (tp) {
> - _xfs_trans_bjoin(tp, bp, 1);
> + error = _xfs_trans_bjoin(tp, bp, 1);
> + if (error) {
> + xfs_buf_relse(bp);
> + return error;
> + }
>   trace_xfs_trans_read_buf(bp->b_log_item);

So what happens at the callers when we have a dirty transaction and
joining a buffer fails with -EAGAIN?

Apart from the fact this may well propagate -EAGAIN up to userspace,
cancelling a dirty transaction at this point will result in a
filesystem shutdown

Indeed, this can happen in the "simple" timestamp update case that
this "nowait" semantic is being aimed at. We log the inode in the
timestamp update, which dirties the log item and registers a
precommit operation to be run. We commit the
transaction, which then runs xfs_inode_item_precommit() and that
may need to attach the inode to the inode cluster buffer. This
results in:

xfs_inode_item_precommit
  xfs_imap_to_bp
xfs_trans_read_buf_map
  _xfs_trans_bjoin
xfs_buf_item_init(XFS_TRANS_NOWAIT)
  kmem_cache_zalloc(GFP_NOFS)
  
  gets -EAGAIN error
propagates -EAGAIN
  fails due to -EAGAIN

And now xfs_trans_commit() fails with a dirty transaction and the
filesystem shuts down.

IOWs, XFS_TRANS_NOWAIT as it stands is fundamentally broken. Once we
dirty an item in a transaction, we *cannot* back out of the
transaction. We *must block* in every place that could fail -
locking, memory allocation and/or IO - until the transaction
completes because we cannot undo the changes we've already made to
the dirty items in the transaction

It's even worse than that - once we have committed intents, the
whole chain of intent processing must be run to completionr. Hence
we can't tolerate backing out of that defered processing chain half
way through because we might have to block.

Until we can roll back partial dirty transactions and partially
completed defered intent chains at any random point of completion,
XFS_TRANS_NOWAIT will not work.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-

Re: [f2fs-dev] [PATCH 24/29] xfs: support nowait for xfs_buf_read_map()

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:26PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This causes xfstests generic/232 hung in umount process, waiting for ail
> push, so I comment it for now, need some hints from xfs folks.
> Not a real patch.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_buf.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index cdad80e1ae25..284962a9f31a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -828,6 +828,13 @@ xfs_buf_read_map(
>   trace_xfs_buf_read(bp, flags, _RET_IP_);
>  
>   if (!(bp->b_flags & XBF_DONE)) {
> +//   /*
> +//* Let's bypass the _xfs_buf_read() for now
> +//*/
> +//   if (flags & XBF_NOWAIT) {
> +//   xfs_buf_relse(bp);
> +//   return -EAGAIN;
> +//   }

This is *fundamentally broken*, and apart from anything else breaks
readahead.

IF we asked for a read, we cannot instantiate the buffer and then
*not issue any IO on it* and release it. That leaves an
uninitialised buffer in memory, and there's every chance that
something then trips over it and bad things happen.

A buffer like this *must* be errored out and marked stale so that
the next access to it will then re-initialise the buffer state and
trigger any preparatory work that needs to be done for the new
operation.

This comes back to my first comments that XBF_TRYLOCK cannot simpy
be replaced with XBF_NOWAIT semantics...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 28/29] xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:30PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Apply trylock logic for xc_ctx_lock in xlog_cil_commit() in nowait
> case and error out -EAGAIN for xlog_cil_commit().

Again, fundamentally broken. Any error from xlog_cil_commit() will
result in a filesystem shutdown as we cannot back out from failure
with dirty log items gracefully at this point.

-Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 26/29] xfs: return -EAGAIN when nowait meets sync in transaction commit

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:28PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> if the log transaction is a sync one, let's fail the nowait try and
> return -EAGAIN directly since sync transaction means blocked by IO.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_trans.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 7988b4c7f36e..f1f84a3dd456 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -968,12 +968,24 @@ __xfs_trans_commit(
>   xfs_csn_t   commit_seq = 0;
>   int error = 0;
>   int sync = tp->t_flags & XFS_TRANS_SYNC;
> + boolnowait = tp->t_flags & XFS_TRANS_NOWAIT;
> + boolperm_log = tp->t_flags & XFS_TRANS_PERM_LOG_RES;
>  
>   trace_xfs_trans_commit(tp, _RET_IP_);
>  
> + if (nowait && sync) {
> + /*
> +  * Currently nowait is only from xfs_vn_update_time()
> +  * so perm_log is always false here, but let's make
> +  * code general.
> +  */
> + if (perm_log)
> + xfs_defer_cancel(tp);
> + goto out_unreserve;
> + }

This is fundamentally broken.  We cannot about a transaction commit
with dirty items at this point with shutting down the filesystem.

This points to XFS_TRANS_NOWAIT being completely broken, too,
because we don't call xfs_trans_set_sync() until just before we
commit the transaction. At this point, it is -too late- for
nowait+sync to be handled gracefully, and it will *always* go bad.

IOWs, the whole transaction "nowait" semantics as designed and
implemented is not a workable solution

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 02/29] xfs: rename XBF_TRYLOCK to XBF_NOWAIT

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:04PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> XBF_TRYLOCK means we need lock but don't block on it,

Yes.


> we can use it to
> stand for not waiting for memory allcation. Rename XBF_TRYLOCK to
> XBF_NOWAIT, which is more generic.

No.

Not only can XBF_TRYLOCK require memory allocation, it can require
IO to be issued. We use TRYLOCK for -readahead- and so we *must* be
able to allocate memory and issue IO under TRYLOCK caller
conditions.

[...]

> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index d440393b40eb..2ccb0867824c 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -661,7 +661,7 @@ xfs_attr_rmtval_invalidate(
>   return error;
>   if (XFS_IS_CORRUPT(args->dp->i_mount, nmap != 1))
>   return -EFSCORRUPTED;
> - error = xfs_attr_rmtval_stale(args->dp, &map, XBF_TRYLOCK);
> + error = xfs_attr_rmtval_stale(args->dp, &map, XBF_NOWAIT);
>   if (error)
>   return error;

XBF_INCORE | XBF_NOWAIT makes no real sense. I mean, XBF_INCORE is
exactly "find a cached buffer or fail" - it's not going to do any
memory allocation or IO so NOWAIT smeantics don't make any sense
here. It's the buffer lock that this lookup is explicitly
avoiding, and so TRYLOCK describes exactly the semantics we want
from this incore lookup.

Indeed, this is a deadlock avoidance mechanism as the transaction
may already have the buffer locked and so we don't want the
xfs_buf_incore() lookup to try to lock the buffer again. TRYLOCK
documents this pretty clearly - NOWAIT loses that context

> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 6a6503ab0cd7..77c4f1d83475 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -1343,7 +1343,7 @@ xfs_btree_read_buf_block(
>   int error;
>  
>   /* need to sort out how callers deal with failures first */
> - ASSERT(!(flags & XBF_TRYLOCK));
> + ASSERT(!(flags & XBF_NOWAIT));
>  
>   error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
>   if (error)
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index ac6d8803e660..9312cf3b20e2 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -460,7 +460,7 @@ xrep_invalidate_block(
>  
>   error = xfs_buf_incore(sc->mp->m_ddev_targp,
>   XFS_FSB_TO_DADDR(sc->mp, fsbno),
> - XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
> + XFS_FSB_TO_BB(sc->mp, 1), XBF_NOWAIT, &bp);

My point exactly.

xfs_buf_incore() is simply a lookup with XBF_INCORE set. (XBF_INCORE
| XBF_TRYLOCK) has the exactly semantics of "return the buffer only
if it is cached and we can lock it without blocking.

It will not instantiate a new buffer (i.e. do memory allocation) or
do IO because the if it is under IO the buffer lock will be held.

So, essentially, this "NOWAIT" semantic you want is already supplied
by (XBF_INCORE | XBF_TRYLOCK) buffer lookups.

>   if (error)
>   return 0;
>  
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 15d1e5a7c2d3..9f84bc3b802c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -228,7 +228,7 @@ _xfs_buf_alloc(
>* We don't want certain flags to appear in b_flags unless they are
>* specifically set by later operations on the buffer.
>*/
> - flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
> + flags &= ~(XBF_UNMAPPED | XBF_NOWAIT | XBF_ASYNC | XBF_READ_AHEAD);
>  
>   atomic_set(&bp->b_hold, 1);
>   atomic_set(&bp->b_lru_ref, 1);
> @@ -543,7 +543,7 @@ xfs_buf_find_lock(
>   struct xfs_buf  *bp,
>   xfs_buf_flags_t flags)
>  {
> - if (flags & XBF_TRYLOCK) {
> + if (flags & XBF_NOWAIT) {
>   if (!xfs_buf_trylock(bp)) {
>   XFS_STATS_INC(bp->b_mount, xb_busy_locked);
>   return -EAGAIN;
> @@ -886,7 +886,7 @@ xfs_buf_readahead_map(
>   struct xfs_buf  *bp;
>  
>   xfs_buf_read_map(target, map, nmaps,
> -  XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD, &bp, ops,
> +  XBF_NOWAIT | XBF_ASYNC | XBF_READ_AHEAD, &bp, ops,
>__this_address);

That will break readahead (which we use extensively in getdents
operations) if we can't allocate buffers and issue IO under NOWAIT
conditions.

>  }
>  
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 549c60942208..8cd307626939 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -45,7 +45,7 @@ struct xfs_buf;
>  
>  /* flags used only as arguments to access routines */
>  #define XBF_INCORE(1u << 29)/* lookup only, return if found in cache */
> -#define XBF_TRYLOCK   (1u << 30)/* lock requested, but do not wait */
> +#define XBF_NOWAIT(1u << 30)/* mem/lock requested, but d

Re: [f2fs-dev] [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Aug 07, 2023 at 07:09:34PM +0800, Qi Zheng wrote:
> Like global slab shrink, this commit also uses refcount+RCU method to make
> memcg slab shrink lockless.

This patch does random code cleanups amongst the actual RCU changes.
Can you please move the cleanups to a spearate patch to reduce the
noise in this one?

> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index d318f5621862..fee6f62904fb 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -107,6 +107,12 @@ static struct shrinker_info 
> *shrinker_info_protected(struct mem_cgroup *memcg,
>lockdep_is_held(&shrinker_rwsem));
>  }
>  
> +static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
> +int nid)
> +{
> + return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> +}

This helper doesn't add value. It doesn't tell me that
rcu_read_lock() needs to be held when it is called, for one

>  static int expand_one_shrinker_info(struct mem_cgroup *memcg, int new_size,
>   int old_size, int new_nr_max)
>  {
> @@ -198,7 +204,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
> int shrinker_id)
>   struct shrinker_info_unit *unit;
>  
>   rcu_read_lock();
> - info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> + info = shrinker_info_rcu(memcg, nid);

... whilst the original code here was obviously correct.

>   unit = info->unit[shriner_id_to_index(shrinker_id)];
>   if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
>   /* Pairs with smp mb in shrink_slab() */
> @@ -211,7 +217,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
> int shrinker_id)
>  
>  static DEFINE_IDR(shrinker_idr);
>  
> -static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +static int shrinker_memcg_alloc(struct shrinker *shrinker)

Cleanups in a separate patch.

> @@ -253,10 +258,15 @@ static long xchg_nr_deferred_memcg(int nid, struct 
> shrinker *shrinker,
>  {
>   struct shrinker_info *info;
>   struct shrinker_info_unit *unit;
> + long nr_deferred;
>  
> - info = shrinker_info_protected(memcg, nid);
> + rcu_read_lock();
> + info = shrinker_info_rcu(memcg, nid);
>   unit = info->unit[shriner_id_to_index(shrinker->id)];
> - return 
> atomic_long_xchg(&unit->nr_deferred[shriner_id_to_offset(shrinker->id)], 0);
> + nr_deferred = 
> atomic_long_xchg(&unit->nr_deferred[shriner_id_to_offset(shrinker->id)], 0);
> + rcu_read_unlock();
> +
> + return nr_deferred;
>  }

This adds two rcu_read_lock() sections to every call to
do_shrink_slab(). It's not at all clear ifrom any of the other code
that do_shrink_slab() now has internal rcu_read_lock() sections

> @@ -464,18 +480,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   if (!mem_cgroup_online(memcg))
>   return 0;
>  
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 0;
> -
> - info = shrinker_info_protected(memcg, nid);
> +again:
> + rcu_read_lock();
> + info = shrinker_info_rcu(memcg, nid);
>   if (unlikely(!info))
>   goto unlock;
>  
> - for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> + if (index < shriner_id_to_index(info->map_nr_max)) {
>   struct shrinker_info_unit *unit;
>  
>   unit = info->unit[index];
>  
> + /*
> +  * The shrinker_info_unit will not be freed, so we can
> +  * safely release the RCU lock here.
> +  */
> + rcu_read_unlock();

Why - what guarantees that the shrinker_info_unit exists at this
point? We hold no reference to it, we hold no reference to any
shrinker, etc. What provides this existence guarantee?

> +
>   for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
> @@ -485,12 +506,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   struct shrinker *shrinker;
>   int shrinker_id = calc_shrinker_id(index, offset);
>  
> + rcu_read_lock();
>   shrinker = idr_find(&shrinker_idr, shrinker_id);
> - if (unlikely(!shrinker || !(shrinker->flags & 
> SHRINKER_REGISTERED))) {
> - if (!shrinker)
> - clear_bit(offset, unit->map);
> + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) 
> {
> + clear_bit(offset, unit->map);
> + rcu_read_unlock();
>   continue;
>   }
> + rcu_read_unlock();
>  
>   /* Call non-slab shri

Re: [f2fs-dev] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

Documentation, please. What does the refcount protect, what does the
completion provide, etc.

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(&shrinker->refcount))
> + complete(&shrinker->done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(&sc, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This needs to be moved to the head of the function, and document
the whole list walk, get, put and completion parts of the algorithm
that make it safe. There's more to this than "we hold a reference
count", especially the tricky "we might see the shrinker before it
is fully initialised" case


.
>  void shrinker_free(struct shrinker *shrinker)
>  {
>   struct dentry *debugfs_entry = NULL;
> @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker)
>   if (!shrinker)
>   return;
>  
> + if (shrinker->flags & SHRINKER_REGISTERED) {
> + shrinker_put(shrinker);
> + wait_for_completion(&shrinker->done);
> + }

Needs a comment explaining why we need to wait here...
> +
>   down_write(&shrinker_rwsem);
>   if (shrinker->flags & SHRINKER_REGISTERED) {
> - list_del(&shrinker->list);
> + /*
> +  * Lookups on the shrinker are over and will fail in the future,
> +  * so we can now remove it from the lists and free it.
> +  */

 rather than here after the wait has been done and provided the
guarantee that no shrinker is running or will run again...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-dev

Re: [f2fs-dev] [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Aug 07, 2023 at 07:09:32PM +0800, Qi Zheng wrote:
> Currently, we maintain two linear arrays per node per memcg, which are
> shrinker_info::map and shrinker_info::nr_deferred. And we need to resize
> them when the shrinker_nr_max is exceeded, that is, allocate a new array,
> and then copy the old array to the new array, and finally free the old
> array by RCU.
> 
> For shrinker_info::map, we do set_bit() under the RCU lock, so we may set
> the value into the old map which is about to be freed. This may cause the
> value set to be lost. The current solution is not to copy the old map when
> resizing, but to set all the corresponding bits in the new map to 1. This
> solves the data loss problem, but bring the overhead of more pointless
> loops while doing memcg slab shrink.
> 
> For shrinker_info::nr_deferred, we will only modify it under the read lock
> of shrinker_rwsem, so it will not run concurrently with the resizing. But
> after we make memcg slab shrink lockless, there will be the same data loss
> problem as shrinker_info::map, and we can't work around it like the map.
> 
> For such resizable arrays, the most straightforward idea is to change it
> to xarray, like we did for list_lru [1]. We need to do xa_store() in the
> list_lru_add()-->set_shrinker_bit(), but this will cause memory
> allocation, and the list_lru_add() doesn't accept failure. A possible
> solution is to pre-allocate, but the location of pre-allocation is not
> well determined.

So you implemented a two level array that preallocates leaf
nodes to work around it? It's remarkable complex for what it does,
I can't help but think a radix tree using a special holder for
nr_deferred values of zero would end up being simpler...

> Therefore, this commit chooses to introduce a secondary array for
> shrinker_info::{map, nr_deferred}, so that we only need to copy this
> secondary array every time the size is resized. Then even if we get the
> old secondary array under the RCU lock, the found map and nr_deferred are
> also true, so no data is lost.

I don't understand what you are trying to describe here. If we get
the old array, then don't we get either a stale nr_deferred value,
or the update we do gets lost because the next shrinker lookup will
find the new array and os the deferred value stored to the old one
is never seen again?

> 
> [1]. 
> https://lore.kernel.org/all/20220228122126.37293-13-songmuc...@bytedance.com/
> 
> Signed-off-by: Qi Zheng 
> Reviewed-by: Muchun Song 
> ---
.
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index a27779ed3798..1911c06b8af5 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -12,15 +12,50 @@ DECLARE_RWSEM(shrinker_rwsem);
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
>  
> -/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> -static inline int shrinker_map_size(int nr_items)
> +static inline int shrinker_unit_size(int nr_items)
>  {
> - return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> + return (DIV_ROUND_UP(nr_items, SHRINKER_UNIT_BITS) * sizeof(struct 
> shrinker_info_unit *));
>  }
>  
> -static inline int shrinker_defer_size(int nr_items)
> +static inline void shrinker_unit_free(struct shrinker_info *info, int start)
>  {
> - return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> + struct shrinker_info_unit **unit;
> + int nr, i;
> +
> + if (!info)
> + return;
> +
> + unit = info->unit;
> + nr = DIV_ROUND_UP(info->map_nr_max, SHRINKER_UNIT_BITS);
> +
> + for (i = start; i < nr; i++) {
> + if (!unit[i])
> + break;
> +
> + kvfree(unit[i]);
> + unit[i] = NULL;
> + }
> +}
> +
> +static inline int shrinker_unit_alloc(struct shrinker_info *new,
> +struct shrinker_info *old, int nid)
> +{
> + struct shrinker_info_unit *unit;
> + int nr = DIV_ROUND_UP(new->map_nr_max, SHRINKER_UNIT_BITS);
> + int start = old ? DIV_ROUND_UP(old->map_nr_max, SHRINKER_UNIT_BITS) : 0;
> + int i;
> +
> + for (i = start; i < nr; i++) {
> + unit = kvzalloc_node(sizeof(*unit), GFP_KERNEL, nid);

A unit is 576 bytes. Why is this using kvzalloc_node()?

> + if (!unit) {
> + shrinker_unit_free(new, start);
> + return -ENOMEM;
> + }
> +
> + new->unit[i] = unit;
> + }
> +
> + return 0;
>  }
>  
>  void free_shrinker_info(struct mem_cgroup *memcg)
> @@ -32,6 +67,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>   for_each_node(nid) {
>   pn = memcg->nodeinfo[nid];
>   info = rcu_dereference_protected(pn->shrinker_info, true);
> + shrinker_unit_free(info, 0);
>   kvfree(info);
>   rcu_assign_pointer(pn->shrinker_info, NULL);
>   }

Why is this safe? The info and maps are looked up by RCU, so why is
freeing them without a RCU grace pe

Re: [f2fs-dev] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.

> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.

> ---
>  include/linux/shrinker.h | 17 ++
>  mm/shrinker.c| 70 +---
>  2 files changed, 68 insertions(+), 19 deletions(-)

There's no documentation in the code explaining how the lockless
shrinker algorithm works. It's left to the reader to work out how
this all goes together

> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

What does the refcount protect, why do we need the completion, etc?

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(&shrinker->refcount))
> + complete(&shrinker->done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(&sc, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This comment really should be at the head of the function,
describing the algorithm used within the function itself. i.e. how
reference counts are used w.r.t. the rcu_read_lock() usage to
guarantee existence of the shrinker and the validity of the list
walk.

I'm not going to remember all these little details when I look at
this code in another 6 months time, and having to work it out from
first principles every time I look at the code will waste of a lot
of time...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/lis

Re: [f2fs-dev] [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker

2023-07-27 Thread Dave Chinner via Linux-f2fs-devel
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote:
> On 7/27/23 17:55, Qi Zheng wrote:
> >>>   goto err;
> >>>   }
> >>>   +    zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count;
> >>> +    zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan;
> >>> +    zmd->mblk_shrinker->seeks = DEFAULT_SEEKS;
> >>> +    zmd->mblk_shrinker->private_data = zmd;
> >>> +
> >>> +    shrinker_register(zmd->mblk_shrinker);
> >>
> >> I fail to see how this new shrinker API is better... Why isn't there a
> >> shrinker_alloc_and_register() function ? That would avoid adding all this 
> >> code
> >> all over the place as the new API call would be very similar to the current
> >> shrinker_register() call with static allocation.
> > 
> > In some registration scenarios, memory needs to be allocated in advance.
> > So we continue to use the previous prealloc/register_prepared()
> > algorithm. The shrinker_alloc_and_register() is just a helper function
> > that combines the two, and this increases the number of APIs that
> > shrinker exposes to the outside, so I choose not to add this helper.
> 
> And that results in more code in many places instead of less code + a simple
> inline helper in the shrinker header file...

It's not just a "simple helper" - it's a function that has to take 6
or 7 parameters with a return value that must be checked and
handled.

This was done in the first versions of the patch set - the amount of
code in each caller does not go down and, IMO, was much harder to
read and determine "this is obviously correct" that what we have
now.

> So not adding that super simple
> helper is not exactly the best choice in my opinion.

Each to their own - I much prefer the existing style/API over having
to go look up a helper function every time I want to check some
random shrinker has been set up correctly

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote:
> On 2023/7/26 16:08, Dave Chinner wrote:
> > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker 
> > > *shrinker);
> > >   void shrinker_register(struct shrinker *shrinker);
> > >   void shrinker_unregister(struct shrinker *shrinker);
> > > +static inline bool shrinker_try_get(struct shrinker *shrinker)
> > > +{
> > > + return READ_ONCE(shrinker->registered) &&
> > > +refcount_inc_not_zero(&shrinker->refcount);
> > > +}
> > 
> > Why do we care about shrinker->registered here? If we don't set
> > the refcount to 1 until we have fully initialised everything, then
> > the shrinker code can key entirely off the reference count and
> > none of the lookup code needs to care about whether the shrinker is
> > registered or not.
> 
> The purpose of checking shrinker->registered here is to stop running
> shrinker after calling shrinker_free(), which can prevent the following
> situations from happening:
> 
> CPU 0 CPU 1
> 
> shrinker_try_get()
> 
>shrinker_try_get()
> 
> shrinker_put()
> shrinker_try_get()
>shrinker_put()

I don't see any race here? What is wrong with having multiple active
users at once?

> > 
> > This should use a completion, then it is always safe under
> > rcu_read_lock().  This also gets rid of the shrinker_lock spin lock,
> > which only exists because we can't take a blocking lock under
> > rcu_read_lock(). i.e:
> > 
> > 
> > void shrinker_put(struct shrinker *shrinker)
> > {
> > if (refcount_dec_and_test(&shrinker->refcount))
> > complete(&shrinker->done);
> > }
> > 
> > void shrinker_free()
> > {
> > .
> > refcount_dec(&shrinker->refcount);
> 
> I guess what you mean is shrinker_put(), because here may be the last
> refcount.

Yes, I did.

> > wait_for_completion(&shrinker->done);
> > /*
> >  * lookups on the shrinker will now all fail as refcount has
> >  * fallen to zero. We can now remove it from the lists and
> >  * free it.
> >  */
> > down_write(shrinker_rwsem);
> > list_del_rcu(&shrinker->list);
> > up_write(&shrinker_rwsem);
> > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb);
> > }
> > 
> > 
> > 
> > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered);
> > >   void shrinker_register(struct shrinker *shrinker)
> > >   {
> > > - down_write(&shrinker_rwsem);
> > > - list_add_tail(&shrinker->list, &shrinker_list);
> > > - shrinker->flags |= SHRINKER_REGISTERED;
> > > + refcount_set(&shrinker->refcount, 1);
> > > +
> > > + spin_lock(&shrinker_lock);
> > > + list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > > + spin_unlock(&shrinker_lock);
> > > +
> > >   shrinker_debugfs_add(shrinker);
> > > - up_write(&shrinker_rwsem);
> > > + WRITE_ONCE(shrinker->registered, true);
> > >   }
> > >   EXPORT_SYMBOL(shrinker_register);
> > 
> > This just looks wrong - you are trying to use WRITE_ONCE() as a
> > release barrier to indicate that the shrinker is now set up fully.
> > That's not necessary - the refcount is an atomic and along with the
> > rcu locks they should provides all the barriers we need. i.e.
> 
> The reason I used WRITE_ONCE() here is because the shrinker->registered
> will be read and written concurrently (read in shrinker_try_get() and
> written in shrinker_free()), which is why I added shrinker::registered
> field instead of using SHRINKER_REGISTERED flag (this can reduce the
> addition of WRITE_ONCE()/READ_ONCE()).

Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to
use the field like this. You need release/acquire memory ordering
here. i.e. smp_store_release()/smp_load_acquire().

As it is, the refcount_inc_not_zero() provides a control dependency,
as documented in include/linux/refcount.h, refcount_dec_and_test()
provides release memory ordering. The only thing I think we may need
is a write barrier before refcount_set(), such that if
refcount_inc_not_zero() sees a non-zero value, it is guaranteed to
see an initialised structure...

i.e. refcounts provide all the existence and initialisation
guarantees. Hence I don't see the need to use shrinker->registered
like this and it can remain a bit flag protected by the
shrinker_rwsem().


> > void shrinker_register(struct shrinker *shrinker)
> > {
> > down_write(&shrinker_rwsem);
> > list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > shrinker->flags |= SHRINKER_REGISTERED;
> > shrinker_debugfs_add(shrinker);
> > up_write(&shrinker_rwsem);
> > 
> > /*
> >  * now the shrinker is fully set up, take the first
> >  * reference to it to indicate that lookup operations are
> >  * now allowed to use it via shrinker_try_get().
> >  */
> > refcount_set(&shrinker->refcount, 1);
> > }
> > 
> > > diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> > > index f1becf

Re: [f2fs-dev] [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.
> 
> 1) When the memory pressure is high and there are many filesystems
>mounted or unmounted at the same time, slab shrink will be affected
>(down_read_trylock() failed).
> 
>Such as the real workload mentioned by Kirill Tkhai:
> 
>```
>One of the real workloads from my experience is start
>of an overcommitted node containing many starting
>containers after node crash (or many resuming containers
>after reboot for kernel update). In these cases memory
>pressure is huge, and the node goes round in long reclaim.
>```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>in [1]) and a writer comes in (such as mount a fs),
>then this writer will be blocked and cause all
>subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there may still be a
> problem. The down_read_trylock() may become a perf hotspot with frequent
> calls to shrink_slab(). Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop in IPC
> (instructions per cycle).
> 
> We used to implement the lockless slab shrink with SRCU [2], but then
> kernel test robot reported -88.8% regression in
> stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
> 
> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.
> 
> For now, all shrinker instances are converted to dynamically allocated and
> will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to
> ensure that the shrinker instance is valid.
> 
> And the shrinker instance will not be run again after unregistration. So
> the structure that records the pointer of shrinker instance can be safely
> freed without waiting for the RCU read-side critical section.
> 
> In this way, while we implement the lockless slab shrink, we don't need to
> be blocked in unregister_shrinker().
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs735238 60.00 12.37363.70 12253.05
> 1955.08
> for a 60.01s run time:
>1440.27s available CPU time
>  12.36s user time   (  0.86%)
> 363.70s system time ( 25.25%)
> 376.06s total time  ( 26.11%)
> load average: 10.79 4.47 1.69
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> 2) After applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs746677 60.00 12.22367.75 12443.70
> 1965.13
> for a 60.01s run time:
>1440.26s available CPU time
>  12.21s user time   (  0.85%)
> 367.75s system time ( 25.53%)
> 379.96s total time  ( 26.38%)
> load average: 8.37 2.48 0.86
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. 
> https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/
> [2]. 
> https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
> [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
> [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
> [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/
> 
> Signed-off-by: Qi Zheng 
> ---
>  include/linux/shrinker.h | 19 +++---
>  mm/shrinker.c| 75 ++--
>  mm/shrinker_debug.c  | 52 +---
>  3 files changed, 104 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 36977a70bebb..335da93cccee 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -86,6 +87,10 @@ struct shrinker {
>   long batch; /* reclaim batch size, 0 = default */
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
> + bo

Re: [f2fs-dev] [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-26 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote:
> Currently, the shrinker instances can be divided into the following three
> types:
> 
> a) global shrinker instance statically defined in the kernel, such as
>workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel modules, such
>as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed. For case b,
> the memory of shrinker instance will be freed after synchronize_rcu() when
> the module is unloaded. For case c, the memory of shrinker instance will
> be freed along with the structure it is embedded in.
> 
> In preparation for implementing lockless slab shrink, we need to
> dynamically allocate those shrinker instances in case c, then the memory
> can be dynamically freed alone by calling kfree_rcu().
> 
> So this commit adds the following new APIs for dynamically allocating
> shrinker, and add a private_data field to struct shrinker to record and
> get the original embedded structure.
> 
> 1. shrinker_alloc()
> 
> Used to allocate shrinker instance itself and related memory, it will
> return a pointer to the shrinker instance on success and NULL on failure.
> 
> 2. shrinker_free_non_registered()
> 
> Used to destroy the non-registered shrinker instance.

This is a bit nasty

> 
> 3. shrinker_register()
> 
> Used to register the shrinker instance, which is same as the current
> register_shrinker_prepared().
> 
> 4. shrinker_unregister()

rename this "shrinker_free()" and key the two different freeing
cases on the SHRINKER_REGISTERED bit rather than mostly duplicating
the two.

void shrinker_free(struct shrinker *shrinker)
{
struct dentry *debugfs_entry = NULL;
int debugfs_id;

if (!shrinker)
return;

down_write(&shrinker_rwsem);
if (shrinker->flags & SHRINKER_REGISTERED) {
list_del(&shrinker->list);
debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
} else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) {
kfree_const(shrinker->name);
}

if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
up_write(&shrinker_rwsem);

if (debugfs_entry)
shrinker_debugfs_remove(debugfs_entry, debugfs_id);

kfree(shrinker->nr_deferred);
kfree(shrinker);
}
EXPORT_SYMBOL_GPL(shrinker_free);

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v5 6/8] xfs: switch to multigrain timestamps

2023-07-18 Thread Dave Chinner via Linux-f2fs-devel
On Thu, Jul 13, 2023 at 07:00:55PM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and warn if XFS_ICHGTIME_CHG is ever not
> set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
>  fs/xfs/xfs_iops.c   | 4 ++--
>  fs/xfs/xfs_super.c  | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 0c9df8df6d4a..86f5ffce2d89 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + WARN_ON_ONCE(!(flags & XFS_ICHGTIME_CHG));

Make that an ASSERT(flags & XFS_ICHGTIME_CHG), please. There's no
need to verify this at runtime on production kernels.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 16/17] block: use iomap for writes to block devices

2023-05-23 Thread Dave Chinner via Linux-f2fs-devel
On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> On 4/24/23 07:49, Christoph Hellwig wrote:
> > Use iomap in buffer_head compat mode to write to block devices.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >   block/Kconfig |  1 +
> >   block/fops.c  | 33 +
> >   2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 941b2dca70db73..672b08f0096ab4 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -5,6 +5,7 @@
> >   menuconfig BLOCK
> >  bool "Enable the block layer" if EXPERT
> >  default y
> > +   select IOMAP
> >  select SBITMAP
> >  help
> >  Provide block layer support for the kernel.
> > diff --git a/block/fops.c b/block/fops.c
> > index 318247832a7bcf..7910636f8df33b 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -15,6 +15,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include "blk.h"
> > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, 
> > struct iov_iter *iter)
> > return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
> >   }
> > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> > length,
> > +   unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +   struct block_device *bdev = I_BDEV(inode);
> > +   loff_t isize = i_size_read(inode);
> > +
> > +   iomap->bdev = bdev;
> > +   iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> > +   if (WARN_ON_ONCE(iomap->offset >= isize))
> > +   return -EIO;
> 
> I'm hitting this during booting:
> [5.016324]  
> [5.030256]  iomap_iter+0x11a/0x350
> [5.030264]  iomap_readahead+0x1eb/0x2c0
> [5.030272]  read_pages+0x5d/0x220
> [5.030279]  page_cache_ra_unbounded+0x131/0x180
> [5.030284]  filemap_get_pages+0xff/0x5a0

Why is filemap_get_pages() using unbounded readahead? Surely
readahead should be limited to reading within EOF

> [5.030292]  filemap_read+0xca/0x320
> [5.030296]  ? aa_file_perm+0x126/0x500
> [5.040216]  ? touch_atime+0xc8/0x150
> [5.040224]  blkdev_read_iter+0xb0/0x150
> [5.040228]  vfs_read+0x226/0x2d0
> [5.040234]  ksys_read+0xa5/0xe0
> [5.040238]  do_syscall_64+0x5b/0x80
> 
> Maybe we should consider this patch:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 524b8a828aad..d202fb663f25 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode,
> loff_t offset, loff_t length,
> 
> iomap->bdev = bdev;
> iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> -   if (WARN_ON_ONCE(iomap->offset >= isize))
> -   return -EIO;
> -   iomap->type = IOMAP_MAPPED;
> -   iomap->addr = iomap->offset;
> +   if (WARN_ON_ONCE(iomap->offset >= isize)) {
> +   iomap->type = IOMAP_HOLE;
> +   iomap->addr = IOMAP_NULL_ADDR;
> +   } else {
> +   iomap->type = IOMAP_MAPPED;
> +   iomap->addr = iomap->offset;
> +   }

I think Christoph's code is correct. IMO, any attempt to read beyond
the end of the device should throw out a warning and return an
error, not silently return zeros.

If readahead is trying to read beyond the end of the device, then it
really seems to me like the problem here is readahead, not the iomap
code detecting the OOB read request

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 00/23] fs-verity support for XFS

2023-04-11 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Apr 11, 2023 at 07:33:19PM -0700, Eric Biggers wrote:
> On Mon, Apr 10, 2023 at 10:19:46PM -0700, Christoph Hellwig wrote:
> > Dave is going to hate me for this, but..
> > 
> > I've been looking over some of the interfaces here, and I'm starting
> > to very seriously questioning the design decisions of storing the
> > fsverity hashes in xattrs.
> > 
> > Yes, storing them beyond i_size in the file is a bit of a hack, but
> > it allows to reuse a lot of the existing infrastructure, and much
> > of fsverity is based around it.  So storing them in an xattrs causes
> > a lot of churn in the interface.  And the XFS side with special
> > casing xattr indices also seems not exactly nice.
> 
> It seems it's really just the Merkle tree caching interface that is causing
> problems, as it's currently too closely tied to the page cache?  That is just 
> an
> implementation detail that could be reworked along the lines of what is being
> discussed.
> 
> But anyway, it is up to the XFS folks.  Keep in mind there is also the option 
> of
> doing what btrfs is doing, where it stores the Merkle tree separately from the
> file data stream, but caches it past i_size in the page cache at runtime.

Right. It's not entirely simple to store metadata on disk beyond EOF
in XFS because of all the assumptions throughout the IO path and
allocator interfaces that it can allocate space beyond EOF at will
and something else will clean it up later if it is not needed. This
impacts on truncate, delayed allocation, writeback, IO completion,
EOF block removal on file close, background garbage collection,
ENOSPC/EDQUOT driven space freeing, etc.  Some of these things cross
over into iomap infrastructure, too.

AFAIC, it's far more intricate, complex and risky to try to store
merkle tree data beyond EOF than it is to put it in an xattr
namespace because IO path EOF handling bugs result in user data
corruption. This happens over and over again, no matter how careful
we are about these aspects of user data handling.

OTOH, putting the merkle tree data in a different namespace avoids
these issues completely. Yes, we now have to solve an API mismatch,
but we aren't risking the addition of IO path data corruption bugs
to every non-fsverity filesystem in production...

Hence I think copying the btrfs approach (i.e. only caching the
merkle tree data in the page cache beyond EOF) would be as far as I
think we'd want to go. Realistically, there would be little
practical difference between btrfs storing the merkle tree blocks in
a separate internal btree and XFS storing them in an internal
private xattr btree namespace.

I would, however, prefer not to have to do this at all if we could
simply map the blocks directly out of the xattr buffers as we
already do internally for all the XFS code...

> I guess there is also the issue of encryption, which hasn't come up yet since
> we're talking about fsverity support only.  The Merkle tree (including the
> fsverity_descriptor) is supposed to be encrypted, just like the file contents
> are.  Having it be stored after the file contents accomplishes that easily...
> Of course, it doesn't have to be that way; a separate key could be derived, or
> the Merkle tree blocks could be encrypted with the file contents key using
> indices past i_size, without them physically being stored in the data stream.

I'm expecting that fscrypt for XFS will include encryption of the
xattr names and values (just like we will need to do for directory
names) except for the xattrs that hold the encryption keys
themselves. That means the merkle tree blocks should get encrypted
without any extra work needing to be done anywhere.  This will
simply require the fscrypt keys to be held in a private internal
xattr namespace that isn't encrypted, but that's realtively trivial
to do...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 10:54:06PM +, Eric Biggers wrote:
> On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote:
> > > We could certainly think about moving to a design where fs/verity/ asks 
> > > the
> > > filesystem to just *read* a Merkle tree block, without adding it to a 
> > > cache, and
> > > then fs/verity/ implements the caching itself.  That would require some 
> > > large
> > > changes to each filesystem, though, unless we were to double-cache the 
> > > Merkle
> > > tree blocks which would be inefficient.
> > 
> > No, that's unnecessary.
> > 
> > All we need if for fsverity to require filesystems to pass it byte
> > addressable data buffers that are externally reference counted. The
> > filesystem can take a page reference before mapping the page and
> > passing the kaddr to fsverity, then unmap and drop the reference
> > when the merkle tree walk is done as per Andrey's new drop callout.
> > 
> > fsverity doesn't need to care what the buffer is made from, how it
> > is cached, what it's life cycle is, etc. The caching mechanism and
> > reference counting is entirely controlled by the filesystem callout
> > implementations, and fsverity only needs to deal with memory buffers
> > that are guaranteed to live for the entire walk of the merkle
> > tree
> 
> Sure.  Just a couple notes:
> 
> First, fs/verity/ does still need to be able to tell whether the buffer is 
> newly
> instantiated or not.

Boolean flag from the caller.

> Second, fs/verity/ uses the ahash API to do the hashing.  ahash is a
> scatterlist-based API.  Virtual addresses can still be used (see 
> sg_set_buf()),
> but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to 
> work.
> Does XFS use vmalloc'ed memory for these buffers?

Not vmalloc'ed, but vmapped. we allocate the pages individually, but
then call vm_map_page() to present the higher level code with a
single contiguous memory range if it is a multi-page buffer.

We do have the backing info held in the buffer, and that's what we
use for IO. If fsverity needs a page based scatter/gather list
for hardware offload, it could ask the filesystem to provide it
for that given buffer...

> BTW, converting fs/verity/ from ahash to shash is an option; I've really never
> been a fan of the scatterlist-based crypto APIs!  The disadvantage of doing
> this, though, would be that it would remove support for all the hardware 
> crypto
> drivers.
>
> That *might* actually be okay, as that approach to crypto acceleration
> has mostly fallen out of favor, in favor of CPU-based acceleration.  But I do
> worry about e.g. someone coming out of the woodwork and saying they need to 
> use
> fsverity on a low-powered ARM board that has a crypto accelerator like CAAM, 
> and
> they MUST use their crypto accelerator to get acceptable performance.

True, but we are very unlikely to be using XFS on such small
systems and I don't think we really care about XFS performance on
android sized systems, either.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 05:12:34PM +0200, Andrey Albershteyn wrote:
> Hi Eric,
> 
> On Tue, Apr 04, 2023 at 04:32:24PM -0700, Eric Biggers wrote:
> > Hi Andrey,
> > 
> > On Tue, Apr 04, 2023 at 04:53:17PM +0200, Andrey Albershteyn wrote:
> > > In case of different Merkle tree block size fs-verity expects
> > > ->read_merkle_tree_page() to return Merkle tree page filled with
> > > Merkle tree blocks. The XFS stores each merkle tree block under
> > > extended attribute. Those attributes are addressed by block offset
> > > into Merkle tree.
> > > 
> > > This patch make ->read_merkle_tree_page() to fetch multiple merkle
> > > tree blocks based on size ratio. Also the reference to each xfs_buf
> > > is passed with page->private to ->drop_page().
> > > 
> > > Signed-off-by: Andrey Albershteyn 
> > > ---
> > >  fs/xfs/xfs_verity.c | 74 +++--
> > >  fs/xfs/xfs_verity.h |  8 +
> > >  2 files changed, 66 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> > > index a9874ff4efcd..ef0aff216f06 100644
> > > --- a/fs/xfs/xfs_verity.c
> > > +++ b/fs/xfs/xfs_verity.c
> > > @@ -134,6 +134,10 @@ xfs_read_merkle_tree_page(
> > >   struct page *page = NULL;
> > >   __be64  name = cpu_to_be64(index << PAGE_SHIFT);
> > >   uint32_tbs = 1 << log_blocksize;
> > > + int blocks_per_page =
> > > + (1 << (PAGE_SHIFT - log_blocksize));
> > > + int n = 0;
> > > + int offset = 0;
> > >   struct xfs_da_args  args = {
> > >   .dp = ip,
> > >   .attr_filter= XFS_ATTR_VERITY,
> > > @@ -143,26 +147,59 @@ xfs_read_merkle_tree_page(
> > >   .valuelen   = bs,
> > >   };
> > >   int error = 0;
> > > + boolis_checked = true;
> > > + struct xfs_verity_buf_list  *buf_list;
> > >  
> > >   page = alloc_page(GFP_KERNEL);
> > >   if (!page)
> > >   return ERR_PTR(-ENOMEM);
> > >  
> > > - error = xfs_attr_get(&args);
> > > - if (error) {
> > > - kmem_free(args.value);
> > > - xfs_buf_rele(args.bp);
> > > + buf_list = kzalloc(sizeof(struct xfs_verity_buf_list), GFP_KERNEL);
> > > + if (!buf_list) {
> > >   put_page(page);
> > > - return ERR_PTR(-EFAULT);
> > > + return ERR_PTR(-ENOMEM);
> > >   }
> > >  
> > > - if (args.bp->b_flags & XBF_VERITY_CHECKED)
> > > + /*
> > > +  * Fill the page with Merkle tree blocks. The blcoks_per_page is higher
> > > +  * than 1 when fs block size != PAGE_SIZE or Merkle tree block size !=
> > > +  * PAGE SIZE
> > > +  */
> > > + for (n = 0; n < blocks_per_page; n++) {
> > > + offset = bs * n;
> > > + name = cpu_to_be64(((index << PAGE_SHIFT) + offset));
> > > + args.name = (const uint8_t *)&name;
> > > +
> > > + error = xfs_attr_get(&args);
> > > + if (error) {
> > > + kmem_free(args.value);
> > > + /*
> > > +  * No more Merkle tree blocks (e.g. this was the last
> > > +  * block of the tree)
> > > +  */
> > > + if (error == -ENOATTR)
> > > + break;
> > > + xfs_buf_rele(args.bp);
> > > + put_page(page);
> > > + kmem_free(buf_list);
> > > + return ERR_PTR(-EFAULT);
> > > + }
> > > +
> > > + buf_list->bufs[buf_list->buf_count++] = args.bp;
> > > +
> > > + /* One of the buffers was dropped */
> > > + if (!(args.bp->b_flags & XBF_VERITY_CHECKED))
> > > + is_checked = false;
> > > +
> > > + memcpy(page_address(page) + offset, args.value, args.valuelen);
> > > + kmem_free(args.value);
> > > + args.value = NULL;
> > > + }
> > 
> > I was really hoping for a solution where the cached data can be used 
> > directly,
> > instead allocating a temporary page and copying the cached data into it 
> > every
> > time the cache is accessed.  The problem with what you have now is that 
> > every
> > time a single 32-byte hash is accessed, a full page (potentially 64KB!) 
> > will be
> > allocated and filled.  That's not very efficient.  The need to allocate the
> > temporary page can also cause ENOMEM (which will get reported as EIO).
> > 
> > Did you consider alternatives that would work more efficiently?  I think it
> > would be worth designing something that works properly with how XFS is 
> > planned
> > to cache the Merkle tree, instead of designing a workaround.
> > ->read_merkle_tree_page was not really designed for what you are doing here.
> > 
> > How about replacing ->read_merkle_tree_page with a function that takes in a
> > Merkle tree block index (not a page index!) and hands back a (page, offset) 
> > pair
> > that identifies where the Merkle tree block's data is located?  Or (folio,
> > offset), 

Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 06:02:47PM +, Eric Biggers wrote:
> And I really hope that you don't want to do DIO to the *Merkle tree*, as that

Not for XFS - the merkle tree is not held as file data.

That said, the merkle tree in XFS is not page cache or block aligned
metadata either, so we really want the same memory buffer based
interface for the merkle tree reading so that the merkle tree code
can read directly from the xfs-buf rather than requiring us to copy
it out of the xfsbuf into temporary pages...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 06:16:00PM +, Eric Biggers wrote:
> On Wed, Apr 05, 2023 at 09:38:47AM -0700, Darrick J. Wong wrote:
> > > The merkle tree pages are dropped after verification. When page is
> > > dropped xfs_buf is marked as verified. If fs-verity wants to
> > > verify again it will get the same verified buffer. If buffer is
> > > evicted it won't have verified state.
> > > 
> > > So, with enough memory pressure buffers will be dropped and need to
> > > be reverified.
> > 
> > Please excuse me if this was discussed and rejected long ago, but
> > perhaps fsverity should try to hang on to the merkle tree pages that
> > this function returns for as long as possible until reclaim comes for
> > them?
> > 
> > With the merkle tree page lifetimes extended, you then don't need to
> > attach the xfs_buf to page->private, nor does xfs have to extend the
> > buffer cache to stash XBF_VERITY_CHECKED.
> 
> Well, all the other filesystems that support fsverity (ext4, f2fs, and btrfs)
> just cache the Merkle tree pages in the inode's page cache.  It's an approach
> that I know some people aren't a fan of, but it's efficient and it works.

Which puts pages beyond EOF in the page cache. Given that XFS also
allows persistent block allocation beyond EOF, having both data in the page
cache and blocks beyond EOF that contain unrelated information is a
Real Bad Idea.

Just because putting metadata in the file data address space works
for one filesystem, it doesn't me it's a good idea or that it works
for every filesystem.

> We could certainly think about moving to a design where fs/verity/ asks the
> filesystem to just *read* a Merkle tree block, without adding it to a cache, 
> and
> then fs/verity/ implements the caching itself.  That would require some large
> changes to each filesystem, though, unless we were to double-cache the Merkle
> tree blocks which would be inefficient.

No, that's unnecessary.

All we need if for fsverity to require filesystems to pass it byte
addressable data buffers that are externally reference counted. The
filesystem can take a page reference before mapping the page and
passing the kaddr to fsverity, then unmap and drop the reference
when the merkle tree walk is done as per Andrey's new drop callout.

fsverity doesn't need to care what the buffer is made from, how it
is cached, what it's life cycle is, etc. The caching mechanism and
reference counting is entirely controlled by the filesystem callout
implementations, and fsverity only needs to deal with memory buffers
that are guaranteed to live for the entire walk of the merkle
tree

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 08:09:27AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote:
> > On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> > > > The direct path is not supported on verity files. Attempts to use direct
> > > > I/O path on such files should fall back to buffered I/O path.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn 
> > > > ---
> > > >  fs/xfs/xfs_file.c | 14 +++---
> > > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index 947b5c436172..9e072e82f6c1 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -244,7 +244,8 @@ xfs_file_dax_read(
> > > > struct kiocb*iocb,
> > > > struct iov_iter *to)
> > > >  {
> > > > -   struct xfs_inode*ip = 
> > > > XFS_I(iocb->ki_filp->f_mapping->host);
> > > > +   struct inode*inode = iocb->ki_filp->f_mapping->host;
> > > > +   struct xfs_inode*ip = XFS_I(inode);
> > > > ssize_t ret = 0;
> > > >  
> > > > trace_xfs_file_dax_read(iocb, to);
> > > > @@ -297,10 +298,17 @@ xfs_file_read_iter(
> > > >  
> > > > if (IS_DAX(inode))
> > > > ret = xfs_file_dax_read(iocb, to);
> > > > -   else if (iocb->ki_flags & IOCB_DIRECT)
> > > > +   else if (iocb->ki_flags & IOCB_DIRECT && 
> > > > !fsverity_active(inode))
> > > > ret = xfs_file_dio_read(iocb, to);
> > > > -   else
> > > > +   else {
> > > > +   /*
> > > > +* In case fs-verity is enabled, we also fallback to the
> > > > +* buffered read from the direct read path. Therefore,
> > > > +* IOCB_DIRECT is set and need to be cleared
> > > > +*/
> > > > +   iocb->ki_flags &= ~IOCB_DIRECT;
> > > > ret = xfs_file_buffered_read(iocb, to);
> > > 
> > > XFS doesn't usually allow directio fallback to the pagecache. Why
> > > would fsverity be any different?
> > 
> > Didn't know that, this is what happens on ext4 so I did the same.
> > Then it probably make sense to just error on DIRECT on verity
> > sealed file.
> 
> Thinking about this a little more -- I suppose we shouldn't just go
> breaking directio reads from a verity file if we can help it.  Is there
> a way to ask fsverity to perform its validation against some arbitrary
> memory buffer that happens to be fs-block aligned?

The memory buffer doesn't even need to be fs-block aligned - it just
needs to be a pointer to memory the kernel can read...

We also need fsverity to be able to handle being passed mapped
kernel memory rather than pages/folios for the merkle tree
interfaces. That way we can just pass it the mapped buffer memory
straight from the xfs-buf and we don't have to do the whacky "copy
from xattr xfs_bufs into pages so fsverity can take temporary
reference counts on what it thinks are page cache pages" as it walks
the merkle tree.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 16/23] xfs: add inode on-disk VERITY flag

2023-04-04 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Apr 04, 2023 at 03:41:23PM -0700, Eric Biggers wrote:
> Hi Andrey,
> 
> On Tue, Apr 04, 2023 at 04:53:12PM +0200, Andrey Albershteyn wrote:
> > Add flag to mark inodes which have fs-verity enabled on them (i.e.
> > descriptor exist and tree is built).
> > 
> > Signed-off-by: Andrey Albershteyn 
> > ---
> >  fs/ioctl.c | 4 
> >  fs/xfs/libxfs/xfs_format.h | 4 +++-
> >  fs/xfs/xfs_inode.c | 2 ++
> >  fs/xfs/xfs_iops.c  | 2 ++
> >  include/uapi/linux/fs.h| 1 +
> >  5 files changed, 12 insertions(+), 1 deletion(-)
> [...]
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index b7b56871029c..5172a2eb902c 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -140,6 +140,7 @@ struct fsxattr {
> >  #define FS_XFLAG_FILESTREAM0x4000  /* use filestream 
> > allocator */
> >  #define FS_XFLAG_DAX   0x8000  /* use DAX for IO */
> >  #define FS_XFLAG_COWEXTSIZE0x0001  /* CoW extent size 
> > allocator hint */
> > +#define FS_XFLAG_VERITY0x0002  /* fs-verity sealed 
> > inode */
> >  #define FS_XFLAG_HASATTR   0x8000  /* no DIFLAG for this   */
> >  
> 
> I don't think "xfs: add inode on-disk VERITY flag" is an accurate description 
> of
> a patch that involves adding something to the UAPI.

Well it does that, but it also adds the UAPI for querying the
on-disk flag via the FS_IOC_FSGETXATTR interface as well.  It
probably should be split up into two patches.

> Should the other filesystems support this new flag too?

I think they should get it automatically now that it has been
defined for FS_IOC_FSGETXATTR and added to the generic fileattr flag
fill functions in fs/ioctl.c.

> I'd also like all ways of getting the verity flag to continue to be mentioned 
> in
> Documentation/filesystems/fsverity.rst.  The existing methods (FS_IOC_GETFLAGS
> and statx) are already mentioned there.

*nod*

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 06/23] fsverity: add drop_page() callout

2023-04-04 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Apr 04, 2023 at 04:53:02PM +0200, Andrey Albershteyn wrote:
> Allow filesystem to make additional processing on verified pages
> instead of just dropping a reference. This will be used by XFS for
> internal buffer cache manipulation in further patches. The btrfs,
> ext4, and f2fs just drop the reference.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/btrfs/verity.c | 12 
>  fs/ext4/verity.c  |  6 ++
>  fs/f2fs/verity.c  |  6 ++
>  fs/verity/read_metadata.c |  4 ++--
>  fs/verity/verify.c|  6 +++---
>  include/linux/fsverity.h  | 10 ++
>  6 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> index c5ff16f9e9fa..4c2c09204bb4 100644
> --- a/fs/btrfs/verity.c
> +++ b/fs/btrfs/verity.c
> @@ -804,10 +804,22 @@ static int btrfs_write_merkle_tree_block(struct inode 
> *inode, const void *buf,
>  pos, buf, size);
>  }
>  
> +/*
> + * fsverity op that releases the reference obtained by 
> ->read_merkle_tree_page()
> + *
> + * @page:  reference to the page which can be released
> + *
> + */
> +static void btrfs_drop_page(struct page *page)
> +{
> + put_page(page);
> +}
> +
>  const struct fsverity_operations btrfs_verityops = {
>   .begin_enable_verity = btrfs_begin_enable_verity,
>   .end_enable_verity   = btrfs_end_enable_verity,
>   .get_verity_descriptor   = btrfs_get_verity_descriptor,
>   .read_merkle_tree_page   = btrfs_read_merkle_tree_page,
>   .write_merkle_tree_block = btrfs_write_merkle_tree_block,
> + .drop_page   = &btrfs_drop_page,
>  };

Ok, that's a generic put_page() call.


> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index f50e3b5b52c9..c2fc4c86af34 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -210,7 +210,7 @@ verify_data_block(struct inode *inode, struct 
> fsverity_info *vi,
>   if (is_hash_block_verified(vi, hpage, hblock_idx)) {
>   memcpy_from_page(_want_hash, hpage, hoffset, hsize);
>   want_hash = _want_hash;
> - put_page(hpage);
> + inode->i_sb->s_vop->drop_page(hpage);
>   goto descend;

fsverity_drop_page(hpage);

static inline void
fsverity_drop_page(struct inode *inode, struct page *page)
{
if (inode->i_sb->s_vop->drop_page)
inode->i_sb->s_vop->drop_page(page);
else
put_page(page);
}

And then you don't need to add the functions to each of the
filesystems nor make an indirect call just to run put_page().

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel