Re: [Cluster-devel] [PATCH v2 00/23] fs-verity support for XFS

2023-04-11 Thread Eric Biggers
On Mon, Apr 10, 2023 at 10:19:46PM -0700, Christoph Hellwig wrote:
> Dave is going to hate me for this, but..
> 
> I've been looking over some of the interfaces here, and I'm starting
> to very seriously questioning the design decisions of storing the
> fsverity hashes in xattrs.
> 
> Yes, storing them beyond i_size in the file is a bit of a hack, but
> it allows to reuse a lot of the existing infrastructure, and much
> of fsverity is based around it.  So storing them in an xattrs causes
> a lot of churn in the interface.  And the XFS side with special
> casing xattr indices also seems not exactly nice.

It seems it's really just the Merkle tree caching interface that is causing
problems, as it's currently too closely tied to the page cache?  That is just an
implementation detail that could be reworked along the lines of what is being
discussed.

But anyway, it is up to the XFS folks.  Keep in mind there is also the option of
doing what btrfs is doing, where it stores the Merkle tree separately from the
file data stream, but caches it past i_size in the page cache at runtime.

I guess there is also the issue of encryption, which hasn't come up yet since
we're talking about fsverity support only.  The Merkle tree (including the
fsverity_descriptor) is supposed to be encrypted, just like the file contents
are.  Having it be stored after the file contents accomplishes that easily...
Of course, it doesn't have to be that way; a separate key could be derived, or
the Merkle tree blocks could be encrypted with the file contents key using
indices past i_size, without them physically being stored in the data stream.

- Eric



Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-07 Thread Eric Biggers
On Wed, Apr 05, 2023 at 05:44:36PM -0700, Eric Biggers wrote:
> > Not vmalloc'ed, but vmapped. we allocate the pages individually, but
> > then call vm_map_page() to present the higher level code with a
> > single contiguous memory range if it is a multi-page buffer.
> > 
> > We do have the backing info held in the buffer, and that's what we
> > use for IO. If fsverity needs a page based scatter/gather list
> > for hardware offload, it could ask the filesystem to provide it
> > for that given buffer...
> > 
> > > BTW, converting fs/verity/ from ahash to shash is an option; I've really 
> > > never
> > > been a fan of the scatterlist-based crypto APIs!  The disadvantage of 
> > > doing
> > > this, though, would be that it would remove support for all the hardware 
> > > crypto
> > > drivers.
> > >
> > > That *might* actually be okay, as that approach to crypto acceleration
> > > has mostly fallen out of favor, in favor of CPU-based acceleration.  But 
> > > I do
> > > worry about e.g. someone coming out of the woodwork and saying they need 
> > > to use
> > > fsverity on a low-powered ARM board that has a crypto accelerator like 
> > > CAAM, and
> > > they MUST use their crypto accelerator to get acceptable performance.
> > 
> > True, but we are very unlikely to be using XFS on such small
> > systems and I don't think we really care about XFS performance on
> > android sized systems, either.
> > 
> 
> FYI, I've sent an RFC patch that converts fs/verity/ from ahash to shash:
> https://lore.kernel.org/r/20230406003714.94580-1-ebigg...@kernel.org
> 
> It would be great if we could do that.  But I need to get a better sense for
> whether anyone will complain...

FWIW, dm-verity went in the other direction.  It started with shash, and then in
2017 it was switched to ahash by https://git.kernel.org/linus/d1ac3ff008fb9a48
("dm verity: switch to using asynchronous hash crypto API").

I think that was part of my motivation for using ahash in fsverity from the
beginning.

Still, it does seem that ahash is more trouble than it's worth these days...

- Eric



Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Eric Biggers
On Thu, Apr 06, 2023 at 09:37:53AM +1000, Dave Chinner wrote:
> On Wed, Apr 05, 2023 at 10:54:06PM +0000, Eric Biggers wrote:
> > On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote:
> > > > We could certainly think about moving to a design where fs/verity/ asks 
> > > > the
> > > > filesystem to just *read* a Merkle tree block, without adding it to a 
> > > > cache, and
> > > > then fs/verity/ implements the caching itself.  That would require some 
> > > > large
> > > > changes to each filesystem, though, unless we were to double-cache the 
> > > > Merkle
> > > > tree blocks which would be inefficient.
> > > 
> > > No, that's unnecessary.
> > > 
> > > All we need if for fsverity to require filesystems to pass it byte
> > > addressable data buffers that are externally reference counted. The
> > > filesystem can take a page reference before mapping the page and
> > > passing the kaddr to fsverity, then unmap and drop the reference
> > > when the merkle tree walk is done as per Andrey's new drop callout.
> > > 
> > > fsverity doesn't need to care what the buffer is made from, how it
> > > is cached, what it's life cycle is, etc. The caching mechanism and
> > > reference counting is entirely controlled by the filesystem callout
> > > implementations, and fsverity only needs to deal with memory buffers
> > > that are guaranteed to live for the entire walk of the merkle
> > > tree
> > 
> > Sure.  Just a couple notes:
> > 
> > First, fs/verity/ does still need to be able to tell whether the buffer is 
> > newly
> > instantiated or not.
> 
> Boolean flag from the caller.
> 
> > Second, fs/verity/ uses the ahash API to do the hashing.  ahash is a
> > scatterlist-based API.  Virtual addresses can still be used (see 
> > sg_set_buf()),
> > but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to 
> > work.
> > Does XFS use vmalloc'ed memory for these buffers?
> 
> Not vmalloc'ed, but vmapped. we allocate the pages individually, but
> then call vm_map_page() to present the higher level code with a
> single contiguous memory range if it is a multi-page buffer.
> 
> We do have the backing info held in the buffer, and that's what we
> use for IO. If fsverity needs a page based scatter/gather list
> for hardware offload, it could ask the filesystem to provide it
> for that given buffer...
> 
> > BTW, converting fs/verity/ from ahash to shash is an option; I've really 
> > never
> > been a fan of the scatterlist-based crypto APIs!  The disadvantage of doing
> > this, though, would be that it would remove support for all the hardware 
> > crypto
> > drivers.
> >
> > That *might* actually be okay, as that approach to crypto acceleration
> > has mostly fallen out of favor, in favor of CPU-based acceleration.  But I 
> > do
> > worry about e.g. someone coming out of the woodwork and saying they need to 
> > use
> > fsverity on a low-powered ARM board that has a crypto accelerator like 
> > CAAM, and
> > they MUST use their crypto accelerator to get acceptable performance.
> 
> True, but we are very unlikely to be using XFS on such small
> systems and I don't think we really care about XFS performance on
> android sized systems, either.
> 

FYI, I've sent an RFC patch that converts fs/verity/ from ahash to shash:
https://lore.kernel.org/r/20230406003714.94580-1-ebigg...@kernel.org

It would be great if we could do that.  But I need to get a better sense for
whether anyone will complain...

- Eric



Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Eric Biggers
On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote:
> > We could certainly think about moving to a design where fs/verity/ asks the
> > filesystem to just *read* a Merkle tree block, without adding it to a 
> > cache, and
> > then fs/verity/ implements the caching itself.  That would require some 
> > large
> > changes to each filesystem, though, unless we were to double-cache the 
> > Merkle
> > tree blocks which would be inefficient.
> 
> No, that's unnecessary.
> 
> All we need if for fsverity to require filesystems to pass it byte
> addressable data buffers that are externally reference counted. The
> filesystem can take a page reference before mapping the page and
> passing the kaddr to fsverity, then unmap and drop the reference
> when the merkle tree walk is done as per Andrey's new drop callout.
> 
> fsverity doesn't need to care what the buffer is made from, how it
> is cached, what it's life cycle is, etc. The caching mechanism and
> reference counting is entirely controlled by the filesystem callout
> implementations, and fsverity only needs to deal with memory buffers
> that are guaranteed to live for the entire walk of the merkle
> tree

Sure.  Just a couple notes:

First, fs/verity/ does still need to be able to tell whether the buffer is newly
instantiated or not.

Second, fs/verity/ uses the ahash API to do the hashing.  ahash is a
scatterlist-based API.  Virtual addresses can still be used (see sg_set_buf()),
but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to work.
Does XFS use vmalloc'ed memory for these buffers?

BTW, converting fs/verity/ from ahash to shash is an option; I've really never
been a fan of the scatterlist-based crypto APIs!  The disadvantage of doing
this, though, would be that it would remove support for all the hardware crypto
drivers.  That *might* actually be okay, as that approach to crypto acceleration
has mostly fallen out of favor, in favor of CPU-based acceleration.  But I do
worry about e.g. someone coming out of the woodwork and saying they need to use
fsverity on a low-powered ARM board that has a crypto accelerator like CAAM, and
they MUST use their crypto accelerator to get acceptable performance.

- Eric



Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Eric Biggers
On Wed, Apr 05, 2023 at 09:38:47AM -0700, Darrick J. Wong wrote:
> > The merkle tree pages are dropped after verification. When page is
> > dropped xfs_buf is marked as verified. If fs-verity wants to
> > verify again it will get the same verified buffer. If buffer is
> > evicted it won't have verified state.
> > 
> > So, with enough memory pressure buffers will be dropped and need to
> > be reverified.
> 
> Please excuse me if this was discussed and rejected long ago, but
> perhaps fsverity should try to hang on to the merkle tree pages that
> this function returns for as long as possible until reclaim comes for
> them?
> 
> With the merkle tree page lifetimes extended, you then don't need to
> attach the xfs_buf to page->private, nor does xfs have to extend the
> buffer cache to stash XBF_VERITY_CHECKED.

Well, all the other filesystems that support fsverity (ext4, f2fs, and btrfs)
just cache the Merkle tree pages in the inode's page cache.  It's an approach
that I know some people aren't a fan of, but it's efficient and it works.

We could certainly think about moving to a design where fs/verity/ asks the
filesystem to just *read* a Merkle tree block, without adding it to a cache, and
then fs/verity/ implements the caching itself.  That would require some large
changes to each filesystem, though, unless we were to double-cache the Merkle
tree blocks which would be inefficient.

So it feels like continuing to have the filesystem (not fs/verity/) be
responsible for the cache is the best way to allow XFS to do things a bit
differently, without regressing the other filesystems.

I'm interested in hearing any other proposals, though.

- Eric



Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Eric Biggers
On Wed, Apr 05, 2023 at 08:50:10AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 05, 2023 at 08:09:27AM -0700, Darrick J. Wong wrote:
> > Thinking about this a little more -- I suppose we shouldn't just go
> > breaking directio reads from a verity file if we can help it.  Is there
> > a way to ask fsverity to perform its validation against some arbitrary
> > memory buffer that happens to be fs-block aligned?

You could certainly add such a function that wraps around verify_data_block().
The minimal function prototype needed (without supporting readahead or reusing
the ahash_request) would be something like the following, I think:

bool fsverity_verify_blocks_dio(struct inode *inode, u64 pos,
struct folio *folio,
size_t len, size_t offset);

And I really hope that you don't want to do DIO to the *Merkle tree*, as that
would make the problem significantly harder.  I think DIO for the data, but
handling the Merkle tree in the usual way, would be okay?

> 
> That would be my preference as well.  But maybe Eric know a good reason
> why this hasn't been done yet.
> 

I believe it would be possible, especially if DIO to the Merkle tree is not in
scope.  There just hasn't been a reason to the work yet.  And ext4 and f2fs
already fall back to buffer I/O for other filesystem features, so there was
precedent for not bothering with DIO, at least in the initial version.

- Eric



Re: [Cluster-devel] [PATCH v2 05/23] fsverity: make fsverity_verify_folio() accept folio's offset and size

2023-04-05 Thread Eric Biggers
On Wed, Apr 05, 2023 at 08:46:45AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 05, 2023 at 12:36:42PM +0200, Andrey Albershteyn wrote:
> > Hi Christoph,
> > 
> > On Tue, Apr 04, 2023 at 08:30:36AM -0700, Christoph Hellwig wrote:
> > > On Tue, Apr 04, 2023 at 04:53:01PM +0200, Andrey Albershteyn wrote:
> > > > Not the whole folio always need to be verified by fs-verity (e.g.
> > > > with 1k blocks). Use passed folio's offset and size.
> > > 
> > > Why can't those callers just call fsverity_verify_blocks directly?
> > > 
> > 
> > They can. Calling _verify_folio with explicit offset; size appeared
> > more clear to me. But I'm ok with dropping this patch to have full
> > folio verify function.
> 
> Well, there is no point in a wrapper if it has the exact same signature
> and functionality as the functionality being wrapped.
> 
> That being said, right now fsverity_verify_folio, so it might make sense
> to either rename it, or rename fsverity_verify_blocks to
> fsverity_verify_folio.  But that's really a question for Eric.

I thought it would be confusing for fsverity_verify_folio() to not actually
verify a whole folio.  So, for now we have:

fsverity_verify_page: verify a whole page
fsverity_verify_folio: verify a whole folio
fsverity_verify_blocks: verify a range of blocks in a folio

IMO that makes sense.  Note: fsverity_verify_folio() is currently unused, but
ext4 might use it.

So, just use fsverity_verify_blocks().

- Eric



Re: [Cluster-devel] [PATCH v2 00/23] fs-verity support for XFS

2023-04-04 Thread Eric Biggers
On Tue, Apr 04, 2023 at 04:52:56PM +0200, Andrey Albershteyn wrote:
> The patchset is tested with xfstests -g auto on xfs_1k, xfs_4k,
> xfs_1k_quota, and xfs_4k_quota. Haven't found any major failures.

Just to double check, did you verify that the tests in the "verity" group are
running, and were not skipped?

- Eric



Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-04 Thread Eric Biggers
Hi Andrey,

On Tue, Apr 04, 2023 at 04:53:17PM +0200, Andrey Albershteyn wrote:
> In case of different Merkle tree block size fs-verity expects
> ->read_merkle_tree_page() to return Merkle tree page filled with
> Merkle tree blocks. The XFS stores each merkle tree block under
> extended attribute. Those attributes are addressed by block offset
> into Merkle tree.
> 
> This patch make ->read_merkle_tree_page() to fetch multiple merkle
> tree blocks based on size ratio. Also the reference to each xfs_buf
> is passed with page->private to ->drop_page().
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/xfs_verity.c | 74 +++--
>  fs/xfs/xfs_verity.h |  8 +
>  2 files changed, 66 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> index a9874ff4efcd..ef0aff216f06 100644
> --- a/fs/xfs/xfs_verity.c
> +++ b/fs/xfs/xfs_verity.c
> @@ -134,6 +134,10 @@ xfs_read_merkle_tree_page(
>   struct page *page = NULL;
>   __be64  name = cpu_to_be64(index << PAGE_SHIFT);
>   uint32_tbs = 1 << log_blocksize;
> + int blocks_per_page =
> + (1 << (PAGE_SHIFT - log_blocksize));
> + int n = 0;
> + int offset = 0;
>   struct xfs_da_args  args = {
>   .dp = ip,
>   .attr_filter= XFS_ATTR_VERITY,
> @@ -143,26 +147,59 @@ xfs_read_merkle_tree_page(
>   .valuelen   = bs,
>   };
>   int error = 0;
> + boolis_checked = true;
> + struct xfs_verity_buf_list  *buf_list;
>  
>   page = alloc_page(GFP_KERNEL);
>   if (!page)
>   return ERR_PTR(-ENOMEM);
>  
> - error = xfs_attr_get();
> - if (error) {
> - kmem_free(args.value);
> - xfs_buf_rele(args.bp);
> + buf_list = kzalloc(sizeof(struct xfs_verity_buf_list), GFP_KERNEL);
> + if (!buf_list) {
>   put_page(page);
> - return ERR_PTR(-EFAULT);
> + return ERR_PTR(-ENOMEM);
>   }
>  
> - if (args.bp->b_flags & XBF_VERITY_CHECKED)
> + /*
> +  * Fill the page with Merkle tree blocks. The blcoks_per_page is higher
> +  * than 1 when fs block size != PAGE_SIZE or Merkle tree block size !=
> +  * PAGE SIZE
> +  */
> + for (n = 0; n < blocks_per_page; n++) {
> + offset = bs * n;
> + name = cpu_to_be64(((index << PAGE_SHIFT) + offset));
> + args.name = (const uint8_t *)
> +
> + error = xfs_attr_get();
> + if (error) {
> + kmem_free(args.value);
> + /*
> +  * No more Merkle tree blocks (e.g. this was the last
> +  * block of the tree)
> +  */
> + if (error == -ENOATTR)
> + break;
> + xfs_buf_rele(args.bp);
> + put_page(page);
> + kmem_free(buf_list);
> + return ERR_PTR(-EFAULT);
> + }
> +
> + buf_list->bufs[buf_list->buf_count++] = args.bp;
> +
> + /* One of the buffers was dropped */
> + if (!(args.bp->b_flags & XBF_VERITY_CHECKED))
> + is_checked = false;
> +
> + memcpy(page_address(page) + offset, args.value, args.valuelen);
> + kmem_free(args.value);
> + args.value = NULL;
> + }

I was really hoping for a solution where the cached data can be used directly,
instead allocating a temporary page and copying the cached data into it every
time the cache is accessed.  The problem with what you have now is that every
time a single 32-byte hash is accessed, a full page (potentially 64KB!) will be
allocated and filled.  That's not very efficient.  The need to allocate the
temporary page can also cause ENOMEM (which will get reported as EIO).

Did you consider alternatives that would work more efficiently?  I think it
would be worth designing something that works properly with how XFS is planned
to cache the Merkle tree, instead of designing a workaround.
->read_merkle_tree_page was not really designed for what you are doing here.

How about replacing ->read_merkle_tree_page with a function that takes in a
Merkle tree block index (not a page index!) and hands back a (page, offset) pair
that identifies where the Merkle tree block's data is located?  Or (folio,
offset), I suppose.

With that, would it be possible to directly return the XFS cache?

- Eric



Re: [Cluster-devel] [PATCH v2 16/23] xfs: add inode on-disk VERITY flag

2023-04-04 Thread Eric Biggers
Hi Andrey,

On Tue, Apr 04, 2023 at 04:53:12PM +0200, Andrey Albershteyn wrote:
> Add flag to mark inodes which have fs-verity enabled on them (i.e.
> descriptor exist and tree is built).
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/ioctl.c | 4 
>  fs/xfs/libxfs/xfs_format.h | 4 +++-
>  fs/xfs/xfs_inode.c | 2 ++
>  fs/xfs/xfs_iops.c  | 2 ++
>  include/uapi/linux/fs.h| 1 +
>  5 files changed, 12 insertions(+), 1 deletion(-)
[...]
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..5172a2eb902c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -140,6 +140,7 @@ struct fsxattr {
>  #define FS_XFLAG_FILESTREAM  0x4000  /* use filestream allocator */
>  #define FS_XFLAG_DAX 0x8000  /* use DAX for IO */
>  #define FS_XFLAG_COWEXTSIZE  0x0001  /* CoW extent size allocator 
> hint */
> +#define FS_XFLAG_VERITY  0x0002  /* fs-verity sealed 
> inode */
>  #define FS_XFLAG_HASATTR 0x8000  /* no DIFLAG for this   */
>  

I don't think "xfs: add inode on-disk VERITY flag" is an accurate description of
a patch that involves adding something to the UAPI.

Should the other filesystems support this new flag too?

I'd also like all ways of getting the verity flag to continue to be mentioned in
Documentation/filesystems/fsverity.rst.  The existing methods (FS_IOC_GETFLAGS
and statx) are already mentioned there.

- Eric



Re: [Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Eric Biggers
On Fri, Feb 25, 2022 at 02:24:35PM -0500, Theodore Ts'o wrote:
> [un]pin_user_pages_remote is dirtying pages without properly warning
> the file system in advance (or faulting in the file data if the page
> is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
> and more recently has resulted in bug reports by Syzbot in various
> Android kernels[2].
> 
> This is technically a bug in the mm/gup.c codepath, but arguably ext4
> is fragile in that a buggy get_user_pages() implementation causes ext4
> to crash, where as other file systems are not crashing (although in
> some cases the user data will be lost since gup code is not properly
> informing the file system to potentially allocate blocks or reserve
> space when writing into a sparse portion of file).  I suspect in real
> life it is rare that people are using RDMA into file-backed memory,
> which is why no one has complained to ext4 developers except fuzzing
> programs.
> 
> So instead of crashing with a BUG, issue a warning (since there may be
> potential data loss) and just mark the page as clean to avoid
> unprivileged denial of service attacks until the problem can be
> properly fixed.  More discussion and background can be found in the
> thread starting at [2].
> 
> [1] https://www.spinics.net/lists/linux-mm/msg142700.html

Can you use a lore link
(https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz/T/#u)?

> + /*
> +  * Should never happen but for buggy code in
> +  * other subsystemsa that call

subsystemsa => subsystems

> +  * set_page_dirty() without properly warning
> +  * the file system first.  See [1] for more
> +  * information.
> +  *
> +  * [1] 
> https://www.spinics.net/lists/linux-mm/msg142700.html

Likewise, lore link here.

- Eric



Re: [Cluster-devel] [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Eric Biggers
On Sun, Oct 11, 2020 at 11:56:35PM -0700, Ira Weiny wrote:
> > 
> > And I still don't really understand.  After this patchset, there is still 
> > code
> > nearly identical to the above (doing a temporary mapping just for a memcpy) 
> > that
> > would still be using kmap_atomic().
> 
> I don't understand.  You mean there would be other call sites calling:
> 
> kmap_atomic()
> memcpy()
> kunmap_atomic()

Yes, there are tons of places that do this.  Try 'git grep -A6 kmap_atomic'
and look for memcpy().

Hence why I'm asking what will be the "recommended" way to do this...
kunmap_thread() or kmap_atomic()?

> And since I don't know the call site details if there are kmap_thread() calls
> which are better off as kmap_atomic() calls I think it is worth converting
> them.  But I made the assumption that kmap users would already be calling
> kmap_atomic() if they could (because it is more efficient).

Not necessarily.  In cases where either one is correct, people might not have
put much thought into which of kmap() and kmap_atomic() they are using.

- Eric



Re: [Cluster-devel] [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-09 Thread Eric Biggers
On Sat, Oct 10, 2020 at 01:39:54AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 09, 2020 at 02:34:34PM -0700, Eric Biggers wrote:
> > On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > @@ -2410,12 +2410,12 @@ static inline struct page 
> > > *f2fs_pagecache_get_page(
> > >  
> > >  static inline void f2fs_copy_page(struct page *src, struct page *dst)
> > >  {
> > > - char *src_kaddr = kmap(src);
> > > - char *dst_kaddr = kmap(dst);
> > > + char *src_kaddr = kmap_thread(src);
> > > + char *dst_kaddr = kmap_thread(dst);
> > >  
> > >   memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > > - kunmap(dst);
> > > - kunmap(src);
> > > + kunmap_thread(dst);
> > > + kunmap_thread(src);
> > >  }
> > 
> > Wouldn't it make more sense to switch cases like this to kmap_atomic()?
> > The pages are only mapped to do a memcpy(), then they're immediately 
> > unmapped.
> 
> Maybe you missed the earlier thread from Thomas trying to do something
> similar for rather different reasons ...
> 
> https://lore.kernel.org/lkml/20200919091751.06...@linutronix.de/

I did miss it.  I'm not subscribed to any of the mailing lists it was sent to.

Anyway, it shouldn't matter.  Patchsets should be standalone, and not require
reading random prior threads on linux-kernel to understand.

And I still don't really understand.  After this patchset, there is still code
nearly identical to the above (doing a temporary mapping just for a memcpy) that
would still be using kmap_atomic().  Is the idea that later, such code will be
converted to use kmap_thread() instead?  If not, why use one over the other?

- Eric



Re: [Cluster-devel] [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-09 Thread Eric Biggers
On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> The kmap() calls in this FS are localized to a single thread.  To avoid
> the over head of global PKRS updates use the new kmap_thread() call.
> 
> Cc: Jaegeuk Kim 
> Cc: Chao Yu 
> Signed-off-by: Ira Weiny 
> ---
>  fs/f2fs/f2fs.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index d9e52a7f3702..ff72a45a577e 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2410,12 +2410,12 @@ static inline struct page *f2fs_pagecache_get_page(
>  
>  static inline void f2fs_copy_page(struct page *src, struct page *dst)
>  {
> - char *src_kaddr = kmap(src);
> - char *dst_kaddr = kmap(dst);
> + char *src_kaddr = kmap_thread(src);
> + char *dst_kaddr = kmap_thread(dst);
>  
>   memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> - kunmap(dst);
> - kunmap(src);
> + kunmap_thread(dst);
> + kunmap_thread(src);
>  }

Wouldn't it make more sense to switch cases like this to kmap_atomic()?
The pages are only mapped to do a memcpy(), then they're immediately unmapped.

- Eric



Re: [Cluster-devel] [PATCH v10 12/25] mm: Move end_index check out of readahead loop

2020-03-23 Thread Eric Biggers
On Mon, Mar 23, 2020 at 01:22:46PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By reducing nr_to_read, we can eliminate this check from inside the loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: John Hubbard 
> Reviewed-by: William Kucharski 
> ---
>  mm/readahead.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index d01531ef9f3c..998fdd23c0b1 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -167,8 +167,6 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> - struct page *page;
> - unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> @@ -178,22 +176,26 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   ._index = index,
>   };
>   unsigned long i;
> + pgoff_t end_index;  /* The last page we want to read */
>  
>   if (isize == 0)
>   return;
>  
> - end_index = ((isize - 1) >> PAGE_SHIFT);
> + end_index = (isize - 1) >> PAGE_SHIFT;
> + if (index > end_index)
> + return;
> + /* Don't read past the page containing the last byte of the file */
> + if (nr_to_read > end_index - index)
> + nr_to_read = end_index - index + 1;
>  
>   /*
>* Preallocate as many pages as we will need.
>*/
>   for (i = 0; i < nr_to_read; i++) {
> - if (index + i > end_index)
> - break;
> + struct page *page = xa_load(>i_pages, index + i);
>  
>   BUG_ON(index + i != rac._index + rac._nr_pages);
>  
> - page = xa_load(>i_pages, index + i);
>   if (page && !xa_is_value(page)) {
>   /*
>* Page already present?  Kick off the current batch of
> -- 

Reviewed-by: Eric Biggers 

- Eric




Re: [Cluster-devel] [PATCH v9 23/25] f2fs: Pass the inode to f2fs_mpage_readpages

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 07:22:29AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This function now only uses the mapping argument to look up the inode,
> and both callers already have the inode, so just pass the inode instead
> of the mapping.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: William Kucharski 
> ---
>  fs/f2fs/data.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 237dff36fe73..c8b042979fc4 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -2159,12 +2159,11 @@ int f2fs_read_multi_pages(struct compress_ctx *cc, 
> struct bio **bio_ret,
>   * use ->readpage() or do the necessary surgery to decouple ->readpages()
>   * from read-ahead.
>   */
> -static int f2fs_mpage_readpages(struct address_space *mapping,
> +static int f2fs_mpage_readpages(struct inode *inode,
>   struct readahead_control *rac, struct page *page)
>  {
>   struct bio *bio = NULL;
>   sector_t last_block_in_bio = 0;
> - struct inode *inode = mapping->host;
>   struct f2fs_map_blocks map;
>  #ifdef CONFIG_F2FS_FS_COMPRESSION
>   struct compress_ctx cc = {
> @@ -2276,7 +2275,7 @@ static int f2fs_read_data_page(struct file *file, 
> struct page *page)
>   if (f2fs_has_inline_data(inode))
>   ret = f2fs_read_inline_data(inode, page);
>   if (ret == -EAGAIN)
> - ret = f2fs_mpage_readpages(page_file_mapping(page), NULL, page);
> + ret = f2fs_mpage_readpages(inode, NULL, page);
>   return ret;
>  }
>  
> @@ -2293,7 +2292,7 @@ static void f2fs_readahead(struct readahead_control 
> *rac)
>   if (f2fs_has_inline_data(inode))
>   return;
>  
> - f2fs_mpage_readpages(rac->mapping, rac, NULL);
> + f2fs_mpage_readpages(inode, rac, NULL);
>  }
>  
>  int f2fs_encrypt_one_page(struct f2fs_io_info *fio)
> -- 

Reviewed-by: Eric Biggers 

- Eric




Re: [Cluster-devel] [PATCH v9 22/25] f2fs: Convert from readpages to readahead

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 07:22:28AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in f2fs
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: William Kucharski 
> ---
>  fs/f2fs/data.c  | 47 +++--
>  include/trace/events/f2fs.h |  6 ++---
>  2 files changed, 22 insertions(+), 31 deletions(-)
> 

Reviewed-by: Eric Biggers 

> @@ -2210,7 +2204,7 @@ static int f2fs_mpage_readpages(struct address_space 
> *mapping,
>   ret = f2fs_read_multi_pages(, ,
>   max_nr_pages,
>   _block_in_bio,
> - is_readahead);
> + rac);

IMO it would be clearer to write 'rac != NULL' here (and below) since the
argument is actually a bool, but this works too.

- Eric




Re: [Cluster-devel] [PATCH v9 21/25] ext4: Pass the inode to ext4_mpage_readpages

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 07:22:27AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This function now only uses the mapping argument to look up the inode,
> and both callers already have the inode, so just pass the inode instead
> of the mapping.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: William Kucharski 
> ---
>  fs/ext4/ext4.h | 2 +-
>  fs/ext4/inode.c| 4 ++--
>  fs/ext4/readpage.c | 3 +--
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1570a0b51b73..bc1b34ba6eab 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3278,7 +3278,7 @@ static inline void ext4_set_de_type(struct super_block 
> *sb,
>  }
>  
>  /* readpages.c */
> -extern int ext4_mpage_readpages(struct address_space *mapping,
> +extern int ext4_mpage_readpages(struct inode *inode,
>   struct readahead_control *rac, struct page *page);
>  extern int __init ext4_init_post_read_processing(void);
>  extern void ext4_exit_post_read_processing(void);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d674c5f9066c..4f3703c1408d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3226,7 +3226,7 @@ static int ext4_readpage(struct file *file, struct page 
> *page)
>   ret = ext4_readpage_inline(inode, page);
>  
>   if (ret == -EAGAIN)
> - return ext4_mpage_readpages(page->mapping, NULL, page);
> + return ext4_mpage_readpages(inode, NULL, page);
>  
>   return ret;
>  }
> @@ -3239,7 +3239,7 @@ static void ext4_readahead(struct readahead_control 
> *rac)
>   if (ext4_has_inline_data(inode))
>   return;
>  
> - ext4_mpage_readpages(rac->mapping, rac, NULL);
> + ext4_mpage_readpages(inode, rac, NULL);
>  }
>  
>  static void ext4_invalidatepage(struct page *page, unsigned int offset,
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index 66275f25235d..5761e9961682 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -221,13 +221,12 @@ static inline loff_t ext4_readpage_limit(struct inode 
> *inode)
>   return i_size_read(inode);
>  }
>  
> -int ext4_mpage_readpages(struct address_space *mapping,
> +int ext4_mpage_readpages(struct inode *inode,
>   struct readahead_control *rac, struct page *page)
>  {
>   struct bio *bio = NULL;
>   sector_t last_block_in_bio = 0;
>  
> -     struct inode *inode = mapping->host;
>   const unsigned blkbits = inode->i_blkbits;
>   const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
>   const unsigned blocksize = 1 << blkbits;
> -- 

Reviewed-by: Eric Biggers 

- Eric




Re: [Cluster-devel] [PATCH v9 20/25] ext4: Convert from readpages to readahead

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 10:48:48AM -0700, Matthew Wilcox wrote:
> On Fri, Mar 20, 2020 at 10:37:34AM -0700, Eric Biggers wrote:
> > On Fri, Mar 20, 2020 at 07:22:26AM -0700, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > Use the new readahead operation in ext4
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > Reviewed-by: William Kucharski 
> > > ---
> > >  fs/ext4/ext4.h |  3 +--
> > >  fs/ext4/inode.c| 21 +
> > >  fs/ext4/readpage.c | 22 ++++--
> > >  3 files changed, 18 insertions(+), 28 deletions(-)
> > > 
> > 
> > Reviewed-by: Eric Biggers 
> > 
> > > + if (rac) {
> > > + page = readahead_page(rac);
> > >   prefetchw(>flags);
> > > - list_del(>lru);
> > > - if (add_to_page_cache_lru(page, mapping, page->index,
> > > -   readahead_gfp_mask(mapping)))
> > > - goto next_page;
> > >   }
> > 
> > Maybe the prefetchw(>flags) should be included in readahead_page()?
> > Most of the callers do it.
> 
> I did notice that a lot of callers do that.  I wonder whether it (still)
> helps or whether it's just cargo-cult programming.  It can't possibly
> have helped before because we did list_del(>lru) as the very next
> instruction after prefetchw(), and they're in the same cacheline.  It'd
> be interesting to take it out and see what happens to performance.

Yeah, it does look like the list_del() made the prefetchw() useless, so it
should just be removed.  The prefetchw() dates all the way back to
mpage_readpages() being added in 2002, but even then the list_del() was
immediately afterwards, and 'flags' and 'lru' were in the same cache line in
'struct page' even then (assuming at least a 32-byte cache line size), so...

- Eric




Re: [Cluster-devel] [PATCH v9 12/25] mm: Move end_index check out of readahead loop

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 11:11:32AM -0700, Matthew Wilcox wrote:
> On Fri, Mar 20, 2020 at 11:00:17AM -0700, Eric Biggers wrote:
> > On Fri, Mar 20, 2020 at 10:30:40AM -0700, Matthew Wilcox wrote:
> > > On Fri, Mar 20, 2020 at 09:58:28AM -0700, Eric Biggers wrote:
> > > > On Fri, Mar 20, 2020 at 07:22:18AM -0700, Matthew Wilcox wrote:
> > > > > + /* Avoid wrapping to the beginning of the file */
> > > > > + if (index + nr_to_read < index)
> > > > > + nr_to_read = ULONG_MAX - index + 1;
> > > > > + /* Don't read past the page containing the last byte of the 
> > > > > file */
> > > > > + if (index + nr_to_read >= end_index)
> > > > > + nr_to_read = end_index - index + 1;
> > > > 
> > > > There seem to be a couple off-by-one errors here.  Shouldn't it be:
> > > > 
> > > > /* Avoid wrapping to the beginning of the file */
> > > > if (index + nr_to_read < index)
> > > > nr_to_read = ULONG_MAX - index;
> > > 
> > > I think it's right.  Imagine that index is ULONG_MAX.  We should read one
> > > page (the one at ULONG_MAX).  That would be ULONG_MAX - ULONG_MAX + 1.
> > > 
> > > > /* Don't read past the page containing the last byte of the 
> > > > file */
> > > > if (index + nr_to_read > end_index)
> > > > nr_to_read = end_index - index + 1;
> > > > 
> > > > I.e., 'ULONG_MAX - index' rather than 'ULONG_MAX - index + 1', so that
> > > > 'index + nr_to_read' is then ULONG_MAX rather than overflowed to 0.
> > > > 
> > > > Then 'index + nr_to_read > end_index' rather 'index + nr_to_read >= 
> > > > end_index',
> > > > since otherwise nr_to_read can be increased by 1 rather than decreased 
> > > > or stay
> > > > the same as expected.
> > > 
> > > Ooh, I missed the overflow case here.  It should be:
> > > 
> > > + if (index + nr_to_read - 1 > end_index)
> > > + nr_to_read = end_index - index + 1;
> > > 
> > 
> > But then if someone passes index=0 and nr_to_read=0, this underflows and the
> > entire file gets read.
> 
> nr_to_read == 0 doesn't make sense ... I thought we filtered that out
> earlier, but I can't find anywhere that does that right now.  I'd
> rather return early from __do_page_cache_readahead() to fix that.
> 
> > The page cache isn't actually supposed to contain a page at index ULONG_MAX,
> > since MAX_LFS_FILESIZE is at most ((loff_t)ULONG_MAX << PAGE_SHIFT), right? 
> >  So
> > I don't think we need to worry about reading the page with index ULONG_MAX.
> > I.e. I think it's fine to limit nr_to_read to 'ULONG_MAX - index', if that 
> > makes
> > it easier to avoid an overflow or underflow in the next check.
> 
> I think we can get a page at ULONG_MAX on 32-bit systems?  I mean, we can buy
> hard drives which are larger than 16TiB these days:
> https://www.pcmag.com/news/seagate-will-ship-18tb-and-20tb-hard-drives-in-2020
> (even ignoring RAID devices)

The max file size is ((loff_t)ULONG_MAX << PAGE_SHIFT) which means the maximum
page *index* is ULONG_MAX - 1, not ULONG_MAX.

Anyway, I think we may be making this much too complicated.  How about just:

pgoff_t i_nrpages = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);

if (index >= i_nrpages)
return;
/* Don't read past the end of the file */
nr_to_read = min(nr_to_read, i_nrpages - index);

That's 2 branches instead of 4.  (Note that assigning to i_nrpages can't
overflow, since the max number of pages is ULONG_MAX not ULONG_MAX + 1.)

- Eric




Re: [Cluster-devel] [PATCH v9 12/25] mm: Move end_index check out of readahead loop

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 10:30:40AM -0700, Matthew Wilcox wrote:
> On Fri, Mar 20, 2020 at 09:58:28AM -0700, Eric Biggers wrote:
> > On Fri, Mar 20, 2020 at 07:22:18AM -0700, Matthew Wilcox wrote:
> > > + /* Avoid wrapping to the beginning of the file */
> > > + if (index + nr_to_read < index)
> > > + nr_to_read = ULONG_MAX - index + 1;
> > > + /* Don't read past the page containing the last byte of the file */
> > > + if (index + nr_to_read >= end_index)
> > > + nr_to_read = end_index - index + 1;
> > 
> > There seem to be a couple off-by-one errors here.  Shouldn't it be:
> > 
> > /* Avoid wrapping to the beginning of the file */
> > if (index + nr_to_read < index)
> > nr_to_read = ULONG_MAX - index;
> 
> I think it's right.  Imagine that index is ULONG_MAX.  We should read one
> page (the one at ULONG_MAX).  That would be ULONG_MAX - ULONG_MAX + 1.
> 
> > /* Don't read past the page containing the last byte of the file */
> > if (index + nr_to_read > end_index)
> > nr_to_read = end_index - index + 1;
> > 
> > I.e., 'ULONG_MAX - index' rather than 'ULONG_MAX - index + 1', so that
> > 'index + nr_to_read' is then ULONG_MAX rather than overflowed to 0.
> > 
> > Then 'index + nr_to_read > end_index' rather 'index + nr_to_read >= 
> > end_index',
> > since otherwise nr_to_read can be increased by 1 rather than decreased or 
> > stay
> > the same as expected.
> 
> Ooh, I missed the overflow case here.  It should be:
> 
> + if (index + nr_to_read - 1 > end_index)
> + nr_to_read = end_index - index + 1;
> 

But then if someone passes index=0 and nr_to_read=0, this underflows and the
entire file gets read.

The page cache isn't actually supposed to contain a page at index ULONG_MAX,
since MAX_LFS_FILESIZE is at most ((loff_t)ULONG_MAX << PAGE_SHIFT), right?  So
I don't think we need to worry about reading the page with index ULONG_MAX.
I.e. I think it's fine to limit nr_to_read to 'ULONG_MAX - index', if that makes
it easier to avoid an overflow or underflow in the next check.

- Eric




Re: [Cluster-devel] [PATCH v9 13/25] mm: Add page_cache_readahead_unbounded

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 07:22:19AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> ext4 and f2fs have duplicated the guts of the readahead code so
> they can read past i_size.  Instead, separate out the guts of the
> readahead code so they can call it directly.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: William Kucharski 

Reviewed-by: Eric Biggers 
Tested-by: Eric Biggers 

- Eric




Re: [Cluster-devel] [PATCH v9 12/25] mm: Move end_index check out of readahead loop

2020-03-20 Thread Eric Biggers
On Fri, Mar 20, 2020 at 07:22:18AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By reducing nr_to_read, we can eliminate this check from inside the loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: John Hubbard 
> Reviewed-by: William Kucharski 
> ---
>  mm/readahead.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index d01531ef9f3c..a37b68f66233 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -167,8 +167,6 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> - struct page *page;
> - unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> @@ -178,22 +176,29 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   ._index = index,
>   };
>   unsigned long i;
> + pgoff_t end_index;  /* The last page we want to read */
>  
>   if (isize == 0)
>   return;
>  
> - end_index = ((isize - 1) >> PAGE_SHIFT);
> + end_index = (isize - 1) >> PAGE_SHIFT;
> + if (index > end_index)
> + return;
> + /* Avoid wrapping to the beginning of the file */
> + if (index + nr_to_read < index)
> + nr_to_read = ULONG_MAX - index + 1;
> + /* Don't read past the page containing the last byte of the file */
> + if (index + nr_to_read >= end_index)
> + nr_to_read = end_index - index + 1;

There seem to be a couple off-by-one errors here.  Shouldn't it be:

/* Avoid wrapping to the beginning of the file */
if (index + nr_to_read < index)
nr_to_read = ULONG_MAX - index;
/* Don't read past the page containing the last byte of the file */
if (index + nr_to_read > end_index)
nr_to_read = end_index - index + 1;

I.e., 'ULONG_MAX - index' rather than 'ULONG_MAX - index + 1', so that
'index + nr_to_read' is then ULONG_MAX rather than overflowed to 0.

Then 'index + nr_to_read > end_index' rather 'index + nr_to_read >= end_index',
since otherwise nr_to_read can be increased by 1 rather than decreased or stay
the same as expected.

- Eric




Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-19 Thread Eric Biggers
On Tue, Feb 18, 2020 at 07:10:44PM -0800, Eric Biggers wrote:
> On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> > diff --git a/Documentation/filesystems/vfs.rst 
> > b/Documentation/filesystems/vfs.rst
> > index 7d4d09dd5e6d..81ab30fbe45c 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are 
> > defined:
> > int (*readpage)(struct file *, struct page *);
> > int (*writepages)(struct address_space *, struct 
> > writeback_control *);
> > int (*set_page_dirty)(struct page *page);
> > +   void (*readahead)(struct readahead_control *);
> > int (*readpages)(struct file *filp, struct address_space 
> > *mapping,
> >  struct list_head *pages, unsigned nr_pages);
> > int (*write_begin)(struct file *, struct address_space *mapping,
> > @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are 
> > defined:
> > If defined, it should set the PageDirty flag, and the
> > PAGECACHE_TAG_DIRTY tag in the radix tree.
> >  
> > +``readahead``
> > +   Called by the VM to read pages associated with the address_space
> > +   object.  The pages are consecutive in the page cache and are
> > +   locked.  The implementation should decrement the page refcount
> > +   after starting I/O on each page.  Usually the page will be
> > +   unlocked by the I/O completion handler.  If the function does
> > +   not attempt I/O on some pages, the caller will decrement the page
> > +   refcount and unlock the pages for you.  Set PageUptodate if the
> > +   I/O completes successfully.  Setting PageError on any page will
> > +   be ignored; simply unlock the page if an I/O error occurs.
> > +
> 
> This is unclear about how "not attempting I/O" works and how that affects who 
> is
> responsible for putting and unlocking the pages.  How does the caller know 
> which
> pages were not attempted?  Can any arbitrary subset of pages be not attempted,
> or just the last N pages?
> 
> In the code, the caller actually uses readahead_for_each() to iterate through
> and put+unlock the pages.  That implies that ->readahead() must also use
> readahead_for_each() as well, in order for the iterator to be advanced
> correctly... Right?  And the ownership of each page is transferred to the 
> callee
> when the callee advances the iterator past that page.
> 
> I don't see how ext4_readahead() and f2fs_readahead() can work at all, given
> that they don't advance the iterator.
> 

Yep, this patchset immediately crashes on boot with:

BUG: Bad page state in process swapper/0  pfn:02176
page:ea0751d0 refcount:0 mapcount:0 mapping:88807cba0400 index:0x0
ext4_da_aops name:"systemd"
flags: 0x1020001(locked|mappedtodisk)
raw: 01020001 dead0100 dead0122 88807cba0400
raw:   
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x1(locked)
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2-00019-g7203ed9018cb9 #18
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
?-20191223_100556-anatol 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x7a/0xaa lib/dump_stack.c:118
 bad_page.cold+0x89/0xba mm/page_alloc.c:649
 free_pages_check_bad+0x5d/0x60 mm/page_alloc.c:1050
 free_pages_check mm/page_alloc.c:1059 [inline]
 free_pages_prepare mm/page_alloc.c:1157 [inline]
 free_pcp_prepare+0x1c1/0x200 mm/page_alloc.c:1198
 free_unref_page_prepare mm/page_alloc.c:3011 [inline]
 free_unref_page+0x16/0x70 mm/page_alloc.c:3060
 __put_single_page mm/swap.c:81 [inline]
 __put_page+0x31/0x40 mm/swap.c:115
 put_page include/linux/mm.h:1029 [inline]
 ext4_mpage_readpages+0x778/0x9b0 fs/ext4/readpage.c:405
 ext4_readahead+0x2f/0x40 fs/ext4/inode.c:3242
 read_pages+0x4c/0x200 mm/readahead.c:126
 page_cache_readahead_limit+0x224/0x250 mm/readahead.c:241
 __do_page_cache_readahead mm/readahead.c:266 [inline]
 ra_submit mm/internal.h:62 [inline]
 ondemand_readahead+0x1df/0x4d0 mm/readahead.c:544
 page_cache_sync_readahead+0x2d/0x40 mm/readahead.c:579
 generic_file_buffered_read+0x77e/0xa90 mm/filemap.c:2029
 generic_file_read_iter+0xd4/0x130 mm/filemap.c:2302
 ext4_file_read_iter fs/ext4/file.c:131 [inline]
 ext4_file_read_iter+0x53/0x180 fs/ext4/file.c:114
 call_read_iter include/linux/fs.h:1897 [inline]
 new_sync_read+0x113/0x1a0 fs/read_write.c:414
 __vfs_read+0x21/0x30 fs/read_write.c:427
 vfs_read+0xcb/0x160 fs/read_write.c:461
 kernel_read+0x2c/0x40 fs/read_write.c:440
 prepare_binprm+0x14f/0x190 fs/exec.c:1

Re: [Cluster-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead

2020-02-19 Thread Eric Biggers
On Tue, Feb 18, 2020 at 07:47:41PM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 07:28:26PM -0800, Eric Biggers wrote:
> > On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> > > diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> > > index 001f1fcf9836..f4f5e90a6844 100644
> > > --- a/include/linux/mpage.h
> > > +++ b/include/linux/mpage.h
> > > @@ -13,9 +13,9 @@
> > >  #ifdef CONFIG_BLOCK
> > >  
> > >  struct writeback_control;
> > > +struct readahead_control;
> > >  
> > > -int mpage_readpages(struct address_space *mapping, struct list_head 
> > > *pages,
> > > - unsigned nr_pages, get_block_t get_block);
> > > +void mpage_readahead(struct readahead_control *, get_block_t get_block);
> > >  int mpage_readpage(struct page *page, get_block_t get_block);
> > >  int mpage_writepages(struct address_space *mapping,
> > >   struct writeback_control *wbc, get_block_t get_block);
> > 
> > Can you name the 'struct readahead_control *' parameter?
> 
> What good would that do?  I'm sick of seeing 'struct page *page'.
> Well, no shit it's a page.  Unless there's some actual information to
> convey, leave the argument unnamed.  It should be a crime to not name
> an unsigned long, but not naming the struct address_space pointer is
> entirely reasonable.

It's the coding style the community has agreed on, so the tools check for.

I don't care that much myself; it just appeared like this was a mistake rather
than intentional so I thought I'd point it out.

- Eric




Re: [Cluster-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead

2020-02-19 Thread Eric Biggers
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> index 001f1fcf9836..f4f5e90a6844 100644
> --- a/include/linux/mpage.h
> +++ b/include/linux/mpage.h
> @@ -13,9 +13,9 @@
>  #ifdef CONFIG_BLOCK
>  
>  struct writeback_control;
> +struct readahead_control;
>  
> -int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> - unsigned nr_pages, get_block_t get_block);
> +void mpage_readahead(struct readahead_control *, get_block_t get_block);
>  int mpage_readpage(struct page *page, get_block_t get_block);
>  int mpage_writepages(struct address_space *mapping,
>   struct writeback_control *wbc, get_block_t get_block);

Can you name the 'struct readahead_control *' parameter?

checkpatch.pl should warn about this.

- Eric




Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-19 Thread Eric Biggers
On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> diff --git a/Documentation/filesystems/vfs.rst 
> b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..81ab30fbe45c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are 
> defined:
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct 
> writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space 
> *mapping,
>struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are 
> defined:
>   If defined, it should set the PageDirty flag, and the
>   PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> + Called by the VM to read pages associated with the address_space
> + object.  The pages are consecutive in the page cache and are
> + locked.  The implementation should decrement the page refcount
> + after starting I/O on each page.  Usually the page will be
> + unlocked by the I/O completion handler.  If the function does
> + not attempt I/O on some pages, the caller will decrement the page
> + refcount and unlock the pages for you.  Set PageUptodate if the
> + I/O completes successfully.  Setting PageError on any page will
> + be ignored; simply unlock the page if an I/O error occurs.
> +

This is unclear about how "not attempting I/O" works and how that affects who is
responsible for putting and unlocking the pages.  How does the caller know which
pages were not attempted?  Can any arbitrary subset of pages be not attempted,
or just the last N pages?

In the code, the caller actually uses readahead_for_each() to iterate through
and put+unlock the pages.  That implies that ->readahead() must also use
readahead_for_each() as well, in order for the iterator to be advanced
correctly... Right?  And the ownership of each page is transferred to the callee
when the callee advances the iterator past that page.

I don't see how ext4_readahead() and f2fs_readahead() can work at all, given
that they don't advance the iterator.

- Eric




Re: [Cluster-devel] [PATCH v6 14/19] ext4: Convert from readpages to readahead

2020-02-19 Thread Eric Biggers
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote:
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index c1769afbf799..e14841ade612 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -7,8 +7,8 @@
>   *
>   * This was originally taken from fs/mpage.c
>   *
> - * The intent is the ext4_mpage_readpages() function here is intended
> - * to replace mpage_readpages() in the general case, not just for
> + * The ext4_mpage_readahead() function here is intended to
> + * replace mpage_readahead() in the general case, not just for
>   * encrypted files.  It has some limitations (see below), where it
>   * will fall back to read_block_full_page(), but these limitations
>   * should only be hit when page_size != block_size.
> @@ -222,8 +222,7 @@ static inline loff_t ext4_readpage_limit(struct inode 
> *inode)
>  }

This says ext4_mpage_readahead(), but it's actually still called
ext4_mpage_readpages().

- Eric