Re: [Cluster-devel] [PATCH 1/4] gfs2: stop using generic_writepages in gfs2_ail1_start_one

2023-01-18 Thread Christoph Hellwig
On Wed, Jan 18, 2023 at 10:22:20PM +0100, Andreas Gruenbacher wrote:
> The above change means that instead of calling generic_writepages(),
> we end up calling filemap_fdatawrite_wbc() -> do_writepages() ->
> mapping->a_ops->writepages(). But that's something completely
> different; the writepages address space operation operates is outward
> facing, while we really only want to write out the dirty buffers /
> pages in the underlying address space. In case of journaled data
> inodes, gfs2_jdata_writepages() actually ends up trying to create a
> filesystem transaction, which immediately hangs because we're in the
> middle of a log flush.
> 
> So I'm tempted to revert the following two of your commits; luckily
> that's independent from the iomap_writepage() removal:
> 
>   d3d71901b1ea ("gfs2: remove ->writepage")
>   b2b0a5e97855 ("gfs2: stop using generic_writepages in gfs2_ail1_start_one")

generic_writepages is gone in linux-next, and I'd really like to keep
it that way.  So if you have to do this, please open code it
using write_cache_pages and a direct call to the writepage method of
choice.

> I think we could go through iomap_writepages() instead of
> generic_writepages() here as well,  but that's for another day.

Well, that would obviously be much better, and actually help with the
goal of removing ->writepage.



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Dave Chinner
On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.  But I don't know if that statement applies
> generally...

The issue is not truncate/punch/reflink for either DIO or buffered
IO - the issue that leads to stale iomaps is async extent state.
i.e. IO completion doing unwritten extent conversion.

For DIO, AIO doesn't hold the IOLOCK at all when completion is run
(like buffered writeback), but non-AIO DIO writes hold the IOLOCK
shared while waiting for completion. This means that we can have DIO
submission and completion still running concurrently, and so stale
iomaps are a definite possibility.

>From my notes when I looked at this:

1. the race condition for a DIO write mapping go stale is an
overlapping DIO completion and converting the block from unwritten
to written, and then the dio write incorrectly issuing sub-block
zeroing because the mapping is now stale.

2. DIO read into a hole or unwritten extent zeroes the entire range
in the user buffer in one operation. If this is a large range, this
could race with small DIO writes within that range that have
completed

3. There is a window between dio write completion doing unwritten
extent conversion (by ->end_io) and the page cache being
invalidated, providing a window where buffered read maps can be
stale and incorrect read behaviour exposed to userpace before
the page cache is invalidated.

These all stem from IO having overlapping ranges, which is largely
unsupported but can't be entirely prevented (e.g. backup
applications running in the background). Largely the problems are
confined to sub-block IOs. i.e.  when sub-block DIO writes to the
same block are being performed, we have the possiblity that one
write completes whilst the other is deciding what to zero, unaware
that the range is now MAPPED rather than UNWRITTEN.

We currently avoid issues with sub-block dio writes by using
IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the
unaligned IO fits entirely within a MAPPED extent so no sub-block
zeroing is required. If allocation or sub-block zeroing is required,
then we force the filesystem to fall back to exclusive IO locking
and wait for all concurrent DIO in flight to complete so that it
can't race with any other DIO write that might cause the map to
become stale while we are doing the zeroing.

This does not avoid potential issues with DIO write vs buffered
read, nor DIO write vs mmap IO. It's not totally clear to me
whether we need ->iomap_valid checks in the buffered read paths
to avoid the completion races with DIO writes, but there are windows
there where cached iomaps could be considered stale

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH 1/4] gfs2: stop using generic_writepages in gfs2_ail1_start_one

2023-01-18 Thread Andreas Gruenbacher
Hi Christoph,

On Tue, Jul 19, 2022 at 7:07 AM Christoph Hellwig  wrote:
>
> Use filemap_fdatawrite_wbc instead of generic_writepages in
> gfs2_ail1_start_one so that the functin can also cope with address_space
> operations that only implement ->writepages and to properly account
> for cgroup writeback.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Andreas Gruenbacher 
> ---
>  fs/gfs2/log.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index f0ee3ff6f9a87..a66e3b1f6d178 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -131,7 +131,7 @@ __acquires(>sd_ail_lock)
> if (!mapping)
> continue;
> spin_unlock(>sd_ail_lock);
> -   ret = generic_writepages(mapping, wbc);
> +   ret = filemap_fdatawrite_wbc(mapping, wbc);

this patch unfortunately breaks journaled data inodes.

We're in function gfs2_ail1_start_one() here, which is usually called
via gfs2_log_flush() -> empty_ail1_list() -> gfs2_ail1_start() ->
gfs2_ail1_flush() -> gfs2_ail1_start_one(), and we're going through
the list of buffer heads in the transaction to be flushed. We used to
submit each dirty buffer for I/O individually, but since commit
5ac048bb7ea6 ("GFS2: Use filemap_fdatawrite() to write back the AIL"),
we're submitting all the dirty pages in the metadata address space
this buffer head belongs to. That's slightly bizarre, but it happens
to catch exactly the same buffer heads that are in the transaction, so
we end up with the same result. From what I'm being told, this was
done as a performance optimization -- but nobody actually knows the
details anymore.

The above change means that instead of calling generic_writepages(),
we end up calling filemap_fdatawrite_wbc() -> do_writepages() ->
mapping->a_ops->writepages(). But that's something completely
different; the writepages address space operation operates is outward
facing, while we really only want to write out the dirty buffers /
pages in the underlying address space. In case of journaled data
inodes, gfs2_jdata_writepages() actually ends up trying to create a
filesystem transaction, which immediately hangs because we're in the
middle of a log flush.

So I'm tempted to revert the following two of your commits; luckily
that's independent from the iomap_writepage() removal:

  d3d71901b1ea ("gfs2: remove ->writepage")
  b2b0a5e97855 ("gfs2: stop using generic_writepages in gfs2_ail1_start_one")

I think we could go through iomap_writepages() instead of
generic_writepages() here as well,  but that's for another day.

As far as cgroup writeback goes, this is journal I/O and I don't have
the faintest idea how the accounting for that is even supposed to
work.

> if (need_resched()) {
> blk_finish_plug(plug);
> cond_resched();
> @@ -222,8 +222,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct 
> writeback_control *wbc)
> spin_unlock(>sd_ail_lock);
> blk_finish_plug();
> if (ret) {
> -   gfs2_lm(sdp, "gfs2_ail1_start_one (generic_writepages) "
> -   "returned: %d\n", ret);
> +   gfs2_lm(sdp, "gfs2_ail1_start_one returned: %d\n", ret);
> gfs2_withdraw(sdp);
> }
> trace_gfs2_ail_flush(sdp, wbc, 0);
> --
> 2.30.2
>

Thanks,
Andreas



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Andreas Grünbacher
Am Mi., 18. Jan. 2023 um 20:04 Uhr schrieb Darrick J. Wong :
>
> On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> > On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > > I don't have any objections to pulling everything except patches 8 and
> > > 10 for testing this week.
> >
> > That would be great.  I now have a series to return the ERR_PTR
> > from __filemap_get_folio which will cause a minor conflict, but
> > I think that's easy enough for Linux to handle.
>
> Ok, done.
>
> > >
> > > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > > don't think it does, but OTOH zone pointer management might complicate
> > > that.
> >
> > Adding Damien.
> >
> > > 2. How about porting the writeback iomap validation to use this
> > > mechanism?  (I suspect Dave might already be working on this...)
> >
> > What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> > ?   writeback calls into ->map_blocks for every block while under the
> > folio lock, so the validation can (and for XFS currently is) done
> > in that.  Moving it out into a separate method with extra indirect
> > functiona call overhead and interactions between the methods seems
> > like a retrograde step to me.
>
> Sorry, I should've been more specific -- can xfs writeback use the
> validity cookie in struct iomap and thereby get rid of struct
> xfs_writepage_ctx entirely?

Already asked and answered in the same thread:

https://lore.kernel.org/linux-fsdevel/20230109225453.gq1971...@dread.disaster.area/

> > > 2. Do we need to revalidate mappings for directio writes?  I think the
> > > answer is no (for xfs) because the ->iomap_begin call will allocate
> > > whatever blocks are needed and truncate/punch/reflink block on the
> > > iolock while the directio writes are pending, so you'll never end up
> > > with a stale mapping.
> >
> > Yes.
>
> Er... yes as in "Yes, we *do* need to revalidate directio writes", or
> "Yes, your reasoning is correct"?
>
> --D



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Darrick J. Wong
On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > I don't have any objections to pulling everything except patches 8 and
> > 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.

Ok, done.

> > 
> > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > don't think it does, but OTOH zone pointer management might complicate
> > that.
> 
> Adding Damien.
> 
> > 2. How about porting the writeback iomap validation to use this
> > mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.

Sorry, I should've been more specific -- can xfs writeback use the
validity cookie in struct iomap and thereby get rid of struct
xfs_writepage_ctx entirely?

> > 2. Do we need to revalidate mappings for directio writes?  I think the
> > answer is no (for xfs) because the ->iomap_begin call will allocate
> > whatever blocks are needed and truncate/punch/reflink block on the
> > iolock while the directio writes are pending, so you'll never end up
> > with a stale mapping.
> 
> Yes.

Er... yes as in "Yes, we *do* need to revalidate directio writes", or
"Yes, your reasoning is correct"?

--D



Re: [Cluster-devel] [PATCH 7/9] gfs2: handle a NULL folio in gfs2_jhead_process_page

2023-01-18 Thread Christoph Hellwig
On Wed, Jan 18, 2023 at 05:24:32PM +0100, Andreas Gruenbacher wrote:
> We're actually still holding a reference to the folio from the
> find_or_create_page() in gfs2_find_jhead() here, so we know that
> memory pressure can't push the page out and filemap_get_folio() won't
> return NULL.

Ok, I'll drop this patch.



Re: [Cluster-devel] [PATCH 8/9] btrfs: handle a NULL folio in extent_range_redirty_for_io

2023-01-18 Thread Christoph Hellwig
On Wed, Jan 18, 2023 at 04:08:57PM +, Matthew Wilcox wrote:
> On Wed, Jan 18, 2023 at 10:43:28AM +0100, Christoph Hellwig wrote:
> > filemap_get_folio can return NULL, skip those cases.
> 
> Hmm, I'm not sure that's true.  We have one place that calls
> extent_range_redirty_for_io(), and it previously calls
> extent_range_clear_dirty_for_io() which has an explicit
> 
> BUG_ON(!page); /* Pages should be in the extent_io_tree */
> 
> so I'm going to say this one can't happen either.  I haven't delved far
> enough into btrfs to figure out why it can't happen.

I'll drop this patch for now.



Re: [Cluster-devel] [PATCH 9/9] mm: return an ERR_PTR from __filemap_get_folio

2023-01-18 Thread Christoph Hellwig
On Wed, Jan 18, 2023 at 09:39:08PM +0900, Ryusuke Konishi wrote:
> > Also pass through the error through the direct callers:
> 
> > filemap_get_folio, filemap_lock_folio filemap_grab_folio
> > and filemap_get_incore_folio.
> 
> As for the comments describing the return values of these callers,
> isn't it necessary to rewrite the value from NULL in case of errors ?

Yes, thanks for catching this.



Re: [Cluster-devel] [PATCH 4/9] shmem: remove shmem_get_partial_folio

2023-01-18 Thread Brian Foster
On Wed, Jan 18, 2023 at 05:43:58PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 18, 2023 at 08:57:05AM -0500, Brian Foster wrote:
> > This all seems reasonable to me at a glance, FWIW, but I am a little
> > curious why this wouldn't split up into two changes. I.e., switch this
> > over to filemap_get_entry() to minimally remove the FGP_ENTRY dependency
> > without a behavior change, then (perhaps after the next patch) introduce
> > SGP_FIND in a separate patch. That makes it easier to review and
> > potentially undo if it happens to pose a problem in the future. Hm?
> 
> The minimal change to filemap_get_entry would require to add the
> lock, check mapping and retry loop and thus add a fair amount of
> code.  So I looked for ways to avoid that and came up with this
> version.  But if there is a strong preference to first open code
> the logic and then later consolidate it I could do that.
> 

Ok. Not a strong preference from me. I don't think it's worth
complicating that much just to split up.

Brian



Re: [Cluster-devel] [PATCH 4/9] shmem: remove shmem_get_partial_folio

2023-01-18 Thread Christoph Hellwig
On Wed, Jan 18, 2023 at 08:57:05AM -0500, Brian Foster wrote:
> This all seems reasonable to me at a glance, FWIW, but I am a little
> curious why this wouldn't split up into two changes. I.e., switch this
> over to filemap_get_entry() to minimally remove the FGP_ENTRY dependency
> without a behavior change, then (perhaps after the next patch) introduce
> SGP_FIND in a separate patch. That makes it easier to review and
> potentially undo if it happens to pose a problem in the future. Hm?

The minimal change to filemap_get_entry would require to add the
lock, check mapping and retry loop and thus add a fair amount of
code.  So I looked for ways to avoid that and came up with this
version.  But if there is a strong preference to first open code
the logic and then later consolidate it I could do that.



Re: [Cluster-devel] [PATCH 7/9] gfs2: handle a NULL folio in gfs2_jhead_process_page

2023-01-18 Thread Andreas Gruenbacher
[Christoph's email ended up in my spam folder; I hope that was a
one-time-only occurrence.]

On Wed, Jan 18, 2023 at 5:00 PM Matthew Wilcox  wrote:
> On Wed, Jan 18, 2023 at 10:43:27AM +0100, Christoph Hellwig wrote:
> > filemap_get_folio can return NULL, so exit early for that case.
>
> I'm not sure it can return NULL in this specific case.  As I understand
> this code, we're scanning the journal looking for the log head.  We've
> just submitted the bio to read this page.  I suppose memory pressure
> could theoretically push the page out, but if it does, we're doing the
> wrong thing by just returning here; we need to retry reading the page.
>
> Assuming we're not willing to do the work to add that case, I think I'd
> rather see the crash in folio_wait_locked() than get data corruption
> from failing to find the head of the log.
>
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  fs/gfs2/lops.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
> > index 1902413d5d123e..51d4b610127cdb 100644
> > --- a/fs/gfs2/lops.c
> > +++ b/fs/gfs2/lops.c
> > @@ -472,6 +472,8 @@ static void gfs2_jhead_process_page(struct gfs2_jdesc 
> > *jd, unsigned long index,
> >   struct folio *folio;
> >
> >   folio = filemap_get_folio(jd->jd_inode->i_mapping, index);
> > + if (!folio)
> > + return;

We're actually still holding a reference to the folio from the
find_or_create_page() in gfs2_find_jhead() here, so we know that
memory pressure can't push the page out and filemap_get_folio() won't
return NULL.

> >
> >   folio_wait_locked(folio);
> >   if (folio_test_error(folio))
> > --
> > 2.39.0
> >
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 8/9] btrfs: handle a NULL folio in extent_range_redirty_for_io

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:28AM +0100, Christoph Hellwig wrote:
> filemap_get_folio can return NULL, skip those cases.

Hmm, I'm not sure that's true.  We have one place that calls
extent_range_redirty_for_io(), and it previously calls
extent_range_clear_dirty_for_io() which has an explicit

BUG_ON(!page); /* Pages should be in the extent_io_tree */

so I'm going to say this one can't happen either.  I haven't delved far
enough into btrfs to figure out why it can't happen.

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/extent_io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d55e4531ffd212..a54d2cf74ba020 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -230,6 +230,8 @@ void extent_range_redirty_for_io(struct inode *inode, u64 
> start, u64 end)
>  
>   while (index <= end_index) {
>   folio = filemap_get_folio(mapping, index);
> + if (!folio)
> + continue;
>   filemap_dirty_folio(mapping, folio);
>   folio_account_redirty(folio);
>   index += folio_nr_pages(folio);
> -- 
> 2.39.0
> 



Re: [Cluster-devel] [PATCH 7/9] gfs2: handle a NULL folio in gfs2_jhead_process_page

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:27AM +0100, Christoph Hellwig wrote:
> filemap_get_folio can return NULL, so exit early for that case.

I'm not sure it can return NULL in this specific case.  As I understand
this code, we're scanning the journal looking for the log head.  We've
just submitted the bio to read this page.  I suppose memory pressure
could theoretically push the page out, but if it does, we're doing the
wrong thing by just returning here; we need to retry reading the page.

Assuming we're not willing to do the work to add that case, I think I'd
rather see the crash in folio_wait_locked() than get data corruption
from failing to find the head of the log.

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/lops.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
> index 1902413d5d123e..51d4b610127cdb 100644
> --- a/fs/gfs2/lops.c
> +++ b/fs/gfs2/lops.c
> @@ -472,6 +472,8 @@ static void gfs2_jhead_process_page(struct gfs2_jdesc 
> *jd, unsigned long index,
>   struct folio *folio;
>  
>   folio = filemap_get_folio(jd->jd_inode->i_mapping, index);
> + if (!folio)
> + return;
>  
>   folio_wait_locked(folio);
>   if (folio_test_error(folio))
> -- 
> 2.39.0
> 



Re: [Cluster-devel] [PATCH 4/9] shmem: remove shmem_get_partial_folio

2023-01-18 Thread Brian Foster
On Wed, Jan 18, 2023 at 10:43:24AM +0100, Christoph Hellwig wrote:
> Add a new SGP_FIND mode for shmem_get_partial_folio that works like
> SGP_READ, but does not check i_size.  Use that instead of open coding
> the page cache lookup in shmem_get_partial_folio.  Note that this is
> a behavior change in that it reads in swap cache entries for offsets
> outside i_size, possibly causing a little bit of extra work.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/shmem_fs.h |  1 +
>  mm/shmem.c   | 46 
>  2 files changed, 15 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index d09d54be4ffd99..7ba160ac066e5e 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -105,6 +105,7 @@ enum sgp_type {
>   SGP_CACHE,  /* don't exceed i_size, may allocate page */
>   SGP_WRITE,  /* may exceed i_size, may allocate !Uptodate page */
>   SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
> + SGP_FIND,   /* like SGP_READ, but also read outside i_size */
>  };
>  
>  int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio 
> **foliop,
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9e1015cbad29f9..e9500fea43a8dc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -877,27 +877,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
>   }
>  }
>  
> -static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t 
> index)
> -{
> - struct folio *folio;
> -
> - /*
> -  * At first avoid shmem_get_folio(,,,SGP_READ): that fails
> -  * beyond i_size, and reports fallocated pages as holes.
> -  */
> - folio = __filemap_get_folio(inode->i_mapping, index,
> - FGP_ENTRY | FGP_LOCK, 0);

This all seems reasonable to me at a glance, FWIW, but I am a little
curious why this wouldn't split up into two changes. I.e., switch this
over to filemap_get_entry() to minimally remove the FGP_ENTRY dependency
without a behavior change, then (perhaps after the next patch) introduce
SGP_FIND in a separate patch. That makes it easier to review and
potentially undo if it happens to pose a problem in the future. Hm?

Brian

> - if (!xa_is_value(folio))
> - return folio;
> - /*
> -  * But read a page back from swap if any of it is within i_size
> -  * (although in some cases this is just a waste of time).
> -  */
> - folio = NULL;
> - shmem_get_folio(inode, index, , SGP_READ);
> - return folio;
> -}
> -
>  /*
>   * Remove range of pages and swap entries from page cache, and free them.
>   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -957,7 +936,8 @@ static void shmem_undo_range(struct inode *inode, loff_t 
> lstart, loff_t lend,
>   goto whole_folios;
>  
>   same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
> - folio = shmem_get_partial_folio(inode, lstart >> PAGE_SHIFT);
> + folio = NULL;
> + shmem_get_folio(inode, lstart >> PAGE_SHIFT, , SGP_FIND);
>   if (folio) {
>   same_folio = lend < folio_pos(folio) + folio_size(folio);
>   folio_mark_dirty(folio);
> @@ -971,14 +951,16 @@ static void shmem_undo_range(struct inode *inode, 
> loff_t lstart, loff_t lend,
>   folio = NULL;
>   }
>  
> - if (!same_folio)
> - folio = shmem_get_partial_folio(inode, lend >> PAGE_SHIFT);
> - if (folio) {
> - folio_mark_dirty(folio);
> - if (!truncate_inode_partial_folio(folio, lstart, lend))
> - end = folio->index;
> - folio_unlock(folio);
> - folio_put(folio);
> + if (!same_folio) {
> + folio = NULL;
> + shmem_get_folio(inode, lend >> PAGE_SHIFT, , SGP_FIND);
> + if (folio) {
> + folio_mark_dirty(folio);
> + if (!truncate_inode_partial_folio(folio, lstart, lend))
> + end = folio->index;
> + folio_unlock(folio);
> + folio_put(folio);
> + }
>   }
>  
>  whole_folios:
> @@ -1900,7 +1882,7 @@ static int shmem_get_folio_gfp(struct inode *inode, 
> pgoff_t index,
>   if (folio_test_uptodate(folio))
>   goto out;
>   /* fallocated folio */
> - if (sgp != SGP_READ)
> + if (sgp != SGP_READ && sgp != SGP_FIND)
>   goto clear;
>   folio_unlock(folio);
>   folio_put(folio);
> @@ -1911,7 +1893,7 @@ static int shmem_get_folio_gfp(struct inode *inode, 
> pgoff_t index,
>* SGP_NOALLOC: fail on hole, with NULL folio, letting caller fail.
>*/
>   *foliop = NULL;
> - if (sgp == SGP_READ)
> + if (sgp == SGP_READ || sgp == SGP_FIND)
>   return 0;
>   if 

Re: [Cluster-devel] [PATCH 3/9] mm: use filemap_get_entry in filemap_get_incore_folio

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:23AM +0100, Christoph Hellwig wrote:
> filemap_get_incore_folio wants to look at the details of xa_is_value
> entries, but doesn't need any of the other logic in filemap_get_folio.
> Switch it to use the lower-level filemap_get_entry interface.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH 2/9] mm: make mapping_get_entry available outside of filemap.c

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:22AM +0100, Christoph Hellwig wrote:
> mapping_get_entry is useful for page cache API users that need to know
> about xa_value internals.  Rename it and make it available in pagemap.h.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH 1/9] mm: don't look at xarray value entries in split_huge_pages_in_file

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:21AM +0100, Christoph Hellwig wrote:
> split_huge_pages_in_file never wants to do anything with the special
> value enties.  Switch to using filemap_get_folio to not even see them.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [syzbot] INFO: task hung in freeze_super (3)

2023-01-18 Thread Tetsuo Handa
Ping?

On 2023/01/04 22:47, Tetsuo Handa wrote:
> I suspect that cleanup is not done appropriately when gfs2_find_jhead() 
> failed.
> Looking into gfs2_make_fs_rw(), I see there are two worrisome things.
> 
> One is that gfs2_make_fs_rw() returns an error without calling 
> gfs2_consist(sdp) when
> gfs2_find_jhead() returned an error whereas gfs2_make_fs_rw() returns -EIO 
> after calling
> gfs2_consist(sdp) when head.lh_flags does not contain GFS2_LOG_HEAD_UNMOUNT 
> flag.
> 
> Since head.lh_flags is assigned by gfs2_find_jhead(), we might want to call 
> gfs2_consist(sdp)
> when gfs2_find_jhead() returned an error. And actually
> 
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -138,7 +138,11 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
> return -EIO;
> 
> error = gfs2_find_jhead(sdp->sd_jdesc, , false);
> -   if (error || gfs2_withdrawn(sdp))
> +   if (error) {
> +   gfs2_consist(sdp);
> +   return error;
> +   }
> +   if (gfs2_withdrawn(sdp))
> return error;
> 
> if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
> 
> avoids this deadlock. But I don't know when/how to use gfs2_withdraw().
> Please check if this change is appropriate.
> 
> The other worrisome thing is that gfs2_make_fs_rw() is returning 0 (and
> mount operation will continue) when gfs2_withdrawn() == true. Can the caller
> of gfs2_make_fs_rw() survive when callgfs2_make_fs_rw() returned 0 without
> processing
> 
> /*  Initialize some head of the log stuff  */
> sdp->sd_log_sequence = head.lh_sequence + 1;
> gfs2_log_pointers_init(sdp, head.lh_blkno);
> 
> lines? Shouldn't the caller of gfs2_make_fs_rw() observe an error when
> gfs2_make_fs_rw() did not execute the
> 
>   set_bit(SDF_JOURNAL_LIVE, >sd_flags);
> 
> line due to gfs2_withdrawn() == true?
> 



Re: [Cluster-devel] [PATCH 9/9] mm: return an ERR_PTR from __filemap_get_folio

2023-01-18 Thread Ryusuke Konishi
On Wed, Jan 18, 2023 at 7:41 PM Christoph Hellwig wrote:
>
> Instead of returning NULL for all errors, distinguish between:
>
>  - no entry found and not asked to allocated (-ENOENT)
>  - failed to allocate memory (-ENOMEM)
>  - would block (-EAGAIN)
>
> so that callers don't have to guess the error based on the passed
> in flags.
>
> Also pass through the error through the direct callers:

> filemap_get_folio, filemap_lock_folio filemap_grab_folio
> and filemap_get_incore_folio.

As for the comments describing the return values of these callers,
isn't it necessary to rewrite the value from NULL in case of errors ?

Regards,
Ryusuke Konishi



[Cluster-devel] [PATCH 9/9] mm: return an ERR_PTR from __filemap_get_folio

2023-01-18 Thread Christoph Hellwig
Instead of returning NULL for all errors, distinguish between:

 - no entry found and not asked to allocated (-ENOENT)
 - failed to allocate memory (-ENOMEM)
 - would block (-EAGAIN)

so that callers don't have to guess the error based on the passed
in flags.

Also pass through the error through the direct callers:
filemap_get_folio, filemap_lock_folio filemap_grab_folio
and filemap_get_incore_folio.

Signed-off-by: Christoph Hellwig 
---
 fs/afs/dir.c | 10 +-
 fs/afs/dir_edit.c|  2 +-
 fs/afs/write.c   |  4 ++--
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c |  2 +-
 fs/ext4/inode.c  |  2 +-
 fs/ext4/move_extent.c|  8 
 fs/gfs2/lops.c   |  2 +-
 fs/hugetlbfs/inode.c |  2 +-
 fs/iomap/buffered-io.c   |  6 +++---
 fs/netfs/buffered_read.c |  4 ++--
 fs/nilfs2/page.c |  6 +++---
 mm/filemap.c | 14 --
 mm/folio-compat.c|  2 +-
 mm/huge_memory.c |  2 +-
 mm/memcontrol.c  |  2 +-
 mm/mincore.c |  2 +-
 mm/shmem.c   |  4 ++--
 mm/swap_state.c  | 15 ---
 mm/swapfile.c|  4 ++--
 mm/truncate.c| 15 ---
 21 files changed, 57 insertions(+), 53 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index b7c1f8c84b38aa..41d0b4203870be 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -319,16 +319,16 @@ static struct afs_read *afs_read_dir(struct afs_vnode 
*dvnode, struct key *key)
struct folio *folio;
 
folio = filemap_get_folio(mapping, i);
-   if (!folio) {
+   if (IS_ERR(folio)) {
if (test_and_clear_bit(AFS_VNODE_DIR_VALID, 
>flags))
afs_stat_v(dvnode, n_inval);
-
-   ret = -ENOMEM;
folio = __filemap_get_folio(mapping,
i, FGP_LOCK | FGP_CREAT,
mapping->gfp_mask);
-   if (!folio)
+   if (IS_ERR(folio)) {
+   ret = PTR_ERR(folio);
goto error;
+   }
folio_attach_private(folio, (void *)1);
folio_unlock(folio);
}
@@ -524,7 +524,7 @@ static int afs_dir_iterate(struct inode *dir, struct 
dir_context *ctx,
 */
folio = __filemap_get_folio(dir->i_mapping, ctx->pos / 
PAGE_SIZE,
FGP_ACCESSED, 0);
-   if (!folio) {
+   if (IS_ERR(folio)) {
ret = afs_bad(dvnode, afs_file_error_dir_missing_page);
break;
}
diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c
index 0ab7752d1b758e..f0eddccbdd9541 100644
--- a/fs/afs/dir_edit.c
+++ b/fs/afs/dir_edit.c
@@ -115,7 +115,7 @@ static struct folio *afs_dir_get_folio(struct afs_vnode 
*vnode, pgoff_t index)
folio = __filemap_get_folio(mapping, index,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
mapping->gfp_mask);
-   if (!folio)
+   if (IS_ERR(folio))
clear_bit(AFS_VNODE_DIR_VALID, >flags);
else if (folio && !folio_test_private(folio))
folio_attach_private(folio, (void *)1);
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 2d3b08b7406ca7..cf1eb0d122c275 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -232,7 +232,7 @@ static void afs_kill_pages(struct address_space *mapping,
_debug("kill %lx (to %lx)", index, last);
 
folio = filemap_get_folio(mapping, index);
-   if (!folio) {
+   if (IS_ERR(folio)) {
next = index + 1;
continue;
}
@@ -270,7 +270,7 @@ static void afs_redirty_pages(struct writeback_control *wbc,
_debug("redirty %llx @%llx", len, start);
 
folio = filemap_get_folio(mapping, index);
-   if (!folio) {
+   if (IS_ERR(folio)) {
next = index + 1;
continue;
}
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d5da43a89ee7f..f1035e0bcf8c6a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4034,7 +4034,7 @@ static int wait_dev_supers(struct btrfs_device *device, 
int max_mirrors)
 
folio = filemap_get_folio(device->bdev->bd_inode->i_mapping,
 bytenr >> PAGE_SHIFT);
-   if (!folio) {
+   if (IS_ERR(folio)) {
errors++;
if (i == 0)
primary_failed = true;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a54d2cf74ba020..faaab9fae66d66 

[Cluster-devel] [PATCH 8/9] btrfs: handle a NULL folio in extent_range_redirty_for_io

2023-01-18 Thread Christoph Hellwig
filemap_get_folio can return NULL, skip those cases.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d55e4531ffd212..a54d2cf74ba020 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -230,6 +230,8 @@ void extent_range_redirty_for_io(struct inode *inode, u64 
start, u64 end)
 
while (index <= end_index) {
folio = filemap_get_folio(mapping, index);
+   if (!folio)
+   continue;
filemap_dirty_folio(mapping, folio);
folio_account_redirty(folio);
index += folio_nr_pages(folio);
-- 
2.39.0



[Cluster-devel] [PATCH 7/9] gfs2: handle a NULL folio in gfs2_jhead_process_page

2023-01-18 Thread Christoph Hellwig
filemap_get_folio can return NULL, so exit early for that case.

Signed-off-by: Christoph Hellwig 
---
 fs/gfs2/lops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 1902413d5d123e..51d4b610127cdb 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -472,6 +472,8 @@ static void gfs2_jhead_process_page(struct gfs2_jdesc *jd, 
unsigned long index,
struct folio *folio;
 
folio = filemap_get_folio(jd->jd_inode->i_mapping, index);
+   if (!folio)
+   return;
 
folio_wait_locked(folio);
if (folio_test_error(folio))
-- 
2.39.0



[Cluster-devel] [PATCH 6/9] mm: remove FGP_ENTRY

2023-01-18 Thread Christoph Hellwig
FGP_ENTRY is unused now, so remove it.

Signed-off-by: Christoph Hellwig 
---
 include/linux/pagemap.h | 3 +--
 mm/filemap.c| 7 +--
 mm/folio-compat.c   | 4 ++--
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 24dedf6b12be49..e2208ee36966ea 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -504,8 +504,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS   0x0010
 #define FGP_NOWAIT 0x0020
 #define FGP_FOR_MMAP   0x0040
-#define FGP_ENTRY  0x0080
-#define FGP_STABLE 0x0100
+#define FGP_STABLE 0x0080
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
diff --git a/mm/filemap.c b/mm/filemap.c
index ed0583f9e27512..35baadd130795c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1889,8 +1889,6 @@ void *filemap_get_entry(struct address_space *mapping, 
pgoff_t index)
  *
  * * %FGP_ACCESSED - The folio will be marked accessed.
  * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
  * * %FGP_CREAT - If no page is present then a new page is allocated using
  *   @gfp and added to the page cache and the VM's LRU list.
  *   The page is returned locked and with an increased refcount.
@@ -1916,11 +1914,8 @@ struct folio *__filemap_get_folio(struct address_space 
*mapping, pgoff_t index,
 
 repeat:
folio = filemap_get_entry(mapping, index);
-   if (xa_is_value(folio)) {
-   if (fgp_flags & FGP_ENTRY)
-   return folio;
+   if (xa_is_value(folio))
folio = NULL;
-   }
if (!folio)
goto no_page;
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 18c48b55792635..f3841b4977b68e 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -97,8 +97,8 @@ struct page *pagecache_get_page(struct address_space 
*mapping, pgoff_t index,
struct folio *folio;
 
folio = __filemap_get_folio(mapping, index, fgp_flags, gfp);
-   if (!folio || xa_is_value(folio))
-   return >page;
+   if (!folio)
+   return NULL;
return folio_file_page(folio, index);
 }
 EXPORT_SYMBOL(pagecache_get_page);
-- 
2.39.0



[Cluster-devel] [PATCH 5/9] shmem: open code the page cache lookup in shmem_get_folio_gfp

2023-01-18 Thread Christoph Hellwig
Use the very low level filemap_get_entry helper to look up the
entry in the xarray, and then:

 - don't bother locking the folio if only doing a userfault notification
 - open code locking the page and checking for truncation in a related
   code block

This will allow to eventually remove the FGP_ENTRY flag.

Signed-off-by: Christoph Hellwig 
---
 mm/shmem.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e9500fea43a8dc..769107f376562f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1856,12 +1856,10 @@ static int shmem_get_folio_gfp(struct inode *inode, 
pgoff_t index,
sbinfo = SHMEM_SB(inode->i_sb);
charge_mm = vma ? vma->vm_mm : NULL;
 
-   folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
+   folio = filemap_get_entry(mapping, index);
if (folio && vma && userfaultfd_minor(vma)) {
-   if (!xa_is_value(folio)) {
-   folio_unlock(folio);
+   if (!xa_is_value(folio))
folio_put(folio);
-   }
*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
return 0;
}
@@ -1877,6 +1875,14 @@ static int shmem_get_folio_gfp(struct inode *inode, 
pgoff_t index,
}
 
if (folio) {
+   folio_lock(folio);
+
+   /* Has the page been truncated? */
+   if (unlikely(folio->mapping != mapping)) {
+   folio_unlock(folio);
+   folio_put(folio);
+   goto repeat;
+   }
if (sgp == SGP_WRITE)
folio_mark_accessed(folio);
if (folio_test_uptodate(folio))
-- 
2.39.0



[Cluster-devel] [PATCH 4/9] shmem: remove shmem_get_partial_folio

2023-01-18 Thread Christoph Hellwig
Add a new SGP_FIND mode for shmem_get_partial_folio that works like
SGP_READ, but does not check i_size.  Use that instead of open coding
the page cache lookup in shmem_get_partial_folio.  Note that this is
a behavior change in that it reads in swap cache entries for offsets
outside i_size, possibly causing a little bit of extra work.

Signed-off-by: Christoph Hellwig 
---
 include/linux/shmem_fs.h |  1 +
 mm/shmem.c   | 46 
 2 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d09d54be4ffd99..7ba160ac066e5e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -105,6 +105,7 @@ enum sgp_type {
SGP_CACHE,  /* don't exceed i_size, may allocate page */
SGP_WRITE,  /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
+   SGP_FIND,   /* like SGP_READ, but also read outside i_size */
 };
 
 int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
diff --git a/mm/shmem.c b/mm/shmem.c
index 9e1015cbad29f9..e9500fea43a8dc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -877,27 +877,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
}
 }
 
-static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t 
index)
-{
-   struct folio *folio;
-
-   /*
-* At first avoid shmem_get_folio(,,,SGP_READ): that fails
-* beyond i_size, and reports fallocated pages as holes.
-*/
-   folio = __filemap_get_folio(inode->i_mapping, index,
-   FGP_ENTRY | FGP_LOCK, 0);
-   if (!xa_is_value(folio))
-   return folio;
-   /*
-* But read a page back from swap if any of it is within i_size
-* (although in some cases this is just a waste of time).
-*/
-   folio = NULL;
-   shmem_get_folio(inode, index, , SGP_READ);
-   return folio;
-}
-
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -957,7 +936,8 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
goto whole_folios;
 
same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
-   folio = shmem_get_partial_folio(inode, lstart >> PAGE_SHIFT);
+   folio = NULL;
+   shmem_get_folio(inode, lstart >> PAGE_SHIFT, , SGP_FIND);
if (folio) {
same_folio = lend < folio_pos(folio) + folio_size(folio);
folio_mark_dirty(folio);
@@ -971,14 +951,16 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
folio = NULL;
}
 
-   if (!same_folio)
-   folio = shmem_get_partial_folio(inode, lend >> PAGE_SHIFT);
-   if (folio) {
-   folio_mark_dirty(folio);
-   if (!truncate_inode_partial_folio(folio, lstart, lend))
-   end = folio->index;
-   folio_unlock(folio);
-   folio_put(folio);
+   if (!same_folio) {
+   folio = NULL;
+   shmem_get_folio(inode, lend >> PAGE_SHIFT, , SGP_FIND);
+   if (folio) {
+   folio_mark_dirty(folio);
+   if (!truncate_inode_partial_folio(folio, lstart, lend))
+   end = folio->index;
+   folio_unlock(folio);
+   folio_put(folio);
+   }
}
 
 whole_folios:
@@ -1900,7 +1882,7 @@ static int shmem_get_folio_gfp(struct inode *inode, 
pgoff_t index,
if (folio_test_uptodate(folio))
goto out;
/* fallocated folio */
-   if (sgp != SGP_READ)
+   if (sgp != SGP_READ && sgp != SGP_FIND)
goto clear;
folio_unlock(folio);
folio_put(folio);
@@ -1911,7 +1893,7 @@ static int shmem_get_folio_gfp(struct inode *inode, 
pgoff_t index,
 * SGP_NOALLOC: fail on hole, with NULL folio, letting caller fail.
 */
*foliop = NULL;
-   if (sgp == SGP_READ)
+   if (sgp == SGP_READ || sgp == SGP_FIND)
return 0;
if (sgp == SGP_NOALLOC)
return -ENOENT;
-- 
2.39.0



[Cluster-devel] [PATCH 3/9] mm: use filemap_get_entry in filemap_get_incore_folio

2023-01-18 Thread Christoph Hellwig
filemap_get_incore_folio wants to look at the details of xa_is_value
entries, but doesn't need any of the other logic in filemap_get_folio.
Switch it to use the lower-level filemap_get_entry interface.

Signed-off-by: Christoph Hellwig 
---
 mm/swap_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index cb9aaa00951d99..c39ea34bc4fc10 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -380,7 +380,7 @@ struct folio *filemap_get_incore_folio(struct address_space 
*mapping,
 {
swp_entry_t swp;
struct swap_info_struct *si;
-   struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
+   struct folio *folio = filemap_get_entry(mapping, index);
 
if (!xa_is_value(folio))
goto out;
-- 
2.39.0



[Cluster-devel] [PATCH 1/9] mm: don't look at xarray value entries in split_huge_pages_in_file

2023-01-18 Thread Christoph Hellwig
split_huge_pages_in_file never wants to do anything with the special
value enties.  Switch to using filemap_get_folio to not even see them.

Signed-off-by: Christoph Hellwig 
---
 mm/huge_memory.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6977dc6b31ba..a2830019aaa017 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3100,11 +3100,10 @@ static int split_huge_pages_in_file(const char 
*file_path, pgoff_t off_start,
mapping = candidate->f_mapping;
 
for (index = off_start; index < off_end; index += nr_pages) {
-   struct folio *folio = __filemap_get_folio(mapping, index,
-   FGP_ENTRY, 0);
+   struct folio *folio = filemap_get_folio(mapping, index);
 
nr_pages = 1;
-   if (xa_is_value(folio) || !folio)
+   if (!folio)
continue;
 
if (!folio_test_large(folio))
-- 
2.39.0



[Cluster-devel] [PATCH 2/9] mm: make mapping_get_entry available outside of filemap.c

2023-01-18 Thread Christoph Hellwig
mapping_get_entry is useful for page cache API users that need to know
about xa_value internals.  Rename it and make it available in pagemap.h.

Signed-off-by: Christoph Hellwig 
---
 include/linux/pagemap.h | 1 +
 mm/filemap.c| 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9f108168377195..24dedf6b12be49 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -507,6 +507,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_ENTRY  0x0080
 #define FGP_STABLE 0x0100
 
+void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
int fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
diff --git a/mm/filemap.c b/mm/filemap.c
index c915ded191f03f..ed0583f9e27512 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1834,7 +1834,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  */
 
 /*
- * mapping_get_entry - Get a page cache entry.
+ * filemap_get_entry - Get a page cache entry.
  * @mapping: the address_space to search
  * @index: The page cache index.
  *
@@ -1845,7 +1845,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  *
  * Return: The folio, swap or shadow entry, %NULL if nothing is found.
  */
-static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
+void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
 {
XA_STATE(xas, >i_pages, index);
struct folio *folio;
@@ -1915,7 +1915,7 @@ struct folio *__filemap_get_folio(struct address_space 
*mapping, pgoff_t index,
struct folio *folio;
 
 repeat:
-   folio = mapping_get_entry(mapping, index);
+   folio = filemap_get_entry(mapping, index);
if (xa_is_value(folio)) {
if (fgp_flags & FGP_ENTRY)
return folio;
-- 
2.39.0



[Cluster-devel] return an ERR_PTR from __filemap_get_folio

2023-01-18 Thread Christoph Hellwig
Hi all,

__filemap_get_folio and its wrappers can return NULL for three different
conditions, which in some cases requires the caller to reverse engineer
the decision making.  This is fixed by returning an ERR_PTR instead of
NULL and thus transporting the reason for the failure.  But to make
that work we first need to ensure that no xa_value special case is
returned and thus return the FGP_ENTRY flag.  It turns out that flag
is barely used and can usually be deal with in a better way.

Note that the shmem patches in here are non-trivial and need some
careful review and testing.

Diffstat
 fs/afs/dir.c |   10 +++
 fs/afs/dir_edit.c|2 -
 fs/afs/write.c   |4 +-
 fs/btrfs/disk-io.c   |2 -
 fs/btrfs/extent_io.c |2 +
 fs/ext4/inode.c  |2 -
 fs/ext4/move_extent.c|8 ++---
 fs/gfs2/lops.c   |2 +
 fs/hugetlbfs/inode.c |2 -
 fs/iomap/buffered-io.c   |6 ++--
 fs/netfs/buffered_read.c |4 +-
 fs/nilfs2/page.c |6 ++--
 include/linux/pagemap.h  |4 +-
 include/linux/shmem_fs.h |1 
 mm/filemap.c |   27 ---
 mm/folio-compat.c|4 +-
 mm/huge_memory.c |5 +--
 mm/memcontrol.c  |2 -
 mm/mincore.c |2 -
 mm/shmem.c   |   64 +++
 mm/swap_state.c  |   17 ++--
 mm/swapfile.c|4 +-
 mm/truncate.c|   15 +--
 23 files changed, 93 insertions(+), 102 deletions(-)



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Damien Le Moal
On 1/18/23 16:21, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
>> I don't have any objections to pulling everything except patches 8 and
>> 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.
> 
>>
>> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
>> don't think it does, but OTOH zone pointer management might complicate
>> that.
> 
> Adding Damien.

zonefs has a static mapping of file blocks that never changes and is fully
populated up to a file max size from mount. So zonefs is not using the
iomap_valid page operation. In fact, zonefs is not even using struct
iomap_page_ops.

> 
>> 2. How about porting the writeback iomap validation to use this
>> mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.
> 
>> 2. Do we need to revalidate mappings for directio writes?  I think the
>> answer is no (for xfs) because the ->iomap_begin call will allocate
>> whatever blocks are needed and truncate/punch/reflink block on the
>> iolock while the directio writes are pending, so you'll never end up
>> with a stale mapping.
> 
> Yes.

-- 
Damien Le Moal
Western Digital Research



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Christoph Hellwig
On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> I don't have any objections to pulling everything except patches 8 and
> 10 for testing this week. 

That would be great.  I now have a series to return the ERR_PTR
from __filemap_get_folio which will cause a minor conflict, but
I think that's easy enough for Linux to handle.

> 
> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> don't think it does, but OTOH zone pointer management might complicate
> that.

Adding Damien.

> 2. How about porting the writeback iomap validation to use this
> mechanism?  (I suspect Dave might already be working on this...)

What is "this mechanism"?  Do you mean the here removed ->iomap_valid
?   writeback calls into ->map_blocks for every block while under the
folio lock, so the validation can (and for XFS currently is) done
in that.  Moving it out into a separate method with extra indirect
functiona call overhead and interactions between the methods seems
like a retrograde step to me.

> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.

Yes.