Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode

2021-07-28 Thread Jan Kara
On Wed 28-07-21 09:57:01, Steven Whitehouse wrote:
> On Wed, 2021-07-28 at 08:50 +0200, Andreas Gruenbacher wrote:
> > On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson 
> > wrote:
> > > On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> > > 
> > > Hi,
> > > 
> > > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > 
> > > Before this patch, gfs2 kept its own address space for rgrps, but
> > > this
> > > caused a lockdep problem because vfs assumes a 1:1 relationship
> > > between
> > > address spaces and their inode. One problematic area is this:
> > > 
> > > I don't think that is the case. The reason that the address space
> > > is a
> > > separate structure in the first place is to allow them to exist
> > > without
> > > an inode. Maybe that has changed, but we should see why that is, in
> > > that case rather than just making this change immediately.
> > > 
> > > I can't see any reason why if we have to have an inode here that it
> > > needs to be hashed... what would need to look it up via the hashes?
> > > 
> > > Steve.
> > > 
> > > Hi,
> > > 
> > > The actual use case, which is easily demonstrated with lockdep, is
> > > given
> > > in the patch text shortly after where you placed your comment. This
> > > goes
> > > back to this discussion from April 2018:
> > > 
> > > https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> > > 
> > > in which Jan Kara pointed out that:
> > > 
> > > "The problem is we really do expect mapping->host->i_mapping ==
> > > mapping as
> > > we pass mapping and inode interchangeably in the mm code. The
> > > address_space
> > > and inodes are separate structures because you can have many inodes
> > > pointing to one address space (block devices). However it is not
> > > allowed
> > > for several address_spaces to point to one inode!"
> > 
> > This is fundamentally at adds with how we manage inodes: we have
> > inode->i_mapping which is the logical address space of the inode, and
> > we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
> > address space of the inode. The most important function of the
> > metadata address space is to remove the inode's metadata from memory
> > by truncating the metadata address space (inode_go_inval). We need
> > that when moving an inode to another node. I don't have the faintest
> > idea how we could otherwise achieve that in a somewhat efficient way.
> > 
> > Thanks,
> > Andreas
> > 
> 
> In addition, I'm fairly sure also that we were told to use this
> solution (i.e. a separate address space) back in the day because it was
> expected that they didn't have a 1:1 relationship with inodes. I don't
> think we'd have used that solution otherwise. I've not had enough time
> to go digging back in my email to check, but it might be worth looking
> to see when we introduced the use of the second address space (removing
> a whole additional inode structure) and any discussions around that
> change,

AFAIK in last 20 years it has never been the case that multiple
address_space structs for an inode would be handled by the VFS/MM code
properly. There can be multiple 'struct inode' for one 'struct
address_space'. That is handled just fine and is being used. The trouble is
that once you allow multiple address_space structs pointing to one struct
inode, you have some hard questions to answer (at least for VFS) - e.g. you
get a request to writeback the inode, how you do that when you have
multiple address spaces? Writeback them all? And how do you iterate them?

And there are similar questions regarding how to determine inode's
dirtiness or whether some inode's page is under writeback. Tied to that are
locking questions where inode->i_mapping->i_pages->xa_lock lock is used to
protect some of the inode state transitions. But when you have multiple
address spaces pointing to the inode it is not clear which of these many
locks would be protecting it (and the warning syzbot is hitting exactly
says that these state transitions are not properly serialized).

Of course all this can be decided and implemented but it is not a trivial
task there needs to be good motivation for the added complexity in the code
used by everybody. And AFAIU GFS2 even doesn't strictly need it. It uses a
very limited subset of functions that can operate on address_space for
these special address spaces. Andreas mentions you use metadata
address_space for tracking and evicting cached metadata associated with the
inode. Quickly checking the code, another heavy use seems to be metadata
dirtiness tracking and writeback. I'd note that other filesystems
traditionally use mapping->private_list for exactly these purposes (through
helpers like mark_buffer_dirty_inode, sync_mapping_buffers, etc.). Any
reason why you don't use these?

Honza
-- 
Jan Kara 
SUSE Labs, CR



Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode

2021-07-28 Thread Steven Whitehouse
Hi,

On Wed, 2021-07-28 at 08:50 +0200, Andreas Gruenbacher wrote:
> On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson 
> wrote:
> > On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> > 
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > 
> > Before this patch, gfs2 kept its own address space for rgrps, but
> > this
> > caused a lockdep problem because vfs assumes a 1:1 relationship
> > between
> > address spaces and their inode. One problematic area is this:
> > 
> > I don't think that is the case. The reason that the address space
> > is a
> > separate structure in the first place is to allow them to exist
> > without
> > an inode. Maybe that has changed, but we should see why that is, in
> > that case rather than just making this change immediately.
> > 
> > I can't see any reason why if we have to have an inode here that it
> > needs to be hashed... what would need to look it up via the hashes?
> > 
> > Steve.
> > 
> > Hi,
> > 
> > The actual use case, which is easily demonstrated with lockdep, is
> > given
> > in the patch text shortly after where you placed your comment. This
> > goes
> > back to this discussion from April 2018:
> > 
> > https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> > 
> > in which Jan Kara pointed out that:
> > 
> > "The problem is we really do expect mapping->host->i_mapping ==
> > mapping as
> > we pass mapping and inode interchangeably in the mm code. The
> > address_space
> > and inodes are separate structures because you can have many inodes
> > pointing to one address space (block devices). However it is not
> > allowed
> > for several address_spaces to point to one inode!"
> 
> This is fundamentally at adds with how we manage inodes: we have
> inode->i_mapping which is the logical address space of the inode, and
> we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
> address space of the inode. The most important function of the
> metadata address space is to remove the inode's metadata from memory
> by truncating the metadata address space (inode_go_inval). We need
> that when moving an inode to another node. I don't have the faintest
> idea how we could otherwise achieve that in a somewhat efficient way.
> 
> Thanks,
> Andreas
> 

In addition, I'm fairly sure also that we were told to use this
solution (i.e. a separate address space) back in the day because it was
expected that they didn't have a 1:1 relationship with inodes. I don't
think we'd have used that solution otherwise. I've not had enough time
to go digging back in my email to check, but it might be worth looking
to see when we introduced the use of the second address space (removing
a whole additional inode structure) and any discussions around that
change,

Steve.





Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode

2021-07-28 Thread Andreas Gruenbacher
On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson  wrote:
> On 7/13/21 1:26 PM, Steven Whitehouse wrote:
>
> Hi,
>
> On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
>
> Before this patch, gfs2 kept its own address space for rgrps, but
> this
> caused a lockdep problem because vfs assumes a 1:1 relationship
> between
> address spaces and their inode. One problematic area is this:
>
> I don't think that is the case. The reason that the address space is a
> separate structure in the first place is to allow them to exist without
> an inode. Maybe that has changed, but we should see why that is, in
> that case rather than just making this change immediately.
>
> I can't see any reason why if we have to have an inode here that it
> needs to be hashed... what would need to look it up via the hashes?
>
> Steve.
>
> Hi,
>
> The actual use case, which is easily demonstrated with lockdep, is given
> in the patch text shortly after where you placed your comment. This goes
> back to this discussion from April 2018:
>
> https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
>
> in which Jan Kara pointed out that:
>
> "The problem is we really do expect mapping->host->i_mapping == mapping as
> we pass mapping and inode interchangeably in the mm code. The address_space
> and inodes are separate structures because you can have many inodes
> pointing to one address space (block devices). However it is not allowed
> for several address_spaces to point to one inode!"

This is fundamentally at adds with how we manage inodes: we have
inode->i_mapping which is the logical address space of the inode, and
we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
address space of the inode. The most important function of the
metadata address space is to remove the inode's metadata from memory
by truncating the metadata address space (inode_go_inval). We need
that when moving an inode to another node. I don't have the faintest
idea how we could otherwise achieve that in a somewhat efficient way.

Thanks,
Andreas



Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode

2021-07-13 Thread Steven Whitehouse
Hi,

On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> Before this patch, gfs2 kept its own address space for rgrps, but
> this
> caused a lockdep problem because vfs assumes a 1:1 relationship
> between
> address spaces and their inode. One problematic area is this:
> 
I don't think that is the case. The reason that the address space is a
separate structure in the first place is to allow them to exist without
an inode. Maybe that has changed, but we should see why that is, in
that case rather than just making this change immediately.

I can't see any reason why if we have to have an inode here that it
needs to be hashed... what would need to look it up via the hashes?

Steve.

> gfs2_unpin
>mark_buffer_dirty(bh);
>   mapping = page_mapping(page);
>  __set_page_dirty(page, mapping, memcg, 0);
> xa_lock_irqsave(>i_pages, flags);
>  ^---locks page->mapping->i_pages
> account_page_dirtied(page, mapping)
>  struct inode *inode = mapping->host;
>  ^---assumes the mapping points to an inode
>inode_to_wb(inode)
>   WARN_ON_ONCE !lockdep_is_held(>i_mapping->
> i_pages.xa_lock)
> 
> It manifests as a lockdep warning you see in the last line.
> 
> This patch removes sd_aspace in favor of an entire inode, sd_inode.
> Functions that need to access the address space may use a new
> function
> that follows the inode to its address space. This creates the 1:1
> relation
> between the inode and its address space, so lockdep doesn't complain.
> This is how some other file systems manage their metadata, such as
> btrfs.
> 
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/glops.c  |  4 ++--
>  fs/gfs2/incore.h |  7 ++-
>  fs/gfs2/meta_io.c|  2 +-
>  fs/gfs2/meta_io.h|  2 --
>  fs/gfs2/ops_fstype.c | 27 ---
>  fs/gfs2/super.c  |  2 +-
>  6 files changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index 744cacd27213..5d755d30d91c 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -162,7 +162,7 @@ void gfs2_ail_flush(struct gfs2_glock *gl, bool
> fsync)
>  static int gfs2_rgrp_metasync(struct gfs2_glock *gl)
>  {
>   struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> - struct address_space *metamapping = >sd_aspace;
> + struct address_space *metamapping = gfs2_aspace(sdp);
>   struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
>   const unsigned bsize = sdp->sd_sb.sb_bsize;
>   loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
> @@ -219,7 +219,7 @@ static int rgrp_go_sync(struct gfs2_glock *gl)
>  static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
>  {
>   struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> - struct address_space *mapping = >sd_aspace;
> + struct address_space *mapping = gfs2_aspace(sdp);
>   struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
>   const unsigned bsize = sdp->sd_sb.sb_bsize;
>   loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 566c0053b7c5..075e5db1d654 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -797,7 +797,7 @@ struct gfs2_sbd {
>  
>   /* Log stuff */
>  
> - struct address_space sd_aspace;
> + struct inode *sd_inode;
>  
>   spinlock_t sd_log_lock;
>  
> @@ -857,6 +857,11 @@ struct gfs2_sbd {
>   unsigned long sd_log_flush_start;
>  };
>  
> +static inline struct address_space *gfs2_aspace(struct gfs2_sbd
> *sdp)
> +{
> + return sdp->sd_inode->i_mapping;
> +}
> +
>  static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int
> which)
>  {
>   gl->gl_stats.stats[which]++;
> diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
> index 7c9619997355..0123437d9c12 100644
> --- a/fs/gfs2/meta_io.c
> +++ b/fs/gfs2/meta_io.c
> @@ -120,7 +120,7 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock
> *gl, u64 blkno, int create)
>   unsigned int bufnum;
>  
>   if (mapping == NULL)
> - mapping = >sd_aspace;
> + mapping = gfs2_aspace(sdp);
>  
>   shift = PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift;
>   index = blkno >> shift; /* convert block to page */
> diff --git a/fs/gfs2/meta_io.h b/fs/gfs2/meta_io.h
> index 21880d72081a..70b9c41ecb46 100644
> --- a/fs/gfs2/meta_io.h
> +++ b/fs/gfs2/meta_io.h
> @@ -42,8 +42,6 @@ static inline struct gfs2_sbd
> *gfs2_mapping2sbd(struct address_space *mapping)
>   struct inode *inode = mapping->host;
>   if (mapping->a_ops == _meta_aops)
>   return (((struct gfs2_glock *)mapping) - 1)-
> >gl_name.ln_sbd;
> - else if (mapping->a_ops == _rgrp_aops)
> - return container_of(mapping, struct gfs2_sbd,
> sd_aspace);
>   else
>   return inode->i_sb->s_fs_info;
>  }
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index b09e61457b23..3e252cfa7f17 100644
> ---