Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote: > On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote: > > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" 写道: > > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > > > This patchset is a try to resolve the shared 'page cache' > > > > > > > > problem for > > > > > > > > fsdax. > > > > > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A > > > > > > > > dax entry > > > > > > > > will be associated more than once if is shared. At the > > > > > > > > second time we > > > > > > > > associate this entry, we create this rb-tree and store its > > > > > > > > root in > > > > > > > > page->private(not used in fsdax). Insert (->mapping, > > > > > > > > ->index) when > > > > > > > > dax_associate_entry() and delete it when > > > > > > > > dax_disassociate_entry(). > > > > > > > > > > > > > > Do we really want to track all of this on a per-page basis? I > > > > > > > would > > > > > > > have thought a per-extent basis was more useful. Essentially, > > > > > > > create > > > > > > > a new address_space for each shared extent. Per page just seems > > > > > > > like > > > > > > > a huge overhead. > > > > > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > > > yet... > > > > > > > > > > > > But the extent info is maintained by filesystem. I think we need a > > > > > > way > > > > > > to obtain this info from FS when associating a page. May be a bit > > > > > > complicated. Let me think about it... > > > > > > > > > > That's why I want the -user of this association- to do a filesystem > > > > > callout instead of keeping it's own naive tracking infrastructure. > > > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > > > from it's own extent tracking infrastructure, and there's zero > > > > > runtime overhead when there are no errors present. > > > > > > > > > > At the moment, this "dax association" is used to "report" a storage > > > > > media error directly to userspace. I say "report" because what it > > > > > does is kill userspace processes dead. The storage media error > > > > > actually needs to be reported to the owner of the storage media, > > > > > which in the case of FS-DAX is the filesytem. > > > > > > > > Understood. > > > > > > > > BTW, this is the usage in memory-failure, so what about rmap? I have > > > > not > > > > found how to use this tracking in rmap. Do you have any ideas? > > > > > > > > > > > > > > That way the filesystem can then look up all the owners of that bad > > > > > media range (i.e. the filesystem block it corresponds to) and take > > > > > appropriate action. e.g. > > > > > > > > I tried writing a function to look up all the owners' info of one block > > > > in > > > > xfs for memory-failure use. It was dropped in this patchset because I > > > > found > > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. > > > > But > > > > by default, rmapbt is disabled. I am not sure if it matters... > > > > > > I'm pretty sure you can't have shared extents on an XFS filesystem if you > > > _don't_ have the rmapbt feature enabled. I mean, that's why it exists. > > > > You're confusing reflink with rmap. :) > > > &
Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote: > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" 写道: > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > This patchset is a try to resolve the shared 'page cache' problem > > > > > > for > > > > > > fsdax. > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax > > > > > > entry > > > > > > will be associated more than once if is shared. At the second > > > > > > time we > > > > > > associate this entry, we create this rb-tree and store its root in > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) > > > > > > when > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > > have thought a per-extent basis was more useful. Essentially, create > > > > > a new address_space for each shared extent. Per page just seems like > > > > > a huge overhead. > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > yet... > > > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > > to obtain this info from FS when associating a page. May be a bit > > > > complicated. Let me think about it... > > > > > > That's why I want the -user of this association- to do a filesystem > > > callout instead of keeping it's own naive tracking infrastructure. > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > from it's own extent tracking infrastructure, and there's zero > > > runtime overhead when there are no errors present. > > > > > > At the moment, this "dax association" is used to "report" a storage > > > media error directly to userspace. I say "report" because what it > > > does is kill userspace processes dead. The storage media error > > > actually needs to be reported to the owner of the storage media, > > > which in the case of FS-DAX is the filesytem. > > > > Understood. > > > > BTW, this is the usage in memory-failure, so what about rmap? I have not > > found how to use this tracking in rmap. Do you have any ideas? > > > > > > > > That way the filesystem can then look up all the owners of that bad > > > media range (i.e. the filesystem block it corresponds to) and take > > > appropriate action. e.g. > > > > I tried writing a function to look up all the owners' info of one block in > > xfs for memory-failure use. It was dropped in this patchset because I found > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But > > by default, rmapbt is disabled. I am not sure if it matters... > > I'm pretty sure you can't have shared extents on an XFS filesystem if you > _don't_ have the rmapbt feature enabled. I mean, that's why it exists. You're confusing reflink with rmap. :) rmapbt does all the reverse mapping tracking, reflink just does the shared data extent tracking. But given that anyone who wants to use DAX with reflink is going to have to mkfs their filesystem anyway (to turn on reflink) requiring that rmapbt is also turned on is not a big deal. Especially as we can check it at mount time in the kernel... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote: > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" 写道: > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > This patchset is a try to resolve the shared 'page cache' problem > > > > > > for > > > > > > fsdax. > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax > > > > > > entry > > > > > > will be associated more than once if is shared. At the second > > > > > > time we > > > > > > associate this entry, we create this rb-tree and store its root in > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) > > > > > > when > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > > have thought a per-extent basis was more useful. Essentially, create > > > > > a new address_space for each shared extent. Per page just seems like > > > > > a huge overhead. > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > yet... > > > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > > to obtain this info from FS when associating a page. May be a bit > > > > complicated. Let me think about it... > > > > > > That's why I want the -user of this association- to do a filesystem > > > callout instead of keeping it's own naive tracking infrastructure. > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > from it's own extent tracking infrastructure, and there's zero > > > runtime overhead when there are no errors present. > > > > > > At the moment, this "dax association" is used to "report" a storage > > > media error directly to userspace. I say "report" because what it > > > does is kill userspace processes dead. The storage media error > > > actually needs to be reported to the owner of the storage media, > > > which in the case of FS-DAX is the filesytem. > > > > Understood. > > > > BTW, this is the usage in memory-failure, so what about rmap? I have not > > found how to use this tracking in rmap. Do you have any ideas? > > > > > > > > That way the filesystem can then look up all the owners of that bad > > > media range (i.e. the filesystem block it corresponds to) and take > > > appropriate action. e.g. > > > > I tried writing a function to look up all the owners' info of one block in > > xfs for memory-failure use. It was dropped in this patchset because I found > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But > > by default, rmapbt is disabled. I am not sure if it matters... > > I'm pretty sure you can't have shared extents on an XFS filesystem if you > _don't_ have the rmapbt feature enabled. I mean, that's why it exists. You're confusing reflink with rmap. :) rmapbt does all the reverse mapping tracking, reflink just does the shared data extent tracking. But given that anyone who wants to use DAX with reflink is going to have to mkfs their filesystem anyway (to turn on reflink) requiring that rmapbt is also turned on is not a big deal. Especially as we can check it at mount time in the kernel... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote: > > 在 2020/4/27 20:28:36, "Matthew Wilcox" 写道: > > >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > >> This patchset is a try to resolve the shared 'page cache' problem for > >> fsdax. > >> > >> In order to track multiple mappings and indexes on one page, I > >> introduced a dax-rmap rb-tree to manage the relationship. A dax entry > >> will be associated more than once if is shared. At the second time we > >> associate this entry, we create this rb-tree and store its root in > >> page->private(not used in fsdax). Insert (->mapping, ->index) when > >> dax_associate_entry() and delete it when dax_disassociate_entry(). > > > >Do we really want to track all of this on a per-page basis? I would > >have thought a per-extent basis was more useful. Essentially, create > >a new address_space for each shared extent. Per page just seems like > >a huge overhead. > > > Per-extent tracking is a nice idea for me. I haven't thought of it > yet... > > But the extent info is maintained by filesystem. I think we need a way > to obtain this info from FS when associating a page. May be a bit > complicated. Let me think about it... That's why I want the -user of this association- to do a filesystem callout instead of keeping it's own naive tracking infrastructure. The filesystem can do an efficient, on-demand reverse mapping lookup from it's own extent tracking infrastructure, and there's zero runtime overhead when there are no errors present. At the moment, this "dax association" is used to "report" a storage media error directly to userspace. I say "report" because what it does is kill userspace processes dead. The storage media error actually needs to be reported to the owner of the storage media, which in the case of FS-DAX is the filesytem. That way the filesystem can then look up all the owners of that bad media range (i.e. the filesystem block it corresponds to) and take appropriate action. e.g. - if it falls in filesytem metadata, shutdown the filesystem - if it falls in user data, call the "kill userspace dead" routines for each mapping/index tuple the filesystem finds for the given LBA address that the media error occurred. Right now if the media error is in filesystem metadata, the filesystem isn't even told about it. The filesystem can't even shut down - the error is just dropped on the floor and it won't be until the filesystem next tries to reference that metadata that we notice there is an issue. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [f2fs-dev] [PATCH 2/4] fs: avoid double-writing the inode on a lazytime expiration
On Wed, Mar 25, 2020 at 01:28:23PM +0100, Christoph Hellwig wrote: > In the case that an inode has dirty timestamp for longer than the > lazytime expiration timeout (or if all such inodes are being flushed > out due to a sync or syncfs system call), we need to inform the file > system that the inode is dirty so that the inode's timestamps can be > copied out to the on-disk data structures. That's because if the file > system supports lazytime, it will have ignored the dirty_inode(inode, > I_DIRTY_TIME) notification when the timestamp was modified in memory.q > Previously, this was accomplished by calling mark_inode_dirty_sync(), > but that has the unfortunate side effect of also putting the inode the > writeback list, and that's not necessary in this case, since we will > immediately call write_inode() afterwards. Replace the call to > mark_inode_dirty_sync() with a new lazytime_expired method to clearly > separate out this case. hmmm. Doesn't this cause issues with both iput() and vfs_fsync_range() because they call mark_inode_dirty_sync() on I_DIRTY_TIME inodes to move them onto the writeback list so they are appropriately expired when the inode is written back. i.e.: > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 2094386af8ac..e5aafd40dd0f 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -612,19 +612,13 @@ xfs_fs_destroy_inode( > } > > static void > -xfs_fs_dirty_inode( > - struct inode*inode, > - int flag) > +xfs_fs_lazytime_expired( > + struct inode*inode) > { > struct xfs_inode*ip = XFS_I(inode); > struct xfs_mount*mp = ip->i_mount; > struct xfs_trans*tp; > > - if (!(inode->i_sb->s_flags & SB_LAZYTIME)) > - return; > - if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME)) > - return; > - > if (xfs_trans_alloc(mp, _RES(mp)->tr_fsyncts, 0, 0, 0, )) > return; > xfs_ilock(ip, XFS_ILOCK_EXCL); > @@ -1053,7 +1047,7 @@ xfs_fs_free_cached_objects( > static const struct super_operations xfs_super_operations = { > .alloc_inode= xfs_fs_alloc_inode, > .destroy_inode = xfs_fs_destroy_inode, > - .dirty_inode= xfs_fs_dirty_inode, > + .lazytime_expired = xfs_fs_lazytime_expired, > .drop_inode = xfs_fs_drop_inode, > .put_super = xfs_fs_put_super, > .sync_fs= xfs_fs_sync_fs, This means XFS no longer updates/logs the current timestamp because ->dirty_inode(I_DIRTY_SYNC) is no longer called for XFS) before ->fsync flushes the inode data and metadata changes to the journal. Hence the current in-memory timestamps are not present in the log before the fsync is run as so we violate the fsync guarantees lazytime gives for timestamp updates I haven't quite got it straight in my head if the iput() case has similar problems, but the fsync case definitely looks broken. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration
On Thu, Mar 12, 2020 at 07:34:45AM -0700, Christoph Hellwig wrote: > On Thu, Mar 12, 2020 at 11:07:17AM +1100, Dave Chinner wrote: > > > That's true, but when the timestamps were originally modified, > > > dirty_inode() will be called with flag == I_DIRTY_TIME, which will > > > *not* be a no-op; which is to say, XFS will force the timestamps to be > > > updated on disk when the timestamps are first dirtied, because it > > > doesn't support I_DIRTY_TIME. > > > > We log the initial timestamp change, and then ignore timestamp > > updates until the dirty time expires and the inode is set > > I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have > > time stamps that may be 24 hours out of date in memory, and they > > still need to be flushed to the journal. > > > > However, your change does not mark the inode dirtying on expiry > > anymore, so... > > > > > So I think we're fine. > > > > ... we're not fine. This breaks XFS and any other filesystem that > > relies on a I_DIRTY_SYNC notification to handle dirty time expiry > > correctly. > > I haven't seen the original mail this replies to, The original problem was calling mark_inode_dirty_sync() on expiry during inode writeback was causing the inode to be put back on the dirty inode list and so ext4 was flushing it twice - once on expiry and once 5 seconds later on the next background writeback pass. This is a problem that XFS does not have because it does not implement ->write_inode... > but if we could > get the lazytime expirty by some other means (e.g. an explicit > callback), XFS could opt out of all the VFS inode tracking again, > which would simplify a few things. Yes, that would definitely make things simpler for XFS, and it would also solve the problem that the generic lazytime expiry code has Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration
On Wed, Mar 11, 2020 at 08:57:49AM -0400, Theodore Y. Ts'o wrote: > On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote: > > Thanks Ted! This fixes the fscrypt test failure. > > > > However, are you sure this works correctly on all filesystems? I'm not sure > > about XFS. XFS only implements ->dirty_inode(), not ->write_inode(), and > > in its > > ->dirty_inode() it does: > ... > > if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME)) > > return; > > That's true, but when the timestamps were originally modified, > dirty_inode() will be called with flag == I_DIRTY_TIME, which will > *not* be a no-op; which is to say, XFS will force the timestamps to be > updated on disk when the timestamps are first dirtied, because it > doesn't support I_DIRTY_TIME. We log the initial timestamp change, and then ignore timestamp updates until the dirty time expires and the inode is set I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have time stamps that may be 24 hours out of date in memory, and they still need to be flushed to the journal. However, your change does not mark the inode dirtying on expiry anymore, so... > So I think we're fine. ... we're not fine. This breaks XFS and any other filesystem that relies on a I_DIRTY_SYNC notification to handle dirty time expiry correctly. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration
On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote: > On Fri, Mar 06, 2020 at 09:00:43PM -0500, Theodore Ts'o wrote: > > In the case that an inode has dirty timestamp for longer than the > > lazytime expiration timeout (or if all such inodes are being flushed > > out due to a sync or syncfs system call), we need to inform the file > > system that the inode is dirty so that the inode's timestamps can be > > copied out to the on-disk data structures. That's because if the file > > system supports lazytime, it will have ignored the dirty_inode(inode, > > I_DIRTY_TIME) notification when the timestamp was modified in memory.q > > > > Previously, this was accomplished by calling mark_inode_dirty_sync(), > > but that has the unfortunate side effect of also putting the inode the > > writeback list, and that's not necessary in this case, since we will > > immediately call write_inode() afterwards. > > > > Eric Biggers noticed that this was causing problems for fscrypt after > > the key was removed[1]. > > > > [1] https://lore.kernel.org/r/20200306004555.gb225...@gmail.com > > > > Reported-by: Eric Biggers > > Signed-off-by: Theodore Ts'o > > --- > > fs/fs-writeback.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > index 76ac9c7d32ec..32101349ba97 100644 > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -1504,8 +1504,9 @@ __writeback_single_inode(struct inode *inode, struct > > writeback_control *wbc) > > > > spin_unlock(>i_lock); > > > > - if (dirty & I_DIRTY_TIME) > > - mark_inode_dirty_sync(inode); > > + /* This was a lazytime expiration; we need to tell the file system */ > > + if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode) > > + inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED); > > /* Don't write the inode if only I_DIRTY_PAGES was set */ > > if (dirty & ~I_DIRTY_PAGES) { > > int err = write_inode(inode, wbc); > > -- > > Thanks Ted! This fixes the fscrypt test failure. > > However, are you sure this works correctly on all filesystems? I'm not sure > about XFS. XFS only implements ->dirty_inode(), not ->write_inode(), and in > its > ->dirty_inode() it does: > > static void > xfs_fs_dirty_inode( > struct inode*inode, > int flag) > { > struct xfs_inode*ip = XFS_I(inode); > struct xfs_mount*mp = ip->i_mount; > struct xfs_trans*tp; > > if (!(inode->i_sb->s_flags & SB_LAZYTIME)) > return; > if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME)) > return; > > if (xfs_trans_alloc(mp, _RES(mp)->tr_fsyncts, 0, 0, 0, )) > return; > xfs_ilock(ip, XFS_ILOCK_EXCL); > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP); > xfs_trans_commit(tp); > } > > > So flag=I_DIRTY_TIME_EXPIRED will be a no-op. > > Maybe you should be using I_DIRTY_SYNC instead? Or perhaps XFS should be > checking for either I_DIRTY_TIME_EXPIRED or I_DIRTY_SYNC? Right, XFS does not use the VFS inode writeback code at all - we track all metadata changes internally via journalling. The VFS uses I_DIRTY_SYNC to indicate and inode is metadata dirty and a writeback candidate. Hence if we need to mark an inode dirty for integrity purposes for _any reason_, then I_DIRTY_SYNC is the correct flag to be passing to ->dirty_inode. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote: > On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote: > > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > > > [..] > > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > > > callback that deals with page aligned entries. That change at least > > > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > > > zeroing path. > > > > > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take > > > > > page > > > > > aligned start and size and call this interface from > > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > > > path? > > > > > > > > > > Something like. > > > > > > > > > > __dax_zero_page_range() { > > > > > if(page_aligned_io) > > > > > call_dax_page_zero_range() > > > > > else > > > > > use_direct_access_and_memcpy; > > > > > } > > > > > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > > > to calling dax_zero_page_range() instead. > > > > > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just > > > > > that > > > > > it will create a dax specific hook. > > > > > > > > > > In that case it might be simpler to just get rid of > > > > > blkdev_issue_zeroout() > > > > > call from __dax_zero_page_range() and make sure there are no callers > > > > > of > > > > > full block zeroing from this path. > > > > > > > > I think you're right. The path I'm concerned about not regressing is > > > > the error clearing on new block allocation and we get that already via > > > > xfs_zero_extent() and sb_issue_zeroout(). > > > > > > Well I was wrong. I found atleast one user which uses > > > __dax_zero_page_range() > > > to zero full PAGE_SIZE blocks. > > > > > > xfs_io -c "allocsp 32K 0" foo.txt > > > > That ioctl interface is deprecated and likely unused by any new > > application written since 1999. It predates unwritten extents (1998) > > and I don't think any native linux applications have ever used it. A > > newly written DAX aware application would almost certainly not use > > this interface. > > > > IOWs, I wouldn't use it as justification for some special case > > behaviour; I'm more likely to say "rip that ancient ioctl out" than > > to jump through hoops because it's implementation behaviour. > > Hi Dave, > > Do you see any other path in xfs using iomap_zero_range() and zeroing > full block. Yes: - xfs_file_aio_write_checks() for zeroing blocks between the existing EOF and the start of the incoming write beyond EOF - xfs_setattr_size() on truncate up for zeroing blocks between the existing EOF and the new EOF. - xfs_reflink_zero_posteof() for zeroing blocks between the old EOF and where the new reflinked extents are going to land beyond EOF And don't think that blocks beyond EOF can't exist when DAX is enabled. We can turn DAX on and off, we can crash between allocation and file size extension, etc. Hence this code must be able to handle zeroing large ranges of blocks beyond EOF... > iomap_zero_range() already skips IOMAP_HOLE and > IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type > IOMAP_HOLE and IOMAP_UNWRITTEN. > > My understanding is that ext4 and xfs both are initializing full blocks > using blkdev_issue_zeroout(). Only partial blocks are being zeroed using > this dax zeroing path. Look at the API, not the callers: iomap_zero_range takes a 64 bit length parameter. It can be asked to zero blocks across petabytes of a file. If there's a single block somewhere in that range, it will only zero that block. If the entire range is allocated, it will zero that entire range (yes, it will take forever!) as that it what it is intended to do. It should be pretty clear that needs to be able to zero entire pages, regardless of how it is currently used/called by filesystems. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote: > On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote: > > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > > > [..] > > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > > > callback that deals with page aligned entries. That change at least > > > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > > > zeroing path. > > > > > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take > > > > > page > > > > > aligned start and size and call this interface from > > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > > > path? > > > > > > > > > > Something like. > > > > > > > > > > __dax_zero_page_range() { > > > > > if(page_aligned_io) > > > > > call_dax_page_zero_range() > > > > > else > > > > > use_direct_access_and_memcpy; > > > > > } > > > > > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > > > to calling dax_zero_page_range() instead. > > > > > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just > > > > > that > > > > > it will create a dax specific hook. > > > > > > > > > > In that case it might be simpler to just get rid of > > > > > blkdev_issue_zeroout() > > > > > call from __dax_zero_page_range() and make sure there are no callers > > > > > of > > > > > full block zeroing from this path. > > > > > > > > I think you're right. The path I'm concerned about not regressing is > > > > the error clearing on new block allocation and we get that already via > > > > xfs_zero_extent() and sb_issue_zeroout(). > > > > > > Well I was wrong. I found atleast one user which uses > > > __dax_zero_page_range() > > > to zero full PAGE_SIZE blocks. > > > > > > xfs_io -c "allocsp 32K 0" foo.txt > > > > That ioctl interface is deprecated and likely unused by any new > > application written since 1999. It predates unwritten extents (1998) > > and I don't think any native linux applications have ever used it. A > > newly written DAX aware application would almost certainly not use > > this interface. > > > > IOWs, I wouldn't use it as justification for some special case > > behaviour; I'm more likely to say "rip that ancient ioctl out" than > > to jump through hoops because it's implementation behaviour. > > Hi Dave, > > Do you see any other path in xfs using iomap_zero_range() and zeroing > full block. Yes: - xfs_file_aio_write_checks() for zeroing blocks between the existing EOF and the start of the incoming write beyond EOF - xfs_setattr_size() on truncate up for zeroing blocks between the existing EOF and the new EOF. - xfs_reflink_zero_posteof() for zeroing blocks between the old EOF and where the new reflinked extents are going to land beyond EOF And don't think that blocks beyond EOF can't exist when DAX is enabled. We can turn DAX on and off, we can crash between allocation and file size extension, etc. Hence this code must be able to handle zeroing large ranges of blocks beyond EOF... > iomap_zero_range() already skips IOMAP_HOLE and > IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type > IOMAP_HOLE and IOMAP_UNWRITTEN. > > My understanding is that ext4 and xfs both are initializing full blocks > using blkdev_issue_zeroout(). Only partial blocks are being zeroed using > this dax zeroing path. Look at the API, not the callers: iomap_zero_range takes a 64 bit length parameter. It can be asked to zero blocks across petabytes of a file. If there's a single block somewhere in that range, it will only zero that block. If the entire range is allocated, it will zero that entire range (yes, it will take forever!) as that it what it is intended to do. It should be pretty clear that needs to be able to zero entire pages, regardless of how it is currently used/called by filesystems. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote: > On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner wrote: > > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > > > Anyway, partial page truncate can't ensure that data in rest of the page > > > can > > > be read back successfully. Memory can get poison after the write and > > > hence read after truncate will still fail. > > > > Which is where the notification requirement comes in. Yes, we may > > still get errors on read or write, but if memory at rest goes bad, > > we want to handle that and correct it ASAP, not wait days or months > > for something to trip over the poisoned page before we find out > > about it. > > > > > Hence, all we are trying to ensure is that if a poison is known at the > > > time of writing partial page, then we should return error to user space. > > > > I think within FS-DAX infrastructure, any access to the data (read > > or write) within a poisoned page or a page marked with PageError() > > should return EIO to the caller, unless it's the specific command to > > clear the error/poison state on the page. What happens with that > > error state is then up to the caller. > > > > I agree with most of the above if you replace "device-dax error > handling" with "System RAM error handling". It's typical memory error > handling that injects the page->index and page->mappping dependency. I disagree, but that's beside the point and not worth arguing. > So you want the FS to have error handling for just pmem errors or all > memory errors? Just pmem errors in the specific range the filesystem manages - we really only care storage errors because those are the only errors the filesystem is responsible for handling correctly. Somebody else can worry about errors that hit page cache pages - page cache pages require mapping/index pointers on each page anyway, so a generic mechanism for handling those errors can be built into the page cache. And if the error is in general kernel memory, then it's game over for the entire kernel at that point, not just the filesystem. > And you want this to be done without the mm core using > page->index to identify what to unmap when the error happens? Isn't that exactly what I just said? We get the page address that failed, the daxdev can turn that into a sector address and call into the filesystem with a {sector, len, errno} tuple. We then do a reverse mapping lookup on {sector, len} to find all the owners of that range in the filesystem. If it's file data, that owner record gives us the inode and the offset into the file, which then gives us a {mapping, index} tuple. Further, the filesytem reverse map is aware that it's blocks can be shared by multiple owners, so it will have a record for every inode and file offset that maps to that page. Hence we can simply iterate the reverse map and do that whacky collect_procs/kill_procs dance for every {mapping, index} pair that references the the bad range. Nothing ever needs to be stored on the struct page... > Memory > error scanning is not universally available on all pmem > implementations, so FS will need to subscribe for memory-error > handling events. No. Filesystems interact with the underlying device abstraction, not the physical storage that lies behind that device abstraction. The filesystem should not interface with pmem directly in any way (all direct accesses are hidden inside fs/dax.c!), nor should it care about the quirks of the pmem implementation it is sitting on. That's what the daxdev abstraction is supposed to hide from the users of the pmem. IOWs, the daxdev infrastructure subscribes to memory-error event subsystem, calls out to the filesystem when an error in a page in the daxdev is reported. The filesystem only needs to know the {sector, len, errno} tuple related to the error; it is the device's responsibility to handle the physical mechanics of listening, filtering and translating MCEs to a format the filesystem understands Another reason it should be provided by the daxdev as a {sector, len, errno} tuple is that it also allows non-dax block devices to implement the similar error notifications and provide filesystems with exactly the same information so the filesystem can start auto-recovery processes Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote: > On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner wrote: > > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > > > Anyway, partial page truncate can't ensure that data in rest of the page > > > can > > > be read back successfully. Memory can get poison after the write and > > > hence read after truncate will still fail. > > > > Which is where the notification requirement comes in. Yes, we may > > still get errors on read or write, but if memory at rest goes bad, > > we want to handle that and correct it ASAP, not wait days or months > > for something to trip over the poisoned page before we find out > > about it. > > > > > Hence, all we are trying to ensure is that if a poison is known at the > > > time of writing partial page, then we should return error to user space. > > > > I think within FS-DAX infrastructure, any access to the data (read > > or write) within a poisoned page or a page marked with PageError() > > should return EIO to the caller, unless it's the specific command to > > clear the error/poison state on the page. What happens with that > > error state is then up to the caller. > > > > I agree with most of the above if you replace "device-dax error > handling" with "System RAM error handling". It's typical memory error > handling that injects the page->index and page->mappping dependency. I disagree, but that's beside the point and not worth arguing. > So you want the FS to have error handling for just pmem errors or all > memory errors? Just pmem errors in the specific range the filesystem manages - we really only care storage errors because those are the only errors the filesystem is responsible for handling correctly. Somebody else can worry about errors that hit page cache pages - page cache pages require mapping/index pointers on each page anyway, so a generic mechanism for handling those errors can be built into the page cache. And if the error is in general kernel memory, then it's game over for the entire kernel at that point, not just the filesystem. > And you want this to be done without the mm core using > page->index to identify what to unmap when the error happens? Isn't that exactly what I just said? We get the page address that failed, the daxdev can turn that into a sector address and call into the filesystem with a {sector, len, errno} tuple. We then do a reverse mapping lookup on {sector, len} to find all the owners of that range in the filesystem. If it's file data, that owner record gives us the inode and the offset into the file, which then gives us a {mapping, index} tuple. Further, the filesytem reverse map is aware that it's blocks can be shared by multiple owners, so it will have a record for every inode and file offset that maps to that page. Hence we can simply iterate the reverse map and do that whacky collect_procs/kill_procs dance for every {mapping, index} pair that references the the bad range. Nothing ever needs to be stored on the struct page... > Memory > error scanning is not universally available on all pmem > implementations, so FS will need to subscribe for memory-error > handling events. No. Filesystems interact with the underlying device abstraction, not the physical storage that lies behind that device abstraction. The filesystem should not interface with pmem directly in any way (all direct accesses are hidden inside fs/dax.c!), nor should it care about the quirks of the pmem implementation it is sitting on. That's what the daxdev abstraction is supposed to hide from the users of the pmem. IOWs, the daxdev infrastructure subscribes to memory-error event subsystem, calls out to the filesystem when an error in a page in the daxdev is reported. The filesystem only needs to know the {sector, len, errno} tuple related to the error; it is the device's responsibility to handle the physical mechanics of listening, filtering and translating MCEs to a format the filesystem understands Another reason it should be provided by the daxdev as a {sector, len, errno} tuple is that it also allows non-dax block devices to implement the similar error notifications and provide filesystems with exactly the same information so the filesystem can start auto-recovery processes Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > [..] > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > callback that deals with page aligned entries. That change at least > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > zeroing path. > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > > aligned start and size and call this interface from > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > path? > > > > > > Something like. > > > > > > __dax_zero_page_range() { > > > if(page_aligned_io) > > > call_dax_page_zero_range() > > > else > > > use_direct_access_and_memcpy; > > > } > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > to calling dax_zero_page_range() instead. > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > > it will create a dax specific hook. > > > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > > call from __dax_zero_page_range() and make sure there are no callers of > > > full block zeroing from this path. > > > > I think you're right. The path I'm concerned about not regressing is > > the error clearing on new block allocation and we get that already via > > xfs_zero_extent() and sb_issue_zeroout(). > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range() > to zero full PAGE_SIZE blocks. > > xfs_io -c "allocsp 32K 0" foo.txt That ioctl interface is deprecated and likely unused by any new application written since 1999. It predates unwritten extents (1998) and I don't think any native linux applications have ever used it. A newly written DAX aware application would almost certainly not use this interface. IOWs, I wouldn't use it as justification for some special case behaviour; I'm more likely to say "rip that ancient ioctl out" than to jump through hoops because it's implementation behaviour. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > [..] > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > callback that deals with page aligned entries. That change at least > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > zeroing path. > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > > aligned start and size and call this interface from > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > path? > > > > > > Something like. > > > > > > __dax_zero_page_range() { > > > if(page_aligned_io) > > > call_dax_page_zero_range() > > > else > > > use_direct_access_and_memcpy; > > > } > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > to calling dax_zero_page_range() instead. > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > > it will create a dax specific hook. > > > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > > call from __dax_zero_page_range() and make sure there are no callers of > > > full block zeroing from this path. > > > > I think you're right. The path I'm concerned about not regressing is > > the error clearing on new block allocation and we get that already via > > xfs_zero_extent() and sb_issue_zeroout(). > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range() > to zero full PAGE_SIZE blocks. > > xfs_io -c "allocsp 32K 0" foo.txt That ioctl interface is deprecated and likely unused by any new application written since 1999. It predates unwritten extents (1998) and I don't think any native linux applications have ever used it. A newly written DAX aware application would almost certainly not use this interface. IOWs, I wouldn't use it as justification for some special case behaviour; I'm more likely to say "rip that ancient ioctl out" than to jump through hoops because it's implementation behaviour. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > [..] > > > > > Hi Jeff, > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > 4095. > > > > This should fail with EIO. Only full page writes should clear the > > bad page state, and partial writes should therefore fail because > > they do not guarantee the data in the filesystem block is all good. > > > > If this zeroing was a buffered write to an address with a bad > > sector, then the writeback will fail and the user will (eventually) > > get an EIO on the file. > > > > DAX should do the same thing, except because the zeroing is > > synchronous (i.e. done directly by the truncate syscall) we can - > > and should - return EIO immediately. > > > > Indeed, with your code, if we then extend the file by truncating up > > back to 4k, then the range between 23 and 512 is still bad, even > > though we've successfully zeroed it and the user knows it. An > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > will fail with EIO, but reading 10 bytes at offset 2000 will > > succeed. > > Hi Dave, > > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to > 511) is poisoned and rest don't have poison. Should this fail with -EIO. Yes - the filesystem block still contains bad data. > In current implementation it does not. I'm not surprised - the whole hardware error handling architecture for FS-DAX is fundamentally broken. It was designed for device-dax, and it just doesn't work for FS-DAX. For example, to get the hardware error handling to be able to kill userspace applications, a 1:1 physical-to-logical association constraint was added to fs/dax.c: /* * TODO: for reflink+dax we need a way to associate a single page with * multiple address_space instances at different linear_page_index() * offsets. */ static void dax_associate_entry(void *entry, struct address_space *mapping, struct vm_area_struct *vma, unsigned long address) because device-dax only has *linear mappings* and so has a fixed reverse mapping architecture. i.e. the whole hardware error handling infrastructure was designed around the constraints of device-dax. device-dax does not having any structure to serialise access to the physical storage, so locking was added to the mapping tree. THe mapping tree locking is accessed on hardware error via the reverse mappingi association in the struct page and that's how device-dax serialises direct physical storage access against hardware error processing. And while the page index is locked in the mapping tree, it can walk the process vmas that have the page mapped to kill them so that they don't try to access the bad page. That bad physical storage state is carried in a volatile struct page flag, hence requiring some mechanism to make it persistent (the device bad blocks register) and some other mechanism to clear the poison state (direct IO, IIRC). It's also a big, nasty, solid roadblock to implementing shared data extents in FS-DAX. We basically have to completely re-architect the hardware error handling for FS-DAX so that such errors are reported to the filesystem first and then the filesystem does what is needed to handle the error. None of this works for filesystems because they need to perform different operations depending on what the page that went bad contains. FS-DAX should never trip over an unexpected poisoned page; we do so now because such physical storage errors are completely hidden form the fielsystem. What you are trying to do is slap a band-aid over what to do when we hit an unexpected page containing bad data. Filesystems expect to find out about bad data in storage when they marshall the data into or out of memory. They make the assumption that once it is in memory it remains valid on the physical storage. Hence if an in-memory error occurs, we can just toss it away and re-read it from storage, and all is good. FS-DAX changes that - we are no longer marshalling data into and out of memory so we don't have a mechanism to get EIO when reading the page into the page cache or writing it back to disk. We also don't have an in-memory copy of the data - the physical storage is the in-memory copy
Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > [..] > > > > > Hi Jeff, > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > 4095. > > > > This should fail with EIO. Only full page writes should clear the > > bad page state, and partial writes should therefore fail because > > they do not guarantee the data in the filesystem block is all good. > > > > If this zeroing was a buffered write to an address with a bad > > sector, then the writeback will fail and the user will (eventually) > > get an EIO on the file. > > > > DAX should do the same thing, except because the zeroing is > > synchronous (i.e. done directly by the truncate syscall) we can - > > and should - return EIO immediately. > > > > Indeed, with your code, if we then extend the file by truncating up > > back to 4k, then the range between 23 and 512 is still bad, even > > though we've successfully zeroed it and the user knows it. An > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > will fail with EIO, but reading 10 bytes at offset 2000 will > > succeed. > > Hi Dave, > > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to > 511) is poisoned and rest don't have poison. Should this fail with -EIO. Yes - the filesystem block still contains bad data. > In current implementation it does not. I'm not surprised - the whole hardware error handling architecture for FS-DAX is fundamentally broken. It was designed for device-dax, and it just doesn't work for FS-DAX. For example, to get the hardware error handling to be able to kill userspace applications, a 1:1 physical-to-logical association constraint was added to fs/dax.c: /* * TODO: for reflink+dax we need a way to associate a single page with * multiple address_space instances at different linear_page_index() * offsets. */ static void dax_associate_entry(void *entry, struct address_space *mapping, struct vm_area_struct *vma, unsigned long address) because device-dax only has *linear mappings* and so has a fixed reverse mapping architecture. i.e. the whole hardware error handling infrastructure was designed around the constraints of device-dax. device-dax does not having any structure to serialise access to the physical storage, so locking was added to the mapping tree. THe mapping tree locking is accessed on hardware error via the reverse mappingi association in the struct page and that's how device-dax serialises direct physical storage access against hardware error processing. And while the page index is locked in the mapping tree, it can walk the process vmas that have the page mapped to kill them so that they don't try to access the bad page. That bad physical storage state is carried in a volatile struct page flag, hence requiring some mechanism to make it persistent (the device bad blocks register) and some other mechanism to clear the poison state (direct IO, IIRC). It's also a big, nasty, solid roadblock to implementing shared data extents in FS-DAX. We basically have to completely re-architect the hardware error handling for FS-DAX so that such errors are reported to the filesystem first and then the filesystem does what is needed to handle the error. None of this works for filesystems because they need to perform different operations depending on what the page that went bad contains. FS-DAX should never trip over an unexpected poisoned page; we do so now because such physical storage errors are completely hidden form the fielsystem. What you are trying to do is slap a band-aid over what to do when we hit an unexpected page containing bad data. Filesystems expect to find out about bad data in storage when they marshall the data into or out of memory. They make the assumption that once it is in memory it remains valid on the physical storage. Hence if an in-memory error occurs, we can just toss it away and re-read it from storage, and all is good. FS-DAX changes that - we are no longer marshalling data into and out of memory so we don't have a mechanism to get EIO when reading the page into the page cache or writing it back to disk. We also don't have an in-memory copy of the data - the physical storage is the in-memory copy
Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
ch cases. > > Do we expect an interface where if there are any bad blocks in the range > being zeroed, then they all should be cleared (and hence all I/O should > be aligned) otherwise error is returned. If yes, I could make that > change. > > Downside of current interface is that it will clear as many blocks as > possible in the given range and leave starting and end blocks poisoned > (if it is unaligned) and not return error. That means a reader will > get error on these blocks again and they will have to try to clear it > again. Which is solved by having partial page writes always EIO on poisoned memory. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
ch cases. > > Do we expect an interface where if there are any bad blocks in the range > being zeroed, then they all should be cleared (and hence all I/O should > be aligned) otherwise error is returned. If yes, I could make that > change. > > Downside of current interface is that it will clear as many blocks as > possible in the given range and leave starting and end blocks poisoned > (if it is unaligned) and not return error. That means a reader will > get error on these blocks again and they will have to try to clear it > again. Which is solved by having partial page writes always EIO on poisoned memory. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t > > > pos, loff_t length, > > > } > > > ret = iomap_readpage_actor(inode, pos + done, length - done, > > > ctx, iomap, srcmap); > > > + if (WARN_ON(ret == 0)) > > > + break; > > > > This error case now leaks ctx->cur_page > > Yes ... and I see the consequence. I mean, this is a "shouldn't happen", > so do we want to put effort into cleanup here ... Well, the normal thing for XFS is that a production kernel cleans up and handles the error gracefully with a WARN_ON_ONCE, while a debug kernel build will chuck a tanty and burn the house down so to make the developers aware that there is a "should not happen" situation occurring > > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, > > > struct list_head *pages, > > > done: > > > if (ctx.bio) > > > submit_bio(ctx.bio); > > > - if (ctx.cur_page) { > > > - if (!ctx.cur_page_in_bio) > > > - unlock_page(ctx.cur_page); > > > - put_page(ctx.cur_page); > > > - } > > > + BUG_ON(ctx.cur_page); > > > > And so will now trigger both a warn and a bug > > ... or do we just want to run slap bang into this bug? > > Option 1: Remove the check for 'ret == 0' altogether, as we had it before. > That puts us into endless loop territory for a failure mode, and it's not > parallel with iomap_readpage(). > > Option 2: Remove the WARN_ON from the check. Then we just hit the BUG_ON, > but we don't know why we did it. > > Option 3: Set cur_page to NULL. We'll hit the WARN_ON, avoid the BUG_ON, > might end up with a page in the page cache which is never unlocked. None of these are appealing. > Option 4: Do the unlock/put page dance before setting the cur_page to NULL. > We might double-unlock the page. why would we double unlock the page? Oh, the readahead cursor doesn't handle the case of partial page submission, which would result in IO completion unlocking the page. Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e. if we've got a page that the readahead cursor points at, and we haven't actually added it to a bio, then we can leave it to the read_pages() to unlock and clean up. If it's in a bio, then IO completion will unlock it and so we only have to drop the submission reference and move the readahead cursor forwards so read_pages() doesn't try to unlock this page. i.e: /* clean up partial page submission failures */ if (ctx.cur_page && ctx.cur_page_in_bio) { put_page(ctx.cur_page); readahead_next(rac); } looks to me like it will handle the case of "ret == 0" in the actor function just fine. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t > > > pos, loff_t length, > > > } > > > ret = iomap_readpage_actor(inode, pos + done, length - done, > > > ctx, iomap, srcmap); > > > + if (WARN_ON(ret == 0)) > > > + break; > > > > This error case now leaks ctx->cur_page > > Yes ... and I see the consequence. I mean, this is a "shouldn't happen", > so do we want to put effort into cleanup here ... Well, the normal thing for XFS is that a production kernel cleans up and handles the error gracefully with a WARN_ON_ONCE, while a debug kernel build will chuck a tanty and burn the house down so to make the developers aware that there is a "should not happen" situation occurring > > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, > > > struct list_head *pages, > > > done: > > > if (ctx.bio) > > > submit_bio(ctx.bio); > > > - if (ctx.cur_page) { > > > - if (!ctx.cur_page_in_bio) > > > - unlock_page(ctx.cur_page); > > > - put_page(ctx.cur_page); > > > - } > > > + BUG_ON(ctx.cur_page); > > > > And so will now trigger both a warn and a bug > > ... or do we just want to run slap bang into this bug? > > Option 1: Remove the check for 'ret == 0' altogether, as we had it before. > That puts us into endless loop territory for a failure mode, and it's not > parallel with iomap_readpage(). > > Option 2: Remove the WARN_ON from the check. Then we just hit the BUG_ON, > but we don't know why we did it. > > Option 3: Set cur_page to NULL. We'll hit the WARN_ON, avoid the BUG_ON, > might end up with a page in the page cache which is never unlocked. None of these are appealing. > Option 4: Do the unlock/put page dance before setting the cur_page to NULL. > We might double-unlock the page. why would we double unlock the page? Oh, the readahead cursor doesn't handle the case of partial page submission, which would result in IO completion unlocking the page. Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e. if we've got a page that the readahead cursor points at, and we haven't actually added it to a bio, then we can leave it to the read_pages() to unlock and clean up. If it's in a bio, then IO completion will unlock it and so we only have to drop the submission reference and move the readahead cursor forwards so read_pages() doesn't try to unlock this page. i.e: /* clean up partial page submission failures */ if (ctx.cur_page && ctx.cur_page_in_bio) { put_page(ctx.cur_page); readahead_next(rac); } looks to me like it will handle the case of "ret == 0" in the actor function just fine. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t > > > pos, loff_t length, > > > } > > > ret = iomap_readpage_actor(inode, pos + done, length - done, > > > ctx, iomap, srcmap); > > > + if (WARN_ON(ret == 0)) > > > + break; > > > > This error case now leaks ctx->cur_page > > Yes ... and I see the consequence. I mean, this is a "shouldn't happen", > so do we want to put effort into cleanup here ... Well, the normal thing for XFS is that a production kernel cleans up and handles the error gracefully with a WARN_ON_ONCE, while a debug kernel build will chuck a tanty and burn the house down so to make the developers aware that there is a "should not happen" situation occurring > > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, > > > struct list_head *pages, > > > done: > > > if (ctx.bio) > > > submit_bio(ctx.bio); > > > - if (ctx.cur_page) { > > > - if (!ctx.cur_page_in_bio) > > > - unlock_page(ctx.cur_page); > > > - put_page(ctx.cur_page); > > > - } > > > + BUG_ON(ctx.cur_page); > > > > And so will now trigger both a warn and a bug > > ... or do we just want to run slap bang into this bug? > > Option 1: Remove the check for 'ret == 0' altogether, as we had it before. > That puts us into endless loop territory for a failure mode, and it's not > parallel with iomap_readpage(). > > Option 2: Remove the WARN_ON from the check. Then we just hit the BUG_ON, > but we don't know why we did it. > > Option 3: Set cur_page to NULL. We'll hit the WARN_ON, avoid the BUG_ON, > might end up with a page in the page cache which is never unlocked. None of these are appealing. > Option 4: Do the unlock/put page dance before setting the cur_page to NULL. > We might double-unlock the page. why would we double unlock the page? Oh, the readahead cursor doesn't handle the case of partial page submission, which would result in IO completion unlocking the page. Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e. if we've got a page that the readahead cursor points at, and we haven't actually added it to a bio, then we can leave it to the read_pages() to unlock and clean up. If it's in a bio, then IO completion will unlock it and so we only have to drop the submission reference and move the readahead cursor forwards so read_pages() doesn't try to unlock this page. i.e: /* clean up partial page submission failures */ if (ctx.cur_page && ctx.cur_page_in_bio) { put_page(ctx.cur_page); readahead_next(rac); } looks to me like it will handle the case of "ret == 0" in the actor function just fine. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote: > > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > > > Latest version in your git tree: > > > > > > > > > > $ ▶ glo -n 5 willy/readahead > > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > > > $ > > > > > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > > > > > [2.423116] [ cut here ] > > > > > [2.424957] list_add double add: new=ea000efff4c8, > > > > > prev=8883bfffee60, next=ea000efff4c8. > > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > > > __list_add_valid+0x67/0x70 > > > > > [2.457484] Call Trace: > > > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > > > [2.462787] lru_add_drain+0x18/0x20 > > > > > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > > > > > Yes, because it's the only version I've actually merged into my > > > working tree, compiled and tried to run. :P > > > > > > > version pushed to that git tree that did contain a list double-add > > > > (due to a mismerge when shuffling patches). I noticed it and fixed > > > > it, and 4be497096c04 doesn't have that problem. I also test with > > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > > > probabilistic because it'll depend on the timing between whatever other > > > > list is being used and the page actually being added to the LRU. > > > > > > I'll see if I can reproduce it. > > > > Just updated to a current TOT Linus kernel and your latest branch, > > and so far this is 100% reproducable. > > > > Not sure how I'm going to debug it yet, because it's init that is > > triggering it > > Eric found it ... Yeah, just saw that and am applying his patch to test it... > still not sure why I don't see it. No readahead configured on your device? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote: > > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > > > Latest version in your git tree: > > > > > > > > > > $ ▶ glo -n 5 willy/readahead > > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > > > $ > > > > > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > > > > > [2.423116] [ cut here ] > > > > > [2.424957] list_add double add: new=ea000efff4c8, > > > > > prev=8883bfffee60, next=ea000efff4c8. > > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > > > __list_add_valid+0x67/0x70 > > > > > [2.457484] Call Trace: > > > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > > > [2.462787] lru_add_drain+0x18/0x20 > > > > > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > > > > > Yes, because it's the only version I've actually merged into my > > > working tree, compiled and tried to run. :P > > > > > > > version pushed to that git tree that did contain a list double-add > > > > (due to a mismerge when shuffling patches). I noticed it and fixed > > > > it, and 4be497096c04 doesn't have that problem. I also test with > > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > > > probabilistic because it'll depend on the timing between whatever other > > > > list is being used and the page actually being added to the LRU. > > > > > > I'll see if I can reproduce it. > > > > Just updated to a current TOT Linus kernel and your latest branch, > > and so far this is 100% reproducable. > > > > Not sure how I'm going to debug it yet, because it's init that is > > triggering it > > Eric found it ... Yeah, just saw that and am applying his patch to test it... > still not sure why I don't see it. No readahead configured on your device? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote: > > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > > > Latest version in your git tree: > > > > > > > > > > $ ▶ glo -n 5 willy/readahead > > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > > > $ > > > > > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > > > > > [2.423116] [ cut here ] > > > > > [2.424957] list_add double add: new=ea000efff4c8, > > > > > prev=8883bfffee60, next=ea000efff4c8. > > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > > > __list_add_valid+0x67/0x70 > > > > > [2.457484] Call Trace: > > > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > > > [2.462787] lru_add_drain+0x18/0x20 > > > > > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > > > > > Yes, because it's the only version I've actually merged into my > > > working tree, compiled and tried to run. :P > > > > > > > version pushed to that git tree that did contain a list double-add > > > > (due to a mismerge when shuffling patches). I noticed it and fixed > > > > it, and 4be497096c04 doesn't have that problem. I also test with > > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > > > probabilistic because it'll depend on the timing between whatever other > > > > list is being used and the page actually being added to the LRU. > > > > > > I'll see if I can reproduce it. > > > > Just updated to a current TOT Linus kernel and your latest branch, > > and so far this is 100% reproducable. > > > > Not sure how I'm going to debug it yet, because it's init that is > > triggering it > > Eric found it ... Yeah, just saw that and am applying his patch to test it... > still not sure why I don't see it. No readahead configured on your device? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > Latest version in your git tree: > > > > > > $ ▶ glo -n 5 willy/readahead > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > $ > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > [2.423116] [ cut here ] > > > [2.424957] list_add double add: new=ea000efff4c8, > > > prev=8883bfffee60, next=ea000efff4c8. > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > __list_add_valid+0x67/0x70 > > > [2.457484] Call Trace: > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > [2.462787] lru_add_drain+0x18/0x20 > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > Yes, because it's the only version I've actually merged into my > working tree, compiled and tried to run. :P > > > version pushed to that git tree that did contain a list double-add > > (due to a mismerge when shuffling patches). I noticed it and fixed > > it, and 4be497096c04 doesn't have that problem. I also test with > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > probabilistic because it'll depend on the timing between whatever other > > list is being used and the page actually being added to the LRU. > > I'll see if I can reproduce it. Just updated to a current TOT Linus kernel and your latest branch, and so far this is 100% reproducable. Not sure how I'm going to debug it yet, because it's init that is triggering it Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > Latest version in your git tree: > > > > > > $ ▶ glo -n 5 willy/readahead > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > $ > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > [2.423116] [ cut here ] > > > [2.424957] list_add double add: new=ea000efff4c8, > > > prev=8883bfffee60, next=ea000efff4c8. > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > __list_add_valid+0x67/0x70 > > > [2.457484] Call Trace: > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > [2.462787] lru_add_drain+0x18/0x20 > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > Yes, because it's the only version I've actually merged into my > working tree, compiled and tried to run. :P > > > version pushed to that git tree that did contain a list double-add > > (due to a mismerge when shuffling patches). I noticed it and fixed > > it, and 4be497096c04 doesn't have that problem. I also test with > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > probabilistic because it'll depend on the timing between whatever other > > list is being used and the page actually being added to the LRU. > > I'll see if I can reproduce it. Just updated to a current TOT Linus kernel and your latest branch, and so far this is 100% reproducable. Not sure how I'm going to debug it yet, because it's init that is triggering it Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 00/19] Change readahead API
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > Latest version in your git tree: > > > > > > $ ▶ glo -n 5 willy/readahead > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > $ > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > [2.423116] [ cut here ] > > > [2.424957] list_add double add: new=ea000efff4c8, > > > prev=8883bfffee60, next=ea000efff4c8. > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > __list_add_valid+0x67/0x70 > > > [2.457484] Call Trace: > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > [2.462787] lru_add_drain+0x18/0x20 > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > Yes, because it's the only version I've actually merged into my > working tree, compiled and tried to run. :P > > > version pushed to that git tree that did contain a list double-add > > (due to a mismerge when shuffling patches). I noticed it and fixed > > it, and 4be497096c04 doesn't have that problem. I also test with > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > probabilistic because it'll depend on the timing between whatever other > > list is being used and the page actually being added to the LRU. > > I'll see if I can reproduce it. Just updated to a current TOT Linus kernel and your latest branch, and so far this is 100% reproducable. Not sure how I'm going to debug it yet, because it's init that is triggering it Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path
On Mon, Feb 17, 2020 at 10:46:13AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Ensure that memory allocations in the readahead path do not attempt to > reclaim file-backed pages, which could lead to a deadlock. It is > possible, though unlikely this is the root cause of a problem observed > by Cong Wang. > > Signed-off-by: Matthew Wilcox (Oracle) > Reported-by: Cong Wang > Suggested-by: Michal Hocko > --- > mm/readahead.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/readahead.c b/mm/readahead.c > index 94d499cfb657..8f9c0dba24e7 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > > @@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space > *mapping, > ._nr_pages = 0, > }; > > + /* > + * Partway through the readahead operation, we will have added > + * locked pages to the page cache, but will not yet have submitted > + * them for I/O. Adding another page may need to allocate memory, > + * which can trigger memory reclaim. Telling the VM we're in > + * the middle of a filesystem operation will cause it to not > + * touch file-backed pages, preventing a deadlock. Most (all?) > + * filesystems already specify __GFP_NOFS in their mapping's > + * gfp_mask, but let's be explicit here. > + */ > + unsigned int nofs = memalloc_nofs_save(); > + So doesn't this largely remove the need for all the gfp flag futzing in the readahead path? i.e. almost all readahead allocations are now going to be GFP_NOFS | GFP_NORETRY | GFP_NOWARN ? If so, shouldn't we just strip all the gfp flags and masking out of the readahead path altogether? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path
On Mon, Feb 17, 2020 at 10:46:13AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Ensure that memory allocations in the readahead path do not attempt to > reclaim file-backed pages, which could lead to a deadlock. It is > possible, though unlikely this is the root cause of a problem observed > by Cong Wang. > > Signed-off-by: Matthew Wilcox (Oracle) > Reported-by: Cong Wang > Suggested-by: Michal Hocko > --- > mm/readahead.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/readahead.c b/mm/readahead.c > index 94d499cfb657..8f9c0dba24e7 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > > @@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space > *mapping, > ._nr_pages = 0, > }; > > + /* > + * Partway through the readahead operation, we will have added > + * locked pages to the page cache, but will not yet have submitted > + * them for I/O. Adding another page may need to allocate memory, > + * which can trigger memory reclaim. Telling the VM we're in > + * the middle of a filesystem operation will cause it to not > + * touch file-backed pages, preventing a deadlock. Most (all?) > + * filesystems already specify __GFP_NOFS in their mapping's > + * gfp_mask, but let's be explicit here. > + */ > + unsigned int nofs = memalloc_nofs_save(); > + So doesn't this largely remove the need for all the gfp flag futzing in the readahead path? i.e. almost all readahead allocations are now going to be GFP_NOFS | GFP_NORETRY | GFP_NOWARN ? If so, shouldn't we just strip all the gfp flags and masking out of the readahead path altogether? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 18/19] iomap: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in iomap. Convert XFS and ZoneFS to > use it. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 91 +++--- > fs/iomap/trace.h | 2 +- > fs/xfs/xfs_aops.c | 13 +++--- > fs/zonefs/super.c | 7 ++-- > include/linux/iomap.h | 3 +- > 5 files changed, 42 insertions(+), 74 deletions(-) All pretty straight forward... > @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > break; > done += ret; > if (offset_in_page(pos + done) == 0) { > + readahead_next(ctx->rac); > if (!ctx->cur_page_in_bio) > unlock_page(ctx->cur_page); > put_page(ctx->cur_page); Though now I look at the addition here, this might be better restructured to mention how we handle partial page submission such as: done += ret; /* * Keep working on a partially complete page, otherwise ready * the ctx for the next page to be acted on. */ if (offset_in_page(pos + done)) continue; if (!ctx->cur_page_in_bio) unlock_page(ctx->cur_page); put_page(ctx->cur_page); ctx->cur_page = NULL; readahead_next(ctx->rac); } Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 18/19] iomap: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in iomap. Convert XFS and ZoneFS to > use it. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 91 +++--- > fs/iomap/trace.h | 2 +- > fs/xfs/xfs_aops.c | 13 +++--- > fs/zonefs/super.c | 7 ++-- > include/linux/iomap.h | 3 +- > 5 files changed, 42 insertions(+), 74 deletions(-) All pretty straight forward... > @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > break; > done += ret; > if (offset_in_page(pos + done) == 0) { > + readahead_next(ctx->rac); > if (!ctx->cur_page_in_bio) > unlock_page(ctx->cur_page); > put_page(ctx->cur_page); Though now I look at the addition here, this might be better restructured to mention how we handle partial page submission such as: done += ret; /* * Keep working on a partially complete page, otherwise ready * the ctx for the next page to be acted on. */ if (offset_in_page(pos + done)) continue; if (!ctx->cur_page_in_bio) unlock_page(ctx->cur_page); put_page(ctx->cur_page); ctx->cur_page = NULL; readahead_next(ctx->rac); } Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 18/19] iomap: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in iomap. Convert XFS and ZoneFS to > use it. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 91 +++--- > fs/iomap/trace.h | 2 +- > fs/xfs/xfs_aops.c | 13 +++--- > fs/zonefs/super.c | 7 ++-- > include/linux/iomap.h | 3 +- > 5 files changed, 42 insertions(+), 74 deletions(-) All pretty straight forward... > @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > break; > done += ret; > if (offset_in_page(pos + done) == 0) { > + readahead_next(ctx->rac); > if (!ctx->cur_page_in_bio) > unlock_page(ctx->cur_page); > put_page(ctx->cur_page); Though now I look at the addition here, this might be better restructured to mention how we handle partial page submission such as: done += ret; /* * Keep working on a partially complete page, otherwise ready * the ctx for the next page to be acted on. */ if (offset_in_page(pos + done)) continue; if (!ctx->cur_page_in_bio) unlock_page(ctx->cur_page); put_page(ctx->cur_page); ctx->cur_page = NULL; readahead_next(ctx->rac); } Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > By putting the 'have we reached the end of the page' condition at the end > of the loop instead of the beginning, we can remove the 'submit the last > page' code from iomap_readpages(). Also check that iomap_readpage_actor() > didn't return 0, which would lead to an endless loop. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 25 - > 1 file changed, 12 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index cb3511eb152a..44303f370b2d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > void *data, struct iomap *iomap, struct iomap *srcmap) > { > struct iomap_readpage_ctx *ctx = data; > - loff_t done, ret; > + loff_t ret, done = 0; > > - for (done = 0; done < length; done += ret) { > - if (ctx->cur_page && offset_in_page(pos + done) == 0) { > - if (!ctx->cur_page_in_bio) > - unlock_page(ctx->cur_page); > - put_page(ctx->cur_page); > - ctx->cur_page = NULL; > - } > + while (done < length) { > if (!ctx->cur_page) { > ctx->cur_page = iomap_next_page(inode, ctx->pages, > pos, length, ); > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > } > ret = iomap_readpage_actor(inode, pos + done, length - done, > ctx, iomap, srcmap); > + if (WARN_ON(ret == 0)) > + break; This error case now leaks ctx->cur_page > + done += ret; > + if (offset_in_page(pos + done) == 0) { > + if (!ctx->cur_page_in_bio) > + unlock_page(ctx->cur_page); > + put_page(ctx->cur_page); > + ctx->cur_page = NULL; > + } > } > > return done; > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct > list_head *pages, > done: > if (ctx.bio) > submit_bio(ctx.bio); > - if (ctx.cur_page) { > - if (!ctx.cur_page_in_bio) > - unlock_page(ctx.cur_page); > - put_page(ctx.cur_page); > - } > + BUG_ON(ctx.cur_page); And so will now trigger both a warn and a bug Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > By putting the 'have we reached the end of the page' condition at the end > of the loop instead of the beginning, we can remove the 'submit the last > page' code from iomap_readpages(). Also check that iomap_readpage_actor() > didn't return 0, which would lead to an endless loop. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 25 - > 1 file changed, 12 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index cb3511eb152a..44303f370b2d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > void *data, struct iomap *iomap, struct iomap *srcmap) > { > struct iomap_readpage_ctx *ctx = data; > - loff_t done, ret; > + loff_t ret, done = 0; > > - for (done = 0; done < length; done += ret) { > - if (ctx->cur_page && offset_in_page(pos + done) == 0) { > - if (!ctx->cur_page_in_bio) > - unlock_page(ctx->cur_page); > - put_page(ctx->cur_page); > - ctx->cur_page = NULL; > - } > + while (done < length) { > if (!ctx->cur_page) { > ctx->cur_page = iomap_next_page(inode, ctx->pages, > pos, length, ); > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > } > ret = iomap_readpage_actor(inode, pos + done, length - done, > ctx, iomap, srcmap); > + if (WARN_ON(ret == 0)) > + break; This error case now leaks ctx->cur_page > + done += ret; > + if (offset_in_page(pos + done) == 0) { > + if (!ctx->cur_page_in_bio) > + unlock_page(ctx->cur_page); > + put_page(ctx->cur_page); > + ctx->cur_page = NULL; > + } > } > > return done; > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct > list_head *pages, > done: > if (ctx.bio) > submit_bio(ctx.bio); > - if (ctx.cur_page) { > - if (!ctx.cur_page_in_bio) > - unlock_page(ctx.cur_page); > - put_page(ctx.cur_page); > - } > + BUG_ON(ctx.cur_page); And so will now trigger both a warn and a bug Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > By putting the 'have we reached the end of the page' condition at the end > of the loop instead of the beginning, we can remove the 'submit the last > page' code from iomap_readpages(). Also check that iomap_readpage_actor() > didn't return 0, which would lead to an endless loop. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 25 - > 1 file changed, 12 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index cb3511eb152a..44303f370b2d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > void *data, struct iomap *iomap, struct iomap *srcmap) > { > struct iomap_readpage_ctx *ctx = data; > - loff_t done, ret; > + loff_t ret, done = 0; > > - for (done = 0; done < length; done += ret) { > - if (ctx->cur_page && offset_in_page(pos + done) == 0) { > - if (!ctx->cur_page_in_bio) > - unlock_page(ctx->cur_page); > - put_page(ctx->cur_page); > - ctx->cur_page = NULL; > - } > + while (done < length) { > if (!ctx->cur_page) { > ctx->cur_page = iomap_next_page(inode, ctx->pages, > pos, length, ); > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > } > ret = iomap_readpage_actor(inode, pos + done, length - done, > ctx, iomap, srcmap); > + if (WARN_ON(ret == 0)) > + break; This error case now leaks ctx->cur_page > + done += ret; > + if (offset_in_page(pos + done) == 0) { > + if (!ctx->cur_page_in_bio) > + unlock_page(ctx->cur_page); > + put_page(ctx->cur_page); > + ctx->cur_page = NULL; > + } > } > > return done; > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct > list_head *pages, > done: > if (ctx.bio) > submit_bio(ctx.bio); > - if (ctx.cur_page) { > - if (!ctx.cur_page_in_bio) > - unlock_page(ctx.cur_page); > - put_page(ctx.cur_page); > - } > + BUG_ON(ctx.cur_page); And so will now trigger both a warn and a bug Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 16/19] fuse: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in fuse. Switching away from the > read_cache_pages() helper gets rid of an implicit call to put_page(), > so we can get rid of the get_page() call in fuse_readpages_fill(). > > Signed-off-by: Matthew Wilcox (Oracle) Looks OK. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 16/19] fuse: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in fuse. Switching away from the > read_cache_pages() helper gets rid of an implicit call to put_page(), > so we can get rid of the get_page() call in fuse_readpages_fill(). > > Signed-off-by: Matthew Wilcox (Oracle) Looks OK. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 16/19] fuse: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in fuse. Switching away from the > read_cache_pages() helper gets rid of an implicit call to put_page(), > so we can get rid of the get_page() call in fuse_readpages_fill(). > > Signed-off-by: Matthew Wilcox (Oracle) Looks OK. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 14/19] ext4: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in ext4 > > Signed-off-by: Matthew Wilcox (Oracle) There's nothing I can see in this that would cause that list corruption I saw with ext4. I'll re-introduce the patch and see if it falls over again. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 14/19] ext4: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in ext4 > > Signed-off-by: Matthew Wilcox (Oracle) There's nothing I can see in this that would cause that list corruption I saw with ext4. I'll re-introduce the patch and see if it falls over again. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 14/19] ext4: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in ext4 > > Signed-off-by: Matthew Wilcox (Oracle) There's nothing I can see in this that would cause that list corruption I saw with ext4. I'll re-introduce the patch and see if it falls over again. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/zdata.c | 29 + > 1 file changed, 9 insertions(+), 20 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/zdata.c | 29 + > 1 file changed, 9 insertions(+), 20 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/zdata.c | 29 + > 1 file changed, 9 insertions(+), 20 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/data.c | 39 +--- > fs/erofs/zdata.c | 2 +- > include/trace/events/erofs.h | 6 +++--- > 3 files changed, 18 insertions(+), 29 deletions(-) Looks fine from the perspective of page iteration and error handling. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/data.c | 39 +--- > fs/erofs/zdata.c | 2 +- > include/trace/events/erofs.h | 6 +++--- > 3 files changed, 18 insertions(+), 29 deletions(-) Looks fine from the perspective of page iteration and error handling. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/data.c | 39 +--- > fs/erofs/zdata.c | 2 +- > include/trace/events/erofs.h | 6 +++--- > 3 files changed, 18 insertions(+), 29 deletions(-) Looks fine from the perspective of page iteration and error handling. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 11/19] btrfs: Convert from readpages to readahead
On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > - if (nr) { > > > - u64 contig_start = page_offset(pagepool[0]); > > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); > > > > Ok, yes it does. :) > > > > I don't see how readahead_for_each_batch() guarantees that, though. > > I ... don't see how it doesn't? We start at rac->_start and iterate > through the consecutive pages in the page cache. readahead_for_each_batch() > does assume that __do_page_cache_readahead() has its current behaviour > of putting the pages in the page cache in order, and kicks off a new > call to ->readahead() every time it has to skip an index for whatever > reason (eg page already in page cache). And there is the comment I was looking for while reading readahead_for_each_batch() :) > > > > - if (bio) > > > - return submit_one_bio(bio, 0, bio_flags); > > > - return 0; > > > + if (bio) { > > > + if (submit_one_bio(bio, 0, bio_flags)) > > > + return; > > > + } > > > } > > > > Shouldn't that just be > > > > if (bio) > > submit_one_bio(bio, 0, bio_flags); > > It should, but some overzealous person decided to mark submit_one_bio() > as __must_check, so I have to work around that. /me looks at code Ngggh. I rather dislike functions that are named in a way that look like they belong to core kernel APIs but in reality are local static functions. I'd ask for this to be fixed if it was generic code, but it's btrfs specific code so they can deal with the ugliness of their own creation. :/ > > Confusing when put alongside rac->_batch_count counting the number > > of pages in the batch, and "batch" being the index into the page > > array, and they aren't the same counts > > Yes. Renamed to 'i'. > > > > + XA_STATE(xas, >mapping->i_pages, rac->_start); > > > + struct page *page; > > > + > > > + rac->_batch_count = 0; > > > + xas_for_each(, page, rac->_start + rac->_nr_pages - 1) { > > > > That just iterates pages in the start,end doesn't it? What > > guarantees that this fills the array with a contiguous page range? > > The behaviour of __do_page_cache_readahead(). Dave Howells also has a > usecase for xas_for_each_contig(), so I'm going to add that soon. > > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + VM_BUG_ON_PAGE(PageTail(page), page); > > > + array[batch++] = page; > > > + rac->_batch_count += hpage_nr_pages(page); > > > + if (PageHead(page)) > > > + xas_set(, rac->_start + rac->_batch_count); > > > > What on earth does this do? Comments please! > > /* >* The page cache isn't using multi-index entries yet, >* so xas_for_each() won't do the right thing for >* large pages. This can be removed once the page cache >* is converted. >*/ Oh, it's changing the internal xarray lookup cursor position to point at the correct next page index? Perhaps it's better to say that instead of "won't do the right thing"? > > > +#define readahead_for_each_batch(rac, array, size, nr) > > > \ > > > + for (; (nr = readahead_page_batch(rac, array, size)); \ > > > + readahead_next(rac)) > > > > I had to go look at the caller to work out what "size" refered to > > here. > > > > This is complex enough that it needs proper API documentation. > > How about just: > > -#define readahead_for_each_batch(rac, array, size, nr) \ > - for (; (nr = readahead_page_batch(rac, array, size)); \ > +#define readahead_for_each_batch(rac, array, array_sz, nr) \ > + for (; (nr = readahead_page_batch(rac, array, array_sz)); \ Yup, that's fine - now the macro documents itself. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > - if (nr) { > > > - u64 contig_start = page_offset(pagepool[0]); > > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); > > > > Ok, yes it does. :) > > > > I don't see how readahead_for_each_batch() guarantees that, though. > > I ... don't see how it doesn't? We start at rac->_start and iterate > through the consecutive pages in the page cache. readahead_for_each_batch() > does assume that __do_page_cache_readahead() has its current behaviour > of putting the pages in the page cache in order, and kicks off a new > call to ->readahead() every time it has to skip an index for whatever > reason (eg page already in page cache). And there is the comment I was looking for while reading readahead_for_each_batch() :) > > > > - if (bio) > > > - return submit_one_bio(bio, 0, bio_flags); > > > - return 0; > > > + if (bio) { > > > + if (submit_one_bio(bio, 0, bio_flags)) > > > + return; > > > + } > > > } > > > > Shouldn't that just be > > > > if (bio) > > submit_one_bio(bio, 0, bio_flags); > > It should, but some overzealous person decided to mark submit_one_bio() > as __must_check, so I have to work around that. /me looks at code Ngggh. I rather dislike functions that are named in a way that look like they belong to core kernel APIs but in reality are local static functions. I'd ask for this to be fixed if it was generic code, but it's btrfs specific code so they can deal with the ugliness of their own creation. :/ > > Confusing when put alongside rac->_batch_count counting the number > > of pages in the batch, and "batch" being the index into the page > > array, and they aren't the same counts > > Yes. Renamed to 'i'. > > > > + XA_STATE(xas, >mapping->i_pages, rac->_start); > > > + struct page *page; > > > + > > > + rac->_batch_count = 0; > > > + xas_for_each(, page, rac->_start + rac->_nr_pages - 1) { > > > > That just iterates pages in the start,end doesn't it? What > > guarantees that this fills the array with a contiguous page range? > > The behaviour of __do_page_cache_readahead(). Dave Howells also has a > usecase for xas_for_each_contig(), so I'm going to add that soon. > > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + VM_BUG_ON_PAGE(PageTail(page), page); > > > + array[batch++] = page; > > > + rac->_batch_count += hpage_nr_pages(page); > > > + if (PageHead(page)) > > > + xas_set(, rac->_start + rac->_batch_count); > > > > What on earth does this do? Comments please! > > /* >* The page cache isn't using multi-index entries yet, >* so xas_for_each() won't do the right thing for >* large pages. This can be removed once the page cache >* is converted. >*/ Oh, it's changing the internal xarray lookup cursor position to point at the correct next page index? Perhaps it's better to say that instead of "won't do the right thing"? > > > +#define readahead_for_each_batch(rac, array, size, nr) > > > \ > > > + for (; (nr = readahead_page_batch(rac, array, size)); \ > > > + readahead_next(rac)) > > > > I had to go look at the caller to work out what "size" refered to > > here. > > > > This is complex enough that it needs proper API documentation. > > How about just: > > -#define readahead_for_each_batch(rac, array, size, nr) \ > - for (; (nr = readahead_page_batch(rac, array, size)); \ > +#define readahead_for_each_batch(rac, array, array_sz, nr) \ > + for (; (nr = readahead_page_batch(rac, array, array_sz)); \ Yup, that's fine - now the macro documents itself. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > - if (nr) { > > > - u64 contig_start = page_offset(pagepool[0]); > > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); > > > > Ok, yes it does. :) > > > > I don't see how readahead_for_each_batch() guarantees that, though. > > I ... don't see how it doesn't? We start at rac->_start and iterate > through the consecutive pages in the page cache. readahead_for_each_batch() > does assume that __do_page_cache_readahead() has its current behaviour > of putting the pages in the page cache in order, and kicks off a new > call to ->readahead() every time it has to skip an index for whatever > reason (eg page already in page cache). And there is the comment I was looking for while reading readahead_for_each_batch() :) > > > > - if (bio) > > > - return submit_one_bio(bio, 0, bio_flags); > > > - return 0; > > > + if (bio) { > > > + if (submit_one_bio(bio, 0, bio_flags)) > > > + return; > > > + } > > > } > > > > Shouldn't that just be > > > > if (bio) > > submit_one_bio(bio, 0, bio_flags); > > It should, but some overzealous person decided to mark submit_one_bio() > as __must_check, so I have to work around that. /me looks at code Ngggh. I rather dislike functions that are named in a way that look like they belong to core kernel APIs but in reality are local static functions. I'd ask for this to be fixed if it was generic code, but it's btrfs specific code so they can deal with the ugliness of their own creation. :/ > > Confusing when put alongside rac->_batch_count counting the number > > of pages in the batch, and "batch" being the index into the page > > array, and they aren't the same counts > > Yes. Renamed to 'i'. > > > > + XA_STATE(xas, >mapping->i_pages, rac->_start); > > > + struct page *page; > > > + > > > + rac->_batch_count = 0; > > > + xas_for_each(, page, rac->_start + rac->_nr_pages - 1) { > > > > That just iterates pages in the start,end doesn't it? What > > guarantees that this fills the array with a contiguous page range? > > The behaviour of __do_page_cache_readahead(). Dave Howells also has a > usecase for xas_for_each_contig(), so I'm going to add that soon. > > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + VM_BUG_ON_PAGE(PageTail(page), page); > > > + array[batch++] = page; > > > + rac->_batch_count += hpage_nr_pages(page); > > > + if (PageHead(page)) > > > + xas_set(, rac->_start + rac->_batch_count); > > > > What on earth does this do? Comments please! > > /* >* The page cache isn't using multi-index entries yet, >* so xas_for_each() won't do the right thing for >* large pages. This can be removed once the page cache >* is converted. >*/ Oh, it's changing the internal xarray lookup cursor position to point at the correct next page index? Perhaps it's better to say that instead of "won't do the right thing"? > > > +#define readahead_for_each_batch(rac, array, size, nr) > > > \ > > > + for (; (nr = readahead_page_batch(rac, array, size)); \ > > > + readahead_next(rac)) > > > > I had to go look at the caller to work out what "size" refered to > > here. > > > > This is complex enough that it needs proper API documentation. > > How about just: > > -#define readahead_for_each_batch(rac, array, size, nr) \ > - for (; (nr = readahead_page_batch(rac, array, size)); \ > +#define readahead_for_each_batch(rac, array, array_sz, nr) \ > + for (; (nr = readahead_page_batch(rac, array, array_sz)); \ Yup, that's fine - now the macro documents itself. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > ext4 and f2fs have duplicated the guts of the readahead code so > > > they can read past i_size. Instead, separate out the guts of the > > > readahead code so they can call it directly. > > > > Gross and nasty (hosting non-stale data beyond EOF in the page > > cache, that is). > > I thought you meant sneaking changes into the VFS (that were rejected) by > copying VFS code and modifying it ... Well, now that you mention it... :P > > > +/** > > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > > > + * @mapping: File address space. > > > + * @file: This instance of the open file; used for authentication. > > > + * @offset: First page index to read. > > > + * @end_index: The maximum page index to read. > > > + * @nr_to_read: The number of pages to read. > > > + * @lookahead_size: Where to start the next readahead. > > > + * > > > + * This function is for filesystems to call when they want to start > > > + * readahead potentially beyond a file's stated i_size. If you want > > > + * to start readahead on a normal file, you probably want to call > > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > > > + * > > > + * Context: File is referenced by caller. Mutexes may be held by caller. > > > + * May sleep, but will not reenter filesystem to reclaim memory. > > > */ > > > -void __do_page_cache_readahead(struct address_space *mapping, > > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > > > - unsigned long lookahead_size) > > > +void page_cache_readahead_limit(struct address_space *mapping, > > > > ... I don't think the function name conveys it's purpose. It's > > really a ranged readahead that ignores where i_size lies. i.e > > > > page_cache_readahead_range(mapping, start, end, nr_to_read) > > > > seems like a better API to me, and then you can drop the "start > > readahead beyond i_size" comments and replace it with "Range is not > > limited by the inode's i_size and hence can be used to read data > > stored beyond EOF into the page cache." > > I'm concerned that calling it 'range' implies "I want to read between > start and end" rather than "I want to read nr_to_read at start, oh but > don't go past end". > > Maybe the right way to do this is have the three callers cap nr_to_read. > Well, the one caller ... after all, f2fs and ext4 have no desire to > cap the length. Then we can call it page_cache_readahead_exceed() or > page_cache_readahead_dangerous() or something else like that to make it > clear that you shouldn't be calling it. Fair point. And in reading this, it occurred to me that what we are enabling is an "out of bounds" readahead function. so page_cache_readahead_OOB() or *_unbounded() might be a better name > * Like add_to_page_cache_locked, but used to add newly allocated pages: > diff --git a/mm/readahead.c b/mm/readahead.c > index 9dd431fa16c9..cad26287ad8b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > blk_finish_plug(); > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_exceed - Start unchecked readahead. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @index: First page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead beyond a file's stated i_size. This is almost certainly > + * not the function you want to call. Use page_cache_async_readahead() > + * or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. Yup, looks much better. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > ext4 and f2fs have duplicated the guts of the readahead code so > > > they can read past i_size. Instead, separate out the guts of the > > > readahead code so they can call it directly. > > > > Gross and nasty (hosting non-stale data beyond EOF in the page > > cache, that is). > > I thought you meant sneaking changes into the VFS (that were rejected) by > copying VFS code and modifying it ... Well, now that you mention it... :P > > > +/** > > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > > > + * @mapping: File address space. > > > + * @file: This instance of the open file; used for authentication. > > > + * @offset: First page index to read. > > > + * @end_index: The maximum page index to read. > > > + * @nr_to_read: The number of pages to read. > > > + * @lookahead_size: Where to start the next readahead. > > > + * > > > + * This function is for filesystems to call when they want to start > > > + * readahead potentially beyond a file's stated i_size. If you want > > > + * to start readahead on a normal file, you probably want to call > > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > > > + * > > > + * Context: File is referenced by caller. Mutexes may be held by caller. > > > + * May sleep, but will not reenter filesystem to reclaim memory. > > > */ > > > -void __do_page_cache_readahead(struct address_space *mapping, > > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > > > - unsigned long lookahead_size) > > > +void page_cache_readahead_limit(struct address_space *mapping, > > > > ... I don't think the function name conveys it's purpose. It's > > really a ranged readahead that ignores where i_size lies. i.e > > > > page_cache_readahead_range(mapping, start, end, nr_to_read) > > > > seems like a better API to me, and then you can drop the "start > > readahead beyond i_size" comments and replace it with "Range is not > > limited by the inode's i_size and hence can be used to read data > > stored beyond EOF into the page cache." > > I'm concerned that calling it 'range' implies "I want to read between > start and end" rather than "I want to read nr_to_read at start, oh but > don't go past end". > > Maybe the right way to do this is have the three callers cap nr_to_read. > Well, the one caller ... after all, f2fs and ext4 have no desire to > cap the length. Then we can call it page_cache_readahead_exceed() or > page_cache_readahead_dangerous() or something else like that to make it > clear that you shouldn't be calling it. Fair point. And in reading this, it occurred to me that what we are enabling is an "out of bounds" readahead function. so page_cache_readahead_OOB() or *_unbounded() might be a better name > * Like add_to_page_cache_locked, but used to add newly allocated pages: > diff --git a/mm/readahead.c b/mm/readahead.c > index 9dd431fa16c9..cad26287ad8b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > blk_finish_plug(); > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_exceed - Start unchecked readahead. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @index: First page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead beyond a file's stated i_size. This is almost certainly > + * not the function you want to call. Use page_cache_async_readahead() > + * or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. Yup, looks much better. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > ext4 and f2fs have duplicated the guts of the readahead code so > > > they can read past i_size. Instead, separate out the guts of the > > > readahead code so they can call it directly. > > > > Gross and nasty (hosting non-stale data beyond EOF in the page > > cache, that is). > > I thought you meant sneaking changes into the VFS (that were rejected) by > copying VFS code and modifying it ... Well, now that you mention it... :P > > > +/** > > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > > > + * @mapping: File address space. > > > + * @file: This instance of the open file; used for authentication. > > > + * @offset: First page index to read. > > > + * @end_index: The maximum page index to read. > > > + * @nr_to_read: The number of pages to read. > > > + * @lookahead_size: Where to start the next readahead. > > > + * > > > + * This function is for filesystems to call when they want to start > > > + * readahead potentially beyond a file's stated i_size. If you want > > > + * to start readahead on a normal file, you probably want to call > > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > > > + * > > > + * Context: File is referenced by caller. Mutexes may be held by caller. > > > + * May sleep, but will not reenter filesystem to reclaim memory. > > > */ > > > -void __do_page_cache_readahead(struct address_space *mapping, > > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > > > - unsigned long lookahead_size) > > > +void page_cache_readahead_limit(struct address_space *mapping, > > > > ... I don't think the function name conveys it's purpose. It's > > really a ranged readahead that ignores where i_size lies. i.e > > > > page_cache_readahead_range(mapping, start, end, nr_to_read) > > > > seems like a better API to me, and then you can drop the "start > > readahead beyond i_size" comments and replace it with "Range is not > > limited by the inode's i_size and hence can be used to read data > > stored beyond EOF into the page cache." > > I'm concerned that calling it 'range' implies "I want to read between > start and end" rather than "I want to read nr_to_read at start, oh but > don't go past end". > > Maybe the right way to do this is have the three callers cap nr_to_read. > Well, the one caller ... after all, f2fs and ext4 have no desire to > cap the length. Then we can call it page_cache_readahead_exceed() or > page_cache_readahead_dangerous() or something else like that to make it > clear that you shouldn't be calling it. Fair point. And in reading this, it occurred to me that what we are enabling is an "out of bounds" readahead function. so page_cache_readahead_OOB() or *_unbounded() might be a better name > * Like add_to_page_cache_locked, but used to add newly allocated pages: > diff --git a/mm/readahead.c b/mm/readahead.c > index 9dd431fa16c9..cad26287ad8b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > blk_finish_plug(); > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_exceed - Start unchecked readahead. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @index: First page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead beyond a file's stated i_size. This is almost certainly > + * not the function you want to call. Use page_cache_async_readahead() > + * or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. Yup, looks much better. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 08/19] mm: Add readahead address space operation
On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > This replaces ->readpages with a saner interface: > > > - Return void instead of an ignored error code. > > > - Pages are already in the page cache when ->readahead is called. > > > > Might read better as: > > > > - Page cache is already populates with locked pages when > >->readahead is called. > > Will do. > > > > - Implementation looks up the pages in the page cache instead of > > >having them passed in a linked list. > > > > Add: > > > > - cleanup of unused readahead handled by ->readahead caller, not > >the method implementation. > > The readpages caller does that cleanup too, so it's not an advantage > to the readahead interface. Right. I kinda of read the list as "the reasons the new API is sane" not as "how readahead is different to readpages" > > > ``readpages`` > > > called by the VM to read pages associated with the address_space > > > object. This is essentially just a vector version of readpage. > > > Instead of just one page, several pages are requested. > > > readpages is only used for read-ahead, so read errors are > > > ignored. If anything goes wrong, feel free to give up. > > > + This interface is deprecated; implement readahead instead. > > > > What is the removal schedule for the deprecated interface? > > I mentioned that in the cover letter; once Dave Howells has the fscache > branch merged, I'll do the remaining filesystems. Should be within the > next couple of merge windows. Sure, but I like to see actual release tags with the deprecation notice so that it's obvious to the reader as to whether this is something new and/or when they can expect it to go away. > > > +/* The byte offset into the file of this readahead block */ > > > +static inline loff_t readahead_offset(struct readahead_control *rac) > > > +{ > > > + return (loff_t)rac->_start * PAGE_SIZE; > > > +} > > > > Urk. Didn't an early page use "offset" for the page index? That > > was was "mm: Remove 'page_offset' from readahead loop" did, right? > > > > That's just going to cause confusion to have different units for > > readahead "offsets" > > We are ... not consistent anywhere in the VM/VFS with our naming. > Unfortunately. > > $ grep -n offset mm/filemap.c > 391: * @start:offset in bytes where the range starts > ... > 815: pgoff_t offset = old->index; > ... > 2020: unsigned long offset; /* offset into pagecache page */ > ... > 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset; > > That last one's my favourite. Not to mention the fine distinction you > and I discussed recently between offset_in_page() and page_offset(). > > Best of all, even our types encode the ambiguity of an 'offset'. We have > pgoff_t and loff_t (replacing the earlier off_t). > > So, new rule. 'pos' is the number of bytes into a file. 'index' is the > number of PAGE_SIZE pages into a file. We don't use the word 'offset' > at all. 'length' as a byte count and 'count' as a page count seem like > fine names to me. That sounds very reasonable to me. Another patchset in the making? :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation
On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > This replaces ->readpages with a saner interface: > > > - Return void instead of an ignored error code. > > > - Pages are already in the page cache when ->readahead is called. > > > > Might read better as: > > > > - Page cache is already populates with locked pages when > >->readahead is called. > > Will do. > > > > - Implementation looks up the pages in the page cache instead of > > >having them passed in a linked list. > > > > Add: > > > > - cleanup of unused readahead handled by ->readahead caller, not > >the method implementation. > > The readpages caller does that cleanup too, so it's not an advantage > to the readahead interface. Right. I kinda of read the list as "the reasons the new API is sane" not as "how readahead is different to readpages" > > > ``readpages`` > > > called by the VM to read pages associated with the address_space > > > object. This is essentially just a vector version of readpage. > > > Instead of just one page, several pages are requested. > > > readpages is only used for read-ahead, so read errors are > > > ignored. If anything goes wrong, feel free to give up. > > > + This interface is deprecated; implement readahead instead. > > > > What is the removal schedule for the deprecated interface? > > I mentioned that in the cover letter; once Dave Howells has the fscache > branch merged, I'll do the remaining filesystems. Should be within the > next couple of merge windows. Sure, but I like to see actual release tags with the deprecation notice so that it's obvious to the reader as to whether this is something new and/or when they can expect it to go away. > > > +/* The byte offset into the file of this readahead block */ > > > +static inline loff_t readahead_offset(struct readahead_control *rac) > > > +{ > > > + return (loff_t)rac->_start * PAGE_SIZE; > > > +} > > > > Urk. Didn't an early page use "offset" for the page index? That > > was was "mm: Remove 'page_offset' from readahead loop" did, right? > > > > That's just going to cause confusion to have different units for > > readahead "offsets" > > We are ... not consistent anywhere in the VM/VFS with our naming. > Unfortunately. > > $ grep -n offset mm/filemap.c > 391: * @start:offset in bytes where the range starts > ... > 815: pgoff_t offset = old->index; > ... > 2020: unsigned long offset; /* offset into pagecache page */ > ... > 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset; > > That last one's my favourite. Not to mention the fine distinction you > and I discussed recently between offset_in_page() and page_offset(). > > Best of all, even our types encode the ambiguity of an 'offset'. We have > pgoff_t and loff_t (replacing the earlier off_t). > > So, new rule. 'pos' is the number of bytes into a file. 'index' is the > number of PAGE_SIZE pages into a file. We don't use the word 'offset' > at all. 'length' as a byte count and 'count' as a page count seem like > fine names to me. That sounds very reasonable to me. Another patchset in the making? :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 08/19] mm: Add readahead address space operation
On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > This replaces ->readpages with a saner interface: > > > - Return void instead of an ignored error code. > > > - Pages are already in the page cache when ->readahead is called. > > > > Might read better as: > > > > - Page cache is already populates with locked pages when > >->readahead is called. > > Will do. > > > > - Implementation looks up the pages in the page cache instead of > > >having them passed in a linked list. > > > > Add: > > > > - cleanup of unused readahead handled by ->readahead caller, not > >the method implementation. > > The readpages caller does that cleanup too, so it's not an advantage > to the readahead interface. Right. I kinda of read the list as "the reasons the new API is sane" not as "how readahead is different to readpages" > > > ``readpages`` > > > called by the VM to read pages associated with the address_space > > > object. This is essentially just a vector version of readpage. > > > Instead of just one page, several pages are requested. > > > readpages is only used for read-ahead, so read errors are > > > ignored. If anything goes wrong, feel free to give up. > > > + This interface is deprecated; implement readahead instead. > > > > What is the removal schedule for the deprecated interface? > > I mentioned that in the cover letter; once Dave Howells has the fscache > branch merged, I'll do the remaining filesystems. Should be within the > next couple of merge windows. Sure, but I like to see actual release tags with the deprecation notice so that it's obvious to the reader as to whether this is something new and/or when they can expect it to go away. > > > +/* The byte offset into the file of this readahead block */ > > > +static inline loff_t readahead_offset(struct readahead_control *rac) > > > +{ > > > + return (loff_t)rac->_start * PAGE_SIZE; > > > +} > > > > Urk. Didn't an early page use "offset" for the page index? That > > was was "mm: Remove 'page_offset' from readahead loop" did, right? > > > > That's just going to cause confusion to have different units for > > readahead "offsets" > > We are ... not consistent anywhere in the VM/VFS with our naming. > Unfortunately. > > $ grep -n offset mm/filemap.c > 391: * @start:offset in bytes where the range starts > ... > 815: pgoff_t offset = old->index; > ... > 2020: unsigned long offset; /* offset into pagecache page */ > ... > 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset; > > That last one's my favourite. Not to mention the fine distinction you > and I discussed recently between offset_in_page() and page_offset(). > > Best of all, even our types encode the ambiguity of an 'offset'. We have > pgoff_t and loff_t (replacing the earlier off_t). > > So, new rule. 'pos' is the number of bytes into a file. 'index' is the > number of PAGE_SIZE pages into a file. We don't use the word 'offset' > at all. 'length' as a byte count and 'count' as a page count seem like > fine names to me. That sounds very reasonable to me. Another patchset in the making? :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > At allocation time, put the pages in the cache unless we're using > > > ->readpages. Add the readahead_for_each() iterator for the benefit of > > > the ->readpage fallback. This iterator supports huge pages, even though > > > none of the filesystems to be converted do yet. > > > > This could be better written - took me some time to get my head > > around it and the code. > > > > "When populating the page cache for readahead, mappings that don't > > use ->readpages need to have their pages added to the page cache > > before ->readpage is called. Do this insertion earlier so that the > > pages can be looked up immediately prior to ->readpage calls rather > > than passing them on a linked list. This early insert functionality > > is also required by the upcoming ->readahead method that will > > replace ->readpages. > > > > Optimise and simplify the readpage loop by adding a > > readahead_for_each() iterator to provide the pages we need to read. > > This iterator also supports huge pages, even though none of the > > filesystems have been converted to use them yet." > > Thanks, I'll use that. > > > > +static inline struct page *readahead_page(struct readahead_control *rac) > > > +{ > > > + struct page *page; > > > + > > > + if (!rac->_nr_pages) > > > + return NULL; > > > > H. > > > > > + > > > + page = xa_load(>mapping->i_pages, rac->_start); > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + rac->_batch_count = hpage_nr_pages(page); > > > > So we could have rac->_nr_pages = 2, and then we get an order 2 > > large page returned, and so rac->_batch_count = 4. > > Well, no, we couldn't. rac->_nr_pages is incremented by 4 when we add > an order-2 page to the readahead. I don't see any code that does that. :) i.e. we aren't actually putting high order pages into the page cache here - page_alloc() allocates order-0 pages) - so there's nothing in the patch that tells me how rac->_nr_pages behaves when allocating large pages... IOWs, we have an undocumented assumption in the implementation... > I can put a > BUG_ON(rac->_batch_count > rac->_nr_pages) > in here to be sure to catch any logic error like that. Definitely necessary given that we don't insert large pages for readahead yet. A comment explaining the assumptions that the code makes for large pages is probably in order, too. > > > - page->index = offset; > > > - list_add(>lru, _pool); > > > + if (use_list) { > > > + page->index = offset; > > > + list_add(>lru, _pool); > > > + } else if (add_to_page_cache_lru(page, mapping, offset, > > > + gfp_mask) < 0) { > > > + put_page(page); > > > + goto read; > > > + } > > > > Ok, so that's why you put read code at the end of the loop. To turn > > the code into spaghetti :/ > > > > How much does this simplify down when we get rid of ->readpages and > > can restructure the loop? This really seems like you're trying to > > flatten two nested loops into one by the use of goto > > I see it as having two failure cases in this loop. One for "page is > already present" (which already existed) and one for "allocated a page, > but failed to add it to the page cache" (which used to be done later). > I didn't want to duplicate the "call read_pages()" code. So I reshuffled > the code rather than add a nested loop. I don't think the nested loop > is easier to read (we'll be at 5 levels of indentation for some statements). > Could do it this way ... Can we move the update of @rac inside read_pages()? The next start offset^Windex we start at is rac._start + rac._nr_pages, right? so read_pages() could do: { if (readahead_count(rac)) { /* do readahead */ } /* advance the readahead cursor */ rac->_start += rac->_nr_pages; rac._nr_pages = 0; } and then we only need to call read_pages() in these cases and so the requirement for avoiding duplicating code is avoided... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > At allocation time, put the pages in the cache unless we're using > > > ->readpages. Add the readahead_for_each() iterator for the benefit of > > > the ->readpage fallback. This iterator supports huge pages, even though > > > none of the filesystems to be converted do yet. > > > > This could be better written - took me some time to get my head > > around it and the code. > > > > "When populating the page cache for readahead, mappings that don't > > use ->readpages need to have their pages added to the page cache > > before ->readpage is called. Do this insertion earlier so that the > > pages can be looked up immediately prior to ->readpage calls rather > > than passing them on a linked list. This early insert functionality > > is also required by the upcoming ->readahead method that will > > replace ->readpages. > > > > Optimise and simplify the readpage loop by adding a > > readahead_for_each() iterator to provide the pages we need to read. > > This iterator also supports huge pages, even though none of the > > filesystems have been converted to use them yet." > > Thanks, I'll use that. > > > > +static inline struct page *readahead_page(struct readahead_control *rac) > > > +{ > > > + struct page *page; > > > + > > > + if (!rac->_nr_pages) > > > + return NULL; > > > > H. > > > > > + > > > + page = xa_load(>mapping->i_pages, rac->_start); > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + rac->_batch_count = hpage_nr_pages(page); > > > > So we could have rac->_nr_pages = 2, and then we get an order 2 > > large page returned, and so rac->_batch_count = 4. > > Well, no, we couldn't. rac->_nr_pages is incremented by 4 when we add > an order-2 page to the readahead. I don't see any code that does that. :) i.e. we aren't actually putting high order pages into the page cache here - page_alloc() allocates order-0 pages) - so there's nothing in the patch that tells me how rac->_nr_pages behaves when allocating large pages... IOWs, we have an undocumented assumption in the implementation... > I can put a > BUG_ON(rac->_batch_count > rac->_nr_pages) > in here to be sure to catch any logic error like that. Definitely necessary given that we don't insert large pages for readahead yet. A comment explaining the assumptions that the code makes for large pages is probably in order, too. > > > - page->index = offset; > > > - list_add(>lru, _pool); > > > + if (use_list) { > > > + page->index = offset; > > > + list_add(>lru, _pool); > > > + } else if (add_to_page_cache_lru(page, mapping, offset, > > > + gfp_mask) < 0) { > > > + put_page(page); > > > + goto read; > > > + } > > > > Ok, so that's why you put read code at the end of the loop. To turn > > the code into spaghetti :/ > > > > How much does this simplify down when we get rid of ->readpages and > > can restructure the loop? This really seems like you're trying to > > flatten two nested loops into one by the use of goto > > I see it as having two failure cases in this loop. One for "page is > already present" (which already existed) and one for "allocated a page, > but failed to add it to the page cache" (which used to be done later). > I didn't want to duplicate the "call read_pages()" code. So I reshuffled > the code rather than add a nested loop. I don't think the nested loop > is easier to read (we'll be at 5 levels of indentation for some statements). > Could do it this way ... Can we move the update of @rac inside read_pages()? The next start offset^Windex we start at is rac._start + rac._nr_pages, right? so read_pages() could do: { if (readahead_count(rac)) { /* do readahead */ } /* advance the readahead cursor */ rac->_start += rac->_nr_pages; rac._nr_pages = 0; } and then we only need to call read_pages() in these cases and so the requirement for avoiding duplicating code is avoided... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > At allocation time, put the pages in the cache unless we're using > > > ->readpages. Add the readahead_for_each() iterator for the benefit of > > > the ->readpage fallback. This iterator supports huge pages, even though > > > none of the filesystems to be converted do yet. > > > > This could be better written - took me some time to get my head > > around it and the code. > > > > "When populating the page cache for readahead, mappings that don't > > use ->readpages need to have their pages added to the page cache > > before ->readpage is called. Do this insertion earlier so that the > > pages can be looked up immediately prior to ->readpage calls rather > > than passing them on a linked list. This early insert functionality > > is also required by the upcoming ->readahead method that will > > replace ->readpages. > > > > Optimise and simplify the readpage loop by adding a > > readahead_for_each() iterator to provide the pages we need to read. > > This iterator also supports huge pages, even though none of the > > filesystems have been converted to use them yet." > > Thanks, I'll use that. > > > > +static inline struct page *readahead_page(struct readahead_control *rac) > > > +{ > > > + struct page *page; > > > + > > > + if (!rac->_nr_pages) > > > + return NULL; > > > > H. > > > > > + > > > + page = xa_load(>mapping->i_pages, rac->_start); > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + rac->_batch_count = hpage_nr_pages(page); > > > > So we could have rac->_nr_pages = 2, and then we get an order 2 > > large page returned, and so rac->_batch_count = 4. > > Well, no, we couldn't. rac->_nr_pages is incremented by 4 when we add > an order-2 page to the readahead. I don't see any code that does that. :) i.e. we aren't actually putting high order pages into the page cache here - page_alloc() allocates order-0 pages) - so there's nothing in the patch that tells me how rac->_nr_pages behaves when allocating large pages... IOWs, we have an undocumented assumption in the implementation... > I can put a > BUG_ON(rac->_batch_count > rac->_nr_pages) > in here to be sure to catch any logic error like that. Definitely necessary given that we don't insert large pages for readahead yet. A comment explaining the assumptions that the code makes for large pages is probably in order, too. > > > - page->index = offset; > > > - list_add(>lru, _pool); > > > + if (use_list) { > > > + page->index = offset; > > > + list_add(>lru, _pool); > > > + } else if (add_to_page_cache_lru(page, mapping, offset, > > > + gfp_mask) < 0) { > > > + put_page(page); > > > + goto read; > > > + } > > > > Ok, so that's why you put read code at the end of the loop. To turn > > the code into spaghetti :/ > > > > How much does this simplify down when we get rid of ->readpages and > > can restructure the loop? This really seems like you're trying to > > flatten two nested loops into one by the use of goto > > I see it as having two failure cases in this loop. One for "page is > already present" (which already existed) and one for "allocated a page, > but failed to add it to the page cache" (which used to be done later). > I didn't want to duplicate the "call read_pages()" code. So I reshuffled > the code rather than add a nested loop. I don't think the nested loop > is easier to read (we'll be at 5 levels of indentation for some statements). > Could do it this way ... Can we move the update of @rac inside read_pages()? The next start offset^Windex we start at is rac._start + rac._nr_pages, right? so read_pages() could do: { if (readahead_count(rac)) { /* do readahead */ } /* advance the readahead cursor */ rac->_start += rac->_nr_pages; rac._nr_pages = 0; } and then we only need to call read_pages() in these cases and so the requirement for avoiding duplicating code is avoided... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 04/19] mm: Rearrange readahead loop
On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > Move the declaration of 'page' to inside the loop and move the 'kick > > > off a fresh batch' code to the end of the function for easier use in > > > subsequent patches. > > > > Stale? the "kick off" code is moved to the tail of the loop, not the > > end of the function. > > Braino; I meant to write end of the loop. > > > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > page = xa_load(>i_pages, page_offset); > > > if (page && !xa_is_value(page)) { > > > /* > > > - * Page already present? Kick off the current batch of > > > - * contiguous pages before continuing with the next > > > - * batch. > > > + * Page already present? Kick off the current batch > > > + * of contiguous pages before continuing with the > > > + * next batch. This page may be the one we would > > > + * have intended to mark as Readahead, but we don't > > > + * have a stable reference to this page, and it's > > > + * not worth getting one just for that. > > >*/ > > > - if (readahead_count()) > > > - read_pages(, _pool, gfp_mask); > > > - rac._nr_pages = 0; > > > - continue; > > > + goto read; > > > } > > > > > > page = __page_cache_alloc(gfp_mask); > > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > if (page_idx == nr_to_read - lookahead_size) > > > SetPageReadahead(page); > > > rac._nr_pages++; > > > + continue; > > > +read: > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > } > > > > Also, why? This adds a goto from branched code that continues, then > > adds a continue so the unbranched code doesn't execute the code the > > goto jumps to. In absence of any explanation, this isn't an > > improvement and doesn't make any sense... > > I thought I was explaining it ... "for easier use in subsequent patches". Sorry, my braino there. :) I commented on the problem with the first part of the sentence, then the rest of the sentence completely failed to sink in. -Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop
On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > Move the declaration of 'page' to inside the loop and move the 'kick > > > off a fresh batch' code to the end of the function for easier use in > > > subsequent patches. > > > > Stale? the "kick off" code is moved to the tail of the loop, not the > > end of the function. > > Braino; I meant to write end of the loop. > > > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > page = xa_load(>i_pages, page_offset); > > > if (page && !xa_is_value(page)) { > > > /* > > > - * Page already present? Kick off the current batch of > > > - * contiguous pages before continuing with the next > > > - * batch. > > > + * Page already present? Kick off the current batch > > > + * of contiguous pages before continuing with the > > > + * next batch. This page may be the one we would > > > + * have intended to mark as Readahead, but we don't > > > + * have a stable reference to this page, and it's > > > + * not worth getting one just for that. > > >*/ > > > - if (readahead_count()) > > > - read_pages(, _pool, gfp_mask); > > > - rac._nr_pages = 0; > > > - continue; > > > + goto read; > > > } > > > > > > page = __page_cache_alloc(gfp_mask); > > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > if (page_idx == nr_to_read - lookahead_size) > > > SetPageReadahead(page); > > > rac._nr_pages++; > > > + continue; > > > +read: > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > } > > > > Also, why? This adds a goto from branched code that continues, then > > adds a continue so the unbranched code doesn't execute the code the > > goto jumps to. In absence of any explanation, this isn't an > > improvement and doesn't make any sense... > > I thought I was explaining it ... "for easier use in subsequent patches". Sorry, my braino there. :) I commented on the problem with the first part of the sentence, then the rest of the sentence completely failed to sink in. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 04/19] mm: Rearrange readahead loop
On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > Move the declaration of 'page' to inside the loop and move the 'kick > > > off a fresh batch' code to the end of the function for easier use in > > > subsequent patches. > > > > Stale? the "kick off" code is moved to the tail of the loop, not the > > end of the function. > > Braino; I meant to write end of the loop. > > > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > page = xa_load(>i_pages, page_offset); > > > if (page && !xa_is_value(page)) { > > > /* > > > - * Page already present? Kick off the current batch of > > > - * contiguous pages before continuing with the next > > > - * batch. > > > + * Page already present? Kick off the current batch > > > + * of contiguous pages before continuing with the > > > + * next batch. This page may be the one we would > > > + * have intended to mark as Readahead, but we don't > > > + * have a stable reference to this page, and it's > > > + * not worth getting one just for that. > > >*/ > > > - if (readahead_count()) > > > - read_pages(, _pool, gfp_mask); > > > - rac._nr_pages = 0; > > > - continue; > > > + goto read; > > > } > > > > > > page = __page_cache_alloc(gfp_mask); > > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > if (page_idx == nr_to_read - lookahead_size) > > > SetPageReadahead(page); > > > rac._nr_pages++; > > > + continue; > > > +read: > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > } > > > > Also, why? This adds a goto from branched code that continues, then > > adds a continue so the unbranched code doesn't execute the code the > > goto jumps to. In absence of any explanation, this isn't an > > improvement and doesn't make any sense... > > I thought I was explaining it ... "for easier use in subsequent patches". Sorry, my braino there. :) I commented on the problem with the first part of the sentence, then the rest of the sentence completely failed to sink in. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > > > +static void read_pages(struct readahead_control *rac, struct list_head > > > *pages, > > > + gfp_t gfp) > > > { > > > + const struct address_space_operations *aops = rac->mapping->a_ops; > > > struct blk_plug plug; > > > unsigned page_idx; > > > > Splitting out the aops rather than the mapping here just looks > > weird, especially as you need the mapping later in the function. > > Using aops doesn't even reduce the code side > > It does in subsequent patches ... I agree it looks a little weird here, > but I think in the final form, it makes sense: Ok. Perhaps just an additional commit comment to say "read_pages() is changed to be aops centric as @rac abstracts away all other implementation details by the end of the patchset." > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead > > control structure - if we have to pass the gfp_mask down all the > > way along side the rac, then I think it makes sense to do that... > > So we end up removing it later on in this series, but I do wonder if > it would make sense anyway. By the end of the series, we still have > this in iomap: > > if (ctx->rac) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > > and we could get rid of that by passing gfp flags down in the rac. On the > other hand, I don't know why it doesn't just use readahead_gfp_mask() > here anyway ... Christoph? mapping->gfp_mask is awful. Is it a mask, or is it a valid set of allocation flags? Or both? Some callers to mapping_gfp_constraint() uses it as a mask, some callers to mapping_gfp_constraint() use it as base flags that context specific flags get masked out of, readahead_gfp_mask() callers use it as the entire set of gfp flags for allocation. That whole API sucks - undocumented as to what it's suposed to do and how it's supposed to be used. Hence it's difficult to use correctly or understand whether it's being used correctly. And reading callers only leads to more confusion and crazy code like in do_mpage_readpage() where readahead returns a mask that are used as base flags and normal reads return a masked set of base flags... The iomap code is obviously correct when it comes to gfp flag manipulation. We start with GFP_KERNEL context, then constrain it via the mask held in mapping->gfp_mask, then if it's readahead we allow the allocation to silently fail. Simple to read and understand code, versus having weird code that requires the reader to decipher an undocumented and inconsistent API to understand how the gfp flags have been calculated and are valid. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > > > +static void read_pages(struct readahead_control *rac, struct list_head > > > *pages, > > > + gfp_t gfp) > > > { > > > + const struct address_space_operations *aops = rac->mapping->a_ops; > > > struct blk_plug plug; > > > unsigned page_idx; > > > > Splitting out the aops rather than the mapping here just looks > > weird, especially as you need the mapping later in the function. > > Using aops doesn't even reduce the code side > > It does in subsequent patches ... I agree it looks a little weird here, > but I think in the final form, it makes sense: Ok. Perhaps just an additional commit comment to say "read_pages() is changed to be aops centric as @rac abstracts away all other implementation details by the end of the patchset." > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead > > control structure - if we have to pass the gfp_mask down all the > > way along side the rac, then I think it makes sense to do that... > > So we end up removing it later on in this series, but I do wonder if > it would make sense anyway. By the end of the series, we still have > this in iomap: > > if (ctx->rac) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > > and we could get rid of that by passing gfp flags down in the rac. On the > other hand, I don't know why it doesn't just use readahead_gfp_mask() > here anyway ... Christoph? mapping->gfp_mask is awful. Is it a mask, or is it a valid set of allocation flags? Or both? Some callers to mapping_gfp_constraint() uses it as a mask, some callers to mapping_gfp_constraint() use it as base flags that context specific flags get masked out of, readahead_gfp_mask() callers use it as the entire set of gfp flags for allocation. That whole API sucks - undocumented as to what it's suposed to do and how it's supposed to be used. Hence it's difficult to use correctly or understand whether it's being used correctly. And reading callers only leads to more confusion and crazy code like in do_mpage_readpage() where readahead returns a mask that are used as base flags and normal reads return a masked set of base flags... The iomap code is obviously correct when it comes to gfp flag manipulation. We start with GFP_KERNEL context, then constrain it via the mask held in mapping->gfp_mask, then if it's readahead we allow the allocation to silently fail. Simple to read and understand code, versus having weird code that requires the reader to decipher an undocumented and inconsistent API to understand how the gfp flags have been calculated and are valid. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > > > +static void read_pages(struct readahead_control *rac, struct list_head > > > *pages, > > > + gfp_t gfp) > > > { > > > + const struct address_space_operations *aops = rac->mapping->a_ops; > > > struct blk_plug plug; > > > unsigned page_idx; > > > > Splitting out the aops rather than the mapping here just looks > > weird, especially as you need the mapping later in the function. > > Using aops doesn't even reduce the code side > > It does in subsequent patches ... I agree it looks a little weird here, > but I think in the final form, it makes sense: Ok. Perhaps just an additional commit comment to say "read_pages() is changed to be aops centric as @rac abstracts away all other implementation details by the end of the patchset." > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead > > control structure - if we have to pass the gfp_mask down all the > > way along side the rac, then I think it makes sense to do that... > > So we end up removing it later on in this series, but I do wonder if > it would make sense anyway. By the end of the series, we still have > this in iomap: > > if (ctx->rac) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > > and we could get rid of that by passing gfp flags down in the rac. On the > other hand, I don't know why it doesn't just use readahead_gfp_mask() > here anyway ... Christoph? mapping->gfp_mask is awful. Is it a mask, or is it a valid set of allocation flags? Or both? Some callers to mapping_gfp_constraint() uses it as a mask, some callers to mapping_gfp_constraint() use it as base flags that context specific flags get masked out of, readahead_gfp_mask() callers use it as the entire set of gfp flags for allocation. That whole API sucks - undocumented as to what it's suposed to do and how it's supposed to be used. Hence it's difficult to use correctly or understand whether it's being used correctly. And reading callers only leads to more confusion and crazy code like in do_mpage_readpage() where readahead returns a mask that are used as base flags and normal reads return a masked set of base flags... The iomap code is obviously correct when it comes to gfp flag manipulation. We start with GFP_KERNEL context, then constrain it via the mask held in mapping->gfp_mask, then if it's readahead we allow the allocation to silently fail. Simple to read and understand code, versus having weird code that requires the reader to decipher an undocumented and inconsistent API to understand how the gfp flags have been calculated and are valid. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > Latest version in your git tree: > > > > $ ▶ glo -n 5 willy/readahead > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > ff63497fcb98 iomap: Convert from readpages to readahead > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > 8115bcca7312 fuse: Convert from readpages to readahead > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > $ > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > [2.423116] [ cut here ] > > [2.424957] list_add double add: new=ea000efff4c8, > > prev=8883bfffee60, next=ea000efff4c8. > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > __list_add_valid+0x67/0x70 > > [2.457484] Call Trace: > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > [2.462787] lru_add_drain+0x18/0x20 > > Are you sure that was 4be497096c04 ? I ask because there was a Yes, because it's the only version I've actually merged into my working tree, compiled and tried to run. :P > version pushed to that git tree that did contain a list double-add > (due to a mismerge when shuffling patches). I noticed it and fixed > it, and 4be497096c04 doesn't have that problem. I also test with > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > probabilistic because it'll depend on the timing between whatever other > list is being used and the page actually being added to the LRU. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > Latest version in your git tree: > > > > $ ▶ glo -n 5 willy/readahead > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > ff63497fcb98 iomap: Convert from readpages to readahead > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > 8115bcca7312 fuse: Convert from readpages to readahead > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > $ > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > [2.423116] [ cut here ] > > [2.424957] list_add double add: new=ea000efff4c8, > > prev=8883bfffee60, next=ea000efff4c8. > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > __list_add_valid+0x67/0x70 > > [2.457484] Call Trace: > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > [2.462787] lru_add_drain+0x18/0x20 > > Are you sure that was 4be497096c04 ? I ask because there was a Yes, because it's the only version I've actually merged into my working tree, compiled and tried to run. :P > version pushed to that git tree that did contain a list double-add > (due to a mismerge when shuffling patches). I noticed it and fixed > it, and 4be497096c04 doesn't have that problem. I also test with > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > probabilistic because it'll depend on the timing between whatever other > list is being used and the page actually being added to the LRU. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > Latest version in your git tree: > > > > $ ▶ glo -n 5 willy/readahead > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > ff63497fcb98 iomap: Convert from readpages to readahead > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > 8115bcca7312 fuse: Convert from readpages to readahead > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > $ > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > [2.423116] [ cut here ] > > [2.424957] list_add double add: new=ea000efff4c8, > > prev=8883bfffee60, next=ea000efff4c8. > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > __list_add_valid+0x67/0x70 > > [2.457484] Call Trace: > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > [2.462787] lru_add_drain+0x18/0x20 > > Are you sure that was 4be497096c04 ? I ask because there was a Yes, because it's the only version I've actually merged into my working tree, compiled and tried to run. :P > version pushed to that git tree that did contain a list double-add > (due to a mismerge when shuffling patches). I noticed it and fixed > it, and 4be497096c04 doesn't have that problem. I also test with > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > probabilistic because it'll depend on the timing between whatever other > list is being used and the page actually being added to the LRU. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(PageTail(page), page); > + array[batch++] = page; > + rac->_batch_count += hpage_nr_pages(page); > + if (PageHead(page)) > + xas_set(, rac->_start + rac->_batch_count); What on earth does this do? Comments please! > + > + if (batch == size) > + break; > + } > + > + return batch; > +} Seems a bit big for an inline function. > + > +#define readahead_for_each_batch(rac, array, size, nr) > \ > + for (; (nr = readahead_page_batch(rac, array, size)); \ > + readahead_next(rac)) I had to go look at the caller to work out what "size" refered to here. This is complex enough that it needs proper API documentation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(PageTail(page), page); > + array[batch++] = page; > + rac->_batch_count += hpage_nr_pages(page); > + if (PageHead(page)) > + xas_set(, rac->_start + rac->_batch_count); What on earth does this do? Comments please! > + > + if (batch == size) > + break; > + } > + > + return batch; > +} Seems a bit big for an inline function. > + > +#define readahead_for_each_batch(rac, array, size, nr) > \ > + for (; (nr = readahead_page_batch(rac, array, size)); \ > + readahead_next(rac)) I had to go look at the caller to work out what "size" refered to here. This is complex enough that it needs proper API documentation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 11/19] btrfs: Convert from readpages to readahead
VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(PageTail(page), page); > + array[batch++] = page; > + rac->_batch_count += hpage_nr_pages(page); > + if (PageHead(page)) > + xas_set(, rac->_start + rac->_batch_count); What on earth does this do? Comments please! > + > + if (batch == size) > + break; > + } > + > + return batch; > +} Seems a bit big for an inline function. > + > +#define readahead_for_each_batch(rac, array, size, nr) > \ > + for (; (nr = readahead_page_batch(rac, array, size)); \ > + readahead_next(rac)) I had to go look at the caller to work out what "size" refered to here. This is complex enough that it needs proper API documentation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Implement the new readahead aop and convert all callers (block_dev, > exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, > reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Junxiao Bi # ocfs2 > --- > drivers/staging/exfat/exfat_super.c | 7 +++--- > fs/block_dev.c | 7 +++--- > fs/ext2/inode.c | 10 +++- > fs/fat/inode.c | 7 +++--- > fs/gfs2/aops.c | 23 ++--- > fs/hpfs/file.c | 7 +++--- > fs/iomap/buffered-io.c | 2 +- > fs/isofs/inode.c| 7 +++--- > fs/jfs/inode.c | 7 +++--- > fs/mpage.c | 38 + > fs/nilfs2/inode.c | 15 +++- > fs/ocfs2/aops.c | 34 ++ > fs/omfs/file.c | 7 +++--- > fs/qnx6/inode.c | 7 +++--- > fs/reiserfs/inode.c | 8 +++--- > fs/udf/inode.c | 7 +++--- > include/linux/mpage.h | 4 +-- > mm/migrate.c| 2 +- > 18 files changed, 73 insertions(+), 126 deletions(-) That's actually pretty simple changeover. Nothing really scary there. :) Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Implement the new readahead aop and convert all callers (block_dev, > exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, > reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Junxiao Bi # ocfs2 > --- > drivers/staging/exfat/exfat_super.c | 7 +++--- > fs/block_dev.c | 7 +++--- > fs/ext2/inode.c | 10 +++- > fs/fat/inode.c | 7 +++--- > fs/gfs2/aops.c | 23 ++--- > fs/hpfs/file.c | 7 +++--- > fs/iomap/buffered-io.c | 2 +- > fs/isofs/inode.c| 7 +++--- > fs/jfs/inode.c | 7 +++--- > fs/mpage.c | 38 + > fs/nilfs2/inode.c | 15 +++- > fs/ocfs2/aops.c | 34 ++ > fs/omfs/file.c | 7 +++--- > fs/qnx6/inode.c | 7 +++--- > fs/reiserfs/inode.c | 8 +++--- > fs/udf/inode.c | 7 +++--- > include/linux/mpage.h | 4 +-- > mm/migrate.c| 2 +- > 18 files changed, 73 insertions(+), 126 deletions(-) That's actually pretty simple changeover. Nothing really scary there. :) Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Implement the new readahead aop and convert all callers (block_dev, > exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, > reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Junxiao Bi # ocfs2 > --- > drivers/staging/exfat/exfat_super.c | 7 +++--- > fs/block_dev.c | 7 +++--- > fs/ext2/inode.c | 10 +++- > fs/fat/inode.c | 7 +++--- > fs/gfs2/aops.c | 23 ++--- > fs/hpfs/file.c | 7 +++--- > fs/iomap/buffered-io.c | 2 +- > fs/isofs/inode.c| 7 +++--- > fs/jfs/inode.c | 7 +++--- > fs/mpage.c | 38 + > fs/nilfs2/inode.c | 15 +++- > fs/ocfs2/aops.c | 34 ++ > fs/omfs/file.c | 7 +++--- > fs/qnx6/inode.c | 7 +++--- > fs/reiserfs/inode.c | 8 +++--- > fs/udf/inode.c | 7 +++--- > include/linux/mpage.h | 4 +-- > mm/migrate.c| 2 +- > 18 files changed, 73 insertions(+), 126 deletions(-) That's actually pretty simple changeover. Nothing really scary there. :) Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > ext4 and f2fs have duplicated the guts of the readahead code so > they can read past i_size. Instead, separate out the guts of the > readahead code so they can call it directly. Gross and nasty (hosting non-stale data beyond EOF in the page cache, that is). Code is pretty simple, but... > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @offset: First page index to read. > + * @end_index: The maximum page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead potentially beyond a file's stated i_size. If you want > + * to start readahead on a normal file, you probably want to call > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. > */ > -void __do_page_cache_readahead(struct address_space *mapping, > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > - unsigned long lookahead_size) > +void page_cache_readahead_limit(struct address_space *mapping, ... I don't think the function name conveys it's purpose. It's really a ranged readahead that ignores where i_size lies. i.e page_cache_readahead_range(mapping, start, end, nr_to_read) seems like a better API to me, and then you can drop the "start readahead beyond i_size" comments and replace it with "Range is not limited by the inode's i_size and hence can be used to read data stored beyond EOF into the page cache." Also: "This is almost certainly not the function you want to call. Use page_cache_async_readahead or page_cache_sync_readahead() instead." Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > ext4 and f2fs have duplicated the guts of the readahead code so > they can read past i_size. Instead, separate out the guts of the > readahead code so they can call it directly. Gross and nasty (hosting non-stale data beyond EOF in the page cache, that is). Code is pretty simple, but... > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @offset: First page index to read. > + * @end_index: The maximum page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead potentially beyond a file's stated i_size. If you want > + * to start readahead on a normal file, you probably want to call > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. > */ > -void __do_page_cache_readahead(struct address_space *mapping, > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > - unsigned long lookahead_size) > +void page_cache_readahead_limit(struct address_space *mapping, ... I don't think the function name conveys it's purpose. It's really a ranged readahead that ignores where i_size lies. i.e page_cache_readahead_range(mapping, start, end, nr_to_read) seems like a better API to me, and then you can drop the "start readahead beyond i_size" comments and replace it with "Range is not limited by the inode's i_size and hence can be used to read data stored beyond EOF into the page cache." Also: "This is almost certainly not the function you want to call. Use page_cache_async_readahead or page_cache_sync_readahead() instead." Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > ext4 and f2fs have duplicated the guts of the readahead code so > they can read past i_size. Instead, separate out the guts of the > readahead code so they can call it directly. Gross and nasty (hosting non-stale data beyond EOF in the page cache, that is). Code is pretty simple, but... > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @offset: First page index to read. > + * @end_index: The maximum page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead potentially beyond a file's stated i_size. If you want > + * to start readahead on a normal file, you probably want to call > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. > */ > -void __do_page_cache_readahead(struct address_space *mapping, > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > - unsigned long lookahead_size) > +void page_cache_readahead_limit(struct address_space *mapping, ... I don't think the function name conveys it's purpose. It's really a ranged readahead that ignores where i_size lies. i.e page_cache_readahead_range(mapping, start, end, nr_to_read) seems like a better API to me, and then you can drop the "start readahead beyond i_size" comments and replace it with "Range is not limited by the inode's i_size and hence can be used to read data stored beyond EOF into the page cache." Also: "This is almost certainly not the function you want to call. Use page_cache_async_readahead or page_cache_sync_readahead() instead." Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH v6 08/19] mm: Add readahead address space operation
agemap.h > index 3613154e79e4..bd4291f78f41 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -665,6 +665,24 @@ static inline void readahead_next(struct > readahead_control *rac) > #define readahead_for_each(rac, page) > \ > for (; (page = readahead_page(rac)); readahead_next(rac)) > > +/* The byte offset into the file of this readahead block */ > +static inline loff_t readahead_offset(struct readahead_control *rac) > +{ > + return (loff_t)rac->_start * PAGE_SIZE; > +} Urk. Didn't an early page use "offset" for the page index? That was was "mm: Remove 'page_offset' from readahead loop" did, right? That's just going to cause confusion to have different units for readahead "offsets" > + > +/* The number of bytes in this readahead block */ > +static inline loff_t readahead_length(struct readahead_control *rac) > +{ > + return (loff_t)rac->_nr_pages * PAGE_SIZE; > +} > + > +/* The index of the first page in this readahead block */ > +static inline unsigned int readahead_index(struct readahead_control *rac) > +{ > + return rac->_start; > +} Based on this, I suspect the earlier patch should use "index" rather than "offset" when walking the page cache indexes... > + > /* The number of pages in this readahead block */ > static inline unsigned int readahead_count(struct readahead_control *rac) > { > diff --git a/mm/readahead.c b/mm/readahead.c > index 9e430daae42f..975ff5e387be 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > > blk_start_plug(); > > - if (aops->readpages) { > + if (aops->readahead) { > + aops->readahead(rac); > + readahead_for_each(rac, page) { > + unlock_page(page); > + put_page(page); > + } This needs a comment to explain the unwinding that needs to be done here. I'm not going to remember in a year's time that this is just for the pages that weren't submitted by ->readahead Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 08/19] mm: Add readahead address space operation
agemap.h > index 3613154e79e4..bd4291f78f41 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -665,6 +665,24 @@ static inline void readahead_next(struct > readahead_control *rac) > #define readahead_for_each(rac, page) > \ > for (; (page = readahead_page(rac)); readahead_next(rac)) > > +/* The byte offset into the file of this readahead block */ > +static inline loff_t readahead_offset(struct readahead_control *rac) > +{ > + return (loff_t)rac->_start * PAGE_SIZE; > +} Urk. Didn't an early page use "offset" for the page index? That was was "mm: Remove 'page_offset' from readahead loop" did, right? That's just going to cause confusion to have different units for readahead "offsets" > + > +/* The number of bytes in this readahead block */ > +static inline loff_t readahead_length(struct readahead_control *rac) > +{ > + return (loff_t)rac->_nr_pages * PAGE_SIZE; > +} > + > +/* The index of the first page in this readahead block */ > +static inline unsigned int readahead_index(struct readahead_control *rac) > +{ > + return rac->_start; > +} Based on this, I suspect the earlier patch should use "index" rather than "offset" when walking the page cache indexes... > + > /* The number of pages in this readahead block */ > static inline unsigned int readahead_count(struct readahead_control *rac) > { > diff --git a/mm/readahead.c b/mm/readahead.c > index 9e430daae42f..975ff5e387be 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > > blk_start_plug(); > > - if (aops->readpages) { > + if (aops->readahead) { > + aops->readahead(rac); > + readahead_for_each(rac, page) { > + unlock_page(page); > + put_page(page); > + } This needs a comment to explain the unwinding that needs to be done here. I'm not going to remember in a year's time that this is just for the pages that weren't submitted by ->readahead Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation
agemap.h > index 3613154e79e4..bd4291f78f41 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -665,6 +665,24 @@ static inline void readahead_next(struct > readahead_control *rac) > #define readahead_for_each(rac, page) > \ > for (; (page = readahead_page(rac)); readahead_next(rac)) > > +/* The byte offset into the file of this readahead block */ > +static inline loff_t readahead_offset(struct readahead_control *rac) > +{ > + return (loff_t)rac->_start * PAGE_SIZE; > +} Urk. Didn't an early page use "offset" for the page index? That was was "mm: Remove 'page_offset' from readahead loop" did, right? That's just going to cause confusion to have different units for readahead "offsets" > + > +/* The number of bytes in this readahead block */ > +static inline loff_t readahead_length(struct readahead_control *rac) > +{ > + return (loff_t)rac->_nr_pages * PAGE_SIZE; > +} > + > +/* The index of the first page in this readahead block */ > +static inline unsigned int readahead_index(struct readahead_control *rac) > +{ > + return rac->_start; > +} Based on this, I suspect the earlier patch should use "index" rather than "offset" when walking the page cache indexes... > + > /* The number of pages in this readahead block */ > static inline unsigned int readahead_count(struct readahead_control *rac) > { > diff --git a/mm/readahead.c b/mm/readahead.c > index 9e430daae42f..975ff5e387be 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > > blk_start_plug(); > > - if (aops->readpages) { > + if (aops->readahead) { > + aops->readahead(rac); > + readahead_for_each(rac, page) { > + unlock_page(page); > + put_page(page); > + } This needs a comment to explain the unwinding that needs to be done here. I'm not going to remember in a year's time that this is just for the pages that weren't submitted by ->readahead Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
page->index = offset; > - list_add(>lru, _pool); > + if (use_list) { > + page->index = offset; > + list_add(>lru, _pool); > + } else if (add_to_page_cache_lru(page, mapping, offset, > + gfp_mask) < 0) { > + put_page(page); > + goto read; > + } Ok, so that's why you put read code at the end of the loop. To turn the code into spaghetti :/ How much does this simplify down when we get rid of ->readpages and can restructure the loop? This really seems like you're trying to flatten two nested loops into one by the use of goto Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
page->index = offset; > - list_add(>lru, _pool); > + if (use_list) { > + page->index = offset; > + list_add(>lru, _pool); > + } else if (add_to_page_cache_lru(page, mapping, offset, > + gfp_mask) < 0) { > + put_page(page); > + goto read; > + } Ok, so that's why you put read code at the end of the loop. To turn the code into spaghetti :/ How much does this simplify down when we get rid of ->readpages and can restructure the loop? This really seems like you're trying to flatten two nested loops into one by the use of goto Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
page->index = offset; > - list_add(>lru, _pool); > + if (use_list) { > + page->index = offset; > + list_add(>lru, _pool); > + } else if (add_to_page_cache_lru(page, mapping, offset, > + gfp_mask) < 0) { > + put_page(page); > + goto read; > + } Ok, so that's why you put read code at the end of the loop. To turn the code into spaghetti :/ How much does this simplify down when we get rid of ->readpages and can restructure the loop? This really seems like you're trying to flatten two nested loops into one by the use of goto Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Change the type of page_idx to unsigned long, and rename it -- it's > just a loop counter, not a page index. > > Suggested-by: John Hubbard > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Change the type of page_idx to unsigned long, and rename it -- it's > just a loop counter, not a page index. > > Suggested-by: John Hubbard > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Change the type of page_idx to unsigned long, and rename it -- it's > just a loop counter, not a page index. > > Suggested-by: John Hubbard > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Eliminate the page_offset variable which was confusing with the > 'offset' parameter and record the start of each consecutive run of > pages in the readahead_control. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) Looks ok, but having the readahead dispatch out of line from the case that triggers it makes it hard to follow. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Eliminate the page_offset variable which was confusing with the > 'offset' parameter and record the start of each consecutive run of > pages in the readahead_control. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) Looks ok, but having the readahead dispatch out of line from the case that triggers it makes it hard to follow. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Eliminate the page_offset variable which was confusing with the > 'offset' parameter and record the start of each consecutive run of > pages in the readahead_control. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) Looks ok, but having the readahead dispatch out of line from the case that triggers it makes it hard to follow. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 04/19] mm: Rearrange readahead loop
On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Move the declaration of 'page' to inside the loop and move the 'kick > off a fresh batch' code to the end of the function for easier use in > subsequent patches. Stale? the "kick off" code is moved to the tail of the loop, not the end of the function. > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > *mapping, > page = xa_load(>i_pages, page_offset); > if (page && !xa_is_value(page)) { > /* > - * Page already present? Kick off the current batch of > - * contiguous pages before continuing with the next > - * batch. > + * Page already present? Kick off the current batch > + * of contiguous pages before continuing with the > + * next batch. This page may be the one we would > + * have intended to mark as Readahead, but we don't > + * have a stable reference to this page, and it's > + * not worth getting one just for that. >*/ > - if (readahead_count()) > - read_pages(, _pool, gfp_mask); > - rac._nr_pages = 0; > - continue; > + goto read; > } > > page = __page_cache_alloc(gfp_mask); > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > *mapping, > if (page_idx == nr_to_read - lookahead_size) > SetPageReadahead(page); > rac._nr_pages++; > + continue; > +read: > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; > } Also, why? This adds a goto from branched code that continues, then adds a continue so the unbranched code doesn't execute the code the goto jumps to. In absence of any explanation, this isn't an improvement and doesn't make any sense... -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 04/19] mm: Rearrange readahead loop
On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Move the declaration of 'page' to inside the loop and move the 'kick > off a fresh batch' code to the end of the function for easier use in > subsequent patches. Stale? the "kick off" code is moved to the tail of the loop, not the end of the function. > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > *mapping, > page = xa_load(>i_pages, page_offset); > if (page && !xa_is_value(page)) { > /* > - * Page already present? Kick off the current batch of > - * contiguous pages before continuing with the next > - * batch. > + * Page already present? Kick off the current batch > + * of contiguous pages before continuing with the > + * next batch. This page may be the one we would > + * have intended to mark as Readahead, but we don't > + * have a stable reference to this page, and it's > + * not worth getting one just for that. >*/ > - if (readahead_count()) > - read_pages(, _pool, gfp_mask); > - rac._nr_pages = 0; > - continue; > + goto read; > } > > page = __page_cache_alloc(gfp_mask); > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > *mapping, > if (page_idx == nr_to_read - lookahead_size) > SetPageReadahead(page); > rac._nr_pages++; > + continue; > +read: > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; > } Also, why? This adds a goto from branched code that continues, then adds a continue so the unbranched code doesn't execute the code the goto jumps to. In absence of any explanation, this isn't an improvement and doesn't make any sense... -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop
On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Move the declaration of 'page' to inside the loop and move the 'kick > off a fresh batch' code to the end of the function for easier use in > subsequent patches. Stale? the "kick off" code is moved to the tail of the loop, not the end of the function. > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > *mapping, > page = xa_load(>i_pages, page_offset); > if (page && !xa_is_value(page)) { > /* > - * Page already present? Kick off the current batch of > - * contiguous pages before continuing with the next > - * batch. > + * Page already present? Kick off the current batch > + * of contiguous pages before continuing with the > + * next batch. This page may be the one we would > + * have intended to mark as Readahead, but we don't > + * have a stable reference to this page, and it's > + * not worth getting one just for that. >*/ > - if (readahead_count()) > - read_pages(, _pool, gfp_mask); > - rac._nr_pages = 0; > - continue; > + goto read; > } > > page = __page_cache_alloc(gfp_mask); > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > *mapping, > if (page_idx == nr_to_read - lookahead_size) > SetPageReadahead(page); > rac._nr_pages++; > + continue; > +read: > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; > } Also, why? This adds a goto from branched code that continues, then adds a continue so the unbranched code doesn't execute the code the goto jumps to. In absence of any explanation, this isn't an improvement and doesn't make any sense... -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > In this patch, only between __do_page_cache_readahead() and > read_pages(), but it will be extended in upcoming patches. Also add > the readahead_count() accessor. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > include/linux/pagemap.h | 17 + > mm/readahead.c | 36 +--- > 2 files changed, 38 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index ccb14b6a16b5..982ecda2d4a2 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page, > return error; > } > > +/* > + * Readahead is of a block of consecutive pages. > + */ > +struct readahead_control { > + struct file *file; > + struct address_space *mapping; > +/* private: use the readahead_* accessors instead */ > + pgoff_t _start; > + unsigned int _nr_pages; > +}; > + > +/* The number of pages in this readahead block */ > +static inline unsigned int readahead_count(struct readahead_control *rac) > +{ > + return rac->_nr_pages; > +} > + > static inline unsigned long dir_pages(struct inode *inode) > { > return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >> > diff --git a/mm/readahead.c b/mm/readahead.c > index 12d13b7792da..15329309231f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, > struct list_head *pages, > > EXPORT_SYMBOL(read_cache_pages); > > -static void read_pages(struct address_space *mapping, struct file *filp, > - struct list_head *pages, unsigned int nr_pages, gfp_t gfp) > +static void read_pages(struct readahead_control *rac, struct list_head > *pages, > + gfp_t gfp) > { > + const struct address_space_operations *aops = rac->mapping->a_ops; > struct blk_plug plug; > unsigned page_idx; Splitting out the aops rather than the mapping here just looks weird, especially as you need the mapping later in the function. Using aops doesn't even reduce the code side > > blk_start_plug(); > > - if (mapping->a_ops->readpages) { > - mapping->a_ops->readpages(filp, mapping, pages, nr_pages); > + if (aops->readpages) { > + aops->readpages(rac->file, rac->mapping, pages, > + readahead_count(rac)); > /* Clean up the remaining pages */ > put_pages_list(pages); > goto out; > } > > - for (page_idx = 0; page_idx < nr_pages; page_idx++) { > + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) { > struct page *page = lru_to_page(pages); > list_del(>lru); > - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) > - mapping->a_ops->readpage(filp, page); > + if (!add_to_page_cache_lru(page, rac->mapping, page->index, > + gfp)) > + aops->readpage(rac->file, page); ... it just makes this less readable by splitting the if() over two lines... > put_page(page); > } > > @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space > *mapping, > unsigned long end_index;/* The last page we want to read */ > LIST_HEAD(page_pool); > int page_idx; > - unsigned int nr_pages = 0; > loff_t isize = i_size_read(inode); > gfp_t gfp_mask = readahead_gfp_mask(mapping); > + struct readahead_control rac = { > + .mapping = mapping, > + .file = filp, > + ._nr_pages = 0, > + }; No need to initialise _nr_pages to zero, leaving it out will do the same thing. > > if (isize == 0) > return; > @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space > *mapping, >* contiguous pages before continuing with the next >* batch. >*/ > - if (nr_pages) > - read_pages(mapping, filp, _pool, nr_pages, > - gfp_mask); > - nr_pages = 0; > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead control structure - if we have to pass the gfp_mask down all the way along side the rac, then I think it makes sense to do that... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > In this patch, only between __do_page_cache_readahead() and > read_pages(), but it will be extended in upcoming patches. Also add > the readahead_count() accessor. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > include/linux/pagemap.h | 17 + > mm/readahead.c | 36 +--- > 2 files changed, 38 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index ccb14b6a16b5..982ecda2d4a2 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page, > return error; > } > > +/* > + * Readahead is of a block of consecutive pages. > + */ > +struct readahead_control { > + struct file *file; > + struct address_space *mapping; > +/* private: use the readahead_* accessors instead */ > + pgoff_t _start; > + unsigned int _nr_pages; > +}; > + > +/* The number of pages in this readahead block */ > +static inline unsigned int readahead_count(struct readahead_control *rac) > +{ > + return rac->_nr_pages; > +} > + > static inline unsigned long dir_pages(struct inode *inode) > { > return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >> > diff --git a/mm/readahead.c b/mm/readahead.c > index 12d13b7792da..15329309231f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, > struct list_head *pages, > > EXPORT_SYMBOL(read_cache_pages); > > -static void read_pages(struct address_space *mapping, struct file *filp, > - struct list_head *pages, unsigned int nr_pages, gfp_t gfp) > +static void read_pages(struct readahead_control *rac, struct list_head > *pages, > + gfp_t gfp) > { > + const struct address_space_operations *aops = rac->mapping->a_ops; > struct blk_plug plug; > unsigned page_idx; Splitting out the aops rather than the mapping here just looks weird, especially as you need the mapping later in the function. Using aops doesn't even reduce the code side > > blk_start_plug(); > > - if (mapping->a_ops->readpages) { > - mapping->a_ops->readpages(filp, mapping, pages, nr_pages); > + if (aops->readpages) { > + aops->readpages(rac->file, rac->mapping, pages, > + readahead_count(rac)); > /* Clean up the remaining pages */ > put_pages_list(pages); > goto out; > } > > - for (page_idx = 0; page_idx < nr_pages; page_idx++) { > + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) { > struct page *page = lru_to_page(pages); > list_del(>lru); > - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) > - mapping->a_ops->readpage(filp, page); > + if (!add_to_page_cache_lru(page, rac->mapping, page->index, > + gfp)) > + aops->readpage(rac->file, page); ... it just makes this less readable by splitting the if() over two lines... > put_page(page); > } > > @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space > *mapping, > unsigned long end_index;/* The last page we want to read */ > LIST_HEAD(page_pool); > int page_idx; > - unsigned int nr_pages = 0; > loff_t isize = i_size_read(inode); > gfp_t gfp_mask = readahead_gfp_mask(mapping); > + struct readahead_control rac = { > + .mapping = mapping, > + .file = filp, > + ._nr_pages = 0, > + }; No need to initialise _nr_pages to zero, leaving it out will do the same thing. > > if (isize == 0) > return; > @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space > *mapping, >* contiguous pages before continuing with the next >* batch. >*/ > - if (nr_pages) > - read_pages(mapping, filp, _pool, nr_pages, > - gfp_mask); > - nr_pages = 0; > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead control structure - if we have to pass the gfp_mask down all the way along side the rac, then I think it makes sense to do that... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > In this patch, only between __do_page_cache_readahead() and > read_pages(), but it will be extended in upcoming patches. Also add > the readahead_count() accessor. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > include/linux/pagemap.h | 17 + > mm/readahead.c | 36 +--- > 2 files changed, 38 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index ccb14b6a16b5..982ecda2d4a2 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page, > return error; > } > > +/* > + * Readahead is of a block of consecutive pages. > + */ > +struct readahead_control { > + struct file *file; > + struct address_space *mapping; > +/* private: use the readahead_* accessors instead */ > + pgoff_t _start; > + unsigned int _nr_pages; > +}; > + > +/* The number of pages in this readahead block */ > +static inline unsigned int readahead_count(struct readahead_control *rac) > +{ > + return rac->_nr_pages; > +} > + > static inline unsigned long dir_pages(struct inode *inode) > { > return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >> > diff --git a/mm/readahead.c b/mm/readahead.c > index 12d13b7792da..15329309231f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, > struct list_head *pages, > > EXPORT_SYMBOL(read_cache_pages); > > -static void read_pages(struct address_space *mapping, struct file *filp, > - struct list_head *pages, unsigned int nr_pages, gfp_t gfp) > +static void read_pages(struct readahead_control *rac, struct list_head > *pages, > + gfp_t gfp) > { > + const struct address_space_operations *aops = rac->mapping->a_ops; > struct blk_plug plug; > unsigned page_idx; Splitting out the aops rather than the mapping here just looks weird, especially as you need the mapping later in the function. Using aops doesn't even reduce the code side > > blk_start_plug(); > > - if (mapping->a_ops->readpages) { > - mapping->a_ops->readpages(filp, mapping, pages, nr_pages); > + if (aops->readpages) { > + aops->readpages(rac->file, rac->mapping, pages, > + readahead_count(rac)); > /* Clean up the remaining pages */ > put_pages_list(pages); > goto out; > } > > - for (page_idx = 0; page_idx < nr_pages; page_idx++) { > + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) { > struct page *page = lru_to_page(pages); > list_del(>lru); > - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) > - mapping->a_ops->readpage(filp, page); > + if (!add_to_page_cache_lru(page, rac->mapping, page->index, > + gfp)) > + aops->readpage(rac->file, page); ... it just makes this less readable by splitting the if() over two lines... > put_page(page); > } > > @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space > *mapping, > unsigned long end_index;/* The last page we want to read */ > LIST_HEAD(page_pool); > int page_idx; > - unsigned int nr_pages = 0; > loff_t isize = i_size_read(inode); > gfp_t gfp_mask = readahead_gfp_mask(mapping); > + struct readahead_control rac = { > + .mapping = mapping, > + .file = filp, > + ._nr_pages = 0, > + }; No need to initialise _nr_pages to zero, leaving it out will do the same thing. > > if (isize == 0) > return; > @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space > *mapping, >* contiguous pages before continuing with the next >* batch. >*/ > - if (nr_pages) > - read_pages(mapping, filp, _pool, nr_pages, > - gfp_mask); > - nr_pages = 0; > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead control structure - if we have to pass the gfp_mask down all the way along side the rac, then I think it makes sense to do that... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [PATCH v6 00/19] Change readahead API
On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > This series adds a readahead address_space operation to eventually > replace the readpages operation. The key difference is that > pages are added to the page cache as they are allocated (and > then looked up by the filesystem) instead of passing them on a > list to the readpages operation and having the filesystem add > them to the page cache. It's a net reduction in code for each > implementation, more efficient than walking a list, and solves > the direct-write vs buffered-read problem reported by yu kuai at > https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yuku...@huawei.com/ > > The only unconverted filesystems are those which use fscache. > Their conversion is pending Dave Howells' rewrite which will make the > conversion substantially easier. Latest version in your git tree: $ ▶ glo -n 5 willy/readahead 4be497096c04 mm: Use memalloc_nofs_save in readahead path ff63497fcb98 iomap: Convert from readpages to readahead 26aee60e89b5 iomap: Restructure iomap_readpages_actor 8115bcca7312 fuse: Convert from readpages to readahead 3db3d10d9ea1 f2fs: Convert from readpages to readahead $ merged into a 5.6-rc2 tree fails at boot on my test vm: [2.423116] [ cut here ] [2.424957] list_add double add: new=ea000efff4c8, prev=8883bfffee60, next=ea000efff4c8. [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70 [2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800 [2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [2.434744] RIP: 0010:__list_add_valid+0x67/0x70 [2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 c7 [2.441161] RSP: :c900018a3bb0 EFLAGS: 00010082 [2.442548] RAX: RBX: ea000efff4c0 RCX: 0256 [2.32] RDX: 0001 RSI: 0086 RDI: 8288a8b0 [2.446315] RBP: ea000efff4c8 R08: c900018a3a65 R09: 0256 [2.448199] R10: 0008 R11: c900018a3a65 R12: ea000efff4c8 [2.450072] R13: 8883bfffee60 R14: 0010 R15: 0001 [2.451959] FS: () GS:8883b9c0() knlGS: [2.454083] CS: 0010 DS: ES: CR0: 80050033 [2.455604] CR2: CR3: 0003b9a37002 CR4: 00060ee0 [2.457484] Call Trace: [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 [2.459376] pagevec_lru_move_fn+0x87/0xd0 [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 [2.461712] lru_add_drain_cpu+0x8d/0x160 [2.462787] lru_add_drain+0x18/0x20 [2.463757] shift_arg_pages+0xb8/0x180 [2.464789] ? vprintk_emit+0x101/0x1c0 [2.465813] ? printk+0x58/0x6f [2.466659] setup_arg_pages+0x205/0x240 [2.467716] load_elf_binary+0x34a/0x1560 [2.468789] ? get_user_pages_remote+0x159/0x280 [2.470024] ? selinux_inode_permission+0x10d/0x1e0 [2.471323] ? _raw_read_unlock+0xa/0x20 [2.472375] ? load_misc_binary+0x2b2/0x410 [2.473492] search_binary_handler+0x60/0xe0 [2.474634] __do_execve_file.isra.0+0x512/0x850 [2.475888] ? rest_init+0xc6/0xc6 [2.476801] do_execve+0x21/0x30 [2.477671] try_to_run_init_process+0x10/0x34 [2.478855] kernel_init+0xe2/0xfa [2.479776] ret_from_fork+0x1f/0x30 [2.480737] ---[ end trace e77079de9b22dc6a ]--- I just dropped the ext4 conversion from my local tree so I can boot the machine and test XFS. Might have some more info when that crashes and burns... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API
On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > This series adds a readahead address_space operation to eventually > replace the readpages operation. The key difference is that > pages are added to the page cache as they are allocated (and > then looked up by the filesystem) instead of passing them on a > list to the readpages operation and having the filesystem add > them to the page cache. It's a net reduction in code for each > implementation, more efficient than walking a list, and solves > the direct-write vs buffered-read problem reported by yu kuai at > https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yuku...@huawei.com/ > > The only unconverted filesystems are those which use fscache. > Their conversion is pending Dave Howells' rewrite which will make the > conversion substantially easier. Latest version in your git tree: $ ▶ glo -n 5 willy/readahead 4be497096c04 mm: Use memalloc_nofs_save in readahead path ff63497fcb98 iomap: Convert from readpages to readahead 26aee60e89b5 iomap: Restructure iomap_readpages_actor 8115bcca7312 fuse: Convert from readpages to readahead 3db3d10d9ea1 f2fs: Convert from readpages to readahead $ merged into a 5.6-rc2 tree fails at boot on my test vm: [2.423116] [ cut here ] [2.424957] list_add double add: new=ea000efff4c8, prev=8883bfffee60, next=ea000efff4c8. [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70 [2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800 [2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [2.434744] RIP: 0010:__list_add_valid+0x67/0x70 [2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 c7 [2.441161] RSP: :c900018a3bb0 EFLAGS: 00010082 [2.442548] RAX: RBX: ea000efff4c0 RCX: 0256 [2.32] RDX: 0001 RSI: 0086 RDI: 8288a8b0 [2.446315] RBP: ea000efff4c8 R08: c900018a3a65 R09: 0256 [2.448199] R10: 0008 R11: c900018a3a65 R12: ea000efff4c8 [2.450072] R13: 8883bfffee60 R14: 0010 R15: 0001 [2.451959] FS: () GS:8883b9c0() knlGS: [2.454083] CS: 0010 DS: ES: CR0: 80050033 [2.455604] CR2: CR3: 0003b9a37002 CR4: 00060ee0 [2.457484] Call Trace: [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 [2.459376] pagevec_lru_move_fn+0x87/0xd0 [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 [2.461712] lru_add_drain_cpu+0x8d/0x160 [2.462787] lru_add_drain+0x18/0x20 [2.463757] shift_arg_pages+0xb8/0x180 [2.464789] ? vprintk_emit+0x101/0x1c0 [2.465813] ? printk+0x58/0x6f [2.466659] setup_arg_pages+0x205/0x240 [2.467716] load_elf_binary+0x34a/0x1560 [2.468789] ? get_user_pages_remote+0x159/0x280 [2.470024] ? selinux_inode_permission+0x10d/0x1e0 [2.471323] ? _raw_read_unlock+0xa/0x20 [2.472375] ? load_misc_binary+0x2b2/0x410 [2.473492] search_binary_handler+0x60/0xe0 [2.474634] __do_execve_file.isra.0+0x512/0x850 [2.475888] ? rest_init+0xc6/0xc6 [2.476801] do_execve+0x21/0x30 [2.477671] try_to_run_init_process+0x10/0x34 [2.478855] kernel_init+0xe2/0xfa [2.479776] ret_from_fork+0x1f/0x30 [2.480737] ---[ end trace e77079de9b22dc6a ]--- I just dropped the ext4 conversion from my local tree so I can boot the machine and test XFS. Might have some more info when that crashes and burns... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > This series adds a readahead address_space operation to eventually > replace the readpages operation. The key difference is that > pages are added to the page cache as they are allocated (and > then looked up by the filesystem) instead of passing them on a > list to the readpages operation and having the filesystem add > them to the page cache. It's a net reduction in code for each > implementation, more efficient than walking a list, and solves > the direct-write vs buffered-read problem reported by yu kuai at > https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yuku...@huawei.com/ > > The only unconverted filesystems are those which use fscache. > Their conversion is pending Dave Howells' rewrite which will make the > conversion substantially easier. Latest version in your git tree: $ ▶ glo -n 5 willy/readahead 4be497096c04 mm: Use memalloc_nofs_save in readahead path ff63497fcb98 iomap: Convert from readpages to readahead 26aee60e89b5 iomap: Restructure iomap_readpages_actor 8115bcca7312 fuse: Convert from readpages to readahead 3db3d10d9ea1 f2fs: Convert from readpages to readahead $ merged into a 5.6-rc2 tree fails at boot on my test vm: [2.423116] [ cut here ] [2.424957] list_add double add: new=ea000efff4c8, prev=8883bfffee60, next=ea000efff4c8. [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70 [2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800 [2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [2.434744] RIP: 0010:__list_add_valid+0x67/0x70 [2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 c7 [2.441161] RSP: :c900018a3bb0 EFLAGS: 00010082 [2.442548] RAX: RBX: ea000efff4c0 RCX: 0256 [2.32] RDX: 0001 RSI: 0086 RDI: 8288a8b0 [2.446315] RBP: ea000efff4c8 R08: c900018a3a65 R09: 0256 [2.448199] R10: 0008 R11: c900018a3a65 R12: ea000efff4c8 [2.450072] R13: 8883bfffee60 R14: 0010 R15: 0001 [2.451959] FS: () GS:8883b9c0() knlGS: [2.454083] CS: 0010 DS: ES: CR0: 80050033 [2.455604] CR2: CR3: 0003b9a37002 CR4: 00060ee0 [2.457484] Call Trace: [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 [2.459376] pagevec_lru_move_fn+0x87/0xd0 [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 [2.461712] lru_add_drain_cpu+0x8d/0x160 [2.462787] lru_add_drain+0x18/0x20 [2.463757] shift_arg_pages+0xb8/0x180 [2.464789] ? vprintk_emit+0x101/0x1c0 [2.465813] ? printk+0x58/0x6f [2.466659] setup_arg_pages+0x205/0x240 [2.467716] load_elf_binary+0x34a/0x1560 [2.468789] ? get_user_pages_remote+0x159/0x280 [2.470024] ? selinux_inode_permission+0x10d/0x1e0 [2.471323] ? _raw_read_unlock+0xa/0x20 [2.472375] ? load_misc_binary+0x2b2/0x410 [2.473492] search_binary_handler+0x60/0xe0 [2.474634] __do_execve_file.isra.0+0x512/0x850 [2.475888] ? rest_init+0xc6/0xc6 [2.476801] do_execve+0x21/0x30 [2.477671] try_to_run_init_process+0x10/0x34 [2.478855] kernel_init+0xe2/0xfa [2.479776] ret_from_fork+0x1f/0x30 [2.480737] ---[ end trace e77079de9b22dc6a ]--- I just dropped the ext4 conversion from my local tree so I can boot the machine and test XFS. Might have some more info when that crashes and burns... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 02/19] mm: Ignore return value of ->readpages
On Mon, Feb 17, 2020 at 10:45:43AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > We used to assign the return value to a variable, which we then ignored. > Remove the pretence of caring. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Christoph Hellwig > --- > mm/readahead.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) Simple enough. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [f2fs-dev] [PATCH v6 02/19] mm: Ignore return value of ->readpages
On Mon, Feb 17, 2020 at 10:45:43AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > We used to assign the return value to a variable, which we then ignored. > Remove the pretence of caring. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Christoph Hellwig > --- > mm/readahead.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) Simple enough. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel