from:"Dave Chinner"

Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-05-25 Thread Dave Chinner

On Fri, May 24, 2024 at 09:55:48AM +0200, Miklos Szeredi wrote:
> On Fri, 24 May 2024 at 02:47, John Groves  wrote:
> 
> > Apologies, but I'm short on time at the moment - going into a long holiday
> > weekend in the US with family plans. I should be focused again by middle of
> > next week.
> 
> NP.
> 
> Obviously I'll need to test it before anything is merged, other than
> that this is not urgent at all...
> 
> > But can you check /proc/cmdline to see of the memmap arg got through without
> > getting mangled? The '$' tends to get fubar'd. You might need \$, or I've 
> > seen
> > the need for \\\$. If it's un-mangled, there should be a dax device.
> 
> /proc/cmdline shows the option correctly:
> 
> root@kvm:~# cat /proc/cmdline
> root=/dev/vda console=hvc0 memmap=4G$4G
> 
> > If that doesn't work, it's worth trying '!' instead, which I think would 
> > give
> > you a pmem device - if the arg gets through (but ! is less likely to get
> > horked). That pmem device can be converted to devdax...
> 
> That doesn't work either.  No device created in /dev  (dax or pmem).

I think you need to do some ndctl magic to get the memory to be
namespaced correctly for the correct devices to appear.

https://docs.pmem.io/ndctl-user-guide/managing-namespaces

IIRC, need to set the type to pmem and the mode to fsdax, devdax or
raw to get the relevant device nodes to be created for the range..

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

2024-01-31 Thread Dave Chinner

On Wed, Jan 31, 2024 at 09:58:21AM -0500, Mathieu Desnoyers wrote:
> On 2024-01-30 21:48, Dave Chinner wrote:
> > On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote:
> > > Introduce a generic way to query whether the dcache is virtually aliased
> > > on all architectures. Its purpose is to ensure that subsystems which
> > > are incompatible with virtually aliased data caches (e.g. FS_DAX) can
> > > reliably query this.
> > > 
> > > For dcache aliasing, there are three scenarios dependending on the
> > > architecture. Here is a breakdown based on my understanding:
> > > 
> > > A) The dcache is always aliasing:
> > > 
> > > * arc
> > > * csky
> > > * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing 
> > > there.)
> > > * sh
> > > * parisc
> > 
> > /me wonders why the dentry cache aliasing has problems on these
> > systems.
> > 
> > Oh, dcache != fs/dcache.c (the VFS dentry cache).
> > 
> > Can you please rename this function appropriately so us dumb
> > filesystem people don't confuse cpu data cache configurations with
> > the VFS dentry cache aliasing when we read this code? Something like
> > cpu_dcache_is_aliased(), perhaps?
> 
> Good point, will do. I'm planning go rename as follows for v3 to
> eliminate confusion with dentry cache (and with "page cache" in
> general):
> 
> ARCH_HAS_CACHE_ALIASING -> ARCH_HAS_CPU_CACHE_ALIASING
> dcache_is_aliasing() -> cpu_dcache_is_aliasing()
> 
> I noticed that you suggested "aliased" rather than "aliasing",
> but I followed what arm64 did for icache_is_aliasing(). Do you
> have a strong preference one way or another ?

Not really.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

2024-01-30 Thread Dave Chinner

On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote:
> commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> prevents DAX from building on architectures with virtually aliased
> dcache with:
> 
>   depends on !(ARM || MIPS || SPARC)
> 
> This check is too broad (e.g. recent ARMv7 don't have virtually aliased
> dcaches), and also misses many other architectures with virtually
> aliased dcache.
> 
> This is a regression introduced in the v5.13 Linux kernel where the
> dax mount option is removed for 32-bit ARMv7 boards which have no dcache
> aliasing, and therefore should work fine with FS_DAX.
> 
> This was turned into the following implementation of dax_is_supported()
> by a preparatory change:
> 
> return !IS_ENABLED(CONFIG_ARM) &&
>!IS_ENABLED(CONFIG_MIPS) &&
>!IS_ENABLED(CONFIG_SPARC);
> 
> Use dcache_is_aliasing() instead to figure out whether the environment
> has aliasing dcaches.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> ---
>  include/linux/dax.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index cfc8cd4a3eae..f59e604662e4 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  typedef unsigned long dax_entry_t;
>  
> @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct 
> vm_area_struct *vma,
>  }
>  static inline bool dax_is_supported(void)
>  {
> - return !IS_ENABLED(CONFIG_ARM) &&
> -!IS_ENABLED(CONFIG_MIPS) &&
> -!IS_ENABLED(CONFIG_SPARC);
> + return !dcache_is_aliasing();

Yeah, if this is just a one liner should go into
fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the
start of the function.

I also noticed that device mapper uses fs_dax_get_by_bdev() to
determine if it can support DAX, but this patch set does not address
that case. Hence it really seems to me like fs_dax_get_by_bdev() is
the right place to put this check.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

2024-01-30 Thread Dave Chinner

On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote:
> Introduce a generic way to query whether the dcache is virtually aliased
> on all architectures. Its purpose is to ensure that subsystems which
> are incompatible with virtually aliased data caches (e.g. FS_DAX) can
> reliably query this.
> 
> For dcache aliasing, there are three scenarios dependending on the
> architecture. Here is a breakdown based on my understanding:
> 
> A) The dcache is always aliasing:
> 
> * arc
> * csky
> * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing 
> there.)
> * sh
> * parisc

/me wonders why the dentry cache aliasing has problems on these
systems.

Oh, dcache != fs/dcache.c (the VFS dentry cache).

Can you please rename this function appropriately so us dumb
filesystem people don't confuse cpu data cache configurations with
the VFS dentry cache aliasing when we read this code? Something like
cpu_dcache_is_aliased(), perhaps?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 1/8] dax: Introduce dax_is_supported()

2024-01-30 Thread Dave Chinner

On Tue, Jan 30, 2024 at 11:52:48AM -0500, Mathieu Desnoyers wrote:
> Introduce a new dax_is_supported() static inline to check whether the
> architecture supports DAX.
> 
> This replaces the following fs/Kconfig:FS_DAX dependency:
> 
>   depends on !(ARM || MIPS || SPARC)
> 
> This is done in preparation for its use by each filesystem supporting
> the dax mount option to validate whether dax is indeed supported.
> 
> This is done in preparation for using dcache_is_aliasing() in a
> following change which will properly support architectures which detect
> dcache aliasing at runtime.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> ---
>  fs/Kconfig  |  1 -
>  include/linux/dax.h | 10 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 42837617a55b..e5efdb3b276b 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -56,7 +56,6 @@ endif # BLOCK
>  config FS_DAX
>   bool "File system based Direct Access (DAX) support"
>   depends on MMU
> - depends on !(ARM || MIPS || SPARC)
>   depends on ZONE_DEVICE || FS_DAX_LIMITED
>   select FS_IOMAP
>   select DAX
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b463502b16e1..cfc8cd4a3eae 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct 
> vm_area_struct *vma,
>   return false;
>   return dax_synchronous(dax_dev);
>  }
> +static inline bool dax_is_supported(void)
> +{
> + return !IS_ENABLED(CONFIG_ARM) &&
> +!IS_ENABLED(CONFIG_MIPS) &&
> +!IS_ENABLED(CONFIG_SPARC);
> +}

Uh, ok. Now I see what dax_is_supported() does.

I think this should be folded into fs_dax_get_by_bdev(), which
currently returns NULL if CONFIG_FS_DAX=n and so should be cahnged
to return NULL if any of these platform configs is enabled.

Then I don't think you need to change a single line of filesystem
code - they'll all just do what they do now if the block device
doesn't support DAX

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()

2024-01-29 Thread Dave Chinner

On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote:
> Use dax_is_supported() to validate whether the architecture has
> virtually aliased caches at mount time.
> 
> This is relevant for architectures which require a dynamic check
> to validate whether they have virtually aliased data caches
> (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Where's the rest of this patchset? I have no idea what
dax_is_supported() actually does, how it interacts with
CONFIG_FS_DAX, etc.

If you are changing anything to do with FSDAX, the cc-ing the
-entire- patchset to linux-fsdevel is absolutely necessary so the
entire patchset lands in our inboxes and not just a random patch
from the middle of a bigger change.

> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Chandan Babu R 
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> ---
>  fs/xfs/xfs_super.c | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 764304595e8b..b27ecb11db66 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1376,14 +1376,22 @@ xfs_fs_parse_param(
>   case Opt_nodiscard:
>   parsing_mp->m_features &= ~XFS_FEAT_DISCARD;
>   return 0;
> -#ifdef CONFIG_FS_DAX
>   case Opt_dax:
> - xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
> - return 0;
> + if (dax_is_supported()) {
> + xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
> + return 0;
> + } else {
> + xfs_warn(parsing_mp, "dax option not supported.");
> + return -EINVAL;
> + }
>   case Opt_dax_enum:
> - xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
> - return 0;
> -#endif
> + if (dax_is_supported()) {
> + xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
> + return 0;
> + } else {
> + xfs_warn(parsing_mp, "dax option not supported.");
> + return -EINVAL;
> + }

Assuming that I understand what dax_is_supported() is doing, this
change isn't right.  We're just setting the DAX configuration flags
from the mount options here, we don't validate them until 
we've parsed all options and eliminated conflicts and rejected
conflicting options. We validate whether the options are
appropriate for the underlying hardware configuration later in the
mount process.

dax=always suitability is check in xfs_setup_dax_always() called
later in the mount process when we have enough context and support
to open storage devices and check them for DAX support. If the
hardware does not support DAX then we simply we turn off DAX
support, we do not reject the mount as this change does.

dax=inode and dax=never are valid options on all configurations,
even those with without FSDAX support or have hardware that is not
capable of using DAX. dax=inode only affects how an inode is
instantiated in cache - if the inode has a flag that says "use DAX"
and dax is suppoortable by the hardware, then the turn on DAX for
that inode. Otherwise we just use the normal non-dax IO paths.

Again, we don't error out the filesystem if DAX is not supported,
we just don't turn it on. This check is done in
xfs_inode_should_enable_dax() and I think all you need to do is
replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported()
call...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v9 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2023-02-05 Thread Dave Chinner

On Sat, Feb 04, 2023 at 02:58:38PM +, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.

.

> @@ -182,12 +188,24 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_super->s_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + /* invalidate_inode_pages2() invalidates dax mapping */
> + super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
> + up_write(>m_super->s_umount);

I really don't like this.

super_drop_pagecache() doesn't guarantee that everything is removed
from cache. It is racy - it doesn't touch inodes being freed or
being instantiated, nor does it prevent concurrent accesses to the
inodes from re-instantiating page cache pages and dirtying them
after the inode has been scanned by super_drop_pagecache().

If we are about to remove the block device and we want to guarantee
that the filesystem is cleaned and stable before the device gets
yanked out from under running applications, then we have to
guarantee that we stall the running applications trying to modify
the filesystem between the MF_MEM_PRE_REMOVE and the actual removal
event that then shuts down the filesystem. Invalidating the page
cache is not enough to guarantee this.

Keep in mind that we're going to walk the rmap after writing the
data to kill any processes that have mmap()d files in the filesystem
after we've dropped the page cache - the page cache invalidation
doesn't change this at all - and this will kill any active userspace
DAX mappings before the device is unplugged. So I don't actually see
how walking the page cache to invalidate it here actually helps
"invalidate dax mapping" reliably as new write page faults on dax
VMAs can still occur between super_drop_pagecache() and the rmap
walk triggering kills on processes with DAX mapped VMAs.

We also don't care if read-only operations race with device unplug -
they are going to get EIO the moment the device is actually
unplugged or the filesystem is shutdown anyway, so it doesn't matter
if reads race with the device remove event.  Hence all we really
care about here is not dirtying the filesystem after we've started
processing the MF_MEM_PRE_REMOVE event.

Realistically, I think we need to freeze the filesystem here to
prevent racing modifications occurring during the rmap + VMA walk +
proc kill. That could be write() IO dirtying new data or other
transactions running dirtying the journal/metadata. Both
sync_filesystem() and super_drop_pagecache() operate on current
state - they don't prevent future dax mapping instantiation or
dirtying from happening on the device, so they don't prevent this...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-27 Thread Dave Chinner

On Tue, Sep 27, 2022 at 09:02:48AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 27, 2022 at 02:53:14PM +0800, Shiyang Ruan wrote:
> > 
> > 
> > 在 2022/9/20 5:15, Dave Chinner 写道:
> > > On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> > > > On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:
> > > > > Since reflink can work together now, the last obstacle has been
> > > > > resolved.  It's time to remove restrictions and drop this warning.
> > > > > 
> > > > > Signed-off-by: Shiyang Ruan 
> > > > 
> > > > I haven't looked at reflink+DAX for some time, and I haven't tested
> > > > it for even longer. So I'm currently running a v6.0-rc6 kernel with
> > > > "-o dax=always" fstests run with reflink enabled and it's not
> > > > looking very promising.
> > > > 
> > > > All of the fsx tests are failing with data corruption, several
> > > > reflink/clone tests are failing with -EINVAL (e.g. g/16[45]) and
> > > > *lots* of tests are leaving stack traces from WARN() conditions in
> > > > DAx operations such as dax_insert_entry(), dax_disassociate_entry(),
> > > > dax_writeback_mapping_range(), iomap_iter() (called from
> > > > dax_dedupe_file_range_compare()), and so on.
> > > > 
> > > > At thsi point - the tests are still running - I'd guess that there's
> > > > going to be at least 50 test failures by the time it completes -
> > > > in comparison using "-o dax=never" results in just a single test
> > > > failure and a lot more tests actually being run.
> > > 
> > > The end results with dax+reflink were:
> > > 
> > > SECTION   -- xfs_dax
> > > =
> > > 
> > > Failures: generic/051 generic/068 generic/074 generic/075
> > > generic/083 generic/091 generic/112 generic/127 generic/164
> > > generic/165 generic/175 generic/231 generic/232 generic/247
> > > generic/269 generic/270 generic/327 generic/340 generic/388
> > > generic/390 generic/413 generic/447 generic/461 generic/471
> > > generic/476 generic/517 generic/519 generic/560 generic/561
> > > generic/605 generic/617 generic/619 generic/630 generic/649
> > > generic/650 generic/656 generic/670 generic/672 xfs/011 xfs/013
> > > xfs/017 xfs/068 xfs/073 xfs/104 xfs/127 xfs/137 xfs/141 xfs/158
> > > xfs/168 xfs/179 xfs/243 xfs/297 xfs/305 xfs/328 xfs/440 xfs/442
> > > xfs/517 xfs/535 xfs/538 xfs/551 xfs/552
> > > Failed 61 of 1071 tests
> > > 
> > > Ok, so I did a new no-reflink run as a baseline, because it is a
> > > while since I've tested DAX at all:
> > > 
> > > SECTION   -- xfs_dax_noreflink
> > > =
> > > Failures: generic/051 generic/068 generic/074 generic/075
> > > generic/083 generic/112 generic/231 generic/232 generic/269
> > > generic/270 generic/340 generic/388 generic/461 generic/471
> > > generic/476 generic/519 generic/560 generic/561 generic/617
> > > generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
> > > xfs/305 xfs/517 xfs/538
> > > Failed 29 of 1071 tests
> > > 
> > > Yeah, there's still lots of warnings from dax_insert_entry() and
> > > friends like:
> > > 
> > > [43262.025815] WARNING: CPU: 9 PID: 1309428 at fs/dax.c:380 
> > > dax_insert_entry+0x2ab/0x320
> > > [43262.028355] Modules linked in:
> > > [43262.029386] CPU: 9 PID: 1309428 Comm: fsstress Tainted: G W  
> > > 6.0.0-rc6-dgc+ #1543
> > > [43262.032168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS 1.15.0-1 04/01/2014
> > > [43262.034840] RIP: 0010:dax_insert_entry+0x2ab/0x320
> > > [43262.036358] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 
> > > 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff 
> > > ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 84 b1 5a 00 eb a4 48 81 e6
> > > [43262.042255] RSP: 0018:c9000a0cbb78 EFLAGS: 00010002
> > > [43262.043946] RAX: ea0018cd1fc0 RBX: 0001 RCX: 
> > > 0001
> > > [43262.046233] RDX: ea00 RSI: 0221 RDI: 
> > > ea0018cd2000
> > > [43262.048518] RBP: 0011 R08:  R09: 
> > > 
> > > [43262.050762] R10: 888241a6d318 R11: 0001 R12: 
> > > c9000a0cbc58
> > >

Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-19 Thread Dave Chinner

On Fri, Sep 02, 2022 at 10:36:01AM +, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 23 +++
>  include/linux/mm.h  |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 3830f908e215..5e04ba7fa403 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  struct xfs_failure_info {
>   xfs_agblock_t   startblock;
> @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* The device is about to be removed.  Not a really failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   notify->want_shutdown = true;
>   return 0;
>   }
> @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_super->s_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + drop_pagecache_sb(mp->m_super, NULL);
> + up_write(>m_super->s_umount);
> + if (error)
> + return error;

If the device is about to go away unexpectedly, shouldn't this shut
down the filesystem after syncing it here?  If the filesystem has
been shut down, then everything will fail before removal finally
triggers, and the act of unmounting the filesystem post device
removal will clean up the page cache and all the other caches.

IOWs, I don't understand why the page cache is considered special
here (as opposed to, say, the inode or dentry caches), nor why we
aren't shutting down the filesystem directly after syncing it to
disk to ensure that we don't end up with applications losing data as
a result of racing with the removal

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-19 Thread Dave Chinner

On Tue, Sep 20, 2022 at 09:17:07AM +0800, Shiyang Ruan wrote:
> Hi Dave,
> 
> 在 2022/9/20 5:15, Dave Chinner 写道:
> > On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> > > On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:
> > > > Since reflink can work together now, the last obstacle has been
> > > > resolved.  It's time to remove restrictions and drop this warning.
> > > > 
> > > > Signed-off-by: Shiyang Ruan 
> > > 
> > > I haven't looked at reflink+DAX for some time, and I haven't tested
> > > it for even longer. So I'm currently running a v6.0-rc6 kernel with
> > > "-o dax=always" fstests run with reflink enabled and it's not
> > > looking very promising.
> > > 
> > > All of the fsx tests are failing with data corruption, several
> > > reflink/clone tests are failing with -EINVAL (e.g. g/16[45]) and
> > > *lots* of tests are leaving stack traces from WARN() conditions in
> > > DAx operations such as dax_insert_entry(), dax_disassociate_entry(),
> > > dax_writeback_mapping_range(), iomap_iter() (called from
> > > dax_dedupe_file_range_compare()), and so on.
> > > 
> > > At thsi point - the tests are still running - I'd guess that there's
> > > going to be at least 50 test failures by the time it completes -
> > > in comparison using "-o dax=never" results in just a single test
> > > failure and a lot more tests actually being run.
> > 
> > The end results with dax+reflink were:
> > 
> > SECTION   -- xfs_dax
> > =
> > 
> > Failures: generic/051 generic/068 generic/074 generic/075
> > generic/083 generic/091 generic/112 generic/127 generic/164
> > generic/165 generic/175 generic/231 generic/232 generic/247
> > generic/269 generic/270 generic/327 generic/340 generic/388
> > generic/390 generic/413 generic/447 generic/461 generic/471
> > generic/476 generic/517 generic/519 generic/560 generic/561
> > generic/605 generic/617 generic/619 generic/630 generic/649
> > generic/650 generic/656 generic/670 generic/672 xfs/011 xfs/013
> > xfs/017 xfs/068 xfs/073 xfs/104 xfs/127 xfs/137 xfs/141 xfs/158
> > xfs/168 xfs/179 xfs/243 xfs/297 xfs/305 xfs/328 xfs/440 xfs/442
> > xfs/517 xfs/535 xfs/538 xfs/551 xfs/552
> > Failed 61 of 1071 tests
> > 
> > Ok, so I did a new no-reflink run as a baseline, because it is a
> > while since I've tested DAX at all:
> > 
> > SECTION   -- xfs_dax_noreflink
> > =
> > Failures: generic/051 generic/068 generic/074 generic/075
> > generic/083 generic/112 generic/231 generic/232 generic/269
> > generic/270 generic/340 generic/388 generic/461 generic/471
> > generic/476 generic/519 generic/560 generic/561 generic/617
> > generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
> > xfs/305 xfs/517 xfs/538
> > Failed 29 of 1071 tests
> > 
> > Yeah, there's still lots of warnings from dax_insert_entry() and
> > friends like:
> > 
> > [43262.025815] WARNING: CPU: 9 PID: 1309428 at fs/dax.c:380 
> > dax_insert_entry+0x2ab/0x320
> > [43262.028355] Modules linked in:
> > [43262.029386] CPU: 9 PID: 1309428 Comm: fsstress Tainted: G W  
> > 6.0.0-rc6-dgc+ #1543
> > [43262.032168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.15.0-1 04/01/2014
> > [43262.034840] RIP: 0010:dax_insert_entry+0x2ab/0x320
> > [43262.036358] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 
> > 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff 
> > <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 84 b1 5a 00 eb a4 48 81 e6
> > [43262.042255] RSP: 0018:c9000a0cbb78 EFLAGS: 00010002
> > [43262.043946] RAX: ea0018cd1fc0 RBX: 0001 RCX: 
> > 0001
> > [43262.046233] RDX: ea00 RSI: 0221 RDI: 
> > ea0018cd2000
> > [43262.048518] RBP: 0011 R08:  R09: 
> > 
> > [43262.050762] R10: 888241a6d318 R11: 0001 R12: 
> > c9000a0cbc58
> > [43262.053020] R13: 888241a6d318 R14: c9000a0cbe20 R15: 
> > 
> > [43262.055309] FS:  7f8ce25e2b80() GS:8885fec8() 
> > knlGS:
> > [43262.057859] CS:  0010 DS:  ES:  CR0: 80050033
> > [43262.059713] CR2: 7f8ce25e1000 CR3: 000152141001 CR4: 
> > 00060ee0
> > [43262.061993] Call Trace:
> > [43262.062836]  
> > [43262.063557]

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-19 Thread Dave Chinner

On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:
> > Since reflink can work together now, the last obstacle has been
> > resolved.  It's time to remove restrictions and drop this warning.
> > 
> > Signed-off-by: Shiyang Ruan 
> 
> I haven't looked at reflink+DAX for some time, and I haven't tested
> it for even longer. So I'm currently running a v6.0-rc6 kernel with
> "-o dax=always" fstests run with reflink enabled and it's not
> looking very promising.
> 
> All of the fsx tests are failing with data corruption, several
> reflink/clone tests are failing with -EINVAL (e.g. g/16[45]) and
> *lots* of tests are leaving stack traces from WARN() conditions in
> DAx operations such as dax_insert_entry(), dax_disassociate_entry(),
> dax_writeback_mapping_range(), iomap_iter() (called from
> dax_dedupe_file_range_compare()), and so on.
> 
> At thsi point - the tests are still running - I'd guess that there's
> going to be at least 50 test failures by the time it completes -
> in comparison using "-o dax=never" results in just a single test
> failure and a lot more tests actually being run.

The end results with dax+reflink were:

SECTION   -- xfs_dax
=

Failures: generic/051 generic/068 generic/074 generic/075
generic/083 generic/091 generic/112 generic/127 generic/164
generic/165 generic/175 generic/231 generic/232 generic/247
generic/269 generic/270 generic/327 generic/340 generic/388
generic/390 generic/413 generic/447 generic/461 generic/471
generic/476 generic/517 generic/519 generic/560 generic/561
generic/605 generic/617 generic/619 generic/630 generic/649
generic/650 generic/656 generic/670 generic/672 xfs/011 xfs/013
xfs/017 xfs/068 xfs/073 xfs/104 xfs/127 xfs/137 xfs/141 xfs/158
xfs/168 xfs/179 xfs/243 xfs/297 xfs/305 xfs/328 xfs/440 xfs/442
xfs/517 xfs/535 xfs/538 xfs/551 xfs/552
Failed 61 of 1071 tests

Ok, so I did a new no-reflink run as a baseline, because it is a
while since I've tested DAX at all:

SECTION   -- xfs_dax_noreflink
=
Failures: generic/051 generic/068 generic/074 generic/075
generic/083 generic/112 generic/231 generic/232 generic/269
generic/270 generic/340 generic/388 generic/461 generic/471
generic/476 generic/519 generic/560 generic/561 generic/617
generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
xfs/305 xfs/517 xfs/538
Failed 29 of 1071 tests

Yeah, there's still lots of warnings from dax_insert_entry() and
friends like:

[43262.025815] WARNING: CPU: 9 PID: 1309428 at fs/dax.c:380 
dax_insert_entry+0x2ab/0x320
[43262.028355] Modules linked in:
[43262.029386] CPU: 9 PID: 1309428 Comm: fsstress Tainted: G W  
6.0.0-rc6-dgc+ #1543
[43262.032168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.15.0-1 04/01/2014
[43262.034840] RIP: 0010:dax_insert_entry+0x2ab/0x320
[43262.036358] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 
20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 
70 ff ff ff 31 f6 4c 89 e7 e8 84 b1 5a 00 eb a4 48 81 e6
[43262.042255] RSP: 0018:c9000a0cbb78 EFLAGS: 00010002
[43262.043946] RAX: ea0018cd1fc0 RBX: 0001 RCX: 0001
[43262.046233] RDX: ea00 RSI: 0221 RDI: ea0018cd2000
[43262.048518] RBP: 0011 R08:  R09: 
[43262.050762] R10: 888241a6d318 R11: 0001 R12: c9000a0cbc58
[43262.053020] R13: 888241a6d318 R14: c9000a0cbe20 R15: 
[43262.055309] FS:  7f8ce25e2b80() GS:8885fec8() 
knlGS:
[43262.057859] CS:  0010 DS:  ES:  CR0: 80050033
[43262.059713] CR2: 7f8ce25e1000 CR3: 000152141001 CR4: 00060ee0
[43262.061993] Call Trace:
[43262.062836]  
[43262.063557]  dax_fault_iter+0x243/0x600
[43262.064802]  dax_iomap_pte_fault+0x199/0x360
[43262.066197]  __xfs_filemap_fault+0x1e3/0x2c0
[43262.067602]  __do_fault+0x31/0x1d0
[43262.068719]  __handle_mm_fault+0xd6d/0x1650
[43262.070083]  ? do_mmap+0x348/0x540
[43262.071200]  handle_mm_fault+0x7a/0x1d0
[43262.072449]  ? __kvm_handle_async_pf+0x12/0xb0
[43262.073908]  exc_page_fault+0x1d9/0x810
[43262.075123]  asm_exc_page_fault+0x22/0x30
[43262.076413] RIP: 0033:0x7f8ce268bc23

So it looks to me like DAX is well and truly broken in 6.0-rc6. And,
yes, I'm running the fixes in mm-hotifxes-stable branch that allow
xfs/550 to pass.

Who is actually testing this DAX code, and what are they actually
testing on? These are not random failures - I haven't run DAX
testing since ~5.18, and none of these failures were present on the
same DAX test VM running the same configuration back then

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-18 Thread Dave Chinner

2d75  0x5
operation# (mod 256) for the bad data may be 117
0x3a006 0x  0x7541  0x6
operation# (mod 256) for the bad data may be 117
0x3a007 0x  0x4175  0x7
operation# (mod 256) for the bad data may be 117
0x3a008 0x  0x7599  0x8
operation# (mod 256) for the bad data may be 117
0x3a009 0x  0x9975  0x9
operation# (mod 256) for the bad data may be 117
0x3a00a 0x  0x7537  0xa
operation# (mod 256) for the bad data may be 117
0x3a00b 0x  0x3775  0xb
operation# (mod 256) for the bad data may be 117
0x3a00c 0x  0x7510  0xc
operation# (mod 256) for the bad data may be 117
0x3a00d 0x  0x1075  0xd
operation# (mod 256) for the bad data may be 117
0x3a00e 0x  0x75ba  0xe
operation# (mod 256) for the bad data may be 117
0x3a00f 0x  0xba75  0xf
operation# (mod 256) for the bad data may be 117
LOG DUMP (122 total operations):
1(  1 mod 256): WRITE0x39381 thru 0x3   (0x6c7f bytes) HOLE ***
2(  2 mod 256): DEDUPE 0x16000 thru 0x24fff (0xf000 bytes) to 0x3000 thru 
0x11fff
3(  3 mod 256): DEDUPE 0x29000 thru 0x29fff (0x1000 bytes) to 0x19000 thru 
0x19fff
4(  4 mod 256): COPY 0x17cf1 thru 0x203e2   (0x86f2 bytes) to 0x8c59 thru 
0x1134a
5(  5 mod 256): PUNCH0x2f78d thru 0x33379   (0x3bed bytes)
6(  6 mod 256): TRUNCATE DOWN   from 0x4 to 0x28b68 **
7(  7 mod 256): COLLAPSE 0x14000 thru 0x14fff   (0x1000 bytes)
8(  8 mod 256): TRUNCATE UP from 0x27b68 to 0x3a9c4 **
9(  9 mod 256): READ 0x9cb7 thru 0x19799(0xfae3 bytes)
10( 10 mod 256): PUNCH0x1b3a8 thru 0x1dff8  (0x2c51 bytes)


-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] xfs: on memory failure, only shut down fs after scanning all mappings

2022-08-21 Thread Dave Chinner

On Thu, Aug 18, 2022 at 10:00:17AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> xfs_dax_failure_fn is used to scan the filesystem during a memory
> failure event to look for memory mappings to revoke.  Unfortunately, if
> it encounters an rmap record for filesystem metadata, it will shut down
> the filesystem and the scan immediately.  This means that we don't
> complete the mapping revocation scan and instead leave live mappings to
> failed memory.  Fix the function to defer the shutdown until after we've
> finished culling mappings.
> 
> While we're at it, add the usual "xfs_" prefix to struct failure_info,
> and actually initialize mf_flags.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-11 Thread Dave Chinner

On Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong"  
> wrote:
> 
> > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams 
> > >  wrote:
> > > 
> > > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > > really matters where al long as it's merged into the xfs for-next
> > > > > tree so it gets filesystem test coverage...
> > > > 
> > > > So how about let the notify_failure() bits go through -mm this cycle,
> > > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > > baseline to build from?
> > > 
> > > What are we referring to here?  I think a minimal thing would be the
> > > memremap.h and memory-failure.c changes from
> > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com
> > >  ?
> > > 
> > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > would probably be straining things to slip it into 5.19.
> > > 
> > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > right thing, but it's a networking errno.  I suppose livable with if it
> > > never escapes the kernel, but if it can get back to userspace then a
> > > user would be justified in wondering how the heck a filesystem
> > > operation generated a networking errno?
> > 
> >  most filesystems return EOPNOTSUPP rather enthusiastically when
> > they don't know how to do something...
> 
> Can it propagate back to userspace?

Maybe not this one, but the point Darrick is making is that we
really don't care if it does because we've been propagating it to
userspace in documented filesystem APIs for at least 15 years now.

e.g:

$ man 2 fallocate
.
Errors
.
   EOPNOTSUPP
 The filesystem containing the file referred to by fd
 does not support this operation; or the mode is not
 supported by the filesystem containing the file
 referred to by fd.
.

Other random examples:

pwritev2(RWF_NOWAIT) can return -EOPNOTSUPP on buffered writes.
Documented in the man page.

FICLONERANGE on an filesystem that doesn't support reflink will
return -EOPNOTSUPP. Documented in the man page.

mmap(MAP_SYNC) returns -EOPNOTSUPP if the underlying filesystem
and/or storage doesn't support DAX. Documented in the man page.

I could go on, but I think I've made the point already...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Dave Chinner

On Tue, May 10, 2022 at 06:55:50PM -0700, Dan Williams wrote:
> [ add Andrew ]
> 
> 
> On Tue, May 10, 2022 at 6:49 PM Dave Chinner  wrote:
> >
> > On Tue, May 10, 2022 at 05:03:52PM -0700, Darrick J. Wong wrote:
> > > On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> > > > This is a combination of two patchsets:
> > > >  1.fsdax-rmap: 
> > > > https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
> > > >  2.fsdax-reflink: 
> > > > https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/
> > > >
> > > >  Changes since v13 of fsdax-rmap:
> > > >   1. Fixed mistakes during rebasing code to latest next-
> > > >   2. Rebased to next-20220504
> > > >
> > > >  Changes since v10 of fsdax-reflink:
> > > >   1. Rebased to next-20220504 and fsdax-rmap
> > > >   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
> > > >   iter model'
> > > >   3. Fixed many conflicts during rebasing
> > > >   4. Fixed a dedupe bug in Patch 05: the actuall length to compare 
> > > > could be
> > > >   shorter than smap->length or dmap->length.
> > > >   PS: There are many changes during rebasing.  I think it's better to
> > > >   review again.
> > > >
> > > > ==
> > > > Shiyang Ruan (14):
> > > >   fsdax-rmap:
> > > > dax: Introduce holder for dax_device
> > > > mm: factor helpers for memory_failure_dev_pagemap
> > > > pagemap,pmem: Introduce ->memory_failure()
> > > > fsdax: Introduce dax_lock_mapping_entry()
> > > > mm: Introduce mf_dax_kill_procs() for fsdax case
> > >
> > > Hmm.  This patchset touches at least the dax, pagecache, and xfs
> > > subsystems.  Assuming it's too late for 5.19, how should we stage this
> > > for 5.20?
> >
> > Yeah, it's past my "last date for this merge cycle" which was
> > -rc6. I expected stuff might slip a little - as it has with the LARP
> > code - but I don't have the time and bandwidth to start working
> > on merging another feature from scratch before the merge window
> > comes around.
> >
> > Getting the dax+reflink stuff in this cycle was always an optimistic
> > stretch, but I wanted to try so that there was no doubt it would be
> > ready for merge in the next cycle...
> >
> > > I could just add the entire series to iomap-5.20-merge and base the
> > > xfs-5.20-merge off of that?  But I'm not sure what else might be landing
> > > in the other subsystems, so I'm open to input.
> >
> > It'll need to be a stable branch somewhere, but I don't think it
> > really matters where al long as it's merged into the xfs for-next
> > tree so it gets filesystem test coverage...
> 
> So how about let the notify_failure() bits go through -mm this cycle,
> if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> baseline to build from?

Sure, if you want to push them that way I'm not going to complain
or stop you. :)

Anything that makes the eventual XFS feature merge simpler counts as
a win in my books.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Dave Chinner

On Tue, May 10, 2022 at 05:03:52PM -0700, Darrick J. Wong wrote:
> On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> > This is a combination of two patchsets:
> >  1.fsdax-rmap: 
> > https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
> >  2.fsdax-reflink: 
> > https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/
> > 
> >  Changes since v13 of fsdax-rmap:
> >   1. Fixed mistakes during rebasing code to latest next-
> >   2. Rebased to next-20220504
> > 
> >  Changes since v10 of fsdax-reflink:
> >   1. Rebased to next-20220504 and fsdax-rmap
> >   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
> >   iter model'
> >   3. Fixed many conflicts during rebasing
> >   4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
> >   shorter than smap->length or dmap->length.
> >   PS: There are many changes during rebasing.  I think it's better to
> >   review again.
> > 
> > ==
> > Shiyang Ruan (14):
> >   fsdax-rmap:
> > dax: Introduce holder for dax_device
> > mm: factor helpers for memory_failure_dev_pagemap
> > pagemap,pmem: Introduce ->memory_failure()
> > fsdax: Introduce dax_lock_mapping_entry()
> > mm: Introduce mf_dax_kill_procs() for fsdax case
> 
> Hmm.  This patchset touches at least the dax, pagecache, and xfs
> subsystems.  Assuming it's too late for 5.19, how should we stage this
> for 5.20?

Yeah, it's past my "last date for this merge cycle" which was
-rc6. I expected stuff might slip a little - as it has with the LARP
code - but I don't have the time and bandwidth to start working
on merging another feature from scratch before the merge window
comes around.

Getting the dax+reflink stuff in this cycle was always an optimistic
stretch, but I wanted to try so that there was no doubt it would be
ready for merge in the next cycle...

> I could just add the entire series to iomap-5.20-merge and base the
> xfs-5.20-merge off of that?  But I'm not sure what else might be landing
> in the other subsystems, so I'm open to input.

It'll need to be a stable branch somewhere, but I don't think it
really matters where al long as it's merged into the xfs for-next
tree so it gets filesystem test coverage...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-22 Thread Dave Chinner

On Fri, Apr 22, 2022 at 02:27:32PM -0700, Dan Williams wrote:
> On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner  wrote:
> >
> > On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > > a single release. However, being unable to cleanly merge code we
> > > > need integrated into our local subsystem tree for integration
> > > > testing because a patch dependency with another subsystem won't gain
> > > > a stable commit ID until the next merge window is  distinctly
> > > > suboptimal.
> > >
> > > Yes.  Which is why we've taken a lot of mm patchs through other trees,
> > > sometimes specilly crafted for that.  So I guess in this case we'll
> > > just need to take non-trivial dependencies into the XFS tree, and just
> > > deal with small merge conflicts for the trivial ones.
> >
> > OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> > listed looks trivial to resolve.
> >
> > The second dependency, OTOH, is on a new function added in the patch
> > pointed to. That said, at first glance it looks to be independent of
> > the first two patches in that series so I might just be able to pull
> > that one patch in and have that leave us with a working
> > fsdax+reflink tree.
> >
> > Regardless, I'll wait to see how much work the updated XFS/DAX
> > reflink enablement patchset still requires when Ruan posts it before
> > deciding what to do here.  If it isn't going to be a merge
> > candidate, what to do with this patchset is moot because there's
> > little to test without reflink enabled...
> 
> I do have a use case for this work absent the reflink work.  Recall we
> had a conversation about how to communicate "dax-device has been
> ripped away from the fs" events and we ended up on the idea of reusing
> ->notify_failure(), but with the device's entire logical address range
> as the notification span. That will let me unwind and delete the
> PTE_DEVMAP infrastructure for taking extra device references to hold
> off device-removal. Instead ->notify_failure() arranges for all active
> DAX mappings to be invalidated and allow the removal to proceed
> especially since physical removal does not care about software pins.

Sure. My point is that if the reflink enablement isn't ready to go,
then from an XFS POV none of this matters in this cycle and we can
just leave the dependencies to commit via Andrew's tree. Hence by
the time we get to the reflink enablement all the prior dependencies
will have been merged and have stable commit IDs, and we can just
stage this series and the reflink enablement as we normally would in
the next cycle.

However, if we don't get the XFS reflink dax enablement sorted out
in the next week or two, then we don't need this patchset in this
cycle. Hence if you still need this patchset for other code you need
to merge in this cycle, then you're the poor schmuck that has to run
the mm-tree conflict guantlet to get a stable commit ID for the
dependent patches in this cycle, not me

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-21 Thread Dave Chinner

On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > a single release. However, being unable to cleanly merge code we
> > need integrated into our local subsystem tree for integration
> > testing because a patch dependency with another subsystem won't gain
> > a stable commit ID until the next merge window is  distinctly
> > suboptimal.
> 
> Yes.  Which is why we've taken a lot of mm patchs through other trees,
> sometimes specilly crafted for that.  So I guess in this case we'll
> just need to take non-trivial dependencies into the XFS tree, and just
> deal with small merge conflicts for the trivial ones.

OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
listed looks trivial to resolve.

The second dependency, OTOH, is on a new function added in the patch
pointed to. That said, at first glance it looks to be independent of
the first two patches in that series so I might just be able to pull
that one patch in and have that leave us with a working
fsdax+reflink tree.

Regardless, I'll wait to see how much work the updated XFS/DAX
reflink enablement patchset still requires when Ruan posts it before
deciding what to do here.  If it isn't going to be a merge
candidate, what to do with this patchset is moot because there's
little to test without reflink enabled...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-20 Thread Dave Chinner

On Wed, Apr 20, 2022 at 07:20:07PM -0700, Dan Williams wrote:
> [ add Andrew and Naoya ]
> 
> On Wed, Apr 20, 2022 at 6:48 PM Shiyang Ruan  wrote:
> >
> > Hi Dave,
> >
> > 在 2022/4/21 9:20, Dave Chinner 写道:
> > > Hi Ruan,
> > >
> > > On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
> > >> This patchset is aimed to support shared pages tracking for fsdax.
> > >
> > > Now that this is largely reviewed, it's time to work out the
> > > logistics of merging it.
> >
> > Thanks!
> >
> > >
> > >> Changes since V12:
> > >>- Rebased onto next-20220414
> > >
> > > What does this depend on that is in the linux-next kernel?
> > >
> > > i.e. can this be applied successfully to a v5.18-rc2 kernel without
> > > needing to drag in any other patchsets/commits/trees?
> >
> > Firstly, I tried to apply to v5.18-rc2 but it failed.
> >
> > There are some changes in memory-failure.c, which besides my Patch-02
> >"mm/hwpoison: fix race between hugetlb free/demotion and
> > memory_failure_hugetlb()"
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=423228ce93c6a283132be38d442120c8e4cdb061
> >
> > Then, why it is on linux-next is: I was told[1] there is a better fix
> > about "pgoff_address()" in linux-next:
> >"mm: rmap: introduce pfn_mkclean_range() to cleans PTEs"
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=65c9605009f8317bb3983519874d755a0b2ca746
> > so I rebased my patches to it and dropped one of mine.
> >
> > [1] https://lore.kernel.org/linux-xfs/ykpuoogd139wp...@infradead.org/
> 
> From my perspective, once something has -mm dependencies it needs to
> go through Andrew's tree, and if it's going through Andrew's tree I
> think that means the reflink side of this needs to wait a cycle as
> there is no stable point that the XFS tree could merge to build on top
> of.

Ngggh. Still? Really?

Sure, I'm not a maintainer and just the stand-in patch shepherd for
a single release. However, being unable to cleanly merge code we
need integrated into our local subsystem tree for integration
testing because a patch dependency with another subsystem won't gain
a stable commit ID until the next merge window is  distinctly
suboptimal.

We know how to do this cleanly, quickly and efficiently - we've been
doing cross-subsystem shared git branch co-ordination for
VFS/fs/block stuff when needed for many, many years. It's pretty
easy to do, just requires clear communication to decide where the
source branch will be kept. It doesn't even matter what order Linus
then merges the trees - they are self contained and git sorts out
the duplicated commits without an issue.

I mean, we've been using git for *17 years* now - this stuff should
be second nature to maintainers by now. So how is it still
considered acceptible for a core kernel subsystem not to have the
ability to provide other subsystems with stable commits/branches
so we can cleanly develop cross-subsystem functionality quickly and
efficiently?

> The last reviewed-by this wants before going through there is Naoya's
> on the memory-failure.c changes.

Naoya? 

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-20 Thread Dave Chinner

Hi Ruan,

On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
> This patchset is aimed to support shared pages tracking for fsdax.

Now that this is largely reviewed, it's time to work out the
logistics of merging it.

> Changes since V12:
>   - Rebased onto next-20220414

What does this depend on that is in the linux-next kernel?

i.e. can this be applied successfully to a v5.18-rc2 kernel without
needing to drag in any other patchsets/commits/trees?

What are your plans for the followup patches that enable
reflink+fsdax in XFS? AFAICT that patchset hasn't been posted for
while so I don't know what it's status is. Is that patchset anywhere
near ready for merge in this cycle?

If that patchset is not a candidate for this cycle, then it largely
doesn't matter what tree this is merged through as there shouldn't
be any major XFS or dax dependencies being built on top of it during
this cycle. The filesystem side changes are isolated and won't
conflict with other work in XFS, either, so this could easily go
through Dan's tree.

However, if the reflink enablement is ready to go, then this all
needs to be in the XFS tree so that we can run it through filesystem
level DAX+reflink testing. That will mean we need this in a stable
shared topic branch and tighter co-ordination between the trees.

So before we go any further we need to know if the dax+reflink
enablement patchset is near being ready to merge

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v12 6/7] xfs: Implement ->notify_failure() for XFS

2022-04-13 Thread Dave Chinner

On Tue, Apr 12, 2022 at 07:06:40PM -0700, Dan Williams wrote:
> On Tue, Apr 12, 2022 at 5:04 PM Dave Chinner  wrote:
> > On Mon, Apr 11, 2022 at 12:09:03AM +0800, Shiyang Ruan wrote:
> > > Introduce xfs_notify_failure.c to handle failure related works, such as
> > > implement ->notify_failure(), register/unregister dax holder in xfs, and
> > > so on.
> > >
> > > If the rmap feature of XFS enabled, we can query it to find files and
> > > metadata which are associated with the corrupt data.  For now all we do
> > > is kill processes with that file mapped into their address spaces, but
> > > future patches could actually do something about corrupt metadata.
> > >
> > > After that, the memory failure needs to notify the processes who are
> > > using those files.
...
> > > @@ -1964,8 +1965,8 @@ xfs_alloc_buftarg(
> > >   btp->bt_mount = mp;
> > >   btp->bt_dev =  bdev->bd_dev;
> > >   btp->bt_bdev = bdev;
> > > - btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, 
> > > NULL,
> > > - NULL);
> > > + btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, mp,
> > > + _dax_holder_operations);
> >
> > I see a problem with this: we are setting up notify callbacks before
> > we've even read in the superblock during mount. i.e. we don't even
> > kow yet if we've got an XFS filesystem on this block device.
> > Hence these notifications need to be delayed until after the
> > filesystem is mounted, all the internal structures have been set up
> > and log recovery has completed.
> 
> So I think this gets back to the fact that there will eventually be 2
> paths into this notifier.

I'm not really concerned by how the notifications are generated;
my concern is purely that notifications can be handled safely.

> All that to say, I think it is ok / expected for the filesystem to
> drop notifications on the floor when it is not ready to handle them.

Well, yes. The whole point of notifications is the consumer makes
the decision on what to do with the notification it receives - the
producer of the notification does not (and can not) dictate what
policy the consumer(s) implement...

> For example there are no processes to send SIGBUS to if the filesystem
> has not even finished mount.

There may be not processes to send SIGBUS to even if the filesystem
has finished mount. But we still want the notifications to be
delivered and we still need to handle them safely.

IOWs, while we might start by avoiding notifications during mount,
this doesn't mean we will never have reason to process events during
mount. What we do with this notification is going to evolve over
time as we add new and adapt existing functionality

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v12 6/7] xfs: Implement ->notify_failure() for XFS

2022-04-12 Thread Dave Chinner

On Mon, Apr 11, 2022 at 12:09:03AM +0800, Shiyang Ruan wrote:
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/Makefile |   5 +
>  fs/xfs/xfs_buf.c|   7 +-
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 219 
>  fs/xfs/xfs_super.h  |   1 +
>  6 files changed, 233 insertions(+), 3 deletions(-)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..09f5560e29f2 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -128,6 +128,11 @@ xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o
>  
> +# notify failure
> +ifeq ($(CONFIG_MEMORY_FAILURE),y)
> +xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o
> +endif
> +
>  # online scrub/repair
>  ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
>  
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f9ca08398d32..9064b8dfbc66 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -5,6 +5,7 @@
>   */
>  #include "xfs.h"
>  #include 
> +#include 
>  
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
> @@ -1911,7 +1912,7 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> - fs_put_dax(btp->bt_daxdev, NULL);
> + fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  
>   kmem_free(btp);
>  }
> @@ -1964,8 +1965,8 @@ xfs_alloc_buftarg(
>   btp->bt_mount = mp;
>   btp->bt_dev =  bdev->bd_dev;
>   btp->bt_bdev = bdev;
> - btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, NULL,
> - NULL);
> + btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, mp,
> + _dax_holder_operations);

I see a problem with this: we are setting up notify callbacks before
we've even read in the superblock during mount. i.e. we don't even
kow yet if we've got an XFS filesystem on this block device.

Hence if we get a notification immediately after registering this
notification callback

[...]

> +
> +static int
> +xfs_dax_notify_ddev_failure(
> + struct xfs_mount*mp,
> + xfs_daddr_t daddr,
> + xfs_daddr_t bblen,
> + int mf_flags)
> +{
> + struct xfs_trans*tp = NULL;
> + struct xfs_btree_cur*cur = NULL;
> + struct xfs_buf  *agf_bp = NULL;
> + int error = 0;
> + xfs_fsblock_t   fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> + xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
> + xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
> + xfs_agnumber_t  end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);

 none of this code is going to function correctly because it
is dependent on the superblock having been read, validated and
copied to the in-memory superblock.

> + error = xfs_trans_alloc_empty(mp, );
> + if (error)
> + return error;

... and it's not valid to use transactions (even empty ones) before
log recovery has completed and set the log up correctly.

> +
> + for (; agno <= end_agno; agno++) {
> + struct xfs_rmap_irecri_low = { };
> + struct xfs_rmap_irecri_high;
> + struct failure_info notify;
> + struct xfs_agf  *agf;
> + xfs_agblock_t   agend;
> +
> + error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
> + if (error)
> + break;
> +
> + cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);

... and none of the structures this rmapbt walk is dependent on
(e.g. perag structures) have been initialised yet so there's null
pointer dereferences going to happen here.

Perhaps even worse is that the rmapbt is not guaranteed to be in
consistent state until after log recovery has completed, so this
walk could get stuck forever in a stale on-disk cycle that
recovery would have corrected

Hence these notifications need to be delayed until after the
filesystem is mounted, all the internal structures have been set up
and log recovery has completed.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-18 Thread Dave Chinner

On Fri, Apr 16, 2021 at 10:14:39AM +0530, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> > 
> > > As an alternative approach, I have this below hack that does lazy
> > > list_lru creation. The memcg-specific list is created and initialized
> > > only when there is a request to add an element to that particular
> > > list. Though I am not sure about the full impact of this change
> > > on the owners of the lists and also the performance impact of this,
> > > the overall savings look good.
> > 
> > Avoiding memory allocation in list_lru_add() was one of the main
> > reasons for up-front static allocation of memcg lists. We cannot do
> > memory allocation while callers are holding multiple spinlocks in
> > core system algorithms (e.g. dentry_kill -> retain_dentry ->
> > d_lru_add -> list_lru_add), let alone while holding an internal
> > spinlock.
> > 
> > Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> > path we know might have memory demand in the *hundreds of GB* range
> > gets an NACK from me. It's a great idea, but it's just not a
> 
> I do understand that GFP_ATOMIC allocations are really not preferrable
> but want to point out that the allocations in the range of hundreds of
> GBs get reduced to tens of MBs when we do lazy list_lru head allocations
> under GFP_ATOMIC.

That does not make GFP_ATOMIC allocations safe or desirable. In
general, using GFP_ATOMIC outside of interrupt context indicates
something is being done incorrectly. Especially if it can be
triggered from userspace, which is likely in this particular case...

> As shown earlier, this is what I see in my experimental setup with
> 10k containers:
> 
> Number of kmalloc-32 allocations
>   Before  During  After
> W/o patch 178176  3442409472  388933632
> W/  patch 190464  468992  468992

SO now we have an additional half million GFP_ATOMIC allocations
when we currently have none. That's not an improvement, that rings
loud alarm bells.

> This does really depend and vary on the type of the container and
> the number of mounts it does, but I suspect we are looking
> at GFP_ATOMIC allocations in the MB range. Also the number of
> GFP_ATOMIC slab allocation requests matter I suppose.

They are slab allocations, which mean every single one of them
could require a new slab backing page (pages!) to be allocated.
Hence the likely memory demand might be a lot higher than the
optimal case you are considering here...

> There are other users of list_lru, but I was just looking at
> dentry and inode list_lru usecase. It appears to me that for both
> dentry and inode, we can tolerate the failure from list_lru_add
> due to GFP_ATOMIC allocation failure. The failure to add dentry
> or inode to the lru list means that they won't be retained in
> the lru list, but would be freed immediately. Is this understanding
> correct?

No. Both retain_dentry() and iput_final() would currently leak
objects that fail insertion into the LRU. They don't check for
insertion success at all.

But, really, this is irrelevant - GFP_ATOMIC usage is the problem,
and allowing it to fail doesn't avoid the problems that unbound
GFP_ATOMIC allocation can have on the stability of the rest of the
system when low on memory. Being able to handle a GFP_ATOMIC memory
allocation failure doesn't change the fact that you should not be
doing GFP_ATOMIC allocation in the first place...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Dave Chinner

On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner  wrote:
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner  wrote:
> > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > batches) and so spending less time contending on the mapping tree
> > > > lock...
> > > >
> > > > IOWs, I suspect this result might actually be a result of less lock
> > > > contention due to a change in batch processing characteristics of
> > > > the new algorithm rather than it being a "better" algorithm...
> > >
> > > I appreciate the profile. But there is no batching in
> > > __remove_mapping() -- it locks the mapping for each page, and
> > > therefore the lock contention penalizes the mainline and this patchset
> > > equally. It looks worse on your system because the four kswapd threads
> > > from different nodes were working on the same file.
> >
> > I think you misunderstand exactly what I mean by "batching" here.
> > I'm not talking about doing multiple pieces of work under a single
> > lock. What I mean is that the overall amount of work done in a
> > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> >
> > We already batch up page reclaim via building a page list and then
> > passing it to shrink_page_list() to process the batch of pages in a
> > single pass. Each page in this page list batch then calls
> > remove_mapping() to pull the page form the LRU, we have a run of
> > contention between the foreground read() thread and the background
> > kswapd.
> >
> > If the size or nature of the pages in the batch passed to
> > shrink_page_list() changes, then the amount of time a reclaim batch
> > is going to put pressure on the mapping tree lock will also change.
> > That's the "change in batching behaviour" I'm referring to here. I
> > haven't read through the patchset to determine if you change the
> > shrink_page_list() algorithm, but it likely changes what is passed
> > to be reclaimed and that in turn changes the locking patterns that
> > fall out of shrink_page_list...
> 
> Ok, if we are talking about the size of the batch passed to
> shrink_page_list(), both the mainline and this patchset cap it at
> SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> running fio/io_uring, it's safe to say both use 32.

You're still looking at micro-scale behaviour, not the larger-scale
batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to
shrinker_page_list() at a different rate?

When I say "batch of work" when talking about the page cache cycling
*500 thousand pages a second* through the cache, I'm not talking
about batches of 32 pages. I'm talking about the entire batch of
work kswapd does in an invocation cycle.

Is it scanning 100k pages 10 times a second? or 10k pages a hundred
times a second? How long does a batch take to run? how long does is
sleep between processing batches? Is there any change in these
metrics as a result of the multi-gen LRU patches?

Basically, we're looking at how access to the mapping lock is
changing the contention profile, and whether that is signficant or
not. I suspect it is, because when you have highly contended locks
and you do something external that reduces unrelated lock
contention, it's because that external thing is taking more time to
do and so there's less time to spend hitting locks hard...

As such, I don't think this test is a good measure of the multi-gen
LRU patches at all - performance is dominated by the severity of
lock contention external to the LRU scanning algorithm, and it's
hard to infer anything through suck lock contention

> I don't want to paste everything here -- they'd clutter. Please see
> all the detailed profiles in the attachment. Let me know if their
> formats are no to your liking. I still have the raw perf.data.

Which makes the discussion thread just about impossible to follow or
comment on. Please just post the relevant excerpt of the stack
profile that you are commenting on.

> > > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > > benchmark the patchset. I heard they have been complaining about the
> > > buffered io performance under memory pressure. Any other benchmarks
> > > you'd suggest?
> > >
> > > BTW, you might find another surprise in how less frequently slab
> > > shrinkers are called under memory pressure, because this patchset is a
> &g

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Dave Chinner

On Wed, Apr 14, 2021 at 08:43:36AM -0600, Jens Axboe wrote:
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park 
> >>>
> >>> Hello,
> >>>
> >>>
> >>> Very interesting work, thank you for sharing this :)
> >>>
> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> >>>
> >>>> What's new in v2
> >>>> 
> >>>> Special thanks to Jens Axboe for reporting a regression in buffered
> >>>> I/O and helping test the fix.
> >>>
> >>> Is the discussion open?  If so, could you please give me a link?
> >>
> >> I wasn't on the initial post (or any of the lists it was posted to), but
> >> it's on the google page reclaim list. Not sure if that is public or not.
> >>
> >> tldr is that I was pretty excited about this work, as buffered IO tends
> >> to suck (a lot) for high throughput applications. My test case was
> >> pretty simple:
> >>
> >> Randomly read a fast device, using 4k buffered IO, and watch what
> >> happens when the page cache gets filled up. For this particular test,
> >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> >> with kswapd using a lot of CPU trying to keep up. That's mainline
> >> behavior.
> > 
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> > 
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> > 
> > -   20.06% 0.00%  [kernel]   [k] kswapd 
> > 
> >▒
> >- 20.06% kswapd  
> > 
> >▒
> >   - 20.05% balance_pgdat
> > 
> >▒
> >  - 20.03% shrink_node   
> > 
> >▒
> > - 19.92% shrink_lruvec  
> > 
> >▒
> >- 19.91% shrink_inactive_list
> > 
> >▒
> >   - 19.22% shrink_page_list 
> > 
> >▒
> >  - 17.51% __remove_mapping  
> > 
> >▒
> > - 14.16% _raw_spin_lock_irqsave 
> > 
> >▒
> >- 14.14% do_raw_spin_lock
> > 
> >▒
> > __pv_queued_spin_lock_slowpath  
> > 
> >▒
> > - 1.56% __delete_from_page_cache
> > 
> >▒
> >  0.63% xas_store
> > 
> >▒
> > - 0.78% _raw_spin_unlock_irqrestore 
> > 
> >▒
> >- 0.69% do_raw_spin_unlock   
>

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Dave Chinner

On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner  wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > From: SeongJae Park 
> > > >
> > > > Hello,
> > > >
> > > >
> > > > Very interesting work, thank you for sharing this :)
> > > >
> > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> > > >
> > > >> What's new in v2
> > > >> 
> > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > >> I/O and helping test the fix.
> > > >
> > > > Is the discussion open?  If so, could you please give me a link?
> > >
> > > I wasn't on the initial post (or any of the lists it was posted to), but
> > > it's on the google page reclaim list. Not sure if that is public or not.
> > >
> > > tldr is that I was pretty excited about this work, as buffered IO tends
> > > to suck (a lot) for high throughput applications. My test case was
> > > pretty simple:
> > >
> > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > happens when the page cache gets filled up. For this particular test,
> > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > behavior.
> >
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> >
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> >
> > -   20.06% 0.00%  [kernel]   [k] kswapd 
> > 
> >▒
> >- 20.06% kswapd  
> > 
> >▒
> >   - 20.05% balance_pgdat
> > 
> >▒
> >  - 20.03% shrink_node   
> > 
> >▒
> > - 19.92% shrink_lruvec  
> > 
> >▒
> >- 19.91% shrink_inactive_list
> > 
> >▒
> >   - 19.22% shrink_page_list 
> > 
> >▒
> >  - 17.51% __remove_mapping  
> > 
> >▒
> > - 14.16% _raw_spin_lock_irqsave 
> > 
> >▒
> >- 14.14% do_raw_spin_lock
> > 
> >▒
> > __pv_queued_spin_lock_slowpath  
> > 
> >▒
> > - 1.56% __delete_from_page_cache
> > 
> >▒
> >  0.63% xas_store
> > 
> >▒
> > - 0.78% _raw_spin_unlock_irqrestore 
> >

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Dave Chinner

4   0.0   0:22.70 kswapd1
   1146 root  20   0   0  0  0 S  44.0   0.0   0:21.85 kswapd0
   1152 root  20   0   0  0  0 S  39.7   0.0   0:18.28 kswapd3
   1151 root  20   0   0  0  0 S  15.2   0.0   0:12.14 kswapd2

i.e. when memory reclaim kicks in, the read process has 20% less
time with exclusive access to the mapping tree to insert new pages.
Hence buffered read performance goes down quite substantially when
memory reclaim kicks in, and this really has nothing to do with the
memory reclaim LRU scanning algorithm.

I can actually get this machine to pin those 5 processes to 100% CPU
under certain conditions. Each process is spinning all that extra
time on the mapping tree lock, and performance degrades further.
Changing the LRU reclaim algorithm won't fix this - the workload is
solidly bound by the exclusive nature of the mapping tree lock and
the number of tasks trying to obtain it exclusively...

> The initial posting of this patchset did no better, in fact it did a bit
> worse. Performance dropped to the same levels and kswapd was using as
> much CPU as before, but on top of that we also got excessive swapping.
> Not at a high rate, but 5-10MB/sec continually.
>
> I had some back and forths with Yu Zhao and tested a few new revisions,
> and the current series does much better in this regard. Performance
> still dips a bit when page cache fills, but not nearly as much, and
> kswapd is using less CPU than before.

Profiles would be interesting, because it sounds to me like reclaim
*might* be batching page cache removal better (e.g. fewer, larger
batches) and so spending less time contending on the mapping tree
lock...

IOWs, I suspect this result might actually be a result of less lock
contention due to a change in batch processing characteristics of
the new algorithm rather than it being a "better" algorithm...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: bl_list and lockdep

2021-04-13 Thread Dave Chinner

On Tue, Apr 13, 2021 at 01:18:35AM +0200, Thomas Gleixner wrote:
> Dave,
> 
> On Tue, Apr 13 2021 at 08:15, Dave Chinner wrote:
> > On Mon, Apr 12, 2021 at 05:20:53PM +0200, Thomas Gleixner wrote:
> >> On Wed, Apr 07 2021 at 07:22, Dave Chinner wrote:
> >> > And, FWIW, I'm also aware of the problems that RT kernels have with
> >> > the use of bit spinlocks and being unable to turn them into sleeping
> >> > mutexes by preprocessor magic. I don't care about that either,
> >> > because dentry cache...
> >> 
> >> In the dentry cache it's a non-issue.
> >
> > Incorrect.
> 
> I'm impressed about your detailed knowledge of something you do not care
> about in the first place.

There's a difference between "don't care because don't understand"
and "don't care because I know how complex real-time is and I know I
can't validate any code I write to be RT safe".

Indeed, just because I work on filesystems now doesn't mean I don't
know what real-time is - I spent the best part of a decade as an
industrial control engineer building systems that provided water and
electricity to populations of millions of people before I started
working on filesystems

> >> RT does not have a problem with bit spinlocks per se, it depends on how
> >> they are used and what nests inside. Most of them are just kept as bit
> >> spinlocks because the lock held, and therefore preempt disabled times
> >> are small and no other on RT conflicting operations happen inside.
> >> 
> >> In the case at hand this is going to be a problem because inode->i_lock
> >> nests inside the bit spinlock and we can't make inode->i_lock a raw
> >> spinlock because it protects way heavier weight code pathes as well.
> >
> > Yes, that's exactly the "problem" I'm refering to. And I don't care,
> > precisely because, well, dentry cache
> >
> > THat is, the dcache calls wake_up_all() from under the
> > hlist_bl_lock() in __d_lookup_done(). That ends up in
> > __wake_up_common_lock() which takes a spin lock embedded inside a
> > wait_queue_head.  That's not a raw spinlock, either, so we already
> > have this "spinlock inside bit lock" situation with the dcache usage
> > of hlist_bl.
> 
> Sure, but you are missing that RT solves that by substituting the
> wait_queue with a swait_queue, which does not suffer from that. But that
> can't be done for the inode::i_lock case for various reasons.

I didn't know about that specific forklift replacement. But, really
that simply adds weight to my comment below

> > FYI, this dentry cache behaviour was added to the dentry cache in
> > 2016 by commit d9171b934526 ("parallel lookups machinery, part 4
> > (and last)"), so it's not like it's a new thing, either.
> 
> Really? I wasn't aware of that. Thanks for the education.
> 
> > If you want to make hlist_bl RT safe, then re-implement it behind
> > the scenes for RT enabled kernels. All it takes is more memory
> > usage for the hash table + locks, but that's something that non-RT
> > people should not be burdened with caring about

... because if RT devs are willing to forklift replace core kernel
functionality like wait queues to provide RT kernels with a
completely different locking schema to vanilla kernels, then
slightly modifying the hlist-bl structure in a RT compatible way is
child's play

> I'm well aware that anything outside of @fromorbit universe is not
> interesting to you, but I neverless want to take the opportunity to
> express my appreciation for your truly caring and collaborative attitude
> versus interests of others who unfortunately do no share that universe.

I'm being realistic. I dont' have the time or mental bandwidth to
solve RT kernel problems. I don't have any way to test RT kernels,
and lockdep is a crock of shit for validating RT locking on vanilla
kernels because of the forklift upgrade games like the above that
give the RT kernel a different locking schema.

Because I have sufficient knowledge of the real-time game, I know
*I'm not an RT expert* these days. I know that I don't know all the
games it plays, nor do I have the time (or patience) to learn about
all of them, nor the resources or knowledge to test whether the code
I write follows all the rules I don't know about, whether I
introduced interrupt hold-offs longer than 50us, etc.

IOWs, I chose not to care about RT because I know I don't know
enough about it to write solid, well tested RT compatible kernel
code. I can write untested shit as well as any other programmer, but
I have much higher professional standards than that.

And I also know there are paid professionals who are RT experts who
a

Re: bl_list and lockdep

2021-04-12 Thread Dave Chinner

On Mon, Apr 12, 2021 at 05:20:53PM +0200, Thomas Gleixner wrote:
> Dave,
> 
> On Wed, Apr 07 2021 at 07:22, Dave Chinner wrote:
> > On Tue, Apr 06, 2021 at 02:28:34PM +0100, Matthew Wilcox wrote:
> >> On Tue, Apr 06, 2021 at 10:33:43PM +1000, Dave Chinner wrote:
> >> > +++ b/fs/inode.c
> >> > @@ -57,8 +57,7 @@
> >> >  
> >> >  static unsigned int i_hash_mask __read_mostly;
> >> >  static unsigned int i_hash_shift __read_mostly;
> >> > -static struct hlist_head *inode_hashtable __read_mostly;
> >> > -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
> >> > +static struct hlist_bl_head *inode_hashtable __read_mostly;
> >> 
> >> I'm a little concerned that we're losing a lockdep map here.  
> >> 
> >> Nobody seems to have done this for list_bl yet, and I'd be reluctant
> >> to gate your patch on "Hey, Dave, solve this problem nobody else has
> >> done yet".
> >
> > I really don't care about lockdep. Adding lockdep support to
> > hlist_bl is somebody else's problem - I'm just using infrastructure
> > that already exists. Also, the dentry cache usage of hlist_bl is
> > vastly more complex and so if lockdep coverage was really necessary,
> > it would have already been done
> >
> > And, FWIW, I'm also aware of the problems that RT kernels have with
> > the use of bit spinlocks and being unable to turn them into sleeping
> > mutexes by preprocessor magic. I don't care about that either,
> > because dentry cache...
> 
> In the dentry cache it's a non-issue.

Incorrect.

> RT does not have a problem with bit spinlocks per se, it depends on how
> they are used and what nests inside. Most of them are just kept as bit
> spinlocks because the lock held, and therefore preempt disabled times
> are small and no other on RT conflicting operations happen inside.
> 
> In the case at hand this is going to be a problem because inode->i_lock
> nests inside the bit spinlock and we can't make inode->i_lock a raw
> spinlock because it protects way heavier weight code pathes as well.

Yes, that's exactly the "problem" I'm refering to. And I don't care,
precisely because, well, dentry cache

THat is, the dcache calls wake_up_all() from under the
hlist_bl_lock() in __d_lookup_done(). That ends up in
__wake_up_common_lock() which takes a spin lock embedded inside a
wait_queue_head.  That's not a raw spinlock, either, so we already
have this "spinlock inside bit lock" situation with the dcache usage
of hlist_bl.

FYI, this dentry cache behaviour was added to the dentry cache in
2016 by commit d9171b934526 ("parallel lookups machinery, part 4
(and last)"), so it's not like it's a new thing, either.

If you want to make hlist_bl RT safe, then re-implement it behind
the scenes for RT enabled kernels. All it takes is more memory
usage for the hash table + locks, but that's something that non-RT
people should not be burdened with caring about

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache

2021-04-08 Thread Dave Chinner

On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> This commit introduces the bpf page cache iterator. This iterator allows
> users to run a bpf prog against each page in the "page cache".
> Internally, the "page cache" is extremely tied to VFS superblock + inode
> combo. Because of this, iter_pagecache will only examine pages in the
> caller's mount namespace.

No, it does not just examine pages with in the callers mount
namespace, because 

> +static struct inode *goto_next_inode(struct bpf_iter_seq_pagecache_info 
> *info)
> +{
> + struct inode *prev_inode = info->cur_inode;
> + struct inode *inode;
> +
> +retry:
> + BUG_ON(!info->cur_sb);
> + spin_lock(>cur_sb->s_inode_list_lock);
> +
> + if (!info->cur_inode) {
> + list_for_each_entry(inode, >cur_sb->s_inodes, i_sb_list) {

... this is an "all inodes on the superblock" walk.  This will also
iterate inodes in other mount namespaces that point to the same
superblock.

IOWs, if you have different parts of the same filesystem mounted
into hundreds of container mount namespaces, this script will not
just iterate the local mount name space, it will iterate every inode
in every mount namespace.

And, of course, if the same files are mounted into multiple
containers (think read-only bind mounts using idmapping) then you
have zero indication of which container is actually using them, just
that there are hundreds of paths to the same inode. And every
container will appear to be using exactly the same amount of page cache.

IOWs, the stats this generates provide no insight into page cache
usage across mount namespaces in many situations, and it leaks
information about page cache usage across mount namespace
boundaries.

And that's before I say "iterating all inodes in a superblock is
bad" because it causes lock contention and interrupts normal usage.
We avoid s_inodes lists walks as much as we possibly can, and the
last thing we want is for userspace to be able to trivially
instigate long running walks of the s_inodes list. Remember, we can
have hundreds of millions of inodes on this list

> + spin_lock(>i_lock);
> + if (inode_unusual(inode)) {
> + spin_unlock(>i_lock);
> + continue;
> + }
> + __iget(inode);
> + spin_unlock(>i_lock);

This can spin long enough to trigger livelock warnings. Even if it's
not held that long, it can cause unexpected long tail latencies in
memory reclaim and inode instantiation. Every s_inodes list walk has
cond_resched() built into it now

> + info->ns = current->nsproxy->mnt_ns;
> + get_mnt_ns(info->ns);
> + INIT_RADIX_TREE(>superblocks, GFP_KERNEL);
> +
> + spin_lock(>ns->ns_lock);
> + list_for_each_entry(mnt, >ns->list, mnt_list) {
> + sb = mnt->mnt.mnt_sb;
> +
> + /* The same mount may be mounted in multiple places */
> + if (radix_tree_lookup(>superblocks, (unsigned long)sb))
> + continue;
> +
> + err = radix_tree_insert(>superblocks,
> + (unsigned long)sb, (void *)1);

And just because nobody has pointed it out yet: radix_tree_insert()
will do GFP_KERNEL memory allocations inside the spinlock being held
here.

You said that you didn't take the "walk the LRUs" approach because
walking superblocks "seemed simpler". It's not. Page cache residency
and accounting is managed by memcgs, not by mount namespaces.

That is, containers usually have a memcg associated with them to control
memory usage of the container. The page cache used by a container is
accounted directly to the memcg, and memory reclaim can find all the
file-backed page cache pages associated with a memcg very quickly
(via mem_cgroup_lruvec()).  This will find pages associated directly
with the memcg, so it gives you a fairly accurate picture of the
page cache usage within the container.

This has none of the issues that arise from "sb != mnt_ns" that
walking superblocks and inode lists have, and it doesn't require you
to play games with mounts, superblocks and inode references

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-06 Thread Dave Chinner

cg awareness and feed that into alloc_super() to set/clear
the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
virtual filesystems that expose system information do not really
need full memcg awareness because they are generally only visible to
a single memcg instance...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: bl_list and lockdep

2021-04-06 Thread Dave Chinner

On Tue, Apr 06, 2021 at 02:28:34PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 10:33:43PM +1000, Dave Chinner wrote:
> > +++ b/fs/inode.c
> > @@ -57,8 +57,7 @@
> >  
> >  static unsigned int i_hash_mask __read_mostly;
> >  static unsigned int i_hash_shift __read_mostly;
> > -static struct hlist_head *inode_hashtable __read_mostly;
> > -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
> > +static struct hlist_bl_head *inode_hashtable __read_mostly;
> 
> I'm a little concerned that we're losing a lockdep map here.  
> 
> Nobody seems to have done this for list_bl yet, and I'd be reluctant
> to gate your patch on "Hey, Dave, solve this problem nobody else has
> done yet".

I really don't care about lockdep. Adding lockdep support to
hlist_bl is somebody else's problem - I'm just using infrastructure
that already exists. Also, the dentry cache usage of hlist_bl is
vastly more complex and so if lockdep coverage was really necessary,
it would have already been done

And, FWIW, I'm also aware of the problems that RT kernels have with
the use of bit spinlocks and being unable to turn them into sleeping
mutexes by preprocessor magic. I don't care about that either,
because dentry cache...

> But maybe we can figure out how to do this?  I think we want one lockdep
> map for the entire hashtable (it's split for performance, not because
> they're logically separate locks).  So something like ...
> 
> static struct lockdep_map inode_hash_map =
>   STATIC_LOCKDEP_MAP_INIT("inode_hash", _hash_map);
> 
> and add:
> 
> static inline void hlist_bl_lock_map(struct hlist_bl_head *b,
>   struct lockdep_map *map)
> {
>   bit_spin_lock(0, (unsigned long *)b);
>   spin_acquire(map, 0, 0, _RET_IP_);
> }
> 
> then use hlist_bl_lock_map() throughout.  Would that work?
> Adding lockdep experts for opinions.

Maybe, but it's kinda messy to have to carry the lockdep map around
externally to the structure. Not to mention it won't support holding
multiple hash chains locked at once if that is ever needed (e.g.
rehashing an object with a different hashval)...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

[PATCH 3/3] vfs: inode cache conversion to hash-bl

2021-04-06 Thread Dave Chinner

From: Dave Chinner 

Because scalability of the global inode_hash_lock really, really
sucks and prevents me from doing scalability characterisation and
analysis of bcachefs algorithms.

Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
on a couple of different filesystems on a 5.10 kernel:

-   52.13% 0.04%  [kernel][k] ext4_create
   - 52.09% ext4_create
  - 41.03% __ext4_new_inode
 - 29.92% insert_inode_locked
- 25.35% _raw_spin_lock
   - do_raw_spin_lock
  - 24.97% __pv_queued_spin_lock_slowpath


-   72.33% 0.02%  [kernel][k] do_filp_open
   - 72.31% do_filp_open
  - 72.28% path_openat
 - 57.03% bch2_create
- 56.46% __bch2_create
   - 40.43% inode_insert5
  - 36.07% _raw_spin_lock
 - do_raw_spin_lock
  35.86% __pv_queued_spin_lock_slowpath
4.02% find_inode

btrfs was tested but it is limited by internal lock contention at
>=2 threads on this workload, so never hammers the inode cache lock
hard enough for this change to matter to it's performance.

However, both bcachefs and ext4 demonstrate poor scaling at >=8
threads on concurrent lookup or create workloads.

Hence convert the inode hash table to a RCU-aware hash-bl table just
like the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because, unlike the dentry cache, the hash value that is
used to hash the inode is not generated from the inode itself. i.e.
filesystems can provide this themselves so we have to either store
the hashval or the hlist head pointer in the inode to be able to
find the right list head for removal...

Concurrent create with variying thread count (files/s):

vanilla Patched
threads ext4  bcachefs  ext4  bcachefs
2   117k112k   85k
4   185k190k  145k
8   303k  185k  346k  255k
16  389k  190k  465k  420k
32  360k  142k  437k  481k

ext4bcachefs
threads vanilla  patchedvanilla patched
2   117k 112k80k 85k
4   185k 190k   133k145k
8   303k 346k   185k255k
16  389k 465k   190k420k
32  360k 437k   142k481k

CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
halved on the patched kernel, while performance has increased
marginally on ext4 and massively on bcachefs. Internal filesystem
algorithms now limit performance on these workloads, not the global
inode_hash_lock.

Profile of the workloads on the patched kernels:

-   35.94% 0.07%  [kernel]  [k] ext4_create
   - 35.87% ext4_create
  - 20.45% __ext4_new_inode
...
   3.36% insert_inode_locked

   - 78.43% do_filp_open
  - 78.36% path_openat
 - 53.95% bch2_create
- 47.99% __bch2_create

  - 7.57% inode_insert5
6.94% find_inode

Spinlock contention is largely gone from the inode hash operations
and the filesystems are limited by contention in their internal
algorithms.

Signed-off-by: Dave Chinner 
---
 fs/inode.c | 200 -
 include/linux/fs.h |   9 +-
 2 files changed, 132 insertions(+), 77 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index b8d9eb3454dc..867af386177b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -57,8 +57,7 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
-static struct hlist_head *inode_hashtable __read_mostly;
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
+static struct hlist_bl_head *inode_hashtable __read_mostly;
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
@@ -70,7 +69,7 @@ static unsigned long hash(struct super_block *sb, unsigned 
long hashval)
return tmp & i_hash_mask;
 }
 
-static inline struct hlist_head *i_hash_head(struct super_block *sb,
+static inline struct hlist_bl_head *i_hash_head(struct super_block *sb,
unsigned int hashval)
 {
return inode_hashtable + hash(sb, hashval);
@@ -407,7 +406,7 @@ EXPORT_SYMBOL(address_space_init_once);
 void inode_init_once(struct inode *inode)
 {
memset(inode, 0, sizeof(*inode));
-   INIT_HLIST_NODE(>i_hash);
+   INIT_HLIST_BL_NODE(>i_hash);
INIT_LIST_HEAD(>i_devices);
INIT_LIST_HEAD(>i_io_list);
INIT_LIST_HEAD(>i_wb_list);
@@ -491,6 +490,17 @@ static inline void inode_sb_lis

[PATCH 2/3] hlist-bl: add hlist_bl_fake()

2021-04-06 Thread Dave Chinner

From: Dave Chinner 

in preparation for switching the VFS inode cache over the hlist_bl
lists, we nee dto be able to fake a list node that looks like it is
hased for correct operation of filesystems that don't directly use
the VFS indoe cache.

Signed-off-by: Dave Chinner 
---
 include/linux/list_bl.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index ae1b541446c9..8ee2bf5af131 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -143,6 +143,28 @@ static inline void hlist_bl_del_init(struct hlist_bl_node 
*n)
}
 }
 
+/**
+ * hlist_bl_add_fake - create a fake list consisting of a single headless node
+ * @n: Node to make a fake list out of
+ *
+ * This makes @n appear to be its own predecessor on a headless hlist.
+ * The point of this is to allow things like hlist_bl_del() to work correctly
+ * in cases where there is no list.
+ */
+static inline void hlist_bl_add_fake(struct hlist_bl_node *n)
+{
+   n->pprev = >next;
+}
+
+/**
+ * hlist_fake: Is this node a fake hlist_bl?
+ * @h: Node to check for being a self-referential fake hlist.
+ */
+static inline bool hlist_bl_fake(struct hlist_bl_node *n)
+{
+   return n->pprev == >next;
+}
+
 static inline void hlist_bl_lock(struct hlist_bl_head *b)
 {
bit_spin_lock(0, (unsigned long *)b);
-- 
2.31.0

[PATCH 1/3] vfs: factor out inode hash head calculation

2021-04-06 Thread Dave Chinner

From: Dave Chinner 

In preparation for changing the inode hash table implementation.

Signed-off-by: Dave Chinner 
---
 fs/inode.c | 44 +---
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index a047ab306f9a..b8d9eb3454dc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,22 @@ static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+   unsigned long tmp;
+
+   tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+   L1_CACHE_BYTES;
+   tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
+   return tmp & i_hash_mask;
+}
+
+static inline struct hlist_head *i_hash_head(struct super_block *sb,
+   unsigned int hashval)
+{
+   return inode_hashtable + hash(sb, hashval);
+}
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -475,16 +491,6 @@ static inline void inode_sb_list_del(struct inode *inode)
}
 }
 
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-   unsigned long tmp;
-
-   tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-   L1_CACHE_BYTES;
-   tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
-   return tmp & i_hash_mask;
-}
-
 /**
  * __insert_inode_hash - hash an inode
  * @inode: unhashed inode
@@ -1073,7 +1079,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned 
long hashval,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
 {
-   struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+   struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
struct inode *old;
bool creating = inode->i_state & I_CREATING;
 
@@ -1173,7 +1179,7 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-   struct hlist_head *head = inode_hashtable + hash(sb, ino);
+   struct hlist_head *head = i_hash_head(sb, ino);
struct inode *inode;
 again:
spin_lock(_hash_lock);
@@ -1241,7 +1247,7 @@ EXPORT_SYMBOL(iget_locked);
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-   struct hlist_head *b = inode_hashtable + hash(sb, ino);
+   struct hlist_head *b = i_hash_head(sb, ino);
struct inode *inode;
 
hlist_for_each_entry_rcu(inode, b, i_hash) {
@@ -1328,7 +1334,7 @@ EXPORT_SYMBOL(igrab);
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
 {
-   struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+   struct hlist_head *head = i_hash_head(sb, hashval);
struct inode *inode;
 
spin_lock(_hash_lock);
@@ -1383,7 +1389,7 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-   struct hlist_head *head = inode_hashtable + hash(sb, ino);
+   struct hlist_head *head = i_hash_head(sb, ino);
struct inode *inode;
 again:
spin_lock(_hash_lock);
@@ -1432,7 +1438,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
 void *),
void *data)
 {
-   struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+   struct hlist_head *head = i_hash_head(sb, hashval);
struct inode *inode, *ret_inode = NULL;
int mval;
 
@@ -1477,7 +1483,7 @@ EXPORT_SYMBOL(find_inode_nowait);
 struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
 int (*test)(struct inode *, void *), void *data)
 {
-   struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+   struct hlist_head *head = i_hash_head(sb, hashval);
struct inode *inode;
 
RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1515,7 +1521,7 @@ EXPORT_SYMBOL(find_inode_rcu);
 struct inode *find_inode_by_ino_rcu(struct super_block *sb,
unsigned long ino)
 {
-   struct hlist_head *head = inode_hashtable + hash(sb, ino);
+   struct hlist_head *head = i_hash_head(sb, ino);
struct inode *inode;
 
RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1535,7 +1541,7 @@ int insert_inode_locked(struct inode *inode)
 {
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
-   struct hlist_head *head = inode_hashtable + hash(sb, ino);
+   struct hlist_head *head = i_hash_head(sb, ino);
 
while (1) {

[RFC PATCH 0/3] vfs: convert inode cache to hlist-bl

2021-04-06 Thread Dave Chinner

Hi folks,

Recently I've been doing some scalability characterisation of
various filesystems, and one of the limiting factors that has
prevented me from exploring filesystem characteristics is the
inode hash table. namely, the global inode_hash_lock that protects
it.

This has long been a problem, but I personally haven't cared about
it because, well, XFS doesn't use it and so it's not a limiting
factor for most of my work. However, in trying to characterise the
scalability boundaries of bcachefs, I kept hitting against VFS
limitations first. bcachefs hits the inode hash table pretty hard
and it becaomse a contention point a lot sooner than it does for
ext4. Btrfs also uses the inode hash, but it's namespace doesn't
have the capability to stress the indoe hash lock due to it hitting
internal contention first.

Long story short, I did what should have been done a decade or more
ago - I converted the inode hash table to use hlist-bl to split up
the global lock. This is modelled on the dentry cache, with one
minor tweak. That is, the inode hash value cannot be calculated from
the inode, so we have to keep a record of either the hash value or a
pointer to the hlist-bl list head that the inode is hashed into so
taht we can lock the corect list on removal.

Other than that, this is mostly just a mechanical conversion from
one list and lock type to another. None of the algorithms have
changed and none of the RCU behaviours have changed. But it removes
the inode_hash_lock from the picture and so performance for bcachefs
goes way up and CPU usage for ext4 halves at 16 and 32 threads. At
higher thread counts, we start to hit filesystem and other VFS locks
as the limiting factors. Profiles and performance numbers are in
patch 3 for those that are curious.

I've been running this in benchmarks and perf testing across
bcachefs, btrfs and ext4 for a couple of weeks, and it passes
fstests on ext4 and btrfs without regressions. So now it needs more
eyes and testing and hopefully merging

Cheers,

Dave.

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-19 Thread Dave Chinner

On Thu, Mar 18, 2021 at 12:20:35PM -0700, Dan Williams wrote:
> On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner  wrote:
> >
> > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > > Jason wondered why the get_user_pages_fast() path takes references on a
> > > @pgmap object. The rationale was to protect against accessing a 'struct
> > > page' that might be in the process of being removed by the driver, but
> > > he rightly points out that should be solved the same way all gup-fast
> > > synchronization is solved which is invalidate the mapping and let the
> > > gup slow path do @pgmap synchronization [1].
> > >
> > > To achieve that it means that new user mappings need to stop being
> > > created and all existing user mappings need to be invalidated.
> > >
> > > For device-dax this is already the case as kill_dax() prevents future
> > > faults from installing a pte, and the single device-dax inode
> > > address_space can be trivially unmapped.
> > >
> > > The situation is different for filesystem-dax where device pages could
> > > be mapped by any number of inode address_space instances. An initial
> > > thought was to treat the device removal event like a drop_pagecache_sb()
> > > event that walks superblocks and unmaps all inodes. However, Dave points
> > > out that it is not just the filesystem user-mappings that need to react
> > > to global DAX page-unmap events, it is also filesystem metadata
> > > (proposed DAX metadata access), and other drivers (upstream
> > > DM-writecache) that need to react to this event [2].
> > >
> > > The only kernel facility that is meant to globally broadcast the loss of
> > > a page (via corruption or surprise remove) is memory_failure(). The
> > > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > > However, the events that would trigger the need to call memory_failure()
> > > over a full PMEM device should be rare.
> >
> > This is a highly suboptimal design. Filesystems only need a single
> > callout to trigger a shutdown that unmaps every active mapping in
> > the filesystem - we do not need a page-by-page error notification
> > which results in 250 million hwposion callouts per TB of pmem to do
> > this.
> >
> > Indeed, the moment we get the first hwpoison from this patch, we'll
> > map it to the primary XFS superblock and we'd almost certainly
> > consider losing the storage behind that block to be a shut down
> > trigger. During the shutdown, the filesystem should unmap all the
> > active mappings (we already need to add this to shutdown on DAX
> > regardless of this device remove issue) and so we really don't need
> > a page-by-page notification of badness.
> 
> XFS doesn't, but what about device-mapper and other agents? Even if
> the driver had a callback up the stack memory_failure() still needs to
> be able to trigger failures down the stack for CPU consumed poison.

If the device is gone, then they don't need page by page
notifucation, either. Tell them the entire device is gone so they
can do what they need (like pass it up to the filesystem as ranges
of badness!).

> > AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> > iteration to hwposion every page. It's going to take a few seconds
> > for the filesystem shutdown to run a device wide invalidation.
> >
> > SO, yeah, I think this should simply be a single ranged call to the
> > filesystem like:
> >
> > ->memory_failure(dev, 0, -1ULL)
> >
> > to tell the filesystem that the entire backing device has gone away,
> > and leave the filesystem to handle failure entirely at the
> > filesystem level.
> 
> So I went with memory_failure() after our discussion of all the other
> agents in the system that might care about these pfns going offline
> and relying on memory_failure() to route down to each of those. I.e.
> the "reuse the drop_pagecache_sb() model" idea was indeed
> insufficient.

Using drop_pagecache_sb is insufficient because filesystems have
more than just inode indexed caches that pmem failure may affect.
This is not an argument against a "knock everything down at once"
notification model, just that drop_pagecache_sb() is ...
insufficient to do what we need...

> Now I'm trying to reconcile the fact that platform
> poison handling will hit memory_failure() first and may not
> immediately reach the driver, if ever (see the perennially awkward
> firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> even if the ->memory_failu

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Dave Chinner

On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
> 
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
> 
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
> 
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
> 
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare.

This is a highly suboptimal design. Filesystems only need a single
callout to trigger a shutdown that unmaps every active mapping in
the filesystem - we do not need a page-by-page error notification
which results in 250 million hwposion callouts per TB of pmem to do
this.

Indeed, the moment we get the first hwpoison from this patch, we'll
map it to the primary XFS superblock and we'd almost certainly
consider losing the storage behind that block to be a shut down
trigger. During the shutdown, the filesystem should unmap all the
active mappings (we already need to add this to shutdown on DAX
regardless of this device remove issue) and so we really don't need
a page-by-page notification of badness.

AFAICT, it's going to take minutes, maybe hours for do the page-by-page
iteration to hwposion every page. It's going to take a few seconds
for the filesystem shutdown to run a device wide invalidation.

SO, yeah, I think this should simply be a single ranged call to the
filesystem like:

->memory_failure(dev, 0, -1ULL)

to tell the filesystem that the entire backing device has gone away,
and leave the filesystem to handle failure entirely at the
filesystem level.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v4 00/25] Page folios

2021-03-15 Thread Dave Chinner

On Mon, Mar 15, 2021 at 07:40:14PM +, Matthew Wilcox wrote:
> I would agree that the conversion is both straightforward and noisy.
> There are some minor things that crop up, like noticing that we get
> the accounting wrong for writeback of compound pages.  That's not
> entirely unexpected since no filesystem supports both compound pages
> and writeback today.

And this is where the abstraction that the "folio" introduces is
required - filesystem people don't want to have to deal with all the
complexity and subtlety of compound pages when large page support is
added to the page cache. As Willy says, this will be a never-ending
source of data corruption bugs

Hence from the filesystem side of things, I think this abstraction
is absolutely necessary. Especially because handling buffered IO in
4kB pages really, really sucks from a performance persepctive and
the sooner we have native high-order page support in the page cache
the better.  These days buffered IO often can't even max out a cheap
NVMe SSD because of the CPU cost of per-page state management in the
page cache. So the sooner we have large page support to mitigate the
worst of the overhead for streaming buffered IO, the better.

FWIW, like others, I'm not a fan of "folio" as a name, but I can live
with it because it so jarringly different to "pages" that we're not
going to confuse the two of them. But if we want a better name, my
suggestion would be for a "struct cage" as in Compound pAGE

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread Dave Chinner

On Mon, Mar 08, 2021 at 11:28:41AM +, David Howells wrote:
> Amir Goldstein  wrote:
> 
> > > But after I've written and sync'd the data, I set the xattr to mark the
> > > file not open.  At the moment I'm doing this too lazily, only doing it
> > > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > > really need to add a queue of objects to be sealed as they're closed.  The
> > > balance is working out how often to do the sealing as something like a
> > > shell script can do a lot of consecutive open/write/close ops.
> > 
> > You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> > For example, xfs keeps the "LSN" on each inode, so once the transaction
> > with some LSN has been committed, all the relevant inodes, if not dirty, can
> > be declared as synced, without having to call fsync() on any file and 
> > without
> > having to force transaction commit or any IO at all.
> > 
> > Since fscache takes care of submitting the IO, and it shouldn't care about 
> > any
> > specific time that the data/metadata hits the disk(?), you can make use of 
> > the
> > existing periodic writeback and rolling transaction commit and only ever 
> > need
> > to wait for that to happen before marking cache files "closed".
> > 
> > There was a discussion about fsyncing a range of files on LSFMM [1].
> > In the last comment on the article dchinner argues why we already have that
> > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> > 
> > [1] https://lwn.net/Articles/789024/
> 
> This sounds like an interesting idea.  Actually, what I probably want is a
> notification to say that a particular object has been completely sync'd to
> disk, metadata and all.

This isn't hard to do yourself in the kernel. All it takes is a
workqueue to run vfs_fsync() calls asynchronously and for the work
to queue a local notification/wakeup when the fsync completes...

That's all aio_fsync() does - the notification it queues on
completion is the AIO completion event for userspace - so I think
you could do this in about 50 lines of code if you really needed
it...

> However, there are some performance problems are arising in my fscache-iter
> branch:
> 
>  (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
>  setxattr).

Async pipelines using unbound workqueues are your friend.
> 
>  (2) It's retaining a lot of open file structs on cache files.  Cachefiles
>  opens the file when it's first asked to access it and retains that till
>  the cookie is relinquished or the cache withdrawn (the file* doesn't
>  contribute to ENFILE/EMFILE but it still eats memory).

Sounds similar to the problem that the NFSd open file cache solves.
(fs/nfsd/filecache.c)

>  (3) Trimming excess data from the end of the cache file.  The problem with
>  using DIO to write to the cache is that the write has to be rounded up to
>  a multiple of the backing fs DIO blocksize,

Actually, a multiple of the logical sector size of the backing
device behind the backing filesystem.

>  but if the file is truncated
>  larger, that excess data now becomes part of the file.

Keep the actual file size in your tracking xattr.

>  Possibly it's sufficient to just clear the excess page space before
>  writing, but that doesn't necessarily stop a writable mmap from
>  scribbling on it.

We can't stop mmap from scribbling in it. All filesystems have this
problem, so to prevent data leaks we have to zero the post-eof tail
region on every write of the EOF block, anyway.

>  (4) Committing outstanding cache metadata at cache withdrawal or netfs
>  unmount.  I've previously mentioned this: it ends up with a whole slew of
>  synchronous metadata changes being committed to the cache in one go
>  (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
>  can take quite a long time.  The cache needs to be more proactive in
>  getting stuff committed as it goes along.

Workqueues give you an easy mechanism for async dispatch and
concurrency for synchronous operations. This is a largely solved
problem...

>  (5) Attaching to an object requires a pathwalk to it (normally only two
>  steps) and then reading various xattrs on it - all synchronous, but can
>  be punted to a background threadpool.

a.k.a. punting to a workqueue :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread Dave Chinner

On Mon, Mar 08, 2021 at 09:13:55AM +, David Howells wrote:
> Amir Goldstein  wrote:
> 
> > >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > >   file for every whole operation (which may combine reads and writes).
> > 
> > I read that NFSv4 supports hole punching, so when using ->bmap() or 
> > SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
> 
> I wasn't exactly intending to permit caching over NFS.  That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
> 
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
> 
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
> 
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.

Well, that's not the big problem. The issue that makes FIEMAP
unusable for determining if there is user data present in a file is
that on-disk extent maps aren't exactly coherent with in-memory user
data state.

That is, we can have a hole on disk with delalloc user data in
memory.  There's user data in the file, just not on disk. Same goes
for unwritten extents - there can be dirty data in memory over an
unwritten extent, and it won't get converted to written until the
data is written back and the filesystem runs a conversion
transaction.

So, yeah, if you use FIEMAP to determine where data lies in a file
that is being actively modified, you're going get corrupt data
sooner rather than later.  SEEK_HOLE/DATA are coherent with in
memory user data, so don't have this problem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-02 Thread Dave Chinner

 using dax for metadata, to take a fault when pmem is
> offlined.

 if userspace has directly mapped into the cache, and the cache
storage goes away, the userspace app has to be killed because we
have no idea if the device going away has caused data loss or not.
IOWs, if userspace writes direct to the cache device and it hasn't
been written back to other storage when it gets yanked, we have just
caused data corruption to occur.

At minimum, we now have to tell the filesystem that the dirty data
in the cache is now bad, and direct map applications that map those
dirty ranges need to be killed because their backing store is no
longer valid nor does the backup copy contain the data they last
wrote. Nor is it acessible by direct access, which is going to be
interesting because dynamically changing dax to non-dax access can't
be done without forcibly kicking the inode out of the cache. That
requires all references to the inode to go away. And that means the
event really has to go up to the filesystem.

But I think the biggest piece of the puzzle that you haven't grokked
here is that the dm cache device isn't a linear map - it's made up of
random ranges from the underlying devices. Hence the "remove" of a dm
cache device turns into a huge number of small, sparse corrupt
ranges, not a single linear device remove event.

IOWs, device unplug/remove events are not just simple "pass it on"
events in a stacked storage setup. There can be non-trivial mappings
through the layers, and device disappearance may in fact manifest to
the user as data corruption rather than causing data to be
inaccessible.

Hence "remove" notifications just don't work in the storage stack.
They need to be translated to block ranges going bad (i.e.  media
errors), and reported to higher layers as bad ranges, not as device
removal.

The same goes for DAX devices. The moment they can be placed in
storage stacks in non-trivial configurations and/or used as cache
devices that can be directly accessed over tranditional block
devices, we end up with error conditions that can only be mapped as
ranges of blocks that have gone bad.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-02 Thread Dave Chinner

On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
> [..]
> > We do not need a DAX specific mechanism to tell us "DAX device
> > gone", we need a generic block device interface that tells us "range
> > of block device is gone".
> 
> This is the crux of the disagreement. The block_device is going away
> *and* the dax_device is going away.

No, that is not the disagreement I have with what you are saying.
You still haven't understand that it's even more basic and generic
than devices going away. At the simplest form, all the filesystem
wants is to be notified of is when *unrecoverable media errors*
occur in the persistent storage that underlies the filesystem.

The filesystem does not care what that media is build from - PMEM,
flash, corroded spinning disks, MRAM, or any other persistent media
you can think off. It just doesn't matter.

What we care about is that the contents of a *specific LBA range* no
longer contain *valid data*. IOWs, the data in that range of the
block device has been lost, cannot be retreived and/or cannot be
written to any more.

PMEM taking a MCE because ECC tripped is a media error because data
is lost and inaccessible until recovery actions are taken.

MD RAID failing a scrub is a media error and data is lost and
unrecoverable at that layer.

A device disappearing is a media error because the storage media is
now permanently inaccessible to the higher layers.

This "media error" categorisation is a fundamental property of
persistent storage and, as such, is a property of the block devices
used to access said persistent storage.

That's the disagreement here - that you and Christoph are saying
->corrupted_range is not a block device property because only a
pmem/DAX device currently generates it.

You both seem to be NACKing a generic interface because it's only
implemented for the first subsystem that needs it. AFAICT, you
either don't understand or are completely ignoring the architectural
need for it to be provided across the rest of the storage stack that
*block device based filesystems depend on*.

Sure, there might be dax device based fielsystems around the corner.
They just require a different pmem device ->corrupted_range callout
to implement the notification - one that directs to the dax device
rather than the block device. That's simple and trivial to
implement, but such functionaity for DAX devices  does not replace
the need for the same generic functionality to be provided across a
*range of different block devices* as required by *block device
based filesystems*.

And that's fundamentally the problem. XFS is block device based, not
DAX device based. We require errors to be reported through block
device mechanisms. fs-dax does not change this - it is based on pmem
being presented as a primarily as a block device to the block device
based filesystems and only secondarily as a dax device. Hence if it
can be trivially implemented as a block device interface, that's
where it should go, because then all the other block devices that
the filesytem runs on can provide the same functionality for similar
media error events

> The dax_device removal implies one
> set of actions (direct accessed pfns invalid) the block device removal
> implies another (block layer sector access offline).

There you go again, saying DAX requires an action, while the block
device notification is a -state change- (i.e. goes offline).

This is exactly what I said was wrong in my last email.

> corrupted_range
> is blurring the notification for 2 different failure domains. Look at
> the nascent idea to mount a filesystem on dax sans a block device.
> Look at the existing plumbing for DM to map dax_operations through a
> device stack.

Ummm, it just maps the direct_access call to the underlying device
and calls it's ->direct_access method. All it's doing is LBA
mapping. That's all it needs to do for ->corrupted_range, too.
I have no clue why you think this is a problem for error
notification...

> Look at the pushback Ruan got for adding a new
> block_device operation for corrupted_range().

one person said "no". That's hardly pushback. Especially as I think
Christoph's objection about this being dax specific functionality
is simply wrong, as per above.

> > This is why we need to communicate what error occurred, not what
> > action a device driver thinks needs to be taken.
> 
> The driver is only an event producer in this model, whatever the
> consumer does at the other end is not its concern. There may be a
> generic consumer and a filesystem specific consumer.

That's why these are all ops functions that can provide multiple
implementations to different device types. So that when we get a new
use case, the ops function structure can be replaced with one that
directs the notification

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner

On Mon, Mar 01, 2021 at 04:32:36PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner  wrote:
> > Now we have the filesytem people providing a mechanism for the pmem
> > devices to tell the filesystems about physical device failures so
> > they can handle such failures correctly themselves. Having the
> > device go away unexpectedly from underneath a mounted and active
> > filesystem is a *device failure*, not an "unplug event".
> 
> It's the same difference to the physical page, all mappings to that
> page need to be torn down. I'm happy to call an fs callback and let
> each filesystem do what it wants with a "every pfn in this dax device
> needs to be unmapped".

You keep talking like this is something specific to a DAX device.
It isn't - the filesystem needs to take specific actions if any type
of block device reports that it has a corrupted range, not just DAX.
A DAX device simply adds "and invalidate direct mappings" to the
list of stuff that needs to be done.

And as far as a filesystem is concerned, there is no difference
between "this 4kB range is bad" and "the range of this entire device
is bad". We have to do the same things in both situations.

> I'm looking at the ->corrupted_range() patches trying to map it to
> this use case and I don't see how, for example a realtime-xfs over DM
> over multiple PMEM gets the notification to the right place.
> bd_corrupted_range() uses get_super() which get the wrong answer for
> both realtime-xfs and DM.

I'm not sure I follow your logic. What is generating the wrong
answer?

We already have infrastructure for the block device to look up the
superblock mounted on top of it, an DM already uses that for things
like "dmsetup suspend" to freeze the filesystem before it does
something.  This "superblock lookup" only occurs for the top level
DM device, not for the component pmem devices that make up the DM
device.

IOWs, if there's a DM device that maps multiple pmem devices, then
it should be stacking the bd_corrupted_range() callbacks to map the
physical device range to the range in the higher level DM device
that belongs to. This mapping of ranges is what DM exists to do -
the filesystem has no clue about what devices make up a DM device,
so the DM device *must* translate ranges for component devices into
the ranges that it maps that device into the LBA range it exposes to
the filesystem.

> I'd flip that arrangement around and have the FS tell the block device
> "if something happens to you, here is the super_block to notify".

We already have a mechanism for this that the block device calls:
get_active_super(bdev). There can be only one superblock per block
device - the superblock has exclusive ownership of the block device
while the filesystem is mounted.

get_active_super() returns the superblock that sits on top of the
bdev with an active reference, allowing the caller to safely access
and operate on the sueprblock without having to worry about the
superblock going away in the middle of whatever operation the block
device needs to perform.

If this isn't working, then existing storage stack functionality
doesn't work as it should and this needs fixing independently of
the PMEM/DAX stuff we are talking about here.

> So
> to me this looks like a fs_dax_register_super() helper that plumbs the
> superblock through an arbitrary stack of block devices to the leaf
> block-device that might want to send a notification up when a global
> unmap operation needs to be performed.

No, this is just wrong. The filesystem has no clue what block device
is at the leaf level of a block device stack, nor what LBA block
range represents that device within the address space the stacked
block devices present to the filesystem.

> > Please listen when we say "that is
> > not sufficient" because we don't want to be backed into a corner
> > that we have to fix ourselves again before we can enable some basic
> > filesystem functionality that we should have been able to support on
> > DAX from the start...
> 
> That's some revisionist interpretation of how the discovery of the
> reflink+dax+memory-error-handling collision went down.
> 
> The whole point of this discussion is to determine if
> ->corrupted_range() is suitable for this notification, and looking at
> the code as is, it isn't. Perhaps you have a different implementation
> of ->corrupted_range() in mind that allows this to be plumbed
> correctly?

So rather than try to make the generic ->corrupted_range
infrastructure be able to report "DAX range is invalid" (which is
the very definition of a corrupted block device range!), you want
to introduce a DAX specific notification to tell us that a range in
the block device contains invalid/corrupt data?

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner

On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
> 
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
> 
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
> 
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

Dan, you made this mistake with the hwpoisoning code that we're
trying to fix that here. Hard coding a 1:1 physical address to
inode/offset into the DAX mapping was a bad mistake. It's also one
that should never have occurred because it's *obviously wrong* to
filesystem developers and has been for a long time.

Now we have the filesytem people providing a mechanism for the pmem
devices to tell the filesystems about physical device failures so
they can handle such failures correctly themselves. Having the
device go away unexpectedly from underneath a mounted and active
filesystem is a *device failure*, not an "unplug event".

The mistake you made was not understanding how filesystems work,
nor actually asking filesystem developers what they actually needed.
You're doing the same thing here - you're telling us what you think
the solution filesystems need is. Please listen when we say "that is
not sufficient" because we don't want to be backed into a corner
that we have to fix ourselves again before we can enable some basic
filesystem functionality that we should have been able to support on
DAX from the start...

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks
> 
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

No.

drop_pagecache_sb() is not a relevant model for telling a filesystem
that the block device underneath has gone away, nor for a device to
ensure that access protections that *are managed by the filesystem*
are enforced/revoked sanely.

drop_pagecache_sb() is a brute-force model for invalidating user
data mappings that the filesystem performs in response to such a
notification. It only needs this brute-force approach if it has no
other way to find active DAX mappings across the range of the device
that has gone away.

But this model doesn't work for direct mapped metadata, journals or
any other internal direct filesystem mappings that aren't referenced
by inodes that the filesytem might be using. The filesystem still
needs to invali

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-28 Thread Dave Chinner

On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > it points to, check if it points to the PMEM that is being removed,
> > grab the page it points to, map that to the relevant struct page,
> > run collect_procs() on that page, then kill the user processes that
> > map that page.
> >
> > So why can't we walk the ptescheck the physical pages that they
> > map to and if they map to a pmem page we go poison that
> > page and that kills any user process that maps it.
> >
> > i.e. I can't see how unexpected pmem device unplug is any different
> > to an MCE delivering a hwpoison event to a DAX mapped page.
> 
> I guess the tradeoff is walking a long list of inodes vs walking a
> large array of pages.

Not really. You're assuming all a filesystem has to do is invalidate
everything if a device goes away, and that's not true. Finding if an
inode has a mapping that spans a specific device in a multi-device
filesystem can be a lot more complex than that. Just walking inodes
is easy - determining whihc inodes need invalidation is the hard
part.

That's where ->corrupt_range() comes in - the filesystem is already
set up to do reverse mapping from physical range to inode(s)
offsets...

> There's likely always more pages than inodes, but perhaps it's more
> efficient to walk the 'struct page' array than sb->s_inodes?

I really don't see you seem to be telling us that invalidation is an
either/or choice. There's more ways to convert physical block
address -> inode file offset and mapping index than brute force
inode cache walks

.

> > IOWs, what needs to happen at this point is very filesystem
> > specific. Assuming that "device unplug == filesystem dead" is not
> > correct, nor is specifying a generic action that assumes the
> > filesystem is dead because a device it is using went away.
> 
> Ok, I think I set this discussion in the wrong direction implying any
> mapping of this action to a "filesystem dead" event. It's just a "zap
> all ptes" event and upper layers recover from there.

Yes, that's exactly what ->corrupt_range() is intended for. It
allows the filesystem to lock out access to the bad range
and then recover the data. Or metadata, if that's where the bad
range lands. If that recovery fails, it can then report a data
loss/filesystem shutdown event to userspace and kill user procs that
span the bad range...

FWIW, is this notification going to occur before or after the device
has been physically unplugged? i.e. what do we do about the
time-of-unplug-to-time-of-invalidation window where userspace can
still attempt to access the missing pmem though the
not-yet-invalidated ptes? It may not be likely that people just yank
pmem nvdimms out of machines, but with NVMe persistent memory
spaces, there's every chance that someone pulls the wrong device...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-27 Thread Dave Chinner

On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  wrote:
> > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  wrote:
> > > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > > respect to invalidating all dax mappings when / if the device is
> > > > > ripped out from underneath the fs. I don't think that will collide
> > > > > with Ruan's implementation, but it does need new communication from
> > > > > driver to fs about removal events.
> > > > >
> > > > > [1]: 
> > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> > > >
> > > > Oh, yay.
> > > >
> > > > The XFS shutdown code is centred around preventing new IO from being
> > > > issued - we don't actually do anything about DAX mappings because,
> > > > well, I don't think anyone on the filesystem side thought they had
> > > > to do anything special if pmem went away from under it.
> > > >
> > > > My understanding -was- that the pmem removal invalidates
> > > > all the ptes currently mapped into CPU page tables that point at
> > > > the dax device across the system. THe vmas that manage these
> > > > mappings are not really something the filesystem really manages,
> > > > but a function of the mm subsystem. What the filesystem cares about
> > > > is that it gets page faults triggered when a change of state occurs
> > > > so that it can remap the page to it's backing store correctly.
> > > >
> > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > > CPU ptes, because then when then when userspace tries to access the
> > > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > > filesystem will try to get direct access to the pmem from the block
> > > > device. This will get an ENODEV error from the block device because
> > > > because the backing store (pmem) has been unplugged and is no longer
> > > > there...
> > > >
> > > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > > point at the pmem being removed, the filesystem doesn't need to
> > > > care about device removal at all, DAX or no DAX...
> > >
> > > How would the pmem removal do that without walking all the active
> > > inodes in the fs at the time of shutdown and call
> > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
> >
> > Which then immediately ends up back at the vmas that manage the ptes
> > to unmap them.
> >
> > Isn't finding the vma(s) that map a specific memory range exactly
> > what the rmap code in the mm subsystem is supposed to address?
> 
> rmap can lookup only vmas from a virt address relative to a given
> mm_struct. The driver has neither the list of mm_struct objects nor
> virt addresses to do a lookup. All it knows is that someone might have
> mapped pages through the fsdax interface.

So there's no physical addr to vma translation in the mm subsystem
at all?

That doesn't make sense. We do exactly this for hwpoison for DAX
mappings. While we don't look at ptes, we get a pfn, grab the page
it points to, check if it points to the PMEM that is being removed,
grab the page it points to, map that to the relevant struct page,
run collect_procs() on that page, then kill the user processes that
map that page.

So why can't we walk the ptes, check the physical pages that they
map to and if they map to a pmem page we go poison that
page and that kills any user process that maps it.

i.e. I can't see how unexpected pmem device unplug is any different
to an MCE delivering a hwpoison event to a DAX mapped page.  Both
indicate a physical address range now contains invalid data and the
filesystem has to take the same action...

IOWs, we could just call ->corrupted_range(0, EOD) here to tell the
filesystem the entire device went away. Then the filesystem deal
with this however it needs to. However, it would be more efficient
from an invalidation POV to just call it on the pages that have
currently active ptes because once the block device is dead
new page faults on DAX mappings will get a SIGBUS naturally.

> To me this looks like a notifier that fires from memunmap_pages()
> after dev_pagemap_kill() to notify any block_device associated with
> that dev_pagemap() to say that any dax mappings arranged through this
> block_device are now invalid. The reason

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dave Chinner

On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  wrote:
> >
> > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  
> > > wrote:
> > > >
> > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> > > > > Hi, guys
> > > > >
> > > > > Beside this patchset, I'd like to confirm something about the
> > > > > "EXPERIMENTAL" tag for dax in XFS.
> > > > >
> > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > > when we mount a pmem device with dax option, has been existed for a
> > > > > while.  It's a bit annoying when using fsdax feature.  So, my initial
> > > > > intention was to remove this tag.  And I started to find out and solve
> > > > > the problems which prevent it from being removed.
> > > > >
> > > > > As is talked before, there are 3 main problems.  The first one is "dax
> > > > > semantics", which has been resolved.  The rest two are "RMAP for
> > > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > > working on.
> > > >
> > > > 
> > > >
> > > > > So, what I want to confirm is: does it means that we can remove the
> > > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > > >
> > > > Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> > > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > > those two big remaining gaps.  Thank you for working on this!
> > > >
> > > > > Or maybe there are other important problems need to be fixed before
> > > > > removing it?  If there are, could you please show me that?
> > > >
> > > > That remains to be seen through QA/validation, but I think that's it.
> > > >
> > > > Granted, I still have to read through the two patchsets...
> > >
> > > I've been meaning to circle back here as well.
> > >
> > > My immediate concern is the issue Jason recently highlighted [1] with
> > > respect to invalidating all dax mappings when / if the device is
> > > ripped out from underneath the fs. I don't think that will collide
> > > with Ruan's implementation, but it does need new communication from
> > > driver to fs about removal events.
> > >
> > > [1]: 
> > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> >
> > Oh, yay.
> >
> > The XFS shutdown code is centred around preventing new IO from being
> > issued - we don't actually do anything about DAX mappings because,
> > well, I don't think anyone on the filesystem side thought they had
> > to do anything special if pmem went away from under it.
> >
> > My understanding -was- that the pmem removal invalidates
> > all the ptes currently mapped into CPU page tables that point at
> > the dax device across the system. THe vmas that manage these
> > mappings are not really something the filesystem really manages,
> > but a function of the mm subsystem. What the filesystem cares about
> > is that it gets page faults triggered when a change of state occurs
> > so that it can remap the page to it's backing store correctly.
> >
> > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > CPU ptes, because then when then when userspace tries to access the
> > mapped DAX pages we get a new page fault. In processing the fault, the
> > filesystem will try to get direct access to the pmem from the block
> > device. This will get an ENODEV error from the block device because
> > because the backing store (pmem) has been unplugged and is no longer
> > there...
> >
> > AFAICT, as long as pmem removal invalidates all the active ptes that
> > point at the pmem being removed, the filesystem doesn't need to
> > care about device removal at all, DAX or no DAX...
> 
> How would the pmem removal do that without walking all the active
> inodes in the fs at the time of shutdown and call
> unmap_mapping_range(inode->i_mapping, 0, 0, 1)?

Which then immediately ends up back at the vmas that manage the ptes
to unmap them.

Isn't finding the vma(s) that map a specific memory range exactly
what the rmap code in the mm subsystem is supposed to address?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dave Chinner

On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  wrote:
> >
> > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> > > Hi, guys
> > >
> > > Beside this patchset, I'd like to confirm something about the
> > > "EXPERIMENTAL" tag for dax in XFS.
> > >
> > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > when we mount a pmem device with dax option, has been existed for a
> > > while.  It's a bit annoying when using fsdax feature.  So, my initial
> > > intention was to remove this tag.  And I started to find out and solve
> > > the problems which prevent it from being removed.
> > >
> > > As is talked before, there are 3 main problems.  The first one is "dax
> > > semantics", which has been resolved.  The rest two are "RMAP for
> > > fsdax" and "support dax reflink for filesystem", which I have been
> > > working on.
> >
> > 
> >
> > > So, what I want to confirm is: does it means that we can remove the
> > > "EXPERIMENTAL" tag when the rest two problem are solved?
> >
> > Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> > nothing new pops up, but otherwise the two patchsets you've sent close
> > those two big remaining gaps.  Thank you for working on this!
> >
> > > Or maybe there are other important problems need to be fixed before
> > > removing it?  If there are, could you please show me that?
> >
> > That remains to be seen through QA/validation, but I think that's it.
> >
> > Granted, I still have to read through the two patchsets...
> 
> I've been meaning to circle back here as well.
> 
> My immediate concern is the issue Jason recently highlighted [1] with
> respect to invalidating all dax mappings when / if the device is
> ripped out from underneath the fs. I don't think that will collide
> with Ruan's implementation, but it does need new communication from
> driver to fs about removal events.
> 
> [1]: 
> http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com

Oh, yay.

The XFS shutdown code is centred around preventing new IO from being
issued - we don't actually do anything about DAX mappings because,
well, I don't think anyone on the filesystem side thought they had
to do anything special if pmem went away from under it.

My understanding -was- that the pmem removal invalidates
all the ptes currently mapped into CPU page tables that point at
the dax device across the system. THe vmas that manage these
mappings are not really something the filesystem really manages,
but a function of the mm subsystem. What the filesystem cares about
is that it gets page faults triggered when a change of state occurs
so that it can remap the page to it's backing store correctly.

IOWs, all the mm subsystem needs to when pmem goes away is clear the
CPU ptes, because then when then when userspace tries to access the
mapped DAX pages we get a new page fault. In processing the fault, the
filesystem will try to get direct access to the pmem from the block
device. This will get an ENODEV error from the block device because
because the backing store (pmem) has been unplugged and is no longer
there...

AFAICT, as long as pmem removal invalidates all the active ptes that
point at the pmem being removed, the filesystem doesn't need to
care about device removal at all, DAX or no DAX...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v5 0/4] add simple copy support

2021-02-21 Thread Dave Chinner

On Fri, Feb 19, 2021 at 06:15:13PM +0530, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation doesn't add native copy offload support for stacked
> devices rather copy offload is done through emulation. Possible use
> cases are F2FS gc and BTRFS relocation/balance.

It sounds like you are missing the most obvious use case for this:
hooking up filesystem copy_file_range() implementations to allow
userspace to offload user data copies to hardware

Another fs level feature that could use this for hardware
acceleration fallocate(FALLOC_FL_UNSHARE).

These are probably going to be far easier to hook up than filesystem
GC algorithms, and there is also solid data integrity and stress
testing checking infrastructure for these operations via fstests.

> As SCSI XCOPY can take two different block devices and no of source range is
> equal to 1, this interface can be extended in future to support SCSI XCOPY.

That greatly complicates the implementation. do we even care at this
point about cross-device XCOPY at this point?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-14 Thread Dave Chinner

On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> > On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> > > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner  wrote:
> > > >
> > > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Why are people trying to use copy_file_range on simple /proc and 
> > > > > > > /sys
> > > > > > > files in the first place?  They can not seek (well most can not), 
> > > > > > > so
> > > > > > > that feels like a "oh look, a new syscall, let's use it 
> > > > > > > everywhere!"
> > > > > > > problem that userspace should not do.
> > > > > >
> > > > > > This may have been covered elsewhere, but it's not that people are
> > > > > > saying "let's use copy_file_range on files in /proc."  It's that the
> > > > > > Go language standard library provides an interface to operating 
> > > > > > system
> > > > > > files.  When Go code uses the standard library function io.Copy to
> > > > > > copy the contents of one open file to another open file, then on 
> > > > > > Linux
> > > > > > kernels 5.3 and greater the Go standard library will use the
> > > > > > copy_file_range system call.  That seems to be exactly what
> > > > > > copy_file_range is intended for.  Unfortunately it appears that when
> > > > > > people writing Go code open a file in /proc and use io.Copy the
> > > > > > contents to another open file, copy_file_range does nothing and
> > > > > > reports success.  There isn't anything on the copy_file_range man 
> > > > > > page
> > > > > > explaining this limitation, and there isn't any documented way to 
> > > > > > know
> > > > > > that the Go standard library should not use copy_file_range on 
> > > > > > certain
> > > > > > files.
> > > > >
> > > > > But, is this a bug in the kernel in that the syscall being made is not
> > > > > working properly, or a bug in that Go decided to do this for all types
> > > > > of files not knowing that some types of files can not handle this?
> > > > >
> > > > > If the kernel has always worked this way, I would say that Go is doing
> > > > > the wrong thing here.  If the kernel used to work properly, and then
> > > > > changed, then it's a regression on the kernel side.
> > > > >
> > > > > So which is it?
> > > >
> > > > Both Al Viro and myself have said "copy file range is not a generic
> > > > method for copying data between two file descriptors". It is a
> > > > targetted solution for *regular files only* on filesystems that store
> > > > persistent data and can accelerate the data copy in some way (e.g.
> > > > clone, server side offload, hardware offlead, etc). It is not
> > > > intended as a copy mechanism for copying data from one random file
> > > > descriptor to another.
> > > >
> > > > The use of it as a general file copy mechanism in the Go system
> > > > library is incorrect and wrong. It is a userspace bug.  Userspace
> > > > has done the wrong thing, userspace needs to be fixed.
> > > 
> > > OK, we'll take it out.
> > > 
> > > I'll just make one last plea that I think that copy_file_range could
> > > be much more useful if there were some way that a program could know
> > > whether it would work or not.
> 
> Well... we could always implement a CFR_DRYRUN flag that would run
> through all the parameter validation and return 0 just before actually
> starting any real copying logic.  But that wouldn't itself solve the
> problem that there are very old virtual filesystems in Linux that have
> zero-length regular files that behave like a pipe.
> 
> > If you can't tell from userspace that a file has data in it other
> > than by calling read() on it, then you can't use cfr on it.
> 
> I don't know how to do that, Dave. :)

If stat returns

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-12 Thread Dave Chinner

On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner  wrote:
> >
> > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH  
> > > > wrote:
> > > > >
> > > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > > files in the first place?  They can not seek (well most can not), so
> > > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > > problem that userspace should not do.
> > > >
> > > > This may have been covered elsewhere, but it's not that people are
> > > > saying "let's use copy_file_range on files in /proc."  It's that the
> > > > Go language standard library provides an interface to operating system
> > > > files.  When Go code uses the standard library function io.Copy to
> > > > copy the contents of one open file to another open file, then on Linux
> > > > kernels 5.3 and greater the Go standard library will use the
> > > > copy_file_range system call.  That seems to be exactly what
> > > > copy_file_range is intended for.  Unfortunately it appears that when
> > > > people writing Go code open a file in /proc and use io.Copy the
> > > > contents to another open file, copy_file_range does nothing and
> > > > reports success.  There isn't anything on the copy_file_range man page
> > > > explaining this limitation, and there isn't any documented way to know
> > > > that the Go standard library should not use copy_file_range on certain
> > > > files.
> > >
> > > But, is this a bug in the kernel in that the syscall being made is not
> > > working properly, or a bug in that Go decided to do this for all types
> > > of files not knowing that some types of files can not handle this?
> > >
> > > If the kernel has always worked this way, I would say that Go is doing
> > > the wrong thing here.  If the kernel used to work properly, and then
> > > changed, then it's a regression on the kernel side.
> > >
> > > So which is it?
> >
> > Both Al Viro and myself have said "copy file range is not a generic
> > method for copying data between two file descriptors". It is a
> > targetted solution for *regular files only* on filesystems that store
> > persistent data and can accelerate the data copy in some way (e.g.
> > clone, server side offload, hardware offlead, etc). It is not
> > intended as a copy mechanism for copying data from one random file
> > descriptor to another.
> >
> > The use of it as a general file copy mechanism in the Go system
> > library is incorrect and wrong. It is a userspace bug.  Userspace
> > has done the wrong thing, userspace needs to be fixed.
> 
> OK, we'll take it out.
> 
> I'll just make one last plea that I think that copy_file_range could
> be much more useful if there were some way that a program could know
> whether it would work or not.

If you can't tell from userspace that a file has data in it other
than by calling read() on it, then you can't use cfr on it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-12 Thread Dave Chinner

On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
> On Fri, Feb 12, 2021 at 9:49 AM Greg KH  wrote:
> >
> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> > > Filesystems such as procfs and sysfs generate their content at
> > > runtime. This implies the file sizes do not usually match the
> > > amount of data that can be read from the file, and that seeking
> > > may not work as intended.
> > >
> > > This will be useful to disallow copy_file_range with input files
> > > from such filesystems.
> > >
> > > Signed-off-by: Nicolas Boichat 
> > > ---
> > > I first thought of adding a new field to struct file_operations,
> > > but that doesn't quite scale as every single file creation
> > > operation would need to be modified.
> >
> > Even so, you missed a load of filesystems in the kernel with this patch
> > series, what makes the ones you did mark here different from the
> > "internal" filesystems that you did not?
> >
> > This feels wrong, why is userspace suddenly breaking?  What changed in
> > the kernel that caused this?  Procfs has been around for a _very_ long
> > time :)
> 
> That would be because of (v5.3):
> 
> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
> 
> The intention of this change (series) was to allow server side copy
> for nfs and cifs via copy_file_range().
> This is mostly work by Dave Chinner that I picked up following requests
> from the NFS folks.
> 
> But the above change also includes this generic change:
> 
> -   /* this could be relaxed once a method supports cross-fs copies */
> -   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> -   return -EXDEV;
> -

This isn't the problem.

The problem is that proc sets the file size to zero length, so
if you attempt to CFR from one proc file to another it will still
report "zero bytes copied" because the source file is zero length.

The other problem is that if the write fails, the generated data
from the /proc file gets thrown away - the splice code treats write
failures like a short read and hence the data sent to the failed
write is consumed and lost.

This has nothing to do with cross-fs cfr support - it's just one
mechanism that can be used to expose the problems that using CFR on
pipes that masquerade as regular files causes.

Userspace can't even tell that CFR failed incorrectly, because the
files that it returns immediate EOF on are zero length. Nor can
userspace know taht a short read tossed away data, because it might
actually be a short read rather than a write failure...

> Our option now are:
> - Restore the cross-fs restriction into generic_copy_file_range()
> - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does

- Stop trying using cfr for things it was never intended for.

Anyone using cfr has to be prepared for it to fail and do the copy
manually themselves. If you can't tell from userspace if a file has
data in it without actually calling read(), then you can't use
copy_file_range() on it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-12 Thread Dave Chinner

On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > On Fri, Feb 12, 2021 at 12:38 AM Greg KH  wrote:
> > >
> > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > files in the first place?  They can not seek (well most can not), so
> > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > problem that userspace should not do.
> > 
> > This may have been covered elsewhere, but it's not that people are
> > saying "let's use copy_file_range on files in /proc."  It's that the
> > Go language standard library provides an interface to operating system
> > files.  When Go code uses the standard library function io.Copy to
> > copy the contents of one open file to another open file, then on Linux
> > kernels 5.3 and greater the Go standard library will use the
> > copy_file_range system call.  That seems to be exactly what
> > copy_file_range is intended for.  Unfortunately it appears that when
> > people writing Go code open a file in /proc and use io.Copy the
> > contents to another open file, copy_file_range does nothing and
> > reports success.  There isn't anything on the copy_file_range man page
> > explaining this limitation, and there isn't any documented way to know
> > that the Go standard library should not use copy_file_range on certain
> > files.
> 
> But, is this a bug in the kernel in that the syscall being made is not
> working properly, or a bug in that Go decided to do this for all types
> of files not knowing that some types of files can not handle this?
> 
> If the kernel has always worked this way, I would say that Go is doing
> the wrong thing here.  If the kernel used to work properly, and then
> changed, then it's a regression on the kernel side.
> 
> So which is it?

Both Al Viro and myself have said "copy file range is not a generic
method for copying data between two file descriptors". It is a
targetted solution for *regular files only* on filesystems that store
persistent data and can accelerate the data copy in some way (e.g.
clone, server side offload, hardware offlead, etc). It is not
intended as a copy mechanism for copying data from one random file
descriptor to another.

The use of it as a general file copy mechanism in the Go system
library is incorrect and wrong. It is a userspace bug.  Userspace
has done the wrong thing, userspace needs to be fixed.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

2021-02-08 Thread Dave Chinner

On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > There was a v2 inline that incorporated some directed feedback.
> > > Otherwise there were questions and ideas about making the whole thing
> > > faster, but I've no idea if that addresses the problem or not (if so,
> > > that would be an entirely different set of patches). I'll wait and see
> > > what Darrick thinks about this and rebase/repost if the approach is
> > > agreeable..
> > 
> > There is always the school of thought that says that the best way to
> > get people to focus on this is to rebase and repost.  Otherwise, they
> > are all too likely to assume that you lost interest in this.
> 
> I was hoping that a better solution would emerge for clearing
> PageWriteback on hundreds of thousands of pages, but nothing easy popped
> out.
> 
> The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> to completion workqueue" gives me unease because who's to say if marking
> 262,144 pages on a particular CPU will actually stall it long enough to
> trip the hangcheck?  Is the number lower on (say) some pokey NAS box
> with a lot of storage but a slow CPU?

It's also not the right thing to do given the IO completion
workqueue is a bound workqueue. Anything that is doing large amounts
of CPU intensive work should be on a unbound workqueue so that the
scheduler can bounce it around different CPUs as needed.

Quite frankly, the problem is a huge long ioend chain being built by
the submission code. We need to keep ioend completion overhead down.
It runs in either softirq or bound workqueue context and so
individual items of work that are performed in this context must not
be -unbounded- in size or time. Unbounded ioend chains are bad for
IO latency, they are bad for memory reclaim and they are bad for CPU
scheduling.

As I've said previously, we gain nothing by aggregating ioends past
a few tens of megabytes of submitted IO. The batching gains are
completely diminished once we've got enough IO in flight to keep the
submission queue full. We're talking here about gigabytes of
sequential IOs in a single ioend chain which are 2-3 orders of
magnitude larger than needed for optimal background IO submission
and completion efficiency and throughput. IOWs, we really should be
limiting the ioend chain length at submission time, not trying to
patch over bad completion behaviour that results from sub-optimal IO
submission behaviour...

> That said, /some/ threshold is probably better than no threshold.  Could
> someone try to confirm if that series of Brian's fixes this problem too?

262144 pages is still too much work to be doing in a single softirq
IO completion callback. It's likely to be too much work for a bound
workqueue, too, especially when you consider that the workqueue
completion code will merge sequential ioends into one ioend, hence
making the IO completion loop counts bigger and latency problems worse
rather than better...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 00/18] new API for FS_IOC_[GS]ETFLAGS/FS_IOC_FS[GS]ETXATTR

2021-02-07 Thread Dave Chinner

On Wed, Feb 03, 2021 at 04:03:06PM +0100, Miklos Szeredi wrote:
> On Wed, Feb 3, 2021 at 3:56 PM Matthew Wilcox  wrote:
> 
> > But let's talk specifics.  What does CIFS need to contact the server for?
> > Could it be cached earlier?
> 
> I don't understand what CIFS is doing, and I don't really care.   This
> is the sort of operation where adding a couple of network roundtrips
> so that the client can obtain the credentials required to perform the
> operation doesn't really matter.  We won't have thousands of chattr(1)
> calls per second.

Incorrect.

The xfs utilities can do recursive directory traversal to change
things like the project ID across an entire directory tree. Or to
change extent size hints.

We also have 'xfs_io -c "lsattr -R" ...' and 'lsprog -R' which will
do a recursive descent to list the requested attributes of all
directories and files in the tree...

So, yeah, we do indeed do thousands of these fsxattr based
operations a second, sometimes tens of thousands a second or more,
and sometimes they are issued in bulk in performance critical paths
for container build/deployment operations

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

2021-01-26 Thread Dave Chinner

On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> copy_file_range (which calls generic_copy_file_checks) uses the
> inode file size to adjust the copy count parameter. This breaks
> with special filesystems like procfs/sysfs, where the file size
> appears to be zero, but content is actually returned when a read
> operation is performed.
> 
> This commit ignores the source file size, and makes copy_file_range
> match the end of file behaviour documented in POSIX's "read",
> where 0 is returned to mark EOF. This would allow "cp" and other
> standard tools to make use of copy_file_range with the exact same
> behaviour as they had in the past.
> 
> Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> Signed-off-by: Nicolas Boichat 

Nack.

As I've explained, this is intentional and bypassing it is not a
work around for enabling cfr on filesystems that produce ephemeral,
volatile read-once data using seq-file pipes that masquerade as
regular files with zero size. These files are behaving like pipes
and only work because the VFS has to support read() and friends from
pipes that don't publish the amount of data they contain to the VFS
inode.

copy_file_range() does not support such behaviour.

copy_file_range() -writes- data, so we have to check that those
writes do not extend past boundaries that the destination inode
imposes on the operation. e.g. maximum offset limits, whether the
ranges overlap in the same file, etc.

Hence we need to know how much data there is present to copy before
we can check if it is safe to perform the -write- of the data we are
going to read. Hence we cannot safely support data sources that
cannot tell us how much data is present before we start the copy
operation.

IOWs, these source file EOF restrictions are required by the write
side of copy_file_range(), not the read side.

> ---
> This can be reproduced with this simple test case:
>  #define _GNU_SOURCE
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> 
>  int
>  main(int argc, char **argv)
>  {
>int fd_in, fd_out;
>loff_t ret;
> 
>fd_in = open("/proc/version", O_RDONLY);
>fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
> 
>do {
>  ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
>  printf("%d bytes copied\n", (int)ret);
>} while (ret > 0);
> 
>return 0;
>  }
> 
> Without this patch, `version` output file is empty, and no bytes
> are copied:
> 0 bytes copied

$ ls -l /proc/version
-r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version
$

It's a zero length file.

sysfs does this just fine - it's regular files have a size of
at least PAGE_SIZE rather than zero, and so copy_file_range works
just fine on them:

$ ls -l /sys/block/nvme0n1/capability
-r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
$ cat /sys/block/nvme0n1/capability
50
$ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" 
/tmp/foo
$ sudo cat /tmp/foo
50

And the behaviour is exactly as you'd expect a read() loop to copy
the file to behave:

openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3

openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
copy_file_range(4, [0], 3, [0], 4096, 0) = 3
copy_file_range(4, [3], 3, [3], 4093, 0) = 0
close(4)

See? Inode size of 4096 means there's a maximum of 4kB of data that
can be read from this file.  copy_file_range() now behaves exactly
as read() would, returning a short copy and then 0 bytes to indicate
EOF.

If you want ephemeral data pipes masquerading as regular files to
work with copy_file_range, then the filesystem implementation needs
to provide the VFS with a data size that indicates the maximum
amount of data that the pipe can produce in a continuous read loop.
Otherwise we cannot validate the range of the write we may be asked
to perform...

> Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> copy operation. However, that fails to copy any content when the input
> file is from sysfs/tracefs, with an apparent size of 0 (but there is
> still content when you `cat` it, of course).

Libraries using copy_file_range() must be prepared for it to fail
and fall back to normal copy mechanisms. Of course, with these
special zero length files that contain ephemeral data, userspace can't
actually tell that they contain data from userspace using stat(). So
as far as userspace is concerned, copy_file_range() correctly
returned zero bytes copied from a zero byte long file and there's
nothing more to do.

This zero length file behaviour is, fundamentally, a kernel
filesystem implementation bug, not a copy_file_range() bug.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [BUG] copy_file_range with sysfs file as input

2021-01-26 Thread Dave Chinner

On Tue, Jan 26, 2021 at 11:50:50AM +0800, Nicolas Boichat wrote:
> On Tue, Jan 26, 2021 at 9:34 AM Dave Chinner  wrote:
> >
> > On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote:
> > > Hi copy_file_range experts,
> > >
> > > We hit this interesting issue when upgrading Go compiler from 1.13 to
> > > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of
> > > `/sys/kernel/debug/tracing/trace` to a temporary file.
> > >
> > > Under the hood, Go now uses `copy_file_range` syscall to optimize the
> > > copy operation. However, that fails to copy any content when the input
> > > file is from sysfs/tracefs, with an apparent size of 0 (but there is
> > > still content when you `cat` it, of course).
> > >
> > > A repro case is available in comment7 (adapted from the man page),
> > > also copied below [2].
> > >
> > > Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and
> > > 5.10.3 (chromeos))
> > > $ ./copyfrom /sys/kernel/debug/tracing/trace x
> > > 0 bytes copied
> >
> > That's basically telling you that copy_file_range() was unable to
> > copy anything. The man page says:
> >
> > RETURN VALUE
> >Upon  successful  completion,  copy_file_range() will return
> >the number of bytes copied between files.  This could be less
> >than the length originally requested.  If the file offset
> >of fd_in is at or past the end of file, no bytes are copied,
> >and copy_file_range() returns zero.
> >
> > THe man page explains it perfectly.
> 
> I'm not that confident the explanation is perfect ,-)
> 
> How does one define "EOF"? The read manpage
> (https://man7.org/linux/man-pages/man2/read.2.html) defines it as a
> zero return value.

And so does copy_file_range(). That's the -API definition-, it does
not define the kernel implementation of how to decide when the file
is at EOF.

> I don't think using the inode file size is
> standard.

It is the standard VFS filesystem definition of EOF.

Indeed:

copy_file_range()
  vfs_copy_file_range()
generic_copy_file_checks()
.

/* Shorten the copy to EOF */
size_in = i_size_read(inode_in);
if (pos_in >= size_in)
count = 0;
else
count = min(count, size_in - (uint64_t)pos_in);

That inode size check for EOF is -exactly- what is triggering here,
and a copy of zero length returns 0 bytes having done nothing.

The page cache read path does similar things in
generic_file_buffered_read() to avoid truncate races exposing
stale/bad data to userspace:

/*
 * i_size must be checked after we know the pages are Uptodate.
 *
 * Checking i_size after the check allows us to calculate
 * the correct value for "nr", which means the zero-filled
 * part of the page is not copied back to userspace (unless
 * another truncate extends the file - this is desired though).
 */
isize = i_size_read(inode);
if (unlikely(iocb->ki_pos >= isize))
goto put_pages;

> > 'cat' "works" in this situation because it doesn't check the file
> > size and just attempts to read unconditionally from the file. Hence
> > it happily returns non-existent stale data from busted filesystem
> > implementations that allow data to be read from beyond EOF...
> 
> `cp` also works, so does `dd` and basically any other file operation.

They do not use a syscall interface that can offload work to
filesystems, low level block layer software, hardware and/or remote
systems. copy_file_range() is restricted to regular files and does
complex stuff that read() and friends will never do, so we have
strictly enforced rules to prevent people from playing fast and
loose and silently corrupting user data with it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [BUG] copy_file_range with sysfs file as input

2021-01-26 Thread Dave Chinner

On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote:
> Hi copy_file_range experts,
> 
> We hit this interesting issue when upgrading Go compiler from 1.13 to
> 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of
> `/sys/kernel/debug/tracing/trace` to a temporary file.
> 
> Under the hood, Go now uses `copy_file_range` syscall to optimize the
> copy operation. However, that fails to copy any content when the input
> file is from sysfs/tracefs, with an apparent size of 0 (but there is
> still content when you `cat` it, of course).
> 
> A repro case is available in comment7 (adapted from the man page),
> also copied below [2].
> 
> Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and
> 5.10.3 (chromeos))
> $ ./copyfrom /sys/kernel/debug/tracing/trace x
> 0 bytes copied

That's basically telling you that copy_file_range() was unable to
copy anything. The man page says:

RETURN VALUE
   Upon  successful  completion,  copy_file_range() will return
   the number of bytes copied between files.  This could be less
   than the length originally requested.  If the file offset
   of fd_in is at or past the end of file, no bytes are copied,
   and copy_file_range() returns zero.

THe man page explains it perfectly. Look at the trace file you are
trying to copy:

$ ls -l /sys/kernel/debug/tracing/trace
-rw-r--r-- 1 root root 0 Jan 19 12:17 /sys/kernel/debug/tracing/trace
$ cat /sys/kernel/debug/tracing/trace
tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:8
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |

Yup, the sysfs file reports it's size as zero length, so the CFR
syscall is saying "there's nothing to copy from this empty file" and
so correctly is returning zero without even trying to copy anything
because the file offset is at EOF...

IOWs, there's no copy_file_range() bug here - it's behaving as
documented.

'cat' "works" in this situation because it doesn't check the file
size and just attempts to read unconditionally from the file. Hence
it happily returns non-existent stale data from busted filesystem
implementations that allow data to be read from beyond EOF...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Expense of read_iter

2021-01-19 Thread Dave Chinner

On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> On Thu, 14 Jan 2021, Mikulas wrote:
> For Ext4-dax, the overhead of dax_iomap_rw is significant
> compared to the overhead of struct iov_iter. Although methods
> proposed by Mikulas can eliminate the overhead of iov_iter
> well, they can not be applied in Ext4-dax unless we implement an
> internal "read" method in Ext4-dax.
> 
> For Ext4-dax, there could be two approaches to optimizing:
> 1) implementing the internal "read" method without the complexity
> of iterators and dax_iomap_rw;

Please do not go an re-invent the wheel just for ext4. If there's a
problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
as well, so any improvements to that path benefit all DAX users, not
just ext4.

> 2) optimizing how dax_iomap_rw works.
> Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> the iomap structure and others (e.g., journaling status locks in Ext4),
> we think implementing the internal "read" method would be easier.

Maybe it is, but it's also very selfish. The DAX iomap path was
written to be correct for all users, not inecessarily provide
optimal performance. There will be lots of things that could be done
to optimise it, so rather than creating a special snowflake in ext4
that makes DAX in ext4 much harder to maintain for non-ext4 DAX
developers, please work to improve the common DAX IO path and so
provide the same benefit to all the filesystems that use it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-13 Thread Dave Chinner

On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the space returned by the item formatting is negative (because it is
> > in the AIL and being relogged), the CIL checkpoint now doesn't have
> > the space reserved it needs to run a checkpoint. That transaction is
> > a sync transaction, so it forces the log, and now we push the CIL
> > without sufficient reservation to write out the log headers and the
> > items we just formatted
> > 
> 
> Hmmm... that seems like an odd scenario because I'd expect the space
> usage delta to reflect what might or might not have already been added
> to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
> delta only occur for items being relogged while still CIL resident
> (regardless of AIL residency)?
>
> From a code standpoint, the way a particular log item delta comes out
> negative is from having a shadow lv size smaller than the ->li_lv size.
> Thus, xlog_cil_insert_format_items() subtracts the currently formatted
> lv size from the delta, formats the current state of the item, and
> xfs_cil_prepare_item() adds the new (presumably smaller) size to the
> delta. We reuse ->li_lv in this scenario so both it and the shadow
> buffer remain, but a CIL push pulls ->li_lv from all log items and
> chains them to the CIL context for writing, so I don't see how we could
> have an item return a negative delta on an empty CIL. Hm?

In theory, yes. But like I said, I've made a bunch of assumptions
that this won't happen, and so without actually auditting the code
I'm not actually sure that it won't. i.e. I need to go check what
happens with items being marked stale, how shadow buffer
reallocation interacts with items that end up being smaller than the
existing buffer, etc. I've paged a lot of this detail out of my
brain, so until I spend the time to go over it all again I'm not
going to be able to answer definitively.

> (I was also wondering whether repeated smaller relogs of an item could
> be a vector for this to go wrong, but it looks like
> xlog_cil_insert_format_items() always uses the formatted size of the
> current buffer...).

That's my nagging doubt about all this - that there's an interaction
of this nature that I haven't considered due to the assumptions I've
made that allows underflow to occur. That would be much worse than
the current situation of hanging on a missing wakeup when the CIL is
full and used_space goes backwards

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-13 Thread Dave Chinner

On Mon, Jan 11, 2021 at 11:38:48AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > > If the value goes below the limit while some threads are
> > > > > > already waiting but before the push worker gets to it, these 
> > > > > > threads are
> > > > > > not woken.
> > > > > > 
> > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > > spinlock, which prevents additions to the waitqueue.
> > > > > > 
> > > > > > Signed-off-by: Donald Buczek 
> > > > > > ---
> > > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > > > /*
> > > > > >  * Wake up any background push waiters now this context is 
> > > > > > being pushed.
> > > > > >  */
> > > > > > -   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > > +   if (waitqueue_active(>xc_push_wait))
> > > > > > wake_up_all(>xc_push_wait);
> > > > > 
> > > > > That just smells wrong to me. It *might* be correct, but this
> > > > > condition should pair with the sleep condition, as space used by a
> > > > > CIL context should never actually decrease
> > > > > 
> > > > 
> > > > ... but I'm a little confused by this assertion. The shadow buffer
> > > > allocation code refers to the possibility of shadow buffers falling out
> > > > that are smaller than currently allocated buffers. Further, the
> > > > _insert_format_items() code appears to explicitly optimize for this
> > > > possibility by reusing the active buffer, subtracting the old size/count
> > > > values from the diff variables and then reformatting the latest
> > > > (presumably smaller) item to the lv.
> > > 
> > > Individual items might shrink, but the overall transaction should
> > > grow. Think of a extent to btree conversion of an inode fork. THe
> > > data in the inode fork decreases from a list of extents to a btree
> > > root block pointer, so the inode item shrinks. But then we add a new
> > > btree root block that contains all the extents + the btree block
> > > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > > size.
> > > 
> > > IOWs, while the inode item has decreased in size, the overall
> > > space consumed by the transaction has gone up and so the CIL ctx
> > > used_space should increase. Hence we can't just look at individual
> > > log items and whether they have decreased in size - we have to look
> > > at all the items in the transaction to understand how the space used
> > > in that transaction has changed. i.e. it's the aggregation of all
> > > items in the transaction that matter here, not so much the
> > > individual items.
> > > 
> > 
> > Ok, that makes more sense...
> > 
> > > > Of course this could just be implementation detail. I haven't dug into
> > > > the details in the remainder of this thread and I don't have specific
> > > > examples off the top of my head, but perhaps based on the ability of
> > > > various structures to change formats and the ability of log vectors to
> > > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > > shrink in size as well? Just from poking around the CIL it seems like
> > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > > for recalculating split res as well)...
> > > 
> > > Yes, there may be situ

Re: [PATCH] mm: vmscan: support complete shrinker reclaim

2021-01-10 Thread Dave Chinner

On Wed, Jan 06, 2021 at 03:56:02PM -0800, Andrew Morton wrote:
> (cc's added)
> 
> On Tue,  5 Jan 2021 16:43:38 -0800 Sudarshan Rajagopalan 
>  wrote:
> 
> > Ensure that shrinkers are given the option to completely drop
> > their caches even when their caches are smaller than the batch size.
> > This change helps improve memory headroom by ensuring that under
> > significant memory pressure shrinkers can drop all of their caches.
> > This change only attempts to more aggressively call the shrinkers
> > during background memory reclaim, inorder to avoid hurting the
> > performance of direct memory reclaim.
> > 

Why isn't the residual scan count accrual (nr_deferred) not
triggering the total_scan > freeable condition that is supposed to
allow shrinkers to completely empty under ongoing memory pressure
events?

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -424,6 +424,10 @@ static unsigned long do_shrink_slab(struct 
> > shrink_control *shrinkctl,
> > long batch_size = shrinker->batch ? shrinker->batch
> >   : SHRINK_BATCH;
> > long scanned = 0, next_deferred;
> > +   long min_cache_size = batch_size;
> > +
> > +   if (current_is_kswapd())
> > +   min_cache_size = 0;
> >  
> > if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > nid = 0;
> > @@ -503,7 +507,7 @@ static unsigned long do_shrink_slab(struct 
> > shrink_control *shrinkctl,
> >  * scanning at high prio and therefore should try to reclaim as much as
> >  * possible.
> >  */
> > -   while (total_scan >= batch_size ||
> > +   while (total_scan > min_cache_size ||
> >total_scan >= freeable) {
> > unsigned long ret;
> > unsigned long nr_to_scan = min(batch_size, total_scan);
> 
> I don't really see the need to exclude direct reclaim from this fix.
> 
> And if we're leaving unscanned objects behind in this situation, the
> current code simply isn't working as intended, and 0b1fb40a3b1 ("mm:
> vmscan: shrink all slab objects if tight on memory") either failed to
> achieve its objective or was later broken?

This looks to me like just another symptom of the fact that
nr_deferred needs to be tracked per-memcg. i.e. the deferred work
because total_scan < batch_size is not being aggregated against that
specific memcg and hence the accrual of deferred work over multiple
calls is not occurring correctly. Therefore we never meet the
conditions (total_scan > freeable) where the memcg shrinker can
drain the last few freeable entries in the cache.

i.e. see this patchset which makes the deferral of work be
accounted per-memcg:

https://lore.kernel.org/lkml/20210105225817.1036378-1-shy828...@gmail.com/

and that should also allow accrual of the work skipped on each memcg
be accounted across multiple calls to the shrinkers for the same
memcg. Hence as memory pressure within the memcg goes up, the
repeated calls to direct reclaim within that memcg will result in
all of the freeable items in each cache eventually being freed...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH] fs: block_dev: compute nr_vecs hint for improving writeback bvecs allocation

2021-01-08 Thread Dave Chinner

On Fri, Jan 08, 2021 at 03:59:22PM +0800, Ming Lei wrote:
> On Thu, Jan 07, 2021 at 09:21:11AM +1100, Dave Chinner wrote:
> > On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote:
> > > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> > > > At least for iomap I think this is the wrong approach.  Between the
> > > > iomap and writeback_control we know the maximum size of the writeback
> > > > request and can just use that.
> > > 
> > > I think writeback_control can tell us nothing about max pages in single
> > > bio:
> > 
> > By definition, the iomap tells us exactly how big the IO is going to
> > be. i.e. an iomap spans a single contiguous range that we are going
> > to issue IO on. Hence we can use that to size the bio exactly
> > right for direct IO.
> 
> When I trace wpc->iomap.length in iomap_add_to_ioend() on the following fio
> randwrite/write, the length is 1GB most of times, maybe because it is
> one fresh XFS.

Yes, that's exactly what I said it would do.

> Another reason is that pages in the range may be contiguous physically,
> so lots of pages may share one single bvec.

The iomap layer does not care about this, and there's no way this
can be detected ahead of time, anyway, because we are only passed a
single page at a time. When we get large pages from the page cache,
we'll still only get one page at a time, but we'll get physically
contiguous pages and so it will still be a 1 page : 1 bvec
relationship at the iomap layer.

> > > - wbc->nr_to_write controls how many pages to writeback, this pages
> > >   usually don't belong to same bio. Also this number is often much
> > >   bigger than BIO_MAX_PAGES.
> > > 
> > > - wbc->range_start/range_end is similar too, which is often much more
> > >   bigger than BIO_MAX_PAGES.
> > > 
> > > Also page/blocks_in_page can be mapped to different extent too, which is
> > > only available when wpc->ops->map_blocks() is returned,
> > 
> > We only allocate the bio -after- calling ->map_blocks() to obtain
> > the iomap for the given writeback range request. Hence we
> > already know how large the BIO could be before we allocate it.
> > 
> > > which looks not
> > > different with mpage_writepages(), in which bio is allocated with
> > > BIO_MAX_PAGES vecs too.
> > 
> > __mpage_writepage() only maps a page at a time, so it can't tell
> > ahead of time how big the bio is going to need to be as it doesn't
> > return/cache a contiguous extent range. So it's actually very
> > different to the iomap writeback code, and effectively does require
> > a BIO_MAX_PAGES vecs allocation all the time...
> > 
> > > Or you mean we can use iomap->length for this purpose? But iomap->length
> > > still is still too big in case of xfs.
> > 
> > if we are doing small random writeback into large extents (i.e.
> > iomap->length is large), then it is trivial to detect that we are
> > doing random writes rather than sequential writes by checking if the
> > current page is sequential to the last sector in the current bio.
> > We already do this non-sequential IO checking to determine if a new
> > bio needs to be allocated in iomap_can_add_to_ioend(), and we also
> > know how large the current contiguous range mapped into the current
> > bio chain is (ioend->io_size). Hence we've got everything we need to
> > determine whether we should do a large or small bio vec allocation
> > in the iomap writeback path...
> 
> page->index should tell us if the workload is random or sequential, however
> still not easy to decide how many pages there will be in the next bio
> when iomap->length is large.

page->index doesn't tell us anything about what type of IO is being
done - it just tells us where in the file we need to map to find the
physical block we need to write it to. OTOH, the iomap writeback
context contains all the information about current IO being build -
offset, size, current bio, etc - and the page->index gets compared
against the state in the iomap writepage context.

So, if the wpc->iomap.length is large, current page->index does not
map sequentially to the end of wpc->ioend->io_bio (or
wpc->io_end->io_offset + wpc->ioend->io_size) and
wpc->io_end->io_size == page_size(page) for the currently held bio,
then we are clearly doing random single page writeback into a large
allocated extent. Hence in that case we can do small bvec
allocations for the new bio.

Sure, the first bio in a ->writepages invocation doesn't have this
information, so we've going to have to assume BIO_MAX_PAGES for the
first bio. But for every bio after that in the ->writepages
invocation we have the state of the previous contiguous writeback
range held in the wpc structure and can use that info to optimise
the thousands of random single pages that are written after then
first one...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-07 Thread Dave Chinner

On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
> On 02.01.21 23:44, Dave Chinner wrote:
> > On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> > > On 31.12.20 22:59, Dave Chinner wrote:
> > > > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > > > On 30.12.20 23:16, Dave Chinner wrote:
> > 
> > > > One could argue that, but one should also understand the design
> > > > constraints for a particular algorithm are before suggesting
> > > > that their solution is "robust". :)
> > > 
> > > Yes, but an understanding to the extend required by the
> > > argument should be sufficient :-)
> > 
> > And that's the fundamental conceit described by Dunning-Kruger.
> > 
> > I.e. someone thinks they know enough to understand the argument,
> > when in fact they don't understand enough about the subject matter
> > to realise they don't understand what the expert at the other end is
> > saying at all
> > 
> > > > > # seq 29
> > > > > 
> > > > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX 
> > > > > trigger cil e374c6f1 ctx 4967d650  
> > > > > ctx->space_used=33554656  , push_seq=29, ctx->sequence=29
> > > > 
> > > > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > > > 
> > > > .
> > > > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX 
> > > > > wakecil e374c6f1 ctx 4967d650  
> > > > > ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> > > > 
> > > > It takes the best part of *50 seconds* before the push work actually
> > > > runs?
> > > > 
> > > > That's  well and truly screwed up - the work should run on that
> > > > CPU on the very next time it yeilds the CPU. If we're holding the
> > > > CPU without yeilding it for that long, hangcheck and RCU warnings
> > > > should be going off...
> > > 
> > > No such warnings.
> > > 
> > > But the load is probably I/O bound to the log:
> > > 
> > > - creates `cp -a` copies of a directory tree with small files (linux 
> > > source repository)
> > > - source tree probably completely cached.
> > > - two copies in parallel
> > > - slow log (on software raid6)
> > > 
> > > Isn't it to be expected that sooner or later you need to wait for
> > > log writes when you write as fast as possible with lots of
> > > metadata updates and not so much data?
> > 
> > No, incoming transactions waits for transaction reservation space,
> > not log writes. Reservation space is freed up by metadata writeback.
> > So if you have new transactions blocking in xfs_trans_reserve(),
> > then we are blocking on metadata writeback.
> > 
> > The CIL worker thread may be waiting for log IO completion before it
> > issues more log IO, and in that case it is waiting on iclog buffer
> > space (i.e. only waiting on internal log state, not metadata
> > writeback).  Can you find that kworker thread stack and paste it? If
> > bound on log IO, it will be blocked in xlog_get_iclog_space().
> > 
> > Please paste the entire stack output, too, not just the bits you
> > think are relevant or understand
> 
> That would be a kworker on the "xfs-cil/%s" workqueue, correct?
> And you mean before the lockup, when the I/O is still active, correct?
> 
> This is the usual stack output from that thread:
> 
> # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
> # cat /proc/2080/task/2080/stack
> 
> [<0>] md_flush_request+0x87/0x190
> [<0>] raid5_make_request+0x61b/0xb30
> [<0>] md_handle_request+0x127/0x1a0
> [<0>] md_submit_bio+0xbd/0x100
> [<0>] submit_bio_noacct+0x151/0x410
> [<0>] submit_bio+0x4b/0x1a0
> [<0>] xlog_state_release_iclog+0x87/0xb0
> [<0>] xlog_write+0x452/0x6d0
> [<0>] xlog_cil_push_work+0x2e0/0x4d0
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x23f/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> sampled three times with a few seconds in between, stack identical.

Yeah, so that is MD blocking waiting for a running device cache
flush to finish before submitting the iclog IO. IOWs, it's waiting
for hardware to flush volatile caches to stable storage. Every log
IO has a cache flush pre

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-07 Thread Dave Chinner

On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > If the value goes below the limit while some threads are
> > > already waiting but before the push worker gets to it, these threads are
> > > not woken.
> > > 
> > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > optimization. This is possible, because we hold the xc_push_lock
> > > spinlock, which prevents additions to the waitqueue.
> > > 
> > > Signed-off-by: Donald Buczek 
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index b0ef071b3cb5..d620de8e217c 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > >   /*
> > >* Wake up any background push waiters now this context is being pushed.
> > >*/
> > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > + if (waitqueue_active(>xc_push_wait))
> > >   wake_up_all(>xc_push_wait);
> > 
> > That just smells wrong to me. It *might* be correct, but this
> > condition should pair with the sleep condition, as space used by a
> > CIL context should never actually decrease
> > 
> 
> ... but I'm a little confused by this assertion. The shadow buffer
> allocation code refers to the possibility of shadow buffers falling out
> that are smaller than currently allocated buffers. Further, the
> _insert_format_items() code appears to explicitly optimize for this
> possibility by reusing the active buffer, subtracting the old size/count
> values from the diff variables and then reformatting the latest
> (presumably smaller) item to the lv.

Individual items might shrink, but the overall transaction should
grow. Think of a extent to btree conversion of an inode fork. THe
data in the inode fork decreases from a list of extents to a btree
root block pointer, so the inode item shrinks. But then we add a new
btree root block that contains all the extents + the btree block
header, and it gets rounded up to ithe 128 byte buffer logging chunk
size.

IOWs, while the inode item has decreased in size, the overall
space consumed by the transaction has gone up and so the CIL ctx
used_space should increase. Hence we can't just look at individual
log items and whether they have decreased in size - we have to look
at all the items in the transaction to understand how the space used
in that transaction has changed. i.e. it's the aggregation of all
items in the transaction that matter here, not so much the
individual items.

> Of course this could just be implementation detail. I haven't dug into
> the details in the remainder of this thread and I don't have specific
> examples off the top of my head, but perhaps based on the ability of
> various structures to change formats and the ability of log vectors to
> shrink in size, shouldn't we expect the possibility of a CIL context to
> shrink in size as well? Just from poking around the CIL it seems like
> the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> for recalculating split res as well)...

Yes, there may be situations where it decreases. It may be this is
fine, but the assumption *I've made* in lots of the CIL push code is
that ctx->used_space rarely, if ever, will go backwards.

e.g. we run the first transaction into the CIL, it steals the sapce
needed for the cil checkpoint headers for the transaciton. Then if
the space returned by the item formatting is negative (because it is
in the AIL and being relogged), the CIL checkpoint now doesn't have
the space reserved it needs to run a checkpoint. That transaction is
a sync transaction, so it forces the log, and now we push the CIL
without sufficient reservation to write out the log headers and the
items we just formatted

So, yeah, shrinking transaction space usage definitely violates some
of the assumptions the code makes about how relogging works. It's
entirely possible the assumptions I've made are not entirely correct
in some corner cases - those particular cases are what we need to
ferret out here, and then decide if they are correct or not and deal
with it from there...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH] fs: block_dev: compute nr_vecs hint for improving writeback bvecs allocation

2021-01-06 Thread Dave Chinner

On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote:
> On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> > At least for iomap I think this is the wrong approach.  Between the
> > iomap and writeback_control we know the maximum size of the writeback
> > request and can just use that.
> 
> I think writeback_control can tell us nothing about max pages in single
> bio:

By definition, the iomap tells us exactly how big the IO is going to
be. i.e. an iomap spans a single contiguous range that we are going
to issue IO on. Hence we can use that to size the bio exactly
right for direct IO.

> - wbc->nr_to_write controls how many pages to writeback, this pages
>   usually don't belong to same bio. Also this number is often much
>   bigger than BIO_MAX_PAGES.
> 
> - wbc->range_start/range_end is similar too, which is often much more
>   bigger than BIO_MAX_PAGES.
> 
> Also page/blocks_in_page can be mapped to different extent too, which is
> only available when wpc->ops->map_blocks() is returned,

We only allocate the bio -after- calling ->map_blocks() to obtain
the iomap for the given writeback range request. Hence we
already know how large the BIO could be before we allocate it.

> which looks not
> different with mpage_writepages(), in which bio is allocated with
> BIO_MAX_PAGES vecs too.

__mpage_writepage() only maps a page at a time, so it can't tell
ahead of time how big the bio is going to need to be as it doesn't
return/cache a contiguous extent range. So it's actually very
different to the iomap writeback code, and effectively does require
a BIO_MAX_PAGES vecs allocation all the time...

> Or you mean we can use iomap->length for this purpose? But iomap->length
> still is still too big in case of xfs.

if we are doing small random writeback into large extents (i.e.
iomap->length is large), then it is trivial to detect that we are
doing random writes rather than sequential writes by checking if the
current page is sequential to the last sector in the current bio.
We already do this non-sequential IO checking to determine if a new
bio needs to be allocated in iomap_can_add_to_ioend(), and we also
know how large the current contiguous range mapped into the current
bio chain is (ioend->io_size). Hence we've got everything we need to
determine whether we should do a large or small bio vec allocation
in the iomap writeback path...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-02 Thread Dave Chinner

On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> On 31.12.20 22:59, Dave Chinner wrote:
> > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > On 30.12.20 23:16, Dave Chinner wrote:

> > One could argue that, but one should also understand the design
> > constraints for a particular algorithm are before suggesting
> > that their solution is "robust". :)
>
> Yes, but an understanding to the extend required by the
> argument should be sufficient :-)

And that's the fundamental conceit described by Dunning-Kruger.

I.e. someone thinks they know enough to understand the argument,
when in fact they don't understand enough about the subject matter
to realise they don't understand what the expert at the other end is
saying at all

> > > # seq 29
> > > 
> > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX 
> > > trigger cil e374c6f1 ctx 4967d650  
> > > ctx->space_used=33554656  , push_seq=29, ctx->sequence=29
> > 
> > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > 
> > .
> > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake 
> > >cil e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 
> > > 67108864, push_seq=29, ctx->sequence=29
> > 
> > It takes the best part of *50 seconds* before the push work actually
> > runs?
> > 
> > That's  well and truly screwed up - the work should run on that
> > CPU on the very next time it yeilds the CPU. If we're holding the
> > CPU without yeilding it for that long, hangcheck and RCU warnings
> > should be going off...
> 
> No such warnings.
> 
> But the load is probably I/O bound to the log:
> 
> - creates `cp -a` copies of a directory tree with small files (linux source 
> repository)
> - source tree probably completely cached.
> - two copies in parallel
> - slow log (on software raid6)
> 
> Isn't it to be expected that sooner or later you need to wait for
> log writes when you write as fast as possible with lots of
> metadata updates and not so much data?

No, incoming transactions waits for transaction reservation space,
not log writes. Reservation space is freed up by metadata writeback.
So if you have new transactions blocking in xfs_trans_reserve(),
then we are blocking on metadata writeback.

The CIL worker thread may be waiting for log IO completion before it
issues more log IO, and in that case it is waiting on iclog buffer
space (i.e. only waiting on internal log state, not metadata
writeback).  Can you find that kworker thread stack and paste it? If
bound on log IO, it will be blocked in xlog_get_iclog_space().

Please paste the entire stack output, too, not just the bits you
think are relevant or understand

Also, cp -a of a linux source tree is just as data intensive as it
is metadata intensive. There's probably more data IO than metadata
IO, so that's more likely to be what is slowing the disks down as
metadata writeback is...

> I'm a bit concerned, though, that there seem to be a rather
> unlimited (~ 1000) number of kernel worker threads waiting for the
> cil push and indirectly for log writes.

That's normal - XFS is highly asynchronous and defers a lot of work
to completion threads.

> > So it dropped by 16 bytes (seems to be common) which is
> > unexpected.  I wonder if it filled a hole in a buffer and so
> > needed one less xlog_op_header()? But then the size would have
> > gone up by at least 128 bytes for the hole that was filled, so
> > it still shouldn't go down in size.
> > 
> > I think you need to instrument xlog_cil_insert_items() and catch
> > a negative length here:
> > 
> > /* account for space used by new iovec headers  */
> > iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
> > iovhdr_res; ctx->nvecs += diff_iovecs;
> > 
> > (diff_iovecs will be negative if the number of xlog_op_header
> > structures goes down)
> > 
> > And if this happens, then dump the transaction ticket via
> > xlog_print_trans(tp) so we can see all the log items types and
> > vectors that the transaction has formatted...
> 
> I tried that, but the output was difficult to understand, because
> at that point you can only log the complete transaction with the
> items already updated.  And a shrinking item is not switched to the
> shadow vector, so the formatted content is already overwritten and
> not available for analysis.

Yes, that's exactly the information I need to see.

But the fact you think this is something I don't need to know about
is classic Dunning-Kruger in action. You don't understand w

Re: [xfs] db962cd266: Assertion_failed

2021-01-01 Thread Dave Chinner

On Fri, Jan 01, 2021 at 05:10:49PM +0800, Yafang Shao wrote:
> On Thu, Dec 31, 2020 at 10:46 AM kernel test robot
>  wrote:
.
> > [  552.905799] XFS: Assertion failed: !current->journal_info, file: 
> > fs/xfs/xfs_trans.h, line: 280
> > [  553.104459]  xfs_trans_reserve+0x225/0x320 [xfs]
> > [  553.110556]  xfs_trans_roll+0x6e/0xe0 [xfs]
> > [  553.116134]  xfs_defer_trans_roll+0x104/0x2a0 [xfs]
> > [  553.122489]  ? xfs_extent_free_create_intent+0x62/0xc0 [xfs]
> > [  553.129780]  xfs_defer_finish_noroll+0xb8/0x620 [xfs]
> > [  553.136299]  xfs_defer_finish+0x11/0xa0 [xfs]
> > [  553.142017]  xfs_itruncate_extents_flags+0x141/0x440 [xfs]
> > [  553.149053]  xfs_setattr_size+0x3da/0x480 [xfs]
> > [  553.154939]  ? setattr_prepare+0x6a/0x1e0
> > [  553.160250]  xfs_vn_setattr+0x70/0x120 [xfs]
> > [  553.165833]  notify_change+0x364/0x500
> > [  553.170820]  ? do_truncate+0x76/0xe0
> > [  553.175673]  do_truncate+0x76/0xe0
> > [  553.180184]  path_openat+0xe6c/0x10a0
> > [  553.184981]  do_filp_open+0x91/0x100
> > [  553.189707]  ? __check_object_size+0x136/0x160
> > [  553.195493]  do_sys_openat2+0x20d/0x2e0
> > [  553.200481]  do_sys_open+0x44/0x80
> > [  553.204926]  do_syscall_64+0x33/0x40
> > [  553.209588]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Thanks for the report.
> 
> At a first glance, it seems we should make a similar change as we did
> in xfs_trans_context_clear().
> 
> static inline void
> xfs_trans_context_set(struct xfs_trans *tp)
> {
> /*
>  * We have already handed over the context via xfs_trans_context_swap().
>  */
> if (current->journal_info)
> return;
> current->journal_info = tp;
> tp->t_pflags = memalloc_nofs_save();
> }

Ah, no.

Remember how I said "split out the wrapping of transaction
context setup in xfs_trans_reserve() from
the lifting of the context setting into xfs_trans_alloc()"?

Well, you did the former and dropped the latter out of the patch
set.

Now when a transaction rolls after xfs_trans_context_swap(), it
calls xfs_trans_reserve() and tries to do transaction context setup
work inside a transaction context that already exists.  IOWs, you
need to put the patch that lifts of the context setting up into
xfs_trans_alloc() back into the patchset before adding the
current->journal functionality patch.

Also, you need to test XFS code with CONFIG_XFS_DEBUG=y so that
asserts are actually built into the code and exercised, because this
ASSERT should have fired on the first rolling transaction that the
kernel executes...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2020-12-31 Thread Dave Chinner

On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> On 30.12.20 23:16, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > Threads, which committed items to the CIL, wait in the
> > > xc_push_wait waitqueue when used_space in the push context
> > > goes over a limit. These threads need to be woken when the CIL
> > > is pushed.
> > > 
> > > The CIL push worker tries to avoid the overhead of calling
> > > wake_all() when there are no waiters waiting. It does so by
> > > checking the same condition which caused the waits to happen.
> > > This, however, is unreliable, because ctx->space_used can
> > > actually decrease when items are recommitted.
> > 
> > When does this happen?
> > 
> > Do you have tracing showing the operation where the relogged
> > item has actually gotten smaller? By definition, relogging in
> > the CIL should only grow the size of the object in the CIL
> > because it must relog all the existing changes on top of the new
> > changed being made to the object. Hence the CIL reservation
> > should only ever grow.
> 
> I have (very ugly printk based) log (see below), but it only
> shows, that it happened (space_used decreasing), not what caused
> it.
> 
> I only browsed the ( xfs_*_item.c ) code and got the impression
> that the size of a log item is rather dynamic (e.g. number of
> extends in an inode, extended attributes in an inode, continuity
> of chunks in a buffer) and wasn't surprised that a relogged item
> might need less space from time to time.
> 
> > IOWs, returning negative lengths from the formatting code is
> > unexpected and probably a bug and requires further
> > investigation, not papering over the occurrence with broadcast
> > wakeups...
> 
> One could argue that the code is more robust after the change,
> because it wakes up every thread which is waiting on the next push
> to happen when the next push is happening without making
> assumption of why these threads are waiting by duplicating code
> from that waiters side. The proposed waitqueue_active() is inlined
> to two instructions and avoids the call overhead if there are no
> waiters as well.

One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting that
their solution is "robust". :)

> 
> # seq 29
> 
> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger 
> cil e374c6f1 ctx 4967d650  ctx->space_used=33554656  , 
> push_seq=29, ctx->sequence=29

So, at 20:08:15 we get a push trigger and the work is queued. But...

.
> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake
> cil e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 
> 67108864, push_seq=29, ctx->sequence=29

It takes the best part of *50 seconds* before the push work actually
runs?

That's  well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...

> # seq 30
> 
> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger 
> cil e374c6f1 ctx c46ab121  ctx->space_used=33554480  , 
> push_seq=30, ctx->sequence=30

20:09:39 for the next trigger,

> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.076095+01:00 deadbird ker

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2020-12-30 Thread Dave Chinner

On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> Threads, which committed items to the CIL, wait in the xc_push_wait
> waitqueue when used_space in the push context goes over a limit. These
> threads need to be woken when the CIL is pushed.
> 
> The CIL push worker tries to avoid the overhead of calling wake_all()
> when there are no waiters waiting. It does so by checking the same
> condition which caused the waits to happen. This, however, is
> unreliable, because ctx->space_used can actually decrease when items are
> recommitted.

When does this happen?

Do you have tracing showing the operation where the relogged item
has actually gotten smaller? By definition, relogging in the CIL
should only grow the size of the object in the CIL because it must
relog all the existing changes on top of the new changed being made
to the object. Hence the CIL reservation should only ever grow.

IOWs, returning negative lengths from the formatting code is
unexpected and probably a bug and requires further investigation,
not papering over the occurrence with broadcast wakeups...

> If the value goes below the limit while some threads are
> already waiting but before the push worker gets to it, these threads are
> not woken.
> 
> Always wake all CIL push waiters. Test with waitqueue_active() as an
> optimization. This is possible, because we hold the xc_push_lock
> spinlock, which prevents additions to the waitqueue.
> 
> Signed-off-by: Donald Buczek 
> ---
>  fs/xfs/xfs_log_cil.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..d620de8e217c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -670,7 +670,7 @@ xlog_cil_push_work(
>   /*
>* Wake up any background push waiters now this context is being pushed.
>*/
> - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> + if (waitqueue_active(>xc_push_wait))
>   wake_up_all(>xc_push_wait);

That just smells wrong to me. It *might* be correct, but this
condition should pair with the sleep condition, as space used by a
CIL context should never actually decrease

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: v5.10.1 xfs deadlock

2020-12-18 Thread Dave Chinner

On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
> Dear xfs developer,
> 
> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs 
> filesystems on md raid6 raids.
> 
> The stress test was essentially `cp -a`ing a Linux source repository with two 
> threads in parallel on each filesystem.
> 
> After about on hour, the processes to one filesystem (md1) blocked, 30 
> minutes later the process to the other filesystem (md0) did.
> 
> root  7322  2167  0 Dec16 pts/100:00:06 cp -a 
> /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
> root  7329  2169  0 Dec16 pts/100:00:05 cp -a 
> /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
> root 13856  2170  0 Dec16 pts/100:00:08 cp -a 
> /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
> root 13899  2168  0 Dec16 pts/100:00:05 cp -a 
> /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
> 
> Some info from the system (all stack traces, slabinfo) is available here: 
> https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
> 
> It stands out, that there are many (549 for md0, but only 10 for md1)  
> "xfs-conv" threads all with stacks like this
> 
> [<0>] xfs_log_commit_cil+0x6cc/0x7c0
> [<0>] __xfs_trans_commit+0xab/0x320
> [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
> [<0>] xfs_end_ioend+0xc6/0x110
> [<0>] xfs_end_io+0xad/0xe0
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x2d/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> xfs_log_commit_cil+0x6cc is
> 
>   xfs_log_commit_cil()
> xlog_cil_push_background(log)
>   xlog_wait(>xc_push_wait, >xc_push_lock);
> 
> Some other threads, including the four "cp" commands are also blocking at 
> xfs_log_commit_cil+0x6cc
> 
> There are also single "flush" process for each md device with this stack 
> signature:
> 
> [<0>] xfs_map_blocks+0xbf/0x400
> [<0>] iomap_do_writepage+0x15e/0x880
> [<0>] write_cache_pages+0x175/0x3f0
> [<0>] iomap_writepages+0x1c/0x40
> [<0>] xfs_vm_writepages+0x59/0x80
> [<0>] do_writepages+0x4b/0xe0
> [<0>] __writeback_single_inode+0x42/0x300
> [<0>] writeback_sb_inodes+0x198/0x3f0
> [<0>] __writeback_inodes_wb+0x5e/0xc0
> [<0>] wb_writeback+0x246/0x2d0
> [<0>] wb_workfn+0x26e/0x490
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x2d/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> xfs_map_blocks+0xbf is the
> 
> xfs_ilock(ip, XFS_ILOCK_SHARED);
> 
> in xfs_map_blocks().

Can you post the entire dmesg output after running
'echo w > /proc/sysrq-trigger' to dump all the block threads to
dmesg?

> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005
> pulled from linux-scsi) , but it is unlikely that this blocking is
> related to that, because the md block devices itself are
> responsive (`xxd /dev/md0` )

My bet is that the OOT driver/hardware had dropped a log IO on the
floor - XFS is waiting for the CIL push to complete, and I'm betting
that is stuck waiting for iclog IO completion while writing the CIL
to the journal. The sysrq output will tell us if this is the case,
so that's the first place to look.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

2020-12-17 Thread Dave Chinner

On Thu, Dec 17, 2020 at 04:56:48PM -0800, Yang Shi wrote:
> On Tue, Dec 15, 2020 at 3:07 PM Yang Shi  wrote:
> > > This guarantees that only the shrinker instances taht have a
> > > correctly set up memcg attached to them will have the
> > > SHRINKER_MEMCG_AWARE flag set. Hence in all the rest of the shrinker
> > > code, we only ever need to check for SHRINKER_MEMCG_AWARE to
> > > determine what we should do
> >
> > Thanks. I see your point. We could move the memcg specific details
> > into prealloc_memcg_shrinker().
> >
> > It seems we have to acquire shrinker_rwsem before we check and modify
> > SHIRNKER_MEMCG_AWARE bit if we may clear it.
> 
> Hi Dave,
> 
> Is it possible that shrinker register races with shrinker unregister?
> It seems impossible to me by a quick visual code inspection. But I'm
> not a VFS expert so I'm not quite sure.

Uh, if you have a shrinker racing to register and unregister, you've
got a major bug in your object initialisation/teardown code. i.e.
calling reagister/unregister at the same time for the same shrinker
is a bug, pure and simple.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v3 4/9] mm, fsdax: Refactor memory-failure handler for dax mapping

2020-12-16 Thread Dave Chinner

On Tue, Dec 15, 2020 at 08:14:09PM +0800, Shiyang Ruan wrote:
> The current memory_failure_dev_pagemap() can only handle single-mapped
> dax page for fsdax mode.  The dax page could be mapped by multiple files
> and offsets if we let reflink feature & fsdax mode work together.  So,
> we refactor current implementation to support handle memory failure on
> each file and offset.
> 
> Signed-off-by: Shiyang Ruan 
> ---
.
>  static const char *action_name[] = {
> @@ -1147,6 +1148,60 @@ static int try_to_split_thp_page(struct page *page, 
> const char *msg)
>   return 0;
>  }
>  
> +int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t index, 
> int flags)
> +{
> + const bool unmap_success = true;
> + unsigned long pfn, size = 0;
> + struct to_kill *tk;
> + LIST_HEAD(to_kill);
> + int rc = -EBUSY;
> + loff_t start;
> + dax_entry_t cookie;
> +
> + /*
> +  * Prevent the inode from being freed while we are interrogating
> +  * the address_space, typically this would be handled by
> +  * lock_page(), but dax pages do not use the page lock. This
> +  * also prevents changes to the mapping of this pfn until
> +  * poison signaling is complete.
> +  */
> + cookie = dax_lock(mapping, index, );
> + if (!cookie)
> + goto unlock;

Why do we need to prevent the inode from going away here? This
function gets called by XFS after doing an xfs_iget() call to grab
the inode that owns the block. Hence the the inode (and the mapping)
are guaranteed to be referenced and can't go away. Hence for the
filesystem based callers, this whole "dax_lock()" thing can go away.

So, AFAICT, the dax_lock() stuff is only necessary when the
filesystem can't be used to resolve the owner of physical page that
went bad

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker

2020-12-15 Thread Dave Chinner

On Tue, Dec 15, 2020 at 02:27:18PM -0800, Yang Shi wrote:
> On Mon, Dec 14, 2020 at 6:46 PM Dave Chinner  wrote:
> >
> > On Mon, Dec 14, 2020 at 02:37:19PM -0800, Yang Shi wrote:
> > > Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's 
> > > nr_deferred
> > > will be used in the following cases:
> > > 1. Non memcg aware shrinkers
> > > 2. !CONFIG_MEMCG
> > > 3. memcg is disabled by boot parameter
> > >
> > > Signed-off-by: Yang Shi 
> >
> > Lots of lines way over 80 columns.
> 
> I thought that has been lifted to 100 columns.

Documentation/process/coding-style.rst still says:

"The preferred limit on the length of a single line is 80 columns."

checkpatch might not warn about > 80 columns anymore, but if the
file you are modifying is almost entirely 80 columns in width, then
by default changes to that file should also stay within 80 columns.

I mostly consider using checkpatch to enforce coding styles to be
harmful

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-15 Thread Dave Chinner

On Tue, Dec 15, 2020 at 12:51:02PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 4688bff19c20..e8cfaf860149 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, 
> > struct block_device *bdev,
> >  
> > bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> > sb = get_super(bdev);
> > -   if (sb && sb->s_op->corrupted_range) {
> > +   if (!sb) {
> > +   rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, 
> > data);
> > +   goto out;
> > +   } else if (sb->s_op->corrupted_range)
> > rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
> > data);
> > -   drop_super(sb);
> 
> This is out of scope for this patch(set) but do you think that the scsi
> disk driver should intercept media errors from sense data and call
> ->corrupted_range too?  ISTR Ted muttering that one of his employers had
> a patchset to do more with sense data than the upstream kernel currently
> does...

Most definitely!

That's the whole point of layering corrupt range reporting through
the device layers like this - the corrupted range reporting is not
limited specifically to pmem devices and so generic storage failures
(e.g.  RAID failures, hardware media failures, etc) can be reported
back up to the filesystem and we can take immediate, appropriate
action, including reporting to userspace that they just lost data in
file X at offset Y...

Combine that with the proposed "watch_sb()" syscall for reporting
such errors in a generic manner to interested listeners, and we've
got a fairly solid generic path for reporting data loss events to
userspace for an appropriate user-defined action to be taken...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-15 Thread Dave Chinner

On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > Hi Jane
> > 
> > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > Hi, Shiyang,
> > > 
> > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > > 
> > > > Change from v1:
> > > >    - Intorduce ->block_lost() for block device
> > > >    - Support mapped device
> > > >    - Add 'not available' warning for realtime device in XFS
> > > >    - Rebased to v5.10-rc1
> > > > 
> > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > device, by introducing an interface ->memory_failure() of struct
> > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > to find the filesystem which the damaged page located in, and call
> > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > Finally we are able to try to fix the damaged data in filesystem and do
> > > 
> > > Does that mean clearing poison? if so, would you mind to elaborate
> > > specifically which change does that?
> > 
> > Recovering data for filesystem (or pmem device) has not been done in
> > this patchset...  I just triggered the handler for the files sharing the
> > corrupted page here.
> 
> Thanks! That confirms my understanding.
> 
> With the framework provided by the patchset, how do you envision it to
> ease/simplify poison recovery from the user's perspective?

At the moment, I'd say no change what-so-ever. THe behaviour is
necessary so that we can kill whatever user application maps
multiply-shared physical blocks if there's a memory error. THe
recovery method from that is unchanged. The only advantage may be
that the filesystem (if rmap enabled) can tell you the exact file
and offset into the file where data was corrupted.

However, it can be worse, too: it may also now completely shut down
the filesystem if the filesystem discovers the error is in metadata
rather than user data. That's much more complex to recover from, and
right now will require downtime to take the filesystem offline and
run fsck to correct the error. That may trash whatever the metadata
that can't be recovered points to, so you still have a uesr data
recovery process to perform after this...

> And how does it help in dealing with page faults upon poisoned
> dax page?

It doesn't. If the page is poisoned, the same behaviour will occur
as does now. This is simply error reporting infrastructure, not
error handling.

Future work might change how we correct the faults found in the
storage, but I think the user visible behaviour is going to be "kill
apps mapping corrupted data" for a long time yet

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation

2020-12-15 Thread Dave Chinner

On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
> On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > > Since memcg_shrinker_map_size just can be changd under holding 
> > > shrinker_rwsem
> > > exclusively, the read side can be protected by holding read lock, so it 
> > > sounds
> > > superfluous to have a dedicated mutex.
> > 
> > I'm not sure this is a good idea. This couples the shrinker
> > infrastructure to internal details of how cgroups are initialised
> > and managed. Sure, certain operations might be done in certain
> > shrinker lock contexts, but that doesn't mean we should share global
> > locks across otherwise independent subsystems
> 
> They're not independent subsystems. Most of the memory controller is
> an extension of core VM operations that is fairly difficult to
> understand outside the context of those operations. Then there are a
> limited number of entry points from the cgroup interface. We used to
> have our own locks for core VM structures (private page lock e.g.) to
> coordinate VM and cgroup, and that was mostly unintelligble.

Yes, but OTOH you can CONFIG_MEMCG=n and the shrinker infrastructure
and shrinkers all still functions correctly.  Ergo, the shrinker
infrastructure is independent of memcgs. Yes, it may have functions
to iterate and manipulate memcgs, but it is not dependent on memcgs
existing for correct behaviour and functionality.

Yet.

> We have since established that those two components coordinate with
> native VM locking and lifetime management. If you need to lock the
> page, you lock the page - instead of having all VM paths that already
> hold the page lock acquire a nested lock to exclude one cgroup path.
> 
> In this case, we have auxiliary shrinker data, subject to shrinker
> lifetime and exclusion rules. It's much easier to understand that
> cgroup creation needs a stable shrinker list (shrinker_rwsem) to
> manage this data, than having an aliased lock that is private to the
> memcg callbacks and obscures this real interdependency.

Ok, so the way to do this is to move all the stuff that needs to be
done under a "subsystem global" lock to the one file, not turn a
static lock into a globally visible lock and spray it around random
source files. There's already way too many static globals to manage
separate shrinker and memcg state..

I certainly agree that shrinkers and memcg need to be more closely
integrated.  I've only been saying that for ... well, since memcgs
essentially duplicated the top level shrinker path so the shrinker
map could be introduced to avoid calling shrinkers that have no work
to do for memcgs. The shrinker map should be generic functionality
for all shrinker invocations because even a non-memcg machine can
have thousands of registered shrinkers that are mostly idle all the
time.

IOWs, I think the shrinker map management is not really memcg
specific - it's just allocation and assignment of a structure, and
the only memcg bit is the map is being stored in a memcg structure.
Therefore, if we are looking towards tighter integration then we
should acutally move the map management to the shrinker code, not
split the shrinker infrastructure management across different files.
There's already a heap of code in vmscan.c under #ifdef
CONFIG_MEMCG, like the prealloc_shrinker() code path:

prealloc_shrinker() vmscan.c
  if (MEMCG_AWARE)  vmscan.c
prealloc_memcg_shrinker vmscan.c
#ifdef CONFIG_MEMCG vmscan.c
  down_write(shrinker_rwsem)vmscan.c
  if (id > shrinker_id_max) vmscan.c
memcg_expand_shrinker_maps  memcontrol.c
  for_each_memcgmemcontrol.c
reallocate shrinker map memcontrol.c
replace shrinker mapmemcontrol.c
shrinker_id_max = idvmscan.c
  down_write(shrinker_rwsem)vmscan.c
#endif

And, really, there's very little code in memcg_expand_shrinker_maps()
here - the only memcg part is the memcg iteration loop, and we
already have them in vmscan.c (e.g. shrink_node_memcgs(),
age_active_anon(), drop_slab_node()) so there's precedence for
moving this memcg iteration for shrinker map management all into
vmscan.c.

Doing so would formalise the shrinker maps as first class shrinker
infrastructure rather than being tacked on to the side of the memcg
infrastructure. At this point it makes total sense to serialise map
manipulations under the shrinker_rwsem.

IOWs, I'm not disagreeing with the direction this patch takes us in,
I'm disagreeing with the implementation as published in the patch
because it

Re: [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

2020-12-14 Thread Dave Chinner

On Mon, Dec 14, 2020 at 02:37:20PM -0800, Yang Shi wrote:
> Now nr_deferred is available on per memcg level for memcg aware shrinkers, so 
> don't need
> allocate shrinker->nr_deferred for such shrinkers anymore.
> 
> Signed-off-by: Yang Shi 
> ---
>  mm/vmscan.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bce8cf44eca2..8d5bfd818acd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -420,7 +420,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, 
> enum lru_list lru, int zone
>   */
>  int prealloc_shrinker(struct shrinker *shrinker)
>  {
> - unsigned int size = sizeof(*shrinker->nr_deferred);
> + unsigned int size;
> +
> + if (is_deferred_memcg_aware(shrinker)) {
> + if (prealloc_memcg_shrinker(shrinker))
> + return -ENOMEM;
> + return 0;
> + }
> +
> + size = sizeof(*shrinker->nr_deferred);
>  
>   if (shrinker->flags & SHRINKER_NUMA_AWARE)
>   size *= nr_node_ids;
> @@ -429,26 +437,18 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   if (!shrinker->nr_deferred)
>   return -ENOMEM;
>  
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> - if (prealloc_memcg_shrinker(shrinker))
> - goto free_deferred;
> - }
> -
>   return 0;
> -
> -free_deferred:
> - kfree(shrinker->nr_deferred);
> - shrinker->nr_deferred = NULL;
> - return -ENOMEM;
>  }

I'm trying to put my finger on it, but this seems wrong to me. If
memcgs are disabled, then prealloc_memcg_shrinker() needs to fail.
The preallocation code should not care about internal memcg details
like this.

/*
 * If the shrinker is memcg aware and memcgs are not
 * enabled, clear the MEMCG flag and fall back to non-memcg
 * behaviour for the shrinker.
 */
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
error = prealloc_memcg_shrinker(shrinker);
if (!error)
return 0;
if (error != -ENOSYS)
return error;

/* memcgs not enabled! */
shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
}

size = sizeof(*shrinker->nr_deferred);

return 0;
}

This guarantees that only the shrinker instances taht have a
correctly set up memcg attached to them will have the
SHRINKER_MEMCG_AWARE flag set. Hence in all the rest of the shrinker
code, we only ever need to check for SHRINKER_MEMCG_AWARE to
determine what we should do

>  void free_prealloced_shrinker(struct shrinker *shrinker)
>  {
> - if (!shrinker->nr_deferred)
> + if (is_deferred_memcg_aware(shrinker)) {
> + unregister_memcg_shrinker(shrinker);
>   return;
> + }
>  
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> - unregister_memcg_shrinker(shrinker);
> + if (!shrinker->nr_deferred)
> + return;
>  
>   kfree(shrinker->nr_deferred);
>   shrinker->nr_deferred = NULL;

e.g. then this function can simply do:

{
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
return unregister_memcg_shrinker(shrinker);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline

2020-12-14 Thread Dave Chinner

On Mon, Dec 14, 2020 at 02:37:21PM -0800, Yang Shi wrote:
> Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to 
> parent's
> corresponding nr_deferred when memcg offline.
> 
> Signed-off-by: Yang Shi 
> ---
>  include/linux/shrinker.h |  4 
>  mm/memcontrol.c  | 24 
>  mm/vmscan.c  |  2 +-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 1eac79ce57d4..85cfc910dde4 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -78,6 +78,10 @@ struct shrinker {
>  };
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  
> +#ifdef CONFIG_MEMCG
> +extern int shrinker_nr_max;
> +#endif
> +
>  /* Flags */
>  #define SHRINKER_REGISTERED  (1 << 0)
>  #define SHRINKER_NUMA_AWARE  (1 << 1)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 321d1818ce3d..1f191a15bee1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -59,6 +59,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  #include 
>  #include 
> @@ -612,6 +613,28 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, 
> int nid, int shrinker_id)
>   }
>  }
>  
> +static void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
> +{
> + int i, nid;
> + long nr;
> + struct mem_cgroup *parent;
> + struct memcg_shrinker_deferred *child_deferred, *parent_deferred;
> +
> + parent = parent_mem_cgroup(memcg);
> + if (!parent)
> + parent = root_mem_cgroup;
> +
> + for_each_node(nid) {
> + child_deferred = memcg->nodeinfo[nid]->shrinker_deferred;
> + parent_deferred = parent->nodeinfo[nid]->shrinker_deferred;
> + for (i = 0; i < shrinker_nr_max; i ++) {
> + nr = atomic_long_read(_deferred->nr_deferred[i]);
> + atomic_long_add(nr,
> + _deferred->nr_deferred[i]);
> + }
> + }
> +}

I would place this function in vmscan.c alongside the
shrink_slab_set_nr_deferred_memcg() function so that all the
accounting is in the one place.

> +
>  /**
>   * mem_cgroup_css_from_page - css of the memcg associated with a page
>   * @page: page of interest
> @@ -5543,6 +5566,7 @@ static void mem_cgroup_css_offline(struct 
> cgroup_subsys_state *css)
>   page_counter_set_low(>memory, 0);
>  
>   memcg_offline_kmem(memcg);
> + memcg_reparent_shrinker_deferred(memcg);
>   wb_memcg_offline(memcg);
>  
>   drain_all_stock(memcg);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8d5bfd818acd..693a41e89969 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -201,7 +201,7 @@ DECLARE_RWSEM(shrinker_rwsem);
>  #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
>  
>  static DEFINE_IDR(shrinker_idr);
> -static int shrinker_nr_max;
> +int shrinker_nr_max;

Then we don't need to make yet another variable global...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority

2020-12-14 Thread Dave Chinner

On Mon, Dec 14, 2020 at 02:37:22PM -0800, Yang Shi wrote:
> The number of deferred objects might get windup to an absurd number, and it 
> results in
> clamp of slab objects.  It is undesirable for sustaining workingset.
> 
> So shrink deferred objects proportional to priority and cap nr_deferred to 
> twice of
> cache items.

This completely changes the work accrual algorithm without any
explaination of how it works, what the theory behind the algorithm
is, what the work accrual ramp up and damp down curve looks like,
what workloads it is designed to benefit, how it affects page
cache vs slab cache balance and system performance, what OOM stress
testing has been done to ensure pure slab cache pressure workloads
don't easily trigger OOM kills, etc.

You're going to need a lot more supporting evidence that this is a
well thought out algorithm that doesn't obviously introduce
regressions. The current code might fall down in one corner case,
but there are an awful lot of corner cases where it does work.
Please provide some evidence that it not only works in your corner
case, but also doesn't introduce regressions for other slab cache
intensive and mixed cache intensive worklaods...

> 
> Signed-off-by: Yang Shi 
> ---
>  mm/vmscan.c | 40 +---
>  1 file changed, 5 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 693a41e89969..58f4a383f0df 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -525,7 +525,6 @@ static unsigned long do_shrink_slab(struct shrink_control 
> *shrinkctl,
>*/
>   nr = count_nr_deferred(shrinker, shrinkctl);
>  
> - total_scan = nr;
>   if (shrinker->seeks) {
>   delta = freeable >> priority;
>   delta *= 4;
> @@ -539,37 +538,9 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
>   delta = freeable / 2;
>   }
>  
> + total_scan = nr >> priority;

When there is low memory pressure, this will throw away a large
amount of the work that is deferred. If we are not defering in
amounts larger than ~4000 items, every pass through this code will
zero the deferred work.

Hence when we do get substantial pressure, that deferred work is no
longer being tracked. While it may help your specific corner case,
it's likely to significantly change the reclaim balance of slab
caches, especially under GFP_NOFS intensive workloads where we can
only defer the work to kswapd.

Hence I think this is still a problematic approach as it doesn't
address the reason why deferred counts are increasing out of
control in the first place

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker

2020-12-14 Thread Dave Chinner

r static functions automatically if it makes
sense.

Ok, so you only do the memcg nr_deferred thing if NUMA_AWARE &&
sc->memcg is true. so

static long shrink_slab_set_nr_deferred_memcg(...)
{
int nid = sc->nid;

deferred = 
rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
 true);
return atomic_long_add_return(nr, >nr_deferred[id]);
}

static long shrink_slab_set_nr_deferred(...)
{
int nid = sc->nid;

if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
nid = 0;
else if (sc->memcg)
return shrink_slab_set_nr_deferred_memcg(, nid);

return atomic_long_add_return(nr, >nr_deferred[nid]);
}

And now there's no duplicated code.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred

2020-12-14 Thread Dave Chinner

On Mon, Dec 14, 2020 at 02:37:18PM -0800, Yang Shi wrote:
> Currently the number of deferred objects are per shrinker, but some slabs, 
> for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation 
> among memcgs.
> 
> The deferred objects typically are generated by __GFP_NOFS allocations, one 
> memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other 
> innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
> 
> For example, two workloads run in memcgA and memcgB respectively, workload in 
> B is vfs
> heavy workload.  Workload in A generates excessive deferred objects, then B's 
> vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global 
> reclaim.
> 
> We observed this hit in our production environment which was running vfs 
> heavy workload
> shown as the below tracing log:
> 
> <...>-409454 [016]  28286961.747146: mm_shrink_slab_start: 
> super_cache_scan+0x0/0x1a0 9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags 
> GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022]  28287105.928018: mm_shrink_slab_end: 
> super_cache_scan+0x0/0x1a0 9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 
> total_scan 602
> last shrinker return val 123186855
> 
> The vfs cache and page cache ration was 10:1 on this machine, and half of 
> caches were dropped.
> This also resulted in significant amount of page caches were dropped due to 
> inodes eviction.
> 
> Make nr_deferred per memcg for memcg aware shrinkers would solve the 
> unfairness and bring
> better isolation.
> 
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's 
> nr_deferred
> would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all 
> the time.
> 
> Signed-off-by: Yang Shi 
> ---
>  include/linux/memcontrol.h |   9 +++
>  mm/memcontrol.c| 110 -
>  mm/vmscan.c|   4 ++
>  3 files changed, 120 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 922a7f600465..1b343b268359 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -92,6 +92,13 @@ struct lruvec_stat {
>   long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +
> +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> +struct memcg_shrinker_deferred {
> + struct rcu_head rcu;
> + atomic_long_t nr_deferred[];
> +};

So you're effectively copy and pasting the memcg_shrinker_map
infrastructure and doubling the number of allocations/frees required
to set up/tear down a memcg? Why not add it to the struct
memcg_shrinker_map like this:

struct memcg_shrinker_map {
struct rcu_head rcu;
unsigned long   *map;
atomic_long_t   *nr_deferred;
};

And when you dynamically allocate the structure, set the map and
nr_deferred pointers to the correct offset in the allocated range.

Then this patch is really only changes to the size of the chunk
being allocated, setting up the pointers and copying the relevant
data from the old to new.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation

2020-12-14 Thread Dave Chinner

On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> exclusively, the read side can be protected by holding read lock, so it sounds
> superfluous to have a dedicated mutex.

I'm not sure this is a good idea. This couples the shrinker
infrastructure to internal details of how cgroups are initialised
and managed. Sure, certain operations might be done in certain
shrinker lock contexts, but that doesn't mean we should share global
locks across otherwise independent subsystems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg

2020-12-14 Thread Dave Chinner

On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of 
> CSS_ONLINE flag
> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will 
> see
> memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of 
> processor reordering
> on !x86.
> 
> This seems like the below case:
> 
>CPU A  CPU B
> store shrinker_map  load CSS_ONLINE
> store CSS_ONLINEload shrinker_map
> 
> So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.
> 
> The memory barriers pair will guarantee the ordering between 
> shrinker_deferred and CSS_ONLINE
> for the following patches as well.

This should not require memory barriers in the shrinker code.

The code that sets and checks the CSS_ONLINE flag should have the
memory barriers to ensure that anything that sees an online CSS will
see it completely set up.

That is, the functions online_css() that set the CSS_ONLINE needs
a memory barrier to ensure all previous writes are completed before
the CSS_ONLINE flag is set, and the function mem_cgroup_online()
needs a barrier to pair with that.

This is the same existence issue that the superblock shrinkers have
with the shrinkers being registered before the superblock is fully
set up. The SB_BORN flag on the sueprblock indicates the superblock
is now fully set up ("online" in CSS speak) and the registered
shrinker can run. Please see the smp_wmb() before we set SB_BORN in
vfs_get_tree(), and the big comment about the smp_rmb() -after- we
check SB_BORN in super_cache_count() to understand the details of
the data dependency between the flag and the structures being set up
that the barriers enforce.

IOWs, these memory barriers belong inside the cgroup code to
guarantee anything that sees an online cgroup will always see the
fully initialised cgroup structures. They do not belong in the
shrinker infrastructure...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v1 4/6] block/psi: remove PSI annotations from direct IO

2020-12-14 Thread Dave Chinner

On Tue, Dec 15, 2020 at 01:03:45AM +, Pavel Begunkov wrote:
> On 15/12/2020 00:56, Dave Chinner wrote:
> > On Tue, Dec 15, 2020 at 12:20:23AM +, Pavel Begunkov wrote:
> >> As reported, we must not do pressure stall information accounting for
> >> direct IO, because otherwise it tells that it's thrashing a page when
> >> actually doing IO on hot data.
> >>
> >> Apparently, bio_iov_iter_get_pages() is used only by paths doing direct
> >> IO, so just make it avoid setting BIO_WORKINGSET, it also saves us CPU
> >> cycles on doing that. For fs/direct-io.c just clear the flag before
> >> submit_bio(), it's not of much concern performance-wise.
> >>
> >> Reported-by: Christoph Hellwig 
> >> Suggested-by: Christoph Hellwig 
> >> Suggested-by: Johannes Weiner 
> >> Signed-off-by: Pavel Begunkov 
> >> ---
> >>  block/bio.c| 25 -
> >>  fs/direct-io.c |  2 ++
> >>  2 files changed, 18 insertions(+), 9 deletions(-)
> > .
> >> @@ -1099,6 +1103,9 @@ static int __bio_iov_append_get_pages(struct bio 
> >> *bio, struct iov_iter *iter)
> >>   * fit into the bio, or are requested in @iter, whatever is smaller. If
> >>   * MM encounters an error pinning the requested pages, it stops. Error
> >>   * is returned only if 0 pages could be pinned.
> >> + *
> >> + * It also doesn't set BIO_WORKINGSET, so is intended for direct IO. If 
> >> used
> >> + * otherwise the caller is responsible to do that to keep PSI happy.
> >>   */
> >>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >>  {
> >> diff --git a/fs/direct-io.c b/fs/direct-io.c
> >> index d53fa92a1ab6..914a7f600ecd 100644
> >> --- a/fs/direct-io.c
> >> +++ b/fs/direct-io.c
> >> @@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, 
> >> struct dio_submit *sdio)
> >>unsigned long flags;
> >>  
> >>bio->bi_private = dio;
> >> +  /* PSI is only for paging IO */
> >> +  bio_clear_flag(bio, BIO_WORKINGSET);
> > 
> > Why only do this for the old direct IO path? Why isn't this
> > necessary for the iomap DIO path?
> 
> It's in the description. In short, block and iomap dio use
> bio_iov_iter_get_pages(), which with this patch doesn't use
> [__]bio_add_page() and so doesn't set the flag. 

That is not obvious to someone not intimately familiar with the
patchset you are working on. You described -what- the code is doing,
not -why- the flag needs to be cleared here.

"Direct IO does not operate on the current working set of pages
managed by the kernel, so it should not be accounted as IO to the
pressure stall tracking infrastructure. Only direct IO paths use
bio_iov_iter_get_pages() to build bios, so to avoid PSI tracking of
direct IO don't flag the bio with BIO_WORKINGSET in this function.

fs/direct-io.c uses  to build the bio we
are going to submit and so still flags the bio with BIO_WORKINGSET.
Rather than convert it to use bio_iov_iter_get_pages() to avoid
flagging the bio, we simply clear the BIO_WORKINGSET flag before
submitting the bio."

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v1 5/6] bio: add a helper calculating nr segments to alloc

2020-12-14 Thread Dave Chinner

On Tue, Dec 15, 2020 at 12:00:23PM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 12:20:24AM +, Pavel Begunkov wrote:
> > A preparation patch. It adds a simple helper which abstracts out number
> > of segments we're allocating for a bio from iov_iter_npages().
> 
> Preparation for what? bio_iov_vecs_to_alloc() doesn't seem to be
> used outside this specific patch, so it's not clear what it's
> actually needed for...

Never mind, I'm just blind.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v1 6/6] block/iomap: don't copy bvec for direct IO

2020-12-14 Thread Dave Chinner

On Tue, Dec 15, 2020 at 12:20:25AM +, Pavel Begunkov wrote:
> The block layer spends quite a while in blkdev_direct_IO() to copy and
> initialise bio's bvec. However, if we've already got a bvec in the input
> iterator it might be reused in some cases, i.e. when new
> ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
> performance boost, and it also reduces memory footprint.
> 
> Suggested-by: Matthew Wilcox 
> Signed-off-by: Pavel Begunkov 
> ---
>  Documentation/filesystems/porting.rst |  9 
>  block/bio.c   | 64 +++
>  include/linux/bio.h   |  3 ++
>  3 files changed, 38 insertions(+), 38 deletions(-)

This doesn't touch iomap code, so the title of the patch seems
wrong...

> +For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs 
> but
> +uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec 
> and
> +page references stay until I/O has completed, i.e. until ->ki_complete() has
> +been called or returned with non -EIOCBQUEUED code.

This is hard to follow. Perhaps:

bio_iov_iter_get_pages() uses the bvecs  provided for bvec based
iterators rather than copying them. Hence anyone issuing kiocb based
IO needs to ensure the bvecs and pages stay referenced until the
submitted I/O is completed by a call to ->ki_complete() or returns
with an error other than -EIOCBQUEUED.

> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 2a9f3f0bbe0a..337f4280b639 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -444,6 +444,9 @@ static inline void bio_wouldblock_error(struct bio *bio)
>  
>  static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
>  {
> + /* reuse iter->bvec */
> + if (iov_iter_is_bvec(iter))
> + return 0;
>   return iov_iter_npages(iter, max_segs);

Ah, I'm a blind idiot... :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v1 5/6] bio: add a helper calculating nr segments to alloc

2020-12-14 Thread Dave Chinner

On Tue, Dec 15, 2020 at 12:20:24AM +, Pavel Begunkov wrote:
> A preparation patch. It adds a simple helper which abstracts out number
> of segments we're allocating for a bio from iov_iter_npages().

Preparation for what? bio_iov_vecs_to_alloc() doesn't seem to be
used outside this specific patch, so it's not clear what it's
actually needed for...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH v1 4/6] block/psi: remove PSI annotations from direct IO

2020-12-14 Thread Dave Chinner

On Tue, Dec 15, 2020 at 12:20:23AM +, Pavel Begunkov wrote:
> As reported, we must not do pressure stall information accounting for
> direct IO, because otherwise it tells that it's thrashing a page when
> actually doing IO on hot data.
> 
> Apparently, bio_iov_iter_get_pages() is used only by paths doing direct
> IO, so just make it avoid setting BIO_WORKINGSET, it also saves us CPU
> cycles on doing that. For fs/direct-io.c just clear the flag before
> submit_bio(), it's not of much concern performance-wise.
> 
> Reported-by: Christoph Hellwig 
> Suggested-by: Christoph Hellwig 
> Suggested-by: Johannes Weiner 
> Signed-off-by: Pavel Begunkov 
> ---
>  block/bio.c| 25 -
>  fs/direct-io.c |  2 ++
>  2 files changed, 18 insertions(+), 9 deletions(-)
.
> @@ -1099,6 +1103,9 @@ static int __bio_iov_append_get_pages(struct bio *bio, 
> struct iov_iter *iter)
>   * fit into the bio, or are requested in @iter, whatever is smaller. If
>   * MM encounters an error pinning the requested pages, it stops. Error
>   * is returned only if 0 pages could be pinned.
> + *
> + * It also doesn't set BIO_WORKINGSET, so is intended for direct IO. If used
> + * otherwise the caller is responsible to do that to keep PSI happy.
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index d53fa92a1ab6..914a7f600ecd 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, struct 
> dio_submit *sdio)
>   unsigned long flags;
>  
>   bio->bi_private = dio;
> + /* PSI is only for paging IO */
> + bio_clear_flag(bio, BIO_WORKINGSET);

Why only do this for the old direct IO path? Why isn't this
necessary for the iomap DIO path?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-06 Thread Dave Chinner

On Wed, Dec 02, 2020 at 03:12:20PM +0800, Ruan Shiyang wrote:
> Hi Dave,
> 
> On 2020/11/30 上午6:47, Dave Chinner wrote:
> > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> > > 
> > > The call trace is like this:
> > >   memory_failure()
> > > pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
> > >  gendisk->fops->block_lost()   => pmem_block_lost() or
> > >   md_blk_block_lost()
> > >   sb->s_ops->storage_lost()=> xfs_fs_storage_lost()
> > >xfs_rmap_query_range()
> > > xfs_storage_lost_helper()
> > >  mf_recover_controller->recover_fn => \
> > >  memory_failure_dev_pagemap_kill_procs()
> > > 
> > > The collect_procs() and kill_procs() are moved into a callback which
> > > is passed from memory_failure() to xfs_storage_lost_helper().  So we
> > > can call it when a file assocaited is found, instead of creating a
> > > file list and iterate it.
> > > 
> > > The fsdax & reflink support for XFS is not contained in this patchset.
> > 
> > This looks promising - the overall architecture is a lot more
> > generic and less dependent on knowing about memory, dax or memory
> > failures. A few comments that I think would further improve
> > understanding the patchset and the implementation:
> 
> Thanks for your kindly comment.  It gives me confidence.
> 
> > 
> > - the order of the patches is inverted. It should start with a
> >single patch introducing the mf_recover_controller structure for
> >callbacks, then introduce pgmap->ops->memory_failure, then
> >->block_lost, then the pmem and md implementations of ->block
> >list, then ->storage_lost and the XFS implementations of
> >->storage_lost.
> 
> Yes, it will be easier to understand the patchset in this order.
> 
> But I have something unsure: for example, I introduce ->memory_failure()
> firstly, but the implementation of ->memory_failure() needs to call
> ->block_lost() which is supposed to be introduced in the next patch. So, I
> am not sure the code is supposed to be what in the implementation of
> ->memory_failure() in pmem?  To avoid this situation, I committed the
> patches in the inverted order: lowest level first, then its caller, and then
> caller's caller.

Well, there's two things here. The first is the infrastructure, the
second is the drivers that use the infrastructure. You can introduce
a method in one patch, and then the driver that uses it in another.
Or you can introduce a driver skeleton that doesn't nothing until
more infrastructure is added. so...

> 
> I am trying to sort out the order.  How about this:
>  Patch i.
>Introduce ->memory_failure()
>   - just introduce interface, without implementation
>  Patch i++.
>Introduce ->block_lost()
>   - introduce interface and implement ->memory_failure()
>  in pmem, so that it can call ->block_lost()
>  Patch i++.
>(similar with above, skip...)

So this is better, but you don't need to add the pmem driver use of
"->block_lost" in the patch that adds the method. IOWs, something
like:

P1: introduce ->memory_failure API, all the required documentation
and add the call sites in the infrastructure that trigger it

P2: introduce ->corrupted_range to the block device API, all the
required documentation and any generic block infrastructure that
needs to call it.

P3: introduce ->corrupted_range to the superblock ops API, all the
required documentation

P4: add ->corrupted_range() API to the address space ops, all the
required documentation

P5: factor the existing kill procs stuff to be able to be called on
via generic_mapping_kill_range()

P5: add dax_mapping_kill_range()

P6: add the pmem driver support for ->memory_failure

P7: add the block device driver support for ->corrupted_range

P8: add filesystem support for sb_ops->corrupted_range.

P9: add filesystem support for aops->corrupted_range.

> >This gets rid of the mf_recover_controller altogether and allows
> >the interface to be used by any sort of block device for any sort
> >of bottom-up reporting of media/device failures.
> 
> Moving the recover function to the address_space ops looks a better idea.
> But I think that the error handler for page cache mapping is finished well
> in memory-failure.  The memory-failure is also reused to handles anonymous
> page.

Yes, anonymous page handling can remain there, we're only concerned
about handling file mapped pages here right now. If we end

Re: [PATCH V2] uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

2020-12-02 Thread Dave Chinner

On Wed, Dec 02, 2020 at 10:04:17PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 03, 2020 at 07:40:45AM +1100, Dave Chinner wrote:
> > On Wed, Dec 02, 2020 at 08:06:01PM +0100, Greg Kroah-Hartman wrote:
> > > On Wed, Dec 02, 2020 at 06:41:43PM +0100, Miklos Szeredi wrote:
> > > > On Wed, Dec 2, 2020 at 5:24 PM David Howells  
> > > > wrote:
> > > > >
> > > > > Miklos Szeredi  wrote:
> > > > >
> > > > > > Stable cc also?
> > > > > >
> > > > > > Cc:  # 5.8
> > > > >
> > > > > That seems to be unnecessary, provided there's a Fixes: tag.
> > > > 
> > > > Is it?
> > > > 
> > > > Fixes: means it fixes a patch, Cc: stable means it needs to be
> > > > included in stable kernels.  The two are not necessarily the same.
> > > > 
> > > > Greg?
> > > 
> > > You are correct.  cc: stable, as is documented in
> > > 
> > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > ensures that the patch will get merged into the stable tree.
> > > 
> > > Fixes: is independent of it.  It's great to have for stable patches so
> > > that I know how far back to backport patches.
> > > 
> > > We do scan all commits for Fixes: tags that do not have cc: stable, and
> > > try to pick them up when we can and have the time to do so.  But it's
> > > not guaranteed at all that this will happen.
> > > 
> > > I don't know why people keep getting confused about this, we don't
> > > document the "Fixes: means it goes to stable" anywhere...
> > 
> > Except that is exactly what happens, sometimes within a day of two
> > of a patch with a Fixes tag hitting Linus' kernel. We have had a
> > serious XFS regression in the 5.9.9 stable kernel that should never
> > have happened as a result of exactly this "Fixes = automatically
> > swept immediately into stable kernels" behaviour. See here for
> > post-mortem analysis:
> > 
> > https://lore.kernel.org/linux-xfs/20201126071323.gf2842...@dread.disaster.area/T/#m26e14ebd28ad306025f4ebf37e2aae9a304345a5
> > 
> > This happened because these auotmated Fixes scans seem to occur
> > weekly during -rcX release periods, which means there really is *no
> > practical difference* between the way the stable process treats
> > Fixes tags and cc: stable.
> 
> Sometimes, yes, that is true.  But as it went into Linus's tree at the
> same time, we just ended up with "bug compatible" trees :)
> 
> Not a big deal overall, happens every few releases, we fix it up and
> move on.  The benifits in doing all of this _FAR_ outweigh the very
> infrequent times that kernel developers get something wrong.

I'm not debating that users benefit from backports. I'm talking
about managing risk profiles and how to prevent an entirely
preventable stable kernel regression from happening again.

Talking about risk profiles, the issue here is that the regression
that slipped through to the stable kernels had a -catastrophic- risk
profile. That's exactly the sort of things that the stable kernel is
supposed to avoid exposing users to, and that raises the importance
and priority of ensuring that *never happens again*.

And the cause of this regression slipping through to stable kernel
users? It was a result of the automated "fixes" scan done by the
stable process that results in "fixes" meaning the same thing as
"cc: stable"

> As always, if you do NOT want your subsystem to have fixes: tags picked
> up automatically by us for stable trees, just email us and let us know
> to not do that and we gladly will.

No, that is not an acceptible solution for anyone. The stable
maintainers need to stop suggesting this as a solution to any
criticism that is raised against the stable process. You may as well
just say "shut up, go away, we don't care what you want".

> > It seems like this can all be avoided simply by scheduling the
> > automated fixes scans once the upstream kernel is released, not
> > while it is still being stabilised by -rc releases. That way stable
> > kernels get better tested fixes, they still get the same quantity of
> > fixes, and upstream developers have some margin to detect and
> > correct regressions in fixes before they get propagated to users.
> 
> So the "magic" -final release from Linus would cause this to happen?
> That means that the world would go for 3 months without some known fixes
> being applied to the tree?  That's not acceptable to me, as I started
> doin

Re: [PATCH V2] uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

2020-12-02 Thread Dave Chinner

On Wed, Dec 02, 2020 at 08:06:01PM +0100, Greg Kroah-Hartman wrote:
> On Wed, Dec 02, 2020 at 06:41:43PM +0100, Miklos Szeredi wrote:
> > On Wed, Dec 2, 2020 at 5:24 PM David Howells  wrote:
> > >
> > > Miklos Szeredi  wrote:
> > >
> > > > Stable cc also?
> > > >
> > > > Cc:  # 5.8
> > >
> > > That seems to be unnecessary, provided there's a Fixes: tag.
> > 
> > Is it?
> > 
> > Fixes: means it fixes a patch, Cc: stable means it needs to be
> > included in stable kernels.  The two are not necessarily the same.
> > 
> > Greg?
> 
> You are correct.  cc: stable, as is documented in
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> ensures that the patch will get merged into the stable tree.
> 
> Fixes: is independent of it.  It's great to have for stable patches so
> that I know how far back to backport patches.
> 
> We do scan all commits for Fixes: tags that do not have cc: stable, and
> try to pick them up when we can and have the time to do so.  But it's
> not guaranteed at all that this will happen.
> 
> I don't know why people keep getting confused about this, we don't
> document the "Fixes: means it goes to stable" anywhere...

Except that is exactly what happens, sometimes within a day of two
of a patch with a Fixes tag hitting Linus' kernel. We have had a
serious XFS regression in the 5.9.9 stable kernel that should never
have happened as a result of exactly this "Fixes = automatically
swept immediately into stable kernels" behaviour. See here for
post-mortem analysis:

https://lore.kernel.org/linux-xfs/20201126071323.gf2842...@dread.disaster.area/T/#m26e14ebd28ad306025f4ebf37e2aae9a304345a5

This happened because these auotmated Fixes scans seem to occur
weekly during -rcX release periods, which means there really is *no
practical difference* between the way the stable process treats
Fixes tags and cc: stable.

Hence instead of developers having some control over "urgent, must
backport now" fixes versus fixes that still need the -rcX
stabilisation and integration testing to shake them out fully, the
regular scans result in everything with a fixes tag is treated as an
"urgent, must backport now" category of fix. It effectively
removes the stabilisation and integration testing process from
the changes stable kernel users are being exposed to...

That's not right. It gives upstream developers no margin for error,
and it exposes stable kernel users to bugs that the normal upstream
kernel stabilisation process prevents users from ever seeing in
released kernels. And it is exactly this behaviour that lead people
to understand that "fixes" and "cc: stable" essentially mean the
same thing from a stable kernel perspective.

It seems like this can all be avoided simply by scheduling the
automated fixes scans once the upstream kernel is released, not
while it is still being stabilised by -rc releases. That way stable
kernels get better tested fixes, they still get the same quantity of
fixes, and upstream developers have some margin to detect and
correct regressions in fixes before they get propagated to users.

It also creates a clear demarcation between fixes and cc: stable for
maintainers and developers: only patches with a cc: stable will be
backported immediately to stable. Developers know what patches need
urgent backports and, unlike developers, the automated fixes scan
does not have the subject matter expertise or background to make
that judgement

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 2/2] statx: move STATX_ATTR_DAX attribute handling to filesystems

2020-12-01 Thread Dave Chinner

 /could/ have DAX (but it wasn't currently
> enabled) and a filesystem where it just isn't possible.

I think that's the only thing that makes sense from a userspace
perspective. THe man page explicitly says that:

  stx_attributes_mask
A mask indicating which bits in stx_attributes are supported
by the VFS and the filesystem.

So if DAX can never be turned on for that filesystem instance then,
by definition, it does not support DAX and the bit should never be
set.

e.g. We don't talk about kernels that support reflink - what matters
to userspace is whether the filesystem instance supports reflink.
Think of the useless mess that xfs_info would be if it reported
kernel capabilities instead of filesystem instance capabilities.
i.e. we don't report that a filesystem supports reflink just because
the kernel supports it - it reports whether the filesystem instance
being queried supports reflink. And that also implies the kernel
supports it, because the kernel has to support it to mount the
filesystem...

So, yeah, I think it really does need to be conditional on the
filesystem instance being queried to be actually useful to users

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-11-29 Thread Dave Chinner

On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Change from v1:
>   - Intorduce ->block_lost() for block device
>   - Support mapped device
>   - Add 'not available' warning for realtime device in XFS
>   - Rebased to v5.10-rc1
> 
> This patchset moves owner tracking from dax_assocaite_entry() to pmem
> device, by introducing an interface ->memory_failure() of struct
> pagemap.  The interface is called by memory_failure() in mm, and
> implemented by pmem device.  Then pmem device calls its ->block_lost()
> to find the filesystem which the damaged page located in, and call
> ->storage_lost() to track files or metadata assocaited with this page.
> Finally we are able to try to fix the damaged data in filesystem and do
> other necessary processing, such as killing processes who are using the
> files affected.
> 
> The call trace is like this:
>  memory_failure()
>pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
> gendisk->fops->block_lost()   => pmem_block_lost() or
>  md_blk_block_lost()
>  sb->s_ops->storage_lost()=> xfs_fs_storage_lost()
>   xfs_rmap_query_range()
>xfs_storage_lost_helper()
> mf_recover_controller->recover_fn => \ 
> memory_failure_dev_pagemap_kill_procs()
> 
> The collect_procs() and kill_procs() are moved into a callback which
> is passed from memory_failure() to xfs_storage_lost_helper().  So we
> can call it when a file assocaited is found, instead of creating a
> file list and iterate it.
> 
> The fsdax & reflink support for XFS is not contained in this patchset.

This looks promising - the overall architecture is a lot more
generic and less dependent on knowing about memory, dax or memory
failures. A few comments that I think would further improve
understanding the patchset and the implementation:

- the order of the patches is inverted. It should start with a
  single patch introducing the mf_recover_controller structure for
  callbacks, then introduce pgmap->ops->memory_failure, then
  ->block_lost, then the pmem and md implementations of ->block
  list, then ->storage_lost and the XFS implementations of
  ->storage_lost.

- I think the names "block_lost" and "storage_lost" are misleading.
  It's more like a "media failure" or a general "data corruption"
  event at a specific physical location. The data may not be "lost"
  but only damaged, so we might be able to recover from it without
  "losing" anything. Hence I think they could be better named,
  perhaps just "->corrupt_range"

- need to pass a {offset,len} pair through the chain, not just a
  single offset. This will allow other types of devices to report
  different ranges of failures, from a single sector to an entire
  device.

- I'm not sure that passing the mf_recover_controller structure
  through the corruption event chain is the right thing to do here.
  A block device could generate this storage failure callback if it
  detects an unrecoverable error (e.g. during a MD media scrub or
  rebuild/resilver failure) and in that case we don't have PFNs or
  memory device failure functions to perform.

  IOWs, I think the action that is taken needs to be independent of
  the source that generated the error. Even for a pmem device, we
  can be using the page cache, so it may be possible to recover the
  pmem error by writing the cached page (if it exists) back over the
  pmem.

  Hence I think that the recover function probably needs to be moved
  to the address space ops, because what we do to recover from the
  error is going to be dependent on type of mapping the filesystem
  is using. If it's a DAX mapping, we call back into a generic DAX
  function that does the vma walk and process kill functions. If it
  is a page cache mapping, then if the page is cached then we can
  try to re-write it to disk to fix the bad data, otherwise we treat
  it like a writeback error and report it on the next
  write/fsync/close operation done on that file.

  This gets rid of the mf_recover_controller altogether and allows
  the interface to be used by any sort of block device for any sort
  of bottom-up reporting of media/device failures.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH AUTOSEL 5.9 33/33] xfs: don't allow NOWAIT DIO across extent boundaries

2020-11-25 Thread Dave Chinner

On Wed, Nov 25, 2020 at 06:46:54PM -0500, Sasha Levin wrote:
> On Thu, Nov 26, 2020 at 08:52:47AM +1100, Dave Chinner wrote:
> > We've already had one XFS upstream kernel regression in this -rc
> > cycle propagated to the stable kernels in 5.9.9 because the stable
> > process picked up a bunch of random XFS fixes within hours of them
> > being merged by Linus. One of those commits was a result of a
> > thinko, and despite the fact we found it and reverted it within a
> > few days, users of stable kernels have been exposed to it for a
> > couple of weeks. That *should never have happened*.
> 
> No, what shouldn't have happened is a commit that never went out for a review
> on the public mailing lists nor spending any time in linux-next ending
> up in Linus's tree.

I think you've got your wires crossed somewhere, Sasha, because none
of that happened here.  From the public record, the patch was first
posted here by Darrick:

https://lore.kernel.org/linux-xfs/160494584816.772693.2490433010759557816.stgit@magnolia/

on Nov 9, and was reviewed by Christoph a day later.  It was merged
into the XFS tree on Nov 10, with the rvb tag:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next=6ff646b2ceb0eec916101877f38da0b73e3a5b7f

Which means it should have been in linux-next on Nov 11, 12 and 13,
when Darrick sent the pull request:

https://lore.kernel.org/linux-xfs/20201113231738.GX9695@magnolia/

It was merged into Linus's tree an hour later.

So, in contrast to your claims, the evidence is that the patch was,
in fact, publicly posted, reviewed, and spent time in linux-next
before ending up in Linus's tree.

FWIW, on November 17, GregKH sent the patch to lkml for stable review
after being selected by the stable process for a stable backport.
This was not cc'd to the XFS list, and it was committed without
comment into the  5.9.x tree is on November 18.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/xfs?h=linux-5.9.y=0ca9a072112b18efc9ba9d3a9b77e9dae60f93ac

IOWs, the XFS developers didn't ask for it to be backported to
stable kernels - the commit did not contain a "cc:
sta...@kernel.org", nor was the original patch posting cc'd to the
stable list.

The fact is that entire decision to backport this commit was made by
stable maintainers and/or their tools, and the stable maintainers
themselves chose not to tell the XFS list they had selected it for
backport.

Hence:

> It's ridiculous that you see a failure in the maintainership workflow of
> XFS and turn around to blame it somehow on the stable process.

 I think you really need to have another look at the evidence
before you dig yourself a deeper hole and waste more of my time

> > This has happened before, and *again* we were lucky this wasn't
> > worse than it was. We were saved by the flaw being caught by own
> > internal pre-write corruption verifiers (which exist because we
> > don't trust our code to be bug-free, let alone the collections of
> > random, poorly tested backports) so that it only resulted in
> > corruption shutdowns rather than permanent on-disk damage and data
> > loss.
> > 
> > Put simply: the stable process is flawed because it shortcuts the
> > necessary stabilisation testing for new code. It doesn't matter if
> 
> The stable process assumes that commits that ended up upstream were
> reviewed and tested; the stable process doesn't offer much in the way of
> in-depth review of specific patches but mostly focuses on testing the
> product of backporting hundreds of patches into each stable branch.

And I've lost count of the number of times I've told the stable
maintainers that this is an invalid assumption.

Yet here we are again.

How many times do we have to make the same mistake before we learn
from it?

> Release candidate cycles are here to squash the bugs that went in during
> the merge window, not to introduce new "thinkos" in the way of pulling
> patches out of your hip in the middle of the release cycle.

"pulling patches out of your hip"

Nice insult. Avoids profanity filters and everything. But I don't
know why you're trying to insult me over something I played no part
in.

Seriously, merging critical fixes discovered in the -rc cycle
happens *all the time* and has been done for as long as we've had
-rc cycles.  Even Linus himself does this.

The fact is that the -rc process is intended to accomodate merging
fixes quickly whilst still allowing sufficient testing time to be
confident that no regressions were introduced or have been found and
addressed before release.

And that's the whole point of having an iterative integration
testing phase in the release cycle - it can be adapted in duration
to the current state of the code base and the fixes that are being
made late in the cycle.

You *should*

Re: [PATCH AUTOSEL 5.9 33/33] xfs: don't allow NOWAIT DIO across extent boundaries

2020-11-25 Thread Dave Chinner

On Wed, Nov 25, 2020 at 10:35:50AM -0500, Sasha Levin wrote:
> From: Dave Chinner 
> 
> [ Upstream commit 883a790a84401f6f55992887fd7263d808d4d05d ]
> 
> Jens has reported a situation where partial direct IOs can be issued
> and completed yet still return -EAGAIN. We don't want this to report
> a short IO as we want XFS to complete user DIO entirely or not at
> all.
> 
> This partial IO situation can occur on a write IO that is split
> across an allocated extent and a hole, and the second mapping is
> returning EAGAIN because allocation would be required.
> 
> The trivial reproducer:
> 
> $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" 
> /mnt/scr/foo
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
> pwrite: Resource temporarily unavailable
> $
> 
> The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
> the first 4kB write:
> 
>  xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
>  iomap_apply:  dev 259:1 ino 0x83 pos 0 length 8192 flags 
> WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw 
> actor iomap_dio_actor
>  xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller 
> xfs_ilock_for_iomap
>  xfs_iunlock:  dev 259:1 ino 0x83 flags ILOCK_SHARED caller 
> xfs_direct_write_iomap_begin
>  xfs_iomap_found:  dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 
> fork data startoff 0x0 startblock 24 blockcount 0x1
>  iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 
> length 4096 type MAPPED flags DIRTY
> 
> Here the first iomap loop has mapped the first 4kB of the file and
> issued the IO, and we enter the second iomap_apply loop:
> 
>  iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags 
> WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw 
> actor iomap_dio_actor
>  xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller 
> xfs_ilock_for_iomap
>  xfs_iunlock:  dev 259:1 ino 0x83 flags ILOCK_SHARED caller 
> xfs_direct_write_iomap_begin
> 
> And we exit with -EAGAIN out because we hit the allocate case trying
> to make the second 4kB block.
> 
> Then IO completes on the first 4kB and the original IO context
> completes and unlocks the inode, returning -EAGAIN to userspace:
> 
>  xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 
> offset 0x0 count 4096
>  xfs_iunlock:  dev 259:1 ino 0x83 flags IOLOCK_SHARED caller 
> xfs_file_dio_aio_write
> 
> There are other vectors to the same problem when we re-enter the
> mapping code if we have to make multiple mappinfs under NOWAIT
> conditions. e.g. failing trylocks, COW extents being found,
> allocation being required, and so on.
> 
> Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
> to go ahead if the mapping we retrieve for the IO spans an entire
> allocated extent. This avoids the possibility of subsequent mappings
> to complete the IO from triggering NOWAIT semantics by any means as
> NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
> 
> Reported-and-tested-by: Jens Axboe 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> Signed-off-by: Sasha Levin 
> ---
>  fs/xfs/xfs_iomap.c | 29 +
>  1 file changed, 29 insertions(+)

No, please don't pick this up for stable kernels until at least 5.10
is released. This still needs some time integration testing time to
ensure that we haven't introduced any other performance regressions
with this change.

We've already had one XFS upstream kernel regression in this -rc
cycle propagated to the stable kernels in 5.9.9 because the stable
process picked up a bunch of random XFS fixes within hours of them
being merged by Linus. One of those commits was a result of a
thinko, and despite the fact we found it and reverted it within a
few days, users of stable kernels have been exposed to it for a
couple of weeks. That *should never have happened*. 

This has happened before, and *again* we were lucky this wasn't
worse than it was. We were saved by the flaw being caught by own
internal pre-write corruption verifiers (which exist because we
don't trust our code to be bug-free, let alone the collections of
random, poorly tested backports) so that it only resulted in
corruption shutdowns rather than permanent on-disk damage and data
loss.

Put simply: the stable process is flawed because it shortcuts the
necessary stabilisation testing for new code. It doesn't matter if
the merged commits have a "fixes" tag in them, that tag doesn't mean
the change is ready to be exposed to production sys

Re: [PATCH] fs/stat: set attributes_mask for STATX_ATTR_DAX

2020-11-23 Thread Dave Chinner

On Fri, Nov 20, 2020 at 06:03:18PM -0800, Darrick J. Wong wrote:
> [Adding fsdevel to cc since this is a filesystems question]
> 
> On Fri, Nov 20, 2020 at 04:58:09PM -0800, Randy Dunlap wrote:
> > Hi,
> > 
> > I don't know this code, but:
> > 
> > On 11/20/20 4:33 PM, XiaoLi Feng wrote:
> > > From: Xiaoli Feng 
> > > 
> > > keep attributes and attributes_mask are consistent for
> > > STATX_ATTR_DAX.
> > > ---
> > >  fs/stat.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/stat.c b/fs/stat.c
> > > index dacecdda2e79..914a61d256b0 100644
> > > --- a/fs/stat.c
> > > +++ b/fs/stat.c
> > > @@ -82,7 +82,7 @@ int vfs_getattr_nosec(const struct path *path, struct 
> > > kstat *stat,
> > >  
> > >   if (IS_DAX(inode))
> > >   stat->attributes |= STATX_ATTR_DAX;
> > > -
> > > + stat->attributes_mask |= STATX_ATTR_DAX;
> > 
> > Why shouldn't that be:
> > 
> > if (IS_DAX(inode))
> > stat->attributes_mask |= STATX_ATTR_DAX;
> > 
> > or combine them, like this:
> > 
> > if (IS_DAX(inode)) {
> > stat->attributes |= STATX_ATTR_DAX;
> > stat->attributes_mask |= STATX_ATTR_DAX;
> > }
> > 
> > 
> > and no need to delete that blank line.
> 
> Some filesystems could support DAX but not have it enabled for this
> particular file, so this won't work.
> 
> General question: should filesystems that are /capable/ of DAX signal
> this by setting the DAX bit in the attributes mask?

I think so, yes. It could be set if the right bit on the inode is
set, but it currently isn't so the bit in the mask is set but the
bit in the attributes is not. i.e "DAX is valid status bit, but it
is not set for this file".

> Or is this a VFS
> feature and hence here is the appropriate place to be setting the mask?

Well, in the end it's a filesystem feature bit because the
filesystem policy that decides whether DAX is used or not. e.g. if
the block device is not DAX capable or dax=never mount option is
set, we should not ever set STATX_ATTR_DAX in statx for either the
attributes or attributes_mask field because the filesystem is not
DAX capable. And given that we have filesystems with multiple block
devices that can have different DAX capabilities, I think this
statx() attr state (and mask) really has to come from the
filesystem, not VFS...

> Extra question: should we only set this in the attributes mask if
> CONFIG_FS_DAX=y ?

IMO, yes, because it will always be false on CONFIG_FS_DAX=n and so
it may well as not be emitted as a supported bit in the mask.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 1/2] xfs: show the dax option in mount options.

2020-11-11 Thread Dave Chinner

On Wed, Nov 11, 2020 at 11:28:48AM +0100, Michal Suchánek wrote:
> On Tue, Nov 10, 2020 at 08:08:23AM +1100, Dave Chinner wrote:
> > On Mon, Nov 09, 2020 at 09:27:05PM +0100, Michal Suchánek wrote:
> > > On Mon, Nov 09, 2020 at 11:24:19AM -0800, Darrick J. Wong wrote:
> > > > On Mon, Nov 09, 2020 at 08:10:08PM +0100, Michal Suchanek wrote:
> > > > > xfs accepts both dax and dax_enum but shows only dax_enum. Show both
> > > > > options.
> > > > > 
> > > > > Fixes: 8d6c3446ec23 ("fs/xfs: Make DAX mount option a tri-state")
> > > > > Signed-off-by: Michal Suchanek 
> > > > > ---
> > > > >  fs/xfs/xfs_super.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > index e3e229e52512..a3b3840d 100644
> > > > > --- a/fs/xfs/xfs_super.c
> > > > > +++ b/fs/xfs/xfs_super.c
> > > > > @@ -163,7 +163,7 @@ xfs_fs_show_options(
> > > > >   { XFS_MOUNT_GRPID,  ",grpid" },
> > > > >   { XFS_MOUNT_DISCARD,",discard" },
> > > > >   { XFS_MOUNT_LARGEIO,",largeio" },
> > > > > - { XFS_MOUNT_DAX_ALWAYS, ",dax=always" },
> > > > > + { XFS_MOUNT_DAX_ALWAYS, ",dax,dax=always" },
> > > > 
> > > > NAK, programs that require DAX semantics for files stored on XFS must
> > > > call statx to detect the STATX_ATTR_DAX flag, as outlined in "Enabling
> > > > DAX on xfs" in Documentation/filesystems/dax.txt.
> > > statx can be used to query S_DAX.  NOTE that only regular files will
> > > ever have S_DAX set and therefore statx will never indicate that S_DAX
> > > is set on directories.
> > 
> > Yup, by design.
> > 
> > The application doesn't need to do anything complex to make this
> > work. If the app wants to use DAX, then it should use
> > FS_IOC_FS{GS}ETXATTR to always set the on disk per inode DAX flags
> > for it's data dirs and files, and then STATX_ATTR_DAX will *always*
> > tell it whether DAX is actively in use at runtime. It's pretty
> > simple, really.
> > 
> > > The filesystem may not have any files so statx cannot be used.
> > 
> > Really?  The app or installer is about to *write to the fs* and has
> > all the permissions it needs to modify the contents of the fs. It's
> > pretty simple to create a tmpdir, set the DAX flag on the tmpdir,
> > then create a tmpfile in the tmpdir and run STATX_ATTR_DAX on it to
> > see if DAX is active or not.
> 
> Have you ever seen a 'wizard' style installer?

I wrote my first one in 1995 on Windows NT 3.51 using Installshield.

> Like one that firsts asks what to install, and then presents a list of
> suitable locations that have enough space, supported filesystem features
> enabled, and whatnot?

Hold on, 1995 is calling me. The application I was packaging used
ACLs. But the NTFS version created by windows NT 3.1 was
incompatible as ACL support didn't arrive until NT 3.51 and service
pack 4(?) for NT 3.1. Yes, I had to write code to probe the
filesystems to detect whether ACL support was available or not by
-trying to create an ACL-.

I guess you could say "been there, done that, learnt the lesson".

> So to present a list of mountpoints that support DAX one has to scribble
> over every mountpoint on the system?

If you are filtering storage options presented to the user by
supported features, then you have to probe for them in some way.
And that means you have to consider that many option filesystem
features that applications use cannot be detected via mount options
checking the filesytem config. That is, there are features that can
only be discovered by actually testing whether they work or not.

> That sounds ridiculous.

Reality is a harsh mistress. :/

[snip the rest because you're being ridiculous]

Are you aware of ndctl?

$ ndctl list
[
  {
"dev":"namespace1.0",
"mode":"fsdax",
"map":"mem",
"size":8589934592,
"sector_size":512,
"blockdev":"pmem1"
  },
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":8589934592,
"sector_size":512,
"blockdev":"pmem0"
  }
]

Oh, look there are two block devices on this machine that are
configure

Re: [PATCH 1/2] xfs: show the dax option in mount options.

2020-11-09 Thread Dave Chinner

On Mon, Nov 09, 2020 at 09:27:05PM +0100, Michal Suchánek wrote:
> On Mon, Nov 09, 2020 at 11:24:19AM -0800, Darrick J. Wong wrote:
> > On Mon, Nov 09, 2020 at 08:10:08PM +0100, Michal Suchanek wrote:
> > > xfs accepts both dax and dax_enum but shows only dax_enum. Show both
> > > options.
> > > 
> > > Fixes: 8d6c3446ec23 ("fs/xfs: Make DAX mount option a tri-state")
> > > Signed-off-by: Michal Suchanek 
> > > ---
> > >  fs/xfs/xfs_super.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index e3e229e52512..a3b3840d 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -163,7 +163,7 @@ xfs_fs_show_options(
> > >   { XFS_MOUNT_GRPID,  ",grpid" },
> > >   { XFS_MOUNT_DISCARD,",discard" },
> > >   { XFS_MOUNT_LARGEIO,",largeio" },
> > > - { XFS_MOUNT_DAX_ALWAYS, ",dax=always" },
> > > + { XFS_MOUNT_DAX_ALWAYS, ",dax,dax=always" },
> > 
> > NAK, programs that require DAX semantics for files stored on XFS must
> > call statx to detect the STATX_ATTR_DAX flag, as outlined in "Enabling
> > DAX on xfs" in Documentation/filesystems/dax.txt.
> statx can be used to query S_DAX.  NOTE that only regular files will
> ever have S_DAX set and therefore statx will never indicate that S_DAX
> is set on directories.

Yup, by design.

The application doesn't need to do anything complex to make this
work. If the app wants to use DAX, then it should use
FS_IOC_FS{GS}ETXATTR to always set the on disk per inode DAX flags
for it's data dirs and files, and then STATX_ATTR_DAX will *always*
tell it whether DAX is actively in use at runtime. It's pretty
simple, really.

> The filesystem may not have any files so statx cannot be used.

Really?  The app or installer is about to *write to the fs* and has
all the permissions it needs to modify the contents of the fs. It's
pretty simple to create a tmpdir, set the DAX flag on the tmpdir,
then create a tmpfile in the tmpdir and run STATX_ATTR_DAX on it to
see if DAX is active or not.

However, keep in mind that from a system design perspective having
the installer detect DAX properties to make application level
install/config decisions is problematic from a lot of different
angles.

- DAX is property of the *block device*, not the filesystem, so the
  filesystem can make arbitrary decisions on whether to use DAX or
  not to access data and these can change at any time without
  warning.

- Some filesystems may not have any user visible signs they are
  using DAX to access data except for STATX_ATTR_DAX because they
  always use DAX and only work on DAX capable block devices. e.g
  NVFS.

- For filesystems where DAX is optional, the user can -always-
  change the dax state of the fs (mount options) or even parts of
  the filesystem (per inode flags) at any time after the installer
  has run.

- The application might be storing it's data on a different
  filesystem that isn't mounted at install time, so the installer
  has no chance of detecting that the application is going to use
  DAX enabled storage.

IOWs, the installer cannot make decisions based on DAX state on
behalf of applications because it does not know what environment the
application is going to be configured to run in.  DAX can only be
deteted reliably by the application at runtime inside it's
production execution environment.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3916 matches

Mail list logo