Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-29 Thread Dan Williams
Verma, Vishal L wrote:
> > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, 
> > > struct device_attribute *attr,
> > >   if (!victim)
> > >   return -ENXIO;
> > >  
> > > - rc = down_write_killable(_region_rwsem);
> > > - if (rc)
> > > - return rc;
> > > - rc = down_write_killable(_dev_rwsem);
> > > - if (rc) {
> > > - up_write(_region_rwsem);
> > > - return rc;
> > > - }
> > > + device_lock(dev);
> > > + device_lock(victim);
> > >   dev_dax = to_dev_dax(victim);
> > > + rc = down_write_killable(_dev_rwsem);
> > 
> > This begs the question, why down_write_killable(), but not
> > device_lock_interruptible()?
> 
> Do you mean change the device_lock()s to device_lock_interruptible() in
> addition to the taking the rwsem (i.e. not instead of the rwsem..)?

I mean convert the rwsem to drop _killable.

> I guess I just restored what was there previously - but the
> interruptible variant makes sense, I can make that change.

So the original code did device_lock(), then the rework added killable
rwsem (deleted device_lock()), and now the fixes add device_lock() back.
So now that there is a mix of killable/interruptible lock usage all the
locks should agree.

Since there really is no risk of these operations being long running
there is no driving need to make them killable/interruptible, so go with
the simple option.



Re: [PATCH v2 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> In size_show(), the dax_dev_rwsem only needs a read lock, but was
> acquiring a write lock. Change it to down_read_interruptible() so it
> doesn't unnecessarily hold a write lock.
> 
> Cc: Dan Williams 
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Looks good,

Reviewed-by: Dan Williams 



Re: [PATCH v2 3/4] dax/bus.c: Don't use down_write_killable for non-user processes

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> Change an instance of down_write_killable() to a simple down_write() where
> there is no user process that might want to interrupt the operation.
> 
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)

Looks good,

Reviewed-by: Dan Williams 



Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> was a bit overzealous in eliminating device_lock() usage, and ended up
> removing a couple of lock acquisitions which were needed, and as a
> result, fix some of the conditional locking missteps that the above
> commit introduced in unregister_dax_dev() and unregister_dax_mapping().

I think it makes sense to tell the story a bit about why the
delete_store() conversion was problematic, because the
unregister_dev_dax() changes were just a knock-on effect to fixing the
delete_store() flow.

Something like:

---
commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
aimed to undo device_lock() abuses for protecting changes to dax-driver
internal data-structures like the dax_region resource tree to
device-dax-instance range structures. However, the device_lock() was 
legitamately
enforcing that devices to be deleted were not current actively attached
to any driver nor assigned any capacity from the region.
---

...you can fill in a couple notes about the knock-on fixups after that
was restored.

> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 44 ++--
>  1 file changed, 10 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 7924dd542a13..4e04b228b080 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
>   trim_dev_dax_range(dev_dax);
>  }
>  
> -static void __unregister_dev_dax(void *dev)
> +static void unregister_dev_dax(void *dev)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>  
>   dev_dbg(dev, "%s\n", __func__);
>  
> + down_write(_region_rwsem);
>   kill_dev_dax(dev_dax);
>   device_del(dev);
>   free_dev_dax_ranges(dev_dax);
>   put_device(dev);
> -}
> -
> -static void unregister_dev_dax(void *dev)
> -{
> - if (rwsem_is_locked(_region_rwsem))
> - return __unregister_dev_dax(dev);
> -
> - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
> - return;
> - __unregister_dev_dax(dev);
>   up_write(_region_rwsem);
>  }
>  
> @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, struct 
> device_attribute *attr,
>   if (!victim)
>   return -ENXIO;
>  
> - rc = down_write_killable(_region_rwsem);
> - if (rc)
> - return rc;
> - rc = down_write_killable(_dev_rwsem);
> - if (rc) {
> - up_write(_region_rwsem);
> - return rc;
> - }
> + device_lock(dev);
> + device_lock(victim);
>   dev_dax = to_dev_dax(victim);
> + rc = down_write_killable(_dev_rwsem);

This begs the question, why down_write_killable(), but not
device_lock_interruptible()?

I do not expect any of this is long running so likely down_write() is
sufficient here, especially since the heaviest locks to acquire are
already held by the time rwsem is considered.

Other than that this looks good to me:

You can include my Reviewed-by on the next posting.



Re: [PATCH v2 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> referenced patch should be replaced with lockdep_assert_held, or
> lockdep_held_assert_write(). Replace these as appropriate.
> 
> Link: 
> https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch
>  [1]
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Cc: Andrew Morton 
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)

Looks good,

Reviewed-by: Dan Williams 



Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-03 Thread Dan Williams
Vishal Verma wrote:
> In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> referenced patch should be replaced with lockdep_assert_held(_write)().
> 
> Replace those, and additionally, replace a couple of other
> WARN_ON_ONCE() introduced in the same patch for actual failure
> cases (i.e. when acquiring a semaphore fails in a remove / unregister
> path) with dev_WARN_ONCE() as is the precedent here.

/me goes to look why failures are warned vs bubbling up the error...

> 
> Recall that previously, unregistration paths was implicitly protected by
> overloading the device lock, which the patch in [1] sought to remove.
> This meant adding a semaphore acquisition in these unregistration paths.
> Since that can fail, and it doesn't make sense to return errors from
> these paths, retain the two instances of (now) dev_WARN_ONCE().
> 
> Link: 
> https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch
>  [1]
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Cc: Andrew Morton 
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 23 +--
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 797e1ebff299..d89c8c3b62f7 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
[..]
> @@ -726,7 +728,8 @@ static void unregister_dax_mapping(void *data)

...and realizes that is hunk is confused.

>   if (rwsem_is_locked(_region_rwsem))
>   return __unregister_dax_mapping(data);

This is bug. Either it is safe to call this without the lock held, or it
must be safe to always acquire the lock. Anything else means the caller
needs to be refactored. Conditional locking is an indicator of a deeper
problem with code organization.

Yes, this was not part of the fixup, but the changelog led me here
because no WARNs should remain at the end of this effort.

I think the confusion results from replace *all* device_lock() when some
device_lock() is still needed.

> - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
> + if (dev_WARN_ONCE(data, down_write_killable(_region_rwsem) != 0,
> +   "unable to acquire region rwsem\n"))

In a context like this that cannot return an error directly to the user
process making the request, like in a sysfs operation handler, then the
locking cannot worry about signals. This would become an uncoditional
down_write() lock. So down_write_killable() is typically for request
contexts where there is a user process waiting for a result. For
contexts like async driver probing where we might be in a kernel thread,
and definitely in functions that return 'void', it is a bug to use
fallible locks.



Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-27 Thread Dan Williams
Rick Edgecombe wrote:
> The mm_struct contains a function pointer *get_unmapped_area(), which
> is set to either arch_get_unmapped_area() or
> arch_get_unmapped_area_topdown() during the initialization of the mm.
> 
> Since the function pointer only ever points to two functions that are named
> the same across all arch's, a function pointer is not really required. In
> addition future changes will want to add versions of the functions that
> take additional arguments. So to save a pointers worth of bytes in
> mm_struct, and prevent adding additional function pointers to mm_struct in
> future changes, remove it and keep the information about which
> get_unmapped_area() to use in a flag.
> 
> Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
> mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing
> behavior mm->get_unmapped_area() would get copied to the new mm in
> dup_mm(), so not clobbering the flag preserves the existing behavior
> around inheriting the topdown-ness.
> 
> Introduce a helper, mm_get_unmapped_area(), to easily convert code that
> refers to the old function pointer to instead select and call either
> arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
> flag. Then drop the mm->get_unmapped_area() function pointer. Leave the
> get_unmapped_area() pointer in struct file_operations alone. The main
> purpose of this change is to reorganize in preparation for future changes,
> but it also converts the calls of mm->get_unmapped_area() from indirect
> branches into a direct ones.
> 
> The stress-ng bigheap benchmark calls realloc a lot, which calls through
> get_unmapped_area() in the kernel. On x86, the change yielded a ~1%
> improvement there on a retpoline config.
> 
> In testing a few x86 configs, removing the pointer unfortunately didn't
> result in any actual size reductions in the compiled layout of mm_struct.
> But depending on compiler or arch alignment requirements, the change could
> shrink the size of mm_struct.
> 
> Signed-off-by: Rick Edgecombe 
> Acked-by: Dave Hansen 
> Acked-by: Liam R. Howlett 
> Reviewed-by: Kirill A. Shutemov 
> Cc: linux-s...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: io-ur...@vger.kernel.org
> Cc: b...@vger.kernel.org
> ---
> v4:
>  - Split out pde_get_unmapped_area() refactor into separate patch
>(Christophe Leroy)
> 
> v3:
>  - Fix comment that still referred to mm->get_unmapped_area()
>  - Resolve trivial rebase conflicts with "mm: thp_get_unmapped_area must
>honour topdown preference"
>  - Spelling fix in log
> 
> v2:
>  - Fix comment on MMF_TOPDOWN (Kirill, rppt)
>  - Move MMF_TOPDOWN to actually unused bit
>  - Add MMF_TOPDOWN to MMF_INIT_MASK so it doesn't get clobbered on fork,
>and result in the children using the search up path.
>  - New lower performance results after above bug fix
>  - Add Reviews and Acks
> ---
>  arch/s390/mm/hugetlbpage.c   |  2 +-
>  arch/s390/mm/mmap.c  |  4 ++--
>  arch/sparc/kernel/sys_sparc_64.c | 15 ++-
>  arch/sparc/mm/hugetlbpage.c  |  2 +-
>  arch/x86/kernel/cpu/sgx/driver.c |  2 +-
>  arch/x86/mm/hugetlbpage.c|  2 +-
>  arch/x86/mm/mmap.c   |  4 ++--
>  drivers/char/mem.c   |  2 +-
>  drivers/dax/device.c |  6 +++---
>  fs/hugetlbfs/inode.c |  4 ++--
>  fs/proc/inode.c  |  3 ++-
>  fs/ramfs/file-mmu.c  |  2 +-
>  include/linux/mm_types.h |  6 +-
>  include/linux/sched/coredump.h   |  5 -
>  include/linux/sched/mm.h |  5 +
>  io_uring/io_uring.c  |  2 +-
>  kernel/bpf/arena.c   |  2 +-
>  kernel/bpf/syscall.c |  2 +-
>  mm/debug.c   |  6 --
>  mm/huge_memory.c |  9 -
>  mm/mmap.c| 21 ++---
>  mm/shmem.c   | 11 +--
>  mm/util.c|  6 +++---
>  23 files changed, 66 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
> index c2e8242bd15d..219d906fe830 100644
> --- a/arch/s390/mm/hugetlbpage.c
> +++ b/arch/s390/mm/hugetlbpage.c
> @@ -328,7 +328,7 @@ unsigned long hugetlb_get_unmapped_area(struct file 
> *file, unsigned long addr,
>   goto check_asce_limit;
>   }
>  
> - if (mm->get_unmapped_area == arch_get_unmapped_area)
> + if (!test_bit(MMF_TOPDOWN, >flags))
>   addr = hugetlb_get_unmapped_area_bottomup(file, addr, len,
>   pgoff, flags);
>   else
> diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
> index b14fc0887654..6b2e4436ad4a 100644
> --- a/arch/s390/mm/mmap.c
> +++ b/arch/s390/mm/mmap.c
> @@ 

Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem

2024-03-12 Thread Dan Williams
Vishal Verma wrote:
> The dax driver incorrectly used driver-core device locks to protect
> internal dax region and dax device configuration structures. Replace the
> device lock usage with a local rwsem, one each for dax region
> configuration and dax device configuration. As a result of this
> conversion, no device_lock() usage remains in dax/bus.c.
> 
> Cc: Dan Williams 
> Reported-by: Greg Kroah-Hartman 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 220 
> ++
>  1 file changed, 157 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1ff1ab5fa105..cb148f74ceda 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -12,6 +12,18 @@
>  
>  static DEFINE_MUTEX(dax_bus_lock);
>  
> +/*
> + * All changes to the dax region configuration occur with this lock held
> + * for write.
> + */
> +DECLARE_RWSEM(dax_region_rwsem);
> +
> +/*
> + * All changes to the dax device configuration occur with this lock held
> + * for write.
> + */
> +DECLARE_RWSEM(dax_dev_rwsem);
> +
>  #define DAX_NAME_LEN 30
>  struct dax_id {
>   struct list_head list;
> @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax)
>   u64 size = 0;
>   int i;
>  
> - device_lock_assert(_dax->dev);
> + WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem));

Apologies for the late review, but...

All of these WARN_ON_ONCE() usages should be replaced with
lockdep_assert_held() and lockdep_assert_held_write() where appropriate.



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-16 Thread Dan Williams
Ira Weiny wrote:
> Ira Weiny wrote:
> > Jonathan Cameron wrote:
> > > On Wed, 14 Feb 2024 10:23:10 -0500
> > > Steven Rostedt  wrote:
> > > 
> > > > On Wed, 14 Feb 2024 12:11:53 +
> > > > Jonathan Cameron  wrote:
> > > > 
> > > > > So I'm thinking this is a won't fix - wait for the printk rework to 
> > > > > land and
> > > > > assume this will be resolved as well?  
> > > > 
> > > > That pretty much sums up what I was about to say ;-)
> > > > 
> > > > tp_printk is more of a hack and not to be used sparingly. With the right
> > > > trace events it can hang the machine.
> > > > 
> > > > So, you can use your internal patch locally, but I would recommend 
> > > > waiting
> > > > for the new printk changes to land.
> > 
> > Steven, Do you think that will land in 6.9?
> > 
> > > >
> > > > I'm really hoping that will be soon!
> > > > 
> > > > -- Steve
> > > 
> > > Thanks Steve,
> > > 
> > > Ira's fix is needed for other valid locking reasons - this was 'just 
> > > another'
> > > lock debugging report that came up whilst testing it.
> > > 
> > > For this patch (not a potential additional one that we aren't going to do 
> > > ;)
> > > 
> > > Tested-by: Jonathan Cameron 
> > > Reviewed-by: Jonathan Cameron 
> > 
> > Jonathan,
> > 
> > Again thanks for the testing!  However, Dan and I just discussed this and
> > he has an uneasy feeling about going forward with this for 6.8 final.
> > 
> > If we revert the following patch I can squash this fix and wait for the
> > tp_printk() fix to land in 6.9 and resubmit.
> > 
> > Dan here is the patch which backs out the actual bug:
> > 
> > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 
> 
> Unfortunately this is not the only patch.
> 
> We need to revert this too:
> 
> Fixes: dc97f6344f20 ("cxl/pci: Register for and process CPER events") 
> 
> And then revert ...
>   Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 
> 
> ... but there is a conflict.
> 
> Dan, below is the correct revert patch.  Let me know if you need more.
> 
> Ira
> 
> commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab
> Author: Ira Weiny 
> Date:   Wed Feb 14 15:25:24 2024 -0800
> 
> Revert "acpi/ghes: Process CXL Component Events"
> 
> This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5.

Even reverts need changelogs, I can add one. I got conflicts trying to
apply this to current fixes branch. I think I am going to just
surgically backout the drivers/acpi/apei/ghes.c changes.



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-16 Thread Dan Williams
Ira Weiny wrote:
[..]
> > As Steve said, tp_printk is a hack (a very useful one) and
> > hopefully no one runs it in production.
> 
> OMG...  I did not realize what tp_printk() was exactly.  I should have
> looked closer.
> 
> Do we have evidence of its use in production?
> 
> I would love to not have to revert/respin,

The revert is for 2 non-trivial fixes needed in one new feature, lets
just circle back and get it right for v6.9. The tp_printk() was not the
final straw for me.



RE: CPU data cache across reboot/kexec for pmem/dax devices

2024-02-09 Thread Dan Williams
Mathieu Desnoyers wrote:
> Hi Dan,
> 
> In the context of extracting user-space trace data when the kernel crashes,
> the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
> of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
> resulting device to create/mount a dax-enabled fs (e.g. ext4).
> 
> We then use this filesystem to mmap() the shared memory files for the tracer.
> 
> I want to make sure that the very last events from the userspace tracer 
> written
> to the memory mapped buffers (mmap()) by userspace are present after a
> warm-reboot (or kexec/kdump).
> 
> Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
> (or equivalent pmem_persist() from libpmem) for performance reasons: ring 
> buffer
> data is usually overwritten many times before the system actually crashes, and
> the only thing we really need to make sure is that the cache lines are not
> invalidated without write back.
> 
> So I understand that the main use-case for pmem is nvdimm, and that in order 
> to
> guarantee persistence of the data on power off an explicit pmem_persist() is
> needed after each "transaction", but for the needs of tracing, is there some
> kind of architectural guarantee that the data present in the cpu data cache
> is not invalidated prior to write back in each of those scenarios ?
>
> - reboot with bios explicitly not clearing memory,

This one gives me pause, because a trip through the BIOS typically means
lots of resets and other low level magic, so this would likely require
pushing dirty data out of CPU caches prior to entering the BIOS code
paths.

So this either needs explicit cache flushing or mapping the memory with
write-through semantics. That latter one is not supported in the stack
today.

> - kexec/kdump.

This should maintain the state of CPU caches. As far as the CPU is
concerned it is just long jumping into a new kernel in memory without
resetting any CPU cache state.



Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

2024-01-30 Thread Dan Williams
Dave Chinner wrote:
> On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote:
> > commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
> > caches")
> > prevents DAX from building on architectures with virtually aliased
> > dcache with:
> > 
> >   depends on !(ARM || MIPS || SPARC)
> > 
> > This check is too broad (e.g. recent ARMv7 don't have virtually aliased
> > dcaches), and also misses many other architectures with virtually
> > aliased dcache.
> > 
> > This is a regression introduced in the v5.13 Linux kernel where the
> > dax mount option is removed for 32-bit ARMv7 boards which have no dcache
> > aliasing, and therefore should work fine with FS_DAX.
> > 
> > This was turned into the following implementation of dax_is_supported()
> > by a preparatory change:
> > 
> > return !IS_ENABLED(CONFIG_ARM) &&
> >!IS_ENABLED(CONFIG_MIPS) &&
> >!IS_ENABLED(CONFIG_SPARC);
> > 
> > Use dcache_is_aliasing() instead to figure out whether the environment
> > has aliasing dcaches.
> > 
> > Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> > caches")
> > Signed-off-by: Mathieu Desnoyers 
> > Cc: Andrew Morton 
> > Cc: Linus Torvalds 
> > Cc: linux...@kvack.org
> > Cc: linux-a...@vger.kernel.org
> > Cc: Dan Williams 
> > Cc: Vishal Verma 
> > Cc: Dave Jiang 
> > Cc: Matthew Wilcox 
> > Cc: Arnd Bergmann 
> > Cc: Russell King 
> > Cc: nvd...@lists.linux.dev
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-fsde...@vger.kernel.org
> > ---
> >  include/linux/dax.h | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index cfc8cd4a3eae..f59e604662e4 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -5,6 +5,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  typedef unsigned long dax_entry_t;
> >  
> > @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct 
> > vm_area_struct *vma,
> >  }
> >  static inline bool dax_is_supported(void)
> >  {
> > -   return !IS_ENABLED(CONFIG_ARM) &&
> > -  !IS_ENABLED(CONFIG_MIPS) &&
> > -  !IS_ENABLED(CONFIG_SPARC);
> > +   return !dcache_is_aliasing();
> 
> Yeah, if this is just a one liner should go into
> fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the
> start of the function.
> 
> I also noticed that device mapper uses fs_dax_get_by_bdev() to
> determine if it can support DAX, but this patch set does not address
> that case. Hence it really seems to me like fs_dax_get_by_bdev() is
> the right place to put this check.

Oh, good catch. Yes, I agree this can definitely be pushed down, but
then I think it should be pushed down all the way to make alloc_dax()
fail. That will need some additional fixups like:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8dcabf84d866..a35e60e62440 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor)
md->dax_dev = alloc_dax(md, _dax_ops);
if (IS_ERR(md->dax_dev)) {
md->dax_dev = NULL;
-   goto bad;
+   } else {
+   set_dax_nocache(md->dax_dev);
+   set_dax_nomc(md->dax_dev);
+   if (dax_add_host(md->dax_dev, md->disk))
+   goto bad;
}
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
-   goto bad;
}
 
format_dev_t(md->name, MKDEV(_major, minor));

...to make it not fatal to fail to register the dax_dev.



Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression

2024-01-29 Thread Dan Williams
Mathieu Desnoyers wrote:
> This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
> even on ARMv7 which does not have virtually aliased dcaches:
> 
> commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> 
> It used to work fine before: I have customers using dax over pmem on
> ARMv7, but this regression will likely prevent them from upgrading their
> kernel.
> 
> The root of the issue here is the fact that DAX was never designed to
> handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
> touches the pages through their linear mapping, which is not consistent
> with the userspace mappings on virtually aliased dcaches. 
> 
> This patch series introduces cache_is_aliasing() with new Kconfig
> options:
> 
>   * ARCH_HAS_CACHE_ALIASING
>   * ARCH_HAS_CACHE_ALIASING_DYNAMIC
> 
> and implements it for all architectures. The "DYNAMIC" implementation
> implements cache_is_aliasing() as a runtime check, which is what is
> needed on architectures like 32-bit ARMV6 and ARMV6K.
> 
> With this we can basically narrow down the list of architectures which
> are unsupported by DAX to those which are really affected.
> 
> Feedback is welcome,

Hi Mathieu, this looks good overall, just some quibbling about the
ordering.

I would introduce dax_is_supported() with the current overly broad
interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then
fixup the filesystems to use the new helper, and finally go back and
convert dax_is_supported() to use cache_is_aliasing() internally.

Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC
needs to exist. As long as all paths that care are calling
cache_is_aliasing() then whether it is dynamic or not is something only
the compiler cares about. If those dynamic archs do not want to pay the
.text size increase they can always do CONFIG_FS_DAX=n, right?



Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-15 Thread Dan Williams
Greg Kroah-Hartman wrote:
> On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > Use the guard(device) macro to lock a 'struct device', and unlock it
> > automatically when going out of scope using Scope Based Resource
> > Management semantics. A lot of the sysfs attribute writes in
> > drivers/dax/bus.c benefit from a cleanup using these, so change these
> > where applicable.
> 
> Wait, why are you needing to call device_lock() at all here?  Why is dax
> special in needing this when no other subsystem requires it?
> 
> > 
> > Cc: Joao Martins 
> > Cc: Dan Williams 
> > Signed-off-by: Vishal Verma 
> > ---
> >  drivers/dax/bus.c | 143 
> > ++
> >  1 file changed, 59 insertions(+), 84 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 1ff1ab5fa105..6226de131d17 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device *dev,
> > struct device_attribute *attr, char *buf)
> >  {
> > struct dax_region *dax_region = dev_get_drvdata(dev);
> > -   unsigned long long size;
> >  
> > -   device_lock(dev);
> > -   size = dax_region_avail_size(dax_region);
> > -   device_unlock(dev);
> > +   guard(device)(dev);
> 
> You have a valid device here, why are you locking it?  How can it go
> away?  And if it can, shouldn't you have a local lock for it, and not
> abuse the driver core lock?

Yes, this is a driver-core lock abuse written by someone who should have
known better. And yes, a local lock to protect the dax_region resource
tree should replace this. A new rwsem to synchronize all list walks
seems appropriate.



[PATCH] driver core: Add a guard() definition for the device_lock()

2023-12-13 Thread Dan Williams
At present there are ~200 usages of device_lock() in the kernel. Some of
those usages lead to "goto unlock;" patterns which have proven to be
error prone. Define a "device" guard() definition to allow for those to
be cleaned up and prevent new ones from appearing.

Link: 
http://lore.kernel.org/r/657897453dda8_269bd29...@dwillia2-mobl3.amr.corp.intel.com.notmuch
Link: 
http://lore.kernel.org/r/6577b0c2a02df_a04c529...@dwillia2-xfh.jf.intel.com.notmuch
Cc: Vishal Verma 
Cc: Ira Weiny 
Cc: Peter Zijlstra 
Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
Hi Greg,

I wonder if you might include this change in v6.7-rc to ease some patch
sets alternately going through my tree and Andrew's tree. Those
discussions are linked above. Alternately I can can just take it through
my tree with your ack and the other use case can circle back to it in
the v6.9 cycle.

I considered also defining a __free() helper similar to __free(mutex),
but I think "__free(device)" would be a surprising name for something
that drops a lock. Also, I like the syntax of guard(device) over
something like guard(device_lock) since a 'struct device *' is the
argument, not a lock type, but I'm open to your or Peter's thoughts on
the naming.

 include/linux/device.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index d7a72a8749ea..6c83294395ac 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1007,6 +1007,8 @@ static inline void device_unlock(struct device *dev)
mutex_unlock(>mutex);
 }
 
+DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
+
 static inline void device_lock_assert(struct device *dev)
 {
lockdep_assert_held(>mutex);




Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Dan Williams
Verma, Vishal L wrote:
> On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote:
> > Vishal Verma  writes:
> > 
> > > Add a sysfs knob for dax devices to control the memmap_on_memory setting
> > > if the dax device were to be hotplugged as system memory.
> > > 
> > > The default memmap_on_memory setting for dax devices originating via
> > > pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> > > preserve legacy behavior. For dax devices via CXL, the default is on.
> > > The sysfs control allows the administrator to override the above
> > > defaults if needed.
> > > 
> > > Cc: David Hildenbrand 
> > > Cc: Dan Williams 
> > > Cc: Dave Jiang 
> > > Cc: Dave Hansen 
> > > Cc: Huang Ying 
> > > Tested-by: Li Zhijian 
> > > Reviewed-by: Jonathan Cameron 
> > > Reviewed-by: David Hildenbrand 
> > > Signed-off-by: Vishal Verma 
> > > ---
> > >  drivers/dax/bus.c   | 47 
> > > +
> > >  Documentation/ABI/testing/sysfs-bus-dax | 17 
> > >  2 files changed, 64 insertions(+)
> > > 
> > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > > index 1ff1ab5fa105..2871e5188f0d 100644
> > > --- a/drivers/dax/bus.c
> > > +++ b/drivers/dax/bus.c
> > > @@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev,
> > >  }
> > >  static DEVICE_ATTR_RO(numa_node);
> > >  
> > > +static ssize_t memmap_on_memory_show(struct device *dev,
> > > +    struct device_attribute *attr, char 
> > > *buf)
> > > +{
> > > +   struct dev_dax *dev_dax = to_dev_dax(dev);
> > > +
> > > +   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> > > +}
> > > +
> > > +static ssize_t memmap_on_memory_store(struct device *dev,
> > > + struct device_attribute *attr,
> > > + const char *buf, size_t len)
> > > +{
> > > +   struct device_driver *drv = dev->driver;
> > > +   struct dev_dax *dev_dax = to_dev_dax(dev);
> > > +   struct dax_region *dax_region = dev_dax->region;
> > > +   struct dax_device_driver *dax_drv = to_dax_drv(drv);
> > > +   ssize_t rc;
> > > +   bool val;
> > > +
> > > +   rc = kstrtobool(buf, );
> > > +   if (rc)
> > > +   return rc;
> > > +
> > > +   if (dev_dax->memmap_on_memory == val)
> > > +   return len;
> > > +
> > > +   device_lock(dax_region->dev);
> > > +   if (!dax_region->dev->driver) {
> > > +   device_unlock(dax_region->dev);
> > > +   return -ENXIO;
> > > +   }
> > 
> > I think that it should be OK to write to "memmap_on_memory" if no driver
> > is bound to the device.  We just need to avoid to write to it when kmem
> > driver is bound.
> 
> Oh this is just a check on the region driver, not for a dax driver
> being bound to the device. It's the same as what things like
> align_store(), size_store() etc. do for dax device reconfiguration.
> 
> That said, it might be okay to remove this check, as this operation
> doesn't change any attributes of the dax region (the other interfaces I
> mentioned above can affect regions, so we want to lock the region
> device). If removing the check, we'd drop the region lock acquisition
> as well.
> 
> Dan, any thoughts on this?

Since this is a dev_dax attribute then this would have already been
synchronously shutdown when dax_region->dev->driver transitioned to
NULL. I.e. region unbind causes dev_dax deletion.

However, there's a different issue here as dev->driver was referenced
without the device_lock().

Additionally, I think this function also makes it clear that device lock
flavor of guard() would be useful:

DEFINE_GUARD(dev, struct device *, device_lock(_T), device_unlock(_T))

...then I would expect something like:

guard(dev)(dev);
if (dev_dax->memmap_on_memory != val && dev->driver &&
to_dax_drv(dev->driver)->type == DAXDRV_KMEM_TYPE)
return -EBUSY;
dev_dax->memmap_on_memory = val;
return len;

...maybe with some temp variables to reduce the derefence chain, buy you
get the idea. Only prevent changes while the device is active under
kmem.



RE: [PATCH v1 1/1] ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()

2023-10-19 Thread Dan Williams
Andy Shevchenko wrote:
> The acpi_evaluate_dsm_typed() provides a way to check the type of the
> object evaluated by _DSM call. Use it instead of open coded variant.

Looks good to me.

Reviewed-by: Dan Williams 



Re: [PATCH v1 1/1] ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()

2023-10-19 Thread Dan Williams
Wilczynski, Michal wrote:
> 
> 
> On 10/2/2023 3:54 PM, Andy Shevchenko wrote:
> > The acpi_evaluate_dsm_typed() provides a way to check the type of the
> > object evaluated by _DSM call. Use it instead of open coded variant.
> >
> > Signed-off-by: Andy Shevchenko 
> > ---
> >  drivers/acpi/nfit/core.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> > index f96bf32cd368..280da408c02c 100644
> > --- a/drivers/acpi/nfit/core.c
> > +++ b/drivers/acpi/nfit/core.c
> > @@ -1737,9 +1737,8 @@ __weak void nfit_intel_shutdown_status(struct 
> > nfit_mem *nfit_mem)
> > if ((nfit_mem->dsm_mask & (1 << func)) == 0)
> > return;
> >  
> > -   out_obj = acpi_evaluate_dsm(handle, guid, revid, func, _obj);
> > -   if (!out_obj || out_obj->type != ACPI_TYPE_BUFFER
> > -   || out_obj->buffer.length < sizeof(smart)) {
> > +   out_obj = acpi_evaluate_dsm_typed(handle, guid, revid, func, _obj, 
> > ACPI_TYPE_BUFFER);
> 
> This line is 90 characters long, wouldn't it be better to split it ?

100 characters is allowed these days.

> > +   if (!out_obj || out_obj->buffer.length < sizeof(smart)) {
> > dev_dbg(dev->parent, "%s: failed to retrieve initial health\n",
> > dev_name(dev));
> 
> While at it maybe fix alignment ? :-)

Life is too short to fiddle with indentation. For old code leave it
alone, for new code just use clang-format.



Re: [PATCH v3] ACPI: NFIT: Use cleanup.h helpers instead of devm_*()

2023-10-17 Thread Dan Williams
Michal Wilczynski wrote:
> The new cleanup.h facilities that arrived in v6.5-rc1 can replace the
> the usage of devm semantics in acpi_nfit_init_interleave_set(). That
> routine appears to only be using devm to avoid goto statements. The
> new __free() annotation at variable declaration time can achieve the same
> effect more efficiently.
> 
> There is no end user visible side effects of this patch, I was
> motivated to send this cleanup to practice using the new helpers.
> 
> Suggested-by: Dave Jiang 
> Suggested-by: Andy Shevchenko 
> Reviewed-by: Dave Jiang 
> Reviewed-by: Andy Shevchenko 
> Signed-off-by: Michal Wilczynski 
> ---
> 
> Dan, would you like me to give you credit for the changelog changes
> with Co-developed-by tag ?

Nope, this looks good to me, thanks for fixing it up.

Reviewed-by: Dan Williams 



RE: [PATCH v2 5/6] ACPI: NFIT: Replace acpi_driver with platform_driver

2023-10-17 Thread Dan Williams
Michal Wilczynski wrote:
> NFIT driver uses struct acpi_driver incorrectly to register itself.
> This is wrong as the instances of the ACPI devices are not meant
> to be literal devices, they're supposed to describe ACPI entry of a
> particular device.
> 
> Use platform_driver instead of acpi_driver. In relevant places call
> platform devices instances pdev to make a distinction with ACPI
> devices instances.
> 
> NFIT driver uses devm_*() family of functions extensively. This change
> has no impact on correct functioning of the whole devm_*() family of
> functions, since the lifecycle of the device stays the same. It is still
> being created during the enumeration, and destroyed on platform device
> removal.

I notice this verbiage has the same fundamental misunderstanding of devm
allocation lifetime as the acpi_nfit_init_interleave_set() discussion.
The devm allocation lifetime typically starts in driver->probe() and
ends either with driver->probe() failure, or the driver->remove() call.
Note that the driver->remove() call is invoked not only for
platform-device removal, but also driver "unbind" events. So the
"destroyed on platform device removal" is the least likely way that
these allocations are torn down given ACPI0012 devices are never
removed.

Outside of that, my main concern about this patch is that I expect it
breaks unit tests. The infrastructure in
tools/testing/nvdimm/test/nfit.c emulates an ACPI0012 device that allows
for deeper regression testing given hardware is difficult to come by,
and because QEMU does not implement some of the tricky corner cases that
the unit tests cover.

This needs to pass tests, but fair warning, 
tools/testing/nvdimm/test/nfit.c does some non-idiomatic + "strange"
things to achieve deeper test coverage. So I expect that if this breaks
tests as I expect the effort needed to fix the emulation could be
significant.

If you want to give running the tests a try the easiest would be to use
"run_qemu.sh" with --nfit-test option [1], or you can try to setup an
environment manually using the ndctl instructions [2].

[1]: https://github.com/pmem/run_qemu
[2]: https://github.com/pmem/ndctl#readme



Re: [PATCH v2] ACPI: NFIT: Fix local use of devm_*()

2023-10-13 Thread Dan Williams
Wilczynski, Michal wrote:
> 
> 
> On 10/13/2023 7:05 PM, Dan Williams wrote:
> > Wilczynski, Michal wrote:
> >> On 10/13/2023 6:38 PM, Dan Williams wrote:
> >>> Michal Wilczynski wrote:
> >>>> devm_*() family of functions purpose is managing memory attached to a
> >>>> device. So in general it should only be used for allocations that should
> >>>> last for the whole lifecycle of the device. 
> >>> No, this assertion is not accurate, if it were strictly true then
> >>> devm_kfree() should be deleted. This patch is only a cleanup to switch
> >>> the automatic cleanup pattern from devm to the new cleanup.h helpers.
> >> The memory in question is only used locally in a function, so there is no 
> >> reason
> >> to use devm_*() family of functions. I think devm_kfree() is more for 
> >> special
> >> cases where the memory is meant to be used for the whole lifecycle of 
> >> device,
> >> but some special case occurs and it's not and it needs to be freed.
> >>
> >> This is an incorrect API usage. Would you propose to change all memory
> >> allocations currently being done to devm_*() family simply because 
> >> devm_kfree()
> >> exists ?
> > Michal, please work with someone else to get these cleanups upstream, I
> > am done with this thread.
> 
> I'm really sorry if I offended you, I didn't mean to.

Hey, it happens.

I am not offended, just frustrated.

Going forward, my advice for anyone doing advocacy for a patch set is to
be clear about "what happens if upstream does not take this patch?".
When you view it from that angle the patch changelog that would have
received an immediate Reviewed-by with no additional comment from me is
something along the lines of:

"The new cleanup.h facilities that arrived in v6.5-rc1 can replace the
the usage of devm semantics in acpi_nfit_init_interleave_set(). That
routine appears to only be using devm to avoid goto statements. The new
__free() annotation at variable declaration time can achieve the same
effect more efficiently.

There is no end user visible side effects of this patch, I was motivated
to send this cleanup to practice using the new helpers."

As Linus mentions, subtlety is difficult to convey in mailing list
interactions, and you may not have picked up on it, but the frustration
for me began with the claim of a memory leak that turned out to be
false. That was the prompt to consider "what other claims should I be
careful about now that the fundamental justification for touching this
old code has gone away."

So if you want to try again with the justification of the patch limited
to a cleanup, we can move forward.



Re: [PATCH v2] ACPI: NFIT: Fix local use of devm_*()

2023-10-13 Thread Dan Williams
Wilczynski, Michal wrote:
> On 10/13/2023 6:38 PM, Dan Williams wrote:
> > Michal Wilczynski wrote:
> >> devm_*() family of functions purpose is managing memory attached to a
> >> device. So in general it should only be used for allocations that should
> >> last for the whole lifecycle of the device. 
> > No, this assertion is not accurate, if it were strictly true then
> > devm_kfree() should be deleted. This patch is only a cleanup to switch
> > the automatic cleanup pattern from devm to the new cleanup.h helpers.
> 
> The memory in question is only used locally in a function, so there is no 
> reason
> to use devm_*() family of functions. I think devm_kfree() is more for special
> cases where the memory is meant to be used for the whole lifecycle of device,
> but some special case occurs and it's not and it needs to be freed.
> 
> This is an incorrect API usage. Would you propose to change all memory
> allocations currently being done to devm_*() family simply because 
> devm_kfree()
> exists ?

Michal, please work with someone else to get these cleanups upstream, I
am done with this thread.



Re: [PATCH v2] ACPI: NFIT: Fix local use of devm_*()

2023-10-13 Thread Dan Williams
Michal Wilczynski wrote:
> devm_*() family of functions purpose is managing memory attached to a
> device. So in general it should only be used for allocations that should
> last for the whole lifecycle of the device. 

No, this assertion is not accurate, if it were strictly true then
devm_kfree() should be deleted. This patch is only a cleanup to switch
the automatic cleanup pattern from devm to the new cleanup.h helpers.

I am all for modernizing code over time, but patches that make
assertions of "memory leaks" and "incorrect API usage" in code that has
been untouched for almost a decade demand more scrutiny than what
transpired here.



RE: [PATCH v1 1/2] ACPI: NFIT: Fix memory leak, and local use of devm_*()

2023-10-12 Thread Dan Williams
Michal Wilczynski wrote:
> devm_*() family of functions purpose is managing memory attached to a
> device. So in general it should only be used for allocations that should
> last for the whole lifecycle of the device. This is not the case for
> acpi_nfit_init_interleave_set(). There are two allocations that are only
> used locally in this function. What's more - if the function exits on
> error path memory is never freed. It's still attached to dev and would
> be freed on device detach, so this leak could be called a 'local leak'.

This analysis is incorrect devm cleans up on driver ->probe() failure in
addition to ->remove(), and these error returns result in ->probe()
failures. No leak, i.e. this is not a fix.

The conversion to modern probe is ok if you want to resubmit that one
without this intervening change.



RE: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-05 Thread Dan Williams
Vishal Verma wrote:
> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
> 
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
> 
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
> 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  mm/memory_hotplug.c | 162 
> 
>  1 file changed, 99 insertions(+), 63 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f8d3e7427e32..77ec6f15f943 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +  u64 start, u64 size, mhp_t mhp_flags)
> +{
> + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn =  PHYS_PFN(start),
> + .end_pfn  =  PHYS_PFN(start + size - 1),
> + };
> + int ret;
> +
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
> +
> + memcpy(params.altmap, _altmap, sizeof(mhp_altmap));

Isn't this just open coded kmemdup()?

Other than that, I am not seeing anything else to comment on, you can add:

Reviewed-by: Dan Williams 



RE: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-05 Thread Dan Williams
Vishal Verma wrote:
> Large amounts of memory managed by the kmem driver may come in via CXL,
> and it is often desirable to have the memmap for this memory on the new
> memory itself.
> 
> Enroll kmem-managed memory for memmap_on_memory semantics if the dax
> region originates via CXL. For non-CXL dax regions, retain the existing
> default behavior of hot adding without memmap_on_memory semantics.
> 
> Add a sysfs override under the dax device to control this behavior and
> override either default.
> 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.h |  1 +
>  drivers/dax/dax-private.h |  1 +
>  drivers/dax/bus.c | 38 ++
>  drivers/dax/cxl.c |  1 +
>  drivers/dax/hmem/hmem.c   |  1 +
>  drivers/dax/kmem.c|  8 +++-
>  drivers/dax/pmem.c|  1 +
>  7 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 1ccd23360124..cbbf64443098 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -23,6 +23,7 @@ struct dev_dax_data {
>   struct dev_pagemap *pgmap;
>   resource_size_t size;
>   int id;
> + bool memmap_on_memory;
>  };
>  
>  struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 27cf2d79..446617b73aea 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -70,6 +70,7 @@ struct dev_dax {
>   struct ida ida;
>   struct device dev;
>   struct dev_pagemap *pgmap;
> + bool memmap_on_memory;
>   int nr_range;
>   struct dev_dax_range {
>   unsigned long pgoff;
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 0ee96e6fc426..43be95a231c9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct 
> device_attribute *attr,
>   .dax_region = dax_region,
>   .size = 0,
>   .id = -1,
> + .memmap_on_memory = false,
>   };
>   struct dev_dax *dev_dax = devm_create_dev_dax();
>  
> @@ -1269,6 +1270,40 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct dax_region *dax_region = dev_dax->region;
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;

Perhaps:

if (dev_dax->memmap_on_memory == val)
return len;

...and skip the check below when it is going to be a nop

> +
> + device_lock(dax_region->dev);
> + if (!dax_region->dev->driver) {

Is the polarity backwards here? I.e. if the device is already attached to
the kmem driver it is too late to modify memmap_on_memory policy.

> + device_unlock(dax_region->dev);
> + return -ENXIO;

I would expect -EBUSY since disabling the device allows the property to be
set and -ENXIO implies a more permanent error.

> + }
> +
> + dev_dax->memmap_on_memory = val;
> +
> + device_unlock(dax_region->dev);
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);

This new attribute needs a new Documentation/ABI/ entry... in fact all of
these attributes need Documentation/ entries. I can help with that base
document to get things started.

Perhaps split this sysfs ABI into its own patch and, depending on how fast
we can pull the Documentation together, start with the
region-driver-conveyed approach in the meantime.



Re: [PATCH v1 9/9] ACPI: NFIT: Don't use KBUILD_MODNAME for driver name

2023-09-25 Thread Dan Williams
Andy Shevchenko wrote:
> On Mon, Sep 25, 2023 at 05:48:42PM +0300, Michal Wilczynski wrote:
> > Driver name is part of the ABI, so it should be hard-coded, as ABI
> > should be always kept backward compatible. Prevent ABI from changing
> > accidentally in case KBUILD_MODNAME change.
> 
> This is up to maintainers, probably we won't have any users outside of 
> existing
> model (instantiating via ACPI ID). All the above is "strictly speaking"...

...right, more than 8 years for this "risk" to materialize indicates
it's a non-issue to me.



RE: [PATCH] drivers: nvdimm: fix dereference after free

2023-08-17 Thread Dan Williams
[ add Kajol ]

Konstantin Meskhidze wrote:
> 'nd_pmu->pmu.attr_groups' is dereferenced in function
> 'nvdimm_pmu_free_hotplug_memory' call after it has been freed. Because in
> function 'nvdimm_pmu_free_hotplug_memory' memory pointed by the fields of
> 'nd_pmu->pmu.attr_groups' is deallocated it is necessary to call 'kfree'
> after 'nvdimm_pmu_free_hotplug_memory'.

Another one that would be fixed by static attribute groups.

I do think we should move forward with these fixes as is for ease of backport,
but long term this dynamically allocated attribute groups approach needs to be
jettisoned. ...unless I am missing a concrete reason it needs to remain dynamic?



RE: [PATCH] drivers: nvdimm: fix memleak

2023-08-17 Thread Dan Williams
[ add Kajol and Madhavan ]

Konstantin Meskhidze wrote:
> Memory pointed by 'nd_pmu->pmu.attr_groups' is allocated in function
> 'register_nvdimm_pmu' and is lost after 'kfree(nd_pmu)' call in function
> 'unregister_nvdimm_pmu'.

Yes, looks like a real issue, but also completely avoidable by using statically
defined groups.

My fault for not catching this earlier, but Kajol, why is nd_perf not using
statically defined sysfs attribute groups?



Re: [PATCH v5 09/10] acpi/nfit: Move handler installing logic to driver

2023-06-30 Thread Dan Williams
Wilczynski, Michal wrote:
> 
> 
> On 6/29/2023 10:54 PM, Dan Williams wrote:
> > Michal Wilczynski wrote:
> >> Currently logic for installing notifications from ACPI devices is
> >> implemented using notify callback in struct acpi_driver. Preparations
> >> are being made to replace acpi_driver with more generic struct
> >> platform_driver, which doesn't contain notify callback. Furthermore
> >> as of now handlers are being called indirectly through
> >> acpi_notify_device(), which decreases performance.
> >>
> >> Call acpi_dev_install_notify_handler() at the end of .add() callback.
> >> Call acpi_dev_remove_notify_handler() at the beginning of .remove()
> >> callback. Change arguments passed to the notify function to match with
> >> what's required by acpi_install_notify_handler(). Remove .notify
> >> callback initialization in acpi_driver.
> >>
> >> Suggested-by: Rafael J. Wysocki 
> >> Signed-off-by: Michal Wilczynski 
> >> ---
> >>  drivers/acpi/nfit/core.c | 24 ++--
> >>  1 file changed, 18 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> >> index 95930e9d776c..a281bdfee8a0 100644
> >> --- a/drivers/acpi/nfit/core.c
> >> +++ b/drivers/acpi/nfit/core.c
> >> @@ -3312,11 +3312,13 @@ void acpi_nfit_shutdown(void *data)
> >>  }
> >>  EXPORT_SYMBOL_GPL(acpi_nfit_shutdown);
> >>  
> >> -static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
> >> +static void acpi_nfit_notify(acpi_handle handle, u32 event, void *data)
> >>  {
> >> -  device_lock(>dev);
> >> -  __acpi_nfit_notify(>dev, adev->handle, event);
> >> -  device_unlock(>dev);
> >> +  struct acpi_device *device = data;
> >> +
> >> +  device_lock(>dev);
> >> +  __acpi_nfit_notify(>dev, handle, event);
> >> +  device_unlock(>dev);
> >>  }
> >>  
> >>  static int acpi_nfit_add(struct acpi_device *adev)
> >> @@ -3375,12 +3377,23 @@ static int acpi_nfit_add(struct acpi_device *adev)
> >>  
> >>if (rc)
> >>return rc;
> >> -  return devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc);
> >> +
> >> +  rc = devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc);
> >> +  if (rc)
> >> +  return rc;
> >> +
> >> +  return acpi_dev_install_notify_handler(adev,
> >> + ACPI_DEVICE_NOTIFY,
> >> + acpi_nfit_notify);
> >>  }
> >>  
> >>  static void acpi_nfit_remove(struct acpi_device *adev)
> >>  {
> >>/* see acpi_nfit_unregister */
> >> +
> >> +  acpi_dev_remove_notify_handler(adev,
> >> + ACPI_DEVICE_NOTIFY,
> >> + acpi_nfit_notify);
> > Please use devm to trigger this release rather than making
> > acpi_nfit_remove() contain any logic.
> 
> I think adding separate devm action to remove event handler is not
> necessary. I'll put the removal in the beggining of acpi_nfit_shutdown() if 
> you
> don't object.

How do you plan to handle an acpi_dev_install_notify_handler() failure?
acpi_nfit_shutdown() will need to have extra logic to know that it can
skip acpi_dev_remove_notify_handler() in some cases and not other..
Maybe it is ok to remove a handler that was never installed, but I would
rather not go look that up. A devm callback for
acpi_dev_remove_notify_handler() avoids that.



RE: [PATCH v5 09/10] acpi/nfit: Move handler installing logic to driver

2023-06-29 Thread Dan Williams
Michal Wilczynski wrote:
> Currently logic for installing notifications from ACPI devices is
> implemented using notify callback in struct acpi_driver. Preparations
> are being made to replace acpi_driver with more generic struct
> platform_driver, which doesn't contain notify callback. Furthermore
> as of now handlers are being called indirectly through
> acpi_notify_device(), which decreases performance.
> 
> Call acpi_dev_install_notify_handler() at the end of .add() callback.
> Call acpi_dev_remove_notify_handler() at the beginning of .remove()
> callback. Change arguments passed to the notify function to match with
> what's required by acpi_install_notify_handler(). Remove .notify
> callback initialization in acpi_driver.
> 
> Suggested-by: Rafael J. Wysocki 
> Signed-off-by: Michal Wilczynski 
> ---
>  drivers/acpi/nfit/core.c | 24 ++--
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 95930e9d776c..a281bdfee8a0 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -3312,11 +3312,13 @@ void acpi_nfit_shutdown(void *data)
>  }
>  EXPORT_SYMBOL_GPL(acpi_nfit_shutdown);
>  
> -static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
> +static void acpi_nfit_notify(acpi_handle handle, u32 event, void *data)
>  {
> - device_lock(>dev);
> - __acpi_nfit_notify(>dev, adev->handle, event);
> - device_unlock(>dev);
> + struct acpi_device *device = data;
> +
> + device_lock(>dev);
> + __acpi_nfit_notify(>dev, handle, event);
> + device_unlock(>dev);
>  }
>  
>  static int acpi_nfit_add(struct acpi_device *adev)
> @@ -3375,12 +3377,23 @@ static int acpi_nfit_add(struct acpi_device *adev)
>  
>   if (rc)
>   return rc;
> - return devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc);
> +
> + rc = devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc);
> + if (rc)
> + return rc;
> +
> + return acpi_dev_install_notify_handler(adev,
> +ACPI_DEVICE_NOTIFY,
> +acpi_nfit_notify);
>  }
>  
>  static void acpi_nfit_remove(struct acpi_device *adev)
>  {
>   /* see acpi_nfit_unregister */
> +
> + acpi_dev_remove_notify_handler(adev,
> +ACPI_DEVICE_NOTIFY,
> +acpi_nfit_notify);

Please use devm to trigger this release rather than making
acpi_nfit_remove() contain any logic.

An additional cleanup opportunity with the ->add() path fully devm
instrumented would be to just delete acpi_nfit_remove() since it is
optional and serves no purpose.



RE: [PATCH v5 08/10] acpi/nfit: Improve terminator line in acpi_nfit_ids

2023-06-29 Thread Dan Williams
Michal Wilczynski wrote:
> Currently terminator line contains redunant characters. Remove them and
> also remove a comma at the end.
> 
> Signed-off-by: Michal Wilczynski 
> ---
>  drivers/acpi/nfit/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index aff79cbc2190..95930e9d776c 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -3455,7 +3455,7 @@ EXPORT_SYMBOL_GPL(__acpi_nfit_notify);
>  
>  static const struct acpi_device_id acpi_nfit_ids[] = {
>   { "ACPI0012", 0 },
> - { "", 0 },
> + {}

Looks like a pointless change to me.



Re: [v5 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-27 Thread Dan Williams
Markus Elfring wrote:
> >> Would you insist on the usage of cover letters also for single patches?
> >
> > I would neither insist on it, nor prohibit it.
> 
> It seems that you can tolerate an extra message here.
> 
> 
> > It simply does not make enough difference.
> 
> Can it occasionally be a bit nicer to receive all relevant information within 
> a single patch
> instead of a combination of two messages?

No, I am the maintainer of this code and everything I needed to judge
this patch was provided. This cover letter only included inter-version
details and anything relevant for the kernel history is included in the
patch itself.

For any code I maintain inter-version details below the --- line or in a
0/1 cover letter are perfectly acceptable.

Please, if the review feedback is arbitrary, as it is in this case,
strongly consider not offering it.



RE: [PATCH v5 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-24 Thread Dan Williams
Jane Chu wrote:
> When multiple processes mmap() a dax file, then at some point,
> a process issues a 'load' and consumes a hwpoison, the process
> receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
> set for the poison scope. Soon after, any other process issues
> a 'load' to the poisoned page (that is unmapped from the kernel
> side by memory_failure), it receives a SIGBUS with
> si_code = BUS_ADRERR and without valid si_lsb.
> 
> This is confusing to user, and is different from page fault due
> to poison in RAM memory, also some helpful information is lost.
> 
> Channel dax backend driver's poison detection to the filesystem
> such that instead of reporting VM_FAULT_SIGBUS, it could report
> VM_FAULT_HWPOISON.
> 
> If user level block IO syscalls fail due to poison, the errno will
> be converted to EIO to maintain block API consistency.
> 
> Signed-off-by: Jane Chu 

Reviewed-by: Dan Williams 



RE: [PATCH v4 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-08 Thread Dan Williams
Jane Chu wrote:
> When multiple processes mmap() a dax file, then at some point,
> a process issues a 'load' and consumes a hwpoison, the process
> receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
> set for the poison scope. Soon after, any other process issues
> a 'load' to the poisoned page (that is unmapped from the kernel
> side by memory_failure), it receives a SIGBUS with
> si_code = BUS_ADRERR and without valid si_lsb.
> 
> This is confusing to user, and is different from page fault due
> to poison in RAM memory, also some helpful information is lost.
> 
> Channel dax backend driver's poison detection to the filesystem
> such that instead of reporting VM_FAULT_SIGBUS, it could report
> VM_FAULT_HWPOISON.
> 
> If user level block IO syscalls fail due to poison, the errno will
> be converted to EIO to maintain block API consistency.
> 
> Signed-off-by: Jane Chu 
> ---
>  drivers/dax/super.c  |  5 -
>  drivers/nvdimm/pmem.c|  2 +-
>  drivers/s390/block/dcssblk.c |  3 ++-
>  fs/dax.c | 11 ++-
>  fs/fuse/virtio_fs.c  |  3 ++-
>  include/linux/dax.h  |  5 +
>  include/linux/mm.h   |  2 ++
>  7 files changed, 22 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..0da9232ea175 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -203,6 +203,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, 
> pgoff_t pgoff, void *addr,
>  int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>   size_t nr_pages)
>  {
> + int ret;
> +
>   if (!dax_alive(dax_dev))
>   return -ENXIO;
>   /*
> @@ -213,7 +215,8 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
> pgoff_t pgoff,
>   if (nr_pages != 1)
>   return -EIO;
>  
> - return dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
> + ret = dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
> + return dax_mem2blk_err(ret);
>  }
>  EXPORT_SYMBOL_GPL(dax_zero_page_range);
>  
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ceea55f621cc..46e094e56159 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device 
> *pmem, pgoff_t pgoff,
>   long actual_nr;
>  
>   if (mode != DAX_RECOVERY_WRITE)
> - return -EIO;
> + return -EHWPOISON;
>  
>   /*
>* Set the recovery stride is set to kernel page size because
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index c09f2e053bf8..ee47ac520cd4 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -54,7 +54,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device 
> *dax_dev,
>   rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
>   , NULL);
>   if (rc < 0)
> - return rc;
> + return dax_mem2blk_err(rc);
> +
>   memset(kaddr, 0, nr_pages << PAGE_SHIFT);
>   dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
>   return 0;
> diff --git a/fs/dax.c b/fs/dax.c
> index 2ababb89918d..a26eb5abfdc0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1148,7 +1148,7 @@ static int dax_iomap_copy_around(loff_t pos, uint64_t 
> length, size_t align_size,
>   if (!zero_edge) {
>   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
>   if (ret)
> - return ret;
> + return dax_mem2blk_err(ret);
>   }
>  
>   if (copy_all) {
> @@ -1310,7 +1310,7 @@ static s64 dax_unshare_iter(struct iomap_iter *iter)
>  
>  out_unlock:
>   dax_read_unlock(id);
> - return ret;
> + return dax_mem2blk_err(ret);
>  }
>  
>  int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> @@ -1342,7 +1342,8 @@ static int dax_memzero(struct iomap_iter *iter, loff_t 
> pos, size_t size)
>   ret = dax_direct_access(iomap->dax_dev, pgoff, 1, DAX_ACCESS, ,
>   NULL);
>   if (ret < 0)
> - return ret;
> + return dax_mem2blk_err(ret);
> +
>   memset(kaddr + offset, 0, size);
>   if (iomap->flags & IOMAP_F_SHARED)
>   ret = dax_iomap_copy_around(pos, size, PAGE_SIZE, srcmap,
> @@ -1498,7 +1499,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>  
>   map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
>   DAX_ACCESS, , NULL);
> - if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
> + if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
>   map_len = dax_direct_access(dax_dev, pgoff,
>   PHYS_PFN(size), DAX_RECOVERY_WRITE,
>   , NULL);
> 

Re: [PATCH] dax/hmem: Fix refcount leak in dax_hmem_probe()

2023-06-02 Thread Dan Williams
Paul Cassella wrote:
> On Fri, 2 Jun 2023, Ira Weiny wrote:
> > Paul Cassella wrote:
> > > On Sat, 3 Dec 2022, Ira Weiny wrote:
> > > > On Sat, Dec 03, 2022 at 09:58:58AM +, Yongqiang Liu wrote:
> 
> > > > > We should always call dax_region_put() whenever devm_create_dev_dax()
> > > > > succeed or fail to avoid refcount leak of dax_region. Move the return
> > > > > value check after dax_region_put().
> 
> > > > I think dax_region_put is called from dax_region_unregister() 
> > > > automatically on
> > > > tear down.
> 
> > > Note the reference dax_region_unregister() will be putting is the one 
> > > devm_create_dev_dax() takes by kref_get(_region->kref).   I think 
> > > dax_hmem_probe() needs to put its reference in the error case, as in the 
> > > successful case.
> 
> > Looking at this again I'm inclined to agree that something is wrong.  But
> > I'm not sure this patch fixes it. anything.
> 
> > When you say:
> > 
> > > ...   I think 
> > > dax_hmem_probe() needs to put its reference in the error case, as in the 
> > > successful case.
> > 
> > ... which kref_get() is dax_hmem_probe() letting go?
> 
> Isn't it letting go of the initial kref_init() reference from 
> alloc_dax_region()?
> 
> Sorry, I had gone through the code a little carelessly yesterday.  Now I 
> think that kref_init() reference is the one that dax_hmem_probe() is 
> dropping in the success case, and which still needs to be dropped in the 
> error case.
> 
> (If so, I think the alloc_dax_region() path that returns NULL on 
> devm_add_action_or_reset() failure, releasing the kref_get reference, will 
> leak the kref_init reference and the region.)

Yes, *this* one looks valid. All the other patches in this thread had me
asking, "um, did you try it?".

Now, that said, the games that this init path is playing are hard to
follow and it should not be the case that alloc_dax_region() needs to
hold a reference on behalf of some future code path that might need it.

Lets make this clearer by moving the reference count management closer
to the object that needs it dax_region->ida. Where the extra references
were being taken on behalf of "static" regions just in case they needed
to be reference in dev_dax_release().

I also found a dax_mapping lifetime issue while testing this, so I
appreciate the focus here.



RE: [PATCH] dax: fix missing-prototype warnings

2023-05-18 Thread Dan Williams
Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> dev_dax_probe declaration for this function was removed with the only
> caller outside of device.c. Mark it static to avoid a W=1
> warning:
> drivers/dax/device.c:399:5: error: no previous prototype for 'dev_dax_probe'
> 
> Similarly, run_dax() causes a warning, but this one is because the
> declaration needs to be included:
> 
> drivers/dax/super.c:337:6: error: no previous prototype for 'run_dax'
> 
> Fixes: 83762cb5c7c4 ("dax: Kill DEV_DAX_PMEM_COMPAT")
> Signed-off-by: Arnd Bergmann 

> ---
>  drivers/dax/device.c | 3 +--
>  drivers/dax/super.c  | 1 +
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index af9930c03c9c..30665a3ff6ea 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -396,7 +396,7 @@ static void dev_dax_kill(void *dev_dax)
>   kill_dev_dax(dev_dax);
>  }
>  
> -int dev_dax_probe(struct dev_dax *dev_dax)
> +static int dev_dax_probe(struct dev_dax *dev_dax)
>  {
>   struct dax_device *dax_dev = dev_dax->dax_dev;
>   struct device *dev = _dax->dev;
> @@ -471,7 +471,6 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>   run_dax(dax_dev);
>   return devm_add_action_or_reset(dev, dev_dax_kill, dev_dax);
>  }
> -EXPORT_SYMBOL_GPL(dev_dax_probe);
>  
>  static struct dax_device_driver device_dax_driver = {
>   .probe = dev_dax_probe,

This hunk looks good.

> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..8c05dae19bfe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include "dax-private.h"
> +#include "bus.h"

This one came up before, and IIRC last time I said lets just move the run_dax()
prototype to dax-private.h.

https://lore.kernel.org/all/YiqF1a9VNiSWI5j0@iweiny-desk3

I can fix that up when applying.



RE: [PATCH v3] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-05-04 Thread Dan Williams
Jane Chu wrote:
> When multiple processes mmap() a dax file, then at some point,
> a process issues a 'load' and consumes a hwpoison, the process
> receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
> set for the poison scope. Soon after, any other process issues
> a 'load' to the poisoned page (that is unmapped from the kernel
> side by memory_failure), it receives a SIGBUS with
> si_code = BUS_ADRERR and without valid si_lsb.
> 
> This is confusing to user, and is different from page fault due
> to poison in RAM memory, also some helpful information is lost.
> 
> Channel dax backend driver's poison detection to the filesystem
> such that instead of reporting VM_FAULT_SIGBUS, it could report
> VM_FAULT_HWPOISON.

I do think it is interesting that this will start returning SIGBUS with
BUS_MCEERR_AR for stores whereas it is only signalled for loads in the
direct consumption path, but I can't think of a scenario where that
would confuse existing software.

> Change from v2:
>   Convert -EHWPOISON to -EIO to prevent EHWPOISON errno from leaking
> out to block read(2) - suggested by Matthew.

For next time these kind of changelog notes belong...

> Signed-off-by: Jane Chu 
> ---

...here after the "---".

>  drivers/nvdimm/pmem.c | 2 +-
>  fs/dax.c  | 4 ++--
>  include/linux/mm.h| 2 ++
>  3 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ceea55f621cc..46e094e56159 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device 
> *pmem, pgoff_t pgoff,
>   long actual_nr;
>  
>   if (mode != DAX_RECOVERY_WRITE)
> - return -EIO;
> + return -EHWPOISON;
>  
>   /*
>* Set the recovery stride is set to kernel page size because
> diff --git a/fs/dax.c b/fs/dax.c
> index 2ababb89918d..18f9598951f1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1498,7 +1498,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>  
>   map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
>   DAX_ACCESS, , NULL);
> - if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
> + if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
>   map_len = dax_direct_access(dax_dev, pgoff,
>   PHYS_PFN(size), DAX_RECOVERY_WRITE,
>   , NULL);
> @@ -1506,7 +1506,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>   recovery = true;
>   }
>   if (map_len < 0) {
> - ret = map_len;
> + ret = (map_len == -EHWPOISON) ? -EIO : map_len;

This fixup leaves out several other places where EHWPOISON could leak as
the errno for read(2)/write(2). I think I want to see something like
this:

diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..ec17f9977bcb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,14 +1077,13 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
-   size_t size, void **kaddr, pfn_t *pfnp)
+static int __dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+size_t size, void **kaddr, pfn_t *pfnp)
 {
pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
-   int id, rc = 0;
long length;
+   int rc = 0;
 
-   id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
   DAX_ACCESS, kaddr, pfnp);
if (length < 0) {
@@ -1113,6 +1112,36 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
return rc;
 }
 
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+  size_t size, void **kaddr, pfn_t *pfnp)
+{
+
+   int id;
+
+   id = dax_read_lock();
+   rc = __dax_iomap_direct_access(iomap, pos, size, kaddr, pfnp);
+   dax_read_unlock(id);
+
+   /* don't leak a memory access error code to I/O syscalls */
+   if (rc == -EHWPOISON)
+   return -EIO;
+   return rc;
+}
+
+static int dax_fault_direct_access(const struct iomap *iomap, loff_t pos,
+  size_t size, void **kaddr, pfn_t *pfnp)
+{
+
+   int id;
+
+   id = dax_read_lock();
+   rc = __dax_iomap_direct_access(iomap, pos, size, kaddr, pfnp);
+   dax_read_unlock(id);
+
+   /* -EHWPOISON return ok */
+   return rc;
+}
+
 /**
  * dax_iomap_copy_around - Prepare for an unaligned write to a shared/cow page
  * by copying the data before and after the range to be written.
@@ -1682,7 +1711,7 @@ static 

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Dan Williams
Matthew Wilcox wrote:
> On Thu, Apr 27, 2023 at 06:35:57PM -0700, Dan Williams wrote:
> > Jane Chu wrote:
> > > Hi, Dan,
> > > 
> > > On 4/27/2023 2:36 PM, Dan Williams wrote:
> > > > Jane Chu wrote:
> > > >> When dax fault handler fails to provision the fault page due to
> > > >> hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
> > > >> to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
> > > >> detection on hwpoison to the filesystem to provide the precise reason
> > > >> for the fault.
> > > > 
> > > > It's not yet clear to me by this description why this is an improvement
> > > > or will not cause other confusion. In this case the reason for the
> > > > SIGBUS is because the driver wants to prevent access to poison, not that
> > > > the CPU consumed poison. Can you clarify what is lost by *not* making
> > > > this change?
> > > 
> > > Elsewhere when hwpoison is detected by page fault handler and helpers as 
> > > the direct cause to failure, VM_FAULT_HWPOISON or 
> > > VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is 
> > > produced, such as wp_page_copy() in COW case, do_swap_page() from 
> > > handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the 
> > > huge fault size would be indicated in the payload.
> > > 
> > > But dax fault has been an exception in that the SIGBUS payload does not 
> > > indicate poison, nor fault size.  I don't see why it should be though,
> > > recall an internal user expressing confusion regarding the different 
> > > SIGBUS payloads.
> > 
> > ...but again this the typical behavior with block devices. If a block
> > device has badblock that causes page cache page not to be populated
> > that's a SIGBUS without hwpoison information. If the page cache is
> > properly populated and then the CPU consumes poison that's a SIGBUS with
> > the additional hwpoison information.
> 
> I'm not sure that's true when we mmap().  Yes, it's not consistent with
> -EIO from read(), but we have additional information here, and it's worth
> providing it.  You can think of it as *in this instance*, the error is
> found "in the page cache", because that's effectively where the error
> is from the point of view of the application?

It's true there is additional information, and applications mostly
cannot tell the difference between fault on failure to populate and
fault on access after populate.

So while it is inconsistent with what happens for typical page cache,
but you're right there's no downside to conveying the extra information
here.



Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Dan Williams
Jane Chu wrote:
> Hi, Dan,
> 
> On 4/27/2023 2:36 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> When dax fault handler fails to provision the fault page due to
> >> hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
> >> to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
> >> detection on hwpoison to the filesystem to provide the precise reason
> >> for the fault.
> > 
> > It's not yet clear to me by this description why this is an improvement
> > or will not cause other confusion. In this case the reason for the
> > SIGBUS is because the driver wants to prevent access to poison, not that
> > the CPU consumed poison. Can you clarify what is lost by *not* making
> > this change?
> 
> Elsewhere when hwpoison is detected by page fault handler and helpers as 
> the direct cause to failure, VM_FAULT_HWPOISON or 
> VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is 
> produced, such as wp_page_copy() in COW case, do_swap_page() from 
> handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the 
> huge fault size would be indicated in the payload.
> 
> But dax fault has been an exception in that the SIGBUS payload does not 
> indicate poison, nor fault size.  I don't see why it should be though,
> recall an internal user expressing confusion regarding the different 
> SIGBUS payloads.

...but again this the typical behavior with block devices. If a block
device has badblock that causes page cache page not to be populated
that's a SIGBUS without hwpoison information. If the page cache is
properly populated and then the CPU consumes poison that's a SIGBUS with
the additional hwpoison information.

Applications should have a consistent error response regardless of pmem
or dax.



RE: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Dan Williams
Jane Chu wrote:
> When dax fault handler fails to provision the fault page due to
> hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
> to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
> detection on hwpoison to the filesystem to provide the precise reason
> for the fault.

It's not yet clear to me by this description why this is an improvement
or will not cause other confusion. In this case the reason for the
SIGBUS is because the driver wants to prevent access to poison, not that
the CPU consumed poison. Can you clarify what is lost by *not* making
this change?

> 
> Signed-off-by: Jane Chu 
> ---
>  drivers/nvdimm/pmem.c | 2 +-
>  fs/dax.c  | 2 +-
>  include/linux/mm.h| 2 ++
>  3 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ceea55f621cc..46e094e56159 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device 
> *pmem, pgoff_t pgoff,
>   long actual_nr;
>  
>   if (mode != DAX_RECOVERY_WRITE)
> - return -EIO;
> + return -EHWPOISON;
>  
>   /*
>* Set the recovery stride is set to kernel page size because
> diff --git a/fs/dax.c b/fs/dax.c
> index 3e457a16c7d1..c93191cd4802 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>  
>   map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
>   DAX_ACCESS, , NULL);
> - if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
> + if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
>   map_len = dax_direct_access(dax_dev, pgoff,
>   PHYS_PFN(size), DAX_RECOVERY_WRITE,
>   , NULL);

This change results in EHWPOISON leaking to usersapce in the case of
read(2), that's not a return code that block I/O applications have ever
had to contend with before. Just as badblocks cause EIO to be returned,
so should poisoned cachelines for pmem.


> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f79667824eb..e4c974587659 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3217,6 +3217,8 @@ static inline vm_fault_t vmf_error(int err)
>  {
>   if (err == -ENOMEM)
>   return VM_FAULT_OOM;
> + else if (err == -EHWPOISON)
> + return VM_FAULT_HWPOISON;
>   return VM_FAULT_SIGBUS;
>  }
>  
> -- 
> 2.18.4
> 
> 





Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-20 Thread Dan Williams
lizhij...@fujitsu.com wrote:
[..]
> > Now I do think it would be a good idea to fail device_add() if the bus
> > is not registered,
> 
> BTW, below line 369: device_add() didn't fail in practical.
> 

I think that's ok because the device gets added, but never probed due to
this part of that commit I referenced:

@@ -503,20 +517,21 @@ int bus_add_device(struct device *dev)
  */
 void bus_probe_device(struct device *dev)
 {
-   struct bus_type *bus = dev->bus;
+   struct subsys_private *sp = bus_to_subsys(dev->bus);
struct subsys_interface *sif;
 
-   if (!bus)
+   if (!sp)
return;
 

...so it does what you want which is disable the libnvdimm subsystem
from binding to any devices without crashing.



Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-20 Thread Dan Williams
lizhij...@fujitsu.com wrote:
[..]
> >> Dan,
> >>
> >> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
> >> kernel booting parameter
> >> 'initcall_blacklist=libnvdimm_init'. Then kernel panic!
> > 
> > That's expected though,
> 
> Do you mean we just keep it as it is.

Yes.

> 
> 
> > you can't block libnvdimm_init and then expect
> > modules that link to libnvdimm to work.
> Ah, we would rather see it *unable to work* than panic, isn't it.

That part is true, but consider the implications of adding error
handling to all code that can no longer depend on initcall ordering, not
just libnvdimm. This would be a large paradigm shift.

Now I do think it would be a good idea to fail device_add() if the bus
is not registered, but I *think* that happens now as a result of:

5221b82d46f2 driver core: bus: bus_add/probe/remove_device() cleanups

...can you double check if you have that commit in your tests?



Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-17 Thread Dan Williams
lizhij...@fujitsu.com wrote:
> 
> 
> On 16/03/2023 23:54, Dan Williams wrote:
> > Li Zhijian wrote:
> >> nvdimm_bus_register() could be called from other modules, such as nfit,
> >> but it can only be called after the nvdimm_bus_type is registered.
> >>
> >>   BUG: kernel NULL pointer dereference, address: 0098
> >>   #PF: supervisor read access in kernel mode
> >>   #PF: error_code(0x) - not-present page
> >>   PGD 0 P4D 0
> >>   Oops:  [#1] PREEMPT SMP PTI
> >>   CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97
> >>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> >> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> >>   RIP: 0010:bus_add_device+0x58/0x150
> >>   Call Trace:
> >>
> >>device_add+0x3ac/0x980
> >>nvdimm_bus_register+0x16d/0x1d0
> >>acpi_nfit_init+0xb72/0x1f90 [nfit]
> >>acpi_nfit_add+0x1d5/0x200 [nfit]
> >>acpi_device_probe+0x45/0x160
> > 
> > Can you explain a bit more how to hit this crash? This has not been a
> > problem historically and the explanation above makes it sound like this
> > is a theoretical issue.
> > 
> 
> Dan,
> 
> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
> kernel booting parameter
> 'initcall_blacklist=libnvdimm_init'. Then kernel panic!

That's expected though, you can't block libnvdimm_init and then expect
modules that link to libnvdimm to work. You would also need to block all
modules / initcalls that depend on libnvdimm_init having run. I'll
respond to the other thread with some ideas.



Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-16 Thread Dan Williams
lizhij...@fujitsu.com wrote:
> 
> 
> On 16/03/2023 23:54, Dan Williams wrote:
> > Li Zhijian wrote:
> >> nvdimm_bus_register() could be called from other modules, such as nfit,
> >> but it can only be called after the nvdimm_bus_type is registered.
> >>
> >>   BUG: kernel NULL pointer dereference, address: 0098
> >>   #PF: supervisor read access in kernel mode
> >>   #PF: error_code(0x) - not-present page
> >>   PGD 0 P4D 0
> >>   Oops:  [#1] PREEMPT SMP PTI
> >>   CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97
> >>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> >> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> >>   RIP: 0010:bus_add_device+0x58/0x150
> >>   Call Trace:
> >>
> >>device_add+0x3ac/0x980
> >>nvdimm_bus_register+0x16d/0x1d0
> >>acpi_nfit_init+0xb72/0x1f90 [nfit]
> >>acpi_nfit_add+0x1d5/0x200 [nfit]
> >>acpi_device_probe+0x45/0x160
> > 
> > Can you explain a bit more how to hit this crash? This has not been a
> > problem historically and the explanation above makes it sound like this
> > is a theoretical issue.
> > 
> 
> Dan,
> 
> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
> kernel booting parameter
> 'initcall_blacklist=libnvdimm_init'. Then kernel panic!
> Theoretically, it will also happen if nvdimm_bus_register() failed.
> 
> 
> For kdump purpose[1], we need to disable libnvdimm driver to ensure metadata 
> in pmem will not be updated again in kdump kernel
> [1]https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com/T/

Ah, great write up! Let me give that some thought. Apologies for missing
it earlier.

This would have been a good use for:

Link: 
https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com

...in the above changelog.



RE: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-16 Thread Dan Williams
Li Zhijian wrote:
> nvdimm_bus_register() could be called from other modules, such as nfit,
> but it can only be called after the nvdimm_bus_type is registered.
> 
>  BUG: kernel NULL pointer dereference, address: 0098
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x) - not-present page
>  PGD 0 P4D 0
>  Oops:  [#1] PREEMPT SMP PTI
>  CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>  RIP: 0010:bus_add_device+0x58/0x150
>  Call Trace:
>   
>   device_add+0x3ac/0x980
>   nvdimm_bus_register+0x16d/0x1d0
>   acpi_nfit_init+0xb72/0x1f90 [nfit]
>   acpi_nfit_add+0x1d5/0x200 [nfit]
>   acpi_device_probe+0x45/0x160

Can you explain a bit more how to hit this crash? This has not been a
problem historically and the explanation above makes it sound like this
is a theoretical issue.

libnvdimm_init() *should* be done before the nfit driver can attempt
nvdimm_bus_register(). So, something else is broken if
nvdimm_bus_register() can be called before libnvdimm_init(), or after
libnvdimm_exit().



[GIT PULL] Compute Express Link (CXL) for 6.3

2023-02-21 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.3

...to receive the CXL update for v6.3.

To date Linux has been dependent on platform-firmware to map CXL RAM
regions and handle events / errors from devices. With this update Linux
can now parse / update the CXL memory layout, and report events / errors
from devices. This is a precursor for the CXL subsystem to handle the
end-to-end "RAS" flow for CXL memory. I.e. the flow that for
DDR-attached-DRAM is handled by the EDAC driver where it maps system
physical address events to a field-replaceable-unit (FRU / endpoint
device). In general, CXL has the potential to standardize what has
historically been a pile of memory-controller-specific error handling
logic.

Another change of note is the default policy for handling RAM-backed
device-dax instances. Previously the default access mode was "device",
mmap(2) a device special file to access memory. The new default is
"kmem" where the address range is assigned to the core-mm via
add_memory_driver_managed(). This saves typical users from wondering why
their platform memory is not visible via free(1) and stuck behind a
device-file. At the same time it allows expert users to deploy policy
to, for example, get dedicated access to high performance memory, or
hide low performance memory from general purpose kernel allocations.
This affects not only CXL, but also systems with high-bandwidth-memory
that platform-firmware tags with the EFI_MEMORY_SP (special purpose)
designation.

All of these topics have had exposure in linux-next. Where you see
multiple merges of the same topic branch it is for follow-on fixes and
updates discovered after the branch was cut for -next. Each merge has a
commit log. A few tags came in after the branch was cut as well:

f57aec443c24 cxl/pmem: Fix nvdimm registration races
Tested-by: Dave Jiang 
Reviewed-by: Dave Jiang 

e686c32590f4 dax/kmem: Fix leak of memory-hotplug resources
Reviewed-by: Alistair Popple 

After the fixups from Arnd, there have not been any other problem
reports from -next exposure.

---

The following changes since commit 711442e29f16f0d39dd0e2460c9baacfccb9d5a7:

  cxl/region: Fix passthrough-decoder detection (2023-02-07 11:04:30 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.3

for you to fetch changes up to e686c32590f40bffc45f105c04c836ffad3e531a:

  dax/kmem: Fix leak of memory-hotplug resources (2023-02-17 14:58:01 -0800)


cxl for v6.3

- CXL RAM region enumeration: instantiate 'struct cxl_region' objects
  for platform firmware created memory regions

- CXL RAM region provisioning: complement the existing PMEM region
  creation support with RAM region support

- "Soft Reservation" policy change: Online (memory hot-add)
  soft-reserved memory (EFI_MEMORY_SP) by default, but still allow for
  setting aside such memory for dedicated access via device-dax.

- CXL Events and Interrupts: Takeover CXL event handling from
  platform-firmware (ACPI calls this CXL Memory Error Reporting) and
  export CXL Events via Linux Trace Events.

- Convey CXL _OSC results to drivers: Similar to PCI, let the CXL
  subsystem interrogate the result of CXL _OSC negotiation.

- Emulate CXL DVSEC Range Registers as "decoders": Allow for
  first-generation devices that pre-date the definition of the CXL HDM
  Decoder Capability to translate the CXL DVSEC Range Registers into
  'struct cxl_decoder' objects.

- Set timestamp: Per spec, set the device timestamp in case of hotplug,
  or if platform-firwmare failed to set it.

- General fixups: linux-next build issues, non-urgent fixes for
  pre-production hardware, unit test fixes, spelling and debug message
  improvements.


Alison Schofield (2):
  tools/testing/cxl: Remove cxl_test module math loading message
  cxl/mem: Add kdoc param for event log driver state

Arnd Bergmann (3):
  cxl: avoid returning uninitialized error code
  dax: cxl: add CXL_REGION dependency
  dax/hmem: build hmem device support as module if possible

Dan Williams (35):
  cxl/pci: Move tracepoint definitions to drivers/cxl/core/
  tools/testing/cxl: Prevent cxl_test from confusing production modules
  cxl/region: Clarify when a cxld->commit() callback is mandatory
  cxl/port: Link the 'parent_dport' in portX/ and endpointX/ sysfs
  cxl/pci: Fix irq oneshot expectations
  Merge branch 'for-6.3/cxl' into cxl/next
  Merge branch 'for-6.3/cxl-events' into cxl/next
  cxl/memdev: Fix endpoint port removal
  cxl/Documentation: Update references to attributes added in v6.0
  cxl/region: Add a mode attribute for regions
  cxl/region: Support empty uuids for non-pmem regions
  cxl/r

RE: [PATCH] dax: clx: add CXL_REGION dependency

2023-02-14 Thread Dan Williams
Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> There is already a dependency on CXL_REGION, which depends on CXL_BUS,
> but since CXL_REGION is a 'bool' symbol, it's possible to configure
> DAX as built-in even though CXL itself is a loadable module:
> 
> x86_64-linux-ld: drivers/dax/cxl.o: in function `cxl_dax_region_probe':
> cxl.c:(.text+0xb): undefined reference to `to_cxl_dax_region'
> x86_64-linux-ld: drivers/dax/cxl.o: in function `cxl_dax_region_driver_init':
> cxl.c:(.init.text+0x10): undefined reference to `__cxl_driver_register'
> x86_64-linux-ld: drivers/dax/cxl.o: in function `cxl_dax_region_driver_exit':
> cxl.c:(.exit.text+0x9): undefined reference to `cxl_driver_unregister'
> 
> Prevent this with another depndency on the tristate symbol.
> 
> Fixes: 09d09e04d2fc ("cxl/dax: Create dax devices for CXL RAM regions")
> Signed-off-by: Arnd Bergmann 
> ---

Looks good, thanks Arnd.



[GIT PULL] Compute Express Link (CXL) fixes for 6.2-final

2023-02-10 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-fixes-6.2

...to receive 2 fixups for CXL in presence of passthrough decoders. This
primarily helps developers using the QEMU CXL emulation, but with the
impending arrival of CXL switches these types of topologies will be of
interest to end users.

The first commit has appeared in -next for several days with no issues.
The second has not aged as long, but has been verified by new unit
tests.

---


The following changes since commit 6d796c50f84ca79f1722bb131799e5a5710c4700:

  Linux 6.2-rc6 (2023-01-29 13:59:43 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-fixes-6.2

for you to fetch changes up to 711442e29f16f0d39dd0e2460c9baacfccb9d5a7:

  cxl/region: Fix passthrough-decoder detection (2023-02-07 11:04:30 -0800)


cxl fixes for 6.2

- Fix a crash when shutting down regions in the presence of passthrough
  decoders

- Fix region creation to understand passthrough decoders instead of the
  narrower definition of passthrough ports


Dan Williams (1):
  cxl/region: Fix passthrough-decoder detection

Fan Ni (1):
  cxl/region: Fix null pointer dereference for resetting decoder

 drivers/cxl/core/region.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)




[GIT PULL] NVDIMM and DAX fixes for 6.2-final

2023-02-10 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/libnvdimm-fixes-6.2

...to receive a fix for an issue that could causes users to
inadvertantly reserve too much capacity when debugging the KMSAN and
persistent memory namespace, a lockdep fix, and a kernel-doc build
warning.

This has all appeared in -next with no reported issues.

---


The following changes since commit 88603b6dc419445847923fcb7fe5080067a30f98:

  Linux 6.2-rc2 (2023-01-01 13:53:16 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/libnvdimm-fixes-6.2

for you to fetch changes up to c91d713630848460de8669e6570307b7e559863b:

  nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE (2023-01-28 
15:32:36 -0800)


libnvdimm fixes for 6.2

- Resolve the conflict between KMSAN and NVDIMM with respect to
  reserving pmem namespace / volume capacity for larger
  sizeof(struct page)

- Fix a lockdep warning in the the NFIT code

- Fix a kernel-doc build warning

----
Dan Williams (1):
  nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE

Randy Dunlap (1):
  dax: super.c: fix kernel-doc bad line warning

Vishal Verma (1):
  ACPI: NFIT: fix a potential deadlock during NFIT teardown

 drivers/acpi/nfit/core.c  |  2 +-
 drivers/dax/super.c   |  2 +-
 drivers/nvdimm/Kconfig| 19 +++
 drivers/nvdimm/nd.h   |  2 +-
 drivers/nvdimm/pfn_devs.c | 42 +++---
 5 files changed, 49 insertions(+), 18 deletions(-)




[GIT PULL] Compute Express Link (CXL) fixes for 6.2-rc6

2023-01-28 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl 
tags/cxl-fixes-for-6.2-rc6

...to receive a couple fixes for bugs introduced during the merge
window. One is a regression, the other was a bug in the CXL AER handler.

---

The following changes since commit b7bfaa761d760e72a969d116517eaa12e404c262:

  Linux 6.2-rc3 (2023-01-08 11:49:43 -0600)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl 
tags/cxl-fixes-for-6.2-rc6

for you to fetch changes up to 19398821b25a9cde564265262e680ae1c2351be7:

  cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent 
(2023-01-25 15:35:26 -0800)


cxl fixes for 6.2-rc6

- Fix a crash regression due to module load order of cxl_pmem.ko

- Fix wrong register offset read in CXL AER handling path


Dan Williams (1):
  cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent

Dave Jiang (1):
  cxl: fix cxl_report_and_clear() RAS UE addr mis-assignment

 drivers/cxl/acpi.c  |  1 -
 drivers/cxl/core/pmem.c | 42 --
 drivers/cxl/pci.c   |  7 +--
 drivers/cxl/pmem.c  | 24 
 4 files changed, 33 insertions(+), 41 deletions(-)



RE: [PATCH] dax: super.c: fix kernel-doc bad line warning

2023-01-25 Thread Dan Williams
Randy Dunlap wrote:
> Convert an empty line to " *" to avoid a kernel-doc warning:
> 
> drivers/dax/super.c:478: warning: bad line: 
> 
> Signed-off-by: Randy Dunlap 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/dax/super.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -- a/drivers/dax/super.c b/drivers/dax/super.c
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(put_dax);
>  /**
>   * dax_holder() - obtain the holder of a dax device
>   * @dax_dev: a dax_device instance
> -
> + *
>   * Return: the holder's data which represents the holder if registered,
>   * otherwize NULL.
>   */

Looks good, applied.



[PATCH v2] nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE

2023-01-25 Thread Dan Williams
Commit 6e9f05dc66f9 ("libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE")

...updated MAX_STRUCT_PAGE_SIZE to account for sizeof(struct page)
potentially doubling in the case of CONFIG_KMSAN=y. Unfortunately this
doubles the amount of capacity stolen from user addressable capacity for
everyone, regardless of whether they are using the debug option. Revert
that change, mandate that MAX_STRUCT_PAGE_SIZE never exceed 64, but
allow for debug scenarios to proceed with creating debug sized page maps
with a compile option to support debug scenarios.

Note that this only applies to cases where the page map is permanent,
i.e. stored in a reservation of the pmem itself ("--map=dev" in "ndctl
create-namespace" terms). For the "--map=mem" case, since the allocation
is ephemeral for the lifespan of the namespace, there are no explicit
restriction. However, the implicit restriction, of having enough
available "System RAM" to store the page map for the typically large
pmem, still applies.

Fixes: 6e9f05dc66f9 ("libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE")
Cc: 
Cc: Alexander Potapenko 
Cc: Marco Elver 
Reported-by: Jeff Moyer 
---
Changes since v1 [1]:
* Replace the module option with a compile option and a description of
  the tradeoffs to consider when running with KMSAN enabled in the
  presence of NVDIMM namespaces and their local reservation of capacity
  for a 'struct page' memmap array. (Greg)

[1]: 
https://lore.kernel.org/all/63bc8fec4744a_5178e29...@dwillia2-xfh.jf.intel.com.notmuch/

 drivers/nvdimm/Kconfig|   19 +++
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/pfn_devs.c |   42 +++---
 3 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 79d93126453d..77b06d54cc62 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -102,6 +102,25 @@ config NVDIMM_KEYS
depends on ENCRYPTED_KEYS
depends on (LIBNVDIMM=ENCRYPTED_KEYS) || LIBNVDIMM=m
 
+config NVDIMM_KMSAN
+   bool
+   depends on KMSAN
+   help
+ KMSAN, and other memory debug facilities, increase the size of
+ 'struct page' to contain extra metadata. This collides with
+ the NVDIMM capability to store a potentially
+ larger-than-"System RAM" size 'struct page' array in a
+ reservation of persistent memory rather than limited /
+ precious DRAM. However, that reservation needs to persist for
+ the life of the given NVDIMM namespace. If you are using KMSAN
+ to debug an issue unrelated to NVDIMMs or DAX then say N to this
+ option. Otherwise, say Y but understand that any namespaces
+ (with the page array stored pmem) created with this build of
+ the kernel will permanently reserve and strand excess
+ capacity compared to the CONFIG_KMSAN=n case.
+
+ Select N if unsure.
+
 config NVDIMM_TEST_BUILD
tristate "Build the unit test core"
depends on m
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 85ca5b4da3cf..ec5219680092 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -652,7 +652,7 @@ void devm_namespace_disable(struct device *dev,
struct nd_namespace_common *ndns);
 #if IS_ENABLED(CONFIG_ND_CLAIM)
 /* max struct page size independent of kernel config */
-#define MAX_STRUCT_PAGE_SIZE 128
+#define MAX_STRUCT_PAGE_SIZE 64
 int nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap);
 #else
 static inline int nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 61af072ac98f..c7655a1fe38c 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -13,6 +13,8 @@
 #include "pfn.h"
 #include "nd.h"
 
+const static bool page_struct_override = IS_ENABLED(CONFIG_NVDIMM_KMSAN);
+
 static void nd_pfn_release(struct device *dev)
 {
struct nd_region *nd_region = to_nd_region(dev->parent);
@@ -758,12 +760,6 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
return -ENXIO;
}
 
-   /*
-* Note, we use 64 here for the standard size of struct page,
-* debugging options may cause it to be larger in which case the
-* implementation will limit the pfns advertised through
-* ->direct_access() to those that are included in the memmap.
-*/
start = nsio->res.start;
size = resource_size(>res);
npfns = PHYS_PFN(size - SZ_8K);
@@ -782,20 +778,33 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
}
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
+   unsigned long page_map_size = MAX_STRUCT_PAGE_SIZE * npfns;
+
/*
 * The altmap should be padded out to the block size used
 * when populating the vmemmap. This *should* be equal to
 * PMD_SIZE 

RE: [PATCH] ACPI: NFIT: fix a potential deadlock during NFIT teardown

2023-01-25 Thread Dan Williams
Vishal Verma wrote:
> Lockdep reports that acpi_nfit_shutdown() may deadlock against an
> opportune acpi_nfit_scrub(). acpi_nfit_scrub () is run from inside a
> 'work' and therefore has already acquired workqueue-internal locks. It
> also acquiires acpi_desc->init_mutex. acpi_nfit_shutdown() first
> acquires init_mutex, and was subsequently attempting to cancel any
> pending workqueue items. This reversed locking order causes a potential
> deadlock:
> 
> ==
> WARNING: possible circular locking dependency detected
> 6.2.0-rc3 #116 Tainted: G   O N
> --
> libndctl/1958 is trying to acquire lock:
> 888129b461c0 
> ((work_completion)(&(_desc->dwork)->work)){+.+.}-{0:0}, at: 
> __flush_work+0x43/0x450
> 
> but task is already holding lock:
> 888129b460e8 (_desc->init_mutex){+.+.}-{3:3}, at: 
> acpi_nfit_shutdown+0x87/0xd0 [nfit]
> 
> which lock already depends on the new lock.
> 
> ...
> 
> Possible unsafe locking scenario:
> 
>   CPU0CPU1
>   
>  lock(_desc->init_mutex);
>   
> lock((work_completion)(&(_desc->dwork)->work));
>   lock(_desc->init_mutex);
>  lock((work_completion)(&(_desc->dwork)->work));
> 
> *** DEADLOCK ***
> 
> Since the workqueue manipulation is protected by its own internal locking,
> the cancellation of pending work doesn't need to be done under
> acpi_desc->init_mutex. Move cancel_delayed_work_sync() outside the
> init_mutex to fix the deadlock. Any work that starts after
> acpi_nfit_shutdown() drops the lock will see ARS_CANCEL, and the
> cancel_delayed_work_sync() will safely flush it out.
> 
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index f1cc5ec6a3b6..4e48d6db05eb 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -3297,8 +3297,8 @@ void acpi_nfit_shutdown(void *data)
>  
>   mutex_lock(_desc->init_mutex);
>   set_bit(ARS_CANCEL, _desc->scrub_flags);
> - cancel_delayed_work_sync(_desc->dwork);
>   mutex_unlock(_desc->init_mutex);
> + cancel_delayed_work_sync(_desc->dwork);

Looks good, applied.



RE: [PATCH v2] nvdimm: Use kstrtobool() instead of strtobool()

2023-01-25 Thread Dan Williams
Christophe JAILLET wrote:
> strtobool() is the same as kstrtobool().
> However, the latter is more used within the kernel.
> 
> In order to remove strtobool() and slightly simplify kstrtox.h, switch to
> the other function name.
> 
> While at it, include the corresponding header file ()
> 
> Signed-off-by: Christophe JAILLET 
> ---
> This patch was already sent as a part of a serie ([1]) that axed all usages
> of strtobool().
> Most of the patches have been merged in -next.
> 
> I synch'ed with latest -next and re-send the remaining ones as individual
> patches.
> 
> Changes in v2:
>   - synch with latest -next.

Looks good, applied for v6.3.



RE: [PATCH rcu 11/27] drivers/dax: Remove "select SRCU"

2023-01-05 Thread Dan Williams
Paul E. McKenney wrote:
> Now that the SRCU Kconfig option is unconditionally selected, there is
> no longer any point in selecting it.  Therefore, remove the "select SRCU"
> Kconfig statements.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: 

Acked-by: Dan Williams 

Let me know if I should pick this up directly, otherwise I assume this
will go in along with the rest of the set.



[GIT PULL] Compute Express Link (CXL) for 6.2

2022-12-10 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.2

...to receive the CXL update for v6.2.

While it may seem backwards, the CXL update this time around includes
some focus on CXL 1.x enabling where the work to date had been with CXL
2.0 (VH topologies) in mind.  First generation CXL can mostly be
supported via BIOS, similar to DDR, however it became clear there are
use cases for OS native CXL error handling and some CXL 3.0 endpoint
features can be deployed on CXL 1.x hosts (Restricted CXL Host (RCH)
topologies). So, this update brings RCH topologies into the Linux CXL
device model.

In support of the ongoing CXL 2.0+ enabling 2 new core kernel
facilities are added. One is the ability for the kernel to flag
collisions between userspace access to PCI configuration registers and
kernel accesses. This is brought on by the PCIe Data-Object-Exchange
(DOE) facility, a hardware mailbox over config-cycles. The other is a
cpu_cache_invalidate_memregion() API that maps to wbinvd_on_all_cpus()
on x86. To prevent abuse it is disabled in guest VMs and architectures
that do not support it yet. The CXL paths that need it, dynamic memory
region creation and security commands (erase / unlock), are disabled
when it is not present.

As for the CXL 2.0+ this cycle the subsystem gains support Persistent
Memory Security commands, error handling in response to PCIe AER
notifications, and support for the "XOR" host bridge interleave
algorithm.

That last feature, "XOR" interleave support, is built on top of the
ACPICA update for this cycle [1]. The shortlog and diffstat below are
from a test merge with the pending ACPI updates. So either pull the ACPI
tree first, or understand you will get some unrelated ACPICA updates in
this pull.

This has all appeared in -next with no known outstanding issues. The
x86, ACPI, and PCI touches have acks from their respective maintainers.

[1]: f350c68e3cd5 ("ACPICA: Add CXL 3.0 structures (CXIMS & RDPAS) to the CEDT 
table")

---

The following changes since commit f0c4d9fc9cc9462659728d168387191387e903cc:

  Linux 6.1-rc4 (2022-11-06 15:07:11 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.2

for you to fetch changes up to f04facfb993de47e2133b2b842d72b97b1c50162:

  cxl/region: Fix memdev reuse check (2022-12-08 13:03:47 -0800)


cxl for 6.2

- Add the cpu_cache_invalidate_memregion() API for cache flushing in
  response to physical memory reconfiguration, or memory-side data
  invalidation from operations like secure erase or memory-device unlock.

- Add a facility for the kernel to warn about collisions between kernel
  and userspace access to PCI configuration registers

- Add support for Restricted CXL Host (RCH) topologies (formerly CXL 1.1)

- Add handling and reporting of CXL errors reported via the PCIe AER
  mechanism

- Add support for CXL Persistent Memory Security commands

- Add support for the "XOR" algorithm for CXL host bridge interleave

- Rework / simplify CXL to NVDIMM interactions

- Miscellaneous cleanups and fixes



Adam Manzanares (1):
  cxl: Replace HDM decoder granularity magic numbers

Alison Schofield (3):
  cxl/acpi: Support CXL XOR Interleave Math (CXIMS)
  tools/testing/cxl: Add XOR Math support to cxl_test
  cxl/acpi: Fail decoder add if CXIMS for HBIG is missing

Colin Ian King (1):
  cxl/region: Fix spelling mistake "memergion" -> "memregion"

Dan Williams (34):
  tools/testing/cxl: Add bridge mocking support
  cxl/acpi: Simplify cxl_nvdimm_bridge probing
  cxl/region: Drop redundant pmem region release handling
  cxl/pmem: Refactor nvdimm device registration, delete the workqueue
  cxl/pmem: Remove the cxl_pmem_wq and related infrastructure
  cxl/acpi: Move rescan to the workqueue
  tools/testing/cxl: Make mock CEDT parsing more robust
  cxl/region: Fix missing probe failure
  cxl/pmem: Enforce keyctl ABI for PMEM security
  nvdimm/region: Move cache management to the region driver
  cxl/region: Manage CPU caches relative to DPA invalidation events
  cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
  cxl/pci: Cleanup cxl_map_device_regs()
  cxl/pci: Kill cxl_map_regs()
  cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
  cxl/port: Limit the port driver to just the HDM Decoder Capability
  cxl/pci: Prepare for mapping RAS Capability Structure
  cxl/pci: Find and map the RAS Capability Structure
  cxl/pci: Add (hopeful) error handling support
  Merge "ACPICA: Add CXL 3.0 structures..." into for-6.2/cxl-xor
  cxl/mem: Move devm_cxl_add_endpoint() from cxl_core to cxl_mem
  cxl/port: Add RCD endpo

[GIT PULL] DAX and HMAT fixes for v6.1-rc8

2022-12-02 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/dax-fixes-6.1-rc8

...to receive a few bug fixes around the handling of "Soft Reserved"
memory and memory tiering information. Linux is starting to enounter
more real world systems that deploy an ACPI HMAT to describe different
performance classes of memory, as well the "special purpose" (Linux
"Soft Reserved") designation from EFI. These fixes result from that
testing.

It has all appeared in -next for a while with no known issues.

---

The following changes since commit 094226ad94f471a9f19e8f8e7140a09c2625abaa:

  Linux 6.1-rc5 (2022-11-13 13:12:55 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/dax-fixes-6.1-rc8

for you to fetch changes up to 472faf72b33d80aa8e7a99c9410c1a23d3bf0cd8:

  device-dax: Fix duplicate 'hmem' device registration (2022-11-21 15:34:40 
-0800)


dax fixes for v6.1-rc8

- Fix duplicate overlapping device-dax instances for HMAT described
  "Soft Reserved" Memory

- Fix missing node targets in the sysfs representation of memory tiers

- Remove a confusing variable initialization

--------
Dan Williams (1):
  device-dax: Fix duplicate 'hmem' device registration

Vishal Verma (2):
  ACPI: HMAT: remove unnecessary variable initialization
  ACPI: HMAT: Fix initiator registration for single-initiator systems

 drivers/acpi/numa/hmat.c  | 27 ---
 drivers/dax/hmem/device.c | 24 +++-
 2 files changed, 35 insertions(+), 16 deletions(-)



RE: [PATCH v2.1 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-02 Thread Dan Williams
Shiyang Ruan wrote:
> fsdax page is used not only when CoW, but also mapread. To make the it
> easily understood, use 'share' to indicate that the dax page is shared
> by more than one extent.  And add helper functions to use it.
> 
> Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c   | 38 ++
>  include/linux/mm_types.h   |  5 -
>  include/linux/page-flags.h |  2 +-
>  3 files changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1c6867810cbd..edbacb273ab5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
>   for (pfn = dax_to_pfn(entry); \
>   pfn < dax_end_pfn(entry); pfn++)
>  
> -static inline bool dax_mapping_is_cow(struct address_space *mapping)
> +static inline bool dax_page_is_shared(struct page *page)
>  {
> - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
> + return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
>  }
>  
>  /*
> - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
> + * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> + * refcount.
>   */
> -static inline void dax_mapping_set_cow(struct page *page)
> +static inline void dax_page_bump_sharing(struct page *page)

Similar to page_ref naming I would call this page_share_get() and the
corresponding function page_share_put().

>  {
> - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
> + if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
>   /*
>* Reset the index if the page was already mapped
>* regularly before.
>*/
>   if (page->mapping)
> - page->index = 1;
> - page->mapping = (void *)PAGE_MAPPING_DAX_COW;
> + page->share = 1;
> + page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;

Small nit, You could save a cast here by defining
PAGE_MAPPING_DAX_SHARED as "((void *) 1)".

>   }
> - page->index++;
> + page->share++;
> +}
> +
> +static inline unsigned long dax_page_drop_sharing(struct page *page)
> +{
> + return --page->share;
>  }
>  
>  /*
> - * When it is called in dax_insert_entry(), the cow flag will indicate that
> + * When it is called in dax_insert_entry(), the shared flag will indicate 
> that
>   * whether this entry is shared by multiple files.  If so, set the 
> page->mapping
> - * FS_DAX_MAPPING_COW, and use page->index as refcount.
> + * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
>   */
>  static void dax_associate_entry(void *entry, struct address_space *mapping,
> - struct vm_area_struct *vma, unsigned long address, bool cow)
> + struct vm_area_struct *vma, unsigned long address, bool shared)
>  {
>   unsigned long size = dax_entry_size(entry), pfn, index;
>   int i = 0;
> @@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (cow) {
> - dax_mapping_set_cow(page);
> + if (shared) {
> + dax_page_bump_sharing(page);
>   } else {
>   WARN_ON_ONCE(page->mapping);
>   page->mapping = mapping;
> @@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   struct page *page = pfn_to_page(pfn);
>  
>   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - if (dax_mapping_is_cow(page->mapping)) {
> - /* keep the CoW flag if this page is still shared */
> - if (page->index-- > 0)
> + if (dax_page_is_shared(page)) {
> + /* keep the shared flag if this page is still shared */
> + if (dax_page_drop_sharing(page) > 0)
>   continue;

I think part of what makes this hard to read is trying to preserve the
same code paths for shared pages and typical pages.

page_share_put() should, in addition to decrementing the share, clear
out page->mapping value.

>   } else
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..f46cac3657ad 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -103,7 +103,10 @@ struct page {
>   };
>   /* See page-flags.h for PAGE_MAPPING_FLAGS */
>   struct address_space *mapping;
> - pgoff_t index;  /* Our offset within mapping. */
> + union {
> +   

RE: [PATCH v2 0/8] fsdax,xfs: fix warning messages

2022-12-02 Thread Dan Williams
Shiyang Ruan wrote:
> Changes since v1:
>  1. Added a snippet of the warning message and some of the failed cases
>  2. Separated the patch for easily review
>  3. Added page->share and its helper functions
>  4. Included the patch[1] that removes the restrictions of fsdax and reflink
> [1] 
> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.f...@fujitsu.com/
> 
> Many testcases failed in dax+reflink mode with warning message in dmesg.
> Such as generic/051,075,127.  The warning message is like this:
> [  775.509337] [ cut here ]
> [  775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 
> dax_insert_entry.cold+0x2e/0x69
> [  775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash 
> af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject 
> nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 
> nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables 
> iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel 
> configfs xfs libcrc32c fuse
> [  775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: GW  
> 6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d
> [  775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch 
> Linux 1.16.0-3-3 04/01/2014
> [  775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69
> [  775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d 
> ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b 
> e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48
> [  775.526708] RSP: :c90001d57b30 EFLAGS: 00010082
> [  775.527042] RAX: 002a RBX:  RCX: 
> 0042
> [  775.527396] RDX: ea000a0f6c80 RSI: 81dfab1b RDI: 
> 
> [  775.527819] RBP: ea000a0f6c40 R08:  R09: 
> 820625e0
> [  775.528241] R10: c90001d579d8 R11: 820d2628 R12: 
> 88815fc98320
> [  775.528598] R13: c90001d57c18 R14:  R15: 
> 0001
> [  775.528997] FS:  7f39fc75d740() GS:88817bc8() 
> knlGS:
> [  775.529474] CS:  0010 DS:  ES:  CR0: 80050033
> [  775.529800] CR2: 7f39fc772040 CR3: 000107eb6001 CR4: 
> 003706e0
> [  775.530214] DR0:  DR1:  DR2: 
> 
> [  775.530592] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  775.531002] Call Trace:
> [  775.531230]  
> [  775.531444]  dax_fault_iter+0x267/0x6c0
> [  775.531719]  dax_iomap_pte_fault+0x198/0x3d0
> [  775.532002]  __xfs_filemap_fault+0x24a/0x2d0 [xfs 
> aa8d25411432b306d9554da38096f4ebb86bdfe7]
> [  775.532603]  __do_fault+0x30/0x1e0
> [  775.532903]  do_fault+0x314/0x6c0
> [  775.533166]  __handle_mm_fault+0x646/0x1250
> [  775.533480]  handle_mm_fault+0xc1/0x230
> [  775.533810]  do_user_addr_fault+0x1ac/0x610
> [  775.534110]  exc_page_fault+0x63/0x140
> [  775.534389]  asm_exc_page_fault+0x22/0x30
> [  775.534678] RIP: 0033:0x7f39fc55820a
> [  775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 
> 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9  a4 
> c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00
> [  775.535839] RSP: 002b:7ffc66a08118 EFLAGS: 00010202
> [  775.536157] RAX: 7f39fc772001 RBX: 00042001 RCX: 
> 63c1
> [  775.536537] RDX: 6400 RSI: 7f39fac42050 RDI: 
> 7f39fc772040
> [  775.536919] RBP: 6400 R08: 7f39fc772001 R09: 
> 00042000
> [  775.537304] R10: 0001 R11: 0246 R12: 
> 0001
> [  775.537694] R13: 7f39fc772000 R14: 6401 R15: 
> 0003
> [  775.538086]  
> [  775.538333] ---[ end trace  ]---
> 
> This also effects dax+noreflink mode if we run the test after a
> dax+reflink test.  So, the most urgent thing is solving the warning
> messages.
> 
> With these fixes, most warning messages in dax_associate_entry() are
> gone.  But honestly, generic/388 will randomly failed with the warning.
> The case shutdown the xfs when fsstress is running, and do it for many
> times.  I think the reason is that dax pages in use are not able to be
> invalidated in time when fs is shutdown.  The next time dax page to be
> associated, it still remains the mapping value set last time.  I'll keep
> on solving it.

This one also sounds like it is going to be relevant for CXL PMEM, and
the improvements to the reference counting. CXL has a facility where the
driver asserts that no more writes are in-flight to the device so that
the device can assert a clean shutdown. Part of that will be making sure
that page access ends at fs shutdown.

> The warning message in dax_writeback_one() can also be fixed because of
> the dax unshare.
> 
> 
> Shiyang Ruan (8):
>   fsdax: introduce 

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-30 Thread Dan Williams
Darrick J. Wong wrote:
> On Wed, Nov 30, 2022 at 01:48:59PM -0800, Dan Williams wrote:
> > Andrew Morton wrote:
> > > On Tue, 29 Nov 2022 19:59:14 -0800 Dan Williams 
> > >  wrote:
> > > 
> > > > [ add Andrew ]
> > > > 
> > > > Shiyang Ruan wrote:
> > > > > Many testcases failed in dax+reflink mode with warning message in 
> > > > > dmesg.
> > > > > This also effects dax+noreflink mode if we run the test after a
> > > > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > > > messages.
> > > > > 
> > > > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > > > Patch 2 adds the implementation of unshare for fsdax.
> > > > > 
> > > > > With these fixes, most warning messages in dax_associate_entry() are
> > > > > gone.  But honestly, generic/388 will randomly failed with the 
> > > > > warning.
> > > > > The case shutdown the xfs when fsstress is running, and do it for many
> > > > > times.  I think the reason is that dax pages in use are not able to be
> > > > > invalidated in time when fs is shutdown.  The next time dax page to be
> > > > > associated, it still remains the mapping value set last time.  I'll 
> > > > > keep
> > > > > on solving it.
> > > > > 
> > > > > The warning message in dax_writeback_one() can also be fixed because 
> > > > > of
> > > > > the dax unshare.
> > > > 
> > > > Thank you for digging in on this, I had been pinned down on CXL tasks
> > > > and worried that we would need to mark FS_DAX broken for a cycle, so
> > > > this is timely.
> > > > 
> > > > My only concern is that these patches look to have significant 
> > > > collisions with
> > > > the fsdax page reference counting reworks pending in linux-next. 
> > > > Although,
> > > > those are still sitting in mm-unstable:
> > > > 
> > > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> > > 
> > > As far as I know, Dan's "Fix the DAX-gup mistake" series is somewhat
> > > stuck.  Jan pointed out:
> > > 
> > > https://lore.kernel.org/all/20221109113849.p7pwob533ijgrytu@quack3/T/#u
> > > 
> > > or have Jason's issues since been addressed?
> > 
> > No, they have not. I do think the current series is a step forward, but
> > given the urgency remains low for the time being (CXL hotplug use case
> > further out, no known collisions with ongoing folio work, and no
> > MEMORY_DEVICE_PRIVATE users looking to build any conversions on top for
> > 6.2) I am ok to circle back for 6.3 for that follow on work to be
> > integrated.
> > 
> > > > My preference would be to move ahead with both in which case I can help
> > > > rebase these fixes on top. In that scenario everything would go through
> > > > Andrew.
> > > > 
> > > > However, if we are getting too late in the cycle for that path I think
> > > > these dax-fixes take precedence, and one more cycle to let the page
> > > > reference count reworks sit is ok.
> > > 
> > > That sounds a decent approach.  So we go with this series ("fsdax,xfs:
> > > fix warning messages") and aim at 6.3-rc1 with "Fix the DAX-gup
> > > mistake"?
> > > 
> > 
> > Yeah, that's the path of least hassle.
> 
> Sounds good.  I still want to see patch 1 of this series broken up into
> smaller pieces though.  Once the series goes through review, do you want
> me to push the fixes to Linus, seeing as xfs is the only user of this
> functionality?

Yes, that was my primary feedback as well, and merging through xfs makes
sense to me.



Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-30 Thread Dan Williams
Andrew Morton wrote:
> On Tue, 29 Nov 2022 19:59:14 -0800 Dan Williams  
> wrote:
> 
> > [ add Andrew ]
> > 
> > Shiyang Ruan wrote:
> > > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > > This also effects dax+noreflink mode if we run the test after a
> > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > messages.
> > > 
> > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > Patch 2 adds the implementation of unshare for fsdax.
> > > 
> > > With these fixes, most warning messages in dax_associate_entry() are
> > > gone.  But honestly, generic/388 will randomly failed with the warning.
> > > The case shutdown the xfs when fsstress is running, and do it for many
> > > times.  I think the reason is that dax pages in use are not able to be
> > > invalidated in time when fs is shutdown.  The next time dax page to be
> > > associated, it still remains the mapping value set last time.  I'll keep
> > > on solving it.
> > > 
> > > The warning message in dax_writeback_one() can also be fixed because of
> > > the dax unshare.
> > 
> > Thank you for digging in on this, I had been pinned down on CXL tasks
> > and worried that we would need to mark FS_DAX broken for a cycle, so
> > this is timely.
> > 
> > My only concern is that these patches look to have significant collisions 
> > with
> > the fsdax page reference counting reworks pending in linux-next. Although,
> > those are still sitting in mm-unstable:
> > 
> > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> 
> As far as I know, Dan's "Fix the DAX-gup mistake" series is somewhat
> stuck.  Jan pointed out:
> 
> https://lore.kernel.org/all/20221109113849.p7pwob533ijgrytu@quack3/T/#u
> 
> or have Jason's issues since been addressed?

No, they have not. I do think the current series is a step forward, but
given the urgency remains low for the time being (CXL hotplug use case
further out, no known collisions with ongoing folio work, and no
MEMORY_DEVICE_PRIVATE users looking to build any conversions on top for
6.2) I am ok to circle back for 6.3 for that follow on work to be
integrated.

> > My preference would be to move ahead with both in which case I can help
> > rebase these fixes on top. In that scenario everything would go through
> > Andrew.
> > 
> > However, if we are getting too late in the cycle for that path I think
> > these dax-fixes take precedence, and one more cycle to let the page
> > reference count reworks sit is ok.
> 
> That sounds a decent approach.  So we go with this series ("fsdax,xfs:
> fix warning messages") and aim at 6.3-rc1 with "Fix the DAX-gup
> mistake"?
> 

Yeah, that's the path of least hassle.



Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-29 Thread Dan Williams
Darrick J. Wong wrote:
> On Tue, Nov 29, 2022 at 07:59:14PM -0800, Dan Williams wrote:
> > [ add Andrew ]
> > 
> > Shiyang Ruan wrote:
> > > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > > This also effects dax+noreflink mode if we run the test after a
> > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > messages.
> > > 
> > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > Patch 2 adds the implementation of unshare for fsdax.
> > > 
> > > With these fixes, most warning messages in dax_associate_entry() are
> > > gone.  But honestly, generic/388 will randomly failed with the warning.
> > > The case shutdown the xfs when fsstress is running, and do it for many
> > > times.  I think the reason is that dax pages in use are not able to be
> > > invalidated in time when fs is shutdown.  The next time dax page to be
> > > associated, it still remains the mapping value set last time.  I'll keep
> > > on solving it.
> > > 
> > > The warning message in dax_writeback_one() can also be fixed because of
> > > the dax unshare.
> > 
> > Thank you for digging in on this, I had been pinned down on CXL tasks
> > and worried that we would need to mark FS_DAX broken for a cycle, so
> > this is timely.
> > 
> > My only concern is that these patches look to have significant collisions 
> > with
> > the fsdax page reference counting reworks pending in linux-next. Although,
> > those are still sitting in mm-unstable:
> > 
> > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> > 
> > My preference would be to move ahead with both in which case I can help
> > rebase these fixes on top. In that scenario everything would go through
> > Andrew.
> > 
> > However, if we are getting too late in the cycle for that path I think
> > these dax-fixes take precedence, and one more cycle to let the page
> > reference count reworks sit is ok.
> 
> Well now that raises some interesting questions -- dax and reflink are
> totally broken on 6.1.  I was thinking about cramming them into 6.2 as a
> data corruption fix on the grounds that is not an acceptable state of
> affairs.

I agree it's not an acceptable state of affairs, but for 6.1 the answer
may be to just revert to dax+reflink being forbidden again. The fact
that no end user has noticed is probably a good sign that we can disable
that without any one screaming. That may be the easy answer for 6.2 as
well given how late this all is.

> OTOH we're past -rc7, which is **really late** to be changing core code.
> Then again, there aren't so many fsdax users and nobody's complained
> about 6.0/6.1 being busted, so perhaps the risk of regression isn't so
> bad?  Then again, that could be a sign that this could wait, if you and
> Andrew are really eager to merge the reworks.

The page reference counting has also been languishing for a long time. A
6.2 merge would be nice, it relieves maintenance burden, but they do not
start to have real end user implications until CXL memory hotplug
platforms arrive and the warts in the reference counting start to show
real problems in production.

> Just looking at the stuff that's still broken with dax+reflink -- I
> noticed that xfs/550-552 (aka the dax poison tests) are still regressing
> on reflink filesystems.

That's worrying because the whole point of reworking dax, xfs, and
mm/memory-failure all at once was to handle the collision of poison and
reflink'd dax files.

> So, uh, what would this patchset need to change if the "fsdax page
> reference counting reworks" were applied?  Would it be changing the page
> refcount instead of stashing that in page->index?

Nah, it's things like switching from pages to folios and shifting how
dax goes from pfns to pages.

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-unstable=cca48ba3196

Ideally fsdax would never deal in pfns at all and do everything in terms
of offsets relative to a 'struct dax_device'.

My gut is saying these patches, the refcount reworks, and the
dax+reflink fixes, are important but not end user critical. One more
status quo release does not hurt, and we can circle back to get this all
straightened early in v6.3.

I.e. just revert:

35fcd75af3ed xfs: fail dax mount if reflink is enabled on a partition

...for v6.1-rc8 and get back to this early in the New Year.

> 
> --D
> 
> > > Shiyang Ruan (2):
> > >   fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
> > >   fsdax,xfs: port unshare to fsdax
> > > 
> > >  fs/dax.c | 166 ++-
> > >  fs/xfs/xfs_iomap.c   |   6 +-
> > >  fs/xfs/xfs_reflink.c |   8 ++-
> > >  include/linux/dax.h  |   2 +
> > >  4 files changed, 129 insertions(+), 53 deletions(-)
> > > 
> > > -- 
> > > 2.38.1





RE: [PATCH 1/2] fsdax,xfs: fix warning messages at dax_[dis]associate_entry()

2022-11-29 Thread Dan Williams
Shiyang Ruan wrote:
> This patch fixes the warning message reported in dax_associate_entry()
> and dax_disassociate_entry().

Can you include the xfstest test number and a snippet of the warning
message.

> 1. reset page->mapping and ->index when refcount counting down to 0.
> 2. set IOMAP_F_SHARED flag when iomap read to allow one dax page to be
> associated more than once for not only write but also read.
> 3. should zero the edge (when not aligned) if srcmap is HOLE or
> UNWRITTEN.
> 4. iterator of two files in dedupe should be executed side by side, not
> nested.
> 5. use xfs_dax_write_iomap_ops for xfs zero and truncate. 

Do these all need to be done at once, or is this 5 patches?

> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c   | 114 ++---
>  fs/xfs/xfs_iomap.c |   6 +--
>  2 files changed, 69 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1c6867810cbd..5ea7c0926b7f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -398,7 +398,7 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>   if (dax_mapping_is_cow(page->mapping)) {
>   /* keep the CoW flag if this page is still shared */
> - if (page->index-- > 0)
> + if (page->index-- > 1)

I think this wants either a helper function to make it clear that
->index is being used as a share count, or go ahead and rename that
field in this context with something like:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 910d880e67eb..1a409288f39d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -103,7 +103,10 @@ struct page {
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
-   pgoff_t index;  /* Our offset within mapping. */
+   union {
+   pgoff_t index;  /* Our offset within 
mapping. */
+   unsigned long share;
+   };
/**
 * @private: Mapping-private opaque data.
 * Usually used for buffer_heads if PagePrivate.




RE: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-29 Thread Dan Williams
[ add Andrew ]

Shiyang Ruan wrote:
> Many testcases failed in dax+reflink mode with warning message in dmesg.
> This also effects dax+noreflink mode if we run the test after a
> dax+reflink test.  So, the most urgent thing is solving the warning
> messages.
> 
> Patch 1 fixes some mistakes and adds handling of CoW cases not
> previously considered (srcmap is HOLE or UNWRITTEN).
> Patch 2 adds the implementation of unshare for fsdax.
> 
> With these fixes, most warning messages in dax_associate_entry() are
> gone.  But honestly, generic/388 will randomly failed with the warning.
> The case shutdown the xfs when fsstress is running, and do it for many
> times.  I think the reason is that dax pages in use are not able to be
> invalidated in time when fs is shutdown.  The next time dax page to be
> associated, it still remains the mapping value set last time.  I'll keep
> on solving it.
> 
> The warning message in dax_writeback_one() can also be fixed because of
> the dax unshare.

Thank you for digging in on this, I had been pinned down on CXL tasks
and worried that we would need to mark FS_DAX broken for a cycle, so
this is timely.

My only concern is that these patches look to have significant collisions with
the fsdax page reference counting reworks pending in linux-next. Although,
those are still sitting in mm-unstable:

http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org

My preference would be to move ahead with both in which case I can help
rebase these fixes on top. In that scenario everything would go through
Andrew.

However, if we are getting too late in the cycle for that path I think
these dax-fixes take precedence, and one more cycle to let the page
reference count reworks sit is ok.

> Shiyang Ruan (2):
>   fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
>   fsdax,xfs: port unshare to fsdax
> 
>  fs/dax.c | 166 ++-
>  fs/xfs/xfs_iomap.c   |   6 +-
>  fs/xfs/xfs_reflink.c |   8 ++-
>  include/linux/dax.h  |   2 +
>  4 files changed, 129 insertions(+), 53 deletions(-)
> 
> -- 
> 2.38.1



[PATCH v6] memregion: Add cpu_cache_invalidate_memregion() interface

2022-11-14 Thread Dan Williams
From: Davidlohr Bueso 

With CXL security features, and CXL dynamic provisioning, global CPU
cache flushing nvdimm requirements are no longer specific to that
subsystem, even beyond the scope of security_ops. CXL will need such
semantics for features not necessarily limited to persistent memory.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the erase is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked. Lastly this facility is used when
mapping new devices, or new capacity into an established physical
address range. I.e. when the driver switches DeviceA mapping AddressX to
DeviceB mapping AddressX then any cached data from DeviceA:AddressX
needs to be invalidated.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant or decommissioning a device. It may
also be used for dynamic CXL region provisioning.

Users must first call cpu_cache_has_invalidate_memregion() to know
whether this functionality is available on the architecture. On x86 this
respects the constraints of when wbinvd() is tolerable. It is already
the case that wbinvd() is problematic to allow in VMs due its global
performance impact and KVM, for example, has been known to just trap and
ignore the call. With confidential computing guest execution of wbinvd()
may even trigger an exception. Given guests should not be messing with
the bare metal address map via CXL configuration changes
cpu_cache_has_invalidate_memregion() returns false in VMs.

While this global cache invalidation facility, is exported to modules,
since NVDIMM and CXL support can be built as a module, it is not for
general use. The intent is that this facility is not available outside
of specific "device-memory" use cases. To make that expectation as clear
as possible the API is scoped to a new "DEVMEM" module namespace that
only the NVDIMM and CXL subsystems are expected to import.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Tested-by: Dave Jiang 
Signed-off-by: Davidlohr Bueso 
Acked-by: Dave Hansen 
Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
---
Changes since v5 [1]:
- A collection of build fixed reported by the 0day build robot

[1]: 
http://lore.kernel.org/r/166698148737.3132474.5901874516011784201.st...@dwillia2-xfh.jf.intel.com

 arch/x86/Kconfig |1 +
 arch/x86/mm/pat/set_memory.c |   18 ++
 drivers/acpi/nfit/intel.c|   43 --
 include/linux/memregion.h|   38 +
 lib/Kconfig  |3 +++
 5 files changed, 80 insertions(+), 23 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..e16b2b15d67e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_CACHE_LINE_SIZE
+   select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 2e5a045731de..ef34ba21aa92 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -330,6 +331,23 @@ void arch_invalidate_pmem(void *addr, size_t size)
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 #endif
 
+#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+bool cpu_cache_has_invalidate_memregion(void)
+{
+   return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
+
+int cpu_cache_invalidate_memregion(int res_desc)
+{
+   if (WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion()))
+   return -ENXIO;
+   wbinvd_on_all_cpus();
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM);
+#endif
+
 static void __cpa_flush_all(void *arg)
 {
unsigned long cache = (unsigned long)arg;
diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index 8dd792a55730..fa0e57e35162 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "intel.h"
 #include "nfit.h"
@@ -190,8 +191,6 @@ static int intel_security_cha

[PATCH v5] memregion: Add cpu_cache_invalidate_memregion() interface

2022-10-28 Thread Dan Williams
From: Davidlohr Bueso 

With CXL security features, and CXL dynamic provisioning, global CPU
cache flushing nvdimm requirements are no longer specific to that
subsystem, even beyond the scope of security_ops. CXL will need such
semantics for features not necessarily limited to persistent memory.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the erase is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked. Lastly this facility is used when
mapping new devices, or new capacity into an established physical
address range. I.e. when the driver switches DeviceA mapping AddressX to
DeviceB mapping AddressX then any cached data from DeviceA:AddressX
needs to be invalidated.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant or decommissioning a device. It may
also be used for dynamic CXL region provisioning.

Users must first call cpu_cache_has_invalidate_memregion() to know
whether this functionality is available on the architecture. On x86 this
respects the constraints of when wbinvd() is tolerable. It is already
the case that wbinvd() is problematic to allow in VMs due its global
performance impact and KVM, for example, has been known to just trap and
ignore the call. With confidential computing guest execution of wbinvd()
may even trigger an exception. Given guests should not be messing with
the bare metal address map via CXL configuration changes
cpu_cache_has_invalidate_memregion() returns false in VMs.

While this global cache invalidation facility, is exported to modules,
since NVDIMM and CXL support can be built as a module, it is not for
general use. The intent is that this facility is not available outside
of specific "device-memory" use cases. To make that expectation as clear
as possible the API is scoped to a new "DEVMEM" module namespace that
only the NVDIMM and CXL subsystems are expected to import.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Signed-off-by: Davidlohr Bueso 
Acked-by: Dave Hansen 
Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
---
Changes since v4 [1]:
- Changelog and kdoc wording fixes (Dave)
- Permit x86-32 as there is no functional reason to disallow it, and the
  DEVMEM namespace handles limiting the usage (Dave)
- Similar to the compile time assertion for the
  CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=n case, warn and fail
  cpu_cache_invalidate_memregion() if not supported at runtime (Dave)

[1]: 
http://lore.kernel.org/r/166672803035.280.9244172033971411169.st...@dwillia2-xfh.jf.intel.com

 arch/x86/Kconfig |1 +
 arch/x86/mm/pat/set_memory.c |   17 +
 drivers/acpi/nfit/intel.c|   43 --
 include/linux/memregion.h|   37 
 lib/Kconfig  |3 +++
 5 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..e16b2b15d67e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_CACHE_LINE_SIZE
+   select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 97342c42dda8..0a735c62fa0d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -330,6 +330,23 @@ void arch_invalidate_pmem(void *addr, size_t size)
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 #endif
 
+#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+bool cpu_cache_has_invalidate_memregion(void)
+{
+   return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
+
+int cpu_cache_invalidate_memregion(int res_desc)
+{
+   if (WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion()))
+   return -ENXIO;
+   wbinvd_on_all_cpus();
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM);
+#endif
+
 static void __cpa_flush_all(void *arg)
 {
unsigned long cache = (unsigned long)arg;
diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index 8dd792a55730..fa0e57e35162 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -3,6 +

Re: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

2022-10-27 Thread Dan Williams
Dave Hansen wrote:
> On 10/25/22 13:05, Dan Williams wrote:
> > Users must first call cpu_cache_has_invalidate_memregion() to know whether
> > this functionality is available on the architecture. Only enable it on
> > x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
> > guests may trigger a virtualization exception and may need proper handling
> > to recover. See:
> That sentence doesn't quite parse correctly.  Does it need to be "and
> may trigger..."?

Yes, but I think this can be said more generally, and with more detail:

"Only enable on bare metal x86-64 vai the wbinvd() hammer. It is already
the case that wbinvd() is problematic to allow in VMs due its global
performance impact and KVM, for example, has been known to just trap and
ignore the call. With confidential computing guest execution of wbinvd
may even trigger an exception. As guests should not be messing with the
bare metal address map cpu_cache_has_invalidate_memregion() returns
false in those environments."

> > This global cache invalidation facility,
> > cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM
> > and CXL support can be built as a module. However, the intent is that
> > this facility is not available outside of specific "device-memory" use
> > cases. To that end the API is scoped to a new "DEVMEM" module namespace
> > that only applies to the NVDIMM and CXL subsystems.
> 
> Oh, thank $DEITY you're trying to limit the amount of code that has
> access to this thing.
> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 67745ceab0db..b68661d0633b 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -69,6 +69,7 @@ config X86
> > select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
> > select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
> > select ARCH_HAS_CACHE_LINE_SIZE
> > +   select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION  if X86_64
> 
> What is 64-bit only about this?
> 
> I don't expect to have a lot of NVDIMMs or CXL devices on 32-bit
> kernels, but it would be nice to remove this if it's not strictly
> needed.  Or, to add a changelog nugget that says:
> 
>   Restrict this to X86_64 kernels.  It would probably work on 32-
>   bit, but there is no practical reason to use 32-bit kernels and
>   no one is testing them.

I had to go recall this myself, but it looks like this is unnecessarily
cargo culting the stance of ARCH_HAS_PMEM_API that arose from this
change:

96601adb7451 x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

...that observed that on pure 32-bit x86 CPUs that non-temporal stores
had weaker guarantees about whether writes would bypass the CPU cache.
However, that commit is so old that it even talks about the interactions
with the pcommit instruction. Suffice to say there is no X86_64
dependency for wbinvd, I'll drop the dependency.

> 
> > select ARCH_HAS_CURRENT_STACK_POINTER
> > select ARCH_HAS_DEBUG_VIRTUAL
> > select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 97342c42dda8..8650bb6481a8 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
> >  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
> > +bool cpu_cache_has_invalidate_memregion(void)
> > +{
> > +   return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
> > +
> > +int cpu_cache_invalidate_memregion(int res_desc)
> > +{
> > +   wbinvd_on_all_cpus();
> > +   return 0;
> > +}
> 
> Does this maybe also deserve a:
> 
>   WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion());
> 
> in case one of the cpu_cache_invalidate_memregion() paths missed a
> cpu_cache_has_invalidate_memregion() check?

Yeah, good idea.

> 
> > +/**
> > + * cpu_cache_invalidate_memregion - drop any CPU cached data for
> > + * memregions described by @res_desc
> > + * @res_desc: one of the IORES_DESC_* types
> > + *
> > + * Perform cache maintenance after a memory event / operation that
> > + * changes the contents of physical memory in a cache-incoherent manner.
> > + * For example, device memory technologies like NVDIMM and CXL have
> > + * device secure erase, or dynamic region provision features where such
> > + * semantics.
> 
> s/where/with/ ?

Yes.

> 
> > + * 

Re: [PATCH v1] virtio_pmem: populate numa information

2022-10-26 Thread Dan Williams
Pankaj Gupta wrote:
> > > > Compute the numa information for a virtio_pmem device from the memory
> > > > range of the device. Previously, the target_node was always 0 since
> > > > the ndr_desc.target_node field was never explicitly set. The code for
> > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > drivers/cxl/pmem.c.
> > > >
> > > > Signed-off-by: Michael Sammler 
> > > > ---
> > > >  drivers/nvdimm/virtio_pmem.c | 11 +--
> > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > > index 20da455d2ef6..a92eb172f0e7 100644
> > > > --- a/drivers/nvdimm/virtio_pmem.c
> > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
> > > >  static int virtio_pmem_probe(struct virtio_device *vdev)
> > > >  {
> > > > struct nd_region_desc ndr_desc = {};
> > > > -   int nid = dev_to_node(>dev);
> > > > struct nd_region *nd_region;
> > > > struct virtio_pmem *vpmem;
> > > > struct resource res;
> > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device 
> > > > *vdev)
> > > > dev_set_drvdata(>dev, vpmem->nvdimm_bus);
> > > >
> > > > ndr_desc.res = 
> > > > -   ndr_desc.numa_node = nid;
> > > > +
> > > > +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> > > > +   ndr_desc.target_node = phys_to_target_node(res.start);
> > > > +   if (ndr_desc.target_node == NUMA_NO_NODE) {
> > > > +   ndr_desc.target_node = ndr_desc.numa_node;
> > > > +   dev_dbg(>dev, "changing target node from %d to 
> > > > %d",
> > > > +   NUMA_NO_NODE, ndr_desc.target_node);
> > > > +   }
> > >
> > > As this memory later gets hotplugged using "devm_memremap_pages". I don't
> > > see if 'target_node' is used for fsdax case?
> > >
> > > It seems to me "target_node" is used mainly for volatile range above
> > > persistent memory ( e.g kmem driver?).
> > >
> > I am not sure if 'target_node' is used in the fsdax case, but it is
> > indeed used by the devdax/kmem driver when hotplugging the memory (see
> > 'dev_dax_kmem_probe' and '__dax_pmem_probe').
> 
> Yes, but not currently for FS_DAX iiuc.

The target_node is only used by the dax_kmem driver. In the FSDAX case
the memory (persistent or otherwise) is mapped behind a block-device.
That block-device has affinity to a CPU initiator, but that memory does
not itself have any NUMA affinity or identity as a target.

So:

block-device NUMA node == closest CPU initiator node to the device

dax-device target node == memory only NUMA node target, after onlining



[PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

2022-10-25 Thread Dan Williams
From: Davidlohr Bueso 

With CXL security features, global CPU cache flushing nvdimm requirements
are no longer specific to that subsystem, even beyond the scope of
security_ops. CXL will need such semantics for features not necessarily
limited to persistent memory.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the erase is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked. Lastly this facility is used when
mapping new devices, or new capacity into an established physical
address range. I.e. when the driver switches DeviceA mapping AddressX to
DeviceB mapping AddressX then any cached data from DeviceA:AddressX
needs to be invalidated.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant or decommissioning a device. It may
also be used for dynamic CXL region provisioning.

Users must first call cpu_cache_has_invalidate_memregion() to know whether
this functionality is available on the architecture. Only enable it on
x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
guests may trigger a virtualization exception and may need proper handling
to recover. See:

   e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")

This global cache invalidation facility,
cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM
and CXL support can be built as a module. However, the intent is that
this facility is not available outside of specific "device-memory" use
cases. To that end the API is scoped to a new "DEVMEM" module namespace
that only applies to the NVDIMM and CXL subsystems.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Signed-off-by: Davidlohr Bueso 
Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
---
Changes since v3: 
https://lore.kernel.org/r/20220919110605.3696-1-d...@stgolabs.net
- Add more details about the use of this facility in the memory
  provisioning flow in the changelog
- Introduce the DEVMEM symbol namespace to indicate that this facility is not
  for general consumption (Dave H)
- Update the kernel-doc for cpu_cache_invalidate_memregion() to clarify
  that it is not to be used outside of the CXL and NVDIMM subsystems (Dave H)

 arch/x86/Kconfig |1 +
 arch/x86/mm/pat/set_memory.c |   15 +++
 drivers/acpi/nfit/intel.c|   43 --
 include/linux/memregion.h|   37 
 lib/Kconfig  |3 +++
 5 files changed, 76 insertions(+), 23 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..b68661d0633b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_CACHE_LINE_SIZE
+   select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION  if X86_64
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 97342c42dda8..8650bb6481a8 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 #endif
 
+#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+bool cpu_cache_has_invalidate_memregion(void)
+{
+   return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
+
+int cpu_cache_invalidate_memregion(int res_desc)
+{
+   wbinvd_on_all_cpus();
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM);
+#endif
+
 static void __cpa_flush_all(void *arg)
 {
unsigned long cache = (unsigned long)arg;
diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index 8dd792a55730..fa0e57e35162 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "intel.h"
 #include "nfit.h"
@@ -190,8 +191,6 @@ static int intel_security_change_key(struct nvdimm *nvdimm,
}
 }
 
-static void nvdimm_invalidate_cache(void);
-
 static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
const struct nvdimm_key_data *key_data)
 {
@@ -213,6 +212,9 @@ st

RE: [PATCH v3 -next] memregion: Add cpu_cache_invalidate_memregion() interface

2022-10-22 Thread Dan Williams
[ add the rest of the X86 MM maintainers, Dave and Andy ]

Davidlohr Bueso wrote:
> With CXL security features, global CPU cache flushing nvdimm requirements
> are no longer specific to that subsystem, even beyond the scope of
> security_ops. CXL will need such semantics for features not necessarily
> limited to persistent memory.
> 
> The functionality this is enabling is to be able to instantaneously
> secure erase potentially terabytes of memory at once and the kernel
> needs to be sure that none of the data from before the erase is still
> present in the cache. It is also used when unlocking a memory device
> where speculative reads and firmware accesses could have cached poison
> from before the device was unlocked.
> 
> This capability is typically only used once per-boot (for unlock), or
> once per bare metal provisioning event (secure erase), like when handing
> off the system to another tenant or decommissioning a device. It may
> also be used for dynamic CXL region provisioning.
> 
> Users must first call cpu_cache_has_invalidate_memregion() to know whether
> this functionality is available on the architecture. Only enable it on
> x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
> guests may trigger a virtualization exception and may need proper handling
> to recover. See:
> 
>e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")
> 
> Signed-off-by: Davidlohr Bueso 
> ---
> Changes from v2 
> (https://lore.kernel.org/all/20220829212918.4039240-1-d...@stgolabs.net/):
> - Change the names and params (Dan).
> - GPL symbols (Boris).
> - Mentioned VMM check in the changelog (Boris).

Any remaining concerns from x86 maintainers?

Peter had asked whether this requirement to use wbinvd could be
addressed from the CXL device side in the future. I did raise this
question and I will point out that "back-invalidate" (device initiated
requests to invalidate CPU caches) is a prominent capability defined in
the CXL 3.0 specification. So, there is at least line of sight for that
to be used for these flows going forward.

There will also be motivation for this from platforms that do not have
an equivalent to wbinvd once the Linux CXL stack declines to support
dynamic CXL memory region provisioning and secure erase in the absence
of a functional cpu_cache_invalidate_memregion() or a device-side
back-invalidate capability.

> 
> 
>  arch/x86/Kconfig |  1 +
>  arch/x86/mm/pat/set_memory.c | 15 +
>  drivers/acpi/nfit/intel.c| 41 
>  include/linux/memregion.h| 35 ++
>  lib/Kconfig  |  3 +++
>  5 files changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2e8f6fd28e59..fa5cc581315a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -69,6 +69,7 @@ config X86
>   select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
>   select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
>   select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION  if X86_64
>   select ARCH_HAS_CURRENT_STACK_POINTER
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 0656db33574d..7d940ae2fede 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
>  #endif
>  
> +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
> +bool cpu_cache_has_invalidate_memregion(void)
> +{
> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> +}
> +EXPORT_SYMBOL_GPL(cpu_cache_has_invalidate_memregion);
> +
> +int cpu_cache_invalidate_memregion(int res_desc)
> +{
> + wbinvd_on_all_cpus();
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(cpu_cache_invalidate_memregion);
> +#endif
> +
>  static void __cpa_flush_all(void *arg)
>  {
>   unsigned long cache = (unsigned long)arg;
> diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
> index 8dd792a55730..b2bfbf5797da 100644
> --- a/drivers/acpi/nfit/intel.c
> +++ b/drivers/acpi/nfit/intel.c
> @@ -3,6 +3,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "intel.h"
>  #include "nfit.h"
> @@ -190,8 +191,6 @@ static int intel_security_change_key(struct nvdimm 
> *nvdimm,
>   }
>  }
>  
> -static void nvdimm_invalidate_cache(void);
> -
>  static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
>   const struct nvdimm_key_data *key_data)
>  {
> @@ -213,6 +212,9 @@ static int __maybe_unused intel_security_unlock(struct 
> nvdimm *nvdimm,
>   if (!test_bit(NVDIMM_INTEL_UNLOCK_UNIT, _mem->dsm_mask))
>   return -ENOTTY;
>  
> + if (!cpu_cache_has_invalidate_memregion())
> + return -EINVAL;
> 

[GIT PULL] NVDIMM for 6.1

2022-10-14 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/libnvdimm-for-6.1

...to receive some small cleanups and fixes in and around the nvdimm
subsystem. The most significant change is a regression fix for nvdimm
namespace (volume) creation when the namespace size is smaller than 2MB.

It has appeared in linux-next with no reported issues.

---

The following changes since commit 521a547ced6477c54b4b0cc206000406c221b4d6:

  Linux 6.0-rc6 (2022-09-18 13:44:14 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/libnvdimm-for-6.1

for you to fetch changes up to 305a72efa791c826fe84768ca55e31adc4113ea8:

  Merge branch 'for-6.1/nvdimm' into libnvdimm-for-next (2022-10-01 19:16:16 
-0700)


libnvdimm for 6.1

- Fix nvdimm namespace creation on platforms that do not publish
  associated 'DIMM' metadata for a persistent memory region.

- Miscellaneous fixes and cleanups.


Andy Shevchenko (2):
  nvdimm/namespace: return uuid_null only once in nd_dev_to_uuid()
  nvdimm/namespace: drop unneeded temporary variable in size_store()

Bo Liu (1):
  dax: Remove usage of the deprecated ida_simple_xxx API

Dan Williams (1):
  Merge branch 'for-6.1/nvdimm' into libnvdimm-for-next

Jason Wang (1):
  nvdimm/namespace: Fix comment typo

Jiapeng Chong (2):
  nvdimm/region: Fix kernel-doc
  nvdimm: make __nvdimm_security_overwrite_query static

Lin Yujun (1):
  ACPI: HMAT: Release platform device in case of platform_device_add_data() 
fails

Tyler Hicks (1):
  libnvdimm/region: Allow setting align attribute on regions without 
mappings

 drivers/dax/hmem/device.c   |  4 ++--
 drivers/dax/super.c |  6 +++---
 drivers/nvdimm/namespace_devs.c | 24 
 drivers/nvdimm/region_devs.c| 10 --
 drivers/nvdimm/security.c   |  2 +-
 5 files changed, 18 insertions(+), 28 deletions(-)



RE: [PATCH] dax: Remove usage of the deprecated ida_simple_xxx API

2022-09-29 Thread Dan Williams
Bo Liu wrote:
> Use ida_alloc_xxx()/ida_free() instead of
> ida_simple_get()/ida_simple_remove().
> The latter is deprecated and more verbose.

The better justification is:

ida_alloc_max() makes it clear that the
second argument is inclusive, and the alloc/free terminology is more
idiomatic and symmetric then get/remove.

Otherwise, I would not apply the patch is the only justfication was
deprecation and verbosity.

I'll fixup the changelog.



Re: [PATCH v2] libnvdimm/region: Allow setting align attribute on regions without mappings

2022-09-29 Thread Dan Williams
Tyler Hicks wrote:
> On 2022-09-26 16:18:18, Jeff Moyer wrote:
> > Tyler Hicks  writes:
> > 
> > > The alignment constraint for namespace creation in a region was
> > > increased, from 2M to 16M, for non-PowerPC architectures in v5.7 with
> > > commit 2522afb86a8c ("libnvdimm/region: Introduce an 'align'
> > > attribute"). The thought behind the change was that region alignment
> > > should be uniform across all architectures and, since PowerPC had the
> > > largest alignment constraint of 16M, all architectures should conform to
> > > that alignment.
> > >
> > > The change regressed namespace creation in pre-defined regions that
> > > relied on 2M alignment but a workaround was provided in the form of a
> > > sysfs attribute, named 'align', that could be adjusted to a non-default
> > > alignment value.
> > >
> > > However, the sysfs attribute's store function returned an error (-ENXIO)
> > > when userspace attempted to change the alignment of a region that had no
> > > mappings. This affected 2M aligned regions of volatile memory that were
> > > defined in a device tree using "pmem-region" and created by the
> > > of_pmem_region_driver, since those regions do not contain mappings
> > > (ndr_mappings is 0).
> > >
> > > Allow userspace to set the align attribute on pre-existing regions that
> > > do not have mappings so that namespaces can still be within those
> > > regions, despite not being aligned to 16M.
> > >
> > > Link: 
> > > https://lore.kernel.org/lkml/CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4glddd7mxs321b...@mail.gmail.com
> > > Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute")
> > > Signed-off-by: Tyler Hicks 
> > > ---
> > >
> > > While testing with a recent kernel release (6.0-rc3), I rediscovered
> > > this bug and eventually realized that I never followed through with
> > > fixing it upstream. After a year later, here's the v2 that Aneesh
> > > requested. Sorry about that!
> > >
> > > v2:
> > > - Included Aneesh's feedback to ensure the val is a power of 2 and
> > >   greater than PAGE_SIZE even for regions without mappings
> > > - Reused the max_t() trick from default_align() to avoid special
> > >   casing, with an if-else, when regions have mappings and when they
> > >   don't
> > >   + Didn't include Pavel's Reviewed-by since this is a slightly
> > > different approach than what he reviewed in v1
> > > - Added a Link commit tag to Pavel's initial problem description
> > > v1: 
> > > https://lore.kernel.org/lkml/20210326152645.85225-1-tyhi...@linux.microsoft.com/
> > >
> > >  drivers/nvdimm/region_devs.c | 8 +++-
> > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > index 473a71bbd9c9..550ea0bd6c53 100644
> > > --- a/drivers/nvdimm/region_devs.c
> > > +++ b/drivers/nvdimm/region_devs.c
> > > @@ -509,16 +509,13 @@ static ssize_t align_store(struct device *dev,
> > >  {
> > >   struct nd_region *nd_region = to_nd_region(dev);
> > >   unsigned long val, dpa;
> > > - u32 remainder;
> > > + u32 mappings, remainder;
> > >   int rc;
> > >  
> > >   rc = kstrtoul(buf, 0, );
> > >   if (rc)
> > >   return rc;
> > >  
> > > - if (!nd_region->ndr_mappings)
> > > - return -ENXIO;
> > > -
> > >   /*
> > >* Ensure space-align is evenly divisible by the region
> > >* interleave-width because the kernel typically has no facility
> > > @@ -526,7 +523,8 @@ static ssize_t align_store(struct device *dev,
> > >* contribute to the tail capacity in system-physical-address
> > >* space for the namespace.
> > >*/
> > > - dpa = div_u64_rem(val, nd_region->ndr_mappings, );
> > > + mappings = max_t(u32, 1, nd_region->ndr_mappings);
> > > + dpa = div_u64_rem(val, mappings, );
> > >   if (!is_power_of_2(dpa) || dpa < PAGE_SIZE
> > >   || val > region_size(nd_region) || remainder)
> > >   return -EINVAL;
> > 
> > The math all looks okay, and this matches what's done in default_align.
> > Unfortunately, I don't know enough about the power architecture to
> > understand how you can have a region with no dimms (ndr_mappings == 0).
> 
> Thanks for having a look!
> 
> FWIW, I need this working on arm64. It previously did before the commit
> mentioned in the Fixes line. ndr_mappings is 0 when defining a
> pmem-region in the device tree. The region is also marked as 'volatile'
> but I don't recall if that contributes to ndr_mappings being 0.

Oh, ndr_mappings being 0 can also happen for cases where the platform
just does not advertise any DIMMs / nmemX devices that map to a region.
It is not possible for ndr_mappings to change at run time as they are
only established at nd_region_create() time. The change looks good to
me.



[GIT PULL] NVDIMM and DAX fixes for 6.0-final

2022-09-24 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/dax-and-nvdimm-fixes-v6.0-final

...to receive a recently discovered one-line fix for devdax that further
addresses a v5.5 regression, and (a bit embarrassing) a small batch of
fixes that have been sitting in my fixes tree for weeks. The older fixes
have soaked in linux-next during that time and address an fsdax infinite
loop and some other minor fixups.

---

The following changes since commit 521a547ced6477c54b4b0cc206000406c221b4d6:

  Linux 6.0-rc6 (2022-09-18 13:44:14 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
tags/dax-and-nvdimm-fixes-v6.0-final

for you to fetch changes up to b3bbcc5d1da1b654091dad15980b3d58fdae0fc6:

  Merge branch 'for-6.0/dax' into libnvdimm-fixes (2022-09-24 18:14:12 -0700)


dax-and-nvdimm-fixes-v6.0-final

- Fix a infinite loop bug in fsdax

- Fix memory-type detection for devdax (EINJ regression)

- Small cleanups


Andy Shevchenko (1):
  nvdimm/namespace: drop nested variable in create_namespace_pmem()

Dan Williams (2):
  devdax: Fix soft-reservation memory description
  Merge branch 'for-6.0/dax' into libnvdimm-fixes

Jane Chu (1):
  pmem: fix a name collision

Li Jinlin (1):
  fsdax: Fix infinite loop in dax_iomap_rw()

Shivaprasad G Bhat (1):
  ndtest: Cleanup all of blk namespace specific code

 drivers/dax/hmem/device.c  |  1 +
 drivers/nvdimm/namespace_devs.c|  2 -
 drivers/nvdimm/pmem.c  |  6 +--
 fs/dax.c   |  3 ++
 tools/testing/nvdimm/test/ndtest.c | 77 --
 5 files changed, 7 insertions(+), 82 deletions(-)



RE: [PATCH] dax: Check dev_set_name() return value

2022-09-20 Thread Dan Williams
Bo Liu wrote:
> It's possible that dev_set_name() returns -ENOMEM, catch and handle this.
> 
> Signed-off-by: Bo Liu 
> ---
>  drivers/dax/bus.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1dad813ee4a6..36cf245ee467 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -765,7 +765,12 @@ static int devm_register_dax_mapping(struct dev_dax 
> *dev_dax, int range_id)
>   device_initialize(dev);
>   dev->parent = _dax->dev;
>   dev->type = _mapping_type;
> - dev_set_name(dev, "mapping%d", mapping->id);
> + rc = dev_set_name(dev, "mapping%d", mapping->id);
> + if (rc) {
> + kfree(mapping);

No, this needs to trigger put_device() otherwise it leaks mapping->id.



RE: [PATCH] nvdimm: Call ida_simple_remove() when failed

2022-09-20 Thread Dan Williams
Bo Liu wrote:
> In function nvdimm_bus_register(), when code execution fails, we should
> call ida_simple_remove() to free ida.
> 
> Signed-off-by: Bo Liu 
> ---
>  drivers/nvdimm/bus.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index b38d0355b0ac..3415dc62632b 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -371,6 +371,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device 
> *parent,
>   return nvdimm_bus;
>   err:
>   put_device(_bus->dev);
> + ida_simple_remove(_ida, nvdimm_bus->id);

No, this is a double-free. The put_device() will trigger
nvdimm_bus_release().



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Dan Williams
Andrew Morton wrote:
> On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams  
> wrote:
> 
> > Jonathan Cameron wrote:
> > > On Wed, 7 Sep 2022 18:07:31 -0700
> > > Dan Williams  wrote:
> > > 
> > > > Andrew Morton wrote:
> > > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > > both.
> > > > > 
> > > > > Can we please be very clear in comments and changelogs about exactly
> > > > > what this "flush" does.   With bonus points for being more specific 
> > > > > in the 
> > > > > function naming?
> > > > >   
> > > > 
> > > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > > think this API is purely about invalidate. It just so happens that x86
> > > > has not historically had a global invalidate instruction readily
> > > > available which leads to the overuse of wbinvd.
> > > > 
> > > > It would be nice to make clear that this API is purely about
> > > > invalidating any data cached for a physical address impacted by address
> > > > space management event (secure erase / new region provision). Write-back
> > > > is an unnecessary side-effect.
> > > > 
> > > > So how about:
> > > > 
> > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> > > 
> > > Want to indicate it 'might' write back perhaps?
> > > So could be invalidate or clean and invalidate (using arm ARM terms just 
> > > to add
> > > to the confusion ;)
> > > 
> > > Feels like there will be potential race conditions where that matters as 
> > > we might
> > > force stale data to be written back.
> > > 
> > > Perhaps a comment is enough for that. Anyone have the "famous last words" 
> > > feeling?
> > 
> > Is "invalidate" not clear that write-back is optional? Maybe not.
> 
> Yes, I'd say that "invalidate" means "dirty stuff may of may not have
> been written back".  Ditto for invalidate_inode_pages2().
> 
> > Also, I realized that we tried to include the address range to allow for
> > the possibility of flushing by virtual address range, but that
> > overcomplicates the use. I.e. if someone issue secure erase and the
> > region association is not established does that mean that mean that the
> > cache invalidation is not needed? It could be the case that someone
> > disables a device, does the secure erase, and then reattaches to the
> > same region. The cache invalidation is needed, but at the time of the
> > secure erase the HPA was unknown.
> > 
> > All this to say that I feel the bikeshedding will need to continue until
> > morale improves.
> > 
> > I notice that the DMA API uses 'sync' to indicate, "make this memory
> > consistent/coherent for the CPU or the device", so how about an API like
> > 
> > memregion_sync_for_cpu(int res_desc)
> > 
> > ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
> 
> "sync" is another of my pet peeves ;) In filesystem land, at least. 
> Does it mean "start writeback and return" or does it mean "start
> writeback, wait for its completion then return".  

Ok, no "sync" :).

/**
 * cpu_cache_invalidate_memregion - drop any CPU cached data for
 * memregions described by @res_des
 * @res_desc: one of the IORES_DESC_* types
 *
 * Perform cache maintenance after a memory event / operation that
 * changes the contents of physical memory in a cache-incoherent manner.
 * For example, memory-device secure erase, or provisioning new CXL
 * regions. This routine may or may not write back any dirty contents
 * while performing the invalidation.
 *
 * Returns 0 on success or negative error code on a failure to perform
 * the cache maintenance.
 */
int cpu_cache_invalidate_memregion(int res_desc)

??



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Dan Williams
Jonathan Cameron wrote:
> On Wed, 7 Sep 2022 18:07:31 -0700
> Dan Williams  wrote:
> 
> > Andrew Morton wrote:
> > > I really dislike the term "flush".  Sometimes it means writeback,
> > > sometimes it means invalidate.  Perhaps at other times it means
> > > both.
> > > 
> > > Can we please be very clear in comments and changelogs about exactly
> > > what this "flush" does.   With bonus points for being more specific in 
> > > the 
> > > function naming?
> > >   
> > 
> > That's a good point, "flush" has been cargo-culted along in Linux's
> > cache management APIs to mean write-back-and-invalidate. In this case I
> > think this API is purely about invalidate. It just so happens that x86
> > has not historically had a global invalidate instruction readily
> > available which leads to the overuse of wbinvd.
> > 
> > It would be nice to make clear that this API is purely about
> > invalidating any data cached for a physical address impacted by address
> > space management event (secure erase / new region provision). Write-back
> > is an unnecessary side-effect.
> > 
> > So how about:
> > 
> > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> 
> Want to indicate it 'might' write back perhaps?
> So could be invalidate or clean and invalidate (using arm ARM terms just to 
> add
> to the confusion ;)
> 
> Feels like there will be potential race conditions where that matters as we 
> might
> force stale data to be written back.
> 
> Perhaps a comment is enough for that. Anyone have the "famous last words" 
> feeling?

Is "invalidate" not clear that write-back is optional? Maybe not.

Also, I realized that we tried to include the address range to allow for
the possibility of flushing by virtual address range, but that
overcomplicates the use. I.e. if someone issue secure erase and the
region association is not established does that mean that mean that the
cache invalidation is not needed? It could be the case that someone
disables a device, does the secure erase, and then reattaches to the
same region. The cache invalidation is needed, but at the time of the
secure erase the HPA was unknown.

All this to say that I feel the bikeshedding will need to continue until
morale improves.

I notice that the DMA API uses 'sync' to indicate, "make this memory
consistent/coherent for the CPU or the device", so how about an API like

memregion_sync_for_cpu(int res_desc)

...where the @res_desc would be IORES_DESC_CXL for all CXL and
IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Dan Williams
Borislav Petkov wrote:
> On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote:
> > To be clear nfit stuff and CXL does run in guests, but they do not
> > support secure-erase in a guest.
> > 
> > However, the QEMU CXL enabling is building the ability to do *guest
> > physical* address space management, but in that case the driver can be
> > paravirtualized to realize that it is not managing host-physical address
> > space and does not need to flush caches. That will need some indicator
> > to differentiate virtual CXL memory expanders from assigned devices.
> 
> Sounds to me like that check should be improved later to ask
> whether the kernel is managing host-physical address space, maybe
> arch_flush_memregion() should check whether the address it is supposed
> to flush is host-physical and exit early if not...

Even though I raised the possibility of guest passthrough of a CXL
memory expander, I do not think it could work in practice without it
being a gigantic security nightmare. So it is probably safe to just do
the hypervisor check and assume that there's no such thing as guest
management of host physical address space.



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-07 Thread Dan Williams
Andrew Morton wrote:
> I really dislike the term "flush".  Sometimes it means writeback,
> sometimes it means invalidate.  Perhaps at other times it means
> both.
> 
> Can we please be very clear in comments and changelogs about exactly
> what this "flush" does.   With bonus points for being more specific in the 
> function naming?
> 

That's a good point, "flush" has been cargo-culted along in Linux's
cache management APIs to mean write-back-and-invalidate. In this case I
think this API is purely about invalidate. It just so happens that x86
has not historically had a global invalidate instruction readily
available which leads to the overuse of wbinvd.

It would be nice to make clear that this API is purely about
invalidating any data cached for a physical address impacted by address
space management event (secure erase / new region provision). Write-back
is an unnecessary side-effect.

So how about:

s/arch_flush_memregion/cpu_cache_invalidate_memregion/?



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-07 Thread Dan Williams
Davidlohr Bueso wrote:
> On Wed, 07 Sep 2022, Borislav Petkov wrote:
> 
> >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote:
> >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> >> index 1abd5438f126..18463cb704fb 100644
> >> --- a/arch/x86/mm/pat/set_memory.c
> >> +++ b/arch/x86/mm/pat/set_memory.c
> >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size)
> >>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
> >>  #endif
> >>
> >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
> >> +bool arch_has_flush_memregion(void)
> >> +{
> >> +  return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> >
> >This looks really weird. Why does this need to care about HV at all?
> 
> So the context here is:
> 
> e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")
> 
> >
> >Does that nfit stuff even run in guests?
> 
> No, nor does cxl. This was mostly in general a precautionary check such
> that the api is unavailable in VMs.

To be clear nfit stuff and CXL does run in guests, but they do not
support secure-erase in a guest.

However, the QEMU CXL enabling is building the ability to do *guest
physical* address space management, but in that case the driver can be
paravirtualized to realize that it is not managing host-physical address
space and does not need to flush caches. That will need some indicator
to differentiate virtual CXL memory expanders from assigned devices. Is
there such a thing as a PCIe-virtio extended capability to differentiate
physical vs emulated devices?



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-07 Thread Dan Williams
Davidlohr Bueso wrote:
> Not sure the proper way to route this (akpm?). But unless any remaining
> objections, could this be picked up?

My plan was, barring objections, to take it through the CXL tree with
its first user, the CXL security commands.



RE: [PATCH] nvdimm: Avoid wasting some memory.

2022-09-04 Thread Dan Williams
Christophe JAILLET wrote:
> sizeof(struct btt_sb) is 4096.
> 
> When using devm_kzalloc(), there is a small memory overhead and, on most
> systems, this leads to 40 bytes of extra memory allocation.
> So 5036 bytes are expected to be allocated.
> 
> The memory allocator works with fixed size hunks of memory. In this case,
> it will require 8192 bytes of memory because more than 4096 bytes are
> required.
> 
> In order to avoid wasting 4ko of memory, just use kzalloc() and add a
> devm action to free it when needed.
> 
> Signed-off-by: Christophe JAILLET 
> ---
>  drivers/nvdimm/btt_devs.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
> index fabbb31f2c35..7b79fb0b0338 100644
> --- a/drivers/nvdimm/btt_devs.c
> +++ b/drivers/nvdimm/btt_devs.c
> @@ -332,6 +332,11 @@ static int __nd_btt_probe(struct nd_btt *nd_btt,
>   return 0;
>  }
>  
> +void nd_btt_free(void *data)
> +{
> + kfree(data);
> +}
> +
>  int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
>  {
>   int rc;
> @@ -356,7 +361,17 @@ int nd_btt_probe(struct device *dev, struct 
> nd_namespace_common *ndns)
>   nvdimm_bus_unlock(>dev);
>   if (!btt_dev)
>   return -ENOMEM;
> - btt_sb = devm_kzalloc(dev, sizeof(*btt_sb), GFP_KERNEL);
> +
> + /*
> +  * 'struct btt_sb' is 4096. Using devm_kzalloc() would waste 4 ko of
> +  * memory because, because of a small memory over head, 8192 bytes
> +  * would be allocated. So keep this kzalloc()+devm_add_action_or_reset()
> +  */
> + btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
> + rc = devm_add_action_or_reset(dev, nd_btt_free, btt_sb);
> + if (rc)
> + return rc;

Thanks for the analysis and the patch. However, shouldn't this be
something that is addressed internal to devm_kzalloc() rather than
open-coded at every potential call site?



RE: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-08-29 Thread Dan Williams
[ add Christoph ]

Davidlohr Bueso wrote:
> With CXL security features, global CPU cache flushing nvdimm requirements
> are no longer specific to that subsystem, even beyond the scope of
> security_ops. CXL will need such semantics for features not necessarily
> limited to persistent memory.
> 
> The functionality this is enabling is to be able to instantaneously
> secure erase potentially terabytes of memory at once and the kernel
> needs to be sure that none of the data from before the secure is still
> present in the cache. It is also used when unlocking a memory device
> where speculative reads and firmware accesses could have cached poison
> from before the device was unlocked.
> 
> This capability is typically only used once per-boot (for unlock), or
> once per bare metal provisioning event (secure erase), like when handing
> off the system to another tenant or decommissioning a device.
> 
> Users must first call arch_has_flush_memregion() to know whether this
> functionality is available on the architecture. Only enable it on x86-64
> via the wbinvd() hammer.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
> 
> Changes from v2 
> (https://lore.kernel.org/all/20220819171024.1766857-1-d...@stgolabs.net/):
> - Redid to use memregion based interfaces + VMM check on x86 (Dan)
> - Restricted the flushing to x86-64.
> 
> Note: Since we still are dealing with a physical "range" at this level,
> added the spa range for nfit even though this is unused.

Looks reasonable to me.

Reviewed-by: Dan Williams 



RE: [PATCH v7] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-26 Thread Dan Williams
Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>   -> unbind_store()
>-> ... (skip)
> -> devres_release_all()
>  -> kill_dax()
>   -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>-> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> ==
> Changes since v6:
>1. Rebase on 6.0-rc2 and Darrick's patch[2].
> 
> Changes since v5:
>1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
>2. hold s_umount before sync_filesystem()
>3. do sync_filesystem() after SB_BORN check
>4. Rebased on next-20220714
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Darrick J. Wong 
> ---
>   drivers/dax/super.c |  3 ++-
>   fs/xfs/xfs_notify_failure.c | 15 +++
>   include/linux/mm.h  |  1 +
>   3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 65d5eb20878e..a9769f17e998 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -77,6 +77,9 @@ xfs_dax_failure_fn(
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Do not shutdown so early when device is to be removed */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   notify->want_shutdown = true;
>   return 0;
>   }
> @@ -182,12 +185,22 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {

How are you testing the SB_BORN interactions? I have a fix for this
pending here:

https://lore.kernel.org/nvdimm/166153428094.2758201.7936572520826540019.st...@dwillia2-xfh.jf.intel.com/

>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>   +   if (mf_flags & MF_MEM_PRE_REMOVE) {

It appears this patch is corrupted here. I confirmed that b4 sees the
same when trying to apply it.

> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);

This syncs to make data persistent, but for DAX this also needs to get
invalidate all current DAX mappings. I do not see that in these changes.

> + up_write(>m_super->s_umount);
> + if (error)
> + return error;
> + }
> +
>   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>   xfs_warn(mp,
>"notify_failure() not supported on realtime device!");
> @@ -196,6 +209,8 @@ xfs_dax_notify_failure(
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 982f2607180b..2c7c132e6512 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3176,6 +3176,7 @@ enum mf_flags {
>   MF_UNPOISON = 1 << 4,
>   MF_SW_SIMULATED = 1 << 5,
>   MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
>   };
>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> -- 
> 2.37.2
> 
> 





Re: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-26 Thread Dan Williams
Borislav Petkov wrote:
> On Thu, Aug 25, 2022 at 04:29:47PM +, Jane Chu wrote:
> > Tony has replied.
> 
> Do you really think that I can't look up what field means?
> 
> What I said was
> 
> "What I'm missing from this text here is... "
> 
> IOW, what I'm trying to say is, you should formulate your commit message
> better, more human-friendly. Right now it reads like for insiders only.
> But that's not its purpose.
> 
> Do you catch my drift?

How about:

---

When memory poison consumption machine checks fire,
mce-notifier-handlers like nfit_handle_mce() record the impacted
physical address range. The error information includes data about blast
radius, i.e. how many cachelines did the hardware determine are
impacted. A recent change, commit 7917f9cdb503 ("acpi/nfit: rely on
mce->misc to determine poison granularity"), updated nfit_handle_mce()
to stop hard coding the blast radius value of 1 cacheline, and instead
rely on the blast radius reported in 'struct mce' which can be up to 4K
(64 cachelines).

It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than checking the blast
radius in the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.

---



Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-08-25 Thread Dan Williams
Dan Williams wrote:
> Dan Williams wrote:
> > HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote:
> > > > Shiyang Ruan wrote:
> > > > > This new function is a variant of mf_generic_kill_procs that accepts a
> > > > > file, offset pair instead of a struct to support multiple files 
> > > > > sharing
> > > > > a DAX mapping.  It is intended to be called by the file systems as 
> > > > > part
> > > > > of the memory_failure handler after the file system performed a 
> > > > > reverse
> > > > > mapping from the storage address to the file and file offset.
> > > > > 
> > > > > Signed-off-by: Shiyang Ruan 
> > > > > Reviewed-by: Dan Williams 
> > > > > Reviewed-by: Christoph Hellwig 
> > > > > Reviewed-by: Darrick J. Wong 
> > > > > Reviewed-by: Miaohe Lin 
> > > > > ---
> > > > >  include/linux/mm.h  |  2 +
> > > > >  mm/memory-failure.c | 96 
> > > > > -
> > > > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > > > 
> > > > Unfortunately my test suite was only running the "non-destructive" set
> > > > of 'ndctl' tests which skipped some of the complex memory-failure cases.
> > > > Upon fixing that, bisect flags this commit as the source of the 
> > > > following
> > > > crash regression:
> > > 
> > > Thank you for testing/reporting.
> > > 
> > > > 
> > > >  kernel BUG at mm/memory-failure.c:310!
> > > >  invalid opcode:  [#1] PREEMPT SMP PTI
> > > >  CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G   OE 
> > > > 5.19.0-rc4+ #58
> > > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> > > > 02/06/2015
> > > >  RIP: 0010:add_to_kill+0x304/0x400
> > > > [..]
> > > >  Call Trace:
> > > >   
> > > >   collect_procs.part.0+0x2c8/0x470
> > > >   memory_failure+0x979/0xf30
> > > >   do_madvise.part.0.cold+0x9c/0xd3
> > > >   ? lock_is_held_type+0xe3/0x140
> > > >   ? find_held_lock+0x2b/0x80
> > > >   ? lock_release+0x145/0x2f0
> > > >   ? lock_is_held_type+0xe3/0x140
> > > >   ? syscall_enter_from_user_mode+0x20/0x70
> > > >   __x64_sys_madvise+0x56/0x70
> > > >   do_syscall_64+0x3a/0x80
> > > >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > 
> > > This stacktrace shows that VM_BUG_ON_VMA() in dev_pagemap_mapping_shift()
> > > was triggered.  I think that BUG_ON is too harsh here because address ==
> > > -EFAULT means that there's no mapping for the address.  The subsequent
> > > code considers "tk->size_shift == 0" as "no mapping" cases, so
> > > dev_pagemap_mapping_shift() can return 0 in such a case?
> > > 
> > > Could the following diff work for the issue?
> > 
> > This passes the "dax-ext4.sh" and "dax-xfs.sh" tests from the ndctl
> > suite.

So that diff works to avoid the BUG_ON, but it does not work to handle
the error case. I think the problem comes from:

vma->vm_file->f_mapping != folio->mapping

...where page_folio(page)->mapping is likely not setup correctly for DAX
pages. This goes back to the broken nature of DAX page reference
counting which I am fixing now, but this folio association also needs to
be fixed up.



Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-08-24 Thread Dan Williams
Dan Williams wrote:
> HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote:
> > > Shiyang Ruan wrote:
> > > > This new function is a variant of mf_generic_kill_procs that accepts a
> > > > file, offset pair instead of a struct to support multiple files sharing
> > > > a DAX mapping.  It is intended to be called by the file systems as part
> > > > of the memory_failure handler after the file system performed a reverse
> > > > mapping from the storage address to the file and file offset.
> > > > 
> > > > Signed-off-by: Shiyang Ruan 
> > > > Reviewed-by: Dan Williams 
> > > > Reviewed-by: Christoph Hellwig 
> > > > Reviewed-by: Darrick J. Wong 
> > > > Reviewed-by: Miaohe Lin 
> > > > ---
> > > >  include/linux/mm.h  |  2 +
> > > >  mm/memory-failure.c | 96 -
> > > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > > 
> > > Unfortunately my test suite was only running the "non-destructive" set
> > > of 'ndctl' tests which skipped some of the complex memory-failure cases.
> > > Upon fixing that, bisect flags this commit as the source of the following
> > > crash regression:
> > 
> > Thank you for testing/reporting.
> > 
> > > 
> > >  kernel BUG at mm/memory-failure.c:310!
> > >  invalid opcode:  [#1] PREEMPT SMP PTI
> > >  CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G   OE 5.19.0-rc4+ 
> > > #58
> > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >  RIP: 0010:add_to_kill+0x304/0x400
> > > [..]
> > >  Call Trace:
> > >   
> > >   collect_procs.part.0+0x2c8/0x470
> > >   memory_failure+0x979/0xf30
> > >   do_madvise.part.0.cold+0x9c/0xd3
> > >   ? lock_is_held_type+0xe3/0x140
> > >   ? find_held_lock+0x2b/0x80
> > >   ? lock_release+0x145/0x2f0
> > >   ? lock_is_held_type+0xe3/0x140
> > >   ? syscall_enter_from_user_mode+0x20/0x70
> > >   __x64_sys_madvise+0x56/0x70
> > >   do_syscall_64+0x3a/0x80
> > >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > 
> > This stacktrace shows that VM_BUG_ON_VMA() in dev_pagemap_mapping_shift()
> > was triggered.  I think that BUG_ON is too harsh here because address ==
> > -EFAULT means that there's no mapping for the address.  The subsequent
> > code considers "tk->size_shift == 0" as "no mapping" cases, so
> > dev_pagemap_mapping_shift() can return 0 in such a case?
> > 
> > Could the following diff work for the issue?
> 
> This passes the "dax-ext4.sh" and "dax-xfs.sh" tests from the ndctl
> suite.
> 
> It then fails on the "device-dax" test with this signature:
> 
>  BUG: kernel NULL pointer dereference, address: 0010
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x) - not-present page
>  PGD 800205073067 P4D 800205073067 PUD 2062b3067 PMD 0 
>  Oops:  [#1] PREEMPT SMP PTI
>  CPU: 22 PID: 4535 Comm: device-dax Tainted: G   OEN 6.0.0-rc2+ 
> #59
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  RIP: 0010:memory_failure+0x667/0xba0
> [..]
>  Call Trace:
>   
>   ? _printk+0x58/0x73
>   do_madvise.part.0.cold+0xaf/0xc5
> 
> Which is:
> 
> (gdb) li *(memory_failure+0x667)
> 0x813b7f17 is in memory_failure (mm/memory-failure.c:1933).
> 1928
> 1929/*
> 1930 * Call driver's implementation to handle the memory failure, 
> otherwise
> 1931 * fall back to generic handler.
> 1932 */
> 1933if (pgmap->ops->memory_failure) {
> 1934rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
> 
> 
> ...I think this is just a simple matter of:
> 
> @@ -1928,7 +1930,7 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>  * Call driver's implementation to handle the memory failure, 
> otherwise
>  * fall back to generic handler.
>  */
> -   if (pgmap->ops->memory_failure) {
> +   if (pgmap->ops && pgmap->ops->memory_failure) {
> rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
> /*
>  * Fall back to generic handler too if operation is not
> 
> 
> ...since device-dax does not implement pagemap ops.
> 
> I will see what else pops up and make sure that this regression always
> runs going forward.

Ok, that was last of the regression fallout that I could find.



Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-08-24 Thread Dan Williams
HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote:
> > Shiyang Ruan wrote:
> > > This new function is a variant of mf_generic_kill_procs that accepts a
> > > file, offset pair instead of a struct to support multiple files sharing
> > > a DAX mapping.  It is intended to be called by the file systems as part
> > > of the memory_failure handler after the file system performed a reverse
> > > mapping from the storage address to the file and file offset.
> > > 
> > > Signed-off-by: Shiyang Ruan 
> > > Reviewed-by: Dan Williams 
> > > Reviewed-by: Christoph Hellwig 
> > > Reviewed-by: Darrick J. Wong 
> > > Reviewed-by: Miaohe Lin 
> > > ---
> > >  include/linux/mm.h  |  2 +
> > >  mm/memory-failure.c | 96 -
> > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > 
> > Unfortunately my test suite was only running the "non-destructive" set
> > of 'ndctl' tests which skipped some of the complex memory-failure cases.
> > Upon fixing that, bisect flags this commit as the source of the following
> > crash regression:
> 
> Thank you for testing/reporting.
> 
> > 
> >  kernel BUG at mm/memory-failure.c:310!
> >  invalid opcode:  [#1] PREEMPT SMP PTI
> >  CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G   OE 5.19.0-rc4+ #58
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >  RIP: 0010:add_to_kill+0x304/0x400
> > [..]
> >  Call Trace:
> >   
> >   collect_procs.part.0+0x2c8/0x470
> >   memory_failure+0x979/0xf30
> >   do_madvise.part.0.cold+0x9c/0xd3
> >   ? lock_is_held_type+0xe3/0x140
> >   ? find_held_lock+0x2b/0x80
> >   ? lock_release+0x145/0x2f0
> >   ? lock_is_held_type+0xe3/0x140
> >   ? syscall_enter_from_user_mode+0x20/0x70
> >   __x64_sys_madvise+0x56/0x70
> >   do_syscall_64+0x3a/0x80
> >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> This stacktrace shows that VM_BUG_ON_VMA() in dev_pagemap_mapping_shift()
> was triggered.  I think that BUG_ON is too harsh here because address ==
> -EFAULT means that there's no mapping for the address.  The subsequent
> code considers "tk->size_shift == 0" as "no mapping" cases, so
> dev_pagemap_mapping_shift() can return 0 in such a case?
> 
> Could the following diff work for the issue?

This passes the "dax-ext4.sh" and "dax-xfs.sh" tests from the ndctl
suite.

It then fails on the "device-dax" test with this signature:

 BUG: kernel NULL pointer dereference, address: 0010
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x) - not-present page
 PGD 800205073067 P4D 800205073067 PUD 2062b3067 PMD 0 
 Oops:  [#1] PREEMPT SMP PTI
 CPU: 22 PID: 4535 Comm: device-dax Tainted: G   OEN 6.0.0-rc2+ #59
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:memory_failure+0x667/0xba0
[..]
 Call Trace:
  
  ? _printk+0x58/0x73
  do_madvise.part.0.cold+0xaf/0xc5

Which is:

(gdb) li *(memory_failure+0x667)
0x813b7f17 is in memory_failure (mm/memory-failure.c:1933).
1928
1929/*
1930 * Call driver's implementation to handle the memory failure, 
otherwise
1931 * fall back to generic handler.
1932 */
1933if (pgmap->ops->memory_failure) {
1934rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);


...I think this is just a simple matter of:

@@ -1928,7 +1930,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
 * Call driver's implementation to handle the memory failure, otherwise
 * fall back to generic handler.
 */
-   if (pgmap->ops->memory_failure) {
+   if (pgmap->ops && pgmap->ops->memory_failure) {
rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
/*
 * Fall back to generic handler too if operation is not


...since device-dax does not implement pagemap ops.

I will see what else pops up and make sure that this regression always
runs going forward.



RE: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-08-24 Thread Dan Williams
Shiyang Ruan wrote:
> This new function is a variant of mf_generic_kill_procs that accepts a
> file, offset pair instead of a struct to support multiple files sharing
> a DAX mapping.  It is intended to be called by the file systems as part
> of the memory_failure handler after the file system performed a reverse
> mapping from the storage address to the file and file offset.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Dan Williams 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Reviewed-by: Miaohe Lin 
> ---
>  include/linux/mm.h  |  2 +
>  mm/memory-failure.c | 96 -
>  2 files changed, 88 insertions(+), 10 deletions(-)

Unfortunately my test suite was only running the "non-destructive" set
of 'ndctl' tests which skipped some of the complex memory-failure cases.
Upon fixing that, bisect flags this commit as the source of the following
crash regression:

 kernel BUG at mm/memory-failure.c:310!
 invalid opcode:  [#1] PREEMPT SMP PTI
 CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G   OE 5.19.0-rc4+ #58
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:add_to_kill+0x304/0x400
[..]
 Call Trace:
  
  collect_procs.part.0+0x2c8/0x470
  memory_failure+0x979/0xf30
  do_madvise.part.0.cold+0x9c/0xd3
  ? lock_is_held_type+0xe3/0x140
  ? find_held_lock+0x2b/0x80
  ? lock_release+0x145/0x2f0
  ? lock_is_held_type+0xe3/0x140
  ? syscall_enter_from_user_mode+0x20/0x70
  __x64_sys_madvise+0x56/0x70
  do_syscall_64+0x3a/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This is from running:

  meson test -C build dax-ext4.sh

...from the ndctl repo.

I will take look, and posting it here in case I do not find it tonight
and Ruan can take a look.



Re: [PATCH v2] arch/cacheflush: Introduce flush_all_caches()

2022-08-23 Thread Dan Williams
Davidlohr Bueso wrote:
> On Mon, 22 Aug 2022, Dan Williams wrote:
> 
> >Davidlohr Bueso wrote:
> >> On Sun, 21 Aug 2022, Christoph Hellwig wrote:
> >>
> >> >On Fri, Aug 19, 2022 at 10:10:24AM -0700, Davidlohr Bueso wrote:
> >> >> index b192d917a6d0..ac4d4fd4e508 100644
> >> >> --- a/arch/x86/include/asm/cacheflush.h
> >> >> +++ b/arch/x86/include/asm/cacheflush.h
> >> >> @@ -10,4 +10,8 @@
> >> >>
> >> >>  void clflush_cache_range(void *addr, unsigned int size);
> >> >>
> >> >> +/* see comments in the stub version */
> >> >> +#define flush_all_caches() \
> >> >> +   do { wbinvd_on_all_cpus(); } while(0)
> >> >
> >> >Yikes.  This is just a horrible, horrible name and placement for a bad
> >> >hack that should have no generic relevance.
> >>
> >> Why does this have no generic relevance? There's already been discussions
> >> on how much wbinv is hated[0].
> >>
> >> >Please fix up the naming to make it clear that this function is for a
> >> >very specific nvdimm use case, and move it to a nvdimm-specific header
> >> >file.
> >>
> >> Do you have any suggestions for a name? And, as the changelog describes,
> >> this is not nvdimm specific anymore, and the whole point of all this is
> >> volatile memory components for cxl, hence nvdimm namespace is bogus.
> >>
> >> [0] 
> >> https://lore.kernel.org/all/yvtc2u1j%2fqip8...@worktop.programming.kicks-ass.net/
> >
> >While it is not nvdimm specific anymore, it's still specific to "memory
> >devices that can bulk invalidate a physical address space". I.e. it's
> >not as generic as its location in arch/x86/include/asm/cacheflush.h
> >would imply. So, similar to arch_invalidate_pmem(), lets keep it in a
> >device-driver-specific header file, because hch and peterz are right, we
> >need to make this much more clear that it is not for general
> >consumption.
> 
> Fine, I won't argue - although I don't particularly agree, at least wrt
> the naming. Imo my naming does _exactly_ what it should do and is much
> easier to read than arch_has_flush_memregion() which is counter intuitive
> when we are in fact flushing everything. This does not either make it
> any more clearer about virt vs physical mappings either (except that
> it's no longer associated to cacheflush). But, excepting arm cacheflush.h's
> rare arch with braino cache users get way too much credit in their namespace
> usage.


> 
> But yes there is no doubt that my version is more inviting than it should be,
> which made me think of naming it to flush_all_caches_careful() so the user
> is forced to at least check the function (or one would hope).

So I'm not married to arch_has_flush_memregion() or even including the
physical address range to flush, the only aspect of the prototype I want
to see incorporated is something about the target / motivation for the
flush.

"flush_all_caches_careful()" says nothing about what the API is being
"careful" about. It reminds of Linus' comments on memcpy_mcsafe()

https://lore.kernel.org/all/CAHk-=wh1SPyuGkTkQESsacwKTpjWd=_-KwoCK5o=suc3ymd...@mail.gmail.com/

"Naming - like comments - shouldn't be about what some implementation
is, but about the concept."

So "memregion" was meant to represent a memory device backed physical
address range, but that association may only be in my own head.  How
about something even more explicit like:
"flush_after_memdev_invalidate()" where someone would feel icky using it
for anything other than what we have been talking about in this thread.

> Anyway, I'll send a new version based on the below - I particularly agree
> with the hypervisor bits.

Ok, just one more lap around the bikeshed track, but I think we're
converging.



Re: [PATCH v2] arch/cacheflush: Introduce flush_all_caches()

2022-08-22 Thread Dan Williams
Davidlohr Bueso wrote:
> On Sun, 21 Aug 2022, Christoph Hellwig wrote:
> 
> >On Fri, Aug 19, 2022 at 10:10:24AM -0700, Davidlohr Bueso wrote:
> >> index b192d917a6d0..ac4d4fd4e508 100644
> >> --- a/arch/x86/include/asm/cacheflush.h
> >> +++ b/arch/x86/include/asm/cacheflush.h
> >> @@ -10,4 +10,8 @@
> >>
> >>  void clflush_cache_range(void *addr, unsigned int size);
> >>
> >> +/* see comments in the stub version */
> >> +#define flush_all_caches() \
> >> +  do { wbinvd_on_all_cpus(); } while(0)
> >
> >Yikes.  This is just a horrible, horrible name and placement for a bad
> >hack that should have no generic relevance.
> 
> Why does this have no generic relevance? There's already been discussions
> on how much wbinv is hated[0].
> 
> >Please fix up the naming to make it clear that this function is for a
> >very specific nvdimm use case, and move it to a nvdimm-specific header
> >file.
> 
> Do you have any suggestions for a name? And, as the changelog describes,
> this is not nvdimm specific anymore, and the whole point of all this is
> volatile memory components for cxl, hence nvdimm namespace is bogus.
> 
> [0] 
> https://lore.kernel.org/all/yvtc2u1j%2fqip8...@worktop.programming.kicks-ass.net/

While it is not nvdimm specific anymore, it's still specific to "memory
devices that can bulk invalidate a physical address space". I.e. it's
not as generic as its location in arch/x86/include/asm/cacheflush.h
would imply. So, similar to arch_invalidate_pmem(), lets keep it in a
device-driver-specific header file, because hch and peterz are right, we
need to make this much more clear that it is not for general
consumption.

There is already include/linux/memregion.h for identifying memory regions
with a common id across ACPI NFIT, DAX "soft reserved" regions, and CXL.
So how about something like along the same lines as
arch_invalidate_pmem():

diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index c04c4fd2e209..0310135d7a42 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -21,3 +21,23 @@ static inline void memregion_free(int id)
 }
 #endif
 #endif /* _MEMREGION_H_ */
+
+/*
+ * Device memory technologies like NVDIMM and CXL have events like
+ * secure erase and dynamic region provision that can invalidate an
+ * entire physical memory address range at once. Limit that
+ * functionality to architectures that have an efficient way to
+ * writeback and invalidate potentially terabytes of memory at once.
+ */
+#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
+void arch_flush_memregion(phys_addr_t phys, resource_size_t size);
+bool arch_has_flush_memregion(void);
+#else
+static inline bool arch_has_flush_memregion(void)
+{
+   return false;
+}
+static void arch_flush_memregion(phys_addr_t phys, resource_size_t size);
+{
+}
+#endif

...where arch_has_flush_memregion() on x86 is:

bool arch_has_flush_memregion(void)
{
return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR)
}

...to make it clear that this API is unavailable in virtual
environments.



Re: [PATCH] arch/cacheflush: Introduce flush_all_caches()

2022-08-16 Thread Dan Williams
Davidlohr Bueso wrote:
> On Tue, 16 Aug 2022, Dan Williams wrote:
> 
> >On Tue, Aug 16, 2022 at 10:30 AM Davidlohr Bueso  wrote:
> >>
> >> On Tue, 16 Aug 2022, Dan Williams wrote:
> >>
> >> >Peter Zijlstra wrote:
> >> >> On Mon, Aug 15, 2022 at 09:07:06AM -0700, Davidlohr Bueso wrote:
> >> >> > diff --git a/arch/x86/include/asm/cacheflush.h 
> >> >> > b/arch/x86/include/asm/cacheflush.h
> >> >> > index b192d917a6d0..ce2ec9556093 100644
> >> >> > --- a/arch/x86/include/asm/cacheflush.h
> >> >> > +++ b/arch/x86/include/asm/cacheflush.h
> >> >> > @@ -10,4 +10,7 @@
> >> >> >
> >> >> >  void clflush_cache_range(void *addr, unsigned int size);
> >> >> >
> >> >> > +#define flush_all_caches() \
> >> >> > +  do { wbinvd_on_all_cpus(); } while(0)
> >> >> > +
> >> >>
> >> >> This is horrific... we've done our utmost best to remove all WBINVD
> >> >> usage and here you're adding it back in the most horrible form possible
> >> >> ?!?
> >> >>
> >> >> Please don't do this, do *NOT* use WBINVD.
> >> >
> >> >Unfortunately there are a few good options here, and the changelog did
> >> >not make clear that this is continuing legacy [1], not adding new wbinvd
> >> >usage.
> >>
> >> While I was hoping that it was obvious from the intel.c changes that this
> >> was not a new wbinvd, I can certainly improve the changelog with the below.
> >
> >I also think this cache_flush_region() API wants a prominent comment
> >clarifying the limited applicability of this API. I.e. that it is not
> >for general purpose usage, not for VMs, and only for select bare metal
> >scenarios that instantaneously invalidate wide swaths of memory.
> >Otherwise, I can now see how this looks like a potentially scary
> >expansion of the usage of wbinvd.
> 
> Sure.
> 
> Also, in the future we might be able to bypass this hammer in the presence
> of persistent cpu caches.

What would have helped is if the secure-erase and unlock definition in
the specification mandated that the device emit cache invalidations for
everything it has mapped when it is erased. However, that has some
holes, and it also makes me think there is a gap in the current region
provisioning code. If I have device-A mapped at physical-address-X and then
tear that down and instantiate device-B at that same physical address
there needs to be CPU cache invalidation between those 2 events.



Re: [PATCH] arch/cacheflush: Introduce flush_all_caches()

2022-08-16 Thread Dan Williams
On Tue, Aug 16, 2022 at 10:30 AM Davidlohr Bueso  wrote:
>
> On Tue, 16 Aug 2022, Dan Williams wrote:
>
> >Peter Zijlstra wrote:
> >> On Mon, Aug 15, 2022 at 09:07:06AM -0700, Davidlohr Bueso wrote:
> >> > diff --git a/arch/x86/include/asm/cacheflush.h 
> >> > b/arch/x86/include/asm/cacheflush.h
> >> > index b192d917a6d0..ce2ec9556093 100644
> >> > --- a/arch/x86/include/asm/cacheflush.h
> >> > +++ b/arch/x86/include/asm/cacheflush.h
> >> > @@ -10,4 +10,7 @@
> >> >
> >> >  void clflush_cache_range(void *addr, unsigned int size);
> >> >
> >> > +#define flush_all_caches() \
> >> > +  do { wbinvd_on_all_cpus(); } while(0)
> >> > +
> >>
> >> This is horrific... we've done our utmost best to remove all WBINVD
> >> usage and here you're adding it back in the most horrible form possible
> >> ?!?
> >>
> >> Please don't do this, do *NOT* use WBINVD.
> >
> >Unfortunately there are a few good options here, and the changelog did
> >not make clear that this is continuing legacy [1], not adding new wbinvd
> >usage.
>
> While I was hoping that it was obvious from the intel.c changes that this
> was not a new wbinvd, I can certainly improve the changelog with the below.

I also think this cache_flush_region() API wants a prominent comment
clarifying the limited applicability of this API. I.e. that it is not
for general purpose usage, not for VMs, and only for select bare metal
scenarios that instantaneously invalidate wide swaths of memory.
Otherwise, I can now see how this looks like a potentially scary
expansion of the usage of wbinvd.



Re: [PATCH] arch/cacheflush: Introduce flush_all_caches()

2022-08-16 Thread Dan Williams
Peter Zijlstra wrote:
> On Mon, Aug 15, 2022 at 09:07:06AM -0700, Davidlohr Bueso wrote:
> > diff --git a/arch/x86/include/asm/cacheflush.h 
> > b/arch/x86/include/asm/cacheflush.h
> > index b192d917a6d0..ce2ec9556093 100644
> > --- a/arch/x86/include/asm/cacheflush.h
> > +++ b/arch/x86/include/asm/cacheflush.h
> > @@ -10,4 +10,7 @@
> > 
> >  void clflush_cache_range(void *addr, unsigned int size);
> > 
> > +#define flush_all_caches() \
> > +   do { wbinvd_on_all_cpus(); } while(0)
> > +
> 
> This is horrific... we've done our utmost best to remove all WBINVD
> usage and here you're adding it back in the most horrible form possible
> ?!?
> 
> Please don't do this, do *NOT* use WBINVD.

Unfortunately there are a few good options here, and the changelog did
not make clear that this is continuing legacy [1], not adding new wbinvd
usage.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the secure is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant. That small scope plus the fact that
none of this is available to a VM limits the potential damage. So,
similar to the mitigation we did in [2] that did not kill off wbinvd
completely, this is limited to specific scenarios and should be disabled
in any scenario where wbinvd is painful / forbidden.

[1]: 4c6926a23b76 ("acpi/nfit, libnvdimm: Add unlock of nvdimm support for 
Intel DIMMs")
[2]: e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")



Re: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-08 Thread Dan Williams
Jane Chu wrote:
> On 8/3/2022 1:53 AM, Ingo Molnar wrote:
> > 
> > * Jane Chu  wrote:
> > 
> >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
> > 
> > s/Commit/commit
> 
> Maintainers,
> Would you prefer a v8, or take care the comment upon accepting the patch?
> 
> > 
> >> poison granularity") that changed nfit_handle_mce() callback to report
> >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
> >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
> >> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
> >> because 0x1000 bytes is 8 512-byte.
> >>
> >> Dan Williams noticed that apei_mce_report_mem_error() hardcode
> >> the LSB field to PAGE_SHIFT instead of consulting the input
> >> struct cper_sec_mem_err record.  So change to rely on hardware whenever
> >> support is available.
> >>
> >> Link: 
> >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
> >>
> >> Reviewed-by: Dan Williams 
> >> Reviewed-by: Ingo Molnar 
> >> Signed-off-by: Jane Chu 
> >> ---
> >>   arch/x86/kernel/cpu/mce/apei.c | 13 -
> >>   1 file changed, 12 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/mce/apei.c 
> >> b/arch/x86/kernel/cpu/mce/apei.c
> >> index 717192915f28..8ed341714686 100644
> >> --- a/arch/x86/kernel/cpu/mce/apei.c
> >> +++ b/arch/x86/kernel/cpu/mce/apei.c
> >> @@ -29,15 +29,26 @@
> >>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
> >> *mem_err)
> >>   {
> >>struct mce m;
> >> +  int lsb;
> >>   
> >>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
> >>return;
> >>   
> >> +  /*
> >> +   * Even if the ->validation_bits are set for address mask,
> >> +   * to be extra safe, check and reject an error radius '0',
> >> +   * and fall back to the default page size.
> >> +   */
> >> +  if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> >> +  lsb = find_first_bit((void *)_err->physical_addr_mask, 
> >> PAGE_SHIFT);
> >> +  else
> >> +  lsb = PAGE_SHIFT;
> >> +
> >>mce_setup();
> >>m.bank = -1;
> >>/* Fake a memory read error with unknown channel */
> >>m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
> >> MCI_STATUS_MISCV | 0x9f;
> >> -  m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
> >> +  m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
> > 
> > LGTM.
> > 
> > I suppose this wants to go upstream via the tree the bug came from (NVDIMM
> > tree? ACPI tree?), or should we pick it up into the x86 tree?
> 
> No idea.  Maintainers?

There's no real NVDIMM dependency here, just a general cleanup of how
APEI error granularities are managed. So I think it is appropriate for
this to go through the x86 tree via the typical path for mce related
topics.



RE: [PATCH v5 24/32] tools/testing/nvdimm: Convert to printbuf

2022-08-08 Thread Dan Williams
Matthew Wilcox (Oracle) wrote:
> From: Kent Overstreet 
> 
> This converts from seq_buf to printbuf. Here we're using printbuf with
> an external buffer, meaning it's a direct conversion.
> 
> Signed-off-by: Kent Overstreet 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: nvd...@lists.linux.dev

My Acked-by still applies:

https://lore.kernel.org/all/62b61165348f4_a7a2f294d0@dwillia2-xfh.notmuch/

...and Shivaprasad's Tested-by should still apply:

https://lore.kernel.org/all/b299ebe2-88e5-c2bd-bad0-bef62d4ac...@linux.ibm.com/



Re: [PATCH v5] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Dan Williams
Jane Chu wrote:
> On 8/1/2022 9:44 AM, Dan Williams wrote:
> > Jane Chu wrote:
> >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
> >> poison granularity") that changed nfit_handle_mce() callback to report
> >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
> >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
> >> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
> >> because 0x1000 bytes is 8 512-byte.
> >>
> >> Dan Williams noticed that apei_mce_report_mem_error() hardcode
> >> the LSB field to PAGE_SHIFT instead of consulting the input
> >> struct cper_sec_mem_err record.  So change to rely on hardware whenever
> >> support is available.
> >>
> >> Link: 
> >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
> >>
> >> Reviewed-by: Dan Williams 
> >> Signed-off-by: Jane Chu 
> >> ---
> >>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
> >>   1 file changed, 13 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/mce/apei.c 
> >> b/arch/x86/kernel/cpu/mce/apei.c
> >> index 717192915f28..2c7ea0ba9dd7 100644
> >> --- a/arch/x86/kernel/cpu/mce/apei.c
> >> +++ b/arch/x86/kernel/cpu/mce/apei.c
> >> @@ -29,15 +29,27 @@
> >>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
> >> *mem_err)
> >>   {
> >>struct mce m;
> >> +  int lsb = PAGE_SHIFT;
> >>   
> >>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
> >>return;
> >>   
> >> +  /*
> >> +   * Even if the ->validation_bits are set for address mask,
> >> +   * to be extra safe, check and reject an error radius '0',
> >> +   * and fallback to the default page size.
> >> +   */
> >> +  if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
> >> +  lsb = __ffs64(mem_err->physical_addr_mask);
> >> +  if (lsb == 1)
> > 
> > This was the reason I recommended hweight64 and min_not_zero() as
> > hweight64 does not have the undefined behavior. However, an even better
> > option is to just do:
> > 
> >  find_first_bit(_err->physical_addr_mask, PAGE_SHIFT)
> > 
> > ...as that trims the result to the PAGE_SHIFT max and handles the zero
> > case.
> 
> Thanks Dan!  However it looks like find_first_bit() could call into 
> __ffs(x) which has the same limitation as __ffs64(x), as Tony pointed out.

Not quite, no. __ffs() behavior is *undefined* if the input is zero.
find_first_bit() is *defined* and returns @size is the input is zero.
Which is the behavior this wants to default to PAGE_SHIFT in the absence
of any smaller granularity information.



RE: [PATCH v5] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Dan Williams
Jane Chu wrote:
> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
> poison granularity") that changed nfit_handle_mce() callback to report
> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
> because 0x1000 bytes is 8 512-byte.
> 
> Dan Williams noticed that apei_mce_report_mem_error() hardcode
> the LSB field to PAGE_SHIFT instead of consulting the input
> struct cper_sec_mem_err record.  So change to rely on hardware whenever
> support is available.
> 
> Link: 
> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
> 
> Reviewed-by: Dan Williams 
> Signed-off-by: Jane Chu 
> ---
>  arch/x86/kernel/cpu/mce/apei.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index 717192915f28..2c7ea0ba9dd7 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -29,15 +29,27 @@
>  void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
> *mem_err)
>  {
>   struct mce m;
> + int lsb = PAGE_SHIFT;
>  
>   if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>   return;
>  
> + /*
> +  * Even if the ->validation_bits are set for address mask,
> +  * to be extra safe, check and reject an error radius '0',
> +  * and fallback to the default page size.
> +  */
> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
> + lsb = __ffs64(mem_err->physical_addr_mask);
> + if (lsb == 1)

This was the reason I recommended hweight64 and min_not_zero() as
hweight64 does not have the undefined behavior. However, an even better
option is to just do:

find_first_bit(_err->physical_addr_mask, PAGE_SHIFT)

...as that trims the result to the PAGE_SHIFT max and handles the zero
case.



  1   2   3   4   5   6   7   8   9   10   >