Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
Verma, Vishal L wrote: > > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, > > > struct device_attribute *attr, > > > if (!victim) > > > return -ENXIO; > > > > > > - rc = down_write_killable(_region_rwsem); > > > - if (rc) > > > - return rc; > > > - rc = down_write_killable(_dev_rwsem); > > > - if (rc) { > > > - up_write(_region_rwsem); > > > - return rc; > > > - } > > > + device_lock(dev); > > > + device_lock(victim); > > > dev_dax = to_dev_dax(victim); > > > + rc = down_write_killable(_dev_rwsem); > > > > This begs the question, why down_write_killable(), but not > > device_lock_interruptible()? > > Do you mean change the device_lock()s to device_lock_interruptible() in > addition to the taking the rwsem (i.e. not instead of the rwsem..)? I mean convert the rwsem to drop _killable. > I guess I just restored what was there previously - but the > interruptible variant makes sense, I can make that change. So the original code did device_lock(), then the rework added killable rwsem (deleted device_lock()), and now the fixes add device_lock() back. So now that there is a mix of killable/interruptible lock usage all the locks should agree. Since there really is no risk of these operations being long running there is no driving need to make them killable/interruptible, so go with the simple option.
Re: [PATCH v2 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show
Vishal Verma wrote: > In size_show(), the dax_dev_rwsem only needs a read lock, but was > acquiring a write lock. Change it to down_read_interruptible() so it > doesn't unnecessarily hold a write lock. > > Cc: Dan Williams > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Looks good, Reviewed-by: Dan Williams
Re: [PATCH v2 3/4] dax/bus.c: Don't use down_write_killable for non-user processes
Vishal Verma wrote: > Change an instance of down_write_killable() to a simple down_write() where > there is no user process that might want to interrupt the operation. > > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 6 +- > 1 file changed, 1 insertion(+), 5 deletions(-) Looks good, Reviewed-by: Dan Williams
Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
Vishal Verma wrote: > Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > was a bit overzealous in eliminating device_lock() usage, and ended up > removing a couple of lock acquisitions which were needed, and as a > result, fix some of the conditional locking missteps that the above > commit introduced in unregister_dax_dev() and unregister_dax_mapping(). I think it makes sense to tell the story a bit about why the delete_store() conversion was problematic, because the unregister_dev_dax() changes were just a knock-on effect to fixing the delete_store() flow. Something like: --- commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") aimed to undo device_lock() abuses for protecting changes to dax-driver internal data-structures like the dax_region resource tree to device-dax-instance range structures. However, the device_lock() was legitamately enforcing that devices to be deleted were not current actively attached to any driver nor assigned any capacity from the region. --- ...you can fill in a couple notes about the knock-on fixups after that was restored. > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 44 ++-- > 1 file changed, 10 insertions(+), 34 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 7924dd542a13..4e04b228b080 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax) > trim_dev_dax_range(dev_dax); > } > > -static void __unregister_dev_dax(void *dev) > +static void unregister_dev_dax(void *dev) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > > dev_dbg(dev, "%s\n", __func__); > > + down_write(_region_rwsem); > kill_dev_dax(dev_dax); > device_del(dev); > free_dev_dax_ranges(dev_dax); > put_device(dev); > -} > - > -static void unregister_dev_dax(void *dev) > -{ > - if (rwsem_is_locked(_region_rwsem)) > - return __unregister_dev_dax(dev); > - > - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0)) > - return; > - __unregister_dev_dax(dev); > up_write(_region_rwsem); > } > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, struct > device_attribute *attr, > if (!victim) > return -ENXIO; > > - rc = down_write_killable(_region_rwsem); > - if (rc) > - return rc; > - rc = down_write_killable(_dev_rwsem); > - if (rc) { > - up_write(_region_rwsem); > - return rc; > - } > + device_lock(dev); > + device_lock(victim); > dev_dax = to_dev_dax(victim); > + rc = down_write_killable(_dev_rwsem); This begs the question, why down_write_killable(), but not device_lock_interruptible()? I do not expect any of this is long running so likely down_write() is sufficient here, especially since the heaviest locks to acquire are already held by the time rwsem is considered. Other than that this looks good to me: You can include my Reviewed-by on the next posting.
Re: [PATCH v2 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts
Vishal Verma wrote: > In [1], Dan points out that all of the WARN_ON_ONCE() usage in the > referenced patch should be replaced with lockdep_assert_held, or > lockdep_held_assert_write(). Replace these as appropriate. > > Link: > https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch > [1] > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Cc: Andrew Morton > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) Looks good, Reviewed-by: Dan Williams
Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts
Vishal Verma wrote: > In [1], Dan points out that all of the WARN_ON_ONCE() usage in the > referenced patch should be replaced with lockdep_assert_held(_write)(). > > Replace those, and additionally, replace a couple of other > WARN_ON_ONCE() introduced in the same patch for actual failure > cases (i.e. when acquiring a semaphore fails in a remove / unregister > path) with dev_WARN_ONCE() as is the precedent here. /me goes to look why failures are warned vs bubbling up the error... > > Recall that previously, unregistration paths was implicitly protected by > overloading the device lock, which the patch in [1] sought to remove. > This meant adding a semaphore acquisition in these unregistration paths. > Since that can fail, and it doesn't make sense to return errors from > these paths, retain the two instances of (now) dev_WARN_ONCE(). > > Link: > https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch > [1] > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Cc: Andrew Morton > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 23 +-- > 1 file changed, 13 insertions(+), 10 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 797e1ebff299..d89c8c3b62f7 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c [..] > @@ -726,7 +728,8 @@ static void unregister_dax_mapping(void *data) ...and realizes that is hunk is confused. > if (rwsem_is_locked(_region_rwsem)) > return __unregister_dax_mapping(data); This is bug. Either it is safe to call this without the lock held, or it must be safe to always acquire the lock. Anything else means the caller needs to be refactored. Conditional locking is an indicator of a deeper problem with code organization. Yes, this was not part of the fixup, but the changelog led me here because no WARNs should remain at the end of this effort. I think the confusion results from replace *all* device_lock() when some device_lock() is still needed. > - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0)) > + if (dev_WARN_ONCE(data, down_write_killable(_region_rwsem) != 0, > + "unable to acquire region rwsem\n")) In a context like this that cannot return an error directly to the user process making the request, like in a sysfs operation handler, then the locking cannot worry about signals. This would become an uncoditional down_write() lock. So down_write_killable() is typically for request contexts where there is a user process waiting for a result. For contexts like async driver probing where we might be in a kernel thread, and definitely in functions that return 'void', it is a bug to use fallible locks.
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
Rick Edgecombe wrote: > The mm_struct contains a function pointer *get_unmapped_area(), which > is set to either arch_get_unmapped_area() or > arch_get_unmapped_area_topdown() during the initialization of the mm. > > Since the function pointer only ever points to two functions that are named > the same across all arch's, a function pointer is not really required. In > addition future changes will want to add versions of the functions that > take additional arguments. So to save a pointers worth of bytes in > mm_struct, and prevent adding additional function pointers to mm_struct in > future changes, remove it and keep the information about which > get_unmapped_area() to use in a flag. > > Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by > mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing > behavior mm->get_unmapped_area() would get copied to the new mm in > dup_mm(), so not clobbering the flag preserves the existing behavior > around inheriting the topdown-ness. > > Introduce a helper, mm_get_unmapped_area(), to easily convert code that > refers to the old function pointer to instead select and call either > arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the > flag. Then drop the mm->get_unmapped_area() function pointer. Leave the > get_unmapped_area() pointer in struct file_operations alone. The main > purpose of this change is to reorganize in preparation for future changes, > but it also converts the calls of mm->get_unmapped_area() from indirect > branches into a direct ones. > > The stress-ng bigheap benchmark calls realloc a lot, which calls through > get_unmapped_area() in the kernel. On x86, the change yielded a ~1% > improvement there on a retpoline config. > > In testing a few x86 configs, removing the pointer unfortunately didn't > result in any actual size reductions in the compiled layout of mm_struct. > But depending on compiler or arch alignment requirements, the change could > shrink the size of mm_struct. > > Signed-off-by: Rick Edgecombe > Acked-by: Dave Hansen > Acked-by: Liam R. Howlett > Reviewed-by: Kirill A. Shutemov > Cc: linux-s...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: nvd...@lists.linux.dev > Cc: linux-...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-fsde...@vger.kernel.org > Cc: io-ur...@vger.kernel.org > Cc: b...@vger.kernel.org > --- > v4: > - Split out pde_get_unmapped_area() refactor into separate patch >(Christophe Leroy) > > v3: > - Fix comment that still referred to mm->get_unmapped_area() > - Resolve trivial rebase conflicts with "mm: thp_get_unmapped_area must >honour topdown preference" > - Spelling fix in log > > v2: > - Fix comment on MMF_TOPDOWN (Kirill, rppt) > - Move MMF_TOPDOWN to actually unused bit > - Add MMF_TOPDOWN to MMF_INIT_MASK so it doesn't get clobbered on fork, >and result in the children using the search up path. > - New lower performance results after above bug fix > - Add Reviews and Acks > --- > arch/s390/mm/hugetlbpage.c | 2 +- > arch/s390/mm/mmap.c | 4 ++-- > arch/sparc/kernel/sys_sparc_64.c | 15 ++- > arch/sparc/mm/hugetlbpage.c | 2 +- > arch/x86/kernel/cpu/sgx/driver.c | 2 +- > arch/x86/mm/hugetlbpage.c| 2 +- > arch/x86/mm/mmap.c | 4 ++-- > drivers/char/mem.c | 2 +- > drivers/dax/device.c | 6 +++--- > fs/hugetlbfs/inode.c | 4 ++-- > fs/proc/inode.c | 3 ++- > fs/ramfs/file-mmu.c | 2 +- > include/linux/mm_types.h | 6 +- > include/linux/sched/coredump.h | 5 - > include/linux/sched/mm.h | 5 + > io_uring/io_uring.c | 2 +- > kernel/bpf/arena.c | 2 +- > kernel/bpf/syscall.c | 2 +- > mm/debug.c | 6 -- > mm/huge_memory.c | 9 - > mm/mmap.c| 21 ++--- > mm/shmem.c | 11 +-- > mm/util.c| 6 +++--- > 23 files changed, 66 insertions(+), 57 deletions(-) > > diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c > index c2e8242bd15d..219d906fe830 100644 > --- a/arch/s390/mm/hugetlbpage.c > +++ b/arch/s390/mm/hugetlbpage.c > @@ -328,7 +328,7 @@ unsigned long hugetlb_get_unmapped_area(struct file > *file, unsigned long addr, > goto check_asce_limit; > } > > - if (mm->get_unmapped_area == arch_get_unmapped_area) > + if (!test_bit(MMF_TOPDOWN, >flags)) > addr = hugetlb_get_unmapped_area_bottomup(file, addr, len, > pgoff, flags); > else > diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c > index b14fc0887654..6b2e4436ad4a 100644 > --- a/arch/s390/mm/mmap.c > +++ b/arch/s390/mm/mmap.c > @@
Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem
Vishal Verma wrote: > The dax driver incorrectly used driver-core device locks to protect > internal dax region and dax device configuration structures. Replace the > device lock usage with a local rwsem, one each for dax region > configuration and dax device configuration. As a result of this > conversion, no device_lock() usage remains in dax/bus.c. > > Cc: Dan Williams > Reported-by: Greg Kroah-Hartman > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 220 > ++ > 1 file changed, 157 insertions(+), 63 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 1ff1ab5fa105..cb148f74ceda 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -12,6 +12,18 @@ > > static DEFINE_MUTEX(dax_bus_lock); > > +/* > + * All changes to the dax region configuration occur with this lock held > + * for write. > + */ > +DECLARE_RWSEM(dax_region_rwsem); > + > +/* > + * All changes to the dax device configuration occur with this lock held > + * for write. > + */ > +DECLARE_RWSEM(dax_dev_rwsem); > + > #define DAX_NAME_LEN 30 > struct dax_id { > struct list_head list; > @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax) > u64 size = 0; > int i; > > - device_lock_assert(_dax->dev); > + WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem)); Apologies for the late review, but... All of these WARN_ON_ONCE() usages should be replaced with lockdep_assert_held() and lockdep_assert_held_write() where appropriate.
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Ira Weiny wrote: > Ira Weiny wrote: > > Jonathan Cameron wrote: > > > On Wed, 14 Feb 2024 10:23:10 -0500 > > > Steven Rostedt wrote: > > > > > > > On Wed, 14 Feb 2024 12:11:53 + > > > > Jonathan Cameron wrote: > > > > > > > > > So I'm thinking this is a won't fix - wait for the printk rework to > > > > > land and > > > > > assume this will be resolved as well? > > > > > > > > That pretty much sums up what I was about to say ;-) > > > > > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > > > trace events it can hang the machine. > > > > > > > > So, you can use your internal patch locally, but I would recommend > > > > waiting > > > > for the new printk changes to land. > > > > Steven, Do you think that will land in 6.9? > > > > > > > > > > I'm really hoping that will be soon! > > > > > > > > -- Steve > > > > > > Thanks Steve, > > > > > > Ira's fix is needed for other valid locking reasons - this was 'just > > > another' > > > lock debugging report that came up whilst testing it. > > > > > > For this patch (not a potential additional one that we aren't going to do > > > ;) > > > > > > Tested-by: Jonathan Cameron > > > Reviewed-by: Jonathan Cameron > > > > Jonathan, > > > > Again thanks for the testing! However, Dan and I just discussed this and > > he has an uneasy feeling about going forward with this for 6.8 final. > > > > If we revert the following patch I can squash this fix and wait for the > > tp_printk() fix to land in 6.9 and resubmit. > > > > Dan here is the patch which backs out the actual bug: > > > > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") > > Unfortunately this is not the only patch. > > We need to revert this too: > > Fixes: dc97f6344f20 ("cxl/pci: Register for and process CPER events") > > And then revert ... > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") > > ... but there is a conflict. > > Dan, below is the correct revert patch. Let me know if you need more. > > Ira > > commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab > Author: Ira Weiny > Date: Wed Feb 14 15:25:24 2024 -0800 > > Revert "acpi/ghes: Process CXL Component Events" > > This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5. Even reverts need changelogs, I can add one. I got conflicts trying to apply this to current fixes branch. I think I am going to just surgically backout the drivers/acpi/apei/ghes.c changes.
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Ira Weiny wrote: [..] > > As Steve said, tp_printk is a hack (a very useful one) and > > hopefully no one runs it in production. > > OMG... I did not realize what tp_printk() was exactly. I should have > looked closer. > > Do we have evidence of its use in production? > > I would love to not have to revert/respin, The revert is for 2 non-trivial fixes needed in one new feature, lets just circle back and get it right for v6.9. The tp_printk() was not the final straw for me.
RE: CPU data cache across reboot/kexec for pmem/dax devices
Mathieu Desnoyers wrote: > Hi Dan, > > In the context of extracting user-space trace data when the kernel crashes, > the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area > of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the > resulting device to create/mount a dax-enabled fs (e.g. ext4). > > We then use this filesystem to mmap() the shared memory files for the tracer. > > I want to make sure that the very last events from the userspace tracer > written > to the memory mapped buffers (mmap()) by userspace are present after a > warm-reboot (or kexec/kdump). > > Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush > (or equivalent pmem_persist() from libpmem) for performance reasons: ring > buffer > data is usually overwritten many times before the system actually crashes, and > the only thing we really need to make sure is that the cache lines are not > invalidated without write back. > > So I understand that the main use-case for pmem is nvdimm, and that in order > to > guarantee persistence of the data on power off an explicit pmem_persist() is > needed after each "transaction", but for the needs of tracing, is there some > kind of architectural guarantee that the data present in the cpu data cache > is not invalidated prior to write back in each of those scenarios ? > > - reboot with bios explicitly not clearing memory, This one gives me pause, because a trip through the BIOS typically means lots of resets and other low level magic, so this would likely require pushing dirty data out of CPU caches prior to entering the BIOS code paths. So this either needs explicit cache flushing or mapping the memory with write-through semantics. That latter one is not supported in the stack today. > - kexec/kdump. This should maintain the state of CPU caches. As far as the CPU is concerned it is just long jumping into a new kernel in memory without resetting any CPU cache state.
Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures
Dave Chinner wrote: > On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote: > > commit d92576f1167c ("dax: does not work correctly with virtual aliasing > > caches") > > prevents DAX from building on architectures with virtually aliased > > dcache with: > > > > depends on !(ARM || MIPS || SPARC) > > > > This check is too broad (e.g. recent ARMv7 don't have virtually aliased > > dcaches), and also misses many other architectures with virtually > > aliased dcache. > > > > This is a regression introduced in the v5.13 Linux kernel where the > > dax mount option is removed for 32-bit ARMv7 boards which have no dcache > > aliasing, and therefore should work fine with FS_DAX. > > > > This was turned into the following implementation of dax_is_supported() > > by a preparatory change: > > > > return !IS_ENABLED(CONFIG_ARM) && > >!IS_ENABLED(CONFIG_MIPS) && > >!IS_ENABLED(CONFIG_SPARC); > > > > Use dcache_is_aliasing() instead to figure out whether the environment > > has aliasing dcaches. > > > > Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing > > caches") > > Signed-off-by: Mathieu Desnoyers > > Cc: Andrew Morton > > Cc: Linus Torvalds > > Cc: linux...@kvack.org > > Cc: linux-a...@vger.kernel.org > > Cc: Dan Williams > > Cc: Vishal Verma > > Cc: Dave Jiang > > Cc: Matthew Wilcox > > Cc: Arnd Bergmann > > Cc: Russell King > > Cc: nvd...@lists.linux.dev > > Cc: linux-...@vger.kernel.org > > Cc: linux-fsde...@vger.kernel.org > > --- > > include/linux/dax.h | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index cfc8cd4a3eae..f59e604662e4 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -5,6 +5,7 @@ > > #include > > #include > > #include > > +#include > > > > typedef unsigned long dax_entry_t; > > > > @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct > > vm_area_struct *vma, > > } > > static inline bool dax_is_supported(void) > > { > > - return !IS_ENABLED(CONFIG_ARM) && > > - !IS_ENABLED(CONFIG_MIPS) && > > - !IS_ENABLED(CONFIG_SPARC); > > + return !dcache_is_aliasing(); > > Yeah, if this is just a one liner should go into > fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the > start of the function. > > I also noticed that device mapper uses fs_dax_get_by_bdev() to > determine if it can support DAX, but this patch set does not address > that case. Hence it really seems to me like fs_dax_get_by_bdev() is > the right place to put this check. Oh, good catch. Yes, I agree this can definitely be pushed down, but then I think it should be pushed down all the way to make alloc_dax() fail. That will need some additional fixups like: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8dcabf84d866..a35e60e62440 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor) md->dax_dev = alloc_dax(md, _dax_ops); if (IS_ERR(md->dax_dev)) { md->dax_dev = NULL; - goto bad; + } else { + set_dax_nocache(md->dax_dev); + set_dax_nomc(md->dax_dev); + if (dax_add_host(md->dax_dev, md->disk)) + goto bad; } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) - goto bad; } format_dev_t(md->name, MKDEV(_major, minor)); ...to make it not fatal to fail to register the dax_dev.
Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression
Mathieu Desnoyers wrote: > This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, > even on ARMv7 which does not have virtually aliased dcaches: > > commit d92576f1167c ("dax: does not work correctly with virtual aliasing > caches") > > It used to work fine before: I have customers using dax over pmem on > ARMv7, but this regression will likely prevent them from upgrading their > kernel. > > The root of the issue here is the fact that DAX was never designed to > handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It > touches the pages through their linear mapping, which is not consistent > with the userspace mappings on virtually aliased dcaches. > > This patch series introduces cache_is_aliasing() with new Kconfig > options: > > * ARCH_HAS_CACHE_ALIASING > * ARCH_HAS_CACHE_ALIASING_DYNAMIC > > and implements it for all architectures. The "DYNAMIC" implementation > implements cache_is_aliasing() as a runtime check, which is what is > needed on architectures like 32-bit ARMV6 and ARMV6K. > > With this we can basically narrow down the list of architectures which > are unsupported by DAX to those which are really affected. > > Feedback is welcome, Hi Mathieu, this looks good overall, just some quibbling about the ordering. I would introduce dax_is_supported() with the current overly broad interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then fixup the filesystems to use the new helper, and finally go back and convert dax_is_supported() to use cache_is_aliasing() internally. Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC needs to exist. As long as all paths that care are calling cache_is_aliasing() then whether it is dynamic or not is something only the compiler cares about. If those dynamic archs do not want to pay the .text size increase they can always do CONFIG_FS_DAX=n, right?
Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers
Greg Kroah-Hartman wrote: > On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote: > > Use the guard(device) macro to lock a 'struct device', and unlock it > > automatically when going out of scope using Scope Based Resource > > Management semantics. A lot of the sysfs attribute writes in > > drivers/dax/bus.c benefit from a cleanup using these, so change these > > where applicable. > > Wait, why are you needing to call device_lock() at all here? Why is dax > special in needing this when no other subsystem requires it? > > > > > Cc: Joao Martins > > Cc: Dan Williams > > Signed-off-by: Vishal Verma > > --- > > drivers/dax/bus.c | 143 > > ++ > > 1 file changed, 59 insertions(+), 84 deletions(-) > > > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > > index 1ff1ab5fa105..6226de131d17 100644 > > --- a/drivers/dax/bus.c > > +++ b/drivers/dax/bus.c > > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device *dev, > > struct device_attribute *attr, char *buf) > > { > > struct dax_region *dax_region = dev_get_drvdata(dev); > > - unsigned long long size; > > > > - device_lock(dev); > > - size = dax_region_avail_size(dax_region); > > - device_unlock(dev); > > + guard(device)(dev); > > You have a valid device here, why are you locking it? How can it go > away? And if it can, shouldn't you have a local lock for it, and not > abuse the driver core lock? Yes, this is a driver-core lock abuse written by someone who should have known better. And yes, a local lock to protect the dax_region resource tree should replace this. A new rwsem to synchronize all list walks seems appropriate.
[PATCH] driver core: Add a guard() definition for the device_lock()
At present there are ~200 usages of device_lock() in the kernel. Some of those usages lead to "goto unlock;" patterns which have proven to be error prone. Define a "device" guard() definition to allow for those to be cleaned up and prevent new ones from appearing. Link: http://lore.kernel.org/r/657897453dda8_269bd29...@dwillia2-mobl3.amr.corp.intel.com.notmuch Link: http://lore.kernel.org/r/6577b0c2a02df_a04c529...@dwillia2-xfh.jf.intel.com.notmuch Cc: Vishal Verma Cc: Ira Weiny Cc: Peter Zijlstra Cc: Greg Kroah-Hartman Cc: Andrew Morton Signed-off-by: Dan Williams --- Hi Greg, I wonder if you might include this change in v6.7-rc to ease some patch sets alternately going through my tree and Andrew's tree. Those discussions are linked above. Alternately I can can just take it through my tree with your ack and the other use case can circle back to it in the v6.9 cycle. I considered also defining a __free() helper similar to __free(mutex), but I think "__free(device)" would be a surprising name for something that drops a lock. Also, I like the syntax of guard(device) over something like guard(device_lock) since a 'struct device *' is the argument, not a lock type, but I'm open to your or Peter's thoughts on the naming. include/linux/device.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/device.h b/include/linux/device.h index d7a72a8749ea..6c83294395ac 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1007,6 +1007,8 @@ static inline void device_unlock(struct device *dev) mutex_unlock(>mutex); } +DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T)) + static inline void device_lock_assert(struct device *dev) { lockdep_assert_held(>mutex);
Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior
Verma, Vishal L wrote: > On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote: > > Vishal Verma writes: > > > > > Add a sysfs knob for dax devices to control the memmap_on_memory setting > > > if the dax device were to be hotplugged as system memory. > > > > > > The default memmap_on_memory setting for dax devices originating via > > > pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to > > > preserve legacy behavior. For dax devices via CXL, the default is on. > > > The sysfs control allows the administrator to override the above > > > defaults if needed. > > > > > > Cc: David Hildenbrand > > > Cc: Dan Williams > > > Cc: Dave Jiang > > > Cc: Dave Hansen > > > Cc: Huang Ying > > > Tested-by: Li Zhijian > > > Reviewed-by: Jonathan Cameron > > > Reviewed-by: David Hildenbrand > > > Signed-off-by: Vishal Verma > > > --- > > > drivers/dax/bus.c | 47 > > > + > > > Documentation/ABI/testing/sysfs-bus-dax | 17 > > > 2 files changed, 64 insertions(+) > > > > > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > > > index 1ff1ab5fa105..2871e5188f0d 100644 > > > --- a/drivers/dax/bus.c > > > +++ b/drivers/dax/bus.c > > > @@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev, > > > } > > > static DEVICE_ATTR_RO(numa_node); > > > > > > +static ssize_t memmap_on_memory_show(struct device *dev, > > > + struct device_attribute *attr, char > > > *buf) > > > +{ > > > + struct dev_dax *dev_dax = to_dev_dax(dev); > > > + > > > + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory); > > > +} > > > + > > > +static ssize_t memmap_on_memory_store(struct device *dev, > > > + struct device_attribute *attr, > > > + const char *buf, size_t len) > > > +{ > > > + struct device_driver *drv = dev->driver; > > > + struct dev_dax *dev_dax = to_dev_dax(dev); > > > + struct dax_region *dax_region = dev_dax->region; > > > + struct dax_device_driver *dax_drv = to_dax_drv(drv); > > > + ssize_t rc; > > > + bool val; > > > + > > > + rc = kstrtobool(buf, ); > > > + if (rc) > > > + return rc; > > > + > > > + if (dev_dax->memmap_on_memory == val) > > > + return len; > > > + > > > + device_lock(dax_region->dev); > > > + if (!dax_region->dev->driver) { > > > + device_unlock(dax_region->dev); > > > + return -ENXIO; > > > + } > > > > I think that it should be OK to write to "memmap_on_memory" if no driver > > is bound to the device. We just need to avoid to write to it when kmem > > driver is bound. > > Oh this is just a check on the region driver, not for a dax driver > being bound to the device. It's the same as what things like > align_store(), size_store() etc. do for dax device reconfiguration. > > That said, it might be okay to remove this check, as this operation > doesn't change any attributes of the dax region (the other interfaces I > mentioned above can affect regions, so we want to lock the region > device). If removing the check, we'd drop the region lock acquisition > as well. > > Dan, any thoughts on this? Since this is a dev_dax attribute then this would have already been synchronously shutdown when dax_region->dev->driver transitioned to NULL. I.e. region unbind causes dev_dax deletion. However, there's a different issue here as dev->driver was referenced without the device_lock(). Additionally, I think this function also makes it clear that device lock flavor of guard() would be useful: DEFINE_GUARD(dev, struct device *, device_lock(_T), device_unlock(_T)) ...then I would expect something like: guard(dev)(dev); if (dev_dax->memmap_on_memory != val && dev->driver && to_dax_drv(dev->driver)->type == DAXDRV_KMEM_TYPE) return -EBUSY; dev_dax->memmap_on_memory = val; return len; ...maybe with some temp variables to reduce the derefence chain, buy you get the idea. Only prevent changes while the device is active under kmem.
RE: [PATCH v1 1/1] ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()
Andy Shevchenko wrote: > The acpi_evaluate_dsm_typed() provides a way to check the type of the > object evaluated by _DSM call. Use it instead of open coded variant. Looks good to me. Reviewed-by: Dan Williams
Re: [PATCH v1 1/1] ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()
Wilczynski, Michal wrote: > > > On 10/2/2023 3:54 PM, Andy Shevchenko wrote: > > The acpi_evaluate_dsm_typed() provides a way to check the type of the > > object evaluated by _DSM call. Use it instead of open coded variant. > > > > Signed-off-by: Andy Shevchenko > > --- > > drivers/acpi/nfit/core.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > > index f96bf32cd368..280da408c02c 100644 > > --- a/drivers/acpi/nfit/core.c > > +++ b/drivers/acpi/nfit/core.c > > @@ -1737,9 +1737,8 @@ __weak void nfit_intel_shutdown_status(struct > > nfit_mem *nfit_mem) > > if ((nfit_mem->dsm_mask & (1 << func)) == 0) > > return; > > > > - out_obj = acpi_evaluate_dsm(handle, guid, revid, func, _obj); > > - if (!out_obj || out_obj->type != ACPI_TYPE_BUFFER > > - || out_obj->buffer.length < sizeof(smart)) { > > + out_obj = acpi_evaluate_dsm_typed(handle, guid, revid, func, _obj, > > ACPI_TYPE_BUFFER); > > This line is 90 characters long, wouldn't it be better to split it ? 100 characters is allowed these days. > > + if (!out_obj || out_obj->buffer.length < sizeof(smart)) { > > dev_dbg(dev->parent, "%s: failed to retrieve initial health\n", > > dev_name(dev)); > > While at it maybe fix alignment ? :-) Life is too short to fiddle with indentation. For old code leave it alone, for new code just use clang-format.
Re: [PATCH v3] ACPI: NFIT: Use cleanup.h helpers instead of devm_*()
Michal Wilczynski wrote: > The new cleanup.h facilities that arrived in v6.5-rc1 can replace the > the usage of devm semantics in acpi_nfit_init_interleave_set(). That > routine appears to only be using devm to avoid goto statements. The > new __free() annotation at variable declaration time can achieve the same > effect more efficiently. > > There is no end user visible side effects of this patch, I was > motivated to send this cleanup to practice using the new helpers. > > Suggested-by: Dave Jiang > Suggested-by: Andy Shevchenko > Reviewed-by: Dave Jiang > Reviewed-by: Andy Shevchenko > Signed-off-by: Michal Wilczynski > --- > > Dan, would you like me to give you credit for the changelog changes > with Co-developed-by tag ? Nope, this looks good to me, thanks for fixing it up. Reviewed-by: Dan Williams
RE: [PATCH v2 5/6] ACPI: NFIT: Replace acpi_driver with platform_driver
Michal Wilczynski wrote: > NFIT driver uses struct acpi_driver incorrectly to register itself. > This is wrong as the instances of the ACPI devices are not meant > to be literal devices, they're supposed to describe ACPI entry of a > particular device. > > Use platform_driver instead of acpi_driver. In relevant places call > platform devices instances pdev to make a distinction with ACPI > devices instances. > > NFIT driver uses devm_*() family of functions extensively. This change > has no impact on correct functioning of the whole devm_*() family of > functions, since the lifecycle of the device stays the same. It is still > being created during the enumeration, and destroyed on platform device > removal. I notice this verbiage has the same fundamental misunderstanding of devm allocation lifetime as the acpi_nfit_init_interleave_set() discussion. The devm allocation lifetime typically starts in driver->probe() and ends either with driver->probe() failure, or the driver->remove() call. Note that the driver->remove() call is invoked not only for platform-device removal, but also driver "unbind" events. So the "destroyed on platform device removal" is the least likely way that these allocations are torn down given ACPI0012 devices are never removed. Outside of that, my main concern about this patch is that I expect it breaks unit tests. The infrastructure in tools/testing/nvdimm/test/nfit.c emulates an ACPI0012 device that allows for deeper regression testing given hardware is difficult to come by, and because QEMU does not implement some of the tricky corner cases that the unit tests cover. This needs to pass tests, but fair warning, tools/testing/nvdimm/test/nfit.c does some non-idiomatic + "strange" things to achieve deeper test coverage. So I expect that if this breaks tests as I expect the effort needed to fix the emulation could be significant. If you want to give running the tests a try the easiest would be to use "run_qemu.sh" with --nfit-test option [1], or you can try to setup an environment manually using the ndctl instructions [2]. [1]: https://github.com/pmem/run_qemu [2]: https://github.com/pmem/ndctl#readme
Re: [PATCH v2] ACPI: NFIT: Fix local use of devm_*()
Wilczynski, Michal wrote: > > > On 10/13/2023 7:05 PM, Dan Williams wrote: > > Wilczynski, Michal wrote: > >> On 10/13/2023 6:38 PM, Dan Williams wrote: > >>> Michal Wilczynski wrote: > >>>> devm_*() family of functions purpose is managing memory attached to a > >>>> device. So in general it should only be used for allocations that should > >>>> last for the whole lifecycle of the device. > >>> No, this assertion is not accurate, if it were strictly true then > >>> devm_kfree() should be deleted. This patch is only a cleanup to switch > >>> the automatic cleanup pattern from devm to the new cleanup.h helpers. > >> The memory in question is only used locally in a function, so there is no > >> reason > >> to use devm_*() family of functions. I think devm_kfree() is more for > >> special > >> cases where the memory is meant to be used for the whole lifecycle of > >> device, > >> but some special case occurs and it's not and it needs to be freed. > >> > >> This is an incorrect API usage. Would you propose to change all memory > >> allocations currently being done to devm_*() family simply because > >> devm_kfree() > >> exists ? > > Michal, please work with someone else to get these cleanups upstream, I > > am done with this thread. > > I'm really sorry if I offended you, I didn't mean to. Hey, it happens. I am not offended, just frustrated. Going forward, my advice for anyone doing advocacy for a patch set is to be clear about "what happens if upstream does not take this patch?". When you view it from that angle the patch changelog that would have received an immediate Reviewed-by with no additional comment from me is something along the lines of: "The new cleanup.h facilities that arrived in v6.5-rc1 can replace the the usage of devm semantics in acpi_nfit_init_interleave_set(). That routine appears to only be using devm to avoid goto statements. The new __free() annotation at variable declaration time can achieve the same effect more efficiently. There is no end user visible side effects of this patch, I was motivated to send this cleanup to practice using the new helpers." As Linus mentions, subtlety is difficult to convey in mailing list interactions, and you may not have picked up on it, but the frustration for me began with the claim of a memory leak that turned out to be false. That was the prompt to consider "what other claims should I be careful about now that the fundamental justification for touching this old code has gone away." So if you want to try again with the justification of the patch limited to a cleanup, we can move forward.
Re: [PATCH v2] ACPI: NFIT: Fix local use of devm_*()
Wilczynski, Michal wrote: > On 10/13/2023 6:38 PM, Dan Williams wrote: > > Michal Wilczynski wrote: > >> devm_*() family of functions purpose is managing memory attached to a > >> device. So in general it should only be used for allocations that should > >> last for the whole lifecycle of the device. > > No, this assertion is not accurate, if it were strictly true then > > devm_kfree() should be deleted. This patch is only a cleanup to switch > > the automatic cleanup pattern from devm to the new cleanup.h helpers. > > The memory in question is only used locally in a function, so there is no > reason > to use devm_*() family of functions. I think devm_kfree() is more for special > cases where the memory is meant to be used for the whole lifecycle of device, > but some special case occurs and it's not and it needs to be freed. > > This is an incorrect API usage. Would you propose to change all memory > allocations currently being done to devm_*() family simply because > devm_kfree() > exists ? Michal, please work with someone else to get these cleanups upstream, I am done with this thread.
Re: [PATCH v2] ACPI: NFIT: Fix local use of devm_*()
Michal Wilczynski wrote: > devm_*() family of functions purpose is managing memory attached to a > device. So in general it should only be used for allocations that should > last for the whole lifecycle of the device. No, this assertion is not accurate, if it were strictly true then devm_kfree() should be deleted. This patch is only a cleanup to switch the automatic cleanup pattern from devm to the new cleanup.h helpers. I am all for modernizing code over time, but patches that make assertions of "memory leaks" and "incorrect API usage" in code that has been untouched for almost a decade demand more scrutiny than what transpired here.
RE: [PATCH v1 1/2] ACPI: NFIT: Fix memory leak, and local use of devm_*()
Michal Wilczynski wrote: > devm_*() family of functions purpose is managing memory attached to a > device. So in general it should only be used for allocations that should > last for the whole lifecycle of the device. This is not the case for > acpi_nfit_init_interleave_set(). There are two allocations that are only > used locally in this function. What's more - if the function exits on > error path memory is never freed. It's still attached to dev and would > be freed on device detach, so this leak could be called a 'local leak'. This analysis is incorrect devm cleans up on driver ->probe() failure in addition to ->remove(), and these error returns result in ->probe() failures. No leak, i.e. this is not a fix. The conversion to modern probe is ok if you want to resubmit that one without this intervening change.
RE: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
Vishal Verma wrote: > The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to > 'memblock_size' chunks of memory being added. Adding a larger span of > memory precludes memmap_on_memory semantics. > > For users of hotplug such as kmem, large amounts of memory might get > added from the CXL subsystem. In some cases, this amount may exceed the > available 'main memory' to store the memmap for the memory being added. > In this case, it is useful to have a way to place the memmap on the > memory being added, even if it means splitting the addition into > memblock-sized chunks. > > Change add_memory_resource() to loop over memblock-sized chunks of > memory if caller requested memmap_on_memory, and if other conditions for > it are met. Teach try_remove_memory() to also expect that a memory > range being removed might have been split up into memblock sized chunks, > and to loop through those as needed. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hildenbrand > Signed-off-by: Vishal Verma > --- > mm/memory_hotplug.c | 162 > > 1 file changed, 99 insertions(+), 63 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index f8d3e7427e32..77ec6f15f943 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned > long size) > return arch_supports_memmap_on_memory(vmemmap_size); > } > > +static int add_memory_create_devices(int nid, struct memory_group *group, > + u64 start, u64 size, mhp_t mhp_flags) > +{ > + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; > + struct vmem_altmap mhp_altmap = { > + .base_pfn = PHYS_PFN(start), > + .end_pfn = PHYS_PFN(start + size - 1), > + }; > + int ret; > + > + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { > + mhp_altmap.free = memory_block_memmap_on_memory_pages(); > + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); > + if (!params.altmap) > + return -ENOMEM; > + > + memcpy(params.altmap, _altmap, sizeof(mhp_altmap)); Isn't this just open coded kmemdup()? Other than that, I am not seeing anything else to comment on, you can add: Reviewed-by: Dan Williams
RE: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory
Vishal Verma wrote: > Large amounts of memory managed by the kmem driver may come in via CXL, > and it is often desirable to have the memmap for this memory on the new > memory itself. > > Enroll kmem-managed memory for memmap_on_memory semantics if the dax > region originates via CXL. For non-CXL dax regions, retain the existing > default behavior of hot adding without memmap_on_memory semantics. > > Add a sysfs override under the dax device to control this behavior and > override either default. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Reviewed-by: Jonathan Cameron > Reviewed-by: David Hildenbrand > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.h | 1 + > drivers/dax/dax-private.h | 1 + > drivers/dax/bus.c | 38 ++ > drivers/dax/cxl.c | 1 + > drivers/dax/hmem/hmem.c | 1 + > drivers/dax/kmem.c| 8 +++- > drivers/dax/pmem.c| 1 + > 7 files changed, 50 insertions(+), 1 deletion(-) > > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h > index 1ccd23360124..cbbf64443098 100644 > --- a/drivers/dax/bus.h > +++ b/drivers/dax/bus.h > @@ -23,6 +23,7 @@ struct dev_dax_data { > struct dev_pagemap *pgmap; > resource_size_t size; > int id; > + bool memmap_on_memory; > }; > > struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data); > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h > index 27cf2d79..446617b73aea 100644 > --- a/drivers/dax/dax-private.h > +++ b/drivers/dax/dax-private.h > @@ -70,6 +70,7 @@ struct dev_dax { > struct ida ida; > struct device dev; > struct dev_pagemap *pgmap; > + bool memmap_on_memory; > int nr_range; > struct dev_dax_range { > unsigned long pgoff; > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 0ee96e6fc426..43be95a231c9 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct > device_attribute *attr, > .dax_region = dax_region, > .size = 0, > .id = -1, > + .memmap_on_memory = false, > }; > struct dev_dax *dev_dax = devm_create_dev_dax(); > > @@ -1269,6 +1270,40 @@ static ssize_t numa_node_show(struct device *dev, > } > static DEVICE_ATTR_RO(numa_node); > > +static ssize_t memmap_on_memory_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct dev_dax *dev_dax = to_dev_dax(dev); > + > + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory); > +} > + > +static ssize_t memmap_on_memory_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + struct dev_dax *dev_dax = to_dev_dax(dev); > + struct dax_region *dax_region = dev_dax->region; > + ssize_t rc; > + bool val; > + > + rc = kstrtobool(buf, ); > + if (rc) > + return rc; Perhaps: if (dev_dax->memmap_on_memory == val) return len; ...and skip the check below when it is going to be a nop > + > + device_lock(dax_region->dev); > + if (!dax_region->dev->driver) { Is the polarity backwards here? I.e. if the device is already attached to the kmem driver it is too late to modify memmap_on_memory policy. > + device_unlock(dax_region->dev); > + return -ENXIO; I would expect -EBUSY since disabling the device allows the property to be set and -ENXIO implies a more permanent error. > + } > + > + dev_dax->memmap_on_memory = val; > + > + device_unlock(dax_region->dev); > + return len; > +} > +static DEVICE_ATTR_RW(memmap_on_memory); This new attribute needs a new Documentation/ABI/ entry... in fact all of these attributes need Documentation/ entries. I can help with that base document to get things started. Perhaps split this sysfs ABI into its own patch and, depending on how fast we can pull the Documentation together, start with the region-driver-conveyed approach in the meantime.
Re: [PATCH v1 9/9] ACPI: NFIT: Don't use KBUILD_MODNAME for driver name
Andy Shevchenko wrote: > On Mon, Sep 25, 2023 at 05:48:42PM +0300, Michal Wilczynski wrote: > > Driver name is part of the ABI, so it should be hard-coded, as ABI > > should be always kept backward compatible. Prevent ABI from changing > > accidentally in case KBUILD_MODNAME change. > > This is up to maintainers, probably we won't have any users outside of > existing > model (instantiating via ACPI ID). All the above is "strictly speaking"... ...right, more than 8 years for this "risk" to materialize indicates it's a non-issue to me.
RE: [PATCH] drivers: nvdimm: fix dereference after free
[ add Kajol ] Konstantin Meskhidze wrote: > 'nd_pmu->pmu.attr_groups' is dereferenced in function > 'nvdimm_pmu_free_hotplug_memory' call after it has been freed. Because in > function 'nvdimm_pmu_free_hotplug_memory' memory pointed by the fields of > 'nd_pmu->pmu.attr_groups' is deallocated it is necessary to call 'kfree' > after 'nvdimm_pmu_free_hotplug_memory'. Another one that would be fixed by static attribute groups. I do think we should move forward with these fixes as is for ease of backport, but long term this dynamically allocated attribute groups approach needs to be jettisoned. ...unless I am missing a concrete reason it needs to remain dynamic?
RE: [PATCH] drivers: nvdimm: fix memleak
[ add Kajol and Madhavan ] Konstantin Meskhidze wrote: > Memory pointed by 'nd_pmu->pmu.attr_groups' is allocated in function > 'register_nvdimm_pmu' and is lost after 'kfree(nd_pmu)' call in function > 'unregister_nvdimm_pmu'. Yes, looks like a real issue, but also completely avoidable by using statically defined groups. My fault for not catching this earlier, but Kajol, why is nd_perf not using statically defined sysfs attribute groups?
Re: [PATCH v5 09/10] acpi/nfit: Move handler installing logic to driver
Wilczynski, Michal wrote: > > > On 6/29/2023 10:54 PM, Dan Williams wrote: > > Michal Wilczynski wrote: > >> Currently logic for installing notifications from ACPI devices is > >> implemented using notify callback in struct acpi_driver. Preparations > >> are being made to replace acpi_driver with more generic struct > >> platform_driver, which doesn't contain notify callback. Furthermore > >> as of now handlers are being called indirectly through > >> acpi_notify_device(), which decreases performance. > >> > >> Call acpi_dev_install_notify_handler() at the end of .add() callback. > >> Call acpi_dev_remove_notify_handler() at the beginning of .remove() > >> callback. Change arguments passed to the notify function to match with > >> what's required by acpi_install_notify_handler(). Remove .notify > >> callback initialization in acpi_driver. > >> > >> Suggested-by: Rafael J. Wysocki > >> Signed-off-by: Michal Wilczynski > >> --- > >> drivers/acpi/nfit/core.c | 24 ++-- > >> 1 file changed, 18 insertions(+), 6 deletions(-) > >> > >> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > >> index 95930e9d776c..a281bdfee8a0 100644 > >> --- a/drivers/acpi/nfit/core.c > >> +++ b/drivers/acpi/nfit/core.c > >> @@ -3312,11 +3312,13 @@ void acpi_nfit_shutdown(void *data) > >> } > >> EXPORT_SYMBOL_GPL(acpi_nfit_shutdown); > >> > >> -static void acpi_nfit_notify(struct acpi_device *adev, u32 event) > >> +static void acpi_nfit_notify(acpi_handle handle, u32 event, void *data) > >> { > >> - device_lock(>dev); > >> - __acpi_nfit_notify(>dev, adev->handle, event); > >> - device_unlock(>dev); > >> + struct acpi_device *device = data; > >> + > >> + device_lock(>dev); > >> + __acpi_nfit_notify(>dev, handle, event); > >> + device_unlock(>dev); > >> } > >> > >> static int acpi_nfit_add(struct acpi_device *adev) > >> @@ -3375,12 +3377,23 @@ static int acpi_nfit_add(struct acpi_device *adev) > >> > >>if (rc) > >>return rc; > >> - return devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc); > >> + > >> + rc = devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc); > >> + if (rc) > >> + return rc; > >> + > >> + return acpi_dev_install_notify_handler(adev, > >> + ACPI_DEVICE_NOTIFY, > >> + acpi_nfit_notify); > >> } > >> > >> static void acpi_nfit_remove(struct acpi_device *adev) > >> { > >>/* see acpi_nfit_unregister */ > >> + > >> + acpi_dev_remove_notify_handler(adev, > >> + ACPI_DEVICE_NOTIFY, > >> + acpi_nfit_notify); > > Please use devm to trigger this release rather than making > > acpi_nfit_remove() contain any logic. > > I think adding separate devm action to remove event handler is not > necessary. I'll put the removal in the beggining of acpi_nfit_shutdown() if > you > don't object. How do you plan to handle an acpi_dev_install_notify_handler() failure? acpi_nfit_shutdown() will need to have extra logic to know that it can skip acpi_dev_remove_notify_handler() in some cases and not other.. Maybe it is ok to remove a handler that was never installed, but I would rather not go look that up. A devm callback for acpi_dev_remove_notify_handler() avoids that.
RE: [PATCH v5 09/10] acpi/nfit: Move handler installing logic to driver
Michal Wilczynski wrote: > Currently logic for installing notifications from ACPI devices is > implemented using notify callback in struct acpi_driver. Preparations > are being made to replace acpi_driver with more generic struct > platform_driver, which doesn't contain notify callback. Furthermore > as of now handlers are being called indirectly through > acpi_notify_device(), which decreases performance. > > Call acpi_dev_install_notify_handler() at the end of .add() callback. > Call acpi_dev_remove_notify_handler() at the beginning of .remove() > callback. Change arguments passed to the notify function to match with > what's required by acpi_install_notify_handler(). Remove .notify > callback initialization in acpi_driver. > > Suggested-by: Rafael J. Wysocki > Signed-off-by: Michal Wilczynski > --- > drivers/acpi/nfit/core.c | 24 ++-- > 1 file changed, 18 insertions(+), 6 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index 95930e9d776c..a281bdfee8a0 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -3312,11 +3312,13 @@ void acpi_nfit_shutdown(void *data) > } > EXPORT_SYMBOL_GPL(acpi_nfit_shutdown); > > -static void acpi_nfit_notify(struct acpi_device *adev, u32 event) > +static void acpi_nfit_notify(acpi_handle handle, u32 event, void *data) > { > - device_lock(>dev); > - __acpi_nfit_notify(>dev, adev->handle, event); > - device_unlock(>dev); > + struct acpi_device *device = data; > + > + device_lock(>dev); > + __acpi_nfit_notify(>dev, handle, event); > + device_unlock(>dev); > } > > static int acpi_nfit_add(struct acpi_device *adev) > @@ -3375,12 +3377,23 @@ static int acpi_nfit_add(struct acpi_device *adev) > > if (rc) > return rc; > - return devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc); > + > + rc = devm_add_action_or_reset(dev, acpi_nfit_shutdown, acpi_desc); > + if (rc) > + return rc; > + > + return acpi_dev_install_notify_handler(adev, > +ACPI_DEVICE_NOTIFY, > +acpi_nfit_notify); > } > > static void acpi_nfit_remove(struct acpi_device *adev) > { > /* see acpi_nfit_unregister */ > + > + acpi_dev_remove_notify_handler(adev, > +ACPI_DEVICE_NOTIFY, > +acpi_nfit_notify); Please use devm to trigger this release rather than making acpi_nfit_remove() contain any logic. An additional cleanup opportunity with the ->add() path fully devm instrumented would be to just delete acpi_nfit_remove() since it is optional and serves no purpose.
RE: [PATCH v5 08/10] acpi/nfit: Improve terminator line in acpi_nfit_ids
Michal Wilczynski wrote: > Currently terminator line contains redunant characters. Remove them and > also remove a comma at the end. > > Signed-off-by: Michal Wilczynski > --- > drivers/acpi/nfit/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index aff79cbc2190..95930e9d776c 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -3455,7 +3455,7 @@ EXPORT_SYMBOL_GPL(__acpi_nfit_notify); > > static const struct acpi_device_id acpi_nfit_ids[] = { > { "ACPI0012", 0 }, > - { "", 0 }, > + {} Looks like a pointless change to me.
Re: [v5 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Markus Elfring wrote: > >> Would you insist on the usage of cover letters also for single patches? > > > > I would neither insist on it, nor prohibit it. > > It seems that you can tolerate an extra message here. > > > > It simply does not make enough difference. > > Can it occasionally be a bit nicer to receive all relevant information within > a single patch > instead of a combination of two messages? No, I am the maintainer of this code and everything I needed to judge this patch was provided. This cover letter only included inter-version details and anything relevant for the kernel history is included in the patch itself. For any code I maintain inter-version details below the --- line or in a 0/1 cover letter are perfectly acceptable. Please, if the review feedback is arbitrary, as it is in this case, strongly consider not offering it.
RE: [PATCH v5 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Jane Chu wrote: > When multiple processes mmap() a dax file, then at some point, > a process issues a 'load' and consumes a hwpoison, the process > receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb > set for the poison scope. Soon after, any other process issues > a 'load' to the poisoned page (that is unmapped from the kernel > side by memory_failure), it receives a SIGBUS with > si_code = BUS_ADRERR and without valid si_lsb. > > This is confusing to user, and is different from page fault due > to poison in RAM memory, also some helpful information is lost. > > Channel dax backend driver's poison detection to the filesystem > such that instead of reporting VM_FAULT_SIGBUS, it could report > VM_FAULT_HWPOISON. > > If user level block IO syscalls fail due to poison, the errno will > be converted to EIO to maintain block API consistency. > > Signed-off-by: Jane Chu Reviewed-by: Dan Williams
RE: [PATCH v4 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Jane Chu wrote: > When multiple processes mmap() a dax file, then at some point, > a process issues a 'load' and consumes a hwpoison, the process > receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb > set for the poison scope. Soon after, any other process issues > a 'load' to the poisoned page (that is unmapped from the kernel > side by memory_failure), it receives a SIGBUS with > si_code = BUS_ADRERR and without valid si_lsb. > > This is confusing to user, and is different from page fault due > to poison in RAM memory, also some helpful information is lost. > > Channel dax backend driver's poison detection to the filesystem > such that instead of reporting VM_FAULT_SIGBUS, it could report > VM_FAULT_HWPOISON. > > If user level block IO syscalls fail due to poison, the errno will > be converted to EIO to maintain block API consistency. > > Signed-off-by: Jane Chu > --- > drivers/dax/super.c | 5 - > drivers/nvdimm/pmem.c| 2 +- > drivers/s390/block/dcssblk.c | 3 ++- > fs/dax.c | 11 ++- > fs/fuse/virtio_fs.c | 3 ++- > include/linux/dax.h | 5 + > include/linux/mm.h | 2 ++ > 7 files changed, 22 insertions(+), 9 deletions(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index c4c4728a36e4..0da9232ea175 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -203,6 +203,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, > pgoff_t pgoff, void *addr, > int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > size_t nr_pages) > { > + int ret; > + > if (!dax_alive(dax_dev)) > return -ENXIO; > /* > @@ -213,7 +215,8 @@ int dax_zero_page_range(struct dax_device *dax_dev, > pgoff_t pgoff, > if (nr_pages != 1) > return -EIO; > > - return dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages); > + ret = dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages); > + return dax_mem2blk_err(ret); > } > EXPORT_SYMBOL_GPL(dax_zero_page_range); > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index ceea55f621cc..46e094e56159 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device > *pmem, pgoff_t pgoff, > long actual_nr; > > if (mode != DAX_RECOVERY_WRITE) > - return -EIO; > + return -EHWPOISON; > > /* >* Set the recovery stride is set to kernel page size because > diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c > index c09f2e053bf8..ee47ac520cd4 100644 > --- a/drivers/s390/block/dcssblk.c > +++ b/drivers/s390/block/dcssblk.c > @@ -54,7 +54,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device > *dax_dev, > rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS, > , NULL); > if (rc < 0) > - return rc; > + return dax_mem2blk_err(rc); > + > memset(kaddr, 0, nr_pages << PAGE_SHIFT); > dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT); > return 0; > diff --git a/fs/dax.c b/fs/dax.c > index 2ababb89918d..a26eb5abfdc0 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1148,7 +1148,7 @@ static int dax_iomap_copy_around(loff_t pos, uint64_t > length, size_t align_size, > if (!zero_edge) { > ret = dax_iomap_direct_access(srcmap, pos, size, , NULL); > if (ret) > - return ret; > + return dax_mem2blk_err(ret); > } > > if (copy_all) { > @@ -1310,7 +1310,7 @@ static s64 dax_unshare_iter(struct iomap_iter *iter) > > out_unlock: > dax_read_unlock(id); > - return ret; > + return dax_mem2blk_err(ret); > } > > int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len, > @@ -1342,7 +1342,8 @@ static int dax_memzero(struct iomap_iter *iter, loff_t > pos, size_t size) > ret = dax_direct_access(iomap->dax_dev, pgoff, 1, DAX_ACCESS, , > NULL); > if (ret < 0) > - return ret; > + return dax_mem2blk_err(ret); > + > memset(kaddr + offset, 0, size); > if (iomap->flags & IOMAP_F_SHARED) > ret = dax_iomap_copy_around(pos, size, PAGE_SIZE, srcmap, > @@ -1498,7 +1499,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter > *iomi, > > map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), > DAX_ACCESS, , NULL); > - if (map_len == -EIO && iov_iter_rw(iter) == WRITE) { > + if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) { > map_len = dax_direct_access(dax_dev, pgoff, > PHYS_PFN(size), DAX_RECOVERY_WRITE, > , NULL); >
Re: [PATCH] dax/hmem: Fix refcount leak in dax_hmem_probe()
Paul Cassella wrote: > On Fri, 2 Jun 2023, Ira Weiny wrote: > > Paul Cassella wrote: > > > On Sat, 3 Dec 2022, Ira Weiny wrote: > > > > On Sat, Dec 03, 2022 at 09:58:58AM +, Yongqiang Liu wrote: > > > > > > We should always call dax_region_put() whenever devm_create_dev_dax() > > > > > succeed or fail to avoid refcount leak of dax_region. Move the return > > > > > value check after dax_region_put(). > > > > > I think dax_region_put is called from dax_region_unregister() > > > > automatically on > > > > tear down. > > > > Note the reference dax_region_unregister() will be putting is the one > > > devm_create_dev_dax() takes by kref_get(_region->kref). I think > > > dax_hmem_probe() needs to put its reference in the error case, as in the > > > successful case. > > > Looking at this again I'm inclined to agree that something is wrong. But > > I'm not sure this patch fixes it. anything. > > > When you say: > > > > > ... I think > > > dax_hmem_probe() needs to put its reference in the error case, as in the > > > successful case. > > > > ... which kref_get() is dax_hmem_probe() letting go? > > Isn't it letting go of the initial kref_init() reference from > alloc_dax_region()? > > Sorry, I had gone through the code a little carelessly yesterday. Now I > think that kref_init() reference is the one that dax_hmem_probe() is > dropping in the success case, and which still needs to be dropped in the > error case. > > (If so, I think the alloc_dax_region() path that returns NULL on > devm_add_action_or_reset() failure, releasing the kref_get reference, will > leak the kref_init reference and the region.) Yes, *this* one looks valid. All the other patches in this thread had me asking, "um, did you try it?". Now, that said, the games that this init path is playing are hard to follow and it should not be the case that alloc_dax_region() needs to hold a reference on behalf of some future code path that might need it. Lets make this clearer by moving the reference count management closer to the object that needs it dax_region->ida. Where the extra references were being taken on behalf of "static" regions just in case they needed to be reference in dev_dax_release(). I also found a dax_mapping lifetime issue while testing this, so I appreciate the focus here.
RE: [PATCH] dax: fix missing-prototype warnings
Arnd Bergmann wrote: > From: Arnd Bergmann > > dev_dax_probe declaration for this function was removed with the only > caller outside of device.c. Mark it static to avoid a W=1 > warning: > drivers/dax/device.c:399:5: error: no previous prototype for 'dev_dax_probe' > > Similarly, run_dax() causes a warning, but this one is because the > declaration needs to be included: > > drivers/dax/super.c:337:6: error: no previous prototype for 'run_dax' > > Fixes: 83762cb5c7c4 ("dax: Kill DEV_DAX_PMEM_COMPAT") > Signed-off-by: Arnd Bergmann > --- > drivers/dax/device.c | 3 +-- > drivers/dax/super.c | 1 + > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/dax/device.c b/drivers/dax/device.c > index af9930c03c9c..30665a3ff6ea 100644 > --- a/drivers/dax/device.c > +++ b/drivers/dax/device.c > @@ -396,7 +396,7 @@ static void dev_dax_kill(void *dev_dax) > kill_dev_dax(dev_dax); > } > > -int dev_dax_probe(struct dev_dax *dev_dax) > +static int dev_dax_probe(struct dev_dax *dev_dax) > { > struct dax_device *dax_dev = dev_dax->dax_dev; > struct device *dev = _dax->dev; > @@ -471,7 +471,6 @@ int dev_dax_probe(struct dev_dax *dev_dax) > run_dax(dax_dev); > return devm_add_action_or_reset(dev, dev_dax_kill, dev_dax); > } > -EXPORT_SYMBOL_GPL(dev_dax_probe); > > static struct dax_device_driver device_dax_driver = { > .probe = dev_dax_probe, This hunk looks good. > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index c4c4728a36e4..8c05dae19bfe 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -14,6 +14,7 @@ > #include > #include > #include "dax-private.h" > +#include "bus.h" This one came up before, and IIRC last time I said lets just move the run_dax() prototype to dax-private.h. https://lore.kernel.org/all/YiqF1a9VNiSWI5j0@iweiny-desk3 I can fix that up when applying.
RE: [PATCH v3] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Jane Chu wrote: > When multiple processes mmap() a dax file, then at some point, > a process issues a 'load' and consumes a hwpoison, the process > receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb > set for the poison scope. Soon after, any other process issues > a 'load' to the poisoned page (that is unmapped from the kernel > side by memory_failure), it receives a SIGBUS with > si_code = BUS_ADRERR and without valid si_lsb. > > This is confusing to user, and is different from page fault due > to poison in RAM memory, also some helpful information is lost. > > Channel dax backend driver's poison detection to the filesystem > such that instead of reporting VM_FAULT_SIGBUS, it could report > VM_FAULT_HWPOISON. I do think it is interesting that this will start returning SIGBUS with BUS_MCEERR_AR for stores whereas it is only signalled for loads in the direct consumption path, but I can't think of a scenario where that would confuse existing software. > Change from v2: > Convert -EHWPOISON to -EIO to prevent EHWPOISON errno from leaking > out to block read(2) - suggested by Matthew. For next time these kind of changelog notes belong... > Signed-off-by: Jane Chu > --- ...here after the "---". > drivers/nvdimm/pmem.c | 2 +- > fs/dax.c | 4 ++-- > include/linux/mm.h| 2 ++ > 3 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index ceea55f621cc..46e094e56159 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device > *pmem, pgoff_t pgoff, > long actual_nr; > > if (mode != DAX_RECOVERY_WRITE) > - return -EIO; > + return -EHWPOISON; > > /* >* Set the recovery stride is set to kernel page size because > diff --git a/fs/dax.c b/fs/dax.c > index 2ababb89918d..18f9598951f1 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1498,7 +1498,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter > *iomi, > > map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), > DAX_ACCESS, , NULL); > - if (map_len == -EIO && iov_iter_rw(iter) == WRITE) { > + if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) { > map_len = dax_direct_access(dax_dev, pgoff, > PHYS_PFN(size), DAX_RECOVERY_WRITE, > , NULL); > @@ -1506,7 +1506,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter > *iomi, > recovery = true; > } > if (map_len < 0) { > - ret = map_len; > + ret = (map_len == -EHWPOISON) ? -EIO : map_len; This fixup leaves out several other places where EHWPOISON could leak as the errno for read(2)/write(2). I think I want to see something like this: diff --git a/fs/dax.c b/fs/dax.c index 2ababb89918d..ec17f9977bcb 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1077,14 +1077,13 @@ int dax_writeback_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); -static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, - size_t size, void **kaddr, pfn_t *pfnp) +static int __dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, +size_t size, void **kaddr, pfn_t *pfnp) { pgoff_t pgoff = dax_iomap_pgoff(iomap, pos); - int id, rc = 0; long length; + int rc = 0; - id = dax_read_lock(); length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), DAX_ACCESS, kaddr, pfnp); if (length < 0) { @@ -1113,6 +1112,36 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, return rc; } +static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, + size_t size, void **kaddr, pfn_t *pfnp) +{ + + int id; + + id = dax_read_lock(); + rc = __dax_iomap_direct_access(iomap, pos, size, kaddr, pfnp); + dax_read_unlock(id); + + /* don't leak a memory access error code to I/O syscalls */ + if (rc == -EHWPOISON) + return -EIO; + return rc; +} + +static int dax_fault_direct_access(const struct iomap *iomap, loff_t pos, + size_t size, void **kaddr, pfn_t *pfnp) +{ + + int id; + + id = dax_read_lock(); + rc = __dax_iomap_direct_access(iomap, pos, size, kaddr, pfnp); + dax_read_unlock(id); + + /* -EHWPOISON return ok */ + return rc; +} + /** * dax_iomap_copy_around - Prepare for an unaligned write to a shared/cow page * by copying the data before and after the range to be written. @@ -1682,7 +1711,7 @@ static
Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Matthew Wilcox wrote: > On Thu, Apr 27, 2023 at 06:35:57PM -0700, Dan Williams wrote: > > Jane Chu wrote: > > > Hi, Dan, > > > > > > On 4/27/2023 2:36 PM, Dan Williams wrote: > > > > Jane Chu wrote: > > > >> When dax fault handler fails to provision the fault page due to > > > >> hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered > > > >> to userspace with .si_code BUS_ADRERR. Channel dax backend driver's > > > >> detection on hwpoison to the filesystem to provide the precise reason > > > >> for the fault. > > > > > > > > It's not yet clear to me by this description why this is an improvement > > > > or will not cause other confusion. In this case the reason for the > > > > SIGBUS is because the driver wants to prevent access to poison, not that > > > > the CPU consumed poison. Can you clarify what is lost by *not* making > > > > this change? > > > > > > Elsewhere when hwpoison is detected by page fault handler and helpers as > > > the direct cause to failure, VM_FAULT_HWPOISON or > > > VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is > > > produced, such as wp_page_copy() in COW case, do_swap_page() from > > > handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the > > > huge fault size would be indicated in the payload. > > > > > > But dax fault has been an exception in that the SIGBUS payload does not > > > indicate poison, nor fault size. I don't see why it should be though, > > > recall an internal user expressing confusion regarding the different > > > SIGBUS payloads. > > > > ...but again this the typical behavior with block devices. If a block > > device has badblock that causes page cache page not to be populated > > that's a SIGBUS without hwpoison information. If the page cache is > > properly populated and then the CPU consumes poison that's a SIGBUS with > > the additional hwpoison information. > > I'm not sure that's true when we mmap(). Yes, it's not consistent with > -EIO from read(), but we have additional information here, and it's worth > providing it. You can think of it as *in this instance*, the error is > found "in the page cache", because that's effectively where the error > is from the point of view of the application? It's true there is additional information, and applications mostly cannot tell the difference between fault on failure to populate and fault on access after populate. So while it is inconsistent with what happens for typical page cache, but you're right there's no downside to conveying the extra information here.
Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Jane Chu wrote: > Hi, Dan, > > On 4/27/2023 2:36 PM, Dan Williams wrote: > > Jane Chu wrote: > >> When dax fault handler fails to provision the fault page due to > >> hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered > >> to userspace with .si_code BUS_ADRERR. Channel dax backend driver's > >> detection on hwpoison to the filesystem to provide the precise reason > >> for the fault. > > > > It's not yet clear to me by this description why this is an improvement > > or will not cause other confusion. In this case the reason for the > > SIGBUS is because the driver wants to prevent access to poison, not that > > the CPU consumed poison. Can you clarify what is lost by *not* making > > this change? > > Elsewhere when hwpoison is detected by page fault handler and helpers as > the direct cause to failure, VM_FAULT_HWPOISON or > VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is > produced, such as wp_page_copy() in COW case, do_swap_page() from > handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the > huge fault size would be indicated in the payload. > > But dax fault has been an exception in that the SIGBUS payload does not > indicate poison, nor fault size. I don't see why it should be though, > recall an internal user expressing confusion regarding the different > SIGBUS payloads. ...but again this the typical behavior with block devices. If a block device has badblock that causes page cache page not to be populated that's a SIGBUS without hwpoison information. If the page cache is properly populated and then the CPU consumes poison that's a SIGBUS with the additional hwpoison information. Applications should have a consistent error response regardless of pmem or dax.
RE: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON
Jane Chu wrote: > When dax fault handler fails to provision the fault page due to > hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered > to userspace with .si_code BUS_ADRERR. Channel dax backend driver's > detection on hwpoison to the filesystem to provide the precise reason > for the fault. It's not yet clear to me by this description why this is an improvement or will not cause other confusion. In this case the reason for the SIGBUS is because the driver wants to prevent access to poison, not that the CPU consumed poison. Can you clarify what is lost by *not* making this change? > > Signed-off-by: Jane Chu > --- > drivers/nvdimm/pmem.c | 2 +- > fs/dax.c | 2 +- > include/linux/mm.h| 2 ++ > 3 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index ceea55f621cc..46e094e56159 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device > *pmem, pgoff_t pgoff, > long actual_nr; > > if (mode != DAX_RECOVERY_WRITE) > - return -EIO; > + return -EHWPOISON; > > /* >* Set the recovery stride is set to kernel page size because > diff --git a/fs/dax.c b/fs/dax.c > index 3e457a16c7d1..c93191cd4802 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter > *iomi, > > map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), > DAX_ACCESS, , NULL); > - if (map_len == -EIO && iov_iter_rw(iter) == WRITE) { > + if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) { > map_len = dax_direct_access(dax_dev, pgoff, > PHYS_PFN(size), DAX_RECOVERY_WRITE, > , NULL); This change results in EHWPOISON leaking to usersapce in the case of read(2), that's not a return code that block I/O applications have ever had to contend with before. Just as badblocks cause EIO to be returned, so should poisoned cachelines for pmem. > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1f79667824eb..e4c974587659 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3217,6 +3217,8 @@ static inline vm_fault_t vmf_error(int err) > { > if (err == -ENOMEM) > return VM_FAULT_OOM; > + else if (err == -EHWPOISON) > + return VM_FAULT_HWPOISON; > return VM_FAULT_SIGBUS; > } > > -- > 2.18.4 > >
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
lizhij...@fujitsu.com wrote: [..] > > Now I do think it would be a good idea to fail device_add() if the bus > > is not registered, > > BTW, below line 369: device_add() didn't fail in practical. > I think that's ok because the device gets added, but never probed due to this part of that commit I referenced: @@ -503,20 +517,21 @@ int bus_add_device(struct device *dev) */ void bus_probe_device(struct device *dev) { - struct bus_type *bus = dev->bus; + struct subsys_private *sp = bus_to_subsys(dev->bus); struct subsys_interface *sif; - if (!bus) + if (!sp) return; ...so it does what you want which is disable the libnvdimm subsystem from binding to any devices without crashing.
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
lizhij...@fujitsu.com wrote: [..] > >> Dan, > >> > >> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra > >> kernel booting parameter > >> 'initcall_blacklist=libnvdimm_init'. Then kernel panic! > > > > That's expected though, > > Do you mean we just keep it as it is. Yes. > > > > you can't block libnvdimm_init and then expect > > modules that link to libnvdimm to work. > Ah, we would rather see it *unable to work* than panic, isn't it. That part is true, but consider the implications of adding error handling to all code that can no longer depend on initcall ordering, not just libnvdimm. This would be a large paradigm shift. Now I do think it would be a good idea to fail device_add() if the bus is not registered, but I *think* that happens now as a result of: 5221b82d46f2 driver core: bus: bus_add/probe/remove_device() cleanups ...can you double check if you have that commit in your tests?
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
lizhij...@fujitsu.com wrote: > > > On 16/03/2023 23:54, Dan Williams wrote: > > Li Zhijian wrote: > >> nvdimm_bus_register() could be called from other modules, such as nfit, > >> but it can only be called after the nvdimm_bus_type is registered. > >> > >> BUG: kernel NULL pointer dereference, address: 0098 > >> #PF: supervisor read access in kernel mode > >> #PF: error_code(0x) - not-present page > >> PGD 0 P4D 0 > >> Oops: [#1] PREEMPT SMP PTI > >> CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97 > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > >> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > >> RIP: 0010:bus_add_device+0x58/0x150 > >> Call Trace: > >> > >>device_add+0x3ac/0x980 > >>nvdimm_bus_register+0x16d/0x1d0 > >>acpi_nfit_init+0xb72/0x1f90 [nfit] > >>acpi_nfit_add+0x1d5/0x200 [nfit] > >>acpi_device_probe+0x45/0x160 > > > > Can you explain a bit more how to hit this crash? This has not been a > > problem historically and the explanation above makes it sound like this > > is a theoretical issue. > > > > Dan, > > Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra > kernel booting parameter > 'initcall_blacklist=libnvdimm_init'. Then kernel panic! That's expected though, you can't block libnvdimm_init and then expect modules that link to libnvdimm to work. You would also need to block all modules / initcalls that depend on libnvdimm_init having run. I'll respond to the other thread with some ideas.
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
lizhij...@fujitsu.com wrote: > > > On 16/03/2023 23:54, Dan Williams wrote: > > Li Zhijian wrote: > >> nvdimm_bus_register() could be called from other modules, such as nfit, > >> but it can only be called after the nvdimm_bus_type is registered. > >> > >> BUG: kernel NULL pointer dereference, address: 0098 > >> #PF: supervisor read access in kernel mode > >> #PF: error_code(0x) - not-present page > >> PGD 0 P4D 0 > >> Oops: [#1] PREEMPT SMP PTI > >> CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97 > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > >> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > >> RIP: 0010:bus_add_device+0x58/0x150 > >> Call Trace: > >> > >>device_add+0x3ac/0x980 > >>nvdimm_bus_register+0x16d/0x1d0 > >>acpi_nfit_init+0xb72/0x1f90 [nfit] > >>acpi_nfit_add+0x1d5/0x200 [nfit] > >>acpi_device_probe+0x45/0x160 > > > > Can you explain a bit more how to hit this crash? This has not been a > > problem historically and the explanation above makes it sound like this > > is a theoretical issue. > > > > Dan, > > Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra > kernel booting parameter > 'initcall_blacklist=libnvdimm_init'. Then kernel panic! > Theoretically, it will also happen if nvdimm_bus_register() failed. > > > For kdump purpose[1], we need to disable libnvdimm driver to ensure metadata > in pmem will not be updated again in kdump kernel > [1]https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com/T/ Ah, great write up! Let me give that some thought. Apologies for missing it earlier. This would have been a good use for: Link: https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com ...in the above changelog.
RE: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
Li Zhijian wrote: > nvdimm_bus_register() could be called from other modules, such as nfit, > but it can only be called after the nvdimm_bus_type is registered. > > BUG: kernel NULL pointer dereference, address: 0098 > #PF: supervisor read access in kernel mode > #PF: error_code(0x) - not-present page > PGD 0 P4D 0 > Oops: [#1] PREEMPT SMP PTI > CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > RIP: 0010:bus_add_device+0x58/0x150 > Call Trace: > > device_add+0x3ac/0x980 > nvdimm_bus_register+0x16d/0x1d0 > acpi_nfit_init+0xb72/0x1f90 [nfit] > acpi_nfit_add+0x1d5/0x200 [nfit] > acpi_device_probe+0x45/0x160 Can you explain a bit more how to hit this crash? This has not been a problem historically and the explanation above makes it sound like this is a theoretical issue. libnvdimm_init() *should* be done before the nfit driver can attempt nvdimm_bus_register(). So, something else is broken if nvdimm_bus_register() can be called before libnvdimm_init(), or after libnvdimm_exit().
[GIT PULL] Compute Express Link (CXL) for 6.3
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.3 ...to receive the CXL update for v6.3. To date Linux has been dependent on platform-firmware to map CXL RAM regions and handle events / errors from devices. With this update Linux can now parse / update the CXL memory layout, and report events / errors from devices. This is a precursor for the CXL subsystem to handle the end-to-end "RAS" flow for CXL memory. I.e. the flow that for DDR-attached-DRAM is handled by the EDAC driver where it maps system physical address events to a field-replaceable-unit (FRU / endpoint device). In general, CXL has the potential to standardize what has historically been a pile of memory-controller-specific error handling logic. Another change of note is the default policy for handling RAM-backed device-dax instances. Previously the default access mode was "device", mmap(2) a device special file to access memory. The new default is "kmem" where the address range is assigned to the core-mm via add_memory_driver_managed(). This saves typical users from wondering why their platform memory is not visible via free(1) and stuck behind a device-file. At the same time it allows expert users to deploy policy to, for example, get dedicated access to high performance memory, or hide low performance memory from general purpose kernel allocations. This affects not only CXL, but also systems with high-bandwidth-memory that platform-firmware tags with the EFI_MEMORY_SP (special purpose) designation. All of these topics have had exposure in linux-next. Where you see multiple merges of the same topic branch it is for follow-on fixes and updates discovered after the branch was cut for -next. Each merge has a commit log. A few tags came in after the branch was cut as well: f57aec443c24 cxl/pmem: Fix nvdimm registration races Tested-by: Dave Jiang Reviewed-by: Dave Jiang e686c32590f4 dax/kmem: Fix leak of memory-hotplug resources Reviewed-by: Alistair Popple After the fixups from Arnd, there have not been any other problem reports from -next exposure. --- The following changes since commit 711442e29f16f0d39dd0e2460c9baacfccb9d5a7: cxl/region: Fix passthrough-decoder detection (2023-02-07 11:04:30 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.3 for you to fetch changes up to e686c32590f40bffc45f105c04c836ffad3e531a: dax/kmem: Fix leak of memory-hotplug resources (2023-02-17 14:58:01 -0800) cxl for v6.3 - CXL RAM region enumeration: instantiate 'struct cxl_region' objects for platform firmware created memory regions - CXL RAM region provisioning: complement the existing PMEM region creation support with RAM region support - "Soft Reservation" policy change: Online (memory hot-add) soft-reserved memory (EFI_MEMORY_SP) by default, but still allow for setting aside such memory for dedicated access via device-dax. - CXL Events and Interrupts: Takeover CXL event handling from platform-firmware (ACPI calls this CXL Memory Error Reporting) and export CXL Events via Linux Trace Events. - Convey CXL _OSC results to drivers: Similar to PCI, let the CXL subsystem interrogate the result of CXL _OSC negotiation. - Emulate CXL DVSEC Range Registers as "decoders": Allow for first-generation devices that pre-date the definition of the CXL HDM Decoder Capability to translate the CXL DVSEC Range Registers into 'struct cxl_decoder' objects. - Set timestamp: Per spec, set the device timestamp in case of hotplug, or if platform-firwmare failed to set it. - General fixups: linux-next build issues, non-urgent fixes for pre-production hardware, unit test fixes, spelling and debug message improvements. Alison Schofield (2): tools/testing/cxl: Remove cxl_test module math loading message cxl/mem: Add kdoc param for event log driver state Arnd Bergmann (3): cxl: avoid returning uninitialized error code dax: cxl: add CXL_REGION dependency dax/hmem: build hmem device support as module if possible Dan Williams (35): cxl/pci: Move tracepoint definitions to drivers/cxl/core/ tools/testing/cxl: Prevent cxl_test from confusing production modules cxl/region: Clarify when a cxld->commit() callback is mandatory cxl/port: Link the 'parent_dport' in portX/ and endpointX/ sysfs cxl/pci: Fix irq oneshot expectations Merge branch 'for-6.3/cxl' into cxl/next Merge branch 'for-6.3/cxl-events' into cxl/next cxl/memdev: Fix endpoint port removal cxl/Documentation: Update references to attributes added in v6.0 cxl/region: Add a mode attribute for regions cxl/region: Support empty uuids for non-pmem regions cxl/r
RE: [PATCH] dax: clx: add CXL_REGION dependency
Arnd Bergmann wrote: > From: Arnd Bergmann > > There is already a dependency on CXL_REGION, which depends on CXL_BUS, > but since CXL_REGION is a 'bool' symbol, it's possible to configure > DAX as built-in even though CXL itself is a loadable module: > > x86_64-linux-ld: drivers/dax/cxl.o: in function `cxl_dax_region_probe': > cxl.c:(.text+0xb): undefined reference to `to_cxl_dax_region' > x86_64-linux-ld: drivers/dax/cxl.o: in function `cxl_dax_region_driver_init': > cxl.c:(.init.text+0x10): undefined reference to `__cxl_driver_register' > x86_64-linux-ld: drivers/dax/cxl.o: in function `cxl_dax_region_driver_exit': > cxl.c:(.exit.text+0x9): undefined reference to `cxl_driver_unregister' > > Prevent this with another depndency on the tristate symbol. > > Fixes: 09d09e04d2fc ("cxl/dax: Create dax devices for CXL RAM regions") > Signed-off-by: Arnd Bergmann > --- Looks good, thanks Arnd.
[GIT PULL] Compute Express Link (CXL) fixes for 6.2-final
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-fixes-6.2 ...to receive 2 fixups for CXL in presence of passthrough decoders. This primarily helps developers using the QEMU CXL emulation, but with the impending arrival of CXL switches these types of topologies will be of interest to end users. The first commit has appeared in -next for several days with no issues. The second has not aged as long, but has been verified by new unit tests. --- The following changes since commit 6d796c50f84ca79f1722bb131799e5a5710c4700: Linux 6.2-rc6 (2023-01-29 13:59:43 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-fixes-6.2 for you to fetch changes up to 711442e29f16f0d39dd0e2460c9baacfccb9d5a7: cxl/region: Fix passthrough-decoder detection (2023-02-07 11:04:30 -0800) cxl fixes for 6.2 - Fix a crash when shutting down regions in the presence of passthrough decoders - Fix region creation to understand passthrough decoders instead of the narrower definition of passthrough ports Dan Williams (1): cxl/region: Fix passthrough-decoder detection Fan Ni (1): cxl/region: Fix null pointer dereference for resetting decoder drivers/cxl/core/region.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-)
[GIT PULL] NVDIMM and DAX fixes for 6.2-final
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-6.2 ...to receive a fix for an issue that could causes users to inadvertantly reserve too much capacity when debugging the KMSAN and persistent memory namespace, a lockdep fix, and a kernel-doc build warning. This has all appeared in -next with no reported issues. --- The following changes since commit 88603b6dc419445847923fcb7fe5080067a30f98: Linux 6.2-rc2 (2023-01-01 13:53:16 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-6.2 for you to fetch changes up to c91d713630848460de8669e6570307b7e559863b: nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE (2023-01-28 15:32:36 -0800) libnvdimm fixes for 6.2 - Resolve the conflict between KMSAN and NVDIMM with respect to reserving pmem namespace / volume capacity for larger sizeof(struct page) - Fix a lockdep warning in the the NFIT code - Fix a kernel-doc build warning ---- Dan Williams (1): nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE Randy Dunlap (1): dax: super.c: fix kernel-doc bad line warning Vishal Verma (1): ACPI: NFIT: fix a potential deadlock during NFIT teardown drivers/acpi/nfit/core.c | 2 +- drivers/dax/super.c | 2 +- drivers/nvdimm/Kconfig| 19 +++ drivers/nvdimm/nd.h | 2 +- drivers/nvdimm/pfn_devs.c | 42 +++--- 5 files changed, 49 insertions(+), 18 deletions(-)
[GIT PULL] Compute Express Link (CXL) fixes for 6.2-rc6
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-fixes-for-6.2-rc6 ...to receive a couple fixes for bugs introduced during the merge window. One is a regression, the other was a bug in the CXL AER handler. --- The following changes since commit b7bfaa761d760e72a969d116517eaa12e404c262: Linux 6.2-rc3 (2023-01-08 11:49:43 -0600) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-fixes-for-6.2-rc6 for you to fetch changes up to 19398821b25a9cde564265262e680ae1c2351be7: cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent (2023-01-25 15:35:26 -0800) cxl fixes for 6.2-rc6 - Fix a crash regression due to module load order of cxl_pmem.ko - Fix wrong register offset read in CXL AER handling path Dan Williams (1): cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent Dave Jiang (1): cxl: fix cxl_report_and_clear() RAS UE addr mis-assignment drivers/cxl/acpi.c | 1 - drivers/cxl/core/pmem.c | 42 -- drivers/cxl/pci.c | 7 +-- drivers/cxl/pmem.c | 24 4 files changed, 33 insertions(+), 41 deletions(-)
RE: [PATCH] dax: super.c: fix kernel-doc bad line warning
Randy Dunlap wrote: > Convert an empty line to " *" to avoid a kernel-doc warning: > > drivers/dax/super.c:478: warning: bad line: > > Signed-off-by: Randy Dunlap > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: nvd...@lists.linux.dev > --- > drivers/dax/super.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff -- a/drivers/dax/super.c b/drivers/dax/super.c > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(put_dax); > /** > * dax_holder() - obtain the holder of a dax device > * @dax_dev: a dax_device instance > - > + * > * Return: the holder's data which represents the holder if registered, > * otherwize NULL. > */ Looks good, applied.
[PATCH v2] nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE
Commit 6e9f05dc66f9 ("libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE") ...updated MAX_STRUCT_PAGE_SIZE to account for sizeof(struct page) potentially doubling in the case of CONFIG_KMSAN=y. Unfortunately this doubles the amount of capacity stolen from user addressable capacity for everyone, regardless of whether they are using the debug option. Revert that change, mandate that MAX_STRUCT_PAGE_SIZE never exceed 64, but allow for debug scenarios to proceed with creating debug sized page maps with a compile option to support debug scenarios. Note that this only applies to cases where the page map is permanent, i.e. stored in a reservation of the pmem itself ("--map=dev" in "ndctl create-namespace" terms). For the "--map=mem" case, since the allocation is ephemeral for the lifespan of the namespace, there are no explicit restriction. However, the implicit restriction, of having enough available "System RAM" to store the page map for the typically large pmem, still applies. Fixes: 6e9f05dc66f9 ("libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE") Cc: Cc: Alexander Potapenko Cc: Marco Elver Reported-by: Jeff Moyer --- Changes since v1 [1]: * Replace the module option with a compile option and a description of the tradeoffs to consider when running with KMSAN enabled in the presence of NVDIMM namespaces and their local reservation of capacity for a 'struct page' memmap array. (Greg) [1]: https://lore.kernel.org/all/63bc8fec4744a_5178e29...@dwillia2-xfh.jf.intel.com.notmuch/ drivers/nvdimm/Kconfig| 19 +++ drivers/nvdimm/nd.h |2 +- drivers/nvdimm/pfn_devs.c | 42 +++--- 3 files changed, 47 insertions(+), 16 deletions(-) diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 79d93126453d..77b06d54cc62 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -102,6 +102,25 @@ config NVDIMM_KEYS depends on ENCRYPTED_KEYS depends on (LIBNVDIMM=ENCRYPTED_KEYS) || LIBNVDIMM=m +config NVDIMM_KMSAN + bool + depends on KMSAN + help + KMSAN, and other memory debug facilities, increase the size of + 'struct page' to contain extra metadata. This collides with + the NVDIMM capability to store a potentially + larger-than-"System RAM" size 'struct page' array in a + reservation of persistent memory rather than limited / + precious DRAM. However, that reservation needs to persist for + the life of the given NVDIMM namespace. If you are using KMSAN + to debug an issue unrelated to NVDIMMs or DAX then say N to this + option. Otherwise, say Y but understand that any namespaces + (with the page array stored pmem) created with this build of + the kernel will permanently reserve and strand excess + capacity compared to the CONFIG_KMSAN=n case. + + Select N if unsure. + config NVDIMM_TEST_BUILD tristate "Build the unit test core" depends on m diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index 85ca5b4da3cf..ec5219680092 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -652,7 +652,7 @@ void devm_namespace_disable(struct device *dev, struct nd_namespace_common *ndns); #if IS_ENABLED(CONFIG_ND_CLAIM) /* max struct page size independent of kernel config */ -#define MAX_STRUCT_PAGE_SIZE 128 +#define MAX_STRUCT_PAGE_SIZE 64 int nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap); #else static inline int nvdimm_setup_pfn(struct nd_pfn *nd_pfn, diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 61af072ac98f..c7655a1fe38c 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -13,6 +13,8 @@ #include "pfn.h" #include "nd.h" +const static bool page_struct_override = IS_ENABLED(CONFIG_NVDIMM_KMSAN); + static void nd_pfn_release(struct device *dev) { struct nd_region *nd_region = to_nd_region(dev->parent); @@ -758,12 +760,6 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) return -ENXIO; } - /* -* Note, we use 64 here for the standard size of struct page, -* debugging options may cause it to be larger in which case the -* implementation will limit the pfns advertised through -* ->direct_access() to those that are included in the memmap. -*/ start = nsio->res.start; size = resource_size(>res); npfns = PHYS_PFN(size - SZ_8K); @@ -782,20 +778,33 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) } end_trunc = start + size - ALIGN_DOWN(start + size, align); if (nd_pfn->mode == PFN_MODE_PMEM) { + unsigned long page_map_size = MAX_STRUCT_PAGE_SIZE * npfns; + /* * The altmap should be padded out to the block size used * when populating the vmemmap. This *should* be equal to * PMD_SIZE
RE: [PATCH] ACPI: NFIT: fix a potential deadlock during NFIT teardown
Vishal Verma wrote: > Lockdep reports that acpi_nfit_shutdown() may deadlock against an > opportune acpi_nfit_scrub(). acpi_nfit_scrub () is run from inside a > 'work' and therefore has already acquired workqueue-internal locks. It > also acquiires acpi_desc->init_mutex. acpi_nfit_shutdown() first > acquires init_mutex, and was subsequently attempting to cancel any > pending workqueue items. This reversed locking order causes a potential > deadlock: > > == > WARNING: possible circular locking dependency detected > 6.2.0-rc3 #116 Tainted: G O N > -- > libndctl/1958 is trying to acquire lock: > 888129b461c0 > ((work_completion)(&(_desc->dwork)->work)){+.+.}-{0:0}, at: > __flush_work+0x43/0x450 > > but task is already holding lock: > 888129b460e8 (_desc->init_mutex){+.+.}-{3:3}, at: > acpi_nfit_shutdown+0x87/0xd0 [nfit] > > which lock already depends on the new lock. > > ... > > Possible unsafe locking scenario: > > CPU0CPU1 > > lock(_desc->init_mutex); > > lock((work_completion)(&(_desc->dwork)->work)); > lock(_desc->init_mutex); > lock((work_completion)(&(_desc->dwork)->work)); > > *** DEADLOCK *** > > Since the workqueue manipulation is protected by its own internal locking, > the cancellation of pending work doesn't need to be done under > acpi_desc->init_mutex. Move cancel_delayed_work_sync() outside the > init_mutex to fix the deadlock. Any work that starts after > acpi_nfit_shutdown() drops the lock will see ARS_CANCEL, and the > cancel_delayed_work_sync() will safely flush it out. > > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/acpi/nfit/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index f1cc5ec6a3b6..4e48d6db05eb 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -3297,8 +3297,8 @@ void acpi_nfit_shutdown(void *data) > > mutex_lock(_desc->init_mutex); > set_bit(ARS_CANCEL, _desc->scrub_flags); > - cancel_delayed_work_sync(_desc->dwork); > mutex_unlock(_desc->init_mutex); > + cancel_delayed_work_sync(_desc->dwork); Looks good, applied.
RE: [PATCH v2] nvdimm: Use kstrtobool() instead of strtobool()
Christophe JAILLET wrote: > strtobool() is the same as kstrtobool(). > However, the latter is more used within the kernel. > > In order to remove strtobool() and slightly simplify kstrtox.h, switch to > the other function name. > > While at it, include the corresponding header file () > > Signed-off-by: Christophe JAILLET > --- > This patch was already sent as a part of a serie ([1]) that axed all usages > of strtobool(). > Most of the patches have been merged in -next. > > I synch'ed with latest -next and re-send the remaining ones as individual > patches. > > Changes in v2: > - synch with latest -next. Looks good, applied for v6.3.
RE: [PATCH rcu 11/27] drivers/dax: Remove "select SRCU"
Paul E. McKenney wrote: > Now that the SRCU Kconfig option is unconditionally selected, there is > no longer any point in selecting it. Therefore, remove the "select SRCU" > Kconfig statements. > > Signed-off-by: Paul E. McKenney > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: Acked-by: Dan Williams Let me know if I should pick this up directly, otherwise I assume this will go in along with the rest of the set.
[GIT PULL] Compute Express Link (CXL) for 6.2
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.2 ...to receive the CXL update for v6.2. While it may seem backwards, the CXL update this time around includes some focus on CXL 1.x enabling where the work to date had been with CXL 2.0 (VH topologies) in mind. First generation CXL can mostly be supported via BIOS, similar to DDR, however it became clear there are use cases for OS native CXL error handling and some CXL 3.0 endpoint features can be deployed on CXL 1.x hosts (Restricted CXL Host (RCH) topologies). So, this update brings RCH topologies into the Linux CXL device model. In support of the ongoing CXL 2.0+ enabling 2 new core kernel facilities are added. One is the ability for the kernel to flag collisions between userspace access to PCI configuration registers and kernel accesses. This is brought on by the PCIe Data-Object-Exchange (DOE) facility, a hardware mailbox over config-cycles. The other is a cpu_cache_invalidate_memregion() API that maps to wbinvd_on_all_cpus() on x86. To prevent abuse it is disabled in guest VMs and architectures that do not support it yet. The CXL paths that need it, dynamic memory region creation and security commands (erase / unlock), are disabled when it is not present. As for the CXL 2.0+ this cycle the subsystem gains support Persistent Memory Security commands, error handling in response to PCIe AER notifications, and support for the "XOR" host bridge interleave algorithm. That last feature, "XOR" interleave support, is built on top of the ACPICA update for this cycle [1]. The shortlog and diffstat below are from a test merge with the pending ACPI updates. So either pull the ACPI tree first, or understand you will get some unrelated ACPICA updates in this pull. This has all appeared in -next with no known outstanding issues. The x86, ACPI, and PCI touches have acks from their respective maintainers. [1]: f350c68e3cd5 ("ACPICA: Add CXL 3.0 structures (CXIMS & RDPAS) to the CEDT table") --- The following changes since commit f0c4d9fc9cc9462659728d168387191387e903cc: Linux 6.1-rc4 (2022-11-06 15:07:11 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl tags/cxl-for-6.2 for you to fetch changes up to f04facfb993de47e2133b2b842d72b97b1c50162: cxl/region: Fix memdev reuse check (2022-12-08 13:03:47 -0800) cxl for 6.2 - Add the cpu_cache_invalidate_memregion() API for cache flushing in response to physical memory reconfiguration, or memory-side data invalidation from operations like secure erase or memory-device unlock. - Add a facility for the kernel to warn about collisions between kernel and userspace access to PCI configuration registers - Add support for Restricted CXL Host (RCH) topologies (formerly CXL 1.1) - Add handling and reporting of CXL errors reported via the PCIe AER mechanism - Add support for CXL Persistent Memory Security commands - Add support for the "XOR" algorithm for CXL host bridge interleave - Rework / simplify CXL to NVDIMM interactions - Miscellaneous cleanups and fixes Adam Manzanares (1): cxl: Replace HDM decoder granularity magic numbers Alison Schofield (3): cxl/acpi: Support CXL XOR Interleave Math (CXIMS) tools/testing/cxl: Add XOR Math support to cxl_test cxl/acpi: Fail decoder add if CXIMS for HBIG is missing Colin Ian King (1): cxl/region: Fix spelling mistake "memergion" -> "memregion" Dan Williams (34): tools/testing/cxl: Add bridge mocking support cxl/acpi: Simplify cxl_nvdimm_bridge probing cxl/region: Drop redundant pmem region release handling cxl/pmem: Refactor nvdimm device registration, delete the workqueue cxl/pmem: Remove the cxl_pmem_wq and related infrastructure cxl/acpi: Move rescan to the workqueue tools/testing/cxl: Make mock CEDT parsing more robust cxl/region: Fix missing probe failure cxl/pmem: Enforce keyctl ABI for PMEM security nvdimm/region: Move cache management to the region driver cxl/region: Manage CPU caches relative to DPA invalidation events cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers cxl/pci: Cleanup cxl_map_device_regs() cxl/pci: Kill cxl_map_regs() cxl/core/regs: Make cxl_map_{component, device}_regs() device generic cxl/port: Limit the port driver to just the HDM Decoder Capability cxl/pci: Prepare for mapping RAS Capability Structure cxl/pci: Find and map the RAS Capability Structure cxl/pci: Add (hopeful) error handling support Merge "ACPICA: Add CXL 3.0 structures..." into for-6.2/cxl-xor cxl/mem: Move devm_cxl_add_endpoint() from cxl_core to cxl_mem cxl/port: Add RCD endpo
[GIT PULL] DAX and HMAT fixes for v6.1-rc8
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/dax-fixes-6.1-rc8 ...to receive a few bug fixes around the handling of "Soft Reserved" memory and memory tiering information. Linux is starting to enounter more real world systems that deploy an ACPI HMAT to describe different performance classes of memory, as well the "special purpose" (Linux "Soft Reserved") designation from EFI. These fixes result from that testing. It has all appeared in -next for a while with no known issues. --- The following changes since commit 094226ad94f471a9f19e8f8e7140a09c2625abaa: Linux 6.1-rc5 (2022-11-13 13:12:55 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/dax-fixes-6.1-rc8 for you to fetch changes up to 472faf72b33d80aa8e7a99c9410c1a23d3bf0cd8: device-dax: Fix duplicate 'hmem' device registration (2022-11-21 15:34:40 -0800) dax fixes for v6.1-rc8 - Fix duplicate overlapping device-dax instances for HMAT described "Soft Reserved" Memory - Fix missing node targets in the sysfs representation of memory tiers - Remove a confusing variable initialization -------- Dan Williams (1): device-dax: Fix duplicate 'hmem' device registration Vishal Verma (2): ACPI: HMAT: remove unnecessary variable initialization ACPI: HMAT: Fix initiator registration for single-initiator systems drivers/acpi/numa/hmat.c | 27 --- drivers/dax/hmem/device.c | 24 +++- 2 files changed, 35 insertions(+), 16 deletions(-)
RE: [PATCH v2.1 1/8] fsdax: introduce page->share for fsdax in reflink mode
Shiyang Ruan wrote: > fsdax page is used not only when CoW, but also mapread. To make the it > easily understood, use 'share' to indicate that the dax page is shared > by more than one extent. And add helper functions to use it. > > Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED. > > Signed-off-by: Shiyang Ruan > --- > fs/dax.c | 38 ++ > include/linux/mm_types.h | 5 - > include/linux/page-flags.h | 2 +- > 3 files changed, 27 insertions(+), 18 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 1c6867810cbd..edbacb273ab5 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry) > for (pfn = dax_to_pfn(entry); \ > pfn < dax_end_pfn(entry); pfn++) > > -static inline bool dax_mapping_is_cow(struct address_space *mapping) > +static inline bool dax_page_is_shared(struct page *page) > { > - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW; > + return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED; > } > > /* > - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount. > + * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the > + * refcount. > */ > -static inline void dax_mapping_set_cow(struct page *page) > +static inline void dax_page_bump_sharing(struct page *page) Similar to page_ref naming I would call this page_share_get() and the corresponding function page_share_put(). > { > - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) { > + if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) { > /* >* Reset the index if the page was already mapped >* regularly before. >*/ > if (page->mapping) > - page->index = 1; > - page->mapping = (void *)PAGE_MAPPING_DAX_COW; > + page->share = 1; > + page->mapping = (void *)PAGE_MAPPING_DAX_SHARED; Small nit, You could save a cast here by defining PAGE_MAPPING_DAX_SHARED as "((void *) 1)". > } > - page->index++; > + page->share++; > +} > + > +static inline unsigned long dax_page_drop_sharing(struct page *page) > +{ > + return --page->share; > } > > /* > - * When it is called in dax_insert_entry(), the cow flag will indicate that > + * When it is called in dax_insert_entry(), the shared flag will indicate > that > * whether this entry is shared by multiple files. If so, set the > page->mapping > - * FS_DAX_MAPPING_COW, and use page->index as refcount. > + * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount. > */ > static void dax_associate_entry(void *entry, struct address_space *mapping, > - struct vm_area_struct *vma, unsigned long address, bool cow) > + struct vm_area_struct *vma, unsigned long address, bool shared) > { > unsigned long size = dax_entry_size(entry), pfn, index; > int i = 0; > @@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct > address_space *mapping, > for_each_mapped_pfn(entry, pfn) { > struct page *page = pfn_to_page(pfn); > > - if (cow) { > - dax_mapping_set_cow(page); > + if (shared) { > + dax_page_bump_sharing(page); > } else { > WARN_ON_ONCE(page->mapping); > page->mapping = mapping; > @@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct > address_space *mapping, > struct page *page = pfn_to_page(pfn); > > WARN_ON_ONCE(trunc && page_ref_count(page) > 1); > - if (dax_mapping_is_cow(page->mapping)) { > - /* keep the CoW flag if this page is still shared */ > - if (page->index-- > 0) > + if (dax_page_is_shared(page)) { > + /* keep the shared flag if this page is still shared */ > + if (dax_page_drop_sharing(page) > 0) > continue; I think part of what makes this hard to read is trying to preserve the same code paths for shared pages and typical pages. page_share_put() should, in addition to decrementing the share, clear out page->mapping value. > } else > WARN_ON_ONCE(page->mapping && page->mapping != mapping); > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 500e536796ca..f46cac3657ad 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -103,7 +103,10 @@ struct page { > }; > /* See page-flags.h for PAGE_MAPPING_FLAGS */ > struct address_space *mapping; > - pgoff_t index; /* Our offset within mapping. */ > + union { > +
RE: [PATCH v2 0/8] fsdax,xfs: fix warning messages
Shiyang Ruan wrote: > Changes since v1: > 1. Added a snippet of the warning message and some of the failed cases > 2. Separated the patch for easily review > 3. Added page->share and its helper functions > 4. Included the patch[1] that removes the restrictions of fsdax and reflink > [1] > https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.f...@fujitsu.com/ > > Many testcases failed in dax+reflink mode with warning message in dmesg. > Such as generic/051,075,127. The warning message is like this: > [ 775.509337] [ cut here ] > [ 775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 > dax_insert_entry.cold+0x2e/0x69 > [ 775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash > af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject > nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables > iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel > configfs xfs libcrc32c fuse > [ 775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: GW > 6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d > [ 775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch > Linux 1.16.0-3-3 04/01/2014 > [ 775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69 > [ 775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d > ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b > e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48 > [ 775.526708] RSP: :c90001d57b30 EFLAGS: 00010082 > [ 775.527042] RAX: 002a RBX: RCX: > 0042 > [ 775.527396] RDX: ea000a0f6c80 RSI: 81dfab1b RDI: > > [ 775.527819] RBP: ea000a0f6c40 R08: R09: > 820625e0 > [ 775.528241] R10: c90001d579d8 R11: 820d2628 R12: > 88815fc98320 > [ 775.528598] R13: c90001d57c18 R14: R15: > 0001 > [ 775.528997] FS: 7f39fc75d740() GS:88817bc8() > knlGS: > [ 775.529474] CS: 0010 DS: ES: CR0: 80050033 > [ 775.529800] CR2: 7f39fc772040 CR3: 000107eb6001 CR4: > 003706e0 > [ 775.530214] DR0: DR1: DR2: > > [ 775.530592] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 775.531002] Call Trace: > [ 775.531230] > [ 775.531444] dax_fault_iter+0x267/0x6c0 > [ 775.531719] dax_iomap_pte_fault+0x198/0x3d0 > [ 775.532002] __xfs_filemap_fault+0x24a/0x2d0 [xfs > aa8d25411432b306d9554da38096f4ebb86bdfe7] > [ 775.532603] __do_fault+0x30/0x1e0 > [ 775.532903] do_fault+0x314/0x6c0 > [ 775.533166] __handle_mm_fault+0x646/0x1250 > [ 775.533480] handle_mm_fault+0xc1/0x230 > [ 775.533810] do_user_addr_fault+0x1ac/0x610 > [ 775.534110] exc_page_fault+0x63/0x140 > [ 775.534389] asm_exc_page_fault+0x22/0x30 > [ 775.534678] RIP: 0033:0x7f39fc55820a > [ 775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f > 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 a4 > c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00 > [ 775.535839] RSP: 002b:7ffc66a08118 EFLAGS: 00010202 > [ 775.536157] RAX: 7f39fc772001 RBX: 00042001 RCX: > 63c1 > [ 775.536537] RDX: 6400 RSI: 7f39fac42050 RDI: > 7f39fc772040 > [ 775.536919] RBP: 6400 R08: 7f39fc772001 R09: > 00042000 > [ 775.537304] R10: 0001 R11: 0246 R12: > 0001 > [ 775.537694] R13: 7f39fc772000 R14: 6401 R15: > 0003 > [ 775.538086] > [ 775.538333] ---[ end trace ]--- > > This also effects dax+noreflink mode if we run the test after a > dax+reflink test. So, the most urgent thing is solving the warning > messages. > > With these fixes, most warning messages in dax_associate_entry() are > gone. But honestly, generic/388 will randomly failed with the warning. > The case shutdown the xfs when fsstress is running, and do it for many > times. I think the reason is that dax pages in use are not able to be > invalidated in time when fs is shutdown. The next time dax page to be > associated, it still remains the mapping value set last time. I'll keep > on solving it. This one also sounds like it is going to be relevant for CXL PMEM, and the improvements to the reference counting. CXL has a facility where the driver asserts that no more writes are in-flight to the device so that the device can assert a clean shutdown. Part of that will be making sure that page access ends at fs shutdown. > The warning message in dax_writeback_one() can also be fixed because of > the dax unshare. > > > Shiyang Ruan (8): > fsdax: introduce
Re: [PATCH 0/2] fsdax,xfs: fix warning messages
Darrick J. Wong wrote: > On Wed, Nov 30, 2022 at 01:48:59PM -0800, Dan Williams wrote: > > Andrew Morton wrote: > > > On Tue, 29 Nov 2022 19:59:14 -0800 Dan Williams > > > wrote: > > > > > > > [ add Andrew ] > > > > > > > > Shiyang Ruan wrote: > > > > > Many testcases failed in dax+reflink mode with warning message in > > > > > dmesg. > > > > > This also effects dax+noreflink mode if we run the test after a > > > > > dax+reflink test. So, the most urgent thing is solving the warning > > > > > messages. > > > > > > > > > > Patch 1 fixes some mistakes and adds handling of CoW cases not > > > > > previously considered (srcmap is HOLE or UNWRITTEN). > > > > > Patch 2 adds the implementation of unshare for fsdax. > > > > > > > > > > With these fixes, most warning messages in dax_associate_entry() are > > > > > gone. But honestly, generic/388 will randomly failed with the > > > > > warning. > > > > > The case shutdown the xfs when fsstress is running, and do it for many > > > > > times. I think the reason is that dax pages in use are not able to be > > > > > invalidated in time when fs is shutdown. The next time dax page to be > > > > > associated, it still remains the mapping value set last time. I'll > > > > > keep > > > > > on solving it. > > > > > > > > > > The warning message in dax_writeback_one() can also be fixed because > > > > > of > > > > > the dax unshare. > > > > > > > > Thank you for digging in on this, I had been pinned down on CXL tasks > > > > and worried that we would need to mark FS_DAX broken for a cycle, so > > > > this is timely. > > > > > > > > My only concern is that these patches look to have significant > > > > collisions with > > > > the fsdax page reference counting reworks pending in linux-next. > > > > Although, > > > > those are still sitting in mm-unstable: > > > > > > > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org > > > > > > As far as I know, Dan's "Fix the DAX-gup mistake" series is somewhat > > > stuck. Jan pointed out: > > > > > > https://lore.kernel.org/all/20221109113849.p7pwob533ijgrytu@quack3/T/#u > > > > > > or have Jason's issues since been addressed? > > > > No, they have not. I do think the current series is a step forward, but > > given the urgency remains low for the time being (CXL hotplug use case > > further out, no known collisions with ongoing folio work, and no > > MEMORY_DEVICE_PRIVATE users looking to build any conversions on top for > > 6.2) I am ok to circle back for 6.3 for that follow on work to be > > integrated. > > > > > > My preference would be to move ahead with both in which case I can help > > > > rebase these fixes on top. In that scenario everything would go through > > > > Andrew. > > > > > > > > However, if we are getting too late in the cycle for that path I think > > > > these dax-fixes take precedence, and one more cycle to let the page > > > > reference count reworks sit is ok. > > > > > > That sounds a decent approach. So we go with this series ("fsdax,xfs: > > > fix warning messages") and aim at 6.3-rc1 with "Fix the DAX-gup > > > mistake"? > > > > > > > Yeah, that's the path of least hassle. > > Sounds good. I still want to see patch 1 of this series broken up into > smaller pieces though. Once the series goes through review, do you want > me to push the fixes to Linus, seeing as xfs is the only user of this > functionality? Yes, that was my primary feedback as well, and merging through xfs makes sense to me.
Re: [PATCH 0/2] fsdax,xfs: fix warning messages
Andrew Morton wrote: > On Tue, 29 Nov 2022 19:59:14 -0800 Dan Williams > wrote: > > > [ add Andrew ] > > > > Shiyang Ruan wrote: > > > Many testcases failed in dax+reflink mode with warning message in dmesg. > > > This also effects dax+noreflink mode if we run the test after a > > > dax+reflink test. So, the most urgent thing is solving the warning > > > messages. > > > > > > Patch 1 fixes some mistakes and adds handling of CoW cases not > > > previously considered (srcmap is HOLE or UNWRITTEN). > > > Patch 2 adds the implementation of unshare for fsdax. > > > > > > With these fixes, most warning messages in dax_associate_entry() are > > > gone. But honestly, generic/388 will randomly failed with the warning. > > > The case shutdown the xfs when fsstress is running, and do it for many > > > times. I think the reason is that dax pages in use are not able to be > > > invalidated in time when fs is shutdown. The next time dax page to be > > > associated, it still remains the mapping value set last time. I'll keep > > > on solving it. > > > > > > The warning message in dax_writeback_one() can also be fixed because of > > > the dax unshare. > > > > Thank you for digging in on this, I had been pinned down on CXL tasks > > and worried that we would need to mark FS_DAX broken for a cycle, so > > this is timely. > > > > My only concern is that these patches look to have significant collisions > > with > > the fsdax page reference counting reworks pending in linux-next. Although, > > those are still sitting in mm-unstable: > > > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org > > As far as I know, Dan's "Fix the DAX-gup mistake" series is somewhat > stuck. Jan pointed out: > > https://lore.kernel.org/all/20221109113849.p7pwob533ijgrytu@quack3/T/#u > > or have Jason's issues since been addressed? No, they have not. I do think the current series is a step forward, but given the urgency remains low for the time being (CXL hotplug use case further out, no known collisions with ongoing folio work, and no MEMORY_DEVICE_PRIVATE users looking to build any conversions on top for 6.2) I am ok to circle back for 6.3 for that follow on work to be integrated. > > My preference would be to move ahead with both in which case I can help > > rebase these fixes on top. In that scenario everything would go through > > Andrew. > > > > However, if we are getting too late in the cycle for that path I think > > these dax-fixes take precedence, and one more cycle to let the page > > reference count reworks sit is ok. > > That sounds a decent approach. So we go with this series ("fsdax,xfs: > fix warning messages") and aim at 6.3-rc1 with "Fix the DAX-gup > mistake"? > Yeah, that's the path of least hassle.
Re: [PATCH 0/2] fsdax,xfs: fix warning messages
Darrick J. Wong wrote: > On Tue, Nov 29, 2022 at 07:59:14PM -0800, Dan Williams wrote: > > [ add Andrew ] > > > > Shiyang Ruan wrote: > > > Many testcases failed in dax+reflink mode with warning message in dmesg. > > > This also effects dax+noreflink mode if we run the test after a > > > dax+reflink test. So, the most urgent thing is solving the warning > > > messages. > > > > > > Patch 1 fixes some mistakes and adds handling of CoW cases not > > > previously considered (srcmap is HOLE or UNWRITTEN). > > > Patch 2 adds the implementation of unshare for fsdax. > > > > > > With these fixes, most warning messages in dax_associate_entry() are > > > gone. But honestly, generic/388 will randomly failed with the warning. > > > The case shutdown the xfs when fsstress is running, and do it for many > > > times. I think the reason is that dax pages in use are not able to be > > > invalidated in time when fs is shutdown. The next time dax page to be > > > associated, it still remains the mapping value set last time. I'll keep > > > on solving it. > > > > > > The warning message in dax_writeback_one() can also be fixed because of > > > the dax unshare. > > > > Thank you for digging in on this, I had been pinned down on CXL tasks > > and worried that we would need to mark FS_DAX broken for a cycle, so > > this is timely. > > > > My only concern is that these patches look to have significant collisions > > with > > the fsdax page reference counting reworks pending in linux-next. Although, > > those are still sitting in mm-unstable: > > > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org > > > > My preference would be to move ahead with both in which case I can help > > rebase these fixes on top. In that scenario everything would go through > > Andrew. > > > > However, if we are getting too late in the cycle for that path I think > > these dax-fixes take precedence, and one more cycle to let the page > > reference count reworks sit is ok. > > Well now that raises some interesting questions -- dax and reflink are > totally broken on 6.1. I was thinking about cramming them into 6.2 as a > data corruption fix on the grounds that is not an acceptable state of > affairs. I agree it's not an acceptable state of affairs, but for 6.1 the answer may be to just revert to dax+reflink being forbidden again. The fact that no end user has noticed is probably a good sign that we can disable that without any one screaming. That may be the easy answer for 6.2 as well given how late this all is. > OTOH we're past -rc7, which is **really late** to be changing core code. > Then again, there aren't so many fsdax users and nobody's complained > about 6.0/6.1 being busted, so perhaps the risk of regression isn't so > bad? Then again, that could be a sign that this could wait, if you and > Andrew are really eager to merge the reworks. The page reference counting has also been languishing for a long time. A 6.2 merge would be nice, it relieves maintenance burden, but they do not start to have real end user implications until CXL memory hotplug platforms arrive and the warts in the reference counting start to show real problems in production. > Just looking at the stuff that's still broken with dax+reflink -- I > noticed that xfs/550-552 (aka the dax poison tests) are still regressing > on reflink filesystems. That's worrying because the whole point of reworking dax, xfs, and mm/memory-failure all at once was to handle the collision of poison and reflink'd dax files. > So, uh, what would this patchset need to change if the "fsdax page > reference counting reworks" were applied? Would it be changing the page > refcount instead of stashing that in page->index? Nah, it's things like switching from pages to folios and shifting how dax goes from pfns to pages. https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-unstable=cca48ba3196 Ideally fsdax would never deal in pfns at all and do everything in terms of offsets relative to a 'struct dax_device'. My gut is saying these patches, the refcount reworks, and the dax+reflink fixes, are important but not end user critical. One more status quo release does not hurt, and we can circle back to get this all straightened early in v6.3. I.e. just revert: 35fcd75af3ed xfs: fail dax mount if reflink is enabled on a partition ...for v6.1-rc8 and get back to this early in the New Year. > > --D > > > > Shiyang Ruan (2): > > > fsdax,xfs: fix warning messages at dax_[dis]associate_entry() > > > fsdax,xfs: port unshare to fsdax > > > > > > fs/dax.c | 166 ++- > > > fs/xfs/xfs_iomap.c | 6 +- > > > fs/xfs/xfs_reflink.c | 8 ++- > > > include/linux/dax.h | 2 + > > > 4 files changed, 129 insertions(+), 53 deletions(-) > > > > > > -- > > > 2.38.1
RE: [PATCH 1/2] fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
Shiyang Ruan wrote: > This patch fixes the warning message reported in dax_associate_entry() > and dax_disassociate_entry(). Can you include the xfstest test number and a snippet of the warning message. > 1. reset page->mapping and ->index when refcount counting down to 0. > 2. set IOMAP_F_SHARED flag when iomap read to allow one dax page to be > associated more than once for not only write but also read. > 3. should zero the edge (when not aligned) if srcmap is HOLE or > UNWRITTEN. > 4. iterator of two files in dedupe should be executed side by side, not > nested. > 5. use xfs_dax_write_iomap_ops for xfs zero and truncate. Do these all need to be done at once, or is this 5 patches? > > Signed-off-by: Shiyang Ruan > --- > fs/dax.c | 114 ++--- > fs/xfs/xfs_iomap.c | 6 +-- > 2 files changed, 69 insertions(+), 51 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 1c6867810cbd..5ea7c0926b7f 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -398,7 +398,7 @@ static void dax_disassociate_entry(void *entry, struct > address_space *mapping, > WARN_ON_ONCE(trunc && page_ref_count(page) > 1); > if (dax_mapping_is_cow(page->mapping)) { > /* keep the CoW flag if this page is still shared */ > - if (page->index-- > 0) > + if (page->index-- > 1) I think this wants either a helper function to make it clear that ->index is being used as a share count, or go ahead and rename that field in this context with something like: diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 910d880e67eb..1a409288f39d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -103,7 +103,10 @@ struct page { }; /* See page-flags.h for PAGE_MAPPING_FLAGS */ struct address_space *mapping; - pgoff_t index; /* Our offset within mapping. */ + union { + pgoff_t index; /* Our offset within mapping. */ + unsigned long share; + }; /** * @private: Mapping-private opaque data. * Usually used for buffer_heads if PagePrivate.
RE: [PATCH 0/2] fsdax,xfs: fix warning messages
[ add Andrew ] Shiyang Ruan wrote: > Many testcases failed in dax+reflink mode with warning message in dmesg. > This also effects dax+noreflink mode if we run the test after a > dax+reflink test. So, the most urgent thing is solving the warning > messages. > > Patch 1 fixes some mistakes and adds handling of CoW cases not > previously considered (srcmap is HOLE or UNWRITTEN). > Patch 2 adds the implementation of unshare for fsdax. > > With these fixes, most warning messages in dax_associate_entry() are > gone. But honestly, generic/388 will randomly failed with the warning. > The case shutdown the xfs when fsstress is running, and do it for many > times. I think the reason is that dax pages in use are not able to be > invalidated in time when fs is shutdown. The next time dax page to be > associated, it still remains the mapping value set last time. I'll keep > on solving it. > > The warning message in dax_writeback_one() can also be fixed because of > the dax unshare. Thank you for digging in on this, I had been pinned down on CXL tasks and worried that we would need to mark FS_DAX broken for a cycle, so this is timely. My only concern is that these patches look to have significant collisions with the fsdax page reference counting reworks pending in linux-next. Although, those are still sitting in mm-unstable: http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org My preference would be to move ahead with both in which case I can help rebase these fixes on top. In that scenario everything would go through Andrew. However, if we are getting too late in the cycle for that path I think these dax-fixes take precedence, and one more cycle to let the page reference count reworks sit is ok. > Shiyang Ruan (2): > fsdax,xfs: fix warning messages at dax_[dis]associate_entry() > fsdax,xfs: port unshare to fsdax > > fs/dax.c | 166 ++- > fs/xfs/xfs_iomap.c | 6 +- > fs/xfs/xfs_reflink.c | 8 ++- > include/linux/dax.h | 2 + > 4 files changed, 129 insertions(+), 53 deletions(-) > > -- > 2.38.1
[PATCH v6] memregion: Add cpu_cache_invalidate_memregion() interface
From: Davidlohr Bueso With CXL security features, and CXL dynamic provisioning, global CPU cache flushing nvdimm requirements are no longer specific to that subsystem, even beyond the scope of security_ops. CXL will need such semantics for features not necessarily limited to persistent memory. The functionality this is enabling is to be able to instantaneously secure erase potentially terabytes of memory at once and the kernel needs to be sure that none of the data from before the erase is still present in the cache. It is also used when unlocking a memory device where speculative reads and firmware accesses could have cached poison from before the device was unlocked. Lastly this facility is used when mapping new devices, or new capacity into an established physical address range. I.e. when the driver switches DeviceA mapping AddressX to DeviceB mapping AddressX then any cached data from DeviceA:AddressX needs to be invalidated. This capability is typically only used once per-boot (for unlock), or once per bare metal provisioning event (secure erase), like when handing off the system to another tenant or decommissioning a device. It may also be used for dynamic CXL region provisioning. Users must first call cpu_cache_has_invalidate_memregion() to know whether this functionality is available on the architecture. On x86 this respects the constraints of when wbinvd() is tolerable. It is already the case that wbinvd() is problematic to allow in VMs due its global performance impact and KVM, for example, has been known to just trap and ignore the call. With confidential computing guest execution of wbinvd() may even trigger an exception. Given guests should not be messing with the bare metal address map via CXL configuration changes cpu_cache_has_invalidate_memregion() returns false in VMs. While this global cache invalidation facility, is exported to modules, since NVDIMM and CXL support can be built as a module, it is not for general use. The intent is that this facility is not available outside of specific "device-memory" use cases. To make that expectation as clear as possible the API is scoped to a new "DEVMEM" module namespace that only the NVDIMM and CXL subsystems are expected to import. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Peter Zijlstra Tested-by: Dave Jiang Signed-off-by: Davidlohr Bueso Acked-by: Dave Hansen Co-developed-by: Dan Williams Signed-off-by: Dan Williams --- Changes since v5 [1]: - A collection of build fixed reported by the 0day build robot [1]: http://lore.kernel.org/r/166698148737.3132474.5901874516011784201.st...@dwillia2-xfh.jf.intel.com arch/x86/Kconfig |1 + arch/x86/mm/pat/set_memory.c | 18 ++ drivers/acpi/nfit/intel.c| 43 -- include/linux/memregion.h| 38 + lib/Kconfig |3 +++ 5 files changed, 80 insertions(+), 23 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 67745ceab0db..e16b2b15d67e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -69,6 +69,7 @@ config X86 select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_CACHE_LINE_SIZE + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 2e5a045731de..ef34ba21aa92 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -330,6 +331,23 @@ void arch_invalidate_pmem(void *addr, size_t size) EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION +bool cpu_cache_has_invalidate_memregion(void) +{ + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); +} +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM); + +int cpu_cache_invalidate_memregion(int res_desc) +{ + if (WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion())) + return -ENXIO; + wbinvd_on_all_cpus(); + return 0; +} +EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM); +#endif + static void __cpa_flush_all(void *arg) { unsigned long cache = (unsigned long)arg; diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index 8dd792a55730..fa0e57e35162 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include "intel.h" #include "nfit.h" @@ -190,8 +191,6 @@ static int intel_security_cha
[PATCH v5] memregion: Add cpu_cache_invalidate_memregion() interface
From: Davidlohr Bueso With CXL security features, and CXL dynamic provisioning, global CPU cache flushing nvdimm requirements are no longer specific to that subsystem, even beyond the scope of security_ops. CXL will need such semantics for features not necessarily limited to persistent memory. The functionality this is enabling is to be able to instantaneously secure erase potentially terabytes of memory at once and the kernel needs to be sure that none of the data from before the erase is still present in the cache. It is also used when unlocking a memory device where speculative reads and firmware accesses could have cached poison from before the device was unlocked. Lastly this facility is used when mapping new devices, or new capacity into an established physical address range. I.e. when the driver switches DeviceA mapping AddressX to DeviceB mapping AddressX then any cached data from DeviceA:AddressX needs to be invalidated. This capability is typically only used once per-boot (for unlock), or once per bare metal provisioning event (secure erase), like when handing off the system to another tenant or decommissioning a device. It may also be used for dynamic CXL region provisioning. Users must first call cpu_cache_has_invalidate_memregion() to know whether this functionality is available on the architecture. On x86 this respects the constraints of when wbinvd() is tolerable. It is already the case that wbinvd() is problematic to allow in VMs due its global performance impact and KVM, for example, has been known to just trap and ignore the call. With confidential computing guest execution of wbinvd() may even trigger an exception. Given guests should not be messing with the bare metal address map via CXL configuration changes cpu_cache_has_invalidate_memregion() returns false in VMs. While this global cache invalidation facility, is exported to modules, since NVDIMM and CXL support can be built as a module, it is not for general use. The intent is that this facility is not available outside of specific "device-memory" use cases. To make that expectation as clear as possible the API is scoped to a new "DEVMEM" module namespace that only the NVDIMM and CXL subsystems are expected to import. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Peter Zijlstra Signed-off-by: Davidlohr Bueso Acked-by: Dave Hansen Co-developed-by: Dan Williams Signed-off-by: Dan Williams --- Changes since v4 [1]: - Changelog and kdoc wording fixes (Dave) - Permit x86-32 as there is no functional reason to disallow it, and the DEVMEM namespace handles limiting the usage (Dave) - Similar to the compile time assertion for the CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=n case, warn and fail cpu_cache_invalidate_memregion() if not supported at runtime (Dave) [1]: http://lore.kernel.org/r/166672803035.280.9244172033971411169.st...@dwillia2-xfh.jf.intel.com arch/x86/Kconfig |1 + arch/x86/mm/pat/set_memory.c | 17 + drivers/acpi/nfit/intel.c| 43 -- include/linux/memregion.h| 37 lib/Kconfig |3 +++ 5 files changed, 78 insertions(+), 23 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 67745ceab0db..e16b2b15d67e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -69,6 +69,7 @@ config X86 select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_CACHE_LINE_SIZE + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 97342c42dda8..0a735c62fa0d 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -330,6 +330,23 @@ void arch_invalidate_pmem(void *addr, size_t size) EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION +bool cpu_cache_has_invalidate_memregion(void) +{ + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); +} +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM); + +int cpu_cache_invalidate_memregion(int res_desc) +{ + if (WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion())) + return -ENXIO; + wbinvd_on_all_cpus(); + return 0; +} +EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM); +#endif + static void __cpa_flush_all(void *arg) { unsigned long cache = (unsigned long)arg; diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index 8dd792a55730..fa0e57e35162 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -3,6 +
Re: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface
Dave Hansen wrote: > On 10/25/22 13:05, Dan Williams wrote: > > Users must first call cpu_cache_has_invalidate_memregion() to know whether > > this functionality is available on the architecture. Only enable it on > > x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX > > guests may trigger a virtualization exception and may need proper handling > > to recover. See: > That sentence doesn't quite parse correctly. Does it need to be "and > may trigger..."? Yes, but I think this can be said more generally, and with more detail: "Only enable on bare metal x86-64 vai the wbinvd() hammer. It is already the case that wbinvd() is problematic to allow in VMs due its global performance impact and KVM, for example, has been known to just trap and ignore the call. With confidential computing guest execution of wbinvd may even trigger an exception. As guests should not be messing with the bare metal address map cpu_cache_has_invalidate_memregion() returns false in those environments." > > This global cache invalidation facility, > > cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM > > and CXL support can be built as a module. However, the intent is that > > this facility is not available outside of specific "device-memory" use > > cases. To that end the API is scoped to a new "DEVMEM" module namespace > > that only applies to the NVDIMM and CXL subsystems. > > Oh, thank $DEITY you're trying to limit the amount of code that has > access to this thing. > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 67745ceab0db..b68661d0633b 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -69,6 +69,7 @@ config X86 > > select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE > > select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI > > select ARCH_HAS_CACHE_LINE_SIZE > > + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64 > > What is 64-bit only about this? > > I don't expect to have a lot of NVDIMMs or CXL devices on 32-bit > kernels, but it would be nice to remove this if it's not strictly > needed. Or, to add a changelog nugget that says: > > Restrict this to X86_64 kernels. It would probably work on 32- > bit, but there is no practical reason to use 32-bit kernels and > no one is testing them. I had to go recall this myself, but it looks like this is unnecessarily cargo culting the stance of ARCH_HAS_PMEM_API that arose from this change: 96601adb7451 x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB ...that observed that on pure 32-bit x86 CPUs that non-temporal stores had weaker guarantees about whether writes would bypass the CPU cache. However, that commit is so old that it even talks about the interactions with the pcommit instruction. Suffice to say there is no X86_64 dependency for wbinvd, I'll drop the dependency. > > > select ARCH_HAS_CURRENT_STACK_POINTER > > select ARCH_HAS_DEBUG_VIRTUAL > > select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > > index 97342c42dda8..8650bb6481a8 100644 > > --- a/arch/x86/mm/pat/set_memory.c > > +++ b/arch/x86/mm/pat/set_memory.c > > @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size) > > EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > > #endif > > > > +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION > > +bool cpu_cache_has_invalidate_memregion(void) > > +{ > > + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); > > +} > > +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM); > > + > > +int cpu_cache_invalidate_memregion(int res_desc) > > +{ > > + wbinvd_on_all_cpus(); > > + return 0; > > +} > > Does this maybe also deserve a: > > WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion()); > > in case one of the cpu_cache_invalidate_memregion() paths missed a > cpu_cache_has_invalidate_memregion() check? Yeah, good idea. > > > +/** > > + * cpu_cache_invalidate_memregion - drop any CPU cached data for > > + * memregions described by @res_desc > > + * @res_desc: one of the IORES_DESC_* types > > + * > > + * Perform cache maintenance after a memory event / operation that > > + * changes the contents of physical memory in a cache-incoherent manner. > > + * For example, device memory technologies like NVDIMM and CXL have > > + * device secure erase, or dynamic region provision features where such > > + * semantics. > > s/where/with/ ? Yes. > > > + *
Re: [PATCH v1] virtio_pmem: populate numa information
Pankaj Gupta wrote: > > > > Compute the numa information for a virtio_pmem device from the memory > > > > range of the device. Previously, the target_node was always 0 since > > > > the ndr_desc.target_node field was never explicitly set. The code for > > > > computing the numa node is taken from cxl_pmem_region_probe in > > > > drivers/cxl/pmem.c. > > > > > > > > Signed-off-by: Michael Sammler > > > > --- > > > > drivers/nvdimm/virtio_pmem.c | 11 +-- > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > > index 20da455d2ef6..a92eb172f0e7 100644 > > > > --- a/drivers/nvdimm/virtio_pmem.c > > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > > > static int virtio_pmem_probe(struct virtio_device *vdev) > > > > { > > > > struct nd_region_desc ndr_desc = {}; > > > > - int nid = dev_to_node(>dev); > > > > struct nd_region *nd_region; > > > > struct virtio_pmem *vpmem; > > > > struct resource res; > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device > > > > *vdev) > > > > dev_set_drvdata(>dev, vpmem->nvdimm_bus); > > > > > > > > ndr_desc.res = > > > > - ndr_desc.numa_node = nid; > > > > + > > > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > > > + ndr_desc.target_node = phys_to_target_node(res.start); > > > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > > > + ndr_desc.target_node = ndr_desc.numa_node; > > > > + dev_dbg(>dev, "changing target node from %d to > > > > %d", > > > > + NUMA_NO_NODE, ndr_desc.target_node); > > > > + } > > > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > > > see if 'target_node' is used for fsdax case? > > > > > > It seems to me "target_node" is used mainly for volatile range above > > > persistent memory ( e.g kmem driver?). > > > > > I am not sure if 'target_node' is used in the fsdax case, but it is > > indeed used by the devdax/kmem driver when hotplugging the memory (see > > 'dev_dax_kmem_probe' and '__dax_pmem_probe'). > > Yes, but not currently for FS_DAX iiuc. The target_node is only used by the dax_kmem driver. In the FSDAX case the memory (persistent or otherwise) is mapped behind a block-device. That block-device has affinity to a CPU initiator, but that memory does not itself have any NUMA affinity or identity as a target. So: block-device NUMA node == closest CPU initiator node to the device dax-device target node == memory only NUMA node target, after onlining
[PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface
From: Davidlohr Bueso With CXL security features, global CPU cache flushing nvdimm requirements are no longer specific to that subsystem, even beyond the scope of security_ops. CXL will need such semantics for features not necessarily limited to persistent memory. The functionality this is enabling is to be able to instantaneously secure erase potentially terabytes of memory at once and the kernel needs to be sure that none of the data from before the erase is still present in the cache. It is also used when unlocking a memory device where speculative reads and firmware accesses could have cached poison from before the device was unlocked. Lastly this facility is used when mapping new devices, or new capacity into an established physical address range. I.e. when the driver switches DeviceA mapping AddressX to DeviceB mapping AddressX then any cached data from DeviceA:AddressX needs to be invalidated. This capability is typically only used once per-boot (for unlock), or once per bare metal provisioning event (secure erase), like when handing off the system to another tenant or decommissioning a device. It may also be used for dynamic CXL region provisioning. Users must first call cpu_cache_has_invalidate_memregion() to know whether this functionality is available on the architecture. Only enable it on x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX guests may trigger a virtualization exception and may need proper handling to recover. See: e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") This global cache invalidation facility, cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM and CXL support can be built as a module. However, the intent is that this facility is not available outside of specific "device-memory" use cases. To that end the API is scoped to a new "DEVMEM" module namespace that only applies to the NVDIMM and CXL subsystems. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Peter Zijlstra Signed-off-by: Davidlohr Bueso Co-developed-by: Dan Williams Signed-off-by: Dan Williams --- Changes since v3: https://lore.kernel.org/r/20220919110605.3696-1-d...@stgolabs.net - Add more details about the use of this facility in the memory provisioning flow in the changelog - Introduce the DEVMEM symbol namespace to indicate that this facility is not for general consumption (Dave H) - Update the kernel-doc for cpu_cache_invalidate_memregion() to clarify that it is not to be used outside of the CXL and NVDIMM subsystems (Dave H) arch/x86/Kconfig |1 + arch/x86/mm/pat/set_memory.c | 15 +++ drivers/acpi/nfit/intel.c| 43 -- include/linux/memregion.h| 37 lib/Kconfig |3 +++ 5 files changed, 76 insertions(+), 23 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 67745ceab0db..b68661d0633b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -69,6 +69,7 @@ config X86 select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_CACHE_LINE_SIZE + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64 select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 97342c42dda8..8650bb6481a8 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size) EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION +bool cpu_cache_has_invalidate_memregion(void) +{ + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); +} +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM); + +int cpu_cache_invalidate_memregion(int res_desc) +{ + wbinvd_on_all_cpus(); + return 0; +} +EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM); +#endif + static void __cpa_flush_all(void *arg) { unsigned long cache = (unsigned long)arg; diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index 8dd792a55730..fa0e57e35162 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include "intel.h" #include "nfit.h" @@ -190,8 +191,6 @@ static int intel_security_change_key(struct nvdimm *nvdimm, } } -static void nvdimm_invalidate_cache(void); - static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm, const struct nvdimm_key_data *key_data) { @@ -213,6 +212,9 @@ st
RE: [PATCH v3 -next] memregion: Add cpu_cache_invalidate_memregion() interface
[ add the rest of the X86 MM maintainers, Dave and Andy ] Davidlohr Bueso wrote: > With CXL security features, global CPU cache flushing nvdimm requirements > are no longer specific to that subsystem, even beyond the scope of > security_ops. CXL will need such semantics for features not necessarily > limited to persistent memory. > > The functionality this is enabling is to be able to instantaneously > secure erase potentially terabytes of memory at once and the kernel > needs to be sure that none of the data from before the erase is still > present in the cache. It is also used when unlocking a memory device > where speculative reads and firmware accesses could have cached poison > from before the device was unlocked. > > This capability is typically only used once per-boot (for unlock), or > once per bare metal provisioning event (secure erase), like when handing > off the system to another tenant or decommissioning a device. It may > also be used for dynamic CXL region provisioning. > > Users must first call cpu_cache_has_invalidate_memregion() to know whether > this functionality is available on the architecture. Only enable it on > x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX > guests may trigger a virtualization exception and may need proper handling > to recover. See: > >e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") > > Signed-off-by: Davidlohr Bueso > --- > Changes from v2 > (https://lore.kernel.org/all/20220829212918.4039240-1-d...@stgolabs.net/): > - Change the names and params (Dan). > - GPL symbols (Boris). > - Mentioned VMM check in the changelog (Boris). Any remaining concerns from x86 maintainers? Peter had asked whether this requirement to use wbinvd could be addressed from the CXL device side in the future. I did raise this question and I will point out that "back-invalidate" (device initiated requests to invalidate CPU caches) is a prominent capability defined in the CXL 3.0 specification. So, there is at least line of sight for that to be used for these flows going forward. There will also be motivation for this from platforms that do not have an equivalent to wbinvd once the Linux CXL stack declines to support dynamic CXL memory region provisioning and secure erase in the absence of a functional cpu_cache_invalidate_memregion() or a device-side back-invalidate capability. > > > arch/x86/Kconfig | 1 + > arch/x86/mm/pat/set_memory.c | 15 + > drivers/acpi/nfit/intel.c| 41 > include/linux/memregion.h| 35 ++ > lib/Kconfig | 3 +++ > 5 files changed, 72 insertions(+), 23 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 2e8f6fd28e59..fa5cc581315a 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -69,6 +69,7 @@ config X86 > select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE > select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI > select ARCH_HAS_CACHE_LINE_SIZE > + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64 > select ARCH_HAS_CURRENT_STACK_POINTER > select ARCH_HAS_DEBUG_VIRTUAL > select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 0656db33574d..7d940ae2fede 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size) > EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > #endif > > +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION > +bool cpu_cache_has_invalidate_memregion(void) > +{ > + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); > +} > +EXPORT_SYMBOL_GPL(cpu_cache_has_invalidate_memregion); > + > +int cpu_cache_invalidate_memregion(int res_desc) > +{ > + wbinvd_on_all_cpus(); > + return 0; > +} > +EXPORT_SYMBOL_GPL(cpu_cache_invalidate_memregion); > +#endif > + > static void __cpa_flush_all(void *arg) > { > unsigned long cache = (unsigned long)arg; > diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c > index 8dd792a55730..b2bfbf5797da 100644 > --- a/drivers/acpi/nfit/intel.c > +++ b/drivers/acpi/nfit/intel.c > @@ -3,6 +3,7 @@ > #include > #include > #include > +#include > #include > #include "intel.h" > #include "nfit.h" > @@ -190,8 +191,6 @@ static int intel_security_change_key(struct nvdimm > *nvdimm, > } > } > > -static void nvdimm_invalidate_cache(void); > - > static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm, > const struct nvdimm_key_data *key_data) > { > @@ -213,6 +212,9 @@ static int __maybe_unused intel_security_unlock(struct > nvdimm *nvdimm, > if (!test_bit(NVDIMM_INTEL_UNLOCK_UNIT, _mem->dsm_mask)) > return -ENOTTY; > > + if (!cpu_cache_has_invalidate_memregion()) > + return -EINVAL; >
[GIT PULL] NVDIMM for 6.1
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-6.1 ...to receive some small cleanups and fixes in and around the nvdimm subsystem. The most significant change is a regression fix for nvdimm namespace (volume) creation when the namespace size is smaller than 2MB. It has appeared in linux-next with no reported issues. --- The following changes since commit 521a547ced6477c54b4b0cc206000406c221b4d6: Linux 6.0-rc6 (2022-09-18 13:44:14 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-6.1 for you to fetch changes up to 305a72efa791c826fe84768ca55e31adc4113ea8: Merge branch 'for-6.1/nvdimm' into libnvdimm-for-next (2022-10-01 19:16:16 -0700) libnvdimm for 6.1 - Fix nvdimm namespace creation on platforms that do not publish associated 'DIMM' metadata for a persistent memory region. - Miscellaneous fixes and cleanups. Andy Shevchenko (2): nvdimm/namespace: return uuid_null only once in nd_dev_to_uuid() nvdimm/namespace: drop unneeded temporary variable in size_store() Bo Liu (1): dax: Remove usage of the deprecated ida_simple_xxx API Dan Williams (1): Merge branch 'for-6.1/nvdimm' into libnvdimm-for-next Jason Wang (1): nvdimm/namespace: Fix comment typo Jiapeng Chong (2): nvdimm/region: Fix kernel-doc nvdimm: make __nvdimm_security_overwrite_query static Lin Yujun (1): ACPI: HMAT: Release platform device in case of platform_device_add_data() fails Tyler Hicks (1): libnvdimm/region: Allow setting align attribute on regions without mappings drivers/dax/hmem/device.c | 4 ++-- drivers/dax/super.c | 6 +++--- drivers/nvdimm/namespace_devs.c | 24 drivers/nvdimm/region_devs.c| 10 -- drivers/nvdimm/security.c | 2 +- 5 files changed, 18 insertions(+), 28 deletions(-)
RE: [PATCH] dax: Remove usage of the deprecated ida_simple_xxx API
Bo Liu wrote: > Use ida_alloc_xxx()/ida_free() instead of > ida_simple_get()/ida_simple_remove(). > The latter is deprecated and more verbose. The better justification is: ida_alloc_max() makes it clear that the second argument is inclusive, and the alloc/free terminology is more idiomatic and symmetric then get/remove. Otherwise, I would not apply the patch is the only justfication was deprecation and verbosity. I'll fixup the changelog.
Re: [PATCH v2] libnvdimm/region: Allow setting align attribute on regions without mappings
Tyler Hicks wrote: > On 2022-09-26 16:18:18, Jeff Moyer wrote: > > Tyler Hicks writes: > > > > > The alignment constraint for namespace creation in a region was > > > increased, from 2M to 16M, for non-PowerPC architectures in v5.7 with > > > commit 2522afb86a8c ("libnvdimm/region: Introduce an 'align' > > > attribute"). The thought behind the change was that region alignment > > > should be uniform across all architectures and, since PowerPC had the > > > largest alignment constraint of 16M, all architectures should conform to > > > that alignment. > > > > > > The change regressed namespace creation in pre-defined regions that > > > relied on 2M alignment but a workaround was provided in the form of a > > > sysfs attribute, named 'align', that could be adjusted to a non-default > > > alignment value. > > > > > > However, the sysfs attribute's store function returned an error (-ENXIO) > > > when userspace attempted to change the alignment of a region that had no > > > mappings. This affected 2M aligned regions of volatile memory that were > > > defined in a device tree using "pmem-region" and created by the > > > of_pmem_region_driver, since those regions do not contain mappings > > > (ndr_mappings is 0). > > > > > > Allow userspace to set the align attribute on pre-existing regions that > > > do not have mappings so that namespaces can still be within those > > > regions, despite not being aligned to 16M. > > > > > > Link: > > > https://lore.kernel.org/lkml/CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4glddd7mxs321b...@mail.gmail.com > > > Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute") > > > Signed-off-by: Tyler Hicks > > > --- > > > > > > While testing with a recent kernel release (6.0-rc3), I rediscovered > > > this bug and eventually realized that I never followed through with > > > fixing it upstream. After a year later, here's the v2 that Aneesh > > > requested. Sorry about that! > > > > > > v2: > > > - Included Aneesh's feedback to ensure the val is a power of 2 and > > > greater than PAGE_SIZE even for regions without mappings > > > - Reused the max_t() trick from default_align() to avoid special > > > casing, with an if-else, when regions have mappings and when they > > > don't > > > + Didn't include Pavel's Reviewed-by since this is a slightly > > > different approach than what he reviewed in v1 > > > - Added a Link commit tag to Pavel's initial problem description > > > v1: > > > https://lore.kernel.org/lkml/20210326152645.85225-1-tyhi...@linux.microsoft.com/ > > > > > > drivers/nvdimm/region_devs.c | 8 +++- > > > 1 file changed, 3 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > > > index 473a71bbd9c9..550ea0bd6c53 100644 > > > --- a/drivers/nvdimm/region_devs.c > > > +++ b/drivers/nvdimm/region_devs.c > > > @@ -509,16 +509,13 @@ static ssize_t align_store(struct device *dev, > > > { > > > struct nd_region *nd_region = to_nd_region(dev); > > > unsigned long val, dpa; > > > - u32 remainder; > > > + u32 mappings, remainder; > > > int rc; > > > > > > rc = kstrtoul(buf, 0, ); > > > if (rc) > > > return rc; > > > > > > - if (!nd_region->ndr_mappings) > > > - return -ENXIO; > > > - > > > /* > > >* Ensure space-align is evenly divisible by the region > > >* interleave-width because the kernel typically has no facility > > > @@ -526,7 +523,8 @@ static ssize_t align_store(struct device *dev, > > >* contribute to the tail capacity in system-physical-address > > >* space for the namespace. > > >*/ > > > - dpa = div_u64_rem(val, nd_region->ndr_mappings, ); > > > + mappings = max_t(u32, 1, nd_region->ndr_mappings); > > > + dpa = div_u64_rem(val, mappings, ); > > > if (!is_power_of_2(dpa) || dpa < PAGE_SIZE > > > || val > region_size(nd_region) || remainder) > > > return -EINVAL; > > > > The math all looks okay, and this matches what's done in default_align. > > Unfortunately, I don't know enough about the power architecture to > > understand how you can have a region with no dimms (ndr_mappings == 0). > > Thanks for having a look! > > FWIW, I need this working on arm64. It previously did before the commit > mentioned in the Fixes line. ndr_mappings is 0 when defining a > pmem-region in the device tree. The region is also marked as 'volatile' > but I don't recall if that contributes to ndr_mappings being 0. Oh, ndr_mappings being 0 can also happen for cases where the platform just does not advertise any DIMMs / nmemX devices that map to a region. It is not possible for ndr_mappings to change at run time as they are only established at nd_region_create() time. The change looks good to me.
[GIT PULL] NVDIMM and DAX fixes for 6.0-final
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/dax-and-nvdimm-fixes-v6.0-final ...to receive a recently discovered one-line fix for devdax that further addresses a v5.5 regression, and (a bit embarrassing) a small batch of fixes that have been sitting in my fixes tree for weeks. The older fixes have soaked in linux-next during that time and address an fsdax infinite loop and some other minor fixups. --- The following changes since commit 521a547ced6477c54b4b0cc206000406c221b4d6: Linux 6.0-rc6 (2022-09-18 13:44:14 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/dax-and-nvdimm-fixes-v6.0-final for you to fetch changes up to b3bbcc5d1da1b654091dad15980b3d58fdae0fc6: Merge branch 'for-6.0/dax' into libnvdimm-fixes (2022-09-24 18:14:12 -0700) dax-and-nvdimm-fixes-v6.0-final - Fix a infinite loop bug in fsdax - Fix memory-type detection for devdax (EINJ regression) - Small cleanups Andy Shevchenko (1): nvdimm/namespace: drop nested variable in create_namespace_pmem() Dan Williams (2): devdax: Fix soft-reservation memory description Merge branch 'for-6.0/dax' into libnvdimm-fixes Jane Chu (1): pmem: fix a name collision Li Jinlin (1): fsdax: Fix infinite loop in dax_iomap_rw() Shivaprasad G Bhat (1): ndtest: Cleanup all of blk namespace specific code drivers/dax/hmem/device.c | 1 + drivers/nvdimm/namespace_devs.c| 2 - drivers/nvdimm/pmem.c | 6 +-- fs/dax.c | 3 ++ tools/testing/nvdimm/test/ndtest.c | 77 -- 5 files changed, 7 insertions(+), 82 deletions(-)
RE: [PATCH] dax: Check dev_set_name() return value
Bo Liu wrote: > It's possible that dev_set_name() returns -ENOMEM, catch and handle this. > > Signed-off-by: Bo Liu > --- > drivers/dax/bus.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 1dad813ee4a6..36cf245ee467 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -765,7 +765,12 @@ static int devm_register_dax_mapping(struct dev_dax > *dev_dax, int range_id) > device_initialize(dev); > dev->parent = _dax->dev; > dev->type = _mapping_type; > - dev_set_name(dev, "mapping%d", mapping->id); > + rc = dev_set_name(dev, "mapping%d", mapping->id); > + if (rc) { > + kfree(mapping); No, this needs to trigger put_device() otherwise it leaks mapping->id.
RE: [PATCH] nvdimm: Call ida_simple_remove() when failed
Bo Liu wrote: > In function nvdimm_bus_register(), when code execution fails, we should > call ida_simple_remove() to free ida. > > Signed-off-by: Bo Liu > --- > drivers/nvdimm/bus.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c > index b38d0355b0ac..3415dc62632b 100644 > --- a/drivers/nvdimm/bus.c > +++ b/drivers/nvdimm/bus.c > @@ -371,6 +371,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device > *parent, > return nvdimm_bus; > err: > put_device(_bus->dev); > + ida_simple_remove(_ida, nvdimm_bus->id); No, this is a double-free. The put_device() will trigger nvdimm_bus_release().
Re: [PATCH -next] memregion: Add arch_flush_memregion() interface
Andrew Morton wrote: > On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams > wrote: > > > Jonathan Cameron wrote: > > > On Wed, 7 Sep 2022 18:07:31 -0700 > > > Dan Williams wrote: > > > > > > > Andrew Morton wrote: > > > > > I really dislike the term "flush". Sometimes it means writeback, > > > > > sometimes it means invalidate. Perhaps at other times it means > > > > > both. > > > > > > > > > > Can we please be very clear in comments and changelogs about exactly > > > > > what this "flush" does. With bonus points for being more specific > > > > > in the > > > > > function naming? > > > > > > > > > > > > > That's a good point, "flush" has been cargo-culted along in Linux's > > > > cache management APIs to mean write-back-and-invalidate. In this case I > > > > think this API is purely about invalidate. It just so happens that x86 > > > > has not historically had a global invalidate instruction readily > > > > available which leads to the overuse of wbinvd. > > > > > > > > It would be nice to make clear that this API is purely about > > > > invalidating any data cached for a physical address impacted by address > > > > space management event (secure erase / new region provision). Write-back > > > > is an unnecessary side-effect. > > > > > > > > So how about: > > > > > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? > > > > > > Want to indicate it 'might' write back perhaps? > > > So could be invalidate or clean and invalidate (using arm ARM terms just > > > to add > > > to the confusion ;) > > > > > > Feels like there will be potential race conditions where that matters as > > > we might > > > force stale data to be written back. > > > > > > Perhaps a comment is enough for that. Anyone have the "famous last words" > > > feeling? > > > > Is "invalidate" not clear that write-back is optional? Maybe not. > > Yes, I'd say that "invalidate" means "dirty stuff may of may not have > been written back". Ditto for invalidate_inode_pages2(). > > > Also, I realized that we tried to include the address range to allow for > > the possibility of flushing by virtual address range, but that > > overcomplicates the use. I.e. if someone issue secure erase and the > > region association is not established does that mean that mean that the > > cache invalidation is not needed? It could be the case that someone > > disables a device, does the secure erase, and then reattaches to the > > same region. The cache invalidation is needed, but at the time of the > > secure erase the HPA was unknown. > > > > All this to say that I feel the bikeshedding will need to continue until > > morale improves. > > > > I notice that the DMA API uses 'sync' to indicate, "make this memory > > consistent/coherent for the CPU or the device", so how about an API like > > > > memregion_sync_for_cpu(int res_desc) > > > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and > > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case. > > "sync" is another of my pet peeves ;) In filesystem land, at least. > Does it mean "start writeback and return" or does it mean "start > writeback, wait for its completion then return". Ok, no "sync" :). /** * cpu_cache_invalidate_memregion - drop any CPU cached data for * memregions described by @res_des * @res_desc: one of the IORES_DESC_* types * * Perform cache maintenance after a memory event / operation that * changes the contents of physical memory in a cache-incoherent manner. * For example, memory-device secure erase, or provisioning new CXL * regions. This routine may or may not write back any dirty contents * while performing the invalidation. * * Returns 0 on success or negative error code on a failure to perform * the cache maintenance. */ int cpu_cache_invalidate_memregion(int res_desc) ??
Re: [PATCH -next] memregion: Add arch_flush_memregion() interface
Jonathan Cameron wrote: > On Wed, 7 Sep 2022 18:07:31 -0700 > Dan Williams wrote: > > > Andrew Morton wrote: > > > I really dislike the term "flush". Sometimes it means writeback, > > > sometimes it means invalidate. Perhaps at other times it means > > > both. > > > > > > Can we please be very clear in comments and changelogs about exactly > > > what this "flush" does. With bonus points for being more specific in > > > the > > > function naming? > > > > > > > That's a good point, "flush" has been cargo-culted along in Linux's > > cache management APIs to mean write-back-and-invalidate. In this case I > > think this API is purely about invalidate. It just so happens that x86 > > has not historically had a global invalidate instruction readily > > available which leads to the overuse of wbinvd. > > > > It would be nice to make clear that this API is purely about > > invalidating any data cached for a physical address impacted by address > > space management event (secure erase / new region provision). Write-back > > is an unnecessary side-effect. > > > > So how about: > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? > > Want to indicate it 'might' write back perhaps? > So could be invalidate or clean and invalidate (using arm ARM terms just to > add > to the confusion ;) > > Feels like there will be potential race conditions where that matters as we > might > force stale data to be written back. > > Perhaps a comment is enough for that. Anyone have the "famous last words" > feeling? Is "invalidate" not clear that write-back is optional? Maybe not. Also, I realized that we tried to include the address range to allow for the possibility of flushing by virtual address range, but that overcomplicates the use. I.e. if someone issue secure erase and the region association is not established does that mean that mean that the cache invalidation is not needed? It could be the case that someone disables a device, does the secure erase, and then reattaches to the same region. The cache invalidation is needed, but at the time of the secure erase the HPA was unknown. All this to say that I feel the bikeshedding will need to continue until morale improves. I notice that the DMA API uses 'sync' to indicate, "make this memory consistent/coherent for the CPU or the device", so how about an API like memregion_sync_for_cpu(int res_desc) ...where the @res_desc would be IORES_DESC_CXL for all CXL and IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
Re: [PATCH -next] memregion: Add arch_flush_memregion() interface
Borislav Petkov wrote: > On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote: > > To be clear nfit stuff and CXL does run in guests, but they do not > > support secure-erase in a guest. > > > > However, the QEMU CXL enabling is building the ability to do *guest > > physical* address space management, but in that case the driver can be > > paravirtualized to realize that it is not managing host-physical address > > space and does not need to flush caches. That will need some indicator > > to differentiate virtual CXL memory expanders from assigned devices. > > Sounds to me like that check should be improved later to ask > whether the kernel is managing host-physical address space, maybe > arch_flush_memregion() should check whether the address it is supposed > to flush is host-physical and exit early if not... Even though I raised the possibility of guest passthrough of a CXL memory expander, I do not think it could work in practice without it being a gigantic security nightmare. So it is probably safe to just do the hypervisor check and assume that there's no such thing as guest management of host physical address space.
Re: [PATCH -next] memregion: Add arch_flush_memregion() interface
Andrew Morton wrote: > I really dislike the term "flush". Sometimes it means writeback, > sometimes it means invalidate. Perhaps at other times it means > both. > > Can we please be very clear in comments and changelogs about exactly > what this "flush" does. With bonus points for being more specific in the > function naming? > That's a good point, "flush" has been cargo-culted along in Linux's cache management APIs to mean write-back-and-invalidate. In this case I think this API is purely about invalidate. It just so happens that x86 has not historically had a global invalidate instruction readily available which leads to the overuse of wbinvd. It would be nice to make clear that this API is purely about invalidating any data cached for a physical address impacted by address space management event (secure erase / new region provision). Write-back is an unnecessary side-effect. So how about: s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
Re: [PATCH -next] memregion: Add arch_flush_memregion() interface
Davidlohr Bueso wrote: > On Wed, 07 Sep 2022, Borislav Petkov wrote: > > >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote: > >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > >> index 1abd5438f126..18463cb704fb 100644 > >> --- a/arch/x86/mm/pat/set_memory.c > >> +++ b/arch/x86/mm/pat/set_memory.c > >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size) > >> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > >> #endif > >> > >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE > >> +bool arch_has_flush_memregion(void) > >> +{ > >> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); > > > >This looks really weird. Why does this need to care about HV at all? > > So the context here is: > > e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") > > > > >Does that nfit stuff even run in guests? > > No, nor does cxl. This was mostly in general a precautionary check such > that the api is unavailable in VMs. To be clear nfit stuff and CXL does run in guests, but they do not support secure-erase in a guest. However, the QEMU CXL enabling is building the ability to do *guest physical* address space management, but in that case the driver can be paravirtualized to realize that it is not managing host-physical address space and does not need to flush caches. That will need some indicator to differentiate virtual CXL memory expanders from assigned devices. Is there such a thing as a PCIe-virtio extended capability to differentiate physical vs emulated devices?
Re: [PATCH -next] memregion: Add arch_flush_memregion() interface
Davidlohr Bueso wrote: > Not sure the proper way to route this (akpm?). But unless any remaining > objections, could this be picked up? My plan was, barring objections, to take it through the CXL tree with its first user, the CXL security commands.
RE: [PATCH] nvdimm: Avoid wasting some memory.
Christophe JAILLET wrote: > sizeof(struct btt_sb) is 4096. > > When using devm_kzalloc(), there is a small memory overhead and, on most > systems, this leads to 40 bytes of extra memory allocation. > So 5036 bytes are expected to be allocated. > > The memory allocator works with fixed size hunks of memory. In this case, > it will require 8192 bytes of memory because more than 4096 bytes are > required. > > In order to avoid wasting 4ko of memory, just use kzalloc() and add a > devm action to free it when needed. > > Signed-off-by: Christophe JAILLET > --- > drivers/nvdimm/btt_devs.c | 17 - > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c > index fabbb31f2c35..7b79fb0b0338 100644 > --- a/drivers/nvdimm/btt_devs.c > +++ b/drivers/nvdimm/btt_devs.c > @@ -332,6 +332,11 @@ static int __nd_btt_probe(struct nd_btt *nd_btt, > return 0; > } > > +void nd_btt_free(void *data) > +{ > + kfree(data); > +} > + > int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns) > { > int rc; > @@ -356,7 +361,17 @@ int nd_btt_probe(struct device *dev, struct > nd_namespace_common *ndns) > nvdimm_bus_unlock(>dev); > if (!btt_dev) > return -ENOMEM; > - btt_sb = devm_kzalloc(dev, sizeof(*btt_sb), GFP_KERNEL); > + > + /* > + * 'struct btt_sb' is 4096. Using devm_kzalloc() would waste 4 ko of > + * memory because, because of a small memory over head, 8192 bytes > + * would be allocated. So keep this kzalloc()+devm_add_action_or_reset() > + */ > + btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL); > + rc = devm_add_action_or_reset(dev, nd_btt_free, btt_sb); > + if (rc) > + return rc; Thanks for the analysis and the patch. However, shouldn't this be something that is addressed internal to devm_kzalloc() rather than open-coded at every potential call site?
RE: [PATCH -next] memregion: Add arch_flush_memregion() interface
[ add Christoph ] Davidlohr Bueso wrote: > With CXL security features, global CPU cache flushing nvdimm requirements > are no longer specific to that subsystem, even beyond the scope of > security_ops. CXL will need such semantics for features not necessarily > limited to persistent memory. > > The functionality this is enabling is to be able to instantaneously > secure erase potentially terabytes of memory at once and the kernel > needs to be sure that none of the data from before the secure is still > present in the cache. It is also used when unlocking a memory device > where speculative reads and firmware accesses could have cached poison > from before the device was unlocked. > > This capability is typically only used once per-boot (for unlock), or > once per bare metal provisioning event (secure erase), like when handing > off the system to another tenant or decommissioning a device. > > Users must first call arch_has_flush_memregion() to know whether this > functionality is available on the architecture. Only enable it on x86-64 > via the wbinvd() hammer. > > Signed-off-by: Davidlohr Bueso > --- > > Changes from v2 > (https://lore.kernel.org/all/20220819171024.1766857-1-d...@stgolabs.net/): > - Redid to use memregion based interfaces + VMM check on x86 (Dan) > - Restricted the flushing to x86-64. > > Note: Since we still are dealing with a physical "range" at this level, > added the spa range for nfit even though this is unused. Looks reasonable to me. Reviewed-by: Dan Williams
RE: [PATCH v7] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
Shiyang Ruan wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > (or mapped device) on it to unmap all files in use and notify processes > who are using those files. > > Call trace: > trigger unbind > -> unbind_store() >-> ... (skip) > -> devres_release_all() > -> kill_dax() > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) >-> xfs_dax_notify_failure() > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > event. So do not shutdown filesystem directly if something not > supported, or if failure range includes metadata area. Make sure all > files and processes are handled correctly. > > == > Changes since v6: >1. Rebase on 6.0-rc2 and Darrick's patch[2]. > > Changes since v5: >1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE >2. hold s_umount before sync_filesystem() >3. do sync_filesystem() after SB_BORN check >4. Rebased on next-20220714 > > [1]: > https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/ > [2]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/ > > Signed-off-by: Shiyang Ruan > Reviewed-by: Darrick J. Wong > --- > drivers/dax/super.c | 3 ++- > fs/xfs/xfs_notify_failure.c | 15 +++ > include/linux/mm.h | 1 + > 3 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 9b5e2a5eb0ae..cf9a64563fbe 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) > return; > if (dax_dev->holder_data != NULL) > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); > + dax_holder_notify_failure(dax_dev, 0, U64_MAX, > + MF_MEM_PRE_REMOVE); > clear_bit(DAXDEV_ALIVE, _dev->flags); > synchronize_srcu(_srcu); > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c > index 65d5eb20878e..a9769f17e998 100644 > --- a/fs/xfs/xfs_notify_failure.c > +++ b/fs/xfs/xfs_notify_failure.c > @@ -77,6 +77,9 @@ xfs_dax_failure_fn( > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { > + /* Do not shutdown so early when device is to be removed */ > + if (notify->mf_flags & MF_MEM_PRE_REMOVE) > + return 0; > notify->want_shutdown = true; > return 0; > } > @@ -182,12 +185,22 @@ xfs_dax_notify_failure( > struct xfs_mount*mp = dax_holder(dax_dev); > u64 ddev_start; > u64 ddev_end; > + int error; > if (!(mp->m_sb.sb_flags & SB_BORN)) { How are you testing the SB_BORN interactions? I have a fix for this pending here: https://lore.kernel.org/nvdimm/166153428094.2758201.7936572520826540019.st...@dwillia2-xfh.jf.intel.com/ > xfs_warn(mp, "filesystem is not ready for notify_failure()!"); > return -EIO; > } > + if (mf_flags & MF_MEM_PRE_REMOVE) { It appears this patch is corrupted here. I confirmed that b4 sees the same when trying to apply it. > + xfs_info(mp, "device is about to be removed!"); > + down_write(>m_super->s_umount); > + error = sync_filesystem(mp->m_super); This syncs to make data persistent, but for DAX this also needs to get invalidate all current DAX mappings. I do not see that in these changes. > + up_write(>m_super->s_umount); > + if (error) > + return error; > + } > + > if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { > xfs_warn(mp, >"notify_failure() not supported on realtime device!"); > @@ -196,6 +209,8 @@ xfs_dax_notify_failure( > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && > mp->m_logdev_targp != mp->m_ddev_targp) { > + if (mf_flags & MF_MEM_PRE_REMOVE) > + return 0; > xfs_err(mp, "ondisk log corrupt, shutting down fs!"); > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); > return -EFSCORRUPTED; > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 982f2607180b..2c7c132e6512 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3176,6 +3176,7 @@ enum mf_flags { > MF_UNPOISON = 1 << 4, > MF_SW_SIMULATED = 1 << 5, > MF_NO_RETRY = 1 << 6, > + MF_MEM_PRE_REMOVE = 1 << 7, > }; > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, > unsigned long count, int mf_flags); > -- > 2.37.2 > >
Re: [PATCH v7] x86/mce: retrieve poison range from hardware
Borislav Petkov wrote: > On Thu, Aug 25, 2022 at 04:29:47PM +, Jane Chu wrote: > > Tony has replied. > > Do you really think that I can't look up what field means? > > What I said was > > "What I'm missing from this text here is... " > > IOW, what I'm trying to say is, you should formulate your commit message > better, more human-friendly. Right now it reads like for insiders only. > But that's not its purpose. > > Do you catch my drift? How about: --- When memory poison consumption machine checks fire, mce-notifier-handlers like nfit_handle_mce() record the impacted physical address range. The error information includes data about blast radius, i.e. how many cachelines did the hardware determine are impacted. A recent change, commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity"), updated nfit_handle_mce() to stop hard coding the blast radius value of 1 cacheline, and instead rely on the blast radius reported in 'struct mce' which can be up to 4K (64 cachelines). It turns out that apei_mce_report_mem_error() had a similar problem in that it hard coded a blast radius of 4K rather than checking the blast radius in the error information. Fix apei_mce_report_mem_error() to convey the proper poison granularity. ---
Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case
Dan Williams wrote: > Dan Williams wrote: > > HORIGUCHI NAOYA(堀口 直也) wrote: > > > On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote: > > > > Shiyang Ruan wrote: > > > > > This new function is a variant of mf_generic_kill_procs that accepts a > > > > > file, offset pair instead of a struct to support multiple files > > > > > sharing > > > > > a DAX mapping. It is intended to be called by the file systems as > > > > > part > > > > > of the memory_failure handler after the file system performed a > > > > > reverse > > > > > mapping from the storage address to the file and file offset. > > > > > > > > > > Signed-off-by: Shiyang Ruan > > > > > Reviewed-by: Dan Williams > > > > > Reviewed-by: Christoph Hellwig > > > > > Reviewed-by: Darrick J. Wong > > > > > Reviewed-by: Miaohe Lin > > > > > --- > > > > > include/linux/mm.h | 2 + > > > > > mm/memory-failure.c | 96 > > > > > - > > > > > 2 files changed, 88 insertions(+), 10 deletions(-) > > > > > > > > Unfortunately my test suite was only running the "non-destructive" set > > > > of 'ndctl' tests which skipped some of the complex memory-failure cases. > > > > Upon fixing that, bisect flags this commit as the source of the > > > > following > > > > crash regression: > > > > > > Thank you for testing/reporting. > > > > > > > > > > > kernel BUG at mm/memory-failure.c:310! > > > > invalid opcode: [#1] PREEMPT SMP PTI > > > > CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G OE > > > > 5.19.0-rc4+ #58 > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 > > > > 02/06/2015 > > > > RIP: 0010:add_to_kill+0x304/0x400 > > > > [..] > > > > Call Trace: > > > > > > > > collect_procs.part.0+0x2c8/0x470 > > > > memory_failure+0x979/0xf30 > > > > do_madvise.part.0.cold+0x9c/0xd3 > > > > ? lock_is_held_type+0xe3/0x140 > > > > ? find_held_lock+0x2b/0x80 > > > > ? lock_release+0x145/0x2f0 > > > > ? lock_is_held_type+0xe3/0x140 > > > > ? syscall_enter_from_user_mode+0x20/0x70 > > > > __x64_sys_madvise+0x56/0x70 > > > > do_syscall_64+0x3a/0x80 > > > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > > > This stacktrace shows that VM_BUG_ON_VMA() in dev_pagemap_mapping_shift() > > > was triggered. I think that BUG_ON is too harsh here because address == > > > -EFAULT means that there's no mapping for the address. The subsequent > > > code considers "tk->size_shift == 0" as "no mapping" cases, so > > > dev_pagemap_mapping_shift() can return 0 in such a case? > > > > > > Could the following diff work for the issue? > > > > This passes the "dax-ext4.sh" and "dax-xfs.sh" tests from the ndctl > > suite. So that diff works to avoid the BUG_ON, but it does not work to handle the error case. I think the problem comes from: vma->vm_file->f_mapping != folio->mapping ...where page_folio(page)->mapping is likely not setup correctly for DAX pages. This goes back to the broken nature of DAX page reference counting which I am fixing now, but this folio association also needs to be fixed up.
Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case
Dan Williams wrote: > HORIGUCHI NAOYA(堀口 直也) wrote: > > On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote: > > > Shiyang Ruan wrote: > > > > This new function is a variant of mf_generic_kill_procs that accepts a > > > > file, offset pair instead of a struct to support multiple files sharing > > > > a DAX mapping. It is intended to be called by the file systems as part > > > > of the memory_failure handler after the file system performed a reverse > > > > mapping from the storage address to the file and file offset. > > > > > > > > Signed-off-by: Shiyang Ruan > > > > Reviewed-by: Dan Williams > > > > Reviewed-by: Christoph Hellwig > > > > Reviewed-by: Darrick J. Wong > > > > Reviewed-by: Miaohe Lin > > > > --- > > > > include/linux/mm.h | 2 + > > > > mm/memory-failure.c | 96 - > > > > 2 files changed, 88 insertions(+), 10 deletions(-) > > > > > > Unfortunately my test suite was only running the "non-destructive" set > > > of 'ndctl' tests which skipped some of the complex memory-failure cases. > > > Upon fixing that, bisect flags this commit as the source of the following > > > crash regression: > > > > Thank you for testing/reporting. > > > > > > > > kernel BUG at mm/memory-failure.c:310! > > > invalid opcode: [#1] PREEMPT SMP PTI > > > CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G OE 5.19.0-rc4+ > > > #58 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > > RIP: 0010:add_to_kill+0x304/0x400 > > > [..] > > > Call Trace: > > > > > > collect_procs.part.0+0x2c8/0x470 > > > memory_failure+0x979/0xf30 > > > do_madvise.part.0.cold+0x9c/0xd3 > > > ? lock_is_held_type+0xe3/0x140 > > > ? find_held_lock+0x2b/0x80 > > > ? lock_release+0x145/0x2f0 > > > ? lock_is_held_type+0xe3/0x140 > > > ? syscall_enter_from_user_mode+0x20/0x70 > > > __x64_sys_madvise+0x56/0x70 > > > do_syscall_64+0x3a/0x80 > > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > This stacktrace shows that VM_BUG_ON_VMA() in dev_pagemap_mapping_shift() > > was triggered. I think that BUG_ON is too harsh here because address == > > -EFAULT means that there's no mapping for the address. The subsequent > > code considers "tk->size_shift == 0" as "no mapping" cases, so > > dev_pagemap_mapping_shift() can return 0 in such a case? > > > > Could the following diff work for the issue? > > This passes the "dax-ext4.sh" and "dax-xfs.sh" tests from the ndctl > suite. > > It then fails on the "device-dax" test with this signature: > > BUG: kernel NULL pointer dereference, address: 0010 > #PF: supervisor read access in kernel mode > #PF: error_code(0x) - not-present page > PGD 800205073067 P4D 800205073067 PUD 2062b3067 PMD 0 > Oops: [#1] PREEMPT SMP PTI > CPU: 22 PID: 4535 Comm: device-dax Tainted: G OEN 6.0.0-rc2+ > #59 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:memory_failure+0x667/0xba0 > [..] > Call Trace: > > ? _printk+0x58/0x73 > do_madvise.part.0.cold+0xaf/0xc5 > > Which is: > > (gdb) li *(memory_failure+0x667) > 0x813b7f17 is in memory_failure (mm/memory-failure.c:1933). > 1928 > 1929/* > 1930 * Call driver's implementation to handle the memory failure, > otherwise > 1931 * fall back to generic handler. > 1932 */ > 1933if (pgmap->ops->memory_failure) { > 1934rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags); > > > ...I think this is just a simple matter of: > > @@ -1928,7 +1930,7 @@ static int memory_failure_dev_pagemap(unsigned long > pfn, int flags, > * Call driver's implementation to handle the memory failure, > otherwise > * fall back to generic handler. > */ > - if (pgmap->ops->memory_failure) { > + if (pgmap->ops && pgmap->ops->memory_failure) { > rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags); > /* > * Fall back to generic handler too if operation is not > > > ...since device-dax does not implement pagemap ops. > > I will see what else pops up and make sure that this regression always > runs going forward. Ok, that was last of the regression fallout that I could find.
Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case
HORIGUCHI NAOYA(堀口 直也) wrote: > On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote: > > Shiyang Ruan wrote: > > > This new function is a variant of mf_generic_kill_procs that accepts a > > > file, offset pair instead of a struct to support multiple files sharing > > > a DAX mapping. It is intended to be called by the file systems as part > > > of the memory_failure handler after the file system performed a reverse > > > mapping from the storage address to the file and file offset. > > > > > > Signed-off-by: Shiyang Ruan > > > Reviewed-by: Dan Williams > > > Reviewed-by: Christoph Hellwig > > > Reviewed-by: Darrick J. Wong > > > Reviewed-by: Miaohe Lin > > > --- > > > include/linux/mm.h | 2 + > > > mm/memory-failure.c | 96 - > > > 2 files changed, 88 insertions(+), 10 deletions(-) > > > > Unfortunately my test suite was only running the "non-destructive" set > > of 'ndctl' tests which skipped some of the complex memory-failure cases. > > Upon fixing that, bisect flags this commit as the source of the following > > crash regression: > > Thank you for testing/reporting. > > > > > kernel BUG at mm/memory-failure.c:310! > > invalid opcode: [#1] PREEMPT SMP PTI > > CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G OE 5.19.0-rc4+ #58 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > RIP: 0010:add_to_kill+0x304/0x400 > > [..] > > Call Trace: > > > > collect_procs.part.0+0x2c8/0x470 > > memory_failure+0x979/0xf30 > > do_madvise.part.0.cold+0x9c/0xd3 > > ? lock_is_held_type+0xe3/0x140 > > ? find_held_lock+0x2b/0x80 > > ? lock_release+0x145/0x2f0 > > ? lock_is_held_type+0xe3/0x140 > > ? syscall_enter_from_user_mode+0x20/0x70 > > __x64_sys_madvise+0x56/0x70 > > do_syscall_64+0x3a/0x80 > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > This stacktrace shows that VM_BUG_ON_VMA() in dev_pagemap_mapping_shift() > was triggered. I think that BUG_ON is too harsh here because address == > -EFAULT means that there's no mapping for the address. The subsequent > code considers "tk->size_shift == 0" as "no mapping" cases, so > dev_pagemap_mapping_shift() can return 0 in such a case? > > Could the following diff work for the issue? This passes the "dax-ext4.sh" and "dax-xfs.sh" tests from the ndctl suite. It then fails on the "device-dax" test with this signature: BUG: kernel NULL pointer dereference, address: 0010 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 800205073067 P4D 800205073067 PUD 2062b3067 PMD 0 Oops: [#1] PREEMPT SMP PTI CPU: 22 PID: 4535 Comm: device-dax Tainted: G OEN 6.0.0-rc2+ #59 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:memory_failure+0x667/0xba0 [..] Call Trace: ? _printk+0x58/0x73 do_madvise.part.0.cold+0xaf/0xc5 Which is: (gdb) li *(memory_failure+0x667) 0x813b7f17 is in memory_failure (mm/memory-failure.c:1933). 1928 1929/* 1930 * Call driver's implementation to handle the memory failure, otherwise 1931 * fall back to generic handler. 1932 */ 1933if (pgmap->ops->memory_failure) { 1934rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags); ...I think this is just a simple matter of: @@ -1928,7 +1930,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, * Call driver's implementation to handle the memory failure, otherwise * fall back to generic handler. */ - if (pgmap->ops->memory_failure) { + if (pgmap->ops && pgmap->ops->memory_failure) { rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags); /* * Fall back to generic handler too if operation is not ...since device-dax does not implement pagemap ops. I will see what else pops up and make sure that this regression always runs going forward.
RE: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case
Shiyang Ruan wrote: > This new function is a variant of mf_generic_kill_procs that accepts a > file, offset pair instead of a struct to support multiple files sharing > a DAX mapping. It is intended to be called by the file systems as part > of the memory_failure handler after the file system performed a reverse > mapping from the storage address to the file and file offset. > > Signed-off-by: Shiyang Ruan > Reviewed-by: Dan Williams > Reviewed-by: Christoph Hellwig > Reviewed-by: Darrick J. Wong > Reviewed-by: Miaohe Lin > --- > include/linux/mm.h | 2 + > mm/memory-failure.c | 96 - > 2 files changed, 88 insertions(+), 10 deletions(-) Unfortunately my test suite was only running the "non-destructive" set of 'ndctl' tests which skipped some of the complex memory-failure cases. Upon fixing that, bisect flags this commit as the source of the following crash regression: kernel BUG at mm/memory-failure.c:310! invalid opcode: [#1] PREEMPT SMP PTI CPU: 26 PID: 1252 Comm: dax-pmd Tainted: G OE 5.19.0-rc4+ #58 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:add_to_kill+0x304/0x400 [..] Call Trace: collect_procs.part.0+0x2c8/0x470 memory_failure+0x979/0xf30 do_madvise.part.0.cold+0x9c/0xd3 ? lock_is_held_type+0xe3/0x140 ? find_held_lock+0x2b/0x80 ? lock_release+0x145/0x2f0 ? lock_is_held_type+0xe3/0x140 ? syscall_enter_from_user_mode+0x20/0x70 __x64_sys_madvise+0x56/0x70 do_syscall_64+0x3a/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This is from running: meson test -C build dax-ext4.sh ...from the ndctl repo. I will take look, and posting it here in case I do not find it tonight and Ruan can take a look.
Re: [PATCH v2] arch/cacheflush: Introduce flush_all_caches()
Davidlohr Bueso wrote: > On Mon, 22 Aug 2022, Dan Williams wrote: > > >Davidlohr Bueso wrote: > >> On Sun, 21 Aug 2022, Christoph Hellwig wrote: > >> > >> >On Fri, Aug 19, 2022 at 10:10:24AM -0700, Davidlohr Bueso wrote: > >> >> index b192d917a6d0..ac4d4fd4e508 100644 > >> >> --- a/arch/x86/include/asm/cacheflush.h > >> >> +++ b/arch/x86/include/asm/cacheflush.h > >> >> @@ -10,4 +10,8 @@ > >> >> > >> >> void clflush_cache_range(void *addr, unsigned int size); > >> >> > >> >> +/* see comments in the stub version */ > >> >> +#define flush_all_caches() \ > >> >> + do { wbinvd_on_all_cpus(); } while(0) > >> > > >> >Yikes. This is just a horrible, horrible name and placement for a bad > >> >hack that should have no generic relevance. > >> > >> Why does this have no generic relevance? There's already been discussions > >> on how much wbinv is hated[0]. > >> > >> >Please fix up the naming to make it clear that this function is for a > >> >very specific nvdimm use case, and move it to a nvdimm-specific header > >> >file. > >> > >> Do you have any suggestions for a name? And, as the changelog describes, > >> this is not nvdimm specific anymore, and the whole point of all this is > >> volatile memory components for cxl, hence nvdimm namespace is bogus. > >> > >> [0] > >> https://lore.kernel.org/all/yvtc2u1j%2fqip8...@worktop.programming.kicks-ass.net/ > > > >While it is not nvdimm specific anymore, it's still specific to "memory > >devices that can bulk invalidate a physical address space". I.e. it's > >not as generic as its location in arch/x86/include/asm/cacheflush.h > >would imply. So, similar to arch_invalidate_pmem(), lets keep it in a > >device-driver-specific header file, because hch and peterz are right, we > >need to make this much more clear that it is not for general > >consumption. > > Fine, I won't argue - although I don't particularly agree, at least wrt > the naming. Imo my naming does _exactly_ what it should do and is much > easier to read than arch_has_flush_memregion() which is counter intuitive > when we are in fact flushing everything. This does not either make it > any more clearer about virt vs physical mappings either (except that > it's no longer associated to cacheflush). But, excepting arm cacheflush.h's > rare arch with braino cache users get way too much credit in their namespace > usage. > > But yes there is no doubt that my version is more inviting than it should be, > which made me think of naming it to flush_all_caches_careful() so the user > is forced to at least check the function (or one would hope). So I'm not married to arch_has_flush_memregion() or even including the physical address range to flush, the only aspect of the prototype I want to see incorporated is something about the target / motivation for the flush. "flush_all_caches_careful()" says nothing about what the API is being "careful" about. It reminds of Linus' comments on memcpy_mcsafe() https://lore.kernel.org/all/CAHk-=wh1SPyuGkTkQESsacwKTpjWd=_-KwoCK5o=suc3ymd...@mail.gmail.com/ "Naming - like comments - shouldn't be about what some implementation is, but about the concept." So "memregion" was meant to represent a memory device backed physical address range, but that association may only be in my own head. How about something even more explicit like: "flush_after_memdev_invalidate()" where someone would feel icky using it for anything other than what we have been talking about in this thread. > Anyway, I'll send a new version based on the below - I particularly agree > with the hypervisor bits. Ok, just one more lap around the bikeshed track, but I think we're converging.
Re: [PATCH v2] arch/cacheflush: Introduce flush_all_caches()
Davidlohr Bueso wrote: > On Sun, 21 Aug 2022, Christoph Hellwig wrote: > > >On Fri, Aug 19, 2022 at 10:10:24AM -0700, Davidlohr Bueso wrote: > >> index b192d917a6d0..ac4d4fd4e508 100644 > >> --- a/arch/x86/include/asm/cacheflush.h > >> +++ b/arch/x86/include/asm/cacheflush.h > >> @@ -10,4 +10,8 @@ > >> > >> void clflush_cache_range(void *addr, unsigned int size); > >> > >> +/* see comments in the stub version */ > >> +#define flush_all_caches() \ > >> + do { wbinvd_on_all_cpus(); } while(0) > > > >Yikes. This is just a horrible, horrible name and placement for a bad > >hack that should have no generic relevance. > > Why does this have no generic relevance? There's already been discussions > on how much wbinv is hated[0]. > > >Please fix up the naming to make it clear that this function is for a > >very specific nvdimm use case, and move it to a nvdimm-specific header > >file. > > Do you have any suggestions for a name? And, as the changelog describes, > this is not nvdimm specific anymore, and the whole point of all this is > volatile memory components for cxl, hence nvdimm namespace is bogus. > > [0] > https://lore.kernel.org/all/yvtc2u1j%2fqip8...@worktop.programming.kicks-ass.net/ While it is not nvdimm specific anymore, it's still specific to "memory devices that can bulk invalidate a physical address space". I.e. it's not as generic as its location in arch/x86/include/asm/cacheflush.h would imply. So, similar to arch_invalidate_pmem(), lets keep it in a device-driver-specific header file, because hch and peterz are right, we need to make this much more clear that it is not for general consumption. There is already include/linux/memregion.h for identifying memory regions with a common id across ACPI NFIT, DAX "soft reserved" regions, and CXL. So how about something like along the same lines as arch_invalidate_pmem(): diff --git a/include/linux/memregion.h b/include/linux/memregion.h index c04c4fd2e209..0310135d7a42 100644 --- a/include/linux/memregion.h +++ b/include/linux/memregion.h @@ -21,3 +21,23 @@ static inline void memregion_free(int id) } #endif #endif /* _MEMREGION_H_ */ + +/* + * Device memory technologies like NVDIMM and CXL have events like + * secure erase and dynamic region provision that can invalidate an + * entire physical memory address range at once. Limit that + * functionality to architectures that have an efficient way to + * writeback and invalidate potentially terabytes of memory at once. + */ +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE +void arch_flush_memregion(phys_addr_t phys, resource_size_t size); +bool arch_has_flush_memregion(void); +#else +static inline bool arch_has_flush_memregion(void) +{ + return false; +} +static void arch_flush_memregion(phys_addr_t phys, resource_size_t size); +{ +} +#endif ...where arch_has_flush_memregion() on x86 is: bool arch_has_flush_memregion(void) { return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) } ...to make it clear that this API is unavailable in virtual environments.
Re: [PATCH] arch/cacheflush: Introduce flush_all_caches()
Davidlohr Bueso wrote: > On Tue, 16 Aug 2022, Dan Williams wrote: > > >On Tue, Aug 16, 2022 at 10:30 AM Davidlohr Bueso wrote: > >> > >> On Tue, 16 Aug 2022, Dan Williams wrote: > >> > >> >Peter Zijlstra wrote: > >> >> On Mon, Aug 15, 2022 at 09:07:06AM -0700, Davidlohr Bueso wrote: > >> >> > diff --git a/arch/x86/include/asm/cacheflush.h > >> >> > b/arch/x86/include/asm/cacheflush.h > >> >> > index b192d917a6d0..ce2ec9556093 100644 > >> >> > --- a/arch/x86/include/asm/cacheflush.h > >> >> > +++ b/arch/x86/include/asm/cacheflush.h > >> >> > @@ -10,4 +10,7 @@ > >> >> > > >> >> > void clflush_cache_range(void *addr, unsigned int size); > >> >> > > >> >> > +#define flush_all_caches() \ > >> >> > + do { wbinvd_on_all_cpus(); } while(0) > >> >> > + > >> >> > >> >> This is horrific... we've done our utmost best to remove all WBINVD > >> >> usage and here you're adding it back in the most horrible form possible > >> >> ?!? > >> >> > >> >> Please don't do this, do *NOT* use WBINVD. > >> > > >> >Unfortunately there are a few good options here, and the changelog did > >> >not make clear that this is continuing legacy [1], not adding new wbinvd > >> >usage. > >> > >> While I was hoping that it was obvious from the intel.c changes that this > >> was not a new wbinvd, I can certainly improve the changelog with the below. > > > >I also think this cache_flush_region() API wants a prominent comment > >clarifying the limited applicability of this API. I.e. that it is not > >for general purpose usage, not for VMs, and only for select bare metal > >scenarios that instantaneously invalidate wide swaths of memory. > >Otherwise, I can now see how this looks like a potentially scary > >expansion of the usage of wbinvd. > > Sure. > > Also, in the future we might be able to bypass this hammer in the presence > of persistent cpu caches. What would have helped is if the secure-erase and unlock definition in the specification mandated that the device emit cache invalidations for everything it has mapped when it is erased. However, that has some holes, and it also makes me think there is a gap in the current region provisioning code. If I have device-A mapped at physical-address-X and then tear that down and instantiate device-B at that same physical address there needs to be CPU cache invalidation between those 2 events.
Re: [PATCH] arch/cacheflush: Introduce flush_all_caches()
On Tue, Aug 16, 2022 at 10:30 AM Davidlohr Bueso wrote: > > On Tue, 16 Aug 2022, Dan Williams wrote: > > >Peter Zijlstra wrote: > >> On Mon, Aug 15, 2022 at 09:07:06AM -0700, Davidlohr Bueso wrote: > >> > diff --git a/arch/x86/include/asm/cacheflush.h > >> > b/arch/x86/include/asm/cacheflush.h > >> > index b192d917a6d0..ce2ec9556093 100644 > >> > --- a/arch/x86/include/asm/cacheflush.h > >> > +++ b/arch/x86/include/asm/cacheflush.h > >> > @@ -10,4 +10,7 @@ > >> > > >> > void clflush_cache_range(void *addr, unsigned int size); > >> > > >> > +#define flush_all_caches() \ > >> > + do { wbinvd_on_all_cpus(); } while(0) > >> > + > >> > >> This is horrific... we've done our utmost best to remove all WBINVD > >> usage and here you're adding it back in the most horrible form possible > >> ?!? > >> > >> Please don't do this, do *NOT* use WBINVD. > > > >Unfortunately there are a few good options here, and the changelog did > >not make clear that this is continuing legacy [1], not adding new wbinvd > >usage. > > While I was hoping that it was obvious from the intel.c changes that this > was not a new wbinvd, I can certainly improve the changelog with the below. I also think this cache_flush_region() API wants a prominent comment clarifying the limited applicability of this API. I.e. that it is not for general purpose usage, not for VMs, and only for select bare metal scenarios that instantaneously invalidate wide swaths of memory. Otherwise, I can now see how this looks like a potentially scary expansion of the usage of wbinvd.
Re: [PATCH] arch/cacheflush: Introduce flush_all_caches()
Peter Zijlstra wrote: > On Mon, Aug 15, 2022 at 09:07:06AM -0700, Davidlohr Bueso wrote: > > diff --git a/arch/x86/include/asm/cacheflush.h > > b/arch/x86/include/asm/cacheflush.h > > index b192d917a6d0..ce2ec9556093 100644 > > --- a/arch/x86/include/asm/cacheflush.h > > +++ b/arch/x86/include/asm/cacheflush.h > > @@ -10,4 +10,7 @@ > > > > void clflush_cache_range(void *addr, unsigned int size); > > > > +#define flush_all_caches() \ > > + do { wbinvd_on_all_cpus(); } while(0) > > + > > This is horrific... we've done our utmost best to remove all WBINVD > usage and here you're adding it back in the most horrible form possible > ?!? > > Please don't do this, do *NOT* use WBINVD. Unfortunately there are a few good options here, and the changelog did not make clear that this is continuing legacy [1], not adding new wbinvd usage. The functionality this is enabling is to be able to instantaneously secure erase potentially terabytes of memory at once and the kernel needs to be sure that none of the data from before the secure is still present in the cache. It is also used when unlocking a memory device where speculative reads and firmware accesses could have cached poison from before the device was unlocked. This capability is typically only used once per-boot (for unlock), or once per bare metal provisioning event (secure erase), like when handing off the system to another tenant. That small scope plus the fact that none of this is available to a VM limits the potential damage. So, similar to the mitigation we did in [2] that did not kill off wbinvd completely, this is limited to specific scenarios and should be disabled in any scenario where wbinvd is painful / forbidden. [1]: 4c6926a23b76 ("acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs") [2]: e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")
Re: [PATCH v7] x86/mce: retrieve poison range from hardware
Jane Chu wrote: > On 8/3/2022 1:53 AM, Ingo Molnar wrote: > > > > * Jane Chu wrote: > > > >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > > > > s/Commit/commit > > Maintainers, > Would you prefer a v8, or take care the comment upon accepting the patch? > > > > >> poison granularity") that changed nfit_handle_mce() callback to report > >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been > >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting > >> 2 back-to-back poisons and the driver ends up logging 8 badblocks, > >> because 0x1000 bytes is 8 512-byte. > >> > >> Dan Williams noticed that apei_mce_report_mem_error() hardcode > >> the LSB field to PAGE_SHIFT instead of consulting the input > >> struct cper_sec_mem_err record. So change to rely on hardware whenever > >> support is available. > >> > >> Link: > >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com > >> > >> Reviewed-by: Dan Williams > >> Reviewed-by: Ingo Molnar > >> Signed-off-by: Jane Chu > >> --- > >> arch/x86/kernel/cpu/mce/apei.c | 13 - > >> 1 file changed, 12 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/x86/kernel/cpu/mce/apei.c > >> b/arch/x86/kernel/cpu/mce/apei.c > >> index 717192915f28..8ed341714686 100644 > >> --- a/arch/x86/kernel/cpu/mce/apei.c > >> +++ b/arch/x86/kernel/cpu/mce/apei.c > >> @@ -29,15 +29,26 @@ > >> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err > >> *mem_err) > >> { > >>struct mce m; > >> + int lsb; > >> > >>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) > >>return; > >> > >> + /* > >> + * Even if the ->validation_bits are set for address mask, > >> + * to be extra safe, check and reject an error radius '0', > >> + * and fall back to the default page size. > >> + */ > >> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) > >> + lsb = find_first_bit((void *)_err->physical_addr_mask, > >> PAGE_SHIFT); > >> + else > >> + lsb = PAGE_SHIFT; > >> + > >>mce_setup(); > >>m.bank = -1; > >>/* Fake a memory read error with unknown channel */ > >>m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | > >> MCI_STATUS_MISCV | 0x9f; > >> - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; > >> + m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb; > > > > LGTM. > > > > I suppose this wants to go upstream via the tree the bug came from (NVDIMM > > tree? ACPI tree?), or should we pick it up into the x86 tree? > > No idea. Maintainers? There's no real NVDIMM dependency here, just a general cleanup of how APEI error granularities are managed. So I think it is appropriate for this to go through the x86 tree via the typical path for mce related topics.
RE: [PATCH v5 24/32] tools/testing/nvdimm: Convert to printbuf
Matthew Wilcox (Oracle) wrote: > From: Kent Overstreet > > This converts from seq_buf to printbuf. Here we're using printbuf with > an external buffer, meaning it's a direct conversion. > > Signed-off-by: Kent Overstreet > Cc: Dan Williams > Cc: Dave Hansen > Cc: nvd...@lists.linux.dev My Acked-by still applies: https://lore.kernel.org/all/62b61165348f4_a7a2f294d0@dwillia2-xfh.notmuch/ ...and Shivaprasad's Tested-by should still apply: https://lore.kernel.org/all/b299ebe2-88e5-c2bd-bad0-bef62d4ac...@linux.ibm.com/
Re: [PATCH v5] x86/mce: retrieve poison range from hardware
Jane Chu wrote: > On 8/1/2022 9:44 AM, Dan Williams wrote: > > Jane Chu wrote: > >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > >> poison granularity") that changed nfit_handle_mce() callback to report > >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been > >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting > >> 2 back-to-back poisons and the driver ends up logging 8 badblocks, > >> because 0x1000 bytes is 8 512-byte. > >> > >> Dan Williams noticed that apei_mce_report_mem_error() hardcode > >> the LSB field to PAGE_SHIFT instead of consulting the input > >> struct cper_sec_mem_err record. So change to rely on hardware whenever > >> support is available. > >> > >> Link: > >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com > >> > >> Reviewed-by: Dan Williams > >> Signed-off-by: Jane Chu > >> --- > >> arch/x86/kernel/cpu/mce/apei.c | 14 +- > >> 1 file changed, 13 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/x86/kernel/cpu/mce/apei.c > >> b/arch/x86/kernel/cpu/mce/apei.c > >> index 717192915f28..2c7ea0ba9dd7 100644 > >> --- a/arch/x86/kernel/cpu/mce/apei.c > >> +++ b/arch/x86/kernel/cpu/mce/apei.c > >> @@ -29,15 +29,27 @@ > >> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err > >> *mem_err) > >> { > >>struct mce m; > >> + int lsb = PAGE_SHIFT; > >> > >>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) > >>return; > >> > >> + /* > >> + * Even if the ->validation_bits are set for address mask, > >> + * to be extra safe, check and reject an error radius '0', > >> + * and fallback to the default page size. > >> + */ > >> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) { > >> + lsb = __ffs64(mem_err->physical_addr_mask); > >> + if (lsb == 1) > > > > This was the reason I recommended hweight64 and min_not_zero() as > > hweight64 does not have the undefined behavior. However, an even better > > option is to just do: > > > > find_first_bit(_err->physical_addr_mask, PAGE_SHIFT) > > > > ...as that trims the result to the PAGE_SHIFT max and handles the zero > > case. > > Thanks Dan! However it looks like find_first_bit() could call into > __ffs(x) which has the same limitation as __ffs64(x), as Tony pointed out. Not quite, no. __ffs() behavior is *undefined* if the input is zero. find_first_bit() is *defined* and returns @size is the input is zero. Which is the behavior this wants to default to PAGE_SHIFT in the absence of any smaller granularity information.
RE: [PATCH v5] x86/mce: retrieve poison range from hardware
Jane Chu wrote: > With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > poison granularity") that changed nfit_handle_mce() callback to report > badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been > discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting > 2 back-to-back poisons and the driver ends up logging 8 badblocks, > because 0x1000 bytes is 8 512-byte. > > Dan Williams noticed that apei_mce_report_mem_error() hardcode > the LSB field to PAGE_SHIFT instead of consulting the input > struct cper_sec_mem_err record. So change to rely on hardware whenever > support is available. > > Link: > https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com > > Reviewed-by: Dan Williams > Signed-off-by: Jane Chu > --- > arch/x86/kernel/cpu/mce/apei.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c > index 717192915f28..2c7ea0ba9dd7 100644 > --- a/arch/x86/kernel/cpu/mce/apei.c > +++ b/arch/x86/kernel/cpu/mce/apei.c > @@ -29,15 +29,27 @@ > void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err > *mem_err) > { > struct mce m; > + int lsb = PAGE_SHIFT; > > if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) > return; > > + /* > + * Even if the ->validation_bits are set for address mask, > + * to be extra safe, check and reject an error radius '0', > + * and fallback to the default page size. > + */ > + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) { > + lsb = __ffs64(mem_err->physical_addr_mask); > + if (lsb == 1) This was the reason I recommended hweight64 and min_not_zero() as hweight64 does not have the undefined behavior. However, an even better option is to just do: find_first_bit(_err->physical_addr_mask, PAGE_SHIFT) ...as that trims the result to the PAGE_SHIFT max and handles the zero case.