On 21/08/2025 07:14, Alison Schofield wrote: > On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote: >> Hi Dan and Smita, >> >> >> On 24/07/2025 00:13, dan.j.willi...@intel.com wrote: >>> dan.j.williams@ wrote: >>> [..] >>>> If the goal is: "I want to give device-dax a point at which it can make >>>> a go / no-go decision about whether the CXL subsystem has properly >>>> assembled all CXL regions implied by Soft Reserved instersecting with >>>> CXL Windows." Then that is something like the below, only lightly tested >>>> and likely regresses the non-CXL case. >>>> >>>> -- 8< -- >>>> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001 >>>> From: Dan Williams <dan.j.willi...@intel.com> >>>> Date: Tue, 22 Jul 2025 16:11:08 -0700 >>>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration >>> >>> Likely needs this incremental change to prevent DEV_DAX_HMEM from being >>> built-in when CXL is not. This still leaves the awkward scenario of CXL >>> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that >>> safely fails in devdax only / fallback mode, but something to >>> investigate when respinning on top of this. >>> >> >> Thank you for your RFC; I find your proposal remarkably compelling, as it >> adeptly addresses the issues I am currently facing. >> >> >> To begin with, I still encountered several issues with your patch >> (considering the patch at the RFC stage, I think it is already quite >> commendable): > > Hi Zhijian, > > Like you, I tried this RFC out. It resolved the issue of soft reserved > resources preventing teardown and replacement of a region in place. > > I looked at the issues you found, and have some questions comments > included below. > >> >> 1. Some resources described by SRAT are wrongly identified as System RAM >> (kmem), such as the following: 200000000-5bffffff. >> >> ``` >> 200000000-5bffffff : dax6.0 >> 200000000-5bffffff : System RAM (kmem) >> 5c0001128-5c00011b7 : port1 >> 5d0000000-64ffffff : CXL Window 0 >> 5d0000000-64ffffff : region0 >> 5d0000000-64ffffff : dax0.0 >> 5d0000000-64ffffff : System RAM (kmem) >> 680000000-e7ffffff : PCI Bus 0000:00 >> >> [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug >> [ 0.000000] Command line: >> BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ >> root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 >> no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 >> softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled >> panic_on_warn ignore_loglevel kasan.fault=panic >> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] >> soft reserved >> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] >> soft reserved >> [ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] >> hotplug >> ``` > > Is that range also labelled as soft reserved? > I ask, because I'm trying to draw a parallel between our test platforms.
No, It's not a soft reserved range. This can simply simulate with QEMU with `maxmem=192G` option(see below full qemu command line). In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1, DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE] DRAM_END: end of the installed DRAM in Node 3 This range is reserved for the DRAM hot-add. In my case, it will be registered into 'HMEM devices' by calling hmem_register_resource in HMAT(drivers/acpi/numa/hmat.c) 893 static void hmat_register_target_devices(struct memory_target *target) 894 { 895 struct resource *res; 896 897 /* 898 * Do not bother creating devices if no driver is available to 899 * consume them. 900 */ 901 if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM)) 902 return; 903 904 for (res = target->memregions.child; res; res = res->sibling) { 905 int target_nid = pxm_to_node(target->memory_pxm); 906 907 hmem_register_resource(target_nid, res); 908 } 909 } $ dmesg | grep -i -e soft -e hotplug -e Node [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 conc [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft reserved [ 0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff] [ 0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff] [ 0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff] [ 0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff] [ 0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug [ 0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff] [ 0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff] [ 0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff] [ 0.086077] Movable zone start for each node [ 0.087054] Early memory node ranges [ 0.087890] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.089264] node 0: [mem 0x0000000000100000-0x000000007ffdefff] [ 0.090631] node 1: [mem 0x0000000100000000-0x000000017fffffff] [ 0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff] [ 0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff] [ 0.095164] Initmem setup node 2 as memoryless [ 0.096281] Initmem setup node 3 as memoryless [ 0.097397] Initmem setup node 4 as memoryless [ 0.098444] On node 0, zone DMA: 1 pages in unavailable ranges [ 0.099866] On node 0, zone DMA: 97 pages in unavailable ranges [ 0.104342] On node 1, zone Normal: 33 pages in unavailable ranges [ 0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs ================================= Please note that this is a modified QEMU. /home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine q35,accel=kvm,cxl=on,hmat=on \ -name guest-rdma-server -nographic -boot c \ -m size=6G,slots=2,maxmem=19922944k \ -hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \ -object memory-backend-memfd,share=on,size=2G,id=m0 \ -object memory-backend-memfd,share=on,size=2G,id=m1 \ -numa node,nodeid=0,cpus=0-1,memdev=m0 \ -numa node,nodeid=1,cpus=2-3,memdev=m1 \ -smp 4,sockets=2,cores=2 \ -device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \ -device pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \ -device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0 \ -device cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true \ -object memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M \ -M cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k \ -nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \ -bios /home/lizhijian/seabios/out/bios.bin \ -object memory-backend-memfd,share=on,size=1G,id=m2 \ -object memory-backend-memfd,share=on,size=1G,id=m3 \ -numa node,memdev=m2,nodeid=2 \ -numa node,memdev=m3,nodeid=3 \ -numa dist,src=0,dst=0,val=10 \ -numa dist,src=0,dst=1,val=21 \ -numa dist,src=0,dst=2,val=21 \ -numa dist,src=0,dst=3,val=21 \ -numa dist,src=1,dst=0,val=21 \ -numa dist,src=1,dst=1,val=10 \ -numa dist,src=1,dst=2,val=21 \ -numa dist,src=1,dst=3,val=21 \ -numa dist,src=2,dst=0,val=21 \ -numa dist,src=2,dst=1,val=21 \ -numa dist,src=2,dst=2,val=10 \ -numa dist,src=2,dst=3,val=21 \ -numa dist,src=3,dst=0,val=21 \ -numa dist,src=3,dst=1,val=21 \ -numa dist,src=3,dst=2,val=21 \ -numa dist,src=3,dst=3,val=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \ -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340 \ -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \ -numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440 \ -numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M \ -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240 \ -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \ -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110 \ -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \ -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340 \ -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \ -numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440 \ -numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M > I see - > > [] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved > . > . > [] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft > reserved > . > . > [] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug > > /proc/iomem - as expected > 24080000000-5f77fffffff : CXL Window 0 > 24080000000-4407fffffff : region0 > 24080000000-4407fffffff : dax0.0 > 24080000000-4407fffffff : System RAM (kmem) > > > I'm also seeing this message: > [] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff] > conflicts with [mem 0x24080000000-0x4407fffffff] > >> >> 2. Triggers dev_warn and dev_err: >> >> ``` >> [root@rdma-server ~]# journalctl -p err -p warning --dmesg >> ...snip... >> Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache >> calculation failed rc:-2 >> Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem >> failed with error -12 >> Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem >> failed with error -12 >> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: >> 0x100000000-0x17ffffff could not reserve region >> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem >> failed with error -16 > > I see the kmem dax messages also. It seems the kmem probe is going after > every range (except hotplug) in the SRAT, and failing. Yes, that's true, because current RFC removed the code that filters out the non-soft-reserverd resource. As a result, it will try to register dax/kmem for all of them while some of them has been marked as busy in the iomem_resource. >> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM, >> - IORES_DESC_SOFT_RESERVED); >> - if (rc != REGION_INTERSECTS) >> - return 0; This is another example on my real *CXL HOST*: Aug 19 17:59:05 kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measuremen> Aug 19 17:59:09 kernel: power_meter ACPI000D:00: Ignoring unsafe software power cap! Aug 19 17:59:09 kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not reserve region Aug 19 17:59:09 kernel: kmem dax2.0: probe with driver kmem failed with error -16 Aug 19 17:59:09 kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could not reserve region Aug 19 17:59:09 kernel: kmem dax3.0: probe with driver kmem failed with error -16 Aug 19 17:59:09 kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could not reserve region Aug 19 17:59:09 kernel: kmem dax4.0: probe with driver kmem failed with error -16 Aug 19 17:59:19 kernel: nvme nvme0: using unchecked data buffer Aug 19 18:36:27 kernel: block nvme1n1: No UUID available providing old NGUID lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000 /proc/iomem 6fffb000-8fffffff : Reserved 100000000-10000ffff : Reserved 106ccc0000-106fffffff : Reserved This issue can be resolved by re-introducing sort_reserved_region_intersects(...) I guess. > >> ``` >> >> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, >> in which case only CXL Window X is visible. > > Haven't tested !CXL_REGION yet. > >> >> On failure: >> >> ``` >> 100000000-27ffffff : System RAM >> 5c0001128-5c00011b7 : port1 >> 5c0011128-5c00111b7 : port2 >> 5d0000000-6cffffff : CXL Window 0 >> 6d0000000-7cffffff : CXL Window 1 >> 7000000000-700000ffff : PCI Bus 0000:0c >> 7000000000-700000ffff : 0000:0c:00.0 >> 7000001080-70000010d7 : mem1 >> ``` >> >> On success: >> >> ``` >> 5d0000000-7cffffff : dax0.0 >> 5d0000000-7cffffff : System RAM (kmem) >> 5d0000000-6cffffff : CXL Window 0 >> 6d0000000-7cffffff : CXL Window 1 >> ``` >> >> In term of issues 1 and 2, this arises because hmem_register_device() >> attempts to register resources of all "HMEM devices," whereas we only need >> to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the >> current TODO will address this. >> >> ``` >> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM, >> - IORES_DESC_SOFT_RESERVED); >> - if (rc != REGION_INTERSECTS) >> - return 0; >> + /* TODO: insert "Soft Reserved" into iomem here */ >> ``` > > Above makes sense. I think the subroutine add_soft_reserved() in your previous patchset[1] are able to cover this TODO > > I'll probably wait for an update from Smita to test again, but if you > or Smita have anything you want me to try out on my hardwware in the > meantime, let me know. > Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2. -- 8< -- commit e7ccd7a01e168e185971da66f4aa13eb451caeaf Author: Li Zhijian <lizhij...@fujitsu.com> Date: Fri Aug 20 11:07:15 2025 +0800 Fix probe-order TODO Signed-off-by: Li Zhijian <lizhij...@fujitsu.com> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index 754115da86cc..965ffc622136 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work) walk_hmem_resources(&pdev->dev, handle_deferred_cxl); } +static int add_soft_reserved(resource_size_t start, resource_size_t len, + unsigned long flags) +{ + struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL); + int rc; + + if (!res) + return -ENOMEM; + + *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved", + flags | IORESOURCE_MEM, + IORES_DESC_SOFT_RESERVED); + + rc = insert_resource(&iomem_resource, res); + if (rc) + kfree(res); + + return rc; +} + static int hmem_register_device(struct device *host, int target_nid, const struct resource *res) { @@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int target_nid, long id; int rc; + if (soft_reserve_res_intersects(res->start, resource_size(res), + IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT) + return 0; + if (IS_ENABLED(CONFIG_DEV_DAX_CXL) && region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) { @@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int target_nid, } } - /* TODO: insert "Soft Reserved" into iomem here */ + /* + * This is a verified Soft Reserved region that CXL is not claiming (or + * is being overridden). Add it to the main iomem tree so it can be + * properly reserved by the DAX driver. + */ + rc = add_soft_reserved(res->start, res->end - res->start + 1, 0); + if (rc) { + dev_warn(host, "failed to insert soft-reserved resource %pr into iomem: %d\n", + res, rc); + return rc; + } id = memregion_alloc(GFP_KERNEL); if (id < 0) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 349f0d9aad22..eca5956c444b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1069,6 +1069,8 @@ enum { int region_intersects(resource_size_t offset, size_t size, unsigned long flags, unsigned long desc); +int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned long flags, + unsigned long desc); /* Support for virtually mapped pages */ struct page *vmalloc_to_page(const void *addr); unsigned long vmalloc_to_pfn(const void *addr); diff --git a/kernel/resource.c b/kernel/resource.c index b8eac6af2fad..a34b76cf690a 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags, arg, func); } EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc); + +static int __region_intersects(struct resource *parent, resource_size_t start, + size_t size, unsigned long flags, + unsigned long desc); +int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned long flags, + unsigned long desc) +{ + int ret; + + read_lock(&resource_lock); + ret = __region_intersects(&soft_reserve_resource, start, size, flags, desc); + read_unlock(&resource_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(soft_reserve_res_intersects); #endif /* [1] https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofi...@intel.com/ > -- Alison > > >> >> Regarding issue 3 (which exists in the current situation), this could be >> because it cannot ensure that dax_hmem_probe() executes prior to >> cxl_acpi_probe() when CXL_REGION is disabled. >> >> I am pleased that you have pushed the patch to the >> cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its >> integration into the upstream during the v6.18 merge window. >> Besides the current TODO, you also mentioned that this RFC PATCH must be >> further subdivided into several patches, so there remains significant work >> to be done. >> If my understanding is correct, you would be personally continuing to push >> forward this patch, right? >> >> >> Smita, >> >> Do you have any additional thoughts on this proposal from your side? >> >> >> Thanks >> Zhijian >> > snip >