On 21/08/2025 07:14, Alison Schofield wrote:
> On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
>> Hi Dan and Smita,
>>
>>
>> On 24/07/2025 00:13, dan.j.willi...@intel.com wrote:
>>> dan.j.williams@ wrote:
>>> [..]
>>>> If the goal is: "I want to give device-dax a point at which it can make
>>>> a go / no-go decision about whether the CXL subsystem has properly
>>>> assembled all CXL regions implied by Soft Reserved instersecting with
>>>> CXL Windows." Then that is something like the below, only lightly tested
>>>> and likely regresses the non-CXL case.
>>>>
>>>> -- 8< --
>>>>   From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
>>>> From: Dan Williams <dan.j.willi...@intel.com>
>>>> Date: Tue, 22 Jul 2025 16:11:08 -0700
>>>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>>>
>>> Likely needs this incremental change to prevent DEV_DAX_HMEM from being
>>> built-in when CXL is not. This still leaves the awkward scenario of CXL
>>> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
>>> safely fails in devdax only / fallback mode, but something to
>>> investigate when respinning on top of this.
>>>
>>
>> Thank you for your RFC; I find your proposal remarkably compelling, as it 
>> adeptly addresses the issues I am currently facing.
>>
>>
>> To begin with, I still encountered several issues with your patch 
>> (considering the patch at the RFC stage, I think it is already quite 
>> commendable):
> 
> Hi Zhijian,
> 
> Like you, I tried this RFC out. It resolved the issue of soft reserved
> resources preventing teardown and replacement of a region in place.
> 
> I looked at the issues you found, and have some questions comments
> included below.
> 
>>
>> 1. Some resources described by SRAT are wrongly identified as System RAM 
>> (kmem), such as the following: 200000000-5bffffff.
>>      
>>      ```
>>      200000000-5bffffff : dax6.0
>>        200000000-5bffffff : System RAM (kmem)
>>      5c0001128-5c00011b7 : port1
>>      5d0000000-64ffffff : CXL Window 0
>>        5d0000000-64ffffff : region0
>>          5d0000000-64ffffff : dax0.0
>>            5d0000000-64ffffff : System RAM (kmem)
>>      680000000-e7ffffff : PCI Bus 0000:00
>>
>>      [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
>>      [    0.000000] Command line: 
>> BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ 
>> root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 
>> no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 
>> softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled 
>> panic_on_warn ignore_loglevel kasan.fault=panic
>>      [    0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] 
>> soft reserved
>>      [    0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] 
>> soft reserved
>>      [    0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] 
>> hotplug
>>      ```
> 
> Is that range also labelled as soft reserved?
> I ask, because I'm trying to draw a parallel between our test platforms.

No, It's not a soft reserved range. This can simply simulate with QEMU with 
`maxmem=192G` option(see below full qemu command line).
In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1, 
DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
DRAM_END: end of the installed DRAM in Node 3

This range is reserved for the DRAM hot-add. In my case, it will be registered 
into 'HMEM devices' by calling hmem_register_resource in 
HMAT(drivers/acpi/numa/hmat.c)

  893 static void hmat_register_target_devices(struct memory_target *target)
  894 {
  895         struct resource *res;
  896
  897         /*
  898          * Do not bother creating devices if no driver is available to
  899          * consume them.
  900          */
  901         if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
  902                 return;
  903
  904         for (res = target->memregions.child; res; res = res->sibling) {
  905                 int target_nid = pxm_to_node(target->memory_pxm);
  906
  907                 hmem_register_resource(target_nid, res);
  908         }
  909 }


$ dmesg | grep -i -e soft -e hotplug -e Node
[    0.000000] Command line: 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty
 root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 
no_timer_check net.ifnames=0 console=tty1 conc
[    0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft 
reserved
[    0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft 
reserved
[    0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[    0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
[    0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
[    0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
[    0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
[    0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 
0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
[    0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
[    0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
[    0.086077] Movable zone start for each node
[    0.087054] Early memory node ranges
[    0.087890]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.089264]   node   0: [mem 0x0000000000100000-0x000000007ffdefff]
[    0.090631]   node   1: [mem 0x0000000100000000-0x000000017fffffff]
[    0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
[    0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
[    0.095164] Initmem setup node 2 as memoryless
[    0.096281] Initmem setup node 3 as memoryless
[    0.097397] Initmem setup node 4 as memoryless
[    0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
[    0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
[    0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs

=================================

Please note that this is a modified QEMU.

/home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine 
q35,accel=kvm,cxl=on,hmat=on \
-name guest-rdma-server -nographic -boot c \
-m size=6G,slots=2,maxmem=19922944k \
-hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
-object memory-backend-memfd,share=on,size=2G,id=m0 \
-object memory-backend-memfd,share=on,size=2G,id=m1 \
-numa node,nodeid=0,cpus=0-1,memdev=m0 \
-numa node,nodeid=1,cpus=2-3,memdev=m1 \
-smp 4,sockets=2,cores=2 \
-device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
-device 
pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
-device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0 
\
-device 
cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true
 \
-object 
memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M
 \
-M 
cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k
 \
-nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
-bios /home/lizhijian/seabios/out/bios.bin \
-object memory-backend-memfd,share=on,size=1G,id=m2 \
-object memory-backend-memfd,share=on,size=1G,id=m3 \
-numa node,memdev=m2,nodeid=2 \
-numa node,memdev=m3,nodeid=3 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=21 \
-numa dist,src=0,dst=3,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa dist,src=1,dst=2,val=21 \
-numa dist,src=1,dst=3,val=21 \
-numa dist,src=2,dst=0,val=21 \
-numa dist,src=2,dst=1,val=21 \
-numa dist,src=2,dst=2,val=10 \
-numa dist,src=2,dst=3,val=21 \
-numa dist,src=3,dst=0,val=21 \
-numa dist,src=3,dst=1,val=21 \
-numa dist,src=3,dst=2,val=21 \
-numa dist,src=3,dst=3,val=10 \
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110
 \
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
 \
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240
 \
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
 \
-numa 
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340
 \
-numa 
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
 \
-numa 
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440
 \
-numa 
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
 \
-numa 
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240
 \
-numa 
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
 \
-numa 
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110
 \
-numa 
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
 \
-numa 
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340
 \
-numa 
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
 \
-numa 
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440
 \
-numa 
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M



> I see -
> 
> [] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
> .
> .
> [] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft 
> reserved
> .
> .
> [] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
> 
> /proc/iomem - as expected
> 24080000000-5f77fffffff : CXL Window 0
>    24080000000-4407fffffff : region0
>      24080000000-4407fffffff : dax0.0
>        24080000000-4407fffffff : System RAM (kmem)
> 
> 
> I'm also seeing this message:
> [] resource: Unaddressable device  [mem 0x24080000000-0x4407fffffff] 
> conflicts with [mem 0x24080000000-0x4407fffffff]
> 
>>
>> 2. Triggers dev_warn and dev_err:
>>      
>>      ```
>>      [root@rdma-server ~]# journalctl -p err -p warning --dmesg
>>      ...snip...
>>      Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache 
>> calculation failed rc:-2
>>      Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem 
>> failed with error -12
>>      Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem 
>> failed with error -12
>>      Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 
>> 0x100000000-0x17ffffff could not reserve region
>>      Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem 
>> failed with error -16
> 
> I see the kmem dax messages also. It seems the kmem probe is going after
> every range (except hotplug) in the SRAT, and failing.

Yes, that's true, because current RFC removed the code that filters out the 
non-soft-reserverd resource. As a result, it will try to register dax/kmem for 
all of them while some of them has been marked as busy in the iomem_resource.

>> -   rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> -                          IORES_DESC_SOFT_RESERVED);
>> -   if (rc != REGION_INTERSECTS)
>> -       return 0;


This is another example on my real *CXL HOST*:
Aug 19 17:59:05  kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is 
disabled. Duplicate IMA measuremen>
Aug 19 17:59:09  kernel: power_meter ACPI000D:00: Ignoring unsafe software 
power cap!
Aug 19 17:59:09  kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not 
reserve region
Aug 19 17:59:09  kernel: kmem dax2.0: probe with driver kmem failed with error 
-16
Aug 19 17:59:09  kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could 
not reserve region
Aug 19 17:59:09  kernel: kmem dax3.0: probe with driver kmem failed with error 
-16
Aug 19 17:59:09  kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could 
not reserve region
Aug 19 17:59:09  kernel: kmem dax4.0: probe with driver kmem failed with error 
-16
Aug 19 17:59:19  kernel: nvme nvme0: using unchecked data buffer
Aug 19 18:36:27  kernel: block nvme1n1: No UUID available providing old NGUID
lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000 
/proc/iomem
6fffb000-8fffffff : Reserved
100000000-10000ffff : Reserved
106ccc0000-106fffffff : Reserved


This issue can be resolved by re-introducing 
sort_reserved_region_intersects(...) I guess.



> 
>>      ```
>>
>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, 
>> in which case only CXL Window X is visible.
> 
> Haven't tested !CXL_REGION yet.
> 
>>      
>>      On failure:
>>      
>>      ```
>>      100000000-27ffffff : System RAM
>>      5c0001128-5c00011b7 : port1
>>      5c0011128-5c00111b7 : port2
>>      5d0000000-6cffffff : CXL Window 0
>>      6d0000000-7cffffff : CXL Window 1
>>      7000000000-700000ffff : PCI Bus 0000:0c
>>        7000000000-700000ffff : 0000:0c:00.0
>>          7000001080-70000010d7 : mem1
>>      ```
>>
>>      On success:
>>      
>>      ```
>>      5d0000000-7cffffff : dax0.0
>>        5d0000000-7cffffff : System RAM (kmem)
>>          5d0000000-6cffffff : CXL Window 0
>>          6d0000000-7cffffff : CXL Window 1
>>      ```
>>
>> In term of issues 1 and 2, this arises because hmem_register_device() 
>> attempts to register resources of all "HMEM devices," whereas we only need 
>> to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the 
>> current TODO will address this.
>>
>> ```
>> -   rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> -                          IORES_DESC_SOFT_RESERVED);
>> -   if (rc != REGION_INTERSECTS)
>> -       return 0;
>> +   /* TODO: insert "Soft Reserved" into iomem here */
>> ```
> 
> Above makes sense.

I think the subroutine add_soft_reserved() in your previous patchset[1] are 
able to cover this TODO

> 
> I'll probably wait for an update from Smita to test again, but if you
> or Smita have anything you want me to try out on my hardwware in the
> meantime, let me know.
> 

Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.


-- 8< --
  commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
Author: Li Zhijian <lizhij...@fujitsu.com>
Date:   Fri Aug 20 11:07:15 2025 +0800

     Fix probe-order TODO
     
     Signed-off-by: Li Zhijian <lizhij...@fujitsu.com>

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 754115da86cc..965ffc622136 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
        walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
  }
  
+static int add_soft_reserved(resource_size_t start, resource_size_t len,
+                            unsigned long flags)
+{
+       struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
+       int rc;
+
+       if (!res)
+               return -ENOMEM;
+
+       *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
+                                    flags | IORESOURCE_MEM,
+                                    IORES_DESC_SOFT_RESERVED);
+
+       rc = insert_resource(&iomem_resource, res);
+       if (rc)
+               kfree(res);
+
+       return rc;
+}
+
  static int hmem_register_device(struct device *host, int target_nid,
                                const struct resource *res)
  {
@@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int 
target_nid,
        long id;
        int rc;
  
+       if (soft_reserve_res_intersects(res->start, resource_size(res),
+                     IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
+               return 0;
+
        if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
            region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
                              IORES_DESC_CXL) != REGION_DISJOINT) {
@@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int 
target_nid,
                }
        }
  
-       /* TODO: insert "Soft Reserved" into iomem here */
+       /*
+        * This is a verified Soft Reserved region that CXL is not claiming (or
+        * is being overridden). Add it to the main iomem tree so it can be
+        * properly reserved by the DAX driver.
+        */
+       rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
+       if (rc) {
+               dev_warn(host, "failed to insert soft-reserved resource %pr 
into iomem: %d\n",
+                        res, rc);
+               return rc;
+       }
  
        id = memregion_alloc(GFP_KERNEL);
        if (id < 0) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 349f0d9aad22..eca5956c444b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1069,6 +1069,8 @@ enum {
  int region_intersects(resource_size_t offset, size_t size, unsigned long 
flags,
                      unsigned long desc);
  
+int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned 
long flags,
+                     unsigned long desc);
  /* Support for virtually mapped pages */
  struct page *vmalloc_to_page(const void *addr);
  unsigned long vmalloc_to_pfn(const void *addr);
diff --git a/kernel/resource.c b/kernel/resource.c
index b8eac6af2fad..a34b76cf690a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc, 
unsigned long flags,
                             arg, func);
  }
  EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+
+static int __region_intersects(struct resource *parent, resource_size_t start,
+                              size_t size, unsigned long flags,
+                              unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned 
long flags,
+                     unsigned long desc)
+{
+       int ret;
+
+       read_lock(&resource_lock);
+       ret = __region_intersects(&soft_reserve_resource, start, size, flags, 
desc);
+       read_unlock(&resource_lock);
+
+       return ret;
+}
+EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
  #endif
  
  /*



[1] 
https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofi...@intel.com/


> -- Alison
> 
> 
>>
>> Regarding issue 3 (which exists in the current situation), this could be 
>> because it cannot ensure that dax_hmem_probe() executes prior to 
>> cxl_acpi_probe() when CXL_REGION is disabled.
>>
>> I am pleased that you have pushed the patch to the 
>> cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its 
>> integration into the upstream during the v6.18 merge window.
>> Besides the current TODO, you also mentioned that this RFC PATCH must be 
>> further subdivided into several patches, so there remains significant work 
>> to be done.
>> If my understanding is correct, you would be personally continuing to push 
>> forward this patch, right?
>>
>>
>> Smita,
>>
>> Do you have any additional thoughts on this proposal from your side?
>>
>>
>> Thanks
>> Zhijian
>>
> snip
> 

Reply via email to