Hi Dan,

Thanks for the detailed feedback. I have put together a pseudo code sketch incorporating all your points. Could you confirm I have understood the direction correctly, or if I'm missing anything?

On 3/11/2026 7:28 PM, Dan Williams wrote:
Smita Koralahalli wrote:
The current probe time ownership check for Soft Reserved memory based
solely on CXL window intersection is insufficient. dax_hmem probing is not
always guaranteed to run after CXL enumeration and region assembly, which
can lead to incorrect ownership decisions before the CXL stack has
finished publishing windows and assembling committed regions.

Introduce deferred ownership handling for Soft Reserved ranges that
intersect CXL windows. When such a range is encountered during dax_hmem
probe, schedule deferred work and wait for the CXL stack to complete
enumeration and region assembly before deciding ownership.

Evaluate ownership of Soft Reserved ranges based on CXL region
containment.

    - If all Soft Reserved ranges are fully contained within committed CXL
      regions, DROP handling Soft Reserved ranges from dax_hmem and allow
      dax_cxl to bind.

    - If any Soft Reserved range is not fully claimed by committed CXL
      region, REGISTER the Soft Reserved ranges with dax_hmem.

Use dax_cxl_mode to coordinate ownership decisions for Soft Reserved
ranges. Once, ownership resolution is complete, flush the deferred work
from dax_cxl before allowing dax_cxl to bind.

This enforces a strict ownership. Either CXL fully claims the Soft
reserved ranges or it relinquishes it entirely.

We have had multiple suggestions during the course of developing this
state machine, but I can not see reading this changelog or the
implementation that the full / final state machine is laid out with all
the old ideas cleaned out of the implementation.

For example, I think this has my "untested!" suggestion from:

http://lore.kernel.org/[email protected]

...but it does not have the explanation of why it turned out to be
suitable and fits the end goal state machine.

It also has the original definition of "enum dax_cxl_mode". However,
with the recent simplification proposal to stop doing total CXL unwind I
think it allows for a more straightforward state machine. For example,
the "drop" state is now automatic simply by losing the race with
dax_hmem, right?

I think we are close, just some final complexity shaving.

So, with the decision to stop tearing down CXL this state machine only
has 3 requirements.

1/ CXL enumeration needs to start before dax_hmem invokes
    wait_for_device_probe().

2/ dax_cxl driver registration needs to be postponed until after
    dax_hmem has dispositioned all its regions.

3/ No probe path can flush the work because of the wait_for_device_probe().

Requirement 1/ is met by patch1. Requirement 2/ partially met, has a
proposal here around flushing the work from a separate workqueue
invocation, but I think you want the dependency directly on the dax_hmem
module (if enabled). Requirement 3/ not achieved.

For 3/ I think we can borrow what cxl_mem_probe() does and do:

if (work_pending(&dax_hmem_work))
        return -EBUSY;

...if for some reason someone really wants to rebind the dax_hmem driver
they need to flush the queue, and that can be achived by a flush_work()
in the module_exit() path.

This does mean that patch 7 in this series disappears because bus.c has
no role to play in this mess. It is just dax_hmem and dax_cxl getting
their ordering straight.

Some notes on the implications follow:

Co-developed-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Smita Koralahalli <[email protected]>
---
  drivers/dax/bus.c       |  3 ++
  drivers/dax/bus.h       | 19 ++++++++++
  drivers/dax/cxl.c       |  1 +
  drivers/dax/hmem/hmem.c | 78 +++++++++++++++++++++++++++++++++++++++--
  4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 92b88952ede1..81985bcc70f9 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -25,6 +25,9 @@ DECLARE_RWSEM(dax_region_rwsem);
   */
  DECLARE_RWSEM(dax_dev_rwsem);
+enum dax_cxl_mode dax_cxl_mode = DAX_CXL_MODE_DEFER;
+EXPORT_SYMBOL_NS_GPL(dax_cxl_mode, "CXL");
+
  static DEFINE_MUTEX(dax_hmem_lock);
  static dax_hmem_deferred_fn hmem_deferred_fn;
  static void *dax_hmem_data;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index b58a88e8089c..82616ff52fd1 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -41,6 +41,25 @@ struct dax_device_driver {
        void (*remove)(struct dev_dax *dev);
  };
+/*
+ * enum dax_cxl_mode - State machine to determine ownership for CXL
+ * tagged Soft Reserved memory ranges.
+ * @DAX_CXL_MODE_DEFER: Ownership resolution pending. Set while waiting
+ * for CXL enumeration and region assembly to complete.
+ * @DAX_CXL_MODE_REGISTER: CXL regions do not fully cover Soft Reserved
+ * ranges. Fall back to registering those ranges via dax_hmem.
+ * @DAX_CXL_MODE_DROP: All Soft Reserved ranges intersecting CXL windows
+ * are fully contained within committed CXL regions. Drop HMEM handling
+ * and allow dax_cxl to bind.

With the above, dax_cxl_mode disappears.

diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index a2136adfa186..3ab39b77843d 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -44,6 +44,7 @@ static struct cxl_driver cxl_dax_region_driver = {
static void cxl_dax_region_driver_register(struct work_struct *work)
  {
+       dax_hmem_flush_work();

Looks ok, as long as that symbol is from dax_hmem.ko which gets you the
load dependency (requirement 2/).

Might also want to make sure that all this deferral mess disappears when
CONFIG_DEV_DAX_HMEM=n.

        cxl_driver_register(&cxl_dax_region_driver);
  }
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1e3424358490..85854e25254b 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -3,6 +3,7 @@
  #include <linux/memregion.h>
  #include <linux/module.h>
  #include <linux/dax.h>
+#include <cxl/cxl.h>
  #include "../bus.h"
static bool region_idle;
@@ -69,8 +70,18 @@ static int hmem_register_device(struct device *host, int 
target_nid,
        if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
            region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
                              IORES_DESC_CXL) != REGION_DISJOINT) {
-               dev_dbg(host, "deferring range to CXL: %pr\n", res);
-               return 0;
+               switch (dax_cxl_mode) {
+               case DAX_CXL_MODE_DEFER:
+                       dev_dbg(host, "deferring range to CXL: %pr\n", res);
+                       dax_hmem_queue_work();

This case is just a flag that determines if the work queue has completed
its one run. So I expect this something like:

if (!dax_hmem_initial_probe) {
        queue_work()
        return;
}

Otherwise just go ahead and register because dax_cxl by this time has
had a chance to have a say and the system falls back to "first come /
first served" mode. In other words the simplification of not cleaning up
goes both ways. dax_hmem naturally fails if dax_cxl already claimed the
address range.

+                       return 0;
+               case DAX_CXL_MODE_REGISTER:
+                       dev_dbg(host, "registering CXL range: %pr\n", res);
+                       break;
+               case DAX_CXL_MODE_DROP:
+                       dev_dbg(host, "dropping CXL range: %pr\n", res);
+                       return 0;
+               }
        }
rc = region_intersects_soft_reserve(res->start, resource_size(res));
@@ -123,8 +134,70 @@ static int hmem_register_device(struct device *host, int 
target_nid,
        return rc;
  }
+static int hmem_register_cxl_device(struct device *host, int target_nid,
+                                   const struct resource *res)
+{
+       if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+                             IORES_DESC_CXL) != REGION_DISJOINT)
+               return hmem_register_device(host, target_nid, res);
+
+       return 0;
+}
+
+static int soft_reserve_has_cxl_match(struct device *host, int target_nid,
+                                     const struct resource *res)
+{
+       if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+                             IORES_DESC_CXL) != REGION_DISJOINT) {
+               if (!cxl_region_contains_soft_reserve((struct resource *)res))
+                       return 1;
+       }
+
+       return 0;
+}
+
+static void process_defer_work(void *data)
+{
+       struct platform_device *pdev = data;
+       int rc;
+
+       /* relies on cxl_acpi and cxl_pci having had a chance to load */
+       wait_for_device_probe();
+
+       rc = walk_hmem_resources(&pdev->dev, soft_reserve_has_cxl_match);
+
+       if (!rc) {
+               dax_cxl_mode = DAX_CXL_MODE_DROP;
+               dev_dbg(&pdev->dev, "All Soft Reserved ranges claimed by 
CXL\n");
+       } else {
+               dax_cxl_mode = DAX_CXL_MODE_REGISTER;
+               dev_warn(&pdev->dev,
+                        "Soft Reserved not fully contained in CXL; using 
HMEM\n");
+       }
+
+       walk_hmem_resources(&pdev->dev, hmem_register_cxl_device);

I do not think we need to do 2 passes. Just do one
hmem_register_cxl_device() pass that skips a range when
cxl_region_contains_resource() has it covered, otherwise register an
hmem device.

+}
+
+static void kill_defer_work(void *data)
+{
+       struct platform_device *pdev = data;
+
+       dax_hmem_flush_work();
+       dax_hmem_unregister_work(process_defer_work, pdev);
+}
+
  static int dax_hmem_platform_probe(struct platform_device *pdev)
  {
+       int rc;

This wants a work_pending() return -EBUSY per above.

+       rc = dax_hmem_register_work(process_defer_work, pdev);
+       if (rc)
+               return rc;

The work does not need to be registered every time. Remember this is
only a one-shot problem at first kernel boot, not every time this
platform device is probed. After the workqueue has run at least once it
never needs to be invoked again if dax_hmem is reloaded.

A flag for "dax_hmem flushed initial device probe at least once" needs
to live in drivers/dax/hmem/device.c and be cleared by
process_defer_work().

+
+       rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, pdev);
+       if (rc)
+               return rc;
+
        return walk_hmem_resources(&pdev->dev, hmem_register_device);
  }

The hunk that is missing is that dax_hmem_exit() should flush the work,
and process_defer_work() should give up if the device has been unbound
before it runs. Hopefully that last suggestion does not make lockdep
unhappy about running process_defer_work under the hmem_platform
device_lock(). I *think* it should be ok and solves the TOCTOU race in
hmem_register_device() around whether we are in the pre or post initial
probe world.

/* hmem.c */

+static struct platform_device *hmem_pdev;
+static void process_defer_work(struct work_struct *work);
+static DECLARE_WORK(dax_hmem_work, process_defer_work);

+void dax_hmem_flush_work(void)
+{
+       flush_work(&dax_hmem_work);
+}
+EXPORT_SYMBOL_GPL(dax_hmem_flush_work);

static int hmem_register_device(..)
{
        if (IS_ENABLED(CONFIG_DEV_DAX_CXL) && .. {
+               if (!dax_hmem_initial_probe_done) {
+                       queue_work(system_long_wq, &dax_hmem_work);
+                       return 0;
+       }
...
}

+static int hmem_register_cxl_device(..)
+{
+       if (region_intersects(..IORES_CXL..) == REGION_DISJOINT)
+               return 0;
+
+       if (cxl_region_contains_soft_reserve(res))
+               return 0;
+
+       return hmem_register_device(host, target_nid, res);
+}

+static void process_defer_work(struct work_struct *work)
+{
+       wait_for_device_probe();
+       /* Flag lives in device.c */
+       dax_hmem_initial_probe_done = true;
+       walk_hmem_resources(&hmem_pdev->dev, hmem_register_cxl_device);
+}

static int dax_hmem_platform_probe(struct platform_device *pdev)
{
+       if (work_pending(&dax_hmem_work))
+               return -EBUSY;

+       hmem_pdev = pdev;
        return walk_hmem_resources(&dev->dev, hmem_register_device);
}

static void __exit dax_hmem_exit(void)
{
+       flush_work(&dax_hmem_work);
..

/* cxl.c */

static void cxl_dax_region_driver_register(struct work_struct *work)
{
+       dax_hmem_flush_work();
        cxl_driver_register(&cxl_dax_region_driver);
}

/* bus.h */

+#if IS_ENABLED(CONFIG_DEV_DAX_HMEM)
+void dax_hmem_flush_work(void);
+#else
+static inline void dax_hmem_flush_work(void) { }
+#endif

A few things I want to confirm:

1. Patch 7 (bus.c helpers) drops entirely — no register/unregister API, no mutex, no typedef. Everything lives in hmem.c.

2. enum dax_cxl_mode drops — replaced by the single bool dax_hmem_initial_probe_done in device.c.

3. dax_hmem_flush_work() exported from dax_hmem.ko so cxl.c gets the module dependency for requirement 2.

Thanks
Smita

Reply via email to