On Wed, Jun 17, 2026 at 10:52:04PM -0700, Anisa Su wrote:
> On Wed, Jun 17, 2026 at 12:10:07AM -0700, Alison Schofield wrote:
> > On Thu, Jun 04, 2026 at 10:43:10PM -0700, Alison Schofield wrote:
> > > On Sat, May 23, 2026 at 02:50:35AM -0700, Anisa Su wrote:
> > > > CXL Dynamic Capacity Device (DCD) support has continued to evolve in the
> > > > upstream kernel since Ira's v5 posting [1]. The kernel side has settled
> > > > on a uuid-driven claim model for sparse DAX devices: dax_resources carry
> > > > the tag delivered with each extent, and userspace selects which ones to
> > > > claim by writing a UUID to the dax device's sysfs 'uuid' attribute (or
> > > > "0" to claim a single untagged resource). Size on a sparse region is
> > > > determined by the claim, not requested up-front.
> > > >
> > > > This series brings cxl-cli and daxctl in line with that model and
> > > > extends cxl_test to exercise the new paths end-to-end.
> > >
> > > Hi Anisa,
> > >
> > > I just now picked this up with the kernel side and took it for a quick
> > > test drive. Based on what's been touched, first meaningful finding is
> > > all the DAX unit tests pass, and then for CXL unit tests, all but these
> > > 2 pass: cxl-security.sh and cxl-dcd.sh
> > >
> > > Please let me know if there are known problems with either of those
> > > before I explore further.
> >
> > Hi Anisa,
> >
> > Good news, DCD exposed a long hidden bug that made cxl-security.sh
> > fail. It is not an issue w DCD patches.
> >
> > Found that DCD set changes which mock memdev the test happens to
> > land on, and that's enough to uncover a latent hex/decimal bug in
> > CXL nvdimm code. We use to always land on id '1', but now this patch:
> >
> > tools/testing/cxl: Add DC Regions to mock mem data
> >
> > reorders the sorted dimm list, so the test selects a dimm with
> > serial 10 (0xa), and there's the hex/decimal mismatch.
> >
> > The renumbering is harmless in itself but it just changed the
> > serial the test exercises and tripped over the old bug.
> >
> > I'll send a separate fixup patch for the hex/dec cleanup.
> >
> > (No answer on cxl-dcd.sh yet)
> >
> > -- Alison
> >
> Thanks for looking into this! I can also look into what might be going
> on with cxl-dcd.sh if you let me know the base commit you applied the
> dcd patches onto? :)
The base commit was indeed the key to the cxl-dcd.sh failure.
I'm seeing a probe-ordering race that you may not see unless you're
using v7.1-rc1 or later. The branch linked in the kernel patchset does
not include this commit -
39aa1d4be12b ("dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding")
Dan changed cxl_dax_region to PROBE_PREFER_ASYNCHRONOUS in support the
DAX and HMEM synchronization, so I'm guessing that undoing that, is
not an option. Before that change, cxl_dax_region probed synchronously
and created the zero-sized seed dax device before cxlr_add_existing_extents()
ran, so no race existed.
Move to 7.1 and you *should* see cxl-dcd.sh start failing. Since it's a
timing issue, so you may need to dial down any dynamic debug and do
repeated runs.
The race is on the dax_region device's devres_head between-
(a) the asynchronous cxl_dax_region probe reaching really_probe()
and
(b) cxlr_add_existing_extents() attaching devres to the same device
really_probe() rejects probing devices that already have resources
attached. If (b) wins, probe fails with -EBUSY, cxl_dax_region never
binds, and the seed dax device is never created.
One possible fixup would be to move existing-extent processing into
cxl_dax_region_probe() so that the resource attachment happens
within the probe itself. That looked like more restructuring than I
could quickly test out, so I'm sending it back to you.
Below is a reproducer using cxl_test and cxl-cli. It creates a DC region
and checks immediately if its dax_region driver bound and a seed dax
device exists. An 'unbound' dax_region is the bug.
#!/bin/bash
set -u
CXL=${CXL:-cxl}; NDCTL=${NDCTL:-ndctl}; TRIALS=${1:-10}
bound=0 unbound=0
for t in $(seq 1 "$TRIALS"); do
$NDCTL disable-region -b cxl_test all >/dev/null 2>&1
modprobe -r cxl_test 2>/dev/null; modprobe cxl_test
udevadm settle 2>/dev/null; dmesg -C 2>/dev/null
# first non-sharable memdev with a dynamic_ram_a partition
# (serial 56540 == 0xDCDC is the mock's sharable fixture)
mem=$($CXL list -b cxl_test -Mi \
| jq -r '.[] | select(.dynamic_ram_a_size != null)
| select(.serial != 56540) | .memdev' | head -1)
reg=$($CXL create-region -t dynamic_ram_a -d decoder0.0 -m "$mem" \
2>/dev/null | jq -r .region)
rnum=${reg#region}
# sample immediately, no sleep (what the test does via daxctl)
daxreg=$(readlink -f /sys/bus/cxl/devices/"$reg"/dax_region"$rnum"
2>/dev/null)
drv=$([ -e "$daxreg/driver" ] && echo bound || echo UNBOUND)
seed=$([ -e /sys/bus/dax/devices/dax"$rnum".0/uuid ] && echo yes ||
echo NO)
ebusy=$(dmesg 2>/dev/null | grep -c "Resources present before probing")
printf 'trial %2d: %s drv=%-7s seed=%-3s ebusy_msgs=%s\n' \
"$t" "$reg" "$drv" "$seed" "$ebusy"
[ "$drv" = bound ] && bound=$((bound+1)) || unbound=$((unbound+1))
done
echo "SUMMARY: bound=$bound unbound(FAIL)=$unbound of $TRIALS"
[ "$unbound" -eq 0 ] || exit 1
Sample output on a failing kernel-
trial 1: region9 drv=bound seed=yes ebusy_msgs=0
trial 2: region9 drv=UNBOUND seed=NO ebusy_msgs=1
trial 3: region9 drv=bound seed=yes ebusy_msgs=0
trial 4: region9 drv=UNBOUND seed=NO ebusy_msgs=1
...
SUMMARY: bound=4 unbound(FAIL)=4 of 8
>
> Thanks,
> Anisa
>
snip