On Wed, Jun 03, 2026 at 12:04:00PM -0700, Vishal Verma wrote:
> On Fri, 2026-04-24 at 16:36 -0700, Alison Schofield wrote:
> > The btt-check unit test exposed data mismatches during BTT I/O in a
> > CI environment, indicating a race in lane acquisition that can lead
> > to silent data corruption. The failure was not reliably reproduced
> > under typical test conditions.
> >
> > Add a targeted stress test that repeatedly writes, reads, and verifies
> > data on a BTT namespace while background readers contend for BTT lanes
> > and CPU loops increase preemption pressure.
> >
> > The test reproduces the race on an unfixed kernel and passes with the
> > lane ownership fix applied.
> >
> > Assisted-By: Claude Sonnet 4.5
> > Signed-off-by: Alison Schofield <[email protected]>
>
> Just one nit below, otherwise looks good.
>
> Reviewed-by: Vishal Verma <[email protected]>
>
> Also, side note, looks like this ran for ~45 seconds on my laptop making
> it the second longest running test in ndctl - maybe that's fine. Also
> all the 'while :; do :;' processes pegged half my CPUs to 100% for the
> entire time, but maybe that's also fine :) - I see that's intentional of
> course.
Thanks for the review Vishal!
ICYMI I take unit test time hogs very seriously, so aside from the
obvious finding here: you deserve a laptop upgrade, let's look at it:
Alone:
1/1 ndctl:ndctl / btt-stress.sh OK 10.71s
En Suite:
1/24 ndctl:ndctl / libndctl OK 1.76s
2/24 ndctl:ndctl / dsm-fail OK 0.30s
3/24 ndctl:ndctl / create.sh OK 0.41s
4/24 ndctl:ndctl / clear.sh OK 14.70s
5/24 ndctl:ndctl / pmem-errors.sh OK 13.61s
6/24 ndctl:ndctl / btt-check.sh OK 2.71s
7/24 ndctl:ndctl / label-compat.sh OK 0.58s
8/24 ndctl:ndctl / sector-mode.sh OK 0.67s
9/24 ndctl:ndctl / inject-error.sh OK 13.99s
10/24 ndctl:ndctl / btt-errors.sh OK 24.14s
11/24 ndctl:ndctl / btt-stress.sh OK 11.15s
12/24 ndctl:ndctl / hugetlb OK 0.32s
13/24 ndctl:ndctl / btt-pad-compat.sh OK 0.95s
14/24 ndctl:ndctl / ack-shutdown-count-set OK 0.21s
15/24 ndctl:ndctl / rescan-partitions.sh OK 6.76s
16/24 ndctl:ndctl / inject-smart.sh OK 1.41s
17/24 ndctl:ndctl / monitor.sh OK 49.71s
18/24 ndctl:ndctl / max_extent_ns OK 0.79s
19/24 ndctl:ndctl / pfn-meta-errors.sh OK 14.79s
20/24 ndctl:ndctl / track-uuid.sh OK 0.55s
21/24 ndctl:ndctl / firmware-update.sh OK 10.32s
22/24 ndctl:ndctl / pmem-ns OK 1.04s
23/24 ndctl:ndctl / align.sh OK 1.13s
24/24 ndctl:ndctl / nfit-security.sh OK 1.27s
To be complete - that btt race condition took a lot for me to
reproduce. I was stripping out all things that slowed the kernel
until I got it to hit, and then came up with that level of stress
in the unit test.
I do wonder if you would even hit the problem with the monster config
you likely are using based on your runtimes. Did you test this with
and without the fix applied?
> > +
> > +create() {
>
> Usually for other scripts in ndctl, the convention has been to be more
> C-like and put the opening { on the next line.
>
Yeah, this is a thing. I've started using shfmt and that brace placement
for function bodies is not a style choice. And when I think about
enforcing it across the unit test scripts I do see the C like convention
you note.
Let me take a look at how the kernel scripts in tools/testing deal
with formatting. I expect it is some flavor of checkpatch.
I'm after uniformity and no exceptions! Same thing for shellcheck.
> > + json=$($NDCTL create-namespace -b "$NFIT_TEST_BUS0" -t pmem -m sector)
> > + rc=2
> > + eval "$(echo "$json" | json2var)"
>
> I was going to complain about the eval and the json2var 'sed' parsing to
> get stuff out of json, but I see this has been around since forever and
> used in a bunch of other places.
>
> Ideally this should just be multiple jq -r <filter> invocations, to
> extract each variable - more wordy, but avoids the eval.. But this can
> be a future treewide cleanup.
Agreed the eval/json2var pattern isn't ideal, but since it's the convention
across the whole test/ suite, I'd rather not have this one test diverge.
Happy to do a treewide jq conversion as a separate series.
snip