On Wed, Jun 03, 2026 at 12:04:00PM -0700, Vishal Verma wrote:
> On Fri, 2026-04-24 at 16:36 -0700, Alison Schofield wrote:
> > The btt-check unit test exposed data mismatches during BTT I/O in a
> > CI environment, indicating a race in lane acquisition that can lead
> > to silent data corruption. The failure was not reliably reproduced
> > under typical test conditions.
> > 
> > Add a targeted stress test that repeatedly writes, reads, and verifies
> > data on a BTT namespace while background readers contend for BTT lanes
> > and CPU loops increase preemption pressure.
> > 
> > The test reproduces the race on an unfixed kernel and passes with the
> > lane ownership fix applied.
> > 
> > Assisted-By: Claude Sonnet 4.5
> > Signed-off-by: Alison Schofield <[email protected]>
> 
> Just one nit below, otherwise looks good.
> 
> Reviewed-by: Vishal Verma <[email protected]>
> 
> Also, side note, looks like this ran for ~45 seconds on my laptop making
> it the second longest running test in ndctl - maybe that's fine. Also
> all the 'while :; do :;' processes pegged half my CPUs to 100% for the
> entire time, but maybe that's also fine :) - I see that's intentional of
> course.

Thanks for the review Vishal!

ICYMI I take unit test time hogs very seriously, so aside from the
obvious finding here: you deserve a laptop upgrade, let's look at it:

Alone:
1/1 ndctl:ndctl / btt-stress.sh        OK              10.71s

En Suite:
1/24 ndctl:ndctl / libndctl                       OK               1.76s
 2/24 ndctl:ndctl / dsm-fail                       OK               0.30s
 3/24 ndctl:ndctl / create.sh                      OK               0.41s
 4/24 ndctl:ndctl / clear.sh                       OK              14.70s
 5/24 ndctl:ndctl / pmem-errors.sh                 OK              13.61s
 6/24 ndctl:ndctl / btt-check.sh                   OK               2.71s
 7/24 ndctl:ndctl / label-compat.sh                OK               0.58s
 8/24 ndctl:ndctl / sector-mode.sh                 OK               0.67s
 9/24 ndctl:ndctl / inject-error.sh                OK              13.99s
10/24 ndctl:ndctl / btt-errors.sh                  OK              24.14s
11/24 ndctl:ndctl / btt-stress.sh                  OK              11.15s
12/24 ndctl:ndctl / hugetlb                        OK               0.32s
13/24 ndctl:ndctl / btt-pad-compat.sh              OK               0.95s
14/24 ndctl:ndctl / ack-shutdown-count-set         OK               0.21s
15/24 ndctl:ndctl / rescan-partitions.sh           OK               6.76s
16/24 ndctl:ndctl / inject-smart.sh                OK               1.41s
17/24 ndctl:ndctl / monitor.sh                     OK              49.71s
18/24 ndctl:ndctl / max_extent_ns                  OK               0.79s
19/24 ndctl:ndctl / pfn-meta-errors.sh             OK              14.79s
20/24 ndctl:ndctl / track-uuid.sh                  OK               0.55s
21/24 ndctl:ndctl / firmware-update.sh             OK              10.32s
22/24 ndctl:ndctl / pmem-ns                        OK               1.04s
23/24 ndctl:ndctl / align.sh                       OK               1.13s
24/24 ndctl:ndctl / nfit-security.sh               OK               1.27s

To be complete - that btt race condition took a lot for me to
reproduce. I was stripping out all things that slowed the kernel
until I got it to hit, and then came up with that level of stress
in the unit test.

I do wonder if you would even hit the problem with the monster config
you likely are using based on your runtimes. Did you test this with
and without the fix applied?

> > +
> > +create() {
> 
> Usually for other scripts in ndctl, the convention has been to be more
> C-like and put the opening { on the next line.
> 

Yeah, this is a thing. I've started using shfmt and that brace placement
for function bodies is not a style choice. And when I think about
enforcing it across the unit test scripts I do see the C like convention
you note.

Let me take a look at how the kernel scripts in tools/testing deal
with formatting. I expect it is some flavor of checkpatch.

I'm after uniformity and no exceptions! Same thing for shellcheck.


> > +   json=$($NDCTL create-namespace -b "$NFIT_TEST_BUS0" -t pmem -m sector)
> > +   rc=2
> > +   eval "$(echo "$json" | json2var)"
> 
> I was going to complain about the eval and the json2var 'sed' parsing to
> get stuff out of json, but I see this has been around since forever and
> used in a bunch of other places.
> 
> Ideally this should just be multiple jq -r <filter> invocations, to
> extract each variable - more wordy, but avoids the eval.. But this can
> be a future treewide cleanup.

Agreed the eval/json2var pattern isn't ideal, but since it's the convention
across the whole test/ suite, I'd rather not have this one test diverge. 
Happy to do a treewide jq conversion as a separate series.

snip

Reply via email to