Re: [PATCH 1/1] eal: correct memory ordering in MCS lock

Ola Liljedahl Thu, 06 Nov 2025 03:24:13 -0800

> On 2025-11-04, 09:18, "Konstantin Ananyev" <[email protected] 
> <mailto:[email protected]>> wrote:
>
> > On Mon, 3 Nov 2025 18:06:05 +0000
> > Konstantin Ananyev <[email protected] 
> > <mailto:[email protected]>> wrote:
> > 
> > > >
> > > > On 11/3/25 11:07, Stephen Hemminger wrote:
> > > > > On Mon, 3 Nov 2025 09:12:39 -0600
> > > > > Wathsala Vithanage <[email protected] 
> > > > > <mailto:[email protected]>> wrote:
> > > > >
> > > > >> MCS lock is broken, it's just a matter of time it will run into a 
> > > > >> deadlock.
> > > > >>
> > > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > > I am surprised that a driver would use mcslock.
> > > > > MCSlock is targeted at case of large number of CPU's with lots of
> > contention.
> > > > > It will likely be slower than spinlock or ticketlock for the use case 
> > > > > of driver.
> > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > > maintainer can clarify.
> > > >
> > >
> > > If MCS lock is really broken, it needs to be fixed anyway.
> > > It might be used by other third-party libs/apps that do use on DPDK.
> > 
> > 100% agree it must be fixed.
> > It would be good to have test or static analysis tool that could validate 
> > all the lock
> > types.
>
>
> +1
> Another option would be to have sort of stress test for all locking types we 
> have in our UT.
> At least for ring I found it very useful.
A stress test only tests for that specific hardware and compiler you are using. 
E.g. the LDAPR
problem in rte_ring was hidden for years because compilers (GCC) did not 
support generating
LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR 
since
Neoverse N1 which was released in 2019. To compare, it was only in 2023 with 
GCC 13 support
for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify 
+rcpc for -march
or specify a CPU that supports FEAT_RCPC, building with defaults won't generate 
LDAPR and the
bug would remain hidden. A stress test that is run on the target every time 
DPDK is rebuilt for
some target would cover some of that.
A stress test (like any test) is also not guaranteed to provoke all code paths 
and thread interleavings
so there are still no guarantees.


I claim that it is important to verify that the code follows the C/C++ memory 
model and implements
all the necessary happens-before relations (synchronize-with edges). Then it 
will work on all
hardware and compilers now and in the future that implement the C/C++ memory 
model (and
Java which is quite similar AFAIK, all atomics being seq_cst).

One such verifier exists in Progress64. It verifies the actual implementation 
by executing all
possible interleavings (permutations) of two threads running a test program and 
ensures all
required synchronize-with edges exists (as implemented by 
load-acquire/store-release). Lack
of synchronize-with between two accesses to the same location by different 
threads means
a data race has been found.

As an example, make that load-acquire in the MCS unlock function a load-relaxed 
instead (as
it was before that patch) and run the verifier:
$ ./verify -am mcslock
Verifying mcslock
(lots of output of failed verifications)
<CTRL-C>
Interrupted
Histogram over number of steps:
 20: 594
 21: 0
 22: 0
 23: 0
 24: 0
 25: 0
 26: 566
 27: 0
 28: 684
 29: 0
 30: 155
 31: 0
 32: 593
 33: 0
 34: 420
 35: 0
 36: 155
Summary:
 succeeded: 3167
 interrupted: 0
 failed: 5871
 total: 9038 (0x234e)
Synchronization analysis:
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 
(count 594)
 load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 
(count 2573)
 load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 
(count 2573)
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 
(count 8444)
 src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871)
Elapsed 1.028909533 seconds, average 113842ns/permutation, average 11530ns/step

The first failed permutation is 0x7. Re-run that with the verbose flag.
$ ./verify -avm -p7 mcslock
Verifying mcslock using permutation 0x7
Step  0: thread 1 file src/p64_mcslock.c line  25 -w 
write_8(0x7c5dd070,0,regular)
Step  1: thread 1 file src/p64_mcslock.c line  26 -w 
write_1(0x7c5dd078,0x1,regular)
Step  2: thread 1 file src/p64_mcslock.c line  28 rw 
exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0
Step  3: thread 0 file src/p64_mcslock.c line  25 -w 
write_8(0x7c5d8c70,0,regular)
Step  4: thread 0 file src/p64_mcslock.c line  26 -w 
write_1(0x7c5d8c78,0x1,regular)
Step  5: thread 0 file src/p64_mcslock.c line  28 rw 
exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070
Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2
Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28)
Step  6: thread 0 file src/p64_mcslock.c line  37 r- 
read_8(0x7c5dd070,regular)=0
Regular read_8 on step 6 matches regular write_8 from other thread on step 0
Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2
Step  7: thread 0 file src/p64_mcslock.c line  38 -w 
store_8(0x7c5dd070,0x7c5d8c70,rls)
Step  8: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 8 matches regular write_1 from same thread on step 4
Step  9: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 10: thread 1 file src/ver_mcslock.c line  42 r- 
read_1(0x7c5dd2b8,regular)=0
Step 11: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 11 matches regular write_1 from same thread on step 4
Step 12: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 13: thread 1 file src/ver_mcslock.c line  43 -w 
write_1(0x7c5dd2b8,0x1,regular)
Step 14: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 14 matches regular write_1 from same thread on step 4
Step 15: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 16: thread 1 file src/ver_mcslock.c line  44 r- 
read_1(0x7c5dd2b8,regular)=0x1
Regular read_1 on step 16 matches regular write_1 from same thread on step 13
Step 17: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 17 matches regular write_1 from same thread on step 4
Step 18: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 19: thread 1 file src/ver_mcslock.c line  45 -w 
write_1(0x7c5dd2b8,0,regular)
Step 20: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 20 matches regular write_1 from same thread on step 4
Step 21: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 22: thread 1 file src/p64_mcslock.c line  51 r- 
load_8(0x7c5dd070,rlx)=0x7c5d8c70
Step 23: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 23 matches regular write_1 from same thread on step 4
Step 24: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 25: thread 1 file src/p64_mcslock.c line  69 r- 
read_1(0x7c5d8c78,regular)=0x1
Regular read_1 on step 25 matches regular write_1 from other thread on step 4
ERROR: Read on step 25 matching write on step 4 missing synchronize-with!
Module mcslock permutation 0x7 step 26: Verification failed

Change load-relaxed to load-acquire in unlock and rerun the verifier. By 
default it
runs 4 (2^32) permutations but this can be changed.
$ ./verify -am mcslock
Verifying mcslock
Verifying permutation 0...
...
Histogram over number of steps:
 20: 281018368
 21: 0
 22: 0
 23: 0
 24: 143130624
 25: 0
 26: 906690560
 27: 137625600
 28: 1200701440
 29: 252887040
 30: 752590848
 31: 181182464
 32: 279252992
 33: 74387456
 34: 59373568
 35: 17340416
 36: 6410240
 37: 2017280
 38: 268800
 39: 89600
Summary:
 succeeded: 4294967296
 interrupted: 0
 failed: 0
 total: 4294967296 (0x100000000)
Synchronization analysis:
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 
(count 281018368)
 load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 
(count 4013948928)
 load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38 
(count 2808086528)
 load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 
(count 1205862400)
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 
(count 4013948928)
 No data races detected
Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step

The verifier is still work in progress and has some limitations. It implements 
an SC
(sequential consistency)-like memory model but then analyses all regular 
accesses to
ensure the necessary synchronize-with (acquire/release) edges are present or a 
data
race is reported. One problem is that atomic accesses by definition never cause 
data
races so the verifier needs some help which is done using additional regular 
(non-
atomic) accesses. I think more work can be done here to detect data races 
without
requiring additional regular accesses. Possibly an RCpc-like model could also 
be simulated,
e.g. a load-acquire could return not the SC (current) value of a memory 
location but the
earliest value that existing synchronize-with edges allow. This would require 
more
intrusive changes to the verifier.
Thread fences are not supported. I don't mind this too much as I don't think 
thread fences
Should be used unless really necessary as their semantics often are 
misunderstood (as we
have seen). However, that means code like the seqlock can't be verified. I have 
a herd7
litmus test for seqlock (p64_rwsync) instead. Matching thread fences with 
(relaxed) atomic
loads and stores to identify synchronize-with edges is a possibility so perhaps 
they will be
supported at some point.
Only two threads are currently supported but extending this to more threads is 
trivial.
Verification speed will suffer though. I am planning to make the verifier 
multithreaded so
can use all processor cores in a system.
I have many more ideas on how to improve the verifier.

- Ola

Re: [PATCH 1/1] eal: correct memory ordering in MCS lock

Reply via email to