> On 2025-11-04, 09:18, "Konstantin Ananyev" <[email protected] > <mailto:[email protected]>> wrote: > > > On Mon, 3 Nov 2025 18:06:05 +0000 > > Konstantin Ananyev <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > > > > > On 11/3/25 11:07, Stephen Hemminger wrote: > > > > > On Mon, 3 Nov 2025 09:12:39 -0600 > > > > > Wathsala Vithanage <[email protected] > > > > > <mailto:[email protected]>> wrote: > > > > > > > > > >> MCS lock is broken, it's just a matter of time it will run into a > > > > >> deadlock. > > > > >> > > > > >> drivers/dma/cnxk is a user of MCS lock. > > > > > I am surprised that a driver would use mcslock. > > > > > MCSlock is targeted at case of large number of CPU's with lots of > > contention. > > > > > It will likely be slower than spinlock or ticketlock for the use case > > > > > of driver. > > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the > > > > maintainer can clarify. > > > > > > > > > > If MCS lock is really broken, it needs to be fixed anyway. > > > It might be used by other third-party libs/apps that do use on DPDK. > > > > 100% agree it must be fixed. > > It would be good to have test or static analysis tool that could validate > > all the lock > > types. > > > +1 > Another option would be to have sort of stress test for all locking types we > have in our UT. > At least for ring I found it very useful. A stress test only tests for that specific hardware and compiler you are using. E.g. the LDAPR problem in rte_ring was hidden for years because compilers (GCC) did not support generating LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR since Neoverse N1 which was released in 2019. To compare, it was only in 2023 with GCC 13 support for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify +rcpc for -march or specify a CPU that supports FEAT_RCPC, building with defaults won't generate LDAPR and the bug would remain hidden. A stress test that is run on the target every time DPDK is rebuilt for some target would cover some of that. A stress test (like any test) is also not guaranteed to provoke all code paths and thread interleavings so there are still no guarantees.
I claim that it is important to verify that the code follows the C/C++ memory model and implements all the necessary happens-before relations (synchronize-with edges). Then it will work on all hardware and compilers now and in the future that implement the C/C++ memory model (and Java which is quite similar AFAIK, all atomics being seq_cst). One such verifier exists in Progress64. It verifies the actual implementation by executing all possible interleavings (permutations) of two threads running a test program and ensures all required synchronize-with edges exists (as implemented by load-acquire/store-release). Lack of synchronize-with between two accesses to the same location by different threads means a data race has been found. As an example, make that load-acquire in the MCS unlock function a load-relaxed instead (as it was before that patch) and run the verifier: $ ./verify -am mcslock Verifying mcslock (lots of output of failed verifications) <CTRL-C> Interrupted Histogram over number of steps: 20: 594 21: 0 22: 0 23: 0 24: 0 25: 0 26: 566 27: 0 28: 684 29: 0 30: 155 31: 0 32: 593 33: 0 34: 420 35: 0 36: 155 Summary: succeeded: 3167 interrupted: 0 failed: 5871 total: 9038 (0x234e) Synchronization analysis: load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 594) load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 2573) load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 2573) load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 8444) src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871) Elapsed 1.028909533 seconds, average 113842ns/permutation, average 11530ns/step The first failed permutation is 0x7. Re-run that with the verbose flag. $ ./verify -avm -p7 mcslock Verifying mcslock using permutation 0x7 Step 0: thread 1 file src/p64_mcslock.c line 25 -w write_8(0x7c5dd070,0,regular) Step 1: thread 1 file src/p64_mcslock.c line 26 -w write_1(0x7c5dd078,0x1,regular) Step 2: thread 1 file src/p64_mcslock.c line 28 rw exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0 Step 3: thread 0 file src/p64_mcslock.c line 25 -w write_8(0x7c5d8c70,0,regular) Step 4: thread 0 file src/p64_mcslock.c line 26 -w write_1(0x7c5d8c78,0x1,regular) Step 5: thread 0 file src/p64_mcslock.c line 28 rw exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070 Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2 Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28) Step 6: thread 0 file src/p64_mcslock.c line 37 r- read_8(0x7c5dd070,regular)=0 Regular read_8 on step 6 matches regular write_8 from other thread on step 0 Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2 Step 7: thread 0 file src/p64_mcslock.c line 38 -w store_8(0x7c5dd070,0x7c5d8c70,rls) Step 8: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1 Atomic read_1 on step 8 matches regular write_1 from same thread on step 4 Step 9: thread 0 file src/p64_mcslock.c line 42 -- yield() Yielding to other thread Step 10: thread 1 file src/ver_mcslock.c line 42 r- read_1(0x7c5dd2b8,regular)=0 Step 11: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1 Atomic read_1 on step 11 matches regular write_1 from same thread on step 4 Step 12: thread 0 file src/p64_mcslock.c line 42 -- yield() Yielding to other thread Step 13: thread 1 file src/ver_mcslock.c line 43 -w write_1(0x7c5dd2b8,0x1,regular) Step 14: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1 Atomic read_1 on step 14 matches regular write_1 from same thread on step 4 Step 15: thread 0 file src/p64_mcslock.c line 42 -- yield() Yielding to other thread Step 16: thread 1 file src/ver_mcslock.c line 44 r- read_1(0x7c5dd2b8,regular)=0x1 Regular read_1 on step 16 matches regular write_1 from same thread on step 13 Step 17: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1 Atomic read_1 on step 17 matches regular write_1 from same thread on step 4 Step 18: thread 0 file src/p64_mcslock.c line 42 -- yield() Yielding to other thread Step 19: thread 1 file src/ver_mcslock.c line 45 -w write_1(0x7c5dd2b8,0,regular) Step 20: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1 Atomic read_1 on step 20 matches regular write_1 from same thread on step 4 Step 21: thread 0 file src/p64_mcslock.c line 42 -- yield() Yielding to other thread Step 22: thread 1 file src/p64_mcslock.c line 51 r- load_8(0x7c5dd070,rlx)=0x7c5d8c70 Step 23: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1 Atomic read_1 on step 23 matches regular write_1 from same thread on step 4 Step 24: thread 0 file src/p64_mcslock.c line 42 -- yield() Yielding to other thread Step 25: thread 1 file src/p64_mcslock.c line 69 r- read_1(0x7c5d8c78,regular)=0x1 Regular read_1 on step 25 matches regular write_1 from other thread on step 4 ERROR: Read on step 25 matching write on step 4 missing synchronize-with! Module mcslock permutation 0x7 step 26: Verification failed Change load-relaxed to load-acquire in unlock and rerun the verifier. By default it runs 4 (2^32) permutations but this can be changed. $ ./verify -am mcslock Verifying mcslock Verifying permutation 0... ... Histogram over number of steps: 20: 281018368 21: 0 22: 0 23: 0 24: 143130624 25: 0 26: 906690560 27: 137625600 28: 1200701440 29: 252887040 30: 752590848 31: 181182464 32: 279252992 33: 74387456 34: 59373568 35: 17340416 36: 6410240 37: 2017280 38: 268800 39: 89600 Summary: succeeded: 4294967296 interrupted: 0 failed: 0 total: 4294967296 (0x100000000) Synchronization analysis: load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 281018368) load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 4013948928) load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38 (count 2808086528) load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 1205862400) load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 4013948928) No data races detected Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step The verifier is still work in progress and has some limitations. It implements an SC (sequential consistency)-like memory model but then analyses all regular accesses to ensure the necessary synchronize-with (acquire/release) edges are present or a data race is reported. One problem is that atomic accesses by definition never cause data races so the verifier needs some help which is done using additional regular (non- atomic) accesses. I think more work can be done here to detect data races without requiring additional regular accesses. Possibly an RCpc-like model could also be simulated, e.g. a load-acquire could return not the SC (current) value of a memory location but the earliest value that existing synchronize-with edges allow. This would require more intrusive changes to the verifier. Thread fences are not supported. I don't mind this too much as I don't think thread fences Should be used unless really necessary as their semantics often are misunderstood (as we have seen). However, that means code like the seqlock can't be verified. I have a herd7 litmus test for seqlock (p64_rwsync) instead. Matching thread fences with (relaxed) atomic loads and stores to identify synchronize-with edges is a possibility so perhaps they will be supported at some point. Only two threads are currently supported but extending this to more threads is trivial. Verification speed will suffer though. I am planning to make the verifier multithreaded so can use all processor cores in a system. I have many more ideas on how to improve the verifier. - Ola

