https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=265974
Bug ID: 265974
Summary: SMR has several missing barriers
Product: Base System
Version: CURRENT
Hardware: arm64
OS: Any
Status: New
Severity: Affects Many People
Priority: ---
Component: kern
Assignee: [email protected]
Reporter: [email protected]
Created attachment 236039
--> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=236039&action=edit
tarball of litmus tests and pdfs
1. smr_enter() ordering is incorrect on arm64 (and probably other weak memory
order systems)
```C
static inline void
smr_enter(smr_t smr)
{
[...]
/* This is an add because we do not have atomic_store_acq_int */
atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared));
}
```
the comment here has a point, there's no atomic_store_acq_int because that
would be a StoreLoad barrier which `atomic_add_acq_int` isn't sufficient for.
Here is a litmus test that proves why this code above isn't correct. Let's
assume 2 threads running something akin to this:
start:
global = 0
rd_seq = 123
wr_seq = 123
thread 0:
store(&global, 2);
smr_synchronize();
thread 1:
smr_enter();
load(&global);
it should NOT be possible for thread 2 to load `0` and for smr_synchronize() to
think thread 1 was not in a critical section pinning `123`
Moreover, smr_poll() absolutely needs a full memory barrier on entry, the
`atomic_load_acq_int` performed by `smr_poll` aren't sufficient.
```
AArch64 MP
{
global = 0;
wr_seq = 123;
p1_rd_seq = 0;
0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
P0 | P1 ;
MOV X8, #2 | LDR X8, [X1] ;
STR X8, [X0] | STR X8, [X2] ;
LDADDL X8, X9, [X1] | DMB SY ;
DMB SY | LDR X10, [X0] ;
LDR X10, [X2] | ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```
This litmus test above shows that the condition is unfortunately reachable, the
one below passes (adding a full barrier in smr_scan() and adding a full barrier
in smr_enter()):
```
AArch64 MP
{
global = 0;
wr_seq = 123;
p1_rd_seq = 0;
0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
P0 | P1 ;
MOV X8, #2 | LDR X8, [X1] ;
STR X8, [X0] | STR X8, [X2] ;
LDADDL X8, X9, [X1] | DMB SY ;
DMB SY | LDR X10, [X0] ;
LDR X10, [X2] | ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```
Note that I think the code works on Intel today.
2. similarly smr_poll_scan() needs a full barrier after the scan _before_ it
updates the global rd_seq, this is about serializing the fast path of smr_poll
with CPUs that weren't in a critical section (while the one on entry of
smr_poll() is about synchronizing with the CPUs inside an active SMR critical
section and was demonstrated in (1)).
I confess that litmus test is much more painful to write and is left as an
exercise to the reader ;)
3. smr_deferred_advance() is similarly incorrect as it doesn't guarantee proper
visibility in this case:
start:
global = 0
rd_seq = 123
wr_seq = 123
thread 0:
store(&global, 2);
smr_deferred_advance(); // returns 125 but didn't update wr_seq
thread 1:
smr_synchronize(); // 123 -> 125
load(&global);
We should never hit a case when thread1 didn't observe the global being 2 but
thread 1 returned 125.
Here is a modelization of this sequence with the added `dmb sy` from (1) in
smr_poll:
```
AArch64 MP
{
global = 0;
wr_seq = 123;
0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2;
1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2;
}
P0 | P1 ;
STR X2, [X0] | LDADDL X2, X9, [X1] ;
| DMB SY ;
LDR X9, [X1] | LDR X10, [X0] ;
ADD X9, X9, X2 | ;
exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0)
```
this unfortunately tells us that the condition is reachable, and intuitively it
makes sense because there's nothing in "P0" ordering anything. The fix is that
when a deferred advance is made, a full barrier must be issued (it has to be a
StoreLoad between the store to the global and the load of the write sequence),
and indeed this litmus test passes:
```
AArch64 MP
{
global = 0;
wr_seq = 123;
0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2;
1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2;
}
P0 | P1 ;
STR X2, [X0] | LDADDL X2, X9, [X1] ;
DMB SY | DMB SY ;
LDR X9, [X1] | LDR X10, [X0] ;
ADD X9, X9, X2 | ;
exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0)
```
Attached is a folder with the litmus test and nice PDF renderings of why on
arm64 those barriers are needed.
The tentative fix IMO is something like this, though I don't use nor even know
how to build FreeBSD so it might have mistakes:
```diff
diff --git a/sys/kern/subr_smr.c b/sys/kern/subr_smr.c
index cbbf185fee79..f7c7ccef8d6e 100644
--- a/sys/kern/subr_smr.c
+++ b/sys/kern/subr_smr.c
@@ -307,8 +307,10 @@ static smr_seq_t
smr_deferred_advance(smr_t smr, smr_shared_t s, smr_t self)
{
- if (++self->c_deferred < self->c_limit)
+ if (++self->c_deferred < self->c_limit) {
+ atomic_thread_fence_seq_cst();
return (smr_shared_current(s) + SMR_SEQ_INCR);
+ }
self->c_deferred = 0;
return (smr_default_advance(smr, s));
}
@@ -427,6 +429,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t
s_rd_seq,
CRITICAL_ASSERT(curthread);
counter_u64_add_protected(poll_scan, 1);
+ atomic_thread_fence_seq_cst();
+
/*
* The read sequence can be no larger than the write sequence at
* the start of the poll.
@@ -450,6 +454,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t
s_rd_seq,
rd_seq = SMR_SEQ_MIN(rd_seq, c_seq);
}
+ atomic_thread_fence_seq_cst();
+
/*
* Advance the rd_seq as long as we observed a more recent value.
*/
diff --git a/sys/sys/smr.h b/sys/sys/smr.h
index c110be9a66c2..3d543d1979ae 100644
--- a/sys/sys/smr.h
+++ b/sys/sys/smr.h
@@ -133,7 +133,12 @@ smr_enter(smr_t smr)
* is detected and handled there.
*/
/* This is an add because we do not have atomic_store_acq_int */
+#if __x86_64__ || __i386__
atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared));
+#else
+ atomic_store_int(&smr->c_seq, smr_shared_current(smr->c_shared));
+ atomic_thread_fence_seq_cst();
+#endif
}
/*
```
--
You are receiving this mail because:
You are the assignee for the bug.