[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-09 Thread amacleod at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

Andrew Macleod amacleod at redhat dot com changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #9 from Andrew Macleod amacleod at redhat dot com ---
GCC emits the fence on stores, and thus loads don't need a fence.


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread mikulas at artax dot karlin.mff.cuni.cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

--- Comment #5 from mikulas at artax dot karlin.mff.cuni.cz ---
So, please pinpoint specific parahraph(s) in the standard that specify that
this behavior is acceptable.


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread mikulas at artax dot karlin.mff.cuni.cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

--- Comment #7 from mikulas at artax dot karlin.mff.cuni.cz ---
#include stdatomic.h

atomic_int a = ATOMIC_VAR_INIT(0);
atomic_int b = ATOMIC_VAR_INIT(0);
atomic_int p = ATOMIC_VAR_INIT(0);

int thread_1(void)
{
atomic_store_explicit(b, 1, memory_order_relaxed);
atomic_load_explicit(p, memory_order_seq_cst);
return atomic_load_explicit(a, memory_order_relaxed);
}

int thread_2(void)
{
atomic_store_explicit(a, 1, memory_order_relaxed);
atomic_load_explicit(p, memory_order_seq_cst);
return atomic_load_explicit(b, memory_order_relaxed);
}

See for example this. Suppose that thread_1 and thread_2 are executed
concurrently. If memory_order_seq_cst were a proper full memory barrier, it
would be impossible that both functions return 0. Because you omit the barrier
on read of variable p, it is possible that both functions return 0.

thread_1 is compiled into
movl$1, b(%rip)
movlp(%rip), %eax
movla(%rip), %eax
ret
thread_2 is compiled into
movl$1, a(%rip)
movlp(%rip), %eax
movlb(%rip), %eax
ret
... and the processor is free to move the writes past reads, resulting in both
functions returning zero.

Does the standard allow this behavior? I don't really know. I don't understand
the standard. Please tell me - how do you decide, by interpreting claims in the
section 7.17.3 of the C11 standard, whether the above outcome is allowed or
not?


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread amacleod at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

--- Comment #4 from Andrew Macleod amacleod at redhat dot com ---
I'm not sure where the problem is.  We interacted quite a bit as the model was
being developed. As I recall, it started with the standard, but they
strengthened some of the problem spots for a complete testable model. 

 To the best of my knowledge, this is the most efficient implementation that we
*know* to be safe, and that is why we implement it.


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread amacleod at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

--- Comment #6 from Andrew Macleod amacleod at redhat dot com ---
The standard doesn't define what machines should generate what code. It defines
terms for observing effects that need to be adhered to. Their machine model was
created over a few years during the early stages of the memory model to help
observe and test those effects. To the best of our knowledge we think the
results from this model are an efficient and correct implementation within the
standards definition.

If you can provide a test case which demonstrates that this is not a correct
implementation based on observable effects (ie the load is observed out of
order somewhere), then we'd look at fixing it in GCC.

If you want to discuss whether parts are or aren't compliant without a test
case that fails, then you should contact the authors with your questions since
they can give you a far better answer than I ever could.


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #8 from torvald at gcc dot gnu.org ---
(In reply to mikulas from comment #7)
 #include stdatomic.h
 
 atomic_int a = ATOMIC_VAR_INIT(0);
 atomic_int b = ATOMIC_VAR_INIT(0);
 atomic_int p = ATOMIC_VAR_INIT(0);
 
 int thread_1(void)
 {
 atomic_store_explicit(b, 1, memory_order_relaxed);
 atomic_load_explicit(p, memory_order_seq_cst);
 return atomic_load_explicit(a, memory_order_relaxed);
 }
 
 int thread_2(void)
 {
 atomic_store_explicit(a, 1, memory_order_relaxed);
 atomic_load_explicit(p, memory_order_seq_cst);
 return atomic_load_explicit(b, memory_order_relaxed);
 }
 
 See for example this. Suppose that thread_1 and thread_2 are executed
 concurrently. If memory_order_seq_cst were a proper full memory barrier, it
 would be impossible that both functions return 0.

memory_order_seq_cst is a memory order in the Standard's terminology.  Fences
are something else (ie, atomic_thread_fence()) , and parametrized by a memory
order.  A memory_order_seq_cst *memory access* does not have the same effects
as a memory_order_seq_cst fence.  See C++14 29.3p4-7; those paragraphs talk
about memory_order_seq_cst fences specifically, not about memory_order_seq_cst
operations in general.

If you want to make this example of Dekker synchronization correct, you need to
use fences instead of the accesses to p; alternatively, you need to use seq-cst
accesses for all the stores and loads to a and b, in which case there will be
HW fences added via the stores (as Andrew already pointed out).


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread amacleod at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

Andrew Macleod amacleod at redhat dot com changed:

   What|Removed |Added

 CC||amacleod at redhat dot com

--- Comment #2 from Andrew Macleod amacleod at redhat dot com ---
I do not believe the lock is needed.  See
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

GCC will emit the fence on all stores which then synchronize with loads.  If we
did emit a fence on the loads, then we wouldn't need the fence on stores.   It
is likely there will be more loads than stores, so the current approach is
usually more efficient.


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread mikulas at artax dot karlin.mff.cuni.cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

--- Comment #3 from mikulas at artax dot karlin.mff.cuni.cz ---
The problem here is that we don't really know what does the standard specify.

People often suggest the Batty's paper Mathematizing C++ Concurrency (
http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf ) as the explanation,
but the paper really describes a different memory model than the C++ standard
(citing section 4: Unfortunately N3092 allows the following nonsequentially
consistent execution of the SB example with SC atomics (initialisation writes,
such as (a) and (b), are non-atomic so that they need not be compiled with
memory fences): - We devised a stronger restriction on the values that may be
read by SC atomics, stated in ยง2.7, that does provide sequential consistency
here. - so it doesn't describe the standard, but something stronger that
authors have devised)

You can look at this example https://gcc.gnu.org/ml/gcc/2014-02/msg00053.html .
The assert can fail - so it implies that __ATOMIC_SEQ_CST is not a full
barrier, it is somehow weaker. But how much weaker is it? Who knows?

I look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html and I would
really like to know why is it so? Where does the standard specify this
mapping? I couldn't really find an answer for that.

[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-02 Thread jamrial at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

James Almer jamrial at gmail dot com changed:

   What|Removed |Added

 CC||jamrial at gmail dot com

--- Comment #1 from James Almer jamrial at gmail dot com ---
This still happens with gcc 4.9 and gcc 5.

int foo(int *bar)
{
return *bar;
}

and

int foo(int *bar)
{
return __atomic_load_n(bar, __ATOMIC_SEQ_CST);
}

Generate the exact same assembly. The latter should add an mfence before the
mov instruction.