Hi Richard,

On Wed, Oct 31, 2018 at 10:27:59AM +0000, Richard Henderson wrote:
> On 10/30/18 8:32 PM, James Greenhalgh wrote:
> > On Tue, Oct 02, 2018 at 11:19:09AM -0500, Richard Henderson wrote:
> >> When the result of an operation is not used, we can ignore the
> >> result by storing to XZR.  For two of the memory models, using
> >> XZR with LD<op> has a preferred assembler alias, ST<op>.
> > 
> > ST<op> has different semantics to LD<op>, in particular, ST<op> is not
> > ordered by a DMB LD; so this could weaken the LDADD and break C11 semantics.
> > 
> > The relevant Arm Arm text is:
> > 
> >   If the destination register is not one of WZR or XZR, LDADDA and
> >   LDADDAL load from memory with acquire semantics
> > 
> >   LDADDL and LDADDAL store to memory with release semantics.
> > 
> >   LDADD has no memory ordering requirements.
> > 
> > I'm taking this to mean that even if the result is unused, using XZR is not
> > a valid transformation; it weakens the expected acquire semantics to
> > unordered.
> > 
> > The example I have from Will Deacon on an internal bug database is:
> > 
> >   P0 (atomic_int* y,atomic_int* x) {
> >     atomic_store_explicit(x,1,memory_order_relaxed);
> >     atomic_thread_fence(memory_order_release);
> >     atomic_store_explicit(y,1,memory_order_relaxed);
> >   }
> > 
> >   P1 (atomic_int* y,atomic_int* x) {
> >     int r0 = atomic_fetch_add_explicit(y,1,memory_order_relaxed);
> >     atomic_thread_fence(memory_order_acquire);
> >     int r1 = atomic_load_explicit(x,memory_order_relaxed);
> >   }
> > 
> >   The outcome where y == 2 and P1 has r0 = 1 and r1 = 0 is illegal.
> > 
> > This example comes from a while back in my memory; so copying Will for
> > any more detailed questions.
> > 
> > My impression is that this transformation is not safe, and so the patch is
> > not OK.
> 
> Here's a revised patch.
> 
> Use ST<op> for relaxed and release orderings, retain the (non-xzr) scratch
> register for other orderings.  But the scratch need not be early-clobber, 
> since
> there's no mid-point of half-consumed inputs.

The example test above uses relaxed atomics in conjunction with an acquire
fence, so I don't think we can actually use ST<op> at all without a change
to the language specification. I previouslyyallocated P0861 for this purpose
but never got a chance to write it up...

Perhaps the issue is a bit clearer with an additional thread (not often I
say that!):


P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  atomic_thread_fence(memory_order_release);
  atomic_store_explicit(y,1,memory_order_relaxed);
}

P1 (atomic_int* y,atomic_int* x) {
  atomic_fetch_add_explicit(y,1,memory_order_relaxed);  // STADD
  atomic_thread_fence(memory_order_acquire);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

P2 (atomic_int* y) {
  int r1 = atomic_load_explicit(y,memory_order_relaxed);
}


My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
this test has executed. However, if the relaxed add in P1 compiles to
STADD and the subsequent acquire fence is compiled as DMB LD, then we
don't have any ordering guarantees in P1 and the forbidden result could
be observed.

Will

Reply via email to