Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-31 Thread Alexander Monakov
On Mon, 31 Jul 2017, Jeff Law wrote:
> > Please consider that expand_mem_thread_fence is used to place fences around
> > seq-cst atomic loads when the backend doesn't provide a direct 
> > pattern.
> > With compiler barriers on both sides of the machine barrier, the generated
> > sequence for a seq-cst atomic load will be 7 insns:
> > 
> >   asm volatile ("":::"memory");
> >   machine_seq_cst_fence ();
> >   asm volatile ("":::"memory");
> >   dst = mem[src];
> >   asm volatile ("":::"memory");
> >   machine_seq_cst_fence ();
> >   asm volatile ("":::"memory");
> > 
> > I can easily imagine people looking at RTL dumps with this overkill fencing
> > being unhappy about this.
> But the extra fences aren't actually going to impact anything except
> perhaps an unmeasurable compile-time hit.  ie, it may look bad, but I'd
> have a hard time believing it matters in practice.

I agree it doesn't matter that much from compile-time/memory standpoint.
I meant it matters from the standpoint of a person working with the RTL dumps,
i.e. having to work through all that, especially if they need to work with
non-slim dumps.

> > I'd be more happy with detecting empty expansion via get_last_insn ().
> That seems like an unnecessary complication to me.

I think it's quite simple by GCC standards:

  {
rtx_insn last = get_last_insn ();
emit_insn (targetm.gen_mem_thread_fence (GEN_INT (model)));
if (last == get_last_insn () && !is_mm_relaxed (model))
  expand_asm_memory_barrier ();
  }

> I'd prefer instead to just emit the necessary fencing in the generic code
> and update the MD docs so that maintainers know they don't have to emit
> the RTL fences themselves.

I agree the docs need an update, but hopefully not in this way.  The legacy
__sync_synchronize barrier has always been required to be a compiler barrier
when expanded to RTL, and it's quite natural to use the same RTL structure
for new atomic fences as the old memory_barrier expansion.  The only problem
in current practice seen so far is with empty expansion for new patterns.

Thanks.
Alexander


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-31 Thread Jeff Law
On 07/31/2017 11:02 AM, Alexander Monakov wrote:
> On Mon, 31 Jul 2017, Jeff Law wrote:
 In the middle end patch, do we need a barrier before the fence as well?
 The post-fence barrier prevents reordering the fence with anything which
 follows the fence.  But do we have to also prevent reordering the fence
 with prior instructions with any of the memory models?  This isn't my
 area of expertise, so if it's dumb question, don't hesitate to let me
 know :-)
>>>
>>> That depends on how pessimistic we want to be with respect to backend
>>> getting it wrong.  My expectation here is that if a backend emits non-empty
>>> RTL, the produced sequence for the fence itself acts as a compiler memory
>>> barrier.
>> Perhaps. But do we really want to rely on that?  EMitting a scheduling
>> barrier prior to these atomics is virtually free.
> 
> Please consider that expand_mem_thread_fence is used to place fences around
> seq-cst atomic loads when the backend doesn't provide a direct pattern.
> With compiler barriers on both sides of the machine barrier, the generated
> sequence for a seq-cst atomic load will be 7 insns:
> 
>   asm volatile ("":::"memory");
>   machine_seq_cst_fence ();
>   asm volatile ("":::"memory");
>   dst = mem[src];
>   asm volatile ("":::"memory");
>   machine_seq_cst_fence ();
>   asm volatile ("":::"memory");
> 
> I can easily imagine people looking at RTL dumps with this overkill fencing
> being unhappy about this.
But the extra fences aren't actually going to impact anything except
perhaps an unmeasurable compile-time hit.  ie, it may look bad, but I'd
have a hard time believing it matters in practice.

> 
> I'd be more happy with detecting empty expansion via get_last_insn ().
That seems like an unnecessary complication to me.  I'd prefer instead
to just emit the necessary fencing in the generic code and update the MD
docs so that maintainers know they don't have to emit the RTL fences
themselves.

Jeff


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-31 Thread Alexander Monakov
On Mon, 31 Jul 2017, Jeff Law wrote:
> >> In the middle end patch, do we need a barrier before the fence as well?
> >> The post-fence barrier prevents reordering the fence with anything which
> >> follows the fence.  But do we have to also prevent reordering the fence
> >> with prior instructions with any of the memory models?  This isn't my
> >> area of expertise, so if it's dumb question, don't hesitate to let me
> >> know :-)
> > 
> > That depends on how pessimistic we want to be with respect to backend
> > getting it wrong.  My expectation here is that if a backend emits non-empty
> > RTL, the produced sequence for the fence itself acts as a compiler memory
> > barrier.
> Perhaps. But do we really want to rely on that?  EMitting a scheduling
> barrier prior to these atomics is virtually free.

Please consider that expand_mem_thread_fence is used to place fences around
seq-cst atomic loads when the backend doesn't provide a direct pattern.
With compiler barriers on both sides of the machine barrier, the generated
sequence for a seq-cst atomic load will be 7 insns:

  asm volatile ("":::"memory");
  machine_seq_cst_fence ();
  asm volatile ("":::"memory");
  dst = mem[src];
  asm volatile ("":::"memory");
  machine_seq_cst_fence ();
  asm volatile ("":::"memory");

I can easily imagine people looking at RTL dumps with this overkill fencing
being unhappy about this.

I'd be more happy with detecting empty expansion via get_last_insn ().

Thanks.
Alexander


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-31 Thread Jeff Law
On 07/26/2017 12:13 PM, Alexander Monakov wrote:
> On Wed, 26 Jul 2017, Jeff Law wrote:
>> I'm not sure what you mean by extraneous compiler barriers -- isn't the
>> worst case scenario here that the target emits them as well?  So there
>> would be an extraneous one in that case, but that ought to be a "don't
>> care".
> 
> Yes, exactly this.
> 
>> In the middle end patch, do we need a barrier before the fence as well?
>> The post-fence barrier prevents reordering the fence with anything which
>> follows the fence.  But do we have to also prevent reordering the fence
>> with prior instructions with any of the memory models?  This isn't my
>> area of expertise, so if it's dumb question, don't hesitate to let me
>> know :-)
> 
> That depends on how pessimistic we want to be with respect to backend
> getting it wrong.  My expectation here is that if a backend emits non-empty
> RTL, the produced sequence for the fence itself acts as a compiler memory
> barrier.
Perhaps. But do we really want to rely on that?  EMitting a scheduling
barrier prior to these atomics is virtually free.

Jeff


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-31 Thread Jeff Law
On 07/26/2017 02:02 PM, Alexander Monakov wrote:
> On Wed, 26 Jul 2017, Alexander Monakov wrote:
> 
>> On Wed, 26 Jul 2017, Jeff Law wrote:
>>> I'm not sure what you mean by extraneous compiler barriers -- isn't the
>>> worst case scenario here that the target emits them as well?  So there
>>> would be an extraneous one in that case, but that ought to be a "don't
>>> care".
>>
>> Yes, exactly this.
> 
> I've just realized that we can detect if the backend produced empty RTL
> sequence by looking at get_last_insn () before/after gen_mem_thread_fence.
> This way we can emit a compiler barrier iff the backend didn't emit a
> machine barrier.
We could.  But I suspect just emitting the barriers in the generic code
is the way to go.  If we have a redundant barrier, the compile time cost
is minimal and I don't think the additional code to avoid creating the
exxtra barrier is worth it.

jeff


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-26 Thread Alexander Monakov
On Wed, 26 Jul 2017, Alexander Monakov wrote:

> On Wed, 26 Jul 2017, Jeff Law wrote:
> > I'm not sure what you mean by extraneous compiler barriers -- isn't the
> > worst case scenario here that the target emits them as well?  So there
> > would be an extraneous one in that case, but that ought to be a "don't
> > care".
> 
> Yes, exactly this.

I've just realized that we can detect if the backend produced empty RTL
sequence by looking at get_last_insn () before/after gen_mem_thread_fence.
This way we can emit a compiler barrier iff the backend didn't emit a
machine barrier.

Alexander


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-26 Thread Alexander Monakov
On Wed, 26 Jul 2017, Jeff Law wrote:
> I'm not sure what you mean by extraneous compiler barriers -- isn't the
> worst case scenario here that the target emits them as well?  So there
> would be an extraneous one in that case, but that ought to be a "don't
> care".

Yes, exactly this.

> In the middle end patch, do we need a barrier before the fence as well?
> The post-fence barrier prevents reordering the fence with anything which
> follows the fence.  But do we have to also prevent reordering the fence
> with prior instructions with any of the memory models?  This isn't my
> area of expertise, so if it's dumb question, don't hesitate to let me
> know :-)

That depends on how pessimistic we want to be with respect to backend
getting it wrong.  My expectation here is that if a backend emits non-empty
RTL, the produced sequence for the fence itself acts as a compiler memory
barrier.

Thanks!
Alexander


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-26 Thread Jeff Law
On 07/26/2017 11:19 AM, Alexander Monakov wrote:
> On Wed, 26 Jul 2017, Jeff Law wrote:
>> So I think this is up to the target maintainers.  I have no concerns
>> with enabling use of expand_asm_memory_barrier to be used outside of
>> optabs.  So if the s390/x86 maintainers want to go forward, the optabs
>> changes are pre-approved.
> 
> Please see the alternative middle-end patch:
>  https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00549.html
I almost made the comment that we should look if we could do this at a
higher level.  But then figured that the expander could be called from
multiple locations, thus requiring that all those points be fixed or at
least routed to a common function.

But it looks like all the fence stuff is routed into
expand_mem_thread_fence, so indeed that's a better place.

[ And if there were only a good way to enforce that we can't
  directly call the backend  expander from anywhere other than
  the bless location.  ]



> and the recently reported related bug:
>  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81316>
> Since we got it wrong for both fences and loads/stores on two different 
> targets,
> perhaps fixing it once and for all in the middle-end is more appropriate.  I
> hope extraneous compiler barriers can be tolerated.
Agreed on middle-end being more appropriate.

I'm not sure what you mean by extraneous compiler barriers -- isn't the
worst case scenario here that the target emits them as well?  So there
would be an extraneous one in that case, but that ought to be a "don't
care".

In the middle end patch, do we need a barrier before the fence as well?
The post-fence barrier prevents reordering the fence with anything which
follows the fence.  But do we have to also prevent reordering the fence
with prior instructions with any of the memory models?  This isn't my
area of expertise, so if it's dumb question, don't hesitate to let me
know :-)


> 
> Yep - the same applies to atomic loads/stores too.  They can be expanded
> as volatile accesses, but we need compiler barrier(s) anyway (except for
> those with relaxed memory model).
Right.
jeff


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-26 Thread Alexander Monakov
On Wed, 26 Jul 2017, Jeff Law wrote:
> So I think this is up to the target maintainers.  I have no concerns
> with enabling use of expand_asm_memory_barrier to be used outside of
> optabs.  So if the s390/x86 maintainers want to go forward, the optabs
> changes are pre-approved.

Please see the alternative middle-end patch:
 https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00549.html
and the recently reported related bug:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81316

Since we got it wrong for both fences and loads/stores on two different targets,
perhaps fixing it once and for all in the middle-end is more appropriate.  I
hope extraneous compiler barriers can be tolerated.

> The reasoning seems sound to me -- we may not need fencing at the
> hardware level because it gives a certain set of guarantees, but we may
> still need them at the compiler level to prevent undesirable code
> motions across the fence point in the compiler.

Yep - the same applies to atomic loads/stores too.  They can be expanded
as volatile accesses, but we need compiler barrier(s) anyway (except for
those with relaxed memory model).

Thanks.
Alexander


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-26 Thread Jeff Law
On 05/26/2017 04:58 AM, Alexander Monakov wrote:
> On Wed, 17 May 2017, Alexander Monakov wrote:
> 
>> Ping.
> 
> Ping^2?
> 
>> (to be clear, patch 2/2 is my previous followup in this thread, I forgot to
>> adjust the subject line; it should have said:
>> "[PATCH 2/2] x86: add compiler memory barriers when expanding atomic_load").
>>
>> On Wed, 10 May 2017, Alexander Monakov wrote:
>>
>>> Hi,
>>>
>>> When expanding __atomic_thread_fence(x) to RTL, the i386 backend doesn't 
>>> emit
>>> any instruction except for x==__ATOMIC_SEQ_CST (which emits 'mfence').  
>>> This 
>>> is incorrect: although no machine barrier is needed, the compiler still must
>>> emit a compiler barrier into the IR to prevent propagation and code motion
>>> across the fence.  The testcase added with the patch shows how it can lead
>>> to a miscompilation.
>>>
>>> The proposed patch fixes it by handling non-seq-cst fences exactly like
>>> __atomic_signal_fence is expanded, by emitting asm volatile("":::"memory").
>>>
>>> The s390 backend uses the a similar mem_thread_fence expansion, so the patch
>>> fixes both backends in the same manner.
>>>
>>> Bootstrapped and regtested on x86_64; also checked that s390-linux cc1
>>> successfully builds after the change.  OK for trunk?
>>>
>>> (the original source code in the PR was misusing atomic fences by doing
>>> something like
>>>
>>>   void f(int *p)
>>>   {
>>> while (*p)
>>>   __atomic_thread_fence(__ATOMIC_ACQUIRE);
>>>   }
>>>
>>> but since *p is not atomic, a concurrent write to *p would cause a data 
>>> race and
>>> thus invoke undefined behavior; also, if *p is false prior to entering the 
>>> loop,
>>> execution does not encounter the fence; new test here has code usable 
>>> without UB)
>>>
>>> Alexander
>>>
>>> * config/i386/sync.md (mem_thread_fence): Emit a compiler barrier for
>>> non-seq-cst fences.  Adjust comment.
>>> * config/s390/s390.md (mem_thread_fence): Likewise.
>>> * optabs.c (expand_asm_memory_barrier): Export.
>>> * optabs.h (expand_asm_memory_barrier): Declare.
>>> testsuite/
>>> * gcc.target/i386/pr80640-1.c: New testcase.
So I think this is up to the target maintainers.  I have no concerns
with enabling use of expand_asm_memory_barrier to be used outside of
optabs.  So if the s390/x86 maintainers want to go forward, the optabs
changes are pre-approved.

The reasoning seems sound to me -- we may not need fencing at the
hardware level because it gives a certain set of guarantees, but we may
still need them at the compiler level to prevent undesirable code
motions across the fence point in the compiler.

Jeff


Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-07-11 Thread Alexander Monakov
On Thu, 8 Jun 2017, Alexander Monakov wrote:
> Ping^3.

Ping^4: https://gcc.gnu.org/ml/gcc-patches/2017-05/msg00782.html

This is a wrong-code issue with C11 atomics: even if no machine barrier is
needed for a given fence type on this architecture, a compiler barrier must
be present in RTL.

Alternatively, it's possible to have a more complete and future-proof solution
by explicitly emitting a compiler barrier from the middle-end, leaving it up
to the backend to emit a machine barrier if needed.  The following patch could
achieve that (but at the cost of creating slightly redundant RTL on targets
that always emit some kind of memory barrier).

* optabs.c (expand_mem_thread_fence): Always emit a compiler barrier
if using mem_thread_fence expansion.

diff --git a/gcc/optabs.c b/gcc/optabs.c
index 8fd5d91..92080c3 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -6297,7 +6297,11 @@ void
 expand_mem_thread_fence (enum memmodel model)
 {
   if (targetm.have_mem_thread_fence ())
-emit_insn (targetm.gen_mem_thread_fence (GEN_INT (model)));
+{
+  emit_insn (targetm.gen_mem_thread_fence (GEN_INT (model)));
+  if (!is_mm_relaxed (model))
+   expand_asm_memory_barrier ();
+}
   else if (!is_mm_relaxed (model))
 {
   if (targetm.have_memory_barrier ())




Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-06-07 Thread Alexander Monakov
On Fri, 26 May 2017, Alexander Monakov wrote:
> > Ping.
> 
> Ping^2?

Ping^3.

> > (to be clear, patch 2/2 is my previous followup in this thread, I forgot to
> > adjust the subject line; it should have said:
> > "[PATCH 2/2] x86: add compiler memory barriers when expanding atomic_load").
> > 
> > On Wed, 10 May 2017, Alexander Monakov wrote:
> > 
> > > Hi,
> > > 
> > > When expanding __atomic_thread_fence(x) to RTL, the i386 backend doesn't 
> > > emit
> > > any instruction except for x==__ATOMIC_SEQ_CST (which emits 'mfence').  
> > > This 
> > > is incorrect: although no machine barrier is needed, the compiler still 
> > > must
> > > emit a compiler barrier into the IR to prevent propagation and code motion
> > > across the fence.  The testcase added with the patch shows how it can lead
> > > to a miscompilation.
> > > 
> > > The proposed patch fixes it by handling non-seq-cst fences exactly like
> > > __atomic_signal_fence is expanded, by emitting asm 
> > > volatile("":::"memory").
> > > 
> > > The s390 backend uses the a similar mem_thread_fence expansion, so the 
> > > patch
> > > fixes both backends in the same manner.
> > > 
> > > Bootstrapped and regtested on x86_64; also checked that s390-linux cc1
> > > successfully builds after the change.  OK for trunk?
> > > 
> > > (the original source code in the PR was misusing atomic fences by doing
> > > something like
> > > 
> > >   void f(int *p)
> > >   {
> > > while (*p)
> > >   __atomic_thread_fence(__ATOMIC_ACQUIRE);
> > >   }
> > > 
> > > but since *p is not atomic, a concurrent write to *p would cause a data 
> > > race and
> > > thus invoke undefined behavior; also, if *p is false prior to entering 
> > > the loop,
> > > execution does not encounter the fence; new test here has code usable 
> > > without UB)
> > > 
> > > Alexander
> > > 
> > >   * config/i386/sync.md (mem_thread_fence): Emit a compiler barrier for
> > >   non-seq-cst fences.  Adjust comment.
> > >   * config/s390/s390.md (mem_thread_fence): Likewise.
> > >   * optabs.c (expand_asm_memory_barrier): Export.
> > >   * optabs.h (expand_asm_memory_barrier): Declare.
> > > testsuite/
> > >   * gcc.target/i386/pr80640-1.c: New testcase.
> > > ---
> > >  gcc/config/i386/sync.md   |  7 ++-
> > >  gcc/config/s390/s390.md   | 11 +--
> > >  gcc/optabs.c  |  2 +-
> > >  gcc/optabs.h  |  3 +++
> > >  gcc/testsuite/gcc.target/i386/pr80640-1.c | 12 
> > >  5 files changed, 31 insertions(+), 4 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80640-1.c
> > > 
> > > diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> > > index 20d46fe..619d53b 100644
> > > --- a/gcc/config/i386/sync.md
> > > +++ b/gcc/config/i386/sync.md
> > > @@ -108,7 +108,7 @@ (define_expand "mem_thread_fence"
> > >enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
> > >  
> > >/* Unless this is a SEQ_CST fence, the i386 memory model is strong
> > > - enough not to require barriers of any kind.  */
> > > + enough not to require a processor barrier of any kind.  */
> > >if (is_mm_seq_cst (model))
> > >  {
> > >rtx (*mfence_insn)(rtx);
> > > @@ -124,6 +124,11 @@ (define_expand "mem_thread_fence"
> > >  
> > >emit_insn (mfence_insn (mem));
> > >  }
> > > +  else if (!is_mm_relaxed (model))
> > > +{
> > > +  /* However, a compiler barrier is still required.  */
> > > +  expand_asm_memory_barrier ();
> > > +}
> > >DONE;
> > >  })
> > >  
> > > diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
> > > index c9fd19a..65e54c4 100644
> > > --- a/gcc/config/s390/s390.md
> > > +++ b/gcc/config/s390/s390.md
> > > @@ -10109,14 +10109,21 @@ (define_expand "mem_thread_fence"
> > >[(match_operand:SI 0 "const_int_operand")] ;; model
> > >""
> > >  {
> > > +  enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
> > > +
> > >/* Unless this is a SEQ_CST fence, the s390 memory model is strong
> > > - enough not to require barriers of any kind.  */
> > > -  if (is_mm_seq_cst (memmodel_from_int (INTVAL (operands[0]
> > > + enough not to require a processor barrier of any kind.  */
> > > +  if (is_mm_seq_cst (model))
> > >  {
> > >rtx mem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
> > >MEM_VOLATILE_P (mem) = 1;
> > >emit_insn (gen_mem_thread_fence_1 (mem));
> > >  }
> > > +  else if (!is_mm_relaxed (model))
> > > +{
> > > +  /* However, a compiler barrier is still required.  */
> > > +  expand_asm_memory_barrier ();
> > > +}
> > >DONE;
> > >  })
> > >  
> > > diff --git a/gcc/optabs.c b/gcc/optabs.c
> > > index 48e37f8..1f1fbc3 100644
> > > --- a/gcc/optabs.c
> > > +++ b/gcc/optabs.c
> > > @@ -6269,7 +6269,7 @@ expand_atomic_compare_and_swap (rtx *ptarget_bool, 
> > > rtx *ptarget_oval,
> > >  
> > >  /* 

Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-05-26 Thread Alexander Monakov
On Wed, 17 May 2017, Alexander Monakov wrote:

> Ping.

Ping^2?

> (to be clear, patch 2/2 is my previous followup in this thread, I forgot to
> adjust the subject line; it should have said:
> "[PATCH 2/2] x86: add compiler memory barriers when expanding atomic_load").
> 
> On Wed, 10 May 2017, Alexander Monakov wrote:
> 
> > Hi,
> > 
> > When expanding __atomic_thread_fence(x) to RTL, the i386 backend doesn't 
> > emit
> > any instruction except for x==__ATOMIC_SEQ_CST (which emits 'mfence').  
> > This 
> > is incorrect: although no machine barrier is needed, the compiler still must
> > emit a compiler barrier into the IR to prevent propagation and code motion
> > across the fence.  The testcase added with the patch shows how it can lead
> > to a miscompilation.
> > 
> > The proposed patch fixes it by handling non-seq-cst fences exactly like
> > __atomic_signal_fence is expanded, by emitting asm volatile("":::"memory").
> > 
> > The s390 backend uses the a similar mem_thread_fence expansion, so the patch
> > fixes both backends in the same manner.
> > 
> > Bootstrapped and regtested on x86_64; also checked that s390-linux cc1
> > successfully builds after the change.  OK for trunk?
> > 
> > (the original source code in the PR was misusing atomic fences by doing
> > something like
> > 
> >   void f(int *p)
> >   {
> > while (*p)
> >   __atomic_thread_fence(__ATOMIC_ACQUIRE);
> >   }
> > 
> > but since *p is not atomic, a concurrent write to *p would cause a data 
> > race and
> > thus invoke undefined behavior; also, if *p is false prior to entering the 
> > loop,
> > execution does not encounter the fence; new test here has code usable 
> > without UB)
> > 
> > Alexander
> > 
> > * config/i386/sync.md (mem_thread_fence): Emit a compiler barrier for
> > non-seq-cst fences.  Adjust comment.
> > * config/s390/s390.md (mem_thread_fence): Likewise.
> > * optabs.c (expand_asm_memory_barrier): Export.
> > * optabs.h (expand_asm_memory_barrier): Declare.
> > testsuite/
> > * gcc.target/i386/pr80640-1.c: New testcase.
> > ---
> >  gcc/config/i386/sync.md   |  7 ++-
> >  gcc/config/s390/s390.md   | 11 +--
> >  gcc/optabs.c  |  2 +-
> >  gcc/optabs.h  |  3 +++
> >  gcc/testsuite/gcc.target/i386/pr80640-1.c | 12 
> >  5 files changed, 31 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80640-1.c
> > 
> > diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> > index 20d46fe..619d53b 100644
> > --- a/gcc/config/i386/sync.md
> > +++ b/gcc/config/i386/sync.md
> > @@ -108,7 +108,7 @@ (define_expand "mem_thread_fence"
> >enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
> >  
> >/* Unless this is a SEQ_CST fence, the i386 memory model is strong
> > - enough not to require barriers of any kind.  */
> > + enough not to require a processor barrier of any kind.  */
> >if (is_mm_seq_cst (model))
> >  {
> >rtx (*mfence_insn)(rtx);
> > @@ -124,6 +124,11 @@ (define_expand "mem_thread_fence"
> >  
> >emit_insn (mfence_insn (mem));
> >  }
> > +  else if (!is_mm_relaxed (model))
> > +{
> > +  /* However, a compiler barrier is still required.  */
> > +  expand_asm_memory_barrier ();
> > +}
> >DONE;
> >  })
> >  
> > diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
> > index c9fd19a..65e54c4 100644
> > --- a/gcc/config/s390/s390.md
> > +++ b/gcc/config/s390/s390.md
> > @@ -10109,14 +10109,21 @@ (define_expand "mem_thread_fence"
> >[(match_operand:SI 0 "const_int_operand")]   ;; model
> >""
> >  {
> > +  enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
> > +
> >/* Unless this is a SEQ_CST fence, the s390 memory model is strong
> > - enough not to require barriers of any kind.  */
> > -  if (is_mm_seq_cst (memmodel_from_int (INTVAL (operands[0]
> > + enough not to require a processor barrier of any kind.  */
> > +  if (is_mm_seq_cst (model))
> >  {
> >rtx mem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
> >MEM_VOLATILE_P (mem) = 1;
> >emit_insn (gen_mem_thread_fence_1 (mem));
> >  }
> > +  else if (!is_mm_relaxed (model))
> > +{
> > +  /* However, a compiler barrier is still required.  */
> > +  expand_asm_memory_barrier ();
> > +}
> >DONE;
> >  })
> >  
> > diff --git a/gcc/optabs.c b/gcc/optabs.c
> > index 48e37f8..1f1fbc3 100644
> > --- a/gcc/optabs.c
> > +++ b/gcc/optabs.c
> > @@ -6269,7 +6269,7 @@ expand_atomic_compare_and_swap (rtx *ptarget_bool, 
> > rtx *ptarget_oval,
> >  
> >  /* Generate asm volatile("" : : : "memory") as the memory barrier.  */
> >  
> > -static void
> > +void
> >  expand_asm_memory_barrier (void)
> >  {
> >rtx asm_op, clob;
> > diff --git a/gcc/optabs.h b/gcc/optabs.h
> > index b2dd31a..aca6755 100644
> > --- 

Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-05-17 Thread Alexander Monakov
Ping.

(to be clear, patch 2/2 is my previous followup in this thread, I forgot to
adjust the subject line; it should have said:
"[PATCH 2/2] x86: add compiler memory barriers when expanding atomic_load").

On Wed, 10 May 2017, Alexander Monakov wrote:

> Hi,
> 
> When expanding __atomic_thread_fence(x) to RTL, the i386 backend doesn't emit
> any instruction except for x==__ATOMIC_SEQ_CST (which emits 'mfence').  This 
> is incorrect: although no machine barrier is needed, the compiler still must
> emit a compiler barrier into the IR to prevent propagation and code motion
> across the fence.  The testcase added with the patch shows how it can lead
> to a miscompilation.
> 
> The proposed patch fixes it by handling non-seq-cst fences exactly like
> __atomic_signal_fence is expanded, by emitting asm volatile("":::"memory").
> 
> The s390 backend uses the a similar mem_thread_fence expansion, so the patch
> fixes both backends in the same manner.
> 
> Bootstrapped and regtested on x86_64; also checked that s390-linux cc1
> successfully builds after the change.  OK for trunk?
> 
> (the original source code in the PR was misusing atomic fences by doing
> something like
> 
>   void f(int *p)
>   {
> while (*p)
>   __atomic_thread_fence(__ATOMIC_ACQUIRE);
>   }
> 
> but since *p is not atomic, a concurrent write to *p would cause a data race 
> and
> thus invoke undefined behavior; also, if *p is false prior to entering the 
> loop,
> execution does not encounter the fence; new test here has code usable without 
> UB)
> 
> Alexander
> 
>   * config/i386/sync.md (mem_thread_fence): Emit a compiler barrier for
>   non-seq-cst fences.  Adjust comment.
>   * config/s390/s390.md (mem_thread_fence): Likewise.
>   * optabs.c (expand_asm_memory_barrier): Export.
>   * optabs.h (expand_asm_memory_barrier): Declare.
> testsuite/
>   * gcc.target/i386/pr80640-1.c: New testcase.
> ---
>  gcc/config/i386/sync.md   |  7 ++-
>  gcc/config/s390/s390.md   | 11 +--
>  gcc/optabs.c  |  2 +-
>  gcc/optabs.h  |  3 +++
>  gcc/testsuite/gcc.target/i386/pr80640-1.c | 12 
>  5 files changed, 31 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr80640-1.c
> 
> diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> index 20d46fe..619d53b 100644
> --- a/gcc/config/i386/sync.md
> +++ b/gcc/config/i386/sync.md
> @@ -108,7 +108,7 @@ (define_expand "mem_thread_fence"
>enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
>  
>/* Unless this is a SEQ_CST fence, the i386 memory model is strong
> - enough not to require barriers of any kind.  */
> + enough not to require a processor barrier of any kind.  */
>if (is_mm_seq_cst (model))
>  {
>rtx (*mfence_insn)(rtx);
> @@ -124,6 +124,11 @@ (define_expand "mem_thread_fence"
>  
>emit_insn (mfence_insn (mem));
>  }
> +  else if (!is_mm_relaxed (model))
> +{
> +  /* However, a compiler barrier is still required.  */
> +  expand_asm_memory_barrier ();
> +}
>DONE;
>  })
>  
> diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
> index c9fd19a..65e54c4 100644
> --- a/gcc/config/s390/s390.md
> +++ b/gcc/config/s390/s390.md
> @@ -10109,14 +10109,21 @@ (define_expand "mem_thread_fence"
>[(match_operand:SI 0 "const_int_operand")] ;; model
>""
>  {
> +  enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
> +
>/* Unless this is a SEQ_CST fence, the s390 memory model is strong
> - enough not to require barriers of any kind.  */
> -  if (is_mm_seq_cst (memmodel_from_int (INTVAL (operands[0]
> + enough not to require a processor barrier of any kind.  */
> +  if (is_mm_seq_cst (model))
>  {
>rtx mem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
>MEM_VOLATILE_P (mem) = 1;
>emit_insn (gen_mem_thread_fence_1 (mem));
>  }
> +  else if (!is_mm_relaxed (model))
> +{
> +  /* However, a compiler barrier is still required.  */
> +  expand_asm_memory_barrier ();
> +}
>DONE;
>  })
>  
> diff --git a/gcc/optabs.c b/gcc/optabs.c
> index 48e37f8..1f1fbc3 100644
> --- a/gcc/optabs.c
> +++ b/gcc/optabs.c
> @@ -6269,7 +6269,7 @@ expand_atomic_compare_and_swap (rtx *ptarget_bool, rtx 
> *ptarget_oval,
>  
>  /* Generate asm volatile("" : : : "memory") as the memory barrier.  */
>  
> -static void
> +void
>  expand_asm_memory_barrier (void)
>  {
>rtx asm_op, clob;
> diff --git a/gcc/optabs.h b/gcc/optabs.h
> index b2dd31a..aca6755 100644
> --- a/gcc/optabs.h
> +++ b/gcc/optabs.h
> @@ -322,6 +322,9 @@ extern bool expand_atomic_compare_and_swap (rtx *, rtx *, 
> rtx, rtx, rtx, bool,
>  extern void expand_mem_thread_fence (enum memmodel);
>  extern void expand_mem_signal_fence (enum memmodel);
>  
> +/* Generate a compile-time memory barrier.  */
> +extern void 

Re: [PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-05-10 Thread Alexander Monakov
While fixing the fences issue of PR80640 I've noticed that a similar oversight
is present in expansion of atomic loads on x86: they become volatile loads, but
that is insufficient, a compiler memory barrier is still needed.  Volatility
prevents tearing the load (preserves non-divisibility of atomic access), but
does not prevent propagation and code motion of non-volatile accesses across the
load.  The testcase below demonstrates that even SEQ_CST loads are mishandled.

It's less clear how to fix this one.  AFAICS there are two possible approaches:
either emit explicit compiler barriers on either side of the load (a barrier
before the load is needed for SEQ_CST loads), or change how atomic loads are
expressed in RTL: atomic stores already use ATOMIC_STA UNSPECs, so they don't
suffer from this issue.  The asymmetry seems odd to me.

The former approach is easier and exposes non-seq-cst loads upwards for
optimization, so that's what the proposed patch implements.

The s390 backend suffers from the same issue, except there the atomic stores are
affected too.

Alexander

* config/i386/sync.md (atomic_load): Emit a compiler barrier
before (only for SEQ_CST order) and after the load.
testsuite/
* gcc.target/i386/pr80640-2.c: New testcase.

---
 gcc/config/i386/sync.md   |  7 +++
 gcc/testsuite/gcc.target/i386/pr80640-2.c | 11 +++
 2 files changed, 18 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr80640-2.c

diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
index 619d53b..35a7b38 100644
--- a/gcc/config/i386/sync.md
+++ b/gcc/config/i386/sync.md
@@ -155,6 +155,11 @@ (define_expand "atomic_load"
   UNSPEC_LDA))]
   ""
 {
+  enum memmodel model = memmodel_from_int (INTVAL (operands[2]));
+
+  if (is_mm_seq_cst (model))
+expand_asm_memory_barrier ();
+
   /* For DImode on 32-bit, we can use the FPU to perform the load.  */
   if (mode == DImode && !TARGET_64BIT)
 emit_insn (gen_atomic_loaddi_fpu
@@ -173,6 +178,8 @@ (define_expand "atomic_load"
   if (dst != operands[0])
emit_move_insn (operands[0], dst);
 }
+  if (!is_mm_relaxed (model))
+expand_asm_memory_barrier ();
   DONE;
 })
 
diff --git a/gcc/testsuite/gcc.target/i386/pr80640-2.c 
b/gcc/testsuite/gcc.target/i386/pr80640-2.c
new file mode 100644
index 000..f0a050c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr80640-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-ter" } */
+/* { dg-final { scan-assembler-not "xorl\[\\t \]*\\\%eax,\[\\t \]*%eax" } } */
+
+int f(int *p, volatile _Bool *p1, _Atomic _Bool *p2)
+{
+  int v = *p;
+  *p1 = !v;
+  while (__atomic_load_n (p2, __ATOMIC_SEQ_CST));
+  return v - *p;
+}
-- 
1.8.3.1



[PATCH 1/2] x86,s390: add compiler memory barriers when expanding atomic_thread_fence (PR 80640)

2017-05-10 Thread Alexander Monakov
Hi,

When expanding __atomic_thread_fence(x) to RTL, the i386 backend doesn't emit
any instruction except for x==__ATOMIC_SEQ_CST (which emits 'mfence').  This 
is incorrect: although no machine barrier is needed, the compiler still must
emit a compiler barrier into the IR to prevent propagation and code motion
across the fence.  The testcase added with the patch shows how it can lead
to a miscompilation.

The proposed patch fixes it by handling non-seq-cst fences exactly like
__atomic_signal_fence is expanded, by emitting asm volatile("":::"memory").

The s390 backend uses the a similar mem_thread_fence expansion, so the patch
fixes both backends in the same manner.

Bootstrapped and regtested on x86_64; also checked that s390-linux cc1
successfully builds after the change.  OK for trunk?

(the original source code in the PR was misusing atomic fences by doing
something like

  void f(int *p)
  {
while (*p)
  __atomic_thread_fence(__ATOMIC_ACQUIRE);
  }

but since *p is not atomic, a concurrent write to *p would cause a data race and
thus invoke undefined behavior; also, if *p is false prior to entering the loop,
execution does not encounter the fence; new test here has code usable without 
UB)

Alexander

* config/i386/sync.md (mem_thread_fence): Emit a compiler barrier for
non-seq-cst fences.  Adjust comment.
* config/s390/s390.md (mem_thread_fence): Likewise.
* optabs.c (expand_asm_memory_barrier): Export.
* optabs.h (expand_asm_memory_barrier): Declare.
testsuite/
* gcc.target/i386/pr80640-1.c: New testcase.
---
 gcc/config/i386/sync.md   |  7 ++-
 gcc/config/s390/s390.md   | 11 +--
 gcc/optabs.c  |  2 +-
 gcc/optabs.h  |  3 +++
 gcc/testsuite/gcc.target/i386/pr80640-1.c | 12 
 5 files changed, 31 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr80640-1.c

diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
index 20d46fe..619d53b 100644
--- a/gcc/config/i386/sync.md
+++ b/gcc/config/i386/sync.md
@@ -108,7 +108,7 @@ (define_expand "mem_thread_fence"
   enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
 
   /* Unless this is a SEQ_CST fence, the i386 memory model is strong
- enough not to require barriers of any kind.  */
+ enough not to require a processor barrier of any kind.  */
   if (is_mm_seq_cst (model))
 {
   rtx (*mfence_insn)(rtx);
@@ -124,6 +124,11 @@ (define_expand "mem_thread_fence"
 
   emit_insn (mfence_insn (mem));
 }
+  else if (!is_mm_relaxed (model))
+{
+  /* However, a compiler barrier is still required.  */
+  expand_asm_memory_barrier ();
+}
   DONE;
 })
 
diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index c9fd19a..65e54c4 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -10109,14 +10109,21 @@ (define_expand "mem_thread_fence"
   [(match_operand:SI 0 "const_int_operand")]   ;; model
   ""
 {
+  enum memmodel model = memmodel_from_int (INTVAL (operands[0]));
+
   /* Unless this is a SEQ_CST fence, the s390 memory model is strong
- enough not to require barriers of any kind.  */
-  if (is_mm_seq_cst (memmodel_from_int (INTVAL (operands[0]
+ enough not to require a processor barrier of any kind.  */
+  if (is_mm_seq_cst (model))
 {
   rtx mem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
   MEM_VOLATILE_P (mem) = 1;
   emit_insn (gen_mem_thread_fence_1 (mem));
 }
+  else if (!is_mm_relaxed (model))
+{
+  /* However, a compiler barrier is still required.  */
+  expand_asm_memory_barrier ();
+}
   DONE;
 })
 
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 48e37f8..1f1fbc3 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -6269,7 +6269,7 @@ expand_atomic_compare_and_swap (rtx *ptarget_bool, rtx 
*ptarget_oval,
 
 /* Generate asm volatile("" : : : "memory") as the memory barrier.  */
 
-static void
+void
 expand_asm_memory_barrier (void)
 {
   rtx asm_op, clob;
diff --git a/gcc/optabs.h b/gcc/optabs.h
index b2dd31a..aca6755 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -322,6 +322,9 @@ extern bool expand_atomic_compare_and_swap (rtx *, rtx *, 
rtx, rtx, rtx, bool,
 extern void expand_mem_thread_fence (enum memmodel);
 extern void expand_mem_signal_fence (enum memmodel);
 
+/* Generate a compile-time memory barrier.  */
+extern void expand_asm_memory_barrier (void);
+
 rtx expand_atomic_load (rtx, rtx, enum memmodel);
 rtx expand_atomic_store (rtx, rtx, enum memmodel, bool);
 rtx expand_atomic_fetch_op (rtx, rtx, rtx, enum rtx_code, enum memmodel, 
diff --git a/gcc/testsuite/gcc.target/i386/pr80640-1.c 
b/gcc/testsuite/gcc.target/i386/pr80640-1.c
new file mode 100644
index 000..f1d1e55
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr80640-1.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-ter" }