[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-12 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

David Edelsohn dje at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||WONTFIX

--- Comment #15 from David Edelsohn dje at gcc dot gnu.org 2012-01-12 
20:33:57 UTC ---
It looks like POWER needs to accept and deal with the sequentially consistent
semantics now specified for the __sync_* intrinsics, so I am closing this bug
as WONTFIX.  But we still need to fix the libstdc++ regression created by this
change.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #4 from Richard Guenther rguenth at gcc dot gnu.org 2012-01-10 
09:43:16 UTC ---
(In reply to comment #3)
  It says above them In most cases, these
  builtins are considered a full barrier. and only __sync_lock_test_and_set 
  and
  __sync_lock_release specify different barrier semantics.
 
 The next sentence is:
 
 That is, no memory operand will be moved across the operation, either forward
 or
 backward.
 
 Note that this refers to memory operands, not memory operations -- memory
 stores and memory loads referenced in documentation of the other sync 
 builtins.
  In other words, one could interpret full memory barrier as:
 
 asm volatile ( : : : memory);
 
 that refers to a GCC scheduling barrier.
 
 The GCC documentation references Intel processors, which do not have have a
 distinction between instructions for RELEASE, ACQ_REL and SEQ_CST semantics.
 
 The basic problem is that the GCC builtins and atomic instruction semantics
 were designed for Intel processors that do not provide the level of 
 granularity
 implemented in POWER processors.  The POWER port implemented lighter weight
 ACQ_REL semantics. Retrofitting the original builtins on the new C++11 memory
 model semantics and imposing SEQ_CST interpretation has changed the behavior
 and performance on POWER, but not on other targets.

But for more precise semantics we now have the __atomic_* builtins, right?
And the __sync_* ones are deprecated.  I don't see how we can preserve
old behavior for the __sync_* ones without adding a new target hook.
The documentation would need to be adjusted, of course, I'm not sure
that different atomic semantics are useful for these portable
synchronization primitives?

Thus, fixing the libstdc++ side seems worthwhile, but I'm not sure
with respect to the deprecated __sync builtins?


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #5 from David Edelsohn dje at gcc dot gnu.org 2012-01-10 14:39:16 
UTC ---
I understand that fixing __sync_* is a hassle.  This is why I opened a separate
bug for libstdc++.

While __sync_* is deprecated in favor of __atomic_*, use of __sync_* for
portability is fairly pervasive in FOSS applications that need it because of
its implementation in GCC.  Most programmers do not know about memory models
and do not care about memory models.  And it will take time for programmers to
switch to __atomic_*, if they even bother to choose a memory model and don't
introduce a bug.

The basic problem is MEMMODEL_SEQ_CST only makes a performance difference for
POWER and developers are going to continue to use __sync_* builtins for a
while.  This change in default behavior only hurts performance for applications
on POWER relative to all other architectures, which sucks. :-(


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #6 from Richard Guenther rguenth at gcc dot gnu.org 2012-01-10 
14:48:54 UTC ---
(In reply to comment #5)
 I understand that fixing __sync_* is a hassle.  This is why I opened a 
 separate
 bug for libstdc++.
 
 While __sync_* is deprecated in favor of __atomic_*, use of __sync_* for
 portability is fairly pervasive in FOSS applications that need it because of
 its implementation in GCC.  Most programmers do not know about memory models
 and do not care about memory models.  And it will take time for programmers to
 switch to __atomic_*, if they even bother to choose a memory model and don't
 introduce a bug.
 
 The basic problem is MEMMODEL_SEQ_CST only makes a performance difference for
 POWER and developers are going to continue to use __sync_* builtins for a
 while.  This change in default behavior only hurts performance for 
 applications
 on POWER relative to all other architectures, which sucks. :-(

Yes, I see that.  But my question is - did a developer reading the
documentation
get _correct_ code on POWER (which uses a laxer memory model than documented!)
in all cases?  Or can you construct a testcase that works fine on IA64
while surprisingly (after reading docs) does not work on POWER?  Thus, didn't
we simply fix a wrong-code bug (albeit by producing slower code)?


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread redi at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #7 from Jonathan Wakely redi at gcc dot gnu.org 2012-01-10 
15:29:09 UTC ---
(In reply to comment #3)
  It says above them In most cases, these
  builtins are considered a full barrier. and only __sync_lock_test_and_set 
  and
  __sync_lock_release specify different barrier semantics.
 
 The next sentence is:
 
 That is, no memory operand will be moved across the operation, either forward
 or
 backward.

It continues:
Further, instructions will be issued as necessary to prevent the processor
from speculating loads across the operation and from queuing stores after the
operation.

Doesn't that rule out lwsync, which allows loads to be moved to before the
store preceding the lwsync?


Isn't the fact users will continue to use the __sync builtins all the more
reason they should work as documented?


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #8 from David Edelsohn dje at gcc dot gnu.org 2012-01-10 18:08:23 
UTC ---
For the way that programmers use __sync_* builtins, release or acquire-release
semantics are sufficient.  As we see in libstdc++, release semantics are overly
strict when incrementing the reference, as opposed to destroying an object.

Again, there is no cost to Intel specifying sequential consistency as opposed
to a slightly weaker memory model.  Intel chose a memory model that matched and
benefited their architecture.  IBM should have the freedom to choose memory
models that benefit its architectures.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread redi at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #9 from Jonathan Wakely redi at gcc dot gnu.org 2012-01-10 
18:20:00 UTC ---
(In reply to comment #8)
 For the way that programmers use __sync_* builtins, release or acquire-release
 semantics are sufficient.

Are you claiming you know how all programmers have used those builtins?

  As we see in libstdc++, release semantics are overly
 strict when incrementing the reference, as opposed to destroying an object.
 
 Again, there is no cost to Intel specifying sequential consistency as opposed
 to a slightly weaker memory model.  Intel chose a memory model that matched 
 and
 benefited their architecture.  IBM should have the freedom to choose memory
 models that benefit its architectures.

How or why Intel chose those semantics is not really relevant, the fact is that
with the exception of lock_test_and_set and lock_release the __sync builtins
are documented to be full barriers, and always have been documented as full
barriers, both in GCC's and Intel's docs. If someone uses them in their code
relying on the fact they'll get a full barrier, then someone else runs that
code on POWER and there's a bug because it isn't a full barrier, who is to
blame for the bug?

That said, I don't have any personal interest in what they do on POWER, as long
as I'm not asked to deal with any libstdc++ bugs resulting from the builtins
not behaving as documented.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread redi at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #10 from Jonathan Wakely redi at gcc dot gnu.org 2012-01-10 
18:22:01 UTC ---
(In reply to comment #8)
 IBM should have the freedom to choose memory
 models that benefit its architectures.

I meant to add that the new __atomic builtins give that freedom, by design.

Changing the documented behaviour of the __sync builtins to do something they
weren't designed for doesn't seem like a good idea - if they're inefficient
then convince people not to use them.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #11 from David Edelsohn dje at gcc dot gnu.org 2012-01-10 
18:33:56 UTC ---
Jonathan,

I understand that the new __atomic_* builtins provide that flexibility.

I also am pointing out that GCC followed Intel semantics and Intel chose
semantics to benefit them.  Why should GCC prefer and impose semantics that
advantage a particular architecture or vendor?

You can argue that the __sync_* definition is the playing field and the rule
and POWER has to deal with it.  And I am responding that the builtins were
implemented appropriate for POWER and now GCC has moved the goal post.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #12 from Jakub Jelinek jakub at gcc dot gnu.org 2012-01-10 
18:46:59 UTC ---
If the PPC implementation of __sync_fetch_and_* and __sync_*_and_fetch was
intentionally implemented to violate the documentation, then the documentation
should have been adjusted at that point to note that on PPC it behaves
differently.  Otherwise it should be just considered a bug in the
implementation that it didn't implement the documented semantics.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread amacleod at redhat dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

Andrew Macleod amacleod at redhat dot com changed:

   What|Removed |Added

 CC||amacleod at redhat dot com

--- Comment #13 from Andrew Macleod amacleod at redhat dot com 2012-01-10 
19:13:44 UTC ---
I don't think it matters what the original rationale was for implementing
__sync. They are documented as being seq-cst.

I would argue that power implementing them not as seq-cst is not truly
compliant.  It shouldn't be too difficult to write a test case using __sync
which expects seq-cst as documented that would pass on all targets except
power.

It may be that power-specific programmers have been taking advantage of this,
but doesn't it has to have the potential to break generic code?... (lucky TM
only needed relaxed atomics!)

That said... it would be possible to simply have an optional macro with a
default definition of 
  #define __SYNC_LEGACY_MODE   MEMMODEL_SEQ_CST
and if a port like power wants something different, it can define
  #define __SYNC_LEGACY_MODE   MEMMODEL_ACQ_REL

then use that value instead of MEMMODEL_SEQ_CST for legacy __sync routines.

Would that satisfy all your requirements?  shouldn't really need any other
changes then.  Although as Jakub says, it should then be clearly documented.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-10 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #14 from David Edelsohn dje at gcc dot gnu.org 2012-01-10 
20:25:57 UTC ---
Jakub,

The POWER implementation of __sync_* was not written intentionally to violate
the documentation.  I don't think those of us who implemented the feature on
POWER realized the documentation was trying to require sequential consistency.

Given this clarification, the POWER maintainers need to figure out what
implementation is appropriate.

The lwsync implementation was more than sufficient for the common use of atomic
ops, like fetch_and_op.  That other architectures provide stronger semantics
is a bonus.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-09 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P1

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2012-01-09 
15:40:10 UTC ---
The docs of __sync_* say

This builtin is not a full barrier, but rather an @dfn{acquire barrier}.
This means that references after the builtin cannot move to (or be
speculated to) before the builtin, but previous memory stores may not
be globally visible yet, and previous memory loads may not yet be
satisfied.

But it is not exactly clear to which builtins this applies.  Thus, is
the intended behavior actually target depedent?


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-09 Thread redi at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #2 from Jonathan Wakely redi at gcc dot gnu.org 2012-01-09 
15:45:06 UTC ---
(In reply to comment #1)
 The docs of __sync_* say
 
 This builtin is not a full barrier, but rather an @dfn{acquire barrier}.
 This means that references after the builtin cannot move to (or be
 speculated to) before the builtin, but previous memory stores may not
 be globally visible yet, and previous memory loads may not yet be
 satisfied.
 
 But it is not exactly clear to which builtins this applies.  Thus, is
 the intended behavior actually target depedent?

It refers to __sync_lock_test_and_set only (it says this builtin and follows
that one)

And This builtin is not a full barrier, but rather a release barrier. refers
to __sync_lock_release.

All the others are full barriers.  It says above them In most cases, these
builtins are considered a full barrier. and only __sync_lock_test_and_set and
__sync_lock_release specify different barrier semantics.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-09 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

--- Comment #3 from David Edelsohn dje at gcc dot gnu.org 2012-01-09 16:49:10 
UTC ---
 It says above them In most cases, these
 builtins are considered a full barrier. and only __sync_lock_test_and_set and
 __sync_lock_release specify different barrier semantics.

The next sentence is:

That is, no memory operand will be moved across the operation, either forward
or
backward.

Note that this refers to memory operands, not memory operations -- memory
stores and memory loads referenced in documentation of the other sync builtins.
 In other words, one could interpret full memory barrier as:

asm volatile ( : : : memory);

that refers to a GCC scheduling barrier.

The GCC documentation references Intel processors, which do not have have a
distinction between instructions for RELEASE, ACQ_REL and SEQ_CST semantics.

The basic problem is that the GCC builtins and atomic instruction semantics
were designed for Intel processors that do not provide the level of granularity
implemented in POWER processors.  The POWER port implemented lighter weight
ACQ_REL semantics. Retrofitting the original builtins on the new C++11 memory
model semantics and imposing SEQ_CST interpretation has changed the behavior
and performance on POWER, but not on other targets.


[Bug middle-end/51766] [4.7 regression] sync_fetch_and_xxx atomicity

2012-01-05 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51766

David Edelsohn dje at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2012-01-05
 CC||rth at gcc dot gnu.org
  Known to work||4.6.0, 4.6.1
   Target Milestone|--- |4.7.0
 Ever Confirmed|0   |1
  Known to fail||4.7.0