[Bug c/84563] GCC interpretation of C11 atomics (DR 459)

2018-02-27 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #4 from torvald at gcc dot gnu.org ---
Closing after further discussion.

[Bug c/84563] GCC interpretation of C11 atomics (DR 459)

2018-02-27 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #3 from torvald at gcc dot gnu.org ---
This is not a bug, see https://gcc.gnu.org/ml/gcc/2018-02/msg00224.html for
further discussion.

[Bug target/70490] __atomic_load_n(const __int128 *, ...) generates CMPXCHG16B with no warning

2018-02-27 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #7 from torvald at gcc dot gnu.org ---
(In reply to Michael Poole from comment #0)
> When compiling for x86-64 with the -mcx16 flag, there is no diagnostic for
> code like this:
> 
> #include 
> __int128 test(void)
> {
> const void *ptr = mmap(0, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS,
> -1, 0);
> const __int128 *p_v = (const __int128 *)ptr;
> return __atomic_load_n(p_v, __ATOMIC_SEQ_CST);
> }
> 
> The CMPXCHG16B instruction that is generated unconditionally attempts to
> write to the address, which causes a fault at runtime.  When
> __atomic_load_n() uses that instruction, it seems prudent to reject a const
> pointer as the first argument.

I guess we haven't spelled that out in the docs, but the thought is that one
would use the __atomic builtins like one would use C11/C++11 atomics.  In that
world, only "address-free" atomic types are guaranteed to work when mapped from
another process, for example, and only lock-free types are address-free.  (That
doesn't quite cover all cases, such as when making process-private read-only,
but it gives at least some idea of what's the intent.)

Users should ensure that the particular type is lock-free (or that atomic ops
on that type are) to know that it's safe to use atomic loads on read-only
memory.  The most recent GCC shouldn't declare the types to be lock-free just
because cmpxchg16b is available.

It might still use it under the covers, but the fact that it's not lock-free
indicates that the implementation is not just native atomic operations as
supported by the hardware, but something different.

I guess we need to spell this out in the docs.

[Bug target/80640] Missing memory side effect with __atomic_thread_fence (2)

2017-10-20 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80640

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #11 from torvald at gcc dot gnu.org ---
(In reply to Alexander Monakov from comment #7)
> I've submitted a patch [1] for the missing compiler barrier, but however
> please note that the original ompi code and the example in comment #3 are
> wrong: in a pattern like
> 
>   while (*foo)
> __atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> I think there are two issues; first, if *foo is a non-atomic, non-volatile
> object, a concurrent modification from another thread would invoke undefined
> behavior due to a data race;

It would be a data race even if it would be volatile but still non-atomic.

I can see how making atomic_thread_fence a compiler barrier for non-atomics
might help existing buggy code, but it would be sufficient it prevents certain
reordering of atomics.

Some more background and examples:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html

[Bug middle-end/26461] liveness of thread local references across function calls

2017-03-31 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461

--- Comment #15 from torvald at gcc dot gnu.org ---
From a ISO C/C++ conformance perspective, this is not a bug:

* Pre-C11/C++11, threading was basically undefined.

* Starting with C11 and C++11, "threads of execution" (as they're called in
C++11 and more recent, often abbreviated in the standard as "threads") are the
lowest level of how execution is modelled.  They are defined as single flows of
control.  Nothing is promised about any resources that may be used to implement
threads of execution (e.g., similar to the "execution context" notion mentioned
in comment #10).

* Thread-specific state is bound to a particular thread of execution (e.g.,
regarding lifetime).  A thread of execution accessing a __thread variable
accesses the thread-specific state of this thread of execution in the abstract
machine.  (Of course, one can still access other threads's thread-specific
state through pointers initially provided by those threads.)

* Only the standards' mechanisms can create threads of execution.  There is
nothing in these standards that would break up the concept of a single flow of
control (ie, that what "looks" like a single flow of control in a program is
actually not always the same thread of execution when executed).  (Also note
that fork/join parallelism is not a counter-example to this.)


That said, I can see that this doesn't match nicely with the fact that we have
things like swapcontext elsewhere.  Do we have any data on what the performance
impact where if the compiler would assume that function calls to functions it
cannot analyze could switch the thread of execution.  Data for several
architectures and different TLS models would be helpful.

Coming back to C++, currently I think there is only one Technical Specification
(TS) that allows breaking up a single flow of control: .then() in the
Concurrency TS (whose specification is certainly not ready for the standard). 
Maybe the Networking TS has something similar, but I can't remember right now. 
There are a few proposals that either allow it (Task Blocks, targeting
Parallelism TS version 2) or require it for good performance ("stackful"
coroutines).
The "stackless" coroutines in the upcoming Coroutines TS are not really
affected by thread-specific state; it's not the coroutines code that would
potentially switch threads, but any runtime that would supply a particular
Awaitable implementation that then may switch threads (e.g., if using .then()).
 The Coroutines does not specify any actual runtime.

However, I don't think the existence of some C++ proposals that may switch
threads necessarily means that the compiler would have to take this into
account when those proposals should become part of the standard.  The other way
this could play out is that the standard simply makes using thread-specific
state undefined for those threads of execution that can use these proposals.

Overall, I think it may be useful to experiment with a command line switch or
something like that, primarily to assess how big the performance degradation
would be caused by having the compiler assume that threads can switch on
function calls.

(In reply to Jakub Jelinek from comment #14)
> Even if we have an option that avoids CSE of TLS addresses across function
> calls (or attribute for specific function), what would you expect to happen
> when user takes address of TLS variables himself:
> __thread int a;
> void
> foo ()
> {
>   int *p = 
>   *p = 10;
>   bar (); // Changes threads
>   *p += 10;
> }

I think that p would point to the initial thread's TLS, even after bar().   The
user would be wrong to assume that it still is the initial thread's object "a"
after having called bar().

[Bug sanitizer/78158] Strange data race detection with thread sanitizer

2017-03-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78158

--- Comment #19 from torvald at gcc dot gnu.org ---
(In reply to Dmitry Vyukov from comment #8)
> We need to modify tsan runtime to ignore (zero) __ATOMIC_HLE_ACQUIRE/RELEASE
> bits, right? It's only an optimization and we can pretend that elision never
> succeeds under tsan.

I agree.  The actual memory order that callers have to provide in addition to
the HLE bits is what should determine the synchronization semantics.

(IMHO, exposing HLE instructions through memory order bits is a a weird
design.)

[Bug tree-optimization/53991] _mm_popcnt_u64 fails with -O3 -fgnu-tm

2016-09-22 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53991

--- Comment #12 from torvald at gcc dot gnu.org ---
Thanks for reporting that you are affected as well.  Regarding your question
about bumping the priority, one problem for the TM support in general has been
to be able to judge how many users are out there, and whether they are
interested in TM for in-production software or just toying around with it
because it's something new.  This holds for GCC's implementation, but also for
the ISO C++ TM Technical Specification.

Thus, if you or anyone else is interested in using TM for real, please share
your feedback with me personally or ISO C++ Study Group 5
(https://groups.google.com/a/isocpp.org/forum/?fromgroups#!forum/tm ; sending
email to t...@isocpp.org might work even if you're not subscribed to this list,
or it should at least end up in a moderation queue).  Sharing your feedback
will help us and Study Group 5.

[Bug libgcc/71744] Concurrently throwing exceptions is not scalable

2016-09-14 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744

--- Comment #28 from torvald at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #20)
> (In reply to Gleb Natapov from comment #19)
> > (In reply to Jakub Jelinek from comment #18)
> > > (In reply to Gleb Natapov from comment #16)
> > > > Can you suggest an alternative to libgcc patch? Use other TLS model?
> > > > Allocate per thread storage dynamically somehow?
> > > 
> > > If we want to use TLS (which I hope we don't), then e.g. a single __thread
> > > pointer with some registered destructor that would free it on process exit
> > > could do the job, and on the first exception it would try to allocate 
> > > memory
> > > for the cache and other stuff and use that (otherwise, if memory 
> > > allocation
> > > fails, just take a lock and be non-scalable).
> > >
> > I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do
> > the same here?
> 
> Can? Yes.  Want?  Nope.  It is worse than TLS.
> 
> > > Another alternative, perhaps much better, if Torvald is going to improve
> > > rwlocks sufficiently, would be to use rwlock to guard writes to the cache
> > > etc. too, and perhaps somewhat enlarge the cache (either statically, or
> > > allow extending it through allocation).
> > > I'd expect that usually these apps that use exceptions too much only care
> > > about a couple of shared libraries, so writes to the cache ought to be 
> > > rare.
> > >
> > As I said in my previous reply, I tested the new rwlock and in congested 
> > case
> > it still slows does the system significantly, not the implementation fault,
> > cpu just does not like locked instruction much. Not having a lock will be
> > significantly better.
> 
> You still need at least one lock, the array of locks is definitely a bad
> idea.

I don't think it is necessarily a bad idea -- spreading out the data used for
synchronization to avoid cache conflicts when there is an asymmetrical workload
is a commonly used technique.  But we need to find a balance between compute
and space overhead that works for us and in this case.  Also, spreading out the
data should ideally happen in a NUMA-aware way, so there may be more space
overhead if we consider that too.

I've been thinking about using more scalable ways of locking in glibc which
would use per-thread data, which could be useful for more than one lock.  So if
we had that, we could just use that to deal with the exception scalability
problem, without having space overhead just for exceptions.b

[Bug libgcc/71744] Concurrently throwing exceptions is not scalable

2016-09-14 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744

--- Comment #21 from torvald at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #17)
> (In reply to torvald from comment #15)
> > > Similarly, the 64 recursive locks in libc, again, significant amount of
> > > memory
> > > per process that doesn't care about exceptions,
> > 
> > That might be reducable if we build custom locks and don't use pthread ones,
> > but we'll at least have 64b per lock.
> 
> Right now the lock is I think 40 bytes per lock, which is 64 * 40 bytes per
> process.  That is just too much (but of course the 64x locking a recursive
> lock is even worse).

Remembering more of the discussion we had about this in the past, then even
with the improved rwlock, Gleb reports that there is a slowdown in Gleb's tests
because of cache contention -- which would mean that we may have to use one
cacheline per lock.

[Bug libgcc/71744] Concurrently throwing exceptions is not scalable

2016-09-14 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744

--- Comment #15 from torvald at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #14)
> Looking at that patchset, I just want to say that the use of TLS in libgcc
> is very much undesirable, it affects every program even when not using
> exceptions (significant portion of them), especially when it uses that much
> (6 * sizeof(void *) * 8 + sizeof(void *) + 2 * 8 = 408 bytes per thread is
> significant, plus it uses the slow local dynamic TLS model.

I agree.

> Similarly, the 64 recursive locks in libc, again, significant amount of
> memory
> per process that doesn't care about exceptions,

That might be reducable if we build custom locks and don't use pthread ones,
but we'll at least have 64b per lock.

> and especially signficiant
> slowdown on dlopen/dlclose.  While some programs abuse exceptions for normal
> control flow, others abuse dlopen/dlclose.  
> 
> A rwlock ought to be better, but I don't think we have recursive rwlock in
> libc

rdlock can be recursive, rwlock cannot.  It is something we could add with not
much effort in a custom rwlock, I think.

> yet (all we need is being able to recursively lock it for writing, for
> reading
> we might not need to allow recursion in this case).

Not having to deal with recursive rdlock can make things easier for the rwlock
implementation.

> Plus the current rwlock
> implementation still takes uses a per-process lock to guard the inner
> operations
> on the rwlock, so while rwlock helps with longer critical sections, for very
> short ones it likely doesn't scale all that well.

This will change as soon as my new rwlock is upstream.  Especially short rdlock
sections will be much faster.

[Bug libstdc++/65033] C++11 atomics: is_lock_free result does not always match the real lock-free property

2016-08-03 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65033

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #10 from torvald at gcc dot gnu.org ---
The difference is unfortunate, but C11 specifies that atomic_is_lock_free is
*per object*.  I suppose that any change there would have to go through the ISO
C process.

[Bug libstdc++/71133] msp430-elf -mlarge FTBFS in libstdc++-v3

2016-05-17 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71133

--- Comment #2 from torvald at gcc dot gnu.org ---
I think it's easiest to disable the TM support on less-than-32b targets.

[Bug middle-end/70638] transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap

2016-04-13 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70638

--- Comment #3 from torvald at gcc dot gnu.org ---
(In reply to Hillel Avni from comment #2)
> On gcc-linaro-4.9-2014.11, I must declare the wrapper as pure. 

But using that version the wrapper was indeed used and not the original
function, right?  In that case, this part of this bug report would be a
regression.

[Bug middle-end/70638] transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap

2016-04-12 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70638

--- Comment #1 from torvald at gcc dot gnu.org ---
Created attachment 38244
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38244=edit
test case

[Bug middle-end/70638] New: transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap

2016-04-12 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70638

Bug ID: 70638
   Summary: transaction_wrap: too strict compatibility check and
transaction_pure wrappers fail to wrap
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torvald at gcc dot gnu.org
  Target Milestone: ---

The attached test case shows two issues with transaction_wrap:

1) If the original function is not transaction_safe, but the wrapper is
declared safe, this incorrectly results in an compilation error about the
original and the wrapper function being incompatible.
I believe this used to work before, but haven't checked.

2) If a wrapper is declared transaction_pure, the original function is
incorrectly considered to be transaction_pure.

The fourth case in the test might be something to be considered adding support
for.  OTOH, perhaps it's safer for programmers having to explicitly select
whether a wrapper is pure or safe, given that there seem to be these two
separate use cases (ie, making it pure to customize instrumentation, and making
it safe to use a transaction-friendly wrapper that still needs to be
instrumented).

[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #25 from torvald at gcc dot gnu.org ---
(In reply to Andrew Macleod from comment #24)
> Author: amacleod
> Date: Wed Jan 14 13:58:35 2015
> New Revision: 219601
> 
> URL: https://gcc.gnu.org/viewcvs?rev=219601=gcc=rev
> Log:
> 
> 2015-01-14  Andrew MacLeod  <amacl...@redhat.com>
> 
>   PR middle-end/59448
>   * builtins.c (get_memmodel): Promote consume to acquire always.
>   * testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal
>   consume in an atomic_exchange.

To summarize, Jeff Preshing confirmed that Andrew's patch fixes the problem
(see Comment 23).  Everyone in ISO C++ seems to agree that promoting consume to
acquire is the right approach until we have a different language-level facility
that is actually implementable.  Therefore, I'm closing this as fixed.

[Bug c/59218] atomic transactions: accesses to volatiles not disallowed in transaction_safe code

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59218

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |6.0

--- Comment #3 from torvald at gcc dot gnu.org ---
Fixed by Jason.

[Bug c++/63472] transaction_atomic within while loop causes ICE

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63472

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #8 from torvald at gcc dot gnu.org ---
Seems to be the same problem as bug 60908.

*** This bug has been marked as a duplicate of bug 60908 ***

[Bug c++/60908] compiler bug related to trans-mem.c

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60908

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||spear at cse dot lehigh.edu

--- Comment #5 from torvald at gcc dot gnu.org ---
*** Bug 63472 has been marked as a duplicate of this bug. ***

[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550

--- Comment #2 from torvald at gcc dot gnu.org ---
Created attachment 37422
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37422=edit
Test case part 2/3

[Bug libitm/69018] libitm/testsuite/libitm.c++/c++.exp: libstdc++ include paths don't work if tests set options

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69018

torvald at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P4
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-01-21
 Ever confirmed|0   |1

--- Comment #2 from torvald at gcc dot gnu.org ---
Downgraded because this is not user-visible and we have a work-around.

[Bug c++/60908] compiler bug related to trans-mem.c

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60908

--- Comment #4 from torvald at gcc dot gnu.org ---
Still happens with r232693.

[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550

--- Comment #3 from torvald at gcc dot gnu.org ---
Created attachment 37423
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37423=edit
Test case part 3/3

[Bug tree-optimization/53991] _mm_popcnt_u64 fails with -O3 -fgnu-tm

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53991

--- Comment #9 from torvald at gcc dot gnu.org ---
Still fails with r232693.

This seems to be another case of difficult ordering between TM passes and other
passes.  It makes sense that we don't inline tm_pure because we must not loose
that information.  always_inline is specified to produce an error when not
inlining, but this shouldn't be much of a problem when considering code
instrumented for transactions I suppose (can there be a case where lack of
inlining causes a correctness problem?).

Perhaps it's easiest if we clone the function if we see such a case, so that
the solution can be different for TM-instrumented clones and normal code.

[Bug c/57855] passing unsafe function as transaction_safe function pointer does not generate error

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57855

torvald at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P4
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-01-21
  Component|libitm  |c
Version|unknown |6.0
 Ever confirmed|0   |1

--- Comment #3 from torvald at gcc dot gnu.org ---
I can confirm that this still happens with r232693 if the test program is
compiled as a C program; a warning is produced, but there is no error.

However, if compiling the reproducer as a C++ program, an error is returned, as
required by the TM TS:

pr57855.c:28:14: error: invalid conversion from ‘void (*)(const char*, int,
const char*, int, const void*)’ to ‘ADD_STAT {aka void (*)(const char*, int,
const char*, int, const void*) transaction_safe}’ [-fpermissive]
   bar(my_func);

I'm not sure whether the initial TM specification required an error, but I
think it's better to keep this open given that ISO C++ SG5 people are working
on / interested in a variant of the TM TS for C (ie, so that we don't forget
about it).  Eventually, I would suppose that we phase out support for the old
TM specification.  Therefore, I've also reduced the priority of this and
associated this with the C frontend.

[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550

--- Comment #1 from torvald at gcc dot gnu.org ---
Created attachment 37421
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37421=edit
Test case part 1/3

[Bug middle-end/57348] [TM] ICE for transaction expression in gimplify_expr

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57348

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #3 from torvald at gcc dot gnu.org ---
Works with r232693.

[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm

2016-01-21 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-01-22
 CC||jason at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #4 from torvald at gcc dot gnu.org ---
Confirmed with r232693.  std::vector's constructor isn't considered
transaction_safe, but see the attached test case.
I'm not sure making the clones weak is the right approach; the clones should
probably mirror whatever is done for the original functions.

[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-20 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

--- Comment #14 from torvald at gcc dot gnu.org ---
Author: torvald
Date: Wed Jan 20 17:47:03 2016
New Revision: 232628

URL: https://gcc.gnu.org/viewcvs?rev=232628=gcc=rev
Log:
libstdc++: Darwin does not support weak refs without definition.

PR libstdc++/69310
* config/os/bsd/darwin/os_defines.h (_GLIBCXX_USE_WEAK_REF): Define.

Modified:
trunk/libstdc++-v3/ChangeLog
trunk/libstdc++-v3/config/os/bsd/darwin/os_defines.h

[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-20 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #15 from torvald at gcc dot gnu.org ---
Fixed on trunk.

[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-20 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

--- Comment #11 from torvald at gcc dot gnu.org ---
(In reply to Jack Howarth from comment #10)
> It is unclear if the changes in r232454, to avoid the explicit linkage on
> libitm, can ever be made darwin-friendly. On darwin, every single executable
> linked against libstdc++ would require -Wl,-undefined,dynamic_lookup on the
> linkage to avoid linkage errors from the undefined symbols from libitm
> within libstdc++.

Is Darwin declaring __GXX_WEAK__?  I suppose that it does because after r232539
the failing code isn't compiled unless __GXX_WEAK__ is supported.  But if it
is, weak references must be supported.  Or are you saying that specifically the
alias attribute is a problem?

If indeed weak references to symbols not declared are a problem, then we
probably should disable the transactional clones the same way we did on AIX,
which doesn't support weak references without definitions either.

What about putting this:

// No support for referencing weak symbols without a definition.
#define _GLIBCXX_USE_WEAK_REF 0

into ./config/os/bsd/darwin/os_defines.h?  This would disable the transactional
clones altogether.  Could you give this a try?

[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-19 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

--- Comment #7 from torvald at gcc dot gnu.org ---
Does this patch fix the issue on Darwin?
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01452.html

[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-19 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

--- Comment #6 from torvald at gcc dot gnu.org ---
(In reply to Jack Howarth from comment #5)
> (In reply to torvald from comment #4)
> 
> > See https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01203.html
> > (Not linked to this bug because this bug is about the darwin breakage.)
> 
> The commit of that patch didn't resolve the bootstrap failures on
> x86_64-apple-darwin15 . The "only weak aliases are supported in this
> configuration" errors still block the bootstrap on darwin.

Of course they don't.  This was a reply to a comment about another issue not
actually related to the bug, but posted as comment on this bug.

[Bug bootstrap/69312] [6 Regression] libstdc++ unconditionally refers to TM symbols

2016-01-19 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69312

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from torvald at gcc dot gnu.org ---
Fixed in r232539

[Bug bootstrap/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

--- Comment #2 from torvald at gcc dot gnu.org ---
(In reply to İsmail Dönmez from comment #1)
> r232454 also breaks bootstrap for mingw-w64:
> 
> libtool: compile:  x86_64-w64-mingw32-c++
> -L/havana/mingw-w64-6.0.0/x86_64-w64-mingw32/lib
> -L/havana/mingw-w64-6.0.0/mingw/lib -isystem
> /havana/mingw-w64-6.0.0/x86_64-w64-mingw32/include -isystem
> /havana/mingw-w64-6.0.0/mingw/include
> -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/../libgcc
> -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/
> include/x86_64-w64-mingw32
> -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/
> include -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/libsupc++
> -std=gnu++11 -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra
> -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once
> -ffunction-sections -fdata-sections -frandom-seed=cow-stdexcept.lo -g -O2 -c
> ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc -o
> cow-stdexcept.o
> ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:235:70:
> warning: unused parameter 'exc' [-Wunused-parameter]
>  _txnal_cow_string_C1_for_exceptions(void* that, const char* s, void *exc)
>   ^
> ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc: In
> function 'void* txnal_read_ptr(void* const*)':
> ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39:
> error: expected ',' before ')' token
>|| sizeof(uint32_t) == sizeof(void*));
>^
> ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39:
> error: expected string-literal before ')' token

is mingw's static_assert any different?  Right now, I can't see how this could
produce this error:

  static_assert(sizeof(uint64_t) == sizeof(void*)
|| sizeof(uint32_t) == sizeof(void*));

[Bug middle-end/61199] [trans-mem] _cxa_free_exception is not transaction safe

2016-01-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199

torvald at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|--- |6.0

[Bug middle-end/61199] [trans-mem] _cxa_free_exception is not transaction safe

2016-01-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from torvald at gcc dot gnu.org ---
Fixed in 230634.

[Bug bootstrap/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15

2016-01-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310

--- Comment #4 from torvald at gcc dot gnu.org ---
(In reply to İsmail Dönmez from comment #3)
> (In reply to torvald from comment #2)
> > (In reply to İsmail Dönmez from comment #1)
> > > r232454 also breaks bootstrap for mingw-w64:
> > > 
> > > libtool: compile:  x86_64-w64-mingw32-c++
> > > -L/havana/mingw-w64-6.0.0/x86_64-w64-mingw32/lib
> > > -L/havana/mingw-w64-6.0.0/mingw/lib -isystem
> > > /havana/mingw-w64-6.0.0/x86_64-w64-mingw32/include -isystem
> > > /havana/mingw-w64-6.0.0/mingw/include
> > > -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/../libgcc
> > > -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/
> > > include/x86_64-w64-mingw32
> > > -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/
> > > include -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/libsupc++
> > > -std=gnu++11 -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra
> > > -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once
> > > -ffunction-sections -fdata-sections -frandom-seed=cow-stdexcept.lo -g -O2 
> > > -c
> > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc -o
> > > cow-stdexcept.o
> > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:235:70:
> > > warning: unused parameter 'exc' [-Wunused-parameter]
> > >  _txnal_cow_string_C1_for_exceptions(void* that, const char* s, void *exc)
> > >   ^
> > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc: In
> > > function 'void* txnal_read_ptr(void* const*)':
> > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39:
> > > error: expected ',' before ')' token
> > >|| sizeof(uint32_t) == sizeof(void*));
> > >^
> > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39:
> > > error: expected string-literal before ')' token
> > 
> > is mingw's static_assert any different?  Right now, I can't see how this
> > could produce this error:
> > 
> >   static_assert(sizeof(uint64_t) == sizeof(void*)
> > || sizeof(uint32_t) == sizeof(void*));
> 
> g++-5 on Linux does not compile that either, clang is more helpful to why:
> 
> t.cpp:5:85: warning: static_assert with no message is a C++1z extension
> [-Wc++1z-extensions]
> 
> So, just adding a message to static_assert fixes the issue here.

Ah, right.

See https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01203.html
(Not linked to this bug because this bug is about the darwin breakage.)

[Bug libitm/69018] New: libitm/testsuite/libitm.c++/c++.exp: libstdc++ include paths don't work if tests set options

2015-12-22 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69018

Bug ID: 69018
   Summary: libitm/testsuite/libitm.c++/c++.exp: libstdc++ include
paths don't work if tests set options
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libitm
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torvald at gcc dot gnu.org
  Target Milestone: ---

If a tests sets an option (eg, // { dg-options "-fgnu-tm" }), then the include
paths for C++ std headers set by c++.exp don't work anymore.  Switching the
last two arguments to the "dg-runtest" command close to the end of c++.exp
makes it work, but I don't know whether that's a fix or a workaround.

[Bug libstdc++/68921] [5/6 Regression] std::future::wait() makes invalid futex calls and spins

2015-12-15 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68921

--- Comment #3 from torvald at gcc dot gnu.org ---
LGTM, thanks.  Would be nice to backport this.

[Bug libitm/68591] trans-mem: missing transactional read in transaction when using -O1

2015-12-15 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68591

--- Comment #3 from torvald at gcc dot gnu.org ---
This might be related to bug 68616.

[Bug middle-end/68616] miscompilation in multi-threaded code

2015-12-01 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68616

--- Comment #2 from torvald at gcc dot gnu.org ---
I basically don't know anything about IPA, so I'll just assume that by
"barrier" you mean conditions prohibiting transformations.  I'm also not sure
whether just CSE is a problem here, so I'll try to give an unspecific but broad
answer.

I'll assume we're looking at cases of nonatomic accesses to foo (such as ptr in
the test case) and calls to the (new) atomic builtins on different variables
bar1,bar2,... as synchronization points.  Mixed atomic/nonatomic accesses to
the same memory location are in most cases a bug, and I believe we discourage
it, but they might not generally be incorrect (e.g., if loading through an
atomic op and observing a value that indicates we're the only thread accessing
it, subsequent nonatomic accesses might be fine); maybe we should just let an
atomic access to bar1 be a barrier for all movement/merging across/involving
this access.  Any old atomic builtins (ie, __synch) can probably be handled
like the equivalent calls to the new builtins.

I believe we don't need to do includes volatiles, because even if they are used
in old-style code, they should have asm compiler barriers around them -- and I
hope we're handling those correctly.

Because  != , atomic stores to bar must be __ATOMIC_RELEASE or
stronger, and atomic loads to bar must be __ATOMIC_ACQUIRE or stronger;
otherwise, there's no implied ordering relationship between the foo and bar
accesses.

A good way to find out what transformations are or are not allowed is to
consider the data-race-freedom (DRF) requirement and which regions an
implementation would be allowed to execute atomically.

For example, "foo = 1; bar.store(1, release); foo = 2;": The implementation is
allowed to execute that in one atomic step, so there cannot be a nonatomic read
of fooin another thread that's triggered by the store to bar because there
would be a data race in this case.  So, foo=1 can be removed if we assume DRF.
(Note that old synchronization code may not be written to guarantee DRF; so
perhaps this should be conditional on -std=c11 or such.)

"temp=foo; if (bar.load(acquire)) temp2=foo;" is similar.

However, this example here (basically the test case) doesn't allow for
optimizing across the synchronization:
"foo = 1; temp=foo; bar.store(1, release); if (bar.load(acquire)==2)
temp2=foo;"

This is because both loads of foo cannot be executed in one atomic step, as it
needs another thread to increment bar to the value 2; also, there is no
nonatomic access to foo between the two atomic accesses, so we cannot derive
that those other threads that execute between the atomics don't access foo
because of DRF.

I hope this helps, though I'm aware it's pretty general.  If you have a list of
specific transformations, I could look through that and say which ones are
correct or not.  Perhaps it's okay for now if we assume that all non-relaxed
atomics are transformation barriers, as we do elsewhere AFAIK.

[Bug middle-end/68122] ICE in gcc/toplev.c:353 with -fsanitize=undefined and -fgnu-tm

2015-12-01 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68122

--- Comment #11 from torvald at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #10)
> There are lots of internal functions in GCC 6 (older versions had fewer).
> Many of them are ECF_CONST, which is also treated like txn_pure, right?

I think it should be (ie, won't access global memory, so is a black box an only
output/input is values) -- but I don't know whether we do this everywhere.

> Other than that, e.g. the remaining UBSAN* handlers are pretty much just
> report some diagnostics and either terminate the program or return, but in a
> valid program they should never be called.  If we treat them as ECF_TM_PURE,
> that would worst case mean the diagnostics could be printed multiple times
> for the same spot in a transaction, if the transaction is restarted, right?

Yes.

> Most of them are not reporting the same diagnostics for the same source code
> location, so I suppose it would be no harm to treat them as tm pure.

Agreed.  Are we running the TM pass before the pass that replaces UBSAN* with
real code?

[Bug middle-end/68122] ICE in gcc/toplev.c:353 with -fsanitize=undefined and -fgnu-tm

2015-12-01 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68122

--- Comment #9 from torvald at gcc dot gnu.org ---
Marek, just skipping handling of internal functions is not correct I think.  In
the worst case, it should treat them as txn-unsafe, unless they are considered
txn-pure.  Do we have some idea of how many built-ins would be affected?  For
example I believe it wouldn't be critical if TM wouldn't be compatible with
UBSan for now, but if there are builtins for code that would be intuitively
txn-safe, supporting them would be good.

Regarding UBSan specifically, I believe many of the checks could be handled in
a straight-forward way: All values returned by transactional memory accesses
will form a consistent snapshot, always.  So, for example, an
addition-with-check call that takes values as arguments should be tm_pure
because it will check values that could have happened in some execution, even
if the transaction has to be rolled back later on.

[Bug other/68616] New: miscompilation in multi-threaded code

2015-11-30 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68616

Bug ID: 68616
   Summary: miscompilation in multi-threaded code
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torvald at gcc dot gnu.org
  Target Milestone: ---

Created attachment 36871
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36871=edit
test case

In the attached case, the compiler seems to assume that in main_thread, ptr
still points to x despite there being a call to arrive_and_wait that contains
synchronization.  The compiled code on x86_64 with -O1 has just an
unconditional store to x after the arrive_and_wait call.  It must keep the
conditional because in fact the interference function waits for arrive_and_wait
to increment barrier, then assigns zero to ptr, and the unblocks
arrive_and_wait, which returns to main_thread.  So, in a correct execution,
main_thread will always see ptr==0 and do not assign to x.

This may or may not be a duplicate of PR 68591; I don't know so I created a new
PR.  Also, while just a transactional memory bug may be considered less severe,
this one here is a severe problem for multi-threaded code.

Also note that in a variation of this I didn't see the miscompilation: In that
case, main_thread() was a part of main() and the pthread_create call for
interference() was directly before the arrive_and_wait.  The code generated for
that case had the load and check of ptr before assigning to ptr.

[Bug libitm/68591] trans-mem: missing transactional read in transaction when using -O1

2015-11-27 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68591

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rth at gcc dot gnu.org

--- Comment #1 from torvald at gcc dot gnu.org ---
Filed as libitm, but this is really a compiler problem, and not related to
libitm.

[Bug libitm/68591] New: trans-mem: missing transactional read in transaction when using -O1

2015-11-27 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68591

Bug ID: 68591
   Summary: trans-mem: missing transactional read in transaction
when using -O1
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libitm
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torvald at gcc dot gnu.org
  Target Milestone: ---

Created attachment 36860
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36860=edit
test case

The attached test case miscompiles in that it does not transform the load from
ptr in the conditional in the transaction into a transactional load.  Instead,
it seems to assume that ptr's value is fixed and not shared with other threads.
 This changes when the synchronization that could potentially make it share
gets inlined into the function with the transaction.  For simplicity, this is a
single-threaded test, but I discovered that in a multi-threaded setting.

[Bug libitm/66453] In a deadlock libitm allocates all available RAM until oom is called

2015-11-26 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66453

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #1 from torvald at gcc dot gnu.org ---
Created attachment 36848
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36848=edit
test case of original bug report

I can't reproduce this with the attached test case with current trunl.  The
only difference to your test case is that the non-synchronization while loop is
technically undefined behavior, and to try to prevent this, I added the hack
that prevents the loop from being optimized out.  Please test again, and reopen
the bug if you still see the OOM situation, ideally with more data on your test
environment.

[Bug libitm/66453] In a deadlock libitm allocates all available RAM until oom is called

2015-11-26 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66453

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #2 from torvald at gcc dot gnu.org ---
Can't reproduce, so close this for now.

[Bug target/67281] HTM builtins aren't treated as compiler barriers on powerpc

2015-08-23 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67281

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #6 from torvald at gcc dot gnu.org ---
I think tbegin needs to have same semantics as a lock acquisition and the
compiler must not assume to know anything about tbegin's return value; tend
must have same semantics as a lock release.

See the libc-alpha discussion for why I think this is the case:
https://sourceware.org/ml/libc-alpha/2015-08/msg00963.html

Thus, I don't think a full compiler barrier is necessary, but if we don't have
something finer-grained to capture the semantics of a lock acquisition, then we
need the compiler barrier (GCC currently assumes atomics to be compiler
barriers AFAIK).  We should in any case agree on a semantics and document it in
the GCC sources.  Documenting that we need a full compiler barrier is not
correct in that it's not a necessary condition (even though it should be a
sufficient condition).


[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier

2015-07-06 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #8 from torvald at gcc dot gnu.org ---
(In reply to mikulas from comment #7)
 #include stdatomic.h
 
 atomic_int a = ATOMIC_VAR_INIT(0);
 atomic_int b = ATOMIC_VAR_INIT(0);
 atomic_int p = ATOMIC_VAR_INIT(0);
 
 int thread_1(void)
 {
 atomic_store_explicit(b, 1, memory_order_relaxed);
 atomic_load_explicit(p, memory_order_seq_cst);
 return atomic_load_explicit(a, memory_order_relaxed);
 }
 
 int thread_2(void)
 {
 atomic_store_explicit(a, 1, memory_order_relaxed);
 atomic_load_explicit(p, memory_order_seq_cst);
 return atomic_load_explicit(b, memory_order_relaxed);
 }
 
 See for example this. Suppose that thread_1 and thread_2 are executed
 concurrently. If memory_order_seq_cst were a proper full memory barrier, it
 would be impossible that both functions return 0.

memory_order_seq_cst is a memory order in the Standard's terminology.  Fences
are something else (ie, atomic_thread_fence()) , and parametrized by a memory
order.  A memory_order_seq_cst *memory access* does not have the same effects
as a memory_order_seq_cst fence.  See C++14 29.3p4-7; those paragraphs talk
about memory_order_seq_cst fences specifically, not about memory_order_seq_cst
operations in general.

If you want to make this example of Dekker synchronization correct, you need to
use fences instead of the accesses to p; alternatively, you need to use seq-cst
accesses for all the stores and loads to a and b, in which case there will be
HW fences added via the stores (as Andrew already pointed out).


[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-05-01 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #49 from torvald at gcc dot gnu.org ---
(In reply to James Greenhalgh from comment #43)
 (In reply to torvald from comment #37)
  (In reply to James Greenhalgh from comment #35)
   So by the strict letter of the specification, no memory references to
   visible data should be allowed to move from after the entire body of the
   intrinsic to before it. That is to say in:
   
 __sync_lock_test_and_set (foo, 1)
 bar = 1
   
   an observer should not be able to observe the write to bar before the 
   write
   to foo. This is a difference from the C++11 semantics.
  
  Can you clarify how this observer would look like?  I think we should look
  at both code examples that just use __sync for concurrent accesses, and
  examples that also use normal memory accesses as if those would be
  guaranteed to be atomic.  None of the __sync docs nor the psABI guarantee
  any of the latter AFAIK.  I don't think it would be unreasonable to argue
  that __sync doesn't make promises about non-DRF normal accesses, and so,
  strictly speaking, maybe programs can't in fact rely on any of the behaviors
  that are complicated to implement for ARM.  That's why I'd like to
  distinguish those two cases.
 
 Sure, it would look much like your cppmem example above, though you can add
 some
 barriers to the observer if you want to make a stronger example
 
 bar = 0, foo = 0;
 
 thread_a {
   __sync_lock_test_and_set (foo, 1)
   bar = 1
 }
 
 thread_b {
   /* If we can see the write to bar, the write
  to foo must also have happened.  */
   if (bar) /* Reads 1.  */
assert (foo) /* Should never read 0.  */
 }

This is the case of allowing non-DRF normal accesses.  The *other* case I was
thinking about is how the test would have to look like when *not* allowing
them.  One way to do it would be:

thread_a {
  __sync_lock_test_and_set (foo, 1)
  __sync_lock_test_and_set (bar, 1) // or __sync_lock_release, or __sync RMW
}

thread_b {
  if (__sync_fetch_and_add (bar, 0))
assert (foo)  // DRF if thread_a's write is the final one
}

In this case, would the current ARM implementation still produce insufficient
code?  If not, at least in this test case, we could argue that there's nothing
wrong with what ARM does.  (The question whether we wan't to require DRF
strictly for __sync usage is of course still open.)

   I'm not worried about __sync_lock_release, I think the documentation is
   strong enough and unambiguous.
  
  Are you aware that the GCC's __sync disallow store-store reordering across
  __sync_lock_release, whereas the psABI docs don't?
 
 No I was not, and even looking exactly for the text you were referring to,
 it took
 me three attempts to spot it. Yes, I understand now why you are concerned
 about
 the GCC wording. Perhaps this is just an artefact of a mistake transcribing
 the psABI?

I suppose so, but can't say for sure.  Somebody might have had x86 TSO in mind
when writing that, and perhaps not just accidentally.

However, we say psABI is the reference spec, so I agree we should change the
GCC __sync_lock_release docs accordingly.


[Bug libstdc++/51617] [C++0x] async(f) isn't.

2015-04-27 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51617

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #11 from torvald at gcc dot gnu.org ---
ISO C++ SG1 was close to deprecating std::async for other reasons, but there
was too much objection to this from other parts of the committee; the main
concern was that there is no replacement for the functionality.  Even though
the executors proposal is still being discussed and not part of any published
TS yet, the parallelism TS V1 is being published.  Thus, I'm not sure whether
spending too much time is justified, given the potentially short-lived
importance of current async().


[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-17 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #38 from torvald at gcc dot gnu.org ---
(In reply to Andrew Macleod from comment #34)
  However, I guess some people relying on data races in their programs could
  (mis?)understand the __sync_lock_release semantics to mean that it is a
  means to get the equivalent of a C11 release *fence* -- which it is not
  because the fence would apply to the (erroneously non-atomic) store after
  the barrier, which could one lead to believe that if one observes the store
  after the barrier, the fence must also be in effect.  Thoughts?
 
 before we get too carried away, maybe we should return to what we *think*
 __sync are suppose to do. It represents a specific definition by intel..
 From the original documentation for __sync back in the day, and all legacy
 uses of sync should expect this behaviour:

The problem I see with that is that I don't think that just looking at the
psABI gives you enough information to really reason about what you are allowed
to do or not.  Strictly speaking, the psABI doesn't give you guarantees about
normal memory accesses that are not data-race-free (through use of the __sync
builtins).  Nonetheless, legacy code does use them in a combination with the
__sync builtins.

Also, if you look at the IA-64 __sync_lock_release vs. GCC docs'
__sync_lock_release, the latter is like x86/TSO.  Do you have any info on which
other semantics __sync was supposed to adhere to?

One potential way to solve it would be to just require code that uses __sync to
more or less implement an IA-64 or x86 memory model, modulo allowing
compiler-reordering and optimization between adjacent non-__sync memory
accesses.  This could be inefficient on ARM (see James' examples) and perhaps
Power too (or not -- see Jakub's comments).


[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-17 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #37 from torvald at gcc dot gnu.org ---
(In reply to James Greenhalgh from comment #35)
 (In reply to torvald from comment #32)
  (In reply to James Greenhalgh from comment #28)
   This also gives us an easier route to fixing any issues with the
   acquire/release __sync primitives (__sync_lock_test_and_set and
   __sync_lock_release) if we decide that these also need to be stronger than
   their C++11 equivalents.
  
  I don't think we have another case of different __sync vs. __atomics
  semantics in case of __sync_lock_test_and_set.  The current specification
  makes it clear that this is an acquire barrier, and how it describes the
  semantics (ie, loads and stores that are program-order before the acquire op
  can move to after it) , this seems to be consistent with the effects C11
  specifies for acquire MO (with perhaps the distinction that C11 is clear
  that acquire needs to be paired with some release op to create an ordering
  constraint).
 
 I think that the question is which parts of a RMW operation with
 MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11
 MEMMODEL_ACQUIRE only applies to the load half of the operation. So an
 observer to:
 
   atomic_flag_test_and_set_explicit(foo, memory_order_acquire)
   atomic_store_exlicit (bar, 1, memory_model_relaxed)

That depends on your observer.  When reasoning about C11, please let's use
complete examples (and ideally, run them through cppmem).  For example, if the
observation is done by a pair of relaxed-MO loads, bar==1  foo==0 can be
observed:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  {{{ {
r1 = cas_strong(foo, 0, 1);
bar.store(1, memory_order_relaxed);
  } ||| {
r2 = bar.load(memory_order_relaxed).readsvalue(1); 
r3 = foo.load(memory_order_relaxed).readsvalue(0); 
  } }}};
  return 0; }

 Is permitted to observe a write to bar before a write to foo (but not before
 the read from foo).

How does one observe a load in C11 (ie, so that before the read from foo is
defined)?  You can constrain the reads-from of a load, but the load itself is
not observable as an effect.

I think I get what you're trying to say regarding the load half but I would
phrase it differently: If there could be another release store to foo that the
CAS/TAS reads-from, then moving the store to bar before the load would risk
creating a cycle between inter-thread happens-before and
sequenced-before-created intra-thread happens-before.  (Note that the later
isn't visible to other threads though, except through additional inter-thread
synchronization.)

Perhaps one could also say that moving the store to bar before the store to foo
could result in the CAS/TAS not being atomic -- but this is just reliably
observable if the foo observation is a CAS/TAS by itself (so it's totally
ordered with the other CAS), or if there is a inter-thread happens-before
between the observation of bar and the store to bar (which means using a
release store for bar).

I'm discussing these aspects because I think it matters which observations are
actually possible in a reliable way.  It matters for C11, and it may matter for
the __sync semantics as well because while the __sync functions promise TSO, we
don't really do so for all (nonatomic) loads or stores anywhere else in the
program...

IOW, can we actually come up with reliable and correct counter-examples (either
using the C11 or __sync interfaces) for the guarantees we think ARM might not
provide?

 My reading of the Itanium ABI is that the acquire barrier applies to the
 entire operation (Andrew, I think you copied these over exactly backwards in
 comment 34 ;) ):
 
   Disallows the movement of memory references to visible data from
after the intrinsic (in program order) to before the intrinsic (this
behavior is desirable at lock-acquire operations, hence the name).
 
 The definition of __sync_lock_test_and_set is:
 
   Behavior:
• Atomically store the supplied value in *ptr and return the old value
  of *ptr. (i.e.)
{ tmp = *ptr; *ptr = value; return tmp; }
• Acquire barrier.

Hmm.  I don't think this list is meant to be a sequence; __sync_lock_release
has two items, first the assignment to the memory location, and *second* a
release barrier.  If this were supposed to be a sequence, it wouldn't be
correct for lock releases.  It rather seems that, somehow, the whole intrinsic
is supposed to be a barrier, and be atomic.  Similarly, the __sync RMW ops just
have a single full barrier listed under behavior.

 So by the strict letter of the specification, no memory references to
 visible data should be allowed to move from after the entire body of the
 intrinsic to before it. That is to say in:
 
   __sync_lock_test_and_set (foo, 1)
   bar = 1
 
 an observer should not be able to observe the write to bar before the write
 to foo. This is a difference from the C++11 semantics.

Can you clarify how

[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #32 from torvald at gcc dot gnu.org ---
(In reply to James Greenhalgh from comment #28)
 (In reply to torvald from comment #24)
  3) We could do something just on ARM (and scan other arcs for similar
  issues).  That's perhaps the cleanest option.
 
 Which leaves 3). From Andrew's two proposed solutions:

3) Also seems best to me.  2) is worst, 1) is too much of a stick.

 This also gives us an easier route to fixing any issues with the
 acquire/release __sync primitives (__sync_lock_test_and_set and
 __sync_lock_release) if we decide that these also need to be stronger than
 their C++11 equivalents.

I don't think we have another case of different __sync vs. __atomics semantics
in case of __sync_lock_test_and_set.  The current specification makes it clear
that this is an acquire barrier, and how it describes the semantics (ie, loads
and stores that are program-order before the acquire op can move to after it) ,
this seems to be consistent with the effects C11 specifies for acquire MO (with
perhaps the distinction that C11 is clear that acquire needs to be paired with
some release op to create an ordering constraint).

I'd say this basically also applies to __sync_lock_release, with the exception
that the current documentation does not mention that stores can be speculated
to before the barrier.  That seems to be an artefact of a TSO model.
However, I don't think this matters much because what the release barrier
allows one to do is reasoning that if one sees the barrier to have taken place
(eg, observe that the lock has been released), then also all ops before the
barrier will be visible.
It does not guarantee that if one observes an effect that is after the barrier
in program order, that the barrier itself will necessarily have taken effect. 
To be able to make this observation, one would have to ensure using __sync ops
that the other effect after the barrier is indeed after the barrier, which
would mean using an release op for the other effect -- which would take care of
things.

If everyone agrees with this reasoning, we probably should add documentation
explaining this.

However, I guess some people relying on data races in their programs could
(mis?)understand the __sync_lock_release semantics to mean that it is a means
to get the equivalent of a C11 release *fence* -- which it is not because the
fence would apply to the (erroneously non-atomic) store after the barrier,
which could one lead to believe that if one observes the store after the
barrier, the fence must also be in effect.  Thoughts?


[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-15 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #21 from torvald at gcc dot gnu.org ---
(In reply to Andrew Haley from comment #20)
 (In reply to mwahab from comment #19)
  (In reply to Andrew Haley from comment #18)
  
  It looks inconsistent with C11 S7.17.7.4-2 (C++11 S29.6.4-21) Further, if
  the comparison is true, memory is affected according to the value of
  success, and if the comparison is false, memory is affected according to the
  value of failure. (where success and failure are the memory model
  arguments.) In this case, the write to *exp should be memory_order_seq_cst.
 
 But no store actually takes place, so the only effect is that of the read.
 You can't have a sequentially consistent store without a store.

I agree. If you continue reading in the C++11 paragraph that you cited, you'll
see that if just one MO is provided and the CAS fails, an acq_rel MO is
downgraded to acquire and a release MO to relaxed.  This is consistent with no
update of the atomic variable (note that expected is not atomic, so applying an
MO to accesses to it is not meaningful in any case).  However, if the provided
MO is seq_cst, I think a failed CAS still needs to be equivalent to a seq_cst
load.


[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-15 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #22 from torvald at gcc dot gnu.org ---
(In reply to James Greenhalgh from comment #12)
 There are two problems here, one of which concerns me more in the real
 world, and both of which rely on races if you are in the C/C++11 model -
 there isn't much I can do about that as the __sync primitives are legacy and
 give stronger guarantees about visible memory references (which includes
 ALL global data).
 
 Consider:
 
 int x = 0;
 int y = 0;
 int z = 0;
 
 void foo (void)
 {
   x = 1;
   __sync_fetch_and_add (y, 1);
   z = 1;
 }
 
 My reading of what a full barrier implies would suggest that we should never
 have a modification order which is any of:
 
   z = 1, x = 1, y = 0+1
   z = 1, y = 0+1, x = 1
   x = 1, z = 1, y = 0+1

At least in C11/C++11, modification orders are per memory location.  If we want
to have something that orders accesses to x, y, and z, we need to model the
observer side too (which the C11 model does, for example).  I'm mentioning that
because at the very least, we have compiler reordering happening at the
observer side; if a programmer would need to use, for example, __sync builtins
to order the observers, then this means we just have to consider combinations
of those on the modification and the observation side.

(And because I'm not sure what you think modification order is, I can't
comment further.)

 GCC 5.0-ish (recent trunk) will emit:
 
 foo:
   adrpx3, .LANCHOR0
   mov w1, 1
   add x0, x3, :lo12:.LANCHOR0
   add x2, x0, 4
   str w1, [x3, #:lo12:.LANCHOR0]
 .L2:
   ldaxr   w3, [x2]
   add w3, w3, w1
   stlxr   w4, w3, [x2]
   cbnzw4, .L2
   str w1, [x0, 8]
   ret
 
 Dropping some of the context and switching to pseudo-asm we have:
 
   str 1, [x]
 .L2:
   ldaxr   tmp, [y]
   add tmp, tmp, 1
   stlxr   flag, tmp, [y]
   cbnzflag, .L2
   str 1, [z]
 
 As far as I understand it, the memory ordering guarantees of the
 half-barrier LDAXR/STLXR instructions do not prevent this being reordered to
 an execution which looks like:
 
   ldaxr   tmp, [y]
   str 1, [z]
   str 1, [x]
   add tmp, tmp, 1
   stlxr   flag, tmp, [y]
   cbnzflag, .L2

 which gives one of the modification orders we wanted to forbid above.
 Similar reordering can give the other undesirable modification orders.

(I'm not familiar with the ARM model, so please bear with me.)

This is reordering the HW can do?  Or are you concerned about the compiler
backend?
Would the reordered str always become visible atomically with the stlxr?
Would it become visible even if the LLSC fails, thus potentially storing more
than once to z?  This would be surprising in that the ldaxr would then have to
be reloaded too potentially after the store to z, I believe (at least for a
strong CAS) -- which would break the acquire MO on the load.

 As mentioned, this reordering is permitted under C11, as the stores to x and
 z are racy - this permits the CPPMEM/Cambridge documentation of what an
 SEQ_CST must do.

That's true in this example, but at the instruction level (assuming HW
reordering is what you're concerned about), a atomic relaxed-MO load isn't
distinguishable from a normal memory access, right?  So, it's not DRF what this
is strictly about, but the difference between C11 seq_cst fences and seq_cst
RMW ops.

 However, my feeling is that it is at the very least
 *surprising* for the __sync primitives to allow this ordering, and more
 likely points to a break in the AArch64 implementation.

With the caveat that given that __sync isn't documented in great detail, a lot
of interpretations might happen in practice, so there might be a few surprises
to some people :)

 Fixing this requires a hard memory barrier - but I really don't want to see
 us penalise C11 SEQ_CST to get the right behaviour out of
 __sync_fetch_and_add...

I agree.

[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-15 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #24 from torvald at gcc dot gnu.org ---
I think we need to at least clarify the documentation of __atomic, probably
also of __sync; we might also have to change the implementation of __sync
builtins on some archs.

First, I think the __atomic builtins should strictly follow the C11 model --
one major reason being that it's not worthwhile for us to support a
GCC-specific memory model, which is complex.  Therefore, I think we should
clarify in the __atomic docs that C++11 is the reference semantics, and that
any additional text we provide (e.g., to describe what MEMMODEL_SEQ_CST does)
is just some hand-wavy attempt of makings things look simpler.  I'd choose
C++11 over C11 because, although the models are supposed to be equal, C++11 has
more detailed docs in the area of atomics and synchronization in general, IMO. 
There are also more people involved in C++ standardization than C
standardization.

The only exception that I see is that we might want to give more guarantees wrt
data-race-freedom (DRF) as far as compiler reordering is concerned.  I'm
mentioning just the compiler because I'm not aware of GCC supporting C11/C++11
on any archs whose ISA distinguishes between atomic and nonatomic accesses (ie,
where DRF is visible at the ISA level); however, I've heard of upcoming archs
that may do that.  Trying to promise some support for non-DRF programs that
synchronize might make transition of legacy code to __atomic builtins easier,
but it's a slippery slope; I don't think it's worthwhile for us to try to give
really precise and hard guarantees, just because of the complexity involved. 
I'm not quite sure what's best here.

Second, in C11 there is a difference between a memory access or
read-modify-write op (RMW) that is seq-cst, and a seq-cst fence.  For
reference, try this example in cppmem:

int main() {
  atomic_int x = 0; 
  atomic_int y = 0; 
  atomic_int d1 = 0; 
  atomic_int d2 = 0; 
  {{{ {
x.store(1, memory_order_relaxed);
// atomic_thread_fence(memory_order_seq_cst);
r3=cas_strong(d1, 0, 1);
r1=y.load(memory_order_relaxed).readsvalue(0);
  }
  ||| {
y.store(1, memory_order_relaxed);
// atomic_thread_fence(memory_order_seq_cst);
r4=cas_strong(d2, 0, 1);
r2=x.load(memory_order_relaxed).readsvalue(0);
  }
  }}};
  return 0; }

This is Dekker synchronization: It doesn't work (ie, both threads can read
value 0) with seq-cst RMWs to separate locations (d1 and d2) even though those
RMWs will be totally ordered; however, it does work when using fences (see the
comments).  Therefore, for __atomic, it's not only the MEMMODEL_* value that
counts, but also which operation it is applied to.  Perhaps we need to point
this out in the docs.


However, our current docs for __sync indicate that most of the __sync RMWs are
full barriers. The latter has a very TSO-like description, and my guess would
be that most people might understand it this way too (ie, that it would at
least be as strong as a seq-cst C11 fence).  Also, Andrew mentioned offline
that prior documentation seemed to have described the __sync RMWs as
  __sync_synchronize();
  operation...
  __sync_synchronize();

For x86 and similar, a C11-seq-cst RMW will be a full barrier; Jakub says the
same holds for PowerPC currently.  AFAIU, on ARM it doesn't (but it kind of
surprises me that HW would reorder into an LLSC region; my guess would that
this complicates matters, and I can't see a benefit).

So, at least on ARM, there seems to be a difference between a __sync RMW that
is documented to be at least as strong as a C11 seq-cst fence, and how we
implement a C11 seq-cst fence.  This difference is observable by a program, but
doing so requires a non-DRF program (but there are no relaxed-MO __sync
variants, so programs that used __sync might just relied on non-DRF accesses).

I would guess that programs that actually contain synchronization code affected
by this difference are rather rare.  Normal locks and such are not affected, as
are many common synchronization algorithms such as for linked lists etc (eg,
which work fine with acq_rel MO ops).  I'd say that the full fences are really
necessary mostly if one wants to synchronize using separate locations and
without a global common order for those accesses (eg, using several mem
locations and asymmetric access patterns to tune contention favorably for
common cases (eg, Dekker-like stuff if made asymmetric; libitm uses that)).  I
guess it's much more common to route the synchronization through centralized
or hierarchical locations so that eventually there is a tie-breaker.  For
example, AFAIR there's no use of a seq-cst fence in current glibc.

Thus, we need to at least improve the __sync docs.  If we want to really make
them C11-like, we need to update the docs, and there might be legacy code
assuming different semantics that breaks.  If we don't, we need to update

[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-09 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #6 from torvald at gcc dot gnu.org ---
(In reply to torvald from comment #5)
 (In reply to Matthew Wahab from comment #0)
  while a __sync full
  barrier should block all movement
  (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.
  html#_005f_005fsync-Builtins).
 
 But in your code example below, you didn't use __synch_synchronize.  The
 other __sync operations do not document that they are a full barrier,
 AFAICS.  (I'm not sure whether there is a wide assumption that they do.)

No, you're right.  They are in most cases considered to be a full barrier.


[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

2015-04-09 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #5 from torvald at gcc dot gnu.org ---
(In reply to Matthew Wahab from comment #0)
 The __sync builtins are implemented by expanding them to the equivalent
 __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers.
 This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow
 some data references to move across the operation (see
 https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html)

I don't think that's a precise characterization.  The difference discussed in
that thread is that seq-cst fences have a stronger effect than seq-cst stores. 
This is not really a question of MEMMODEL_SEQ_CST but what you apply it to.

 while a __sync full
 barrier should block all movement
 (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.
 html#_005f_005fsync-Builtins).

But in your code example below, you didn't use __synch_synchronize.  The other
__sync operations do not document that they are a full barrier, AFAICS.  (I'm
not sure whether there is a wide assumption that they do.)

 This is problem for Aarch64 where data references after the barrier could be
 speculated ahead of it.
 
 For example, compiling with aarch64-none-linux-gnu at -O2,
 -
 int foo = 0;
 int bar = 0;
 
 int T1(void)
 {
   int x = __sync_fetch_and_add(foo, 1);
   return bar;
 }

This test case is problematic in that it has a data race on bar unless bar is
never modified by another thread, in which case it would be fine to load bar
before the fetch_add.  It also loads bar irrespective of which value the
fetch_add actually returns.  I'm aware that the __sync builtins offer no
MEMMODEL_RELAXED, and so lots of code simply has a plain memory access.  

Nonetheless, I think you should also look at a test case that is properly
synchronized, data-race-free C11 code with the builtins, and see whether that
is compiled differently (ie, either use a relaxed MO load for bar or make the
load of bar conditional on the outcome of the fetch_add).

To be on the safe side, you can also use the cppmem tool to verify what the C11
model requires the compiled code to guarantee.  Here's what you probably want
to achieve:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  {{{ {
r1=cas_strong(foo, 0, 1);
r2=bar.load(memory_order_relaxed);
  }
  ||| {
bar.store(42, memory_order_relaxed);
foo.store(1, memory_order_release);
  }
  }}};
  return 0; }

This tries to test whether ordered stores to bar and foo are observed in this
order in the first thread (it uses a CAS instead of fetch_add).  This gives me
3 consistent executions, of which one is the one in which the CAS succeeds (ie,
like fetch_add would do), and this one returns a value of 42 for the load of
bar.

 
 produces
 
 T1:
   adrpx0, .LANCHOR0
   add x0, x0, :lo12:.LANCHOR0
 .L2:
   ldaxr   w1, [x0]   ; load-acquire (__sync_fetch_and_add)
   add w1, w1, 1
   stlxr   w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
   cbnzw2, .L2
   ldr w0, [x0, 4]; load (return bar)
   ret
 
 With this code, the load can be executed ahead of the store-release which
 ends the __sync_fetch_and_add. A correct implementation should emit a dmb
 instruction after the cbnz.

I'm not sufficiently familiar with ARM to assess whether that's correct.  Is
this ARMv8?  It seems to be consistent with the mapping to machine code that
GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), which
does not include a DMB.  However, it also shows no difference between an
acq_rel CAS and a seq_cst CAS.  Could it be that the CAS sequence itself (ie,
the machine code for it) prevents the reordering you are concerned about at the
HW level?

[Bug testsuite/64930] [5 regression] FAIL: gcc.target/powerpc/atomic-p7.c scan-assembler-times isync 12

2015-02-12 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64930

--- Comment #12 from torvald at gcc dot gnu.org ---
(In reply to Alan Modra from comment #9)
 My point was that if you write a testcase that specifically tests for
 consume and get acquire code then that is a fail.  The code generated is
 using a bigger hammer than necessary.

The consensus in ISO C++ SG1 (though we had no straw poll on this particular
thing, so this is my conclusion from the discussions and the opinions voiced by
other compiler vendors) is that implementing C/C++ memory_order_consume means,
in practice, promoting to memory_order_acquire.  This is not a GCC-specific
solution or deficiency.  It is rather the realization that the standard's
intent can't be implemented in practice without too much costs elsewhere (e.g.,
because of how tracking dependencies properly would require points-to analysis
or conservatively adding barriers in likely many places in the code).

See this paper for more background:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4321.pdf


[Bug testsuite/64930] [5 regression] FAIL: gcc.target/powerpc/atomic-p7.c scan-assembler-times isync 12

2015-02-11 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64930

--- Comment #5 from torvald at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #3)
 Created attachment 34731 [details]
 gcc5-pr64930.patch
 
 Thus I'm proposing this untested patch.

I think expecting the consume-to-acquire promotion is the right thing to do (I
haven't tested the patch though).  The consume specification in the standard
has issues, and based on the discussions in ISO C++ SG1, it seems unlikely that
it can or will be fixed.  Thus, I believe the promotion will stay with us
forever.

Therefore, using an XFAIL doesn't seem right to me.


[Bug c++/63472] transaction_atomic within while loop causes ICE

2015-01-30 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63472

--- Comment #6 from torvald at gcc dot gnu.org ---
(In reply to ak from comment #4)
 copy_bbs:
 
 (illegal code due to goto into transaction?)
 
 g_56[];
 fn1() {
   int *p_79;
   if (g_56[7])
 __transaction_atomic {
 lbl_196:
   *p_79 = 1;
 }
   else
 goto lbl_196;
 }

Yes, the goto into the transaction is illegal code.

 copy_loops:
 
 func_65() {
   __transaction_atomic {
 for (;;)
   func_65();
   }
 }

I think that technically, at least for C++, this is illegal code because you
have an infinite loop with no side effects (atomics, volatile, IO) in it.  Does
this also fail when you make the recursion stop eventually?


[Bug libitm/61594] ICE (assertion failure) in trans-mem.c

2015-01-26 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61594

--- Comment #6 from torvald at gcc dot gnu.org ---
We can keep it open, but my guess is that it would need some volunteer to
actively drive the backporting process. I.e., with a patch and testing.  Given
that TM is experimental, I think backporting fixes for TM has low priority on
most people's todo list.


[Bug libitm/52482] libitm INVALID MNEMONIC in .S (powerpc asm)

2015-01-26 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52482

--- Comment #11 from torvald at gcc dot gnu.org ---
Thanks for confirming.  However, that fails before libitm, for which you should
file a separate bug report.  Thanks.


[Bug libitm/61594] ICE (assertion failure) in trans-mem.c

2015-01-26 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61594

torvald at gcc dot gnu.org changed:

   What|Removed |Added

Version|5.0 |4.9.1

--- Comment #7 from torvald at gcc dot gnu.org ---
Changed version to 4.9.1.


[Bug libitm/52482] libitm INVALID MNEMONIC in .S (powerpc asm)

2015-01-23 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52482

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #8 from torvald at gcc dot gnu.org ---
Can somebody with access to a powerpc-apple-darwin8 check whether this bug is
still present in SVN trunk?  Thanks!


[Bug libitm/61594] ICE (assertion failure) in trans-mem.c

2015-01-23 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61594

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||torvald at gcc dot gnu.org
 Resolution|--- |WORKSFORME

--- Comment #3 from torvald at gcc dot gnu.org ---
I can't reproduce this on SVN trunk (r220039), so I'll close this for now.

I didn't spot anything bad when briefly looking at the generated code.


[Bug libitm/52695] libitm/config/x86/cacheline.h: '__m64' does not name a type

2015-01-23 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52695

--- Comment #8 from torvald at gcc dot gnu.org ---
Can this bug be closed?  It sounds as if this has been fixed in 4.8 already;
given that TM is experimental, backporting the fix is probably not worth it.


[Bug libitm/56801] Internal Compiler Error when compiling relaxed transaction

2015-01-23 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56801

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||torvald at gcc dot gnu.org
 Resolution|--- |WORKSFORME

--- Comment #6 from torvald at gcc dot gnu.org ---
I can't reproduce the ICE with SVN trunk (r220039), so I'll close this for now.

The produced code looks correct, although it's a little odd: Both the
instrumented and the uinstrumented code path are generated; on instrumented,
there's no explicit call to request going irrevocable, but the flags to
_ITM_beginTransaction state that only the uninstrumented path is available.


[Bug libitm/53113] Build fails in x86_avx.cc if AVX disabled by -mno-avx but supported by as

2015-01-23 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53113

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #26 from torvald at gcc dot gnu.org ---
(In reply to h...@gcc.gnu.org from comment #24)
 Author: hjl
 Date: Mon Jan 13 19:36:17 2014
 New Revision: 206587
 
 URL: http://gcc.gnu.org/viewcvs?rev=206587root=gccview=rev
 Log:
 Make sure that -msse/-mavx are appended at the end
 
   PR libitm/53113
   * Makefile.am (x86_sse.lo): Append -msse to CXXFLAGS.
   (x86_avx.lo): Append -mavx to CXXFLAGS.
   * Makefile.in: Regenerate.
 
 Modified:
 trunk/libitm/ChangeLog
 trunk/libitm/Makefile.am
 trunk/libitm/Makefile.in

H.J., can this bug be closed?

[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency

2014-10-30 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

--- Comment #21 from torvald at gcc dot gnu.org ---
(In reply to torvald from comment #17)
 (In reply to Andrew Macleod from comment #15)
  So have we concluded that we should promote memory_order_consume to
  memory_order_acquire for now?  
 
 I also think that this is the best way forward.  I believe everyone in ISO
 C++ SG1 agreed that this is basically a defect in the standard.

To clarify, my impression from the discussion was that there was general
consensus that memory_order_consume as specified now wouldn't achieve what it
was intended to in practice, due to the implementation issues.  IOW, it's not
useful even though it's not a defect in the sense of being inconsistent or
plain wrong.


[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency

2014-10-29 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

--- Comment #20 from torvald at gcc dot gnu.org ---
(In reply to jos...@codesourcery.com from comment #19)
 * carries_dependency is about increasing optimization, not decreasing it.  
 If it affects when barriers are added, it does so by allowing some 
 barriers to be omitted that would otherwise be required.

That's not quite true, unfortunately, AFAIU.  I agree that generally,
attributes are supposed to be ignorable -- but in the case of
carries_dependency, C++11 explicitly requires that all declarations of a
function either have the attribute or none has (see 7.6.4p2).  This is because
you need that to actually exploit the attribute; it's a contract between
callers and callees.  If a compiler does try to preserve dependencies (or not
across function boundaries), then ignoring the attribute is fine.  But if there
should be a compiler out there that doesn't, and GCC-generated code is supposed
to link to that other compiler's code, then we need to do something to make
this work.


[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency

2014-10-28 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

--- Comment #17 from torvald at gcc dot gnu.org ---
(In reply to Andrew Macleod from comment #15)
 So have we concluded that we should promote memory_order_consume to
 memory_order_acquire for now?  

I also think that this is the best way forward.  I believe everyone in ISO C++
SG1 agreed that this is basically a defect in the standard.

What I haven't thought through is how to deal with with carries_dependency
(7.6.4 in C++11): For GCC code generated after we promote consume to acquire,
it can safely be ignored; but should GCC code be linked to code generated by
another compiler that does not promote and expects the code to preserve
dependencies, this won't work.

I am not aware of any shipping compiler that would actually try to preserve
dependencies, and nobody else mentioned any during the discussion of this topic
in ISO C++ SG1.  Thus, we could assume that there are no such other compilers,
and make it part of the ABI (assumptions) that consume is promoted to acquire
in a correct compiler.

Alternatively, we could try to be conservative and add an acquire barrier
before the function body if any parameter of the function has
carries_dependency; and, likewise, add an acquire barrier after every call to a
function which has carries_dependency.

I don't have more input from the ISO C side, but I would guess that the
situation there is similar.

Thoughts?


[Bug ipa/61393] [trans-mem] O3 optimization level constant propagation problem

2014-08-06 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61393

--- Comment #9 from torvald at gcc dot gnu.org ---
Alex, can you confirm that this is fixed?


[Bug middle-end/61199] New: [trans-mem] _cxa_free_exception is not transaction safe

2014-05-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199

Bug ID: 61199
   Summary: [trans-mem] _cxa_free_exception is not transaction
safe
   Product: gcc
   Version: 4.10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torvald at gcc dot gnu.org

Created attachment 32804
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=32804action=edit
test case

See attached test case.  I'm not quite sure how we need to handle it: Is this
only called when another transaction has taken over, and we're freeing up
garbage?  Or can a call to this create a situation in which we canceled a
previous exception throw?  If the latter case holds, I guess we also need to
reset the transaction-in-flight state that libitm keeps when
cxa_allocate_exception is called.  Depending on this, we either need to make it
TM-pure or add a wrapper in libitm.

Thoughts?


[Bug middle-end/61199] [trans-mem] _cxa_free_exception is not transaction safe

2014-05-16 Thread torvald at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2014-05-16
 CC||aldyh at gcc dot gnu.org,
   ||jason at gcc dot gnu.org,
   ||rth at gcc dot gnu.org
 Ever confirmed|0   |1


[Bug c++/60272] atomic::compare_exchange_weak has spurious store and can cause race conditions

2014-02-19 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60272

--- Comment #2 from torvald at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #1)
 So, either we'd need to change this function, so that it sets oldval to
 NULL_RTX
 first, and passes ..., oldval, mem, expected, ... and needs to also always
 ask for target, then conditionally on target store to expected, or perhaps
 add extra parameter to expand_atomic_compare_and_swap and do the store only
 conditionally in that case.  Richard/Torvald?

I'm not sure what's better.  But getting this fixed in 4.9.0 would be good! :)


[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency

2014-02-17 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

--- Comment #12 from torvald at gcc dot gnu.org ---
(In reply to algrant from comment #11)
 Where do you get that this is racy if the access to data is not atomic?

In threadB(), you do:

  f = flag.load(std::memory_order_consume); // B.A
  d = *(data + f - f); // B.B

That reads from data irrespective of the value of f, so will read when data is
actually not owned by threadB.  You can either make the accesses to data
atomic, or move the load from data to after checking that flag equals 1.

 By
 design, release/acquire and release/consume sequences don't require
 wholesale changes to the way the data payload (in the general case, multiple
 fields within a structure) is first constructed and then used.  1.10#13
 makes clear that as a result of the intra-thread sequencing between atomic
 and non-atomic operations (1.9#14), and the inter-thread ordering between
 atomic operations (1.10 various), there is a resulting ordering on
 operations to ordinary (sic) objects.  Please see the references to the
 C++ standard in the source example, for the chain of reasoning here.

I'm very much aware of the rules, I believe.  But as far as you can see,
there's a data race in your test program, which results in undefined behavior.


[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency

2014-01-23 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rth at gcc dot gnu.org

--- Comment #10 from torvald at gcc dot gnu.org ---
(In reply to algrant from comment #6)
 Here is a complete C++11 test case.  The code generated for the two loads
 in thread B doesn't maintain the required ordering:

The C++ test has a race condition because the nonatomic load from data is not
ordered wrt. the nonatomic store to data in the other thread.  Please fix that
and report back whether this changes anything.  (I believe the use of
relaxed-MO atomic accesses should be sufficient.)

Nonetheless, if there is a fixed (data-race-free) similar test, then I believe
this fixed test case to be valid in the sense that AFAIU the standards'
requirements (both C++11's and C11's; they are equal in this aspect), code as
in this test case is supposed to be carrying the dependencies forward.

However, I cannot see a *practical* implementation of this requirement.  AFAIU,
dependencies are carried forward through loads/stores too (5.1.2.4#14):
An evaluation A carries a dependency to an evaluation B if: [...]A writes a
scalar object or bit-field M, B reads from M the value written by A, and A is
sequenced before B.
Because a compiler cannot, in general, analyze which piece of code stored last
to M, it would have to conservatively assume that a dependency is carried
through M.  In turn, it could never optimize expressions such as (x-x), at
least if those affect atomic operations on M in some way, irrespective of how
remotely that might have happened.  (I'm restricting my statement to atomic
accesses because I'm not quite sure whether there's some other reason why
nonatomic accesses couldn't be affected by this.)  OTOH, maybe there are
factors that I'm not aware of and that effectively constrain
carries-a-dependency to something that doesn't widely prohibit optimizations.

I don't have significant background information on the design of this part of
the memory model (nor the implementability of this part), but I'll ask around
at the C++ meeting in two weeks.

 According to the architecture specification, to achieve the ordering it's
 sufficient to use the result of the first load in the calculation of the
 address of the second, even if it's done in such a way as to have no
 dependence on the value.

And that's certainly a feasible *architecture* specification.  ISTM that the
language standard tried to model this, but then didn't give the compiler enough
leeway to still be able to optimize code.  The carries-a-depency rules seem
like an unusual shortcut from the language level to requirements on the
generated code, without the usual indirection using the observable behavior of
the abstract machine and as-if.  For example -- and this isn't a precise
definition of course -- if the standard would require that f (in your test
case) is needed to process the evaluation, then things like f-f could be
optimized out, and the compiler might just do the right thing with standard
optimizations.

As a short-term fix, I believe we can promote all memory_order_consume atomics
to memory_order_acquire, when generating code with consume-MO atomics.  GCC is
at fault here, not the code in the builtins or in libatomic.  Thus, this would
prevent errors when GCC compiled the code and a correct atomics implementation
is used (e.g., third-party libatomic).

This won't help when another (potentially correct) compiler generated the code
with the consume-MO atomics, and GCC-compiled code fails to carry the
dependencies forward.  We could try to change libatomic's consume-MO
implementation to prevent that, but this would any catch cases in which
libatomic is actually used and not the other compiler's builtins or similar. 
Thus, changing libatomic does not seem to be necessary.

Richard, Andrew: Thoughts?


[Bug c/59218] New: atomic transactions: accesses to volatiles not disallowed in transaction_safe code

2013-11-20 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59218

Bug ID: 59218
   Summary: atomic transactions: accesses to volatiles not
disallowed in transaction_safe code
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torvald at gcc dot gnu.org
CC: aldyh at gcc dot gnu.org

Accesses to volatiles are disallowed in transaction-safe code.  The following
should produce an error, but doesn't:

volatile int volatile *b;

void foo()
{
  __transaction_atomic
{
  b[10] = b[10] + 1;
}
}

If accessing a volatile int a instead, an error is reported (as expected).


[Bug c/59218] atomic transactions: accesses to volatiles not disallowed in transaction_safe code

2013-11-20 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59218

torvald at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P4


[Bug libitm/57643] libitm.c/reentrant.c hangs on POWER8 with HTM

2013-06-19 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57643

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2013-06-19
 CC||torvald at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |torvald at gcc dot 
gnu.org
 Ever confirmed|0   |1


[Bug tree-optimization/53991] _mm_popcnt_u64 fails with -O3 -fgnu-tm

2013-05-21 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53991

--- Comment #7 from torvald at gcc dot gnu.org ---
A piece of code is tm_pure if, roughly, it doesn't need any instrumentation
(e.g., in contrast to memory loads/stores).  In the test case, I suppose that
the compiler detects that it is tm_pure, but we also allow programmers to
declare it.

Ideally, tm_pure should be a property of a region of code that is preserved
across optimizations (but where we don't move code into or out of tm_pure
regions).  That may require too much implementation effort (but perhaps we
could reuse the TM regions for that, as a no-TM region?)

Alternatively, we could not automatically mark always_inline functions also as
tm_pure, and warn if always_inline is also annotated as tm_pure by the
programmer.

Other thoughts?


[Bug c++/56419] New: [4.8 regression] transactions in for-loops disappear

2013-02-21 Thread torvald at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56419



 Bug #: 56419

   Summary: [4.8 regression] transactions in for-loops disappear

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: c++

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: torv...@gcc.gnu.org

CC: al...@gcc.gnu.org, sp...@cse.lehigh.edu





Transaction statements in for-loops don't show up in Gimple.  For example:



int x = 0;

int inc_func(int i) {

 for (int j = 0; j  i; ++j)

 {

 __transaction_atomic { x+=1; }

 }

 return 0;

}



g++ -v -O0 test.cc -c -fgnu-tm -fdump-tree-all -Wall

leads to the following test.cc.004t.gimple:



int inc_func(int) (int i)

{

  int D.2360;



  {

int j;



j = 0;

goto D.2358;

D.2357:

j = j + 1;

D.2358:

if (j  i) goto D.2357; else goto D.2355;

D.2355:

  }

  D.2360 = 0;

  return D.2360;

}



This does work when using the C frontend.  Seems to also work when the

transaction is in while loops or on an if-branch.  Doesn't seem to be sensitive

to which C++ standard is requested.



Originally reported by Mike Spear.  Mike says this works with 4.7.2, but

doesn't with g++ (GCC) 4.8.0 20130221 (experimental) (revision 

196192).  I've confirmed with an early-December build.


[Bug libitm/55693] [4.8 Regression] libitm.c++/eh-1.C execution test fails on darwin from r193271

2013-01-11 Thread torvald at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55693



torvald at gcc dot gnu.org changed:



   What|Removed |Added



 CC||torvald at gcc dot gnu.org



--- Comment #7 from torvald at gcc dot gnu.org 2013-01-11 22:54:04 UTC ---

The gdb trace looks alright to me.  My guess is that this is a bug in the

uninstrumented code path, not in libitm.



Could you try again after adding the following to the top of

GTM::gtm_thread::begin_transaction in beginend.cc, please:



prop = ~pr_uninstrumentedCode;



Then, libitm shouldn't take the uninstrumented code path anymore.  If it works

with this change, this likely is a bug on the compiler side.


[Bug rtl-optimization/51771] trans-mem: abnormal edges get lost or corrupted

2012-12-05 Thread torvald at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51771



torvald at gcc dot gnu.org changed:



   What|Removed |Added



 Status|WAITING |RESOLVED

 Resolution||FIXED



--- Comment #11 from torvald at gcc dot gnu.org 2012-12-05 19:18:42 UTC ---

I can't reproduce this anymore.  I also had a brief look at the binary code

that showed the problem previously and didn't see the problematic pattern

anymore.  So it seems that this is fixed.


[Bug sanitizer/48076] Unsafe double checked locking in __emutls_get_address

2012-11-28 Thread torvald at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48076



--- Comment #7 from torvald at gcc dot gnu.org 2012-11-28 14:29:47 UTC ---

(In reply to comment #6)

 There seems to be a similar bug in code generated for function static

 variables.

 The fast-path load is a plain load rather than atomic acquire load.



Haven't looked at the details, but this indeed looks similar.


[Bug c++/54893] unable to access volatile variable within relaxed transaction

2012-10-11 Thread torvald at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54893



--- Comment #3 from torvald at gcc dot gnu.org 2012-10-11 19:54:11 UTC ---

I agree with Michael.  Accesses to volative vars are disallowed in safe code,

but relaxed transactions can run unsafe code (after going irrevocable).  The

test case in bug 46654 is an atomic transaction.


[Bug tree-optimization/52558] write introduction incorrect wrt the C++11 memory model

2012-05-31 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558

--- Comment #20 from torvald at gcc dot gnu.org 2012-05-31 20:23:51 UTC ---
(In reply to comment #19)
 Fixed on mainline.  I doubt this will be fixed on 4.7, as this is too
 intrusive-- unless one of the RMs really wants it on 4.7.

OTOH, it would be good if we could avoid those races in 4.7 too.  Can you
gather the RMs' opinions please?


[Bug libitm/53008] abort in _ITM_getTMCloneSafe

2012-05-15 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53008

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #4 from torvald at gcc dot gnu.org 2012-05-15 16:18:00 UTC ---
So, is there still a libitm issue even with your GCC patch?


[Bug libstdc++/52839] double free or corruption running tr1/.../default_weaktoshared.exe

2012-04-12 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52839

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 CC||torvald at gcc dot gnu.org

--- Comment #25 from torvald at gcc dot gnu.org 2012-04-12 08:24:16 UTC ---
If we make an ABI switch at some point, should we move over to relying just on
atomics and the libatomic fallbacks (assuming/hoping libatomic exists by then)?

Also, refcounting in basic_string has bugs too.  Do you see other areas of
libstdc++ that could use a review of the synchronization bits?


[Bug c++/52558] write introduction incorrect wrt the C++11 memory model

2012-03-13 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558

torvald at gcc dot gnu.org changed:

   What|Removed |Added

  Component|tree-optimization   |c++

--- Comment #12 from torvald at gcc dot gnu.org 2012-03-13 21:17:48 UTC ---
(In reply to comment #11)
 Just one remark: in this case the write introduction is incorrect wrt the 
 C++11
 memory model because there are no previous write to the same location.  If
 there had been a previous write to the same location, then the discriminating
 context would have been racy and the code generated by -O2 sound.
 
 I believe that the above argument generalises to all write introductions but I
 don't yet have a proof of this, so take it with a pinch of salt.

We follow a similar line of reasoning when thinking about which transformations
are allowed in transactional code (or other concurrent code).  In principle,
this also applies to loads to some extent even though racy loads are supposed
to be okay for as--if as long as a value obtained from a racy read is never
used.  Having a more precise yet practical understanding of this would
certainly help.


[Bug libitm/52526] libitm: stuck in futex_wait

2012-03-13 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526

--- Comment #3 from torvald at gcc dot gnu.org 2012-03-13 22:01:41 UTC ---
Author: torvald
Date: Tue Mar 13 22:01:34 2012
New Revision: 185358

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=185358
Log:
libitm: Fix lost wake-up in serial lock.

PR libitm/52526
* config/linux/rwlock.cc (GTM::gtm_rwlock::read_lock): Fix lost
wake-up.

Modified:
trunk/libitm/ChangeLog
trunk/libitm/config/linux/rwlock.cc


[Bug libitm/52526] libitm: stuck in futex_wait

2012-03-13 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526

--- Comment #4 from torvald at gcc dot gnu.org 2012-03-13 22:11:52 UTC ---
Author: torvald
Date: Tue Mar 13 22:11:46 2012
New Revision: 185360

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=185360
Log:
libitm: Fix lost wake-up in serial lock.

PR libitm/52526
Backported from mainline.
* config/linux/rwlock.cc (GTM::gtm_rwlock::read_lock): Fix lost
wake-up.

Modified:
branches/gcc-4_7-branch/libitm/ChangeLog
branches/gcc-4_7-branch/libitm/config/linux/rwlock.cc


[Bug libitm/52526] libitm: stuck in futex_wait

2012-03-13 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED

--- Comment #5 from torvald at gcc dot gnu.org 2012-03-13 22:13:16 UTC ---
Fixed in both trunk and 4.7.


[Bug libitm/52526] libitm: stuck in futex_wait

2012-03-10 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526

--- Comment #1 from torvald at gcc dot gnu.org 2012-03-10 17:46:12 UTC ---
Patrick, I posted a fix to gcc-patches.  Please have a look if you can.


[Bug libitm/52526] libitm: stuck in futex_wait

2012-03-08 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526

torvald at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2012-03-08
 AssignedTo|unassigned at gcc dot   |torvald at gcc dot gnu.org
   |gnu.org |
 Ever Confirmed|0   |1


[Bug middle-end/51752] trans-mem: publication safety violated

2012-02-09 Thread torvald at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51752

--- Comment #6 from torvald at gcc dot gnu.org 2012-02-09 18:40:49 UTC ---
(In reply to comment #4)
 But isn't with
 
   __transaction_atomic
 {
   for (i = 0; i  10; i++)
 if (x[i])
   x[i] += data;
 }
 
 and
 

prepend with the following for thread B:

  data = 123;

   __transaction_atomic { x[9] = 1; }
 
 occuring concurrently the loop transaction terminated?  Thus,
 
   __transaction_atomic
 {
   tem = data;
   for (i = 0; i  10; i++)
 if (x[i])
   x[i] += tem;
 }
 
 equivalent?

No.  It might be for typical HTMs, but we can't expect this in the general
case.  The interleaving that can then happen is (with A and B being the two
concurrent threads):

  B: tem = data;  // reads initial value of zero
  A: data = 123;
  A: __transaction_atomic { x[9] = 1; }
  B: if (x[i])
  B:   x[i] += tem;  // adds zero, not 123.

The problem here is that B's store to data is nontransactional, and we cannot
add synchronization code for those (it might be library code, for example).

We could add a partial workaround to libitm that would reduce this case here to
just a racing load.  However, that would kill performance because basically,
B's transaction would have to wait for all earlier-started transactions before
it can start executing.  The racing load can still be a problem if we hoist
dereferencing a pointer.

We don't see this example with other concurrent code because there, the load of
x[] would have to be a synchronizing operation, and we don't hoist across
these.


  1   2   >