[Bug c/84563] GCC interpretation of C11 atomics (DR 459)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #4 from torvald at gcc dot gnu.org --- Closing after further discussion.
[Bug c/84563] GCC interpretation of C11 atomics (DR 459)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #3 from torvald at gcc dot gnu.org --- This is not a bug, see https://gcc.gnu.org/ml/gcc/2018-02/msg00224.html for further discussion.
[Bug target/70490] __atomic_load_n(const __int128 *, ...) generates CMPXCHG16B with no warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #7 from torvald at gcc dot gnu.org --- (In reply to Michael Poole from comment #0) > When compiling for x86-64 with the -mcx16 flag, there is no diagnostic for > code like this: > > #include > __int128 test(void) > { > const void *ptr = mmap(0, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, > -1, 0); > const __int128 *p_v = (const __int128 *)ptr; > return __atomic_load_n(p_v, __ATOMIC_SEQ_CST); > } > > The CMPXCHG16B instruction that is generated unconditionally attempts to > write to the address, which causes a fault at runtime. When > __atomic_load_n() uses that instruction, it seems prudent to reject a const > pointer as the first argument. I guess we haven't spelled that out in the docs, but the thought is that one would use the __atomic builtins like one would use C11/C++11 atomics. In that world, only "address-free" atomic types are guaranteed to work when mapped from another process, for example, and only lock-free types are address-free. (That doesn't quite cover all cases, such as when making process-private read-only, but it gives at least some idea of what's the intent.) Users should ensure that the particular type is lock-free (or that atomic ops on that type are) to know that it's safe to use atomic loads on read-only memory. The most recent GCC shouldn't declare the types to be lock-free just because cmpxchg16b is available. It might still use it under the covers, but the fact that it's not lock-free indicates that the implementation is not just native atomic operations as supported by the hardware, but something different. I guess we need to spell this out in the docs.
[Bug target/80640] Missing memory side effect with __atomic_thread_fence (2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80640 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #11 from torvald at gcc dot gnu.org --- (In reply to Alexander Monakov from comment #7) > I've submitted a patch [1] for the missing compiler barrier, but however > please note that the original ompi code and the example in comment #3 are > wrong: in a pattern like > > while (*foo) > __atomic_thread_fence(__ATOMIC_ACQUIRE); > > I think there are two issues; first, if *foo is a non-atomic, non-volatile > object, a concurrent modification from another thread would invoke undefined > behavior due to a data race; It would be a data race even if it would be volatile but still non-atomic. I can see how making atomic_thread_fence a compiler barrier for non-atomics might help existing buggy code, but it would be sufficient it prevents certain reordering of atomics. Some more background and examples: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 --- Comment #15 from torvald at gcc dot gnu.org --- From a ISO C/C++ conformance perspective, this is not a bug: * Pre-C11/C++11, threading was basically undefined. * Starting with C11 and C++11, "threads of execution" (as they're called in C++11 and more recent, often abbreviated in the standard as "threads") are the lowest level of how execution is modelled. They are defined as single flows of control. Nothing is promised about any resources that may be used to implement threads of execution (e.g., similar to the "execution context" notion mentioned in comment #10). * Thread-specific state is bound to a particular thread of execution (e.g., regarding lifetime). A thread of execution accessing a __thread variable accesses the thread-specific state of this thread of execution in the abstract machine. (Of course, one can still access other threads's thread-specific state through pointers initially provided by those threads.) * Only the standards' mechanisms can create threads of execution. There is nothing in these standards that would break up the concept of a single flow of control (ie, that what "looks" like a single flow of control in a program is actually not always the same thread of execution when executed). (Also note that fork/join parallelism is not a counter-example to this.) That said, I can see that this doesn't match nicely with the fact that we have things like swapcontext elsewhere. Do we have any data on what the performance impact where if the compiler would assume that function calls to functions it cannot analyze could switch the thread of execution. Data for several architectures and different TLS models would be helpful. Coming back to C++, currently I think there is only one Technical Specification (TS) that allows breaking up a single flow of control: .then() in the Concurrency TS (whose specification is certainly not ready for the standard). Maybe the Networking TS has something similar, but I can't remember right now. There are a few proposals that either allow it (Task Blocks, targeting Parallelism TS version 2) or require it for good performance ("stackful" coroutines). The "stackless" coroutines in the upcoming Coroutines TS are not really affected by thread-specific state; it's not the coroutines code that would potentially switch threads, but any runtime that would supply a particular Awaitable implementation that then may switch threads (e.g., if using .then()). The Coroutines does not specify any actual runtime. However, I don't think the existence of some C++ proposals that may switch threads necessarily means that the compiler would have to take this into account when those proposals should become part of the standard. The other way this could play out is that the standard simply makes using thread-specific state undefined for those threads of execution that can use these proposals. Overall, I think it may be useful to experiment with a command line switch or something like that, primarily to assess how big the performance degradation would be caused by having the compiler assume that threads can switch on function calls. (In reply to Jakub Jelinek from comment #14) > Even if we have an option that avoids CSE of TLS addresses across function > calls (or attribute for specific function), what would you expect to happen > when user takes address of TLS variables himself: > __thread int a; > void > foo () > { > int *p = > *p = 10; > bar (); // Changes threads > *p += 10; > } I think that p would point to the initial thread's TLS, even after bar(). The user would be wrong to assume that it still is the initial thread's object "a" after having called bar().
[Bug sanitizer/78158] Strange data race detection with thread sanitizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78158 --- Comment #19 from torvald at gcc dot gnu.org --- (In reply to Dmitry Vyukov from comment #8) > We need to modify tsan runtime to ignore (zero) __ATOMIC_HLE_ACQUIRE/RELEASE > bits, right? It's only an optimization and we can pretend that elision never > succeeds under tsan. I agree. The actual memory order that callers have to provide in addition to the HLE bits is what should determine the synchronization semantics. (IMHO, exposing HLE instructions through memory order bits is a a weird design.)
[Bug tree-optimization/53991] _mm_popcnt_u64 fails with -O3 -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53991 --- Comment #12 from torvald at gcc dot gnu.org --- Thanks for reporting that you are affected as well. Regarding your question about bumping the priority, one problem for the TM support in general has been to be able to judge how many users are out there, and whether they are interested in TM for in-production software or just toying around with it because it's something new. This holds for GCC's implementation, but also for the ISO C++ TM Technical Specification. Thus, if you or anyone else is interested in using TM for real, please share your feedback with me personally or ISO C++ Study Group 5 (https://groups.google.com/a/isocpp.org/forum/?fromgroups#!forum/tm ; sending email to t...@isocpp.org might work even if you're not subscribed to this list, or it should at least end up in a moderation queue). Sharing your feedback will help us and Study Group 5.
[Bug libgcc/71744] Concurrently throwing exceptions is not scalable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 --- Comment #28 from torvald at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #20) > (In reply to Gleb Natapov from comment #19) > > (In reply to Jakub Jelinek from comment #18) > > > (In reply to Gleb Natapov from comment #16) > > > > Can you suggest an alternative to libgcc patch? Use other TLS model? > > > > Allocate per thread storage dynamically somehow? > > > > > > If we want to use TLS (which I hope we don't), then e.g. a single __thread > > > pointer with some registered destructor that would free it on process exit > > > could do the job, and on the first exception it would try to allocate > > > memory > > > for the cache and other stuff and use that (otherwise, if memory > > > allocation > > > fails, just take a lock and be non-scalable). > > > > > I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do > > the same here? > > Can? Yes. Want? Nope. It is worse than TLS. > > > > Another alternative, perhaps much better, if Torvald is going to improve > > > rwlocks sufficiently, would be to use rwlock to guard writes to the cache > > > etc. too, and perhaps somewhat enlarge the cache (either statically, or > > > allow extending it through allocation). > > > I'd expect that usually these apps that use exceptions too much only care > > > about a couple of shared libraries, so writes to the cache ought to be > > > rare. > > > > > As I said in my previous reply, I tested the new rwlock and in congested > > case > > it still slows does the system significantly, not the implementation fault, > > cpu just does not like locked instruction much. Not having a lock will be > > significantly better. > > You still need at least one lock, the array of locks is definitely a bad > idea. I don't think it is necessarily a bad idea -- spreading out the data used for synchronization to avoid cache conflicts when there is an asymmetrical workload is a commonly used technique. But we need to find a balance between compute and space overhead that works for us and in this case. Also, spreading out the data should ideally happen in a NUMA-aware way, so there may be more space overhead if we consider that too. I've been thinking about using more scalable ways of locking in glibc which would use per-thread data, which could be useful for more than one lock. So if we had that, we could just use that to deal with the exception scalability problem, without having space overhead just for exceptions.b
[Bug libgcc/71744] Concurrently throwing exceptions is not scalable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 --- Comment #21 from torvald at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #17) > (In reply to torvald from comment #15) > > > Similarly, the 64 recursive locks in libc, again, significant amount of > > > memory > > > per process that doesn't care about exceptions, > > > > That might be reducable if we build custom locks and don't use pthread ones, > > but we'll at least have 64b per lock. > > Right now the lock is I think 40 bytes per lock, which is 64 * 40 bytes per > process. That is just too much (but of course the 64x locking a recursive > lock is even worse). Remembering more of the discussion we had about this in the past, then even with the improved rwlock, Gleb reports that there is a slowdown in Gleb's tests because of cache contention -- which would mean that we may have to use one cacheline per lock.
[Bug libgcc/71744] Concurrently throwing exceptions is not scalable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 --- Comment #15 from torvald at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #14) > Looking at that patchset, I just want to say that the use of TLS in libgcc > is very much undesirable, it affects every program even when not using > exceptions (significant portion of them), especially when it uses that much > (6 * sizeof(void *) * 8 + sizeof(void *) + 2 * 8 = 408 bytes per thread is > significant, plus it uses the slow local dynamic TLS model. I agree. > Similarly, the 64 recursive locks in libc, again, significant amount of > memory > per process that doesn't care about exceptions, That might be reducable if we build custom locks and don't use pthread ones, but we'll at least have 64b per lock. > and especially signficiant > slowdown on dlopen/dlclose. While some programs abuse exceptions for normal > control flow, others abuse dlopen/dlclose. > > A rwlock ought to be better, but I don't think we have recursive rwlock in > libc rdlock can be recursive, rwlock cannot. It is something we could add with not much effort in a custom rwlock, I think. > yet (all we need is being able to recursively lock it for writing, for > reading > we might not need to allow recursion in this case). Not having to deal with recursive rdlock can make things easier for the rwlock implementation. > Plus the current rwlock > implementation still takes uses a per-process lock to guard the inner > operations > on the rwlock, so while rwlock helps with longer critical sections, for very > short ones it likely doesn't scale all that well. This will change as soon as my new rwlock is upstream. Especially short rdlock sections will be much faster.
[Bug libstdc++/65033] C++11 atomics: is_lock_free result does not always match the real lock-free property
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65033 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #10 from torvald at gcc dot gnu.org --- The difference is unfortunate, but C11 specifies that atomic_is_lock_free is *per object*. I suppose that any change there would have to go through the ISO C process.
[Bug libstdc++/71133] msp430-elf -mlarge FTBFS in libstdc++-v3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71133 --- Comment #2 from torvald at gcc dot gnu.org --- I think it's easiest to disable the TM support on less-than-32b targets.
[Bug middle-end/70638] transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70638 --- Comment #3 from torvald at gcc dot gnu.org --- (In reply to Hillel Avni from comment #2) > On gcc-linaro-4.9-2014.11, I must declare the wrapper as pure. But using that version the wrapper was indeed used and not the original function, right? In that case, this part of this bug report would be a regression.
[Bug middle-end/70638] transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70638 --- Comment #1 from torvald at gcc dot gnu.org --- Created attachment 38244 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38244=edit test case
[Bug middle-end/70638] New: transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70638 Bug ID: 70638 Summary: transaction_wrap: too strict compatibility check and transaction_pure wrappers fail to wrap Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: torvald at gcc dot gnu.org Target Milestone: --- The attached test case shows two issues with transaction_wrap: 1) If the original function is not transaction_safe, but the wrapper is declared safe, this incorrectly results in an compilation error about the original and the wrapper function being incompatible. I believe this used to work before, but haven't checked. 2) If a wrapper is declared transaction_pure, the original function is incorrectly considered to be transaction_pure. The fourth case in the test might be something to be considered adding support for. OTOH, perhaps it's safer for programmers having to explicitly select whether a wrapper is pure or safe, given that there seem to be these two separate use cases (ie, making it pure to customize instrumentation, and making it safe to use a transaction-friendly wrapper that still needs to be instrumented).
[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #25 from torvald at gcc dot gnu.org --- (In reply to Andrew Macleod from comment #24) > Author: amacleod > Date: Wed Jan 14 13:58:35 2015 > New Revision: 219601 > > URL: https://gcc.gnu.org/viewcvs?rev=219601=gcc=rev > Log: > > 2015-01-14 Andrew MacLeod <amacl...@redhat.com> > > PR middle-end/59448 > * builtins.c (get_memmodel): Promote consume to acquire always. > * testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal > consume in an atomic_exchange. To summarize, Jeff Preshing confirmed that Andrew's patch fixes the problem (see Comment 23). Everyone in ISO C++ seems to agree that promoting consume to acquire is the right approach until we have a different language-level facility that is actually implementable. Therefore, I'm closing this as fixed.
[Bug c/59218] atomic transactions: accesses to volatiles not disallowed in transaction_safe code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59218 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Target Milestone|--- |6.0 --- Comment #3 from torvald at gcc dot gnu.org --- Fixed by Jason.
[Bug c++/63472] transaction_atomic within while loop causes ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63472 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #8 from torvald at gcc dot gnu.org --- Seems to be the same problem as bug 60908. *** This bug has been marked as a duplicate of bug 60908 ***
[Bug c++/60908] compiler bug related to trans-mem.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60908 torvald at gcc dot gnu.org changed: What|Removed |Added CC||spear at cse dot lehigh.edu --- Comment #5 from torvald at gcc dot gnu.org --- *** Bug 63472 has been marked as a duplicate of this bug. ***
[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550 --- Comment #2 from torvald at gcc dot gnu.org --- Created attachment 37422 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37422=edit Test case part 2/3
[Bug libitm/69018] libitm/testsuite/libitm.c++/c++.exp: libstdc++ include paths don't work if tests set options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69018 torvald at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P4 Status|UNCONFIRMED |NEW Last reconfirmed||2016-01-21 Ever confirmed|0 |1 --- Comment #2 from torvald at gcc dot gnu.org --- Downgraded because this is not user-visible and we have a work-around.
[Bug c++/60908] compiler bug related to trans-mem.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60908 --- Comment #4 from torvald at gcc dot gnu.org --- Still happens with r232693.
[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550 --- Comment #3 from torvald at gcc dot gnu.org --- Created attachment 37423 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37423=edit Test case part 3/3
[Bug tree-optimization/53991] _mm_popcnt_u64 fails with -O3 -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53991 --- Comment #9 from torvald at gcc dot gnu.org --- Still fails with r232693. This seems to be another case of difficult ordering between TM passes and other passes. It makes sense that we don't inline tm_pure because we must not loose that information. always_inline is specified to produce an error when not inlining, but this shouldn't be much of a problem when considering code instrumented for transactions I suppose (can there be a case where lack of inlining causes a correctness problem?). Perhaps it's easiest if we clone the function if we see such a case, so that the solution can be different for TM-instrumented clones and normal code.
[Bug c/57855] passing unsafe function as transaction_safe function pointer does not generate error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57855 torvald at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P4 Status|UNCONFIRMED |NEW Last reconfirmed||2016-01-21 Component|libitm |c Version|unknown |6.0 Ever confirmed|0 |1 --- Comment #3 from torvald at gcc dot gnu.org --- I can confirm that this still happens with r232693 if the test program is compiled as a C program; a warning is produced, but there is no error. However, if compiling the reproducer as a C++ program, an error is returned, as required by the TM TS: pr57855.c:28:14: error: invalid conversion from ‘void (*)(const char*, int, const char*, int, const void*)’ to ‘ADD_STAT {aka void (*)(const char*, int, const char*, int, const void*) transaction_safe}’ [-fpermissive] bar(my_func); I'm not sure whether the initial TM specification required an error, but I think it's better to keep this open given that ISO C++ SG5 people are working on / interested in a variant of the TM TS for C (ie, so that we don't forget about it). Eventually, I would suppose that we phase out support for the old TM specification. Therefore, I've also reduced the priority of this and associated this with the C frontend.
[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550 --- Comment #1 from torvald at gcc dot gnu.org --- Created attachment 37421 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37421=edit Test case part 1/3
[Bug middle-end/57348] [TM] ICE for transaction expression in gimplify_expr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57348 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |WORKSFORME --- Comment #3 from torvald at gcc dot gnu.org --- Works with r232693.
[Bug c++/63550] Multiple definition errors occur only with -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63550 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-01-22 CC||jason at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #4 from torvald at gcc dot gnu.org --- Confirmed with r232693. std::vector's constructor isn't considered transaction_safe, but see the attached test case. I'm not sure making the clones weak is the right approach; the clones should probably mirror whatever is done for the original functions.
[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 --- Comment #14 from torvald at gcc dot gnu.org --- Author: torvald Date: Wed Jan 20 17:47:03 2016 New Revision: 232628 URL: https://gcc.gnu.org/viewcvs?rev=232628=gcc=rev Log: libstdc++: Darwin does not support weak refs without definition. PR libstdc++/69310 * config/os/bsd/darwin/os_defines.h (_GLIBCXX_USE_WEAK_REF): Define. Modified: trunk/libstdc++-v3/ChangeLog trunk/libstdc++-v3/config/os/bsd/darwin/os_defines.h
[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #15 from torvald at gcc dot gnu.org --- Fixed on trunk.
[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 --- Comment #11 from torvald at gcc dot gnu.org --- (In reply to Jack Howarth from comment #10) > It is unclear if the changes in r232454, to avoid the explicit linkage on > libitm, can ever be made darwin-friendly. On darwin, every single executable > linked against libstdc++ would require -Wl,-undefined,dynamic_lookup on the > linkage to avoid linkage errors from the undefined symbols from libitm > within libstdc++. Is Darwin declaring __GXX_WEAK__? I suppose that it does because after r232539 the failing code isn't compiled unless __GXX_WEAK__ is supported. But if it is, weak references must be supported. Or are you saying that specifically the alias attribute is a problem? If indeed weak references to symbols not declared are a problem, then we probably should disable the transactional clones the same way we did on AIX, which doesn't support weak references without definitions either. What about putting this: // No support for referencing weak symbols without a definition. #define _GLIBCXX_USE_WEAK_REF 0 into ./config/os/bsd/darwin/os_defines.h? This would disable the transactional clones altogether. Could you give this a try?
[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 --- Comment #7 from torvald at gcc dot gnu.org --- Does this patch fix the issue on Darwin? https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01452.html
[Bug libstdc++/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 --- Comment #6 from torvald at gcc dot gnu.org --- (In reply to Jack Howarth from comment #5) > (In reply to torvald from comment #4) > > > See https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01203.html > > (Not linked to this bug because this bug is about the darwin breakage.) > > The commit of that patch didn't resolve the bootstrap failures on > x86_64-apple-darwin15 . The "only weak aliases are supported in this > configuration" errors still block the bootstrap on darwin. Of course they don't. This was a reply to a comment about another issue not actually related to the bug, but posted as comment on this bug.
[Bug bootstrap/69312] [6 Regression] libstdc++ unconditionally refers to TM symbols
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69312 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from torvald at gcc dot gnu.org --- Fixed in r232539
[Bug bootstrap/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 --- Comment #2 from torvald at gcc dot gnu.org --- (In reply to İsmail Dönmez from comment #1) > r232454 also breaks bootstrap for mingw-w64: > > libtool: compile: x86_64-w64-mingw32-c++ > -L/havana/mingw-w64-6.0.0/x86_64-w64-mingw32/lib > -L/havana/mingw-w64-6.0.0/mingw/lib -isystem > /havana/mingw-w64-6.0.0/x86_64-w64-mingw32/include -isystem > /havana/mingw-w64-6.0.0/mingw/include > -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/../libgcc > -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/ > include/x86_64-w64-mingw32 > -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/ > include -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/libsupc++ > -std=gnu++11 -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra > -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once > -ffunction-sections -fdata-sections -frandom-seed=cow-stdexcept.lo -g -O2 -c > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc -o > cow-stdexcept.o > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:235:70: > warning: unused parameter 'exc' [-Wunused-parameter] > _txnal_cow_string_C1_for_exceptions(void* that, const char* s, void *exc) > ^ > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc: In > function 'void* txnal_read_ptr(void* const*)': > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39: > error: expected ',' before ')' token >|| sizeof(uint32_t) == sizeof(void*)); >^ > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39: > error: expected string-literal before ')' token is mingw's static_assert any different? Right now, I can't see how this could produce this error: static_assert(sizeof(uint64_t) == sizeof(void*) || sizeof(uint32_t) == sizeof(void*));
[Bug middle-end/61199] [trans-mem] _cxa_free_exception is not transaction safe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199 torvald at gcc dot gnu.org changed: What|Removed |Added Target Milestone|--- |6.0
[Bug middle-end/61199] [trans-mem] _cxa_free_exception is not transaction safe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199 torvald at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #2 from torvald at gcc dot gnu.org --- Fixed in 230634.
[Bug bootstrap/69310] [6 Regression] Revision r232454 breaks bootstrap on x86_64-apple-darwin15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69310 --- Comment #4 from torvald at gcc dot gnu.org --- (In reply to İsmail Dönmez from comment #3) > (In reply to torvald from comment #2) > > (In reply to İsmail Dönmez from comment #1) > > > r232454 also breaks bootstrap for mingw-w64: > > > > > > libtool: compile: x86_64-w64-mingw32-c++ > > > -L/havana/mingw-w64-6.0.0/x86_64-w64-mingw32/lib > > > -L/havana/mingw-w64-6.0.0/mingw/lib -isystem > > > /havana/mingw-w64-6.0.0/x86_64-w64-mingw32/include -isystem > > > /havana/mingw-w64-6.0.0/mingw/include > > > -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/../libgcc > > > -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/ > > > include/x86_64-w64-mingw32 > > > -I/havana/mingw-w64-build/build-6.0.0/x86_64-w64-mingw32/libstdc++-v3/ > > > include -I/havana/mingw-w64-build/combined-6.0.0/libstdc++-v3/libsupc++ > > > -std=gnu++11 -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra > > > -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once > > > -ffunction-sections -fdata-sections -frandom-seed=cow-stdexcept.lo -g -O2 > > > -c > > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc -o > > > cow-stdexcept.o > > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:235:70: > > > warning: unused parameter 'exc' [-Wunused-parameter] > > > _txnal_cow_string_C1_for_exceptions(void* that, const char* s, void *exc) > > > ^ > > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc: In > > > function 'void* txnal_read_ptr(void* const*)': > > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39: > > > error: expected ',' before ')' token > > >|| sizeof(uint32_t) == sizeof(void*)); > > >^ > > > ../../../../../combined-6.0.0/libstdc++-v3/src/c++11/cow-stdexcept.cc:281:39: > > > error: expected string-literal before ')' token > > > > is mingw's static_assert any different? Right now, I can't see how this > > could produce this error: > > > > static_assert(sizeof(uint64_t) == sizeof(void*) > > || sizeof(uint32_t) == sizeof(void*)); > > g++-5 on Linux does not compile that either, clang is more helpful to why: > > t.cpp:5:85: warning: static_assert with no message is a C++1z extension > [-Wc++1z-extensions] > > So, just adding a message to static_assert fixes the issue here. Ah, right. See https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01203.html (Not linked to this bug because this bug is about the darwin breakage.)
[Bug libitm/69018] New: libitm/testsuite/libitm.c++/c++.exp: libstdc++ include paths don't work if tests set options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69018 Bug ID: 69018 Summary: libitm/testsuite/libitm.c++/c++.exp: libstdc++ include paths don't work if tests set options Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libitm Assignee: unassigned at gcc dot gnu.org Reporter: torvald at gcc dot gnu.org Target Milestone: --- If a tests sets an option (eg, // { dg-options "-fgnu-tm" }), then the include paths for C++ std headers set by c++.exp don't work anymore. Switching the last two arguments to the "dg-runtest" command close to the end of c++.exp makes it work, but I don't know whether that's a fix or a workaround.
[Bug libstdc++/68921] [5/6 Regression] std::future::wait() makes invalid futex calls and spins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68921 --- Comment #3 from torvald at gcc dot gnu.org --- LGTM, thanks. Would be nice to backport this.
[Bug libitm/68591] trans-mem: missing transactional read in transaction when using -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68591 --- Comment #3 from torvald at gcc dot gnu.org --- This might be related to bug 68616.
[Bug middle-end/68616] miscompilation in multi-threaded code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68616 --- Comment #2 from torvald at gcc dot gnu.org --- I basically don't know anything about IPA, so I'll just assume that by "barrier" you mean conditions prohibiting transformations. I'm also not sure whether just CSE is a problem here, so I'll try to give an unspecific but broad answer. I'll assume we're looking at cases of nonatomic accesses to foo (such as ptr in the test case) and calls to the (new) atomic builtins on different variables bar1,bar2,... as synchronization points. Mixed atomic/nonatomic accesses to the same memory location are in most cases a bug, and I believe we discourage it, but they might not generally be incorrect (e.g., if loading through an atomic op and observing a value that indicates we're the only thread accessing it, subsequent nonatomic accesses might be fine); maybe we should just let an atomic access to bar1 be a barrier for all movement/merging across/involving this access. Any old atomic builtins (ie, __synch) can probably be handled like the equivalent calls to the new builtins. I believe we don't need to do includes volatiles, because even if they are used in old-style code, they should have asm compiler barriers around them -- and I hope we're handling those correctly. Because != , atomic stores to bar must be __ATOMIC_RELEASE or stronger, and atomic loads to bar must be __ATOMIC_ACQUIRE or stronger; otherwise, there's no implied ordering relationship between the foo and bar accesses. A good way to find out what transformations are or are not allowed is to consider the data-race-freedom (DRF) requirement and which regions an implementation would be allowed to execute atomically. For example, "foo = 1; bar.store(1, release); foo = 2;": The implementation is allowed to execute that in one atomic step, so there cannot be a nonatomic read of fooin another thread that's triggered by the store to bar because there would be a data race in this case. So, foo=1 can be removed if we assume DRF. (Note that old synchronization code may not be written to guarantee DRF; so perhaps this should be conditional on -std=c11 or such.) "temp=foo; if (bar.load(acquire)) temp2=foo;" is similar. However, this example here (basically the test case) doesn't allow for optimizing across the synchronization: "foo = 1; temp=foo; bar.store(1, release); if (bar.load(acquire)==2) temp2=foo;" This is because both loads of foo cannot be executed in one atomic step, as it needs another thread to increment bar to the value 2; also, there is no nonatomic access to foo between the two atomic accesses, so we cannot derive that those other threads that execute between the atomics don't access foo because of DRF. I hope this helps, though I'm aware it's pretty general. If you have a list of specific transformations, I could look through that and say which ones are correct or not. Perhaps it's okay for now if we assume that all non-relaxed atomics are transformation barriers, as we do elsewhere AFAIK.
[Bug middle-end/68122] ICE in gcc/toplev.c:353 with -fsanitize=undefined and -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68122 --- Comment #11 from torvald at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #10) > There are lots of internal functions in GCC 6 (older versions had fewer). > Many of them are ECF_CONST, which is also treated like txn_pure, right? I think it should be (ie, won't access global memory, so is a black box an only output/input is values) -- but I don't know whether we do this everywhere. > Other than that, e.g. the remaining UBSAN* handlers are pretty much just > report some diagnostics and either terminate the program or return, but in a > valid program they should never be called. If we treat them as ECF_TM_PURE, > that would worst case mean the diagnostics could be printed multiple times > for the same spot in a transaction, if the transaction is restarted, right? Yes. > Most of them are not reporting the same diagnostics for the same source code > location, so I suppose it would be no harm to treat them as tm pure. Agreed. Are we running the TM pass before the pass that replaces UBSAN* with real code?
[Bug middle-end/68122] ICE in gcc/toplev.c:353 with -fsanitize=undefined and -fgnu-tm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68122 --- Comment #9 from torvald at gcc dot gnu.org --- Marek, just skipping handling of internal functions is not correct I think. In the worst case, it should treat them as txn-unsafe, unless they are considered txn-pure. Do we have some idea of how many built-ins would be affected? For example I believe it wouldn't be critical if TM wouldn't be compatible with UBSan for now, but if there are builtins for code that would be intuitively txn-safe, supporting them would be good. Regarding UBSan specifically, I believe many of the checks could be handled in a straight-forward way: All values returned by transactional memory accesses will form a consistent snapshot, always. So, for example, an addition-with-check call that takes values as arguments should be tm_pure because it will check values that could have happened in some execution, even if the transaction has to be rolled back later on.
[Bug other/68616] New: miscompilation in multi-threaded code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68616 Bug ID: 68616 Summary: miscompilation in multi-threaded code Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: torvald at gcc dot gnu.org Target Milestone: --- Created attachment 36871 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36871=edit test case In the attached case, the compiler seems to assume that in main_thread, ptr still points to x despite there being a call to arrive_and_wait that contains synchronization. The compiled code on x86_64 with -O1 has just an unconditional store to x after the arrive_and_wait call. It must keep the conditional because in fact the interference function waits for arrive_and_wait to increment barrier, then assigns zero to ptr, and the unblocks arrive_and_wait, which returns to main_thread. So, in a correct execution, main_thread will always see ptr==0 and do not assign to x. This may or may not be a duplicate of PR 68591; I don't know so I created a new PR. Also, while just a transactional memory bug may be considered less severe, this one here is a severe problem for multi-threaded code. Also note that in a variation of this I didn't see the miscompilation: In that case, main_thread() was a part of main() and the pthread_create call for interference() was directly before the arrive_and_wait. The code generated for that case had the load and check of ptr before assigning to ptr.
[Bug libitm/68591] trans-mem: missing transactional read in transaction when using -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68591 torvald at gcc dot gnu.org changed: What|Removed |Added CC||rth at gcc dot gnu.org --- Comment #1 from torvald at gcc dot gnu.org --- Filed as libitm, but this is really a compiler problem, and not related to libitm.
[Bug libitm/68591] New: trans-mem: missing transactional read in transaction when using -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68591 Bug ID: 68591 Summary: trans-mem: missing transactional read in transaction when using -O1 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libitm Assignee: unassigned at gcc dot gnu.org Reporter: torvald at gcc dot gnu.org Target Milestone: --- Created attachment 36860 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36860=edit test case The attached test case miscompiles in that it does not transform the load from ptr in the conditional in the transaction into a transactional load. Instead, it seems to assume that ptr's value is fixed and not shared with other threads. This changes when the synchronization that could potentially make it share gets inlined into the function with the transaction. For simplicity, this is a single-threaded test, but I discovered that in a multi-threaded setting.
[Bug libitm/66453] In a deadlock libitm allocates all available RAM until oom is called
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66453 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #1 from torvald at gcc dot gnu.org --- Created attachment 36848 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36848=edit test case of original bug report I can't reproduce this with the attached test case with current trunl. The only difference to your test case is that the non-synchronization while loop is technically undefined behavior, and to try to prevent this, I added the hack that prevents the loop from being optimized out. Please test again, and reopen the bug if you still see the OOM situation, ideally with more data on your test environment.
[Bug libitm/66453] In a deadlock libitm allocates all available RAM until oom is called
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66453 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WORKSFORME --- Comment #2 from torvald at gcc dot gnu.org --- Can't reproduce, so close this for now.
[Bug target/67281] HTM builtins aren't treated as compiler barriers on powerpc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67281 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #6 from torvald at gcc dot gnu.org --- I think tbegin needs to have same semantics as a lock acquisition and the compiler must not assume to know anything about tbegin's return value; tend must have same semantics as a lock release. See the libc-alpha discussion for why I think this is the case: https://sourceware.org/ml/libc-alpha/2015-08/msg00963.html Thus, I don't think a full compiler barrier is necessary, but if we don't have something finer-grained to capture the semantics of a lock acquisition, then we need the compiler barrier (GCC currently assumes atomics to be compiler barriers AFAIK). We should in any case agree on a semantics and document it in the GCC sources. Documenting that we need a full compiler barrier is not correct in that it's not a necessary condition (even though it should be a sufficient condition).
[Bug target/59767] __atomic_load_n with __ATOMIC_SEQ_CST should result in a memory barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59767 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #8 from torvald at gcc dot gnu.org --- (In reply to mikulas from comment #7) #include stdatomic.h atomic_int a = ATOMIC_VAR_INIT(0); atomic_int b = ATOMIC_VAR_INIT(0); atomic_int p = ATOMIC_VAR_INIT(0); int thread_1(void) { atomic_store_explicit(b, 1, memory_order_relaxed); atomic_load_explicit(p, memory_order_seq_cst); return atomic_load_explicit(a, memory_order_relaxed); } int thread_2(void) { atomic_store_explicit(a, 1, memory_order_relaxed); atomic_load_explicit(p, memory_order_seq_cst); return atomic_load_explicit(b, memory_order_relaxed); } See for example this. Suppose that thread_1 and thread_2 are executed concurrently. If memory_order_seq_cst were a proper full memory barrier, it would be impossible that both functions return 0. memory_order_seq_cst is a memory order in the Standard's terminology. Fences are something else (ie, atomic_thread_fence()) , and parametrized by a memory order. A memory_order_seq_cst *memory access* does not have the same effects as a memory_order_seq_cst fence. See C++14 29.3p4-7; those paragraphs talk about memory_order_seq_cst fences specifically, not about memory_order_seq_cst operations in general. If you want to make this example of Dekker synchronization correct, you need to use fences instead of the accesses to p; alternatively, you need to use seq-cst accesses for all the stores and loads to a and b, in which case there will be HW fences added via the stores (as Andrew already pointed out).
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #49 from torvald at gcc dot gnu.org --- (In reply to James Greenhalgh from comment #43) (In reply to torvald from comment #37) (In reply to James Greenhalgh from comment #35) So by the strict letter of the specification, no memory references to visible data should be allowed to move from after the entire body of the intrinsic to before it. That is to say in: __sync_lock_test_and_set (foo, 1) bar = 1 an observer should not be able to observe the write to bar before the write to foo. This is a difference from the C++11 semantics. Can you clarify how this observer would look like? I think we should look at both code examples that just use __sync for concurrent accesses, and examples that also use normal memory accesses as if those would be guaranteed to be atomic. None of the __sync docs nor the psABI guarantee any of the latter AFAIK. I don't think it would be unreasonable to argue that __sync doesn't make promises about non-DRF normal accesses, and so, strictly speaking, maybe programs can't in fact rely on any of the behaviors that are complicated to implement for ARM. That's why I'd like to distinguish those two cases. Sure, it would look much like your cppmem example above, though you can add some barriers to the observer if you want to make a stronger example bar = 0, foo = 0; thread_a { __sync_lock_test_and_set (foo, 1) bar = 1 } thread_b { /* If we can see the write to bar, the write to foo must also have happened. */ if (bar) /* Reads 1. */ assert (foo) /* Should never read 0. */ } This is the case of allowing non-DRF normal accesses. The *other* case I was thinking about is how the test would have to look like when *not* allowing them. One way to do it would be: thread_a { __sync_lock_test_and_set (foo, 1) __sync_lock_test_and_set (bar, 1) // or __sync_lock_release, or __sync RMW } thread_b { if (__sync_fetch_and_add (bar, 0)) assert (foo) // DRF if thread_a's write is the final one } In this case, would the current ARM implementation still produce insufficient code? If not, at least in this test case, we could argue that there's nothing wrong with what ARM does. (The question whether we wan't to require DRF strictly for __sync usage is of course still open.) I'm not worried about __sync_lock_release, I think the documentation is strong enough and unambiguous. Are you aware that the GCC's __sync disallow store-store reordering across __sync_lock_release, whereas the psABI docs don't? No I was not, and even looking exactly for the text you were referring to, it took me three attempts to spot it. Yes, I understand now why you are concerned about the GCC wording. Perhaps this is just an artefact of a mistake transcribing the psABI? I suppose so, but can't say for sure. Somebody might have had x86 TSO in mind when writing that, and perhaps not just accidentally. However, we say psABI is the reference spec, so I agree we should change the GCC __sync_lock_release docs accordingly.
[Bug libstdc++/51617] [C++0x] async(f) isn't.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51617 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #11 from torvald at gcc dot gnu.org --- ISO C++ SG1 was close to deprecating std::async for other reasons, but there was too much objection to this from other parts of the committee; the main concern was that there is no replacement for the functionality. Even though the executors proposal is still being discussed and not part of any published TS yet, the parallelism TS V1 is being published. Thus, I'm not sure whether spending too much time is justified, given the potentially short-lived importance of current async().
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #38 from torvald at gcc dot gnu.org --- (In reply to Andrew Macleod from comment #34) However, I guess some people relying on data races in their programs could (mis?)understand the __sync_lock_release semantics to mean that it is a means to get the equivalent of a C11 release *fence* -- which it is not because the fence would apply to the (erroneously non-atomic) store after the barrier, which could one lead to believe that if one observes the store after the barrier, the fence must also be in effect. Thoughts? before we get too carried away, maybe we should return to what we *think* __sync are suppose to do. It represents a specific definition by intel.. From the original documentation for __sync back in the day, and all legacy uses of sync should expect this behaviour: The problem I see with that is that I don't think that just looking at the psABI gives you enough information to really reason about what you are allowed to do or not. Strictly speaking, the psABI doesn't give you guarantees about normal memory accesses that are not data-race-free (through use of the __sync builtins). Nonetheless, legacy code does use them in a combination with the __sync builtins. Also, if you look at the IA-64 __sync_lock_release vs. GCC docs' __sync_lock_release, the latter is like x86/TSO. Do you have any info on which other semantics __sync was supposed to adhere to? One potential way to solve it would be to just require code that uses __sync to more or less implement an IA-64 or x86 memory model, modulo allowing compiler-reordering and optimization between adjacent non-__sync memory accesses. This could be inefficient on ARM (see James' examples) and perhaps Power too (or not -- see Jakub's comments).
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #37 from torvald at gcc dot gnu.org --- (In reply to James Greenhalgh from comment #35) (In reply to torvald from comment #32) (In reply to James Greenhalgh from comment #28) This also gives us an easier route to fixing any issues with the acquire/release __sync primitives (__sync_lock_test_and_set and __sync_lock_release) if we decide that these also need to be stronger than their C++11 equivalents. I don't think we have another case of different __sync vs. __atomics semantics in case of __sync_lock_test_and_set. The current specification makes it clear that this is an acquire barrier, and how it describes the semantics (ie, loads and stores that are program-order before the acquire op can move to after it) , this seems to be consistent with the effects C11 specifies for acquire MO (with perhaps the distinction that C11 is clear that acquire needs to be paired with some release op to create an ordering constraint). I think that the question is which parts of a RMW operation with MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11 MEMMODEL_ACQUIRE only applies to the load half of the operation. So an observer to: atomic_flag_test_and_set_explicit(foo, memory_order_acquire) atomic_store_exlicit (bar, 1, memory_model_relaxed) That depends on your observer. When reasoning about C11, please let's use complete examples (and ideally, run them through cppmem). For example, if the observation is done by a pair of relaxed-MO loads, bar==1 foo==0 can be observed: int main() { atomic_int foo = 0; atomic_int bar = 0; {{{ { r1 = cas_strong(foo, 0, 1); bar.store(1, memory_order_relaxed); } ||| { r2 = bar.load(memory_order_relaxed).readsvalue(1); r3 = foo.load(memory_order_relaxed).readsvalue(0); } }}}; return 0; } Is permitted to observe a write to bar before a write to foo (but not before the read from foo). How does one observe a load in C11 (ie, so that before the read from foo is defined)? You can constrain the reads-from of a load, but the load itself is not observable as an effect. I think I get what you're trying to say regarding the load half but I would phrase it differently: If there could be another release store to foo that the CAS/TAS reads-from, then moving the store to bar before the load would risk creating a cycle between inter-thread happens-before and sequenced-before-created intra-thread happens-before. (Note that the later isn't visible to other threads though, except through additional inter-thread synchronization.) Perhaps one could also say that moving the store to bar before the store to foo could result in the CAS/TAS not being atomic -- but this is just reliably observable if the foo observation is a CAS/TAS by itself (so it's totally ordered with the other CAS), or if there is a inter-thread happens-before between the observation of bar and the store to bar (which means using a release store for bar). I'm discussing these aspects because I think it matters which observations are actually possible in a reliable way. It matters for C11, and it may matter for the __sync semantics as well because while the __sync functions promise TSO, we don't really do so for all (nonatomic) loads or stores anywhere else in the program... IOW, can we actually come up with reliable and correct counter-examples (either using the C11 or __sync interfaces) for the guarantees we think ARM might not provide? My reading of the Itanium ABI is that the acquire barrier applies to the entire operation (Andrew, I think you copied these over exactly backwards in comment 34 ;) ): Disallows the movement of memory references to visible data from after the intrinsic (in program order) to before the intrinsic (this behavior is desirable at lock-acquire operations, hence the name). The definition of __sync_lock_test_and_set is: Behavior: • Atomically store the supplied value in *ptr and return the old value of *ptr. (i.e.) { tmp = *ptr; *ptr = value; return tmp; } • Acquire barrier. Hmm. I don't think this list is meant to be a sequence; __sync_lock_release has two items, first the assignment to the memory location, and *second* a release barrier. If this were supposed to be a sequence, it wouldn't be correct for lock releases. It rather seems that, somehow, the whole intrinsic is supposed to be a barrier, and be atomic. Similarly, the __sync RMW ops just have a single full barrier listed under behavior. So by the strict letter of the specification, no memory references to visible data should be allowed to move from after the entire body of the intrinsic to before it. That is to say in: __sync_lock_test_and_set (foo, 1) bar = 1 an observer should not be able to observe the write to bar before the write to foo. This is a difference from the C++11 semantics. Can you clarify how
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #32 from torvald at gcc dot gnu.org --- (In reply to James Greenhalgh from comment #28) (In reply to torvald from comment #24) 3) We could do something just on ARM (and scan other arcs for similar issues). That's perhaps the cleanest option. Which leaves 3). From Andrew's two proposed solutions: 3) Also seems best to me. 2) is worst, 1) is too much of a stick. This also gives us an easier route to fixing any issues with the acquire/release __sync primitives (__sync_lock_test_and_set and __sync_lock_release) if we decide that these also need to be stronger than their C++11 equivalents. I don't think we have another case of different __sync vs. __atomics semantics in case of __sync_lock_test_and_set. The current specification makes it clear that this is an acquire barrier, and how it describes the semantics (ie, loads and stores that are program-order before the acquire op can move to after it) , this seems to be consistent with the effects C11 specifies for acquire MO (with perhaps the distinction that C11 is clear that acquire needs to be paired with some release op to create an ordering constraint). I'd say this basically also applies to __sync_lock_release, with the exception that the current documentation does not mention that stores can be speculated to before the barrier. That seems to be an artefact of a TSO model. However, I don't think this matters much because what the release barrier allows one to do is reasoning that if one sees the barrier to have taken place (eg, observe that the lock has been released), then also all ops before the barrier will be visible. It does not guarantee that if one observes an effect that is after the barrier in program order, that the barrier itself will necessarily have taken effect. To be able to make this observation, one would have to ensure using __sync ops that the other effect after the barrier is indeed after the barrier, which would mean using an release op for the other effect -- which would take care of things. If everyone agrees with this reasoning, we probably should add documentation explaining this. However, I guess some people relying on data races in their programs could (mis?)understand the __sync_lock_release semantics to mean that it is a means to get the equivalent of a C11 release *fence* -- which it is not because the fence would apply to the (erroneously non-atomic) store after the barrier, which could one lead to believe that if one observes the store after the barrier, the fence must also be in effect. Thoughts?
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #21 from torvald at gcc dot gnu.org --- (In reply to Andrew Haley from comment #20) (In reply to mwahab from comment #19) (In reply to Andrew Haley from comment #18) It looks inconsistent with C11 S7.17.7.4-2 (C++11 S29.6.4-21) Further, if the comparison is true, memory is affected according to the value of success, and if the comparison is false, memory is affected according to the value of failure. (where success and failure are the memory model arguments.) In this case, the write to *exp should be memory_order_seq_cst. But no store actually takes place, so the only effect is that of the read. You can't have a sequentially consistent store without a store. I agree. If you continue reading in the C++11 paragraph that you cited, you'll see that if just one MO is provided and the CAS fails, an acq_rel MO is downgraded to acquire and a release MO to relaxed. This is consistent with no update of the atomic variable (note that expected is not atomic, so applying an MO to accesses to it is not meaningful in any case). However, if the provided MO is seq_cst, I think a failed CAS still needs to be equivalent to a seq_cst load.
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #22 from torvald at gcc dot gnu.org --- (In reply to James Greenhalgh from comment #12) There are two problems here, one of which concerns me more in the real world, and both of which rely on races if you are in the C/C++11 model - there isn't much I can do about that as the __sync primitives are legacy and give stronger guarantees about visible memory references (which includes ALL global data). Consider: int x = 0; int y = 0; int z = 0; void foo (void) { x = 1; __sync_fetch_and_add (y, 1); z = 1; } My reading of what a full barrier implies would suggest that we should never have a modification order which is any of: z = 1, x = 1, y = 0+1 z = 1, y = 0+1, x = 1 x = 1, z = 1, y = 0+1 At least in C11/C++11, modification orders are per memory location. If we want to have something that orders accesses to x, y, and z, we need to model the observer side too (which the C11 model does, for example). I'm mentioning that because at the very least, we have compiler reordering happening at the observer side; if a programmer would need to use, for example, __sync builtins to order the observers, then this means we just have to consider combinations of those on the modification and the observation side. (And because I'm not sure what you think modification order is, I can't comment further.) GCC 5.0-ish (recent trunk) will emit: foo: adrpx3, .LANCHOR0 mov w1, 1 add x0, x3, :lo12:.LANCHOR0 add x2, x0, 4 str w1, [x3, #:lo12:.LANCHOR0] .L2: ldaxr w3, [x2] add w3, w3, w1 stlxr w4, w3, [x2] cbnzw4, .L2 str w1, [x0, 8] ret Dropping some of the context and switching to pseudo-asm we have: str 1, [x] .L2: ldaxr tmp, [y] add tmp, tmp, 1 stlxr flag, tmp, [y] cbnzflag, .L2 str 1, [z] As far as I understand it, the memory ordering guarantees of the half-barrier LDAXR/STLXR instructions do not prevent this being reordered to an execution which looks like: ldaxr tmp, [y] str 1, [z] str 1, [x] add tmp, tmp, 1 stlxr flag, tmp, [y] cbnzflag, .L2 which gives one of the modification orders we wanted to forbid above. Similar reordering can give the other undesirable modification orders. (I'm not familiar with the ARM model, so please bear with me.) This is reordering the HW can do? Or are you concerned about the compiler backend? Would the reordered str always become visible atomically with the stlxr? Would it become visible even if the LLSC fails, thus potentially storing more than once to z? This would be surprising in that the ldaxr would then have to be reloaded too potentially after the store to z, I believe (at least for a strong CAS) -- which would break the acquire MO on the load. As mentioned, this reordering is permitted under C11, as the stores to x and z are racy - this permits the CPPMEM/Cambridge documentation of what an SEQ_CST must do. That's true in this example, but at the instruction level (assuming HW reordering is what you're concerned about), a atomic relaxed-MO load isn't distinguishable from a normal memory access, right? So, it's not DRF what this is strictly about, but the difference between C11 seq_cst fences and seq_cst RMW ops. However, my feeling is that it is at the very least *surprising* for the __sync primitives to allow this ordering, and more likely points to a break in the AArch64 implementation. With the caveat that given that __sync isn't documented in great detail, a lot of interpretations might happen in practice, so there might be a few surprises to some people :) Fixing this requires a hard memory barrier - but I really don't want to see us penalise C11 SEQ_CST to get the right behaviour out of __sync_fetch_and_add... I agree.
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #24 from torvald at gcc dot gnu.org --- I think we need to at least clarify the documentation of __atomic, probably also of __sync; we might also have to change the implementation of __sync builtins on some archs. First, I think the __atomic builtins should strictly follow the C11 model -- one major reason being that it's not worthwhile for us to support a GCC-specific memory model, which is complex. Therefore, I think we should clarify in the __atomic docs that C++11 is the reference semantics, and that any additional text we provide (e.g., to describe what MEMMODEL_SEQ_CST does) is just some hand-wavy attempt of makings things look simpler. I'd choose C++11 over C11 because, although the models are supposed to be equal, C++11 has more detailed docs in the area of atomics and synchronization in general, IMO. There are also more people involved in C++ standardization than C standardization. The only exception that I see is that we might want to give more guarantees wrt data-race-freedom (DRF) as far as compiler reordering is concerned. I'm mentioning just the compiler because I'm not aware of GCC supporting C11/C++11 on any archs whose ISA distinguishes between atomic and nonatomic accesses (ie, where DRF is visible at the ISA level); however, I've heard of upcoming archs that may do that. Trying to promise some support for non-DRF programs that synchronize might make transition of legacy code to __atomic builtins easier, but it's a slippery slope; I don't think it's worthwhile for us to try to give really precise and hard guarantees, just because of the complexity involved. I'm not quite sure what's best here. Second, in C11 there is a difference between a memory access or read-modify-write op (RMW) that is seq-cst, and a seq-cst fence. For reference, try this example in cppmem: int main() { atomic_int x = 0; atomic_int y = 0; atomic_int d1 = 0; atomic_int d2 = 0; {{{ { x.store(1, memory_order_relaxed); // atomic_thread_fence(memory_order_seq_cst); r3=cas_strong(d1, 0, 1); r1=y.load(memory_order_relaxed).readsvalue(0); } ||| { y.store(1, memory_order_relaxed); // atomic_thread_fence(memory_order_seq_cst); r4=cas_strong(d2, 0, 1); r2=x.load(memory_order_relaxed).readsvalue(0); } }}}; return 0; } This is Dekker synchronization: It doesn't work (ie, both threads can read value 0) with seq-cst RMWs to separate locations (d1 and d2) even though those RMWs will be totally ordered; however, it does work when using fences (see the comments). Therefore, for __atomic, it's not only the MEMMODEL_* value that counts, but also which operation it is applied to. Perhaps we need to point this out in the docs. However, our current docs for __sync indicate that most of the __sync RMWs are full barriers. The latter has a very TSO-like description, and my guess would be that most people might understand it this way too (ie, that it would at least be as strong as a seq-cst C11 fence). Also, Andrew mentioned offline that prior documentation seemed to have described the __sync RMWs as __sync_synchronize(); operation... __sync_synchronize(); For x86 and similar, a C11-seq-cst RMW will be a full barrier; Jakub says the same holds for PowerPC currently. AFAIU, on ARM it doesn't (but it kind of surprises me that HW would reorder into an LLSC region; my guess would that this complicates matters, and I can't see a benefit). So, at least on ARM, there seems to be a difference between a __sync RMW that is documented to be at least as strong as a C11 seq-cst fence, and how we implement a C11 seq-cst fence. This difference is observable by a program, but doing so requires a non-DRF program (but there are no relaxed-MO __sync variants, so programs that used __sync might just relied on non-DRF accesses). I would guess that programs that actually contain synchronization code affected by this difference are rather rare. Normal locks and such are not affected, as are many common synchronization algorithms such as for linked lists etc (eg, which work fine with acq_rel MO ops). I'd say that the full fences are really necessary mostly if one wants to synchronize using separate locations and without a global common order for those accesses (eg, using several mem locations and asymmetric access patterns to tune contention favorably for common cases (eg, Dekker-like stuff if made asymmetric; libitm uses that)). I guess it's much more common to route the synchronization through centralized or hierarchical locations so that eventually there is a tie-breaker. For example, AFAIR there's no use of a seq-cst fence in current glibc. Thus, we need to at least improve the __sync docs. If we want to really make them C11-like, we need to update the docs, and there might be legacy code assuming different semantics that breaks. If we don't, we need to update
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #6 from torvald at gcc dot gnu.org --- (In reply to torvald from comment #5) (In reply to Matthew Wahab from comment #0) while a __sync full barrier should block all movement (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins. html#_005f_005fsync-Builtins). But in your code example below, you didn't use __synch_synchronize. The other __sync operations do not document that they are a full barrier, AFAICS. (I'm not sure whether there is a wide assumption that they do.) No, you're right. They are in most cases considered to be a full barrier.
[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 --- Comment #5 from torvald at gcc dot gnu.org --- (In reply to Matthew Wahab from comment #0) The __sync builtins are implemented by expanding them to the equivalent __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers. This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow some data references to move across the operation (see https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) I don't think that's a precise characterization. The difference discussed in that thread is that seq-cst fences have a stronger effect than seq-cst stores. This is not really a question of MEMMODEL_SEQ_CST but what you apply it to. while a __sync full barrier should block all movement (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins. html#_005f_005fsync-Builtins). But in your code example below, you didn't use __synch_synchronize. The other __sync operations do not document that they are a full barrier, AFAICS. (I'm not sure whether there is a wide assumption that they do.) This is problem for Aarch64 where data references after the barrier could be speculated ahead of it. For example, compiling with aarch64-none-linux-gnu at -O2, - int foo = 0; int bar = 0; int T1(void) { int x = __sync_fetch_and_add(foo, 1); return bar; } This test case is problematic in that it has a data race on bar unless bar is never modified by another thread, in which case it would be fine to load bar before the fetch_add. It also loads bar irrespective of which value the fetch_add actually returns. I'm aware that the __sync builtins offer no MEMMODEL_RELAXED, and so lots of code simply has a plain memory access. Nonetheless, I think you should also look at a test case that is properly synchronized, data-race-free C11 code with the builtins, and see whether that is compiled differently (ie, either use a relaxed MO load for bar or make the load of bar conditional on the outcome of the fetch_add). To be on the safe side, you can also use the cppmem tool to verify what the C11 model requires the compiled code to guarantee. Here's what you probably want to achieve: int main() { atomic_int foo = 0; atomic_int bar = 0; {{{ { r1=cas_strong(foo, 0, 1); r2=bar.load(memory_order_relaxed); } ||| { bar.store(42, memory_order_relaxed); foo.store(1, memory_order_release); } }}}; return 0; } This tries to test whether ordered stores to bar and foo are observed in this order in the first thread (it uses a CAS instead of fetch_add). This gives me 3 consistent executions, of which one is the one in which the CAS succeeds (ie, like fetch_add would do), and this one returns a value of 42 for the load of bar. produces T1: adrpx0, .LANCHOR0 add x0, x0, :lo12:.LANCHOR0 .L2: ldaxr w1, [x0] ; load-acquire (__sync_fetch_and_add) add w1, w1, 1 stlxr w2, w1, [x0] ; store-release (__sync_fetch_and_add) cbnzw2, .L2 ldr w0, [x0, 4]; load (return bar) ret With this code, the load can be executed ahead of the store-release which ends the __sync_fetch_and_add. A correct implementation should emit a dmb instruction after the cbnz. I'm not sufficiently familiar with ARM to assess whether that's correct. Is this ARMv8? It seems to be consistent with the mapping to machine code that GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), which does not include a DMB. However, it also shows no difference between an acq_rel CAS and a seq_cst CAS. Could it be that the CAS sequence itself (ie, the machine code for it) prevents the reordering you are concerned about at the HW level?
[Bug testsuite/64930] [5 regression] FAIL: gcc.target/powerpc/atomic-p7.c scan-assembler-times isync 12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64930 --- Comment #12 from torvald at gcc dot gnu.org --- (In reply to Alan Modra from comment #9) My point was that if you write a testcase that specifically tests for consume and get acquire code then that is a fail. The code generated is using a bigger hammer than necessary. The consensus in ISO C++ SG1 (though we had no straw poll on this particular thing, so this is my conclusion from the discussions and the opinions voiced by other compiler vendors) is that implementing C/C++ memory_order_consume means, in practice, promoting to memory_order_acquire. This is not a GCC-specific solution or deficiency. It is rather the realization that the standard's intent can't be implemented in practice without too much costs elsewhere (e.g., because of how tracking dependencies properly would require points-to analysis or conservatively adding barriers in likely many places in the code). See this paper for more background: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4321.pdf
[Bug testsuite/64930] [5 regression] FAIL: gcc.target/powerpc/atomic-p7.c scan-assembler-times isync 12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64930 --- Comment #5 from torvald at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #3) Created attachment 34731 [details] gcc5-pr64930.patch Thus I'm proposing this untested patch. I think expecting the consume-to-acquire promotion is the right thing to do (I haven't tested the patch though). The consume specification in the standard has issues, and based on the discussions in ISO C++ SG1, it seems unlikely that it can or will be fixed. Thus, I believe the promotion will stay with us forever. Therefore, using an XFAIL doesn't seem right to me.
[Bug c++/63472] transaction_atomic within while loop causes ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63472 --- Comment #6 from torvald at gcc dot gnu.org --- (In reply to ak from comment #4) copy_bbs: (illegal code due to goto into transaction?) g_56[]; fn1() { int *p_79; if (g_56[7]) __transaction_atomic { lbl_196: *p_79 = 1; } else goto lbl_196; } Yes, the goto into the transaction is illegal code. copy_loops: func_65() { __transaction_atomic { for (;;) func_65(); } } I think that technically, at least for C++, this is illegal code because you have an infinite loop with no side effects (atomics, volatile, IO) in it. Does this also fail when you make the recursion stop eventually?
[Bug libitm/61594] ICE (assertion failure) in trans-mem.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61594 --- Comment #6 from torvald at gcc dot gnu.org --- We can keep it open, but my guess is that it would need some volunteer to actively drive the backporting process. I.e., with a patch and testing. Given that TM is experimental, I think backporting fixes for TM has low priority on most people's todo list.
[Bug libitm/52482] libitm INVALID MNEMONIC in .S (powerpc asm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52482 --- Comment #11 from torvald at gcc dot gnu.org --- Thanks for confirming. However, that fails before libitm, for which you should file a separate bug report. Thanks.
[Bug libitm/61594] ICE (assertion failure) in trans-mem.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61594 torvald at gcc dot gnu.org changed: What|Removed |Added Version|5.0 |4.9.1 --- Comment #7 from torvald at gcc dot gnu.org --- Changed version to 4.9.1.
[Bug libitm/52482] libitm INVALID MNEMONIC in .S (powerpc asm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52482 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #8 from torvald at gcc dot gnu.org --- Can somebody with access to a powerpc-apple-darwin8 check whether this bug is still present in SVN trunk? Thanks!
[Bug libitm/61594] ICE (assertion failure) in trans-mem.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61594 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||torvald at gcc dot gnu.org Resolution|--- |WORKSFORME --- Comment #3 from torvald at gcc dot gnu.org --- I can't reproduce this on SVN trunk (r220039), so I'll close this for now. I didn't spot anything bad when briefly looking at the generated code.
[Bug libitm/52695] libitm/config/x86/cacheline.h: '__m64' does not name a type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52695 --- Comment #8 from torvald at gcc dot gnu.org --- Can this bug be closed? It sounds as if this has been fixed in 4.8 already; given that TM is experimental, backporting the fix is probably not worth it.
[Bug libitm/56801] Internal Compiler Error when compiling relaxed transaction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56801 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||torvald at gcc dot gnu.org Resolution|--- |WORKSFORME --- Comment #6 from torvald at gcc dot gnu.org --- I can't reproduce the ICE with SVN trunk (r220039), so I'll close this for now. The produced code looks correct, although it's a little odd: Both the instrumented and the uinstrumented code path are generated; on instrumented, there's no explicit call to request going irrevocable, but the flags to _ITM_beginTransaction state that only the uninstrumented path is available.
[Bug libitm/53113] Build fails in x86_avx.cc if AVX disabled by -mno-avx but supported by as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53113 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #26 from torvald at gcc dot gnu.org --- (In reply to h...@gcc.gnu.org from comment #24) Author: hjl Date: Mon Jan 13 19:36:17 2014 New Revision: 206587 URL: http://gcc.gnu.org/viewcvs?rev=206587root=gccview=rev Log: Make sure that -msse/-mavx are appended at the end PR libitm/53113 * Makefile.am (x86_sse.lo): Append -msse to CXXFLAGS. (x86_avx.lo): Append -mavx to CXXFLAGS. * Makefile.in: Regenerate. Modified: trunk/libitm/ChangeLog trunk/libitm/Makefile.am trunk/libitm/Makefile.in H.J., can this bug be closed?
[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 --- Comment #21 from torvald at gcc dot gnu.org --- (In reply to torvald from comment #17) (In reply to Andrew Macleod from comment #15) So have we concluded that we should promote memory_order_consume to memory_order_acquire for now? I also think that this is the best way forward. I believe everyone in ISO C++ SG1 agreed that this is basically a defect in the standard. To clarify, my impression from the discussion was that there was general consensus that memory_order_consume as specified now wouldn't achieve what it was intended to in practice, due to the implementation issues. IOW, it's not useful even though it's not a defect in the sense of being inconsistent or plain wrong.
[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 --- Comment #20 from torvald at gcc dot gnu.org --- (In reply to jos...@codesourcery.com from comment #19) * carries_dependency is about increasing optimization, not decreasing it. If it affects when barriers are added, it does so by allowing some barriers to be omitted that would otherwise be required. That's not quite true, unfortunately, AFAIU. I agree that generally, attributes are supposed to be ignorable -- but in the case of carries_dependency, C++11 explicitly requires that all declarations of a function either have the attribute or none has (see 7.6.4p2). This is because you need that to actually exploit the attribute; it's a contract between callers and callees. If a compiler does try to preserve dependencies (or not across function boundaries), then ignoring the attribute is fine. But if there should be a compiler out there that doesn't, and GCC-generated code is supposed to link to that other compiler's code, then we need to do something to make this work.
[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 --- Comment #17 from torvald at gcc dot gnu.org --- (In reply to Andrew Macleod from comment #15) So have we concluded that we should promote memory_order_consume to memory_order_acquire for now? I also think that this is the best way forward. I believe everyone in ISO C++ SG1 agreed that this is basically a defect in the standard. What I haven't thought through is how to deal with with carries_dependency (7.6.4 in C++11): For GCC code generated after we promote consume to acquire, it can safely be ignored; but should GCC code be linked to code generated by another compiler that does not promote and expects the code to preserve dependencies, this won't work. I am not aware of any shipping compiler that would actually try to preserve dependencies, and nobody else mentioned any during the discussion of this topic in ISO C++ SG1. Thus, we could assume that there are no such other compilers, and make it part of the ABI (assumptions) that consume is promoted to acquire in a correct compiler. Alternatively, we could try to be conservative and add an acquire barrier before the function body if any parameter of the function has carries_dependency; and, likewise, add an acquire barrier after every call to a function which has carries_dependency. I don't have more input from the ISO C side, but I would guess that the situation there is similar. Thoughts?
[Bug ipa/61393] [trans-mem] O3 optimization level constant propagation problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61393 --- Comment #9 from torvald at gcc dot gnu.org --- Alex, can you confirm that this is fixed?
[Bug middle-end/61199] New: [trans-mem] _cxa_free_exception is not transaction safe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199 Bug ID: 61199 Summary: [trans-mem] _cxa_free_exception is not transaction safe Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: torvald at gcc dot gnu.org Created attachment 32804 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=32804action=edit test case See attached test case. I'm not quite sure how we need to handle it: Is this only called when another transaction has taken over, and we're freeing up garbage? Or can a call to this create a situation in which we canceled a previous exception throw? If the latter case holds, I guess we also need to reset the transaction-in-flight state that libitm keeps when cxa_allocate_exception is called. Depending on this, we either need to make it TM-pure or add a wrapper in libitm. Thoughts?
[Bug middle-end/61199] [trans-mem] _cxa_free_exception is not transaction safe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61199 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2014-05-16 CC||aldyh at gcc dot gnu.org, ||jason at gcc dot gnu.org, ||rth at gcc dot gnu.org Ever confirmed|0 |1
[Bug c++/60272] atomic::compare_exchange_weak has spurious store and can cause race conditions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60272 --- Comment #2 from torvald at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #1) So, either we'd need to change this function, so that it sets oldval to NULL_RTX first, and passes ..., oldval, mem, expected, ... and needs to also always ask for target, then conditionally on target store to expected, or perhaps add extra parameter to expand_atomic_compare_and_swap and do the store only conditionally in that case. Richard/Torvald? I'm not sure what's better. But getting this fixed in 4.9.0 would be good! :)
[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 --- Comment #12 from torvald at gcc dot gnu.org --- (In reply to algrant from comment #11) Where do you get that this is racy if the access to data is not atomic? In threadB(), you do: f = flag.load(std::memory_order_consume); // B.A d = *(data + f - f); // B.B That reads from data irrespective of the value of f, so will read when data is actually not owned by threadB. You can either make the accesses to data atomic, or move the load from data to after checking that flag equals 1. By design, release/acquire and release/consume sequences don't require wholesale changes to the way the data payload (in the general case, multiple fields within a structure) is first constructed and then used. 1.10#13 makes clear that as a result of the intra-thread sequencing between atomic and non-atomic operations (1.9#14), and the inter-thread ordering between atomic operations (1.10 various), there is a resulting ordering on operations to ordinary (sic) objects. Please see the references to the C++ standard in the source example, for the chain of reasoning here. I'm very much aware of the rules, I believe. But as far as you can see, there's a data race in your test program, which results in undefined behavior.
[Bug middle-end/59448] Code generation doesn't respect C11 address-dependency
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 torvald at gcc dot gnu.org changed: What|Removed |Added CC||rth at gcc dot gnu.org --- Comment #10 from torvald at gcc dot gnu.org --- (In reply to algrant from comment #6) Here is a complete C++11 test case. The code generated for the two loads in thread B doesn't maintain the required ordering: The C++ test has a race condition because the nonatomic load from data is not ordered wrt. the nonatomic store to data in the other thread. Please fix that and report back whether this changes anything. (I believe the use of relaxed-MO atomic accesses should be sufficient.) Nonetheless, if there is a fixed (data-race-free) similar test, then I believe this fixed test case to be valid in the sense that AFAIU the standards' requirements (both C++11's and C11's; they are equal in this aspect), code as in this test case is supposed to be carrying the dependencies forward. However, I cannot see a *practical* implementation of this requirement. AFAIU, dependencies are carried forward through loads/stores too (5.1.2.4#14): An evaluation A carries a dependency to an evaluation B if: [...]A writes a scalar object or bit-field M, B reads from M the value written by A, and A is sequenced before B. Because a compiler cannot, in general, analyze which piece of code stored last to M, it would have to conservatively assume that a dependency is carried through M. In turn, it could never optimize expressions such as (x-x), at least if those affect atomic operations on M in some way, irrespective of how remotely that might have happened. (I'm restricting my statement to atomic accesses because I'm not quite sure whether there's some other reason why nonatomic accesses couldn't be affected by this.) OTOH, maybe there are factors that I'm not aware of and that effectively constrain carries-a-dependency to something that doesn't widely prohibit optimizations. I don't have significant background information on the design of this part of the memory model (nor the implementability of this part), but I'll ask around at the C++ meeting in two weeks. According to the architecture specification, to achieve the ordering it's sufficient to use the result of the first load in the calculation of the address of the second, even if it's done in such a way as to have no dependence on the value. And that's certainly a feasible *architecture* specification. ISTM that the language standard tried to model this, but then didn't give the compiler enough leeway to still be able to optimize code. The carries-a-depency rules seem like an unusual shortcut from the language level to requirements on the generated code, without the usual indirection using the observable behavior of the abstract machine and as-if. For example -- and this isn't a precise definition of course -- if the standard would require that f (in your test case) is needed to process the evaluation, then things like f-f could be optimized out, and the compiler might just do the right thing with standard optimizations. As a short-term fix, I believe we can promote all memory_order_consume atomics to memory_order_acquire, when generating code with consume-MO atomics. GCC is at fault here, not the code in the builtins or in libatomic. Thus, this would prevent errors when GCC compiled the code and a correct atomics implementation is used (e.g., third-party libatomic). This won't help when another (potentially correct) compiler generated the code with the consume-MO atomics, and GCC-compiled code fails to carry the dependencies forward. We could try to change libatomic's consume-MO implementation to prevent that, but this would any catch cases in which libatomic is actually used and not the other compiler's builtins or similar. Thus, changing libatomic does not seem to be necessary. Richard, Andrew: Thoughts?
[Bug c/59218] New: atomic transactions: accesses to volatiles not disallowed in transaction_safe code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59218 Bug ID: 59218 Summary: atomic transactions: accesses to volatiles not disallowed in transaction_safe code Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: torvald at gcc dot gnu.org CC: aldyh at gcc dot gnu.org Accesses to volatiles are disallowed in transaction-safe code. The following should produce an error, but doesn't: volatile int volatile *b; void foo() { __transaction_atomic { b[10] = b[10] + 1; } } If accessing a volatile int a instead, an error is reported (as expected).
[Bug c/59218] atomic transactions: accesses to volatiles not disallowed in transaction_safe code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59218 torvald at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P4
[Bug libitm/57643] libitm.c/reentrant.c hangs on POWER8 with HTM
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57643 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2013-06-19 CC||torvald at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |torvald at gcc dot gnu.org Ever confirmed|0 |1
[Bug tree-optimization/53991] _mm_popcnt_u64 fails with -O3 -fgnu-tm
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53991 --- Comment #7 from torvald at gcc dot gnu.org --- A piece of code is tm_pure if, roughly, it doesn't need any instrumentation (e.g., in contrast to memory loads/stores). In the test case, I suppose that the compiler detects that it is tm_pure, but we also allow programmers to declare it. Ideally, tm_pure should be a property of a region of code that is preserved across optimizations (but where we don't move code into or out of tm_pure regions). That may require too much implementation effort (but perhaps we could reuse the TM regions for that, as a no-TM region?) Alternatively, we could not automatically mark always_inline functions also as tm_pure, and warn if always_inline is also annotated as tm_pure by the programmer. Other thoughts?
[Bug c++/56419] New: [4.8 regression] transactions in for-loops disappear
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56419 Bug #: 56419 Summary: [4.8 regression] transactions in for-loops disappear Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: torv...@gcc.gnu.org CC: al...@gcc.gnu.org, sp...@cse.lehigh.edu Transaction statements in for-loops don't show up in Gimple. For example: int x = 0; int inc_func(int i) { for (int j = 0; j i; ++j) { __transaction_atomic { x+=1; } } return 0; } g++ -v -O0 test.cc -c -fgnu-tm -fdump-tree-all -Wall leads to the following test.cc.004t.gimple: int inc_func(int) (int i) { int D.2360; { int j; j = 0; goto D.2358; D.2357: j = j + 1; D.2358: if (j i) goto D.2357; else goto D.2355; D.2355: } D.2360 = 0; return D.2360; } This does work when using the C frontend. Seems to also work when the transaction is in while loops or on an if-branch. Doesn't seem to be sensitive to which C++ standard is requested. Originally reported by Mike Spear. Mike says this works with 4.7.2, but doesn't with g++ (GCC) 4.8.0 20130221 (experimental) (revision 196192). I've confirmed with an early-December build.
[Bug libitm/55693] [4.8 Regression] libitm.c++/eh-1.C execution test fails on darwin from r193271
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55693 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #7 from torvald at gcc dot gnu.org 2013-01-11 22:54:04 UTC --- The gdb trace looks alright to me. My guess is that this is a bug in the uninstrumented code path, not in libitm. Could you try again after adding the following to the top of GTM::gtm_thread::begin_transaction in beginend.cc, please: prop = ~pr_uninstrumentedCode; Then, libitm shouldn't take the uninstrumented code path anymore. If it works with this change, this likely is a bug on the compiler side.
[Bug rtl-optimization/51771] trans-mem: abnormal edges get lost or corrupted
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51771 torvald at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |RESOLVED Resolution||FIXED --- Comment #11 from torvald at gcc dot gnu.org 2012-12-05 19:18:42 UTC --- I can't reproduce this anymore. I also had a brief look at the binary code that showed the problem previously and didn't see the problematic pattern anymore. So it seems that this is fixed.
[Bug sanitizer/48076] Unsafe double checked locking in __emutls_get_address
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48076 --- Comment #7 from torvald at gcc dot gnu.org 2012-11-28 14:29:47 UTC --- (In reply to comment #6) There seems to be a similar bug in code generated for function static variables. The fast-path load is a plain load rather than atomic acquire load. Haven't looked at the details, but this indeed looks similar.
[Bug c++/54893] unable to access volatile variable within relaxed transaction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54893 --- Comment #3 from torvald at gcc dot gnu.org 2012-10-11 19:54:11 UTC --- I agree with Michael. Accesses to volative vars are disallowed in safe code, but relaxed transactions can run unsafe code (after going irrevocable). The test case in bug 46654 is an atomic transaction.
[Bug tree-optimization/52558] write introduction incorrect wrt the C++11 memory model
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558 --- Comment #20 from torvald at gcc dot gnu.org 2012-05-31 20:23:51 UTC --- (In reply to comment #19) Fixed on mainline. I doubt this will be fixed on 4.7, as this is too intrusive-- unless one of the RMs really wants it on 4.7. OTOH, it would be good if we could avoid those races in 4.7 too. Can you gather the RMs' opinions please?
[Bug libitm/53008] abort in _ITM_getTMCloneSafe
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53008 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #4 from torvald at gcc dot gnu.org 2012-05-15 16:18:00 UTC --- So, is there still a libitm issue even with your GCC patch?
[Bug libstdc++/52839] double free or corruption running tr1/.../default_weaktoshared.exe
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52839 torvald at gcc dot gnu.org changed: What|Removed |Added CC||torvald at gcc dot gnu.org --- Comment #25 from torvald at gcc dot gnu.org 2012-04-12 08:24:16 UTC --- If we make an ABI switch at some point, should we move over to relying just on atomics and the libatomic fallbacks (assuming/hoping libatomic exists by then)? Also, refcounting in basic_string has bugs too. Do you see other areas of libstdc++ that could use a review of the synchronization bits?
[Bug c++/52558] write introduction incorrect wrt the C++11 memory model
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558 torvald at gcc dot gnu.org changed: What|Removed |Added Component|tree-optimization |c++ --- Comment #12 from torvald at gcc dot gnu.org 2012-03-13 21:17:48 UTC --- (In reply to comment #11) Just one remark: in this case the write introduction is incorrect wrt the C++11 memory model because there are no previous write to the same location. If there had been a previous write to the same location, then the discriminating context would have been racy and the code generated by -O2 sound. I believe that the above argument generalises to all write introductions but I don't yet have a proof of this, so take it with a pinch of salt. We follow a similar line of reasoning when thinking about which transformations are allowed in transactional code (or other concurrent code). In principle, this also applies to loads to some extent even though racy loads are supposed to be okay for as--if as long as a value obtained from a racy read is never used. Having a more precise yet practical understanding of this would certainly help.
[Bug libitm/52526] libitm: stuck in futex_wait
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526 --- Comment #3 from torvald at gcc dot gnu.org 2012-03-13 22:01:41 UTC --- Author: torvald Date: Tue Mar 13 22:01:34 2012 New Revision: 185358 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=185358 Log: libitm: Fix lost wake-up in serial lock. PR libitm/52526 * config/linux/rwlock.cc (GTM::gtm_rwlock::read_lock): Fix lost wake-up. Modified: trunk/libitm/ChangeLog trunk/libitm/config/linux/rwlock.cc
[Bug libitm/52526] libitm: stuck in futex_wait
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526 --- Comment #4 from torvald at gcc dot gnu.org 2012-03-13 22:11:52 UTC --- Author: torvald Date: Tue Mar 13 22:11:46 2012 New Revision: 185360 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=185360 Log: libitm: Fix lost wake-up in serial lock. PR libitm/52526 Backported from mainline. * config/linux/rwlock.cc (GTM::gtm_rwlock::read_lock): Fix lost wake-up. Modified: branches/gcc-4_7-branch/libitm/ChangeLog branches/gcc-4_7-branch/libitm/config/linux/rwlock.cc
[Bug libitm/52526] libitm: stuck in futex_wait
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526 torvald at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED --- Comment #5 from torvald at gcc dot gnu.org 2012-03-13 22:13:16 UTC --- Fixed in both trunk and 4.7.
[Bug libitm/52526] libitm: stuck in futex_wait
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526 --- Comment #1 from torvald at gcc dot gnu.org 2012-03-10 17:46:12 UTC --- Patrick, I posted a fix to gcc-patches. Please have a look if you can.
[Bug libitm/52526] libitm: stuck in futex_wait
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52526 torvald at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2012-03-08 AssignedTo|unassigned at gcc dot |torvald at gcc dot gnu.org |gnu.org | Ever Confirmed|0 |1
[Bug middle-end/51752] trans-mem: publication safety violated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51752 --- Comment #6 from torvald at gcc dot gnu.org 2012-02-09 18:40:49 UTC --- (In reply to comment #4) But isn't with __transaction_atomic { for (i = 0; i 10; i++) if (x[i]) x[i] += data; } and prepend with the following for thread B: data = 123; __transaction_atomic { x[9] = 1; } occuring concurrently the loop transaction terminated? Thus, __transaction_atomic { tem = data; for (i = 0; i 10; i++) if (x[i]) x[i] += tem; } equivalent? No. It might be for typical HTMs, but we can't expect this in the general case. The interleaving that can then happen is (with A and B being the two concurrent threads): B: tem = data; // reads initial value of zero A: data = 123; A: __transaction_atomic { x[9] = 1; } B: if (x[i]) B: x[i] += tem; // adds zero, not 123. The problem here is that B's store to data is nontransactional, and we cannot add synchronization code for those (it might be library code, for example). We could add a partial workaround to libitm that would reduce this case here to just a racing load. However, that would kill performance because basically, B's transaction would have to wait for all earlier-started transactions before it can start executing. The racing load can still be a problem if we hoist dereferencing a pointer. We don't see this example with other concurrent code because there, the load of x[] would have to be a synchronizing operation, and we don't hoist across these.