[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 cbcode at gmail dot com changed: What|Removed |Added CC||cbcode at gmail dot com --- Comment #19 from cbcode at gmail dot com --- As a compromise, I would like to suggest that '__thread volatile' or 'volatile __thread' always reloads the thread-local storage while __thread without volatile keeps the current caching behavior. The C and C++ standards recognize that stack-switching exists and indicate existing situations where variables need to be volatile-qualified in order to survive a task-switch, see e.g. http://en.cppreference.com/w/cpp/utility/program/setjmp .
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 --- Comment #18 from Stephan Tobies --- +1 - I have a use case where a QuickThread is migrated from one pthread to another. TLS would need to be re-fetched after this migration, but is not due to CSE optimizations. Having a way to disable this optimization, either locally or on a per-file basis would be very useful!
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 Tor Myklebust changed: What|Removed |Added CC||tmyklebu at gmail dot com --- Comment #17 from Tor Myklebust --- (In reply to Jakub Jelinek from comment #14) > Even if we have an option that avoids CSE of TLS addresses across function > calls (or attribute for specific function), what would you expect to happen > when user takes address of TLS variables himself: > __thread int a; > void > foo () > { > int *p = > *p = 10; > bar (); // Changes threads > *p += 10; > } > ? The address can be stored anywhere, so the compiler can't do anything > with it. And of course such an option would cause major slowdown of > anything using TLS, not only it would need to stop CSEing TLS addresses > late, but stop treating TLS addresses as constant in all early optimizations > as well. When you take , gcc docs specify that you get the address of the running thread's instance of a, which is a reasonable pointer for any thread to use as long as the running thread is alive. So everyone already expects that code like this: __thread int a; void *bar(void *p) { printf("%i %i\n", *(int *)p, a); } int main() { a = 42; pthread_t pth; pthread_create(, bar, ); pthread_join(pth, 0); } should print "42 0" as p should point to the main thread's instance of a while the reference of a in the third argument to printf in bar should reference the child thread's instance of a, which is zero because TLS is initialised to zero. It seems that your example: __thread int a; void foo() { int *p = *p = 10; bar (); // Changes threads *p += 10; } must twice modify the instance of a in the thread that started running foo, which is different behaviour from: __thread int a; void baz() { int *p = *p = 10; bar (); // Changes threads p = *p += 10; } which must modify the instance of a in the thread that started running baz() once and the instance of a that finishes running baz() once, since bar may change the value at %fs:0 by changing threads. Perhaps there is a more serious problem with this whole idea if signal handlers are permitted to twiddle the running thread.
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 --- Comment #16 from Richard Biener --- Implementing a switch that assumes that function calls (what about async events??) can switch threads to the effect that the location of TLS variables change would be a challenge. Basically you have to implement sth that prevents assumptions of a variables location inside a function, not only its value. That's currently done nowhere and I don't know how to model such kind of dependency. So I don't think it's easy to model. It might be possible to put in place more strict constraints to avoid the issue, like instructing the compiler that it can't ever take the address of a __tls object. But then array accesses are modeled as + index in the language so I can't see how this would work reliably. You'd have to expose __tls'ness by lowering that very early, not only during RTL expansion. That is, place TLS base reg loads and do accesses via them. Then make sure calls clobber that base reg load. So put all TLS data into sth like a static frame where you'd have a global variable pointing to that. I expect this to be not-fun(TM) for performance.
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 --- Comment #15 from torvald at gcc dot gnu.org --- From a ISO C/C++ conformance perspective, this is not a bug: * Pre-C11/C++11, threading was basically undefined. * Starting with C11 and C++11, "threads of execution" (as they're called in C++11 and more recent, often abbreviated in the standard as "threads") are the lowest level of how execution is modelled. They are defined as single flows of control. Nothing is promised about any resources that may be used to implement threads of execution (e.g., similar to the "execution context" notion mentioned in comment #10). * Thread-specific state is bound to a particular thread of execution (e.g., regarding lifetime). A thread of execution accessing a __thread variable accesses the thread-specific state of this thread of execution in the abstract machine. (Of course, one can still access other threads's thread-specific state through pointers initially provided by those threads.) * Only the standards' mechanisms can create threads of execution. There is nothing in these standards that would break up the concept of a single flow of control (ie, that what "looks" like a single flow of control in a program is actually not always the same thread of execution when executed). (Also note that fork/join parallelism is not a counter-example to this.) That said, I can see that this doesn't match nicely with the fact that we have things like swapcontext elsewhere. Do we have any data on what the performance impact where if the compiler would assume that function calls to functions it cannot analyze could switch the thread of execution. Data for several architectures and different TLS models would be helpful. Coming back to C++, currently I think there is only one Technical Specification (TS) that allows breaking up a single flow of control: .then() in the Concurrency TS (whose specification is certainly not ready for the standard). Maybe the Networking TS has something similar, but I can't remember right now. There are a few proposals that either allow it (Task Blocks, targeting Parallelism TS version 2) or require it for good performance ("stackful" coroutines). The "stackless" coroutines in the upcoming Coroutines TS are not really affected by thread-specific state; it's not the coroutines code that would potentially switch threads, but any runtime that would supply a particular Awaitable implementation that then may switch threads (e.g., if using .then()). The Coroutines does not specify any actual runtime. However, I don't think the existence of some C++ proposals that may switch threads necessarily means that the compiler would have to take this into account when those proposals should become part of the standard. The other way this could play out is that the standard simply makes using thread-specific state undefined for those threads of execution that can use these proposals. Overall, I think it may be useful to experiment with a command line switch or something like that, primarily to assess how big the performance degradation would be caused by having the compiler assume that threads can switch on function calls. (In reply to Jakub Jelinek from comment #14) > Even if we have an option that avoids CSE of TLS addresses across function > calls (or attribute for specific function), what would you expect to happen > when user takes address of TLS variables himself: > __thread int a; > void > foo () > { > int *p = > *p = 10; > bar (); // Changes threads > *p += 10; > } I think that p would point to the initial thread's TLS, even after bar(). The user would be wrong to assume that it still is the initial thread's object "a" after having called bar().
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org, ||torvald at gcc dot gnu.org --- Comment #14 from Jakub Jelinek --- Even if we have an option that avoids CSE of TLS addresses across function calls (or attribute for specific function), what would you expect to happen when user takes address of TLS variables himself: __thread int a; void foo () { int *p = *p = 10; bar (); // Changes threads *p += 10; } ? The address can be stored anywhere, so the compiler can't do anything with it. And of course such an option would cause major slowdown of anything using TLS, not only it would need to stop CSEing TLS addresses late, but stop treating TLS addresses as constant in all early optimizations as well.
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 John Ousterhout changed: What|Removed |Added CC||ouster at cs dot stanford.edu --- Comment #13 from John Ousterhout --- Kernel threads are great, and it may seem like there's no need for user-level threads now that kernel threads are universally available. But layering user-level threads on top of kernel threads can offer a speedup of at least 10x for common operations. The fact that so many different people have responded on this issue and issue 26461 is pretty good evidence that it can be useful to do "context switching" on top of kernel threads. My research group has recently run up against this same problem. Thread-local variables (i.e. kernel-thread-locals) are still useful in this environment (for example, we use one to keep track of the user thread that's loaded on the current kernel thread). One of the great things about gcc is that it has supported a huge variety of applications and design styles; it would be a shame for gcc to preclude this particular class of applications. Is it unreasonably difficult to add a mechanism to force gcc to flush cached thread-local addresses? Those of us using the mechanism would be happy to pay a small performance penalty for it, but presumably that won't affect applications that don't use the mechanism. Please reconsider?
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 Andy Robbins changed: What|Removed |Added CC||andy at miniciv dot com --- Comment #12 from Andy Robbins --- Cross posting to help others who need this feature. From a similar ticket on LLVM, about adding the /GT flag (which fixes the OP's problem, while being optional, and MSVC supports this): [...] The option is /GT as specified in the title, and it is not enabled by default. There's one particular use case where this kind of option is really important: a fiber-based job system, something that has been used in video game development for multi-core machines. In a system like this, it's common for one job (occupying a fiber) to be paused (ie: swapped for another fiber in the thread it is running) while it waits for some other work to finish, and then be resumed (ie: swapped to) from the next available worker thread, which will be essentially a random worker thread. The whole point here is to distribute jobs to all available CPU cores evenly and automatically, so this TLS situation is inevitable and by design. Yes, TLS is slower in this use case, but it is the correct behaviour. Not having the /GT flag means having to manually inspect all code and roll a custom replacement TLS, which is a considerable effort. Please reconsider having this option. Reference: https://llvm.org/bugs/show_bug.cgi?id=19177
[Bug middle-end/26461] liveness of thread local references across function calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461 --- Comment #11 from Giovanni Deretta --- In the last few years it has been clear that threads are not enough and coroutines have seen a resurgence in many languages. Go, which is directly supported by GCC, make them a first class construct; boost has had them for a while (and it is affecte by this issue); the HPX library uses them to great effects for HPC work; it even seem that C++ standard will eventually include them officially [1]. Any chance this resolution will be reviewed? Note that an opt-in flag would be enough, and it should have very little effect on x86, where the compiler doesn't bother to cache TLS address computations, as it has fast TLS access, unless the address is explicitly taken. [1] If MS get it its way, stackless coroutines shouldn't be affected by this issue as the switch point can be statically identified by the compiler. But there is still a chance that we will get proper non-crippled stackfull coroutines.
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #8 from dwood at sybase dot com 2010-08-23 21:13 --- I believe this is a bug or a serious oversite in understanding the need for support of USER thread local storage. There are two kinds of software threads; a) kernel threads(AKA LWP's on Solaris) scheduled by the OS; and b) user threads scheduled by the user and/or threading library. Databases such as Informix and Sybase both manage their own user threads(1 per client connection). These run on a small pool of engines which are either kernel threads or processes(No more than one per cpu core). The user threads do cooperative scheduling by calling their own yield implementation and they never yield in a critical section. These products and perhaps others are not little weird cases. These user threads can migrate between kernel threads as load balancing occurs. Of course, this requires that the implementation of the user level thread context switch must save/restore USER TLS variables from/to the TLS areas of the kernel threads involved. If the model where M user threads are handled across N kernel threads is valid then the address of USER TLS variables can change across a context switch. We, Sybase, have run into the same problem on Solaris-Sparc and are evaluating whether gcc __thread on x86_64 will have the same problem. Of course, the database folks often hear from the OS folks the common reply like: Why are you doing this? Just trust the OS and leave it all to us. But for us we have to trust AIX, Linux, Solaris, HP-UX, Windows, etc. This is the same whether we are talking about using USER threads or the dreaded Linux O_DIRECT debate. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #9 from pinskia at gcc dot gnu dot org 2010-08-23 21:19 --- I think you should read: http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/C99-Thread_002dLocal-Edits.html#C99-Thread_002dLocal-Edits --- CUT --- In GCC's case the thread is only can be created via pthread. Note C++0x defines a threading interface and such. I know of no implementation that will allow the use of user threads really. Really I think it is wrong to even think about using user threads any more. The main reason why they existed was to support OS's which don't have threads but those don't really exist any more. Not to mention, support for things like TLS is only only for OS provided threads. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||WONTFIX http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #10 from dwood at sybase dot com 2010-08-24 03:31 --- (In reply to comment #9) I don't disagree with the thread local writeup however it is lacking in clarity. A flow of control must be associated with an execution context. The existance of getcontext/setcontext allows both: 1) One flow of control to switch to another flow of control within a single execution context; 2) and, a flow of control to move from one execution context to another. In fact, __thread variables are not actual bound to a flow of control but to a specific execution context, part of which is usually some kind of thread_t structure associated with a kernel thread. If they where bound to a logical flow of control then we wouldn't even need this discussion. Nowhere above did I refer to user thread. The above should be consistant with your comment that TLS is supported for OS threads only. But the problem isn't the OS but the compiler. When the address-of operator is applied to a thread-local variable, it is evaluated at run-time and returns the address of the current thread's instance of that variable. The question is which OS thread is the current OS thread when the address is obtained to actually fetch the current thread local value. setcontext() avoids changing %fs on Intel and %g7 on Sparc as it is really restoring a suspended flow of control context onto a kernel thread's context as referenced by %fs/%g7. Caching these across function calls in a MT program is faultly. Just as users of compiler's shouldn't depend on implementation defined behavior neither should compiler writers assume OS implementation defined behavior. OS support for threads/setcontext existed prior to C's decision to provide basic efficient access to TLS variables. If some OS implementations of setcontext allow the context to be pushed onto a different kernel thread then it was gotten from then caching the current kernel thread handle or TLS address across function calls is wrong. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #7 from gpderetta at gmail dot com 2009-03-19 12:14 --- Hi, I'm the author of Boost.Coroutine (not yet part of boost, but one day...). I have the exact same problem: gcc caches the address of TLS variables across function calls which breaks when coroutines move from one thread to another. Note that in my case I'm definitely *not* reinventing threads in user space. Coroutines are for different use cases than threads (i.e. when you do not need preemption but simply a way to organize event driven code). One use of boost.coroutine is on top of boost.asio. Posix has both threads and swapcontext and nowhere it says that swapcontext can't be used in threaded applications. In fact is simply states that the saved context is restored after a call to setcontext, and IMHO any posix compatible compiler should support this. FWIW The microsoft c++ compiler has the /GT (fiber safe TLS) flag to prevent exactly this kind of optimizations. Probably GCC should support something like that too. See: http://www.crystalclearsoftware.com/soc/coroutine/coroutinecoroutine_thread.html for details. Finally I see the problem even with plain pointers and references, not only arrays, at least with gcc4.3: #include ucontext.h void bar(int); __thread int x = 0; void foo(ucontext_toucp, ucontext_t ucp) { bar(x); swapcontext(oucp, ucp); bar(x); } Compiles down to this (with -O3, on x86_64): :_Z3fooR8ucontextS0_: movq%fs:0, %rax movq%rbp, -16(%rsp) movq%rbx, -24(%rsp) movq%r12, -8(%rsp) movq%rsi, %rbx subq$24, %rsp movq%rdi, %r12 leaqx...@tpoff(%rax), %rbp movq%rbp, %rdi call_Z3barRi movq%r12, %rdi movq%rbx, %rsi callswapcontext movq%rbp, %rdi movq(%rsp), %rbx movq8(%rsp), %rbp movq16(%rsp), %r12 addq$24, %rsp jmp _Z3barRi -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
-- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |WAITING Component|c |middle-end http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #2 from yichen dot xie at gmail dot com 2006-02-24 22:12 --- (In reply to comment #1) It seems like you are trying to deal with your own threading system instead of allowing the OS do its work. This is indeed what I am trying to do, and C seems to be the perfect language for doing this. I agree it's not common to be switching thread contexts across function calls, but I don't think it should be prohibited by GCC. In my case, h simply saves its context, put it on the ready queue, and waits for another thread to pick it up and resume execution with the new thread local copy of array. So the question is there a way to force recompuation of array[0] after h? Is it reasonable to request for a mechanism to force recomputation of array[0]? BTW, the solution IMO is simple: either make sure all thread local values and addresses (the problem seems to exist only with arrays, the compiler is more conservative dealing with pointers, etc) are dead after a function call, or add a mechanism (__attribute__((thread_switch))?) to force it. -- yichen dot xie at gmail dot com changed: What|Removed |Added CC||yichen dot xie at gmail dot ||com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #3 from pinskia at gcc dot gnu dot org 2006-02-24 22:38 --- (In reply to comment #2) (In reply to comment #1) It seems like you are trying to deal with your own threading system instead of allowing the OS do its work. This is indeed what I am trying to do, and C seems to be the perfect language for doing this. I agree it's not common to be switching thread contexts across function calls, but I don't think it should be prohibited by GCC. Why not let the OS do its job? I still don't understand that idea. Actually no it is not responable in general since GCC assumes the address is invariant which is correct except for your little weird case. What function are you using to save/restore the context? There are no standard C function which allows for that. Even get/setcontext are POSIX but I doubt they support across threads correctly anyways. I know setjmp/longjmp don't for sure. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #4 from yichen dot xie at gmail dot com 2006-02-24 23:06 --- Why not let the OS do its job? I still don't understand that idea. It's a thread library that builds on top of pthreads, so yes, OS is doing its job, and we're doing more on top of that. C is a natural choice for us. Does it help if we rename h to reschedule? __thread array[1]; for (;;) { // do something with array reschedule(); } Actually no it is not responable in general since GCC assumes the address is invariant which is correct except for your little weird case. What function Well, it may be a bit weird for any other language, but not C (IMO). It's definitely not weird if you compare it to the kernel, which is largely written in C. Thread local objects are invariant within a thread, not across threads. I think it could be dangerous for gcc to assume that function calls preserve thread context, esp. when the function is written in assembly. At least there should be a way to tell the compiler not to assume that, given C is a low-level language where everything should be possible. are you using to save/restore the context? There are no standard C function Very simple assembly code that stores/restores a few registers, including %esp. C is a low level language, and it should interoperate well not only with standard C functions, but also with assembly or any other weird functions. That's what C is good for, isn't it? which allows for that. Even get/setcontext are POSIX but I doubt they support across threads correctly anyways. I know setjmp/longjmp don't for sure. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #5 from pinskia at gcc dot gnu dot org 2006-02-25 00:02 --- ISO C is not your normal low level language any more. It actually tries to be a high level language. So this is not a bug. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|WAITING |RESOLVED Resolution||INVALID http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
[Bug middle-end/26461] liveness of thread local references across function calls
--- Comment #6 from yichen dot xie at gmail dot com 2006-02-25 01:55 --- (In reply to comment #5) ISO C is not your normal low level language any more. It actually tries to be a high level language. So this is not a bug. I still don't think it's a good idea to treat thread local array addresses as invariant. If you look at the implementation of getcontext/swapcontext, they intentionally left gs segment register out in the context, leaving open the possibility that a context saved by one thread be resumed by another. What will gcc do in this case? If you don't mind, could you point me to the section of ISO C where it specifies that function calls must preserve thread contexts? If not, by all means it's a bug in the optimizer. Does any one else have an opinion? -- yichen dot xie at gmail dot com changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461