[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #10 from Thiago Macieira --- You're right that emitting more penalises those who have done their job and written proper code. The problem we're seeing is that such code appears to be the minority. Or, maybe put differently, the bad code is showing up a lot in our benchmarks, especially on very big multi-core and multi-socket systems. "Fixing" the compiler would make a broad update to the industry -- so long as the code is recompiled with new compilers. Fixing the actual code would make it better even if used with old ones. Does anyone have a suggestion on how to get best "bang for buck"? (Biggest benefit for smallest effort) This is a sincere question. I'm not trying to be ironic or sarcastic. How can we help the most, the quickest, for the limited amount of resources we can marshal? Also, and I've been hitting this key for a few years, how can we do better at teaching people how to use the tools at their disposal at the proper way? A very good counter-example is C++11's std::atomic_flag: you MUST NEVER use it (at least until C++20, where it got a test() member).
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #9 from Jonathan Wakely --- I don't know what the assembly code does sorry, but it's not permitted to store to the 'expected' value unless the atomic variable is not equal to it. It looks like this would update the 'expected' value unconditionally? I agree with Jakub that we don't want the additional instructions because they will be redundant in some cases where the user has already done a load.
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #8 from Jakub Jelinek --- (In reply to H.J. Lu from comment #7) > Instead of generating: > > movlf(%rip), %eax > .L2: > movd%eax, %xmm0 > addss .LC0(%rip), %xmm0 > movd%xmm0, %edx > lock cmpxchgl %edx, f(%rip) > jne .L2 > ret > > we want > > movlf(%rip), %eax > .L2: > movd%eax, %xmm0 > addss .LC0(%rip), %xmm0 > movd%xmm0, %edx > cmplf(%rip), %eax > jne .L2 > lock cmpxchgl %edx, f(%rip) > jne .L2 > ret No, certainly not. The mov before or the remembered value from previous lock cmpxchgl already has the right value unless the atomic memory is extremely contended, so you don't want to add the non-atomic comparison in between. Not to mention that the way you've written it totally breaks it, because if the memory is not equal to the expected value, you should get the current value. With the above code, if f is modified by another thread in between the initial movl f(%rip), %eax and cmpl f(%rip), %eax and never after it, it will loop forever. I believe what the above paper is talking about should be addressed by users of these intrinsics if they care and if it is beneficial (e.g. depending on extra information on how much the lock etc. is contended etc., in OpenMP one has omp_sync_hint_* constants one can use in hint clause to tell if the lock is contended, uncontended, unknown, speculative, non-speculative, unknown etc.).
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #7 from H.J. Lu --- Instead of generating: movlf(%rip), %eax .L2: movd%eax, %xmm0 addss .LC0(%rip), %xmm0 movd%xmm0, %edx lock cmpxchgl %edx, f(%rip) jne .L2 ret we want movlf(%rip), %eax .L2: movd%eax, %xmm0 addss .LC0(%rip), %xmm0 movd%xmm0, %edx cmplf(%rip), %eax jne .L2 lock cmpxchgl %edx, f(%rip) jne .L2 ret
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #6 from Jakub Jelinek --- E.g. the builtin is often used in a loop where the user does his own atomic load first and decides what to do based on that. Say for float f; void foo () { #pragma omp atomic f += 3.0f; } with -O2 -fopenmp we emit: D.2113 = D.2115 = __atomic_load_4 (D.2113, 0); D.2114 = D.2115; : D.2112 = VIEW_CONVERT_EXPR(D.2114); _1 = D.2112 + 3.0e+0; D.2116 = VIEW_CONVERT_EXPR(_1); D.2117 = .ATOMIC_COMPARE_EXCHANGE (D.2113, D.2114, D.2116, 4, 0, 0); D.2118 = REALPART_EXPR ; D.2119 = D.2114; D.2114 = D.2118; if (D.2118 != D.2119) goto ; [0.00%] else goto ; [100.00%] : return; which is essentially void foo () { int x = __atomic_load_4 ((int *) , __ATOMIC_RELAXED), y; float g; do { __builtin_memcpy (, , 4); g += 3.0f; __builtin_memcpy (, , 4); } while (!__atomic_compare_exchange_n ((int *) , , y, false, __ATOMIC_RELAXED, __ATOMIC_RELAXED)); } Can you explain how your proposed change would improve this? It would just slow it down and make it larger.
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 Jakub Jelinek changed: What|Removed |Added CC||redi at gcc dot gnu.org, ||rodgertq at gcc dot gnu.org --- Comment #5 from Jakub Jelinek --- I don't think so, but we'd need somebody more familiar with the C/C++ memory model to say for sure. In any case, even if it is, I'm not convinced it is a good idea, users should be able to choose whether they do the separate atomic load first themselves rather than compiler forcing them to do that unconditionally.
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #4 from H.J. Lu --- (In reply to Jakub Jelinek from comment #3) > If by fail you mean that it doesn't update the memory if the memory isn't > equal to expected, sure, but do you mean it can fail spuriously, not update > the memory even if the memory is equal to expected? > Neither __sync_{bool,val}_compare_and_swap nor __atomic_compare_exchange_n > with weak set to false can fail spuriously, __atomic_compare_exchange_n with > weak set to true can. If we generate movl m(%rip), %eax cmpl %edi, %eax jne .L1 movl%edi, %eax lock cmpxchgl %esi, m(%rip) .L1: ret is it a valid implementation of atomic_compare_exchange_strong?
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #3 from Jakub Jelinek --- If by fail you mean that it doesn't update the memory if the memory isn't equal to expected, sure, but do you mean it can fail spuriously, not update the memory even if the memory is equal to expected? Neither __sync_{bool,val}_compare_and_swap nor __atomic_compare_exchange_n with weak set to false can fail spuriously, __atomic_compare_exchange_n with weak set to true can.
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 H.J. Lu changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2021-11-05 --- Comment #2 from H.J. Lu --- (In reply to Hongyu Wang from comment #1) > __sync_val_compare_and_swap will be expanded to > atomic_compare_exchange_strong > by default, should we restrict the check and return under > atomic_compare_exchange_weak which is allowed to fail spuriously? On x86, "lock cmpxchgl" can fail.
[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066 --- Comment #1 from Hongyu Wang --- __sync_val_compare_and_swap will be expanded to atomic_compare_exchange_strong by default, should we restrict the check and return under atomic_compare_exchange_weak which is allowed to fail spuriously?