[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 --- Comment #7 from Paul Eggert --- (In reply to Alexander Monakov from comment #6) > Are you binding the benchmark to some core in particular? I did the benchmark on performance cores, which was my original use case. On efficiency cores, adding the (unnecessary) 'mov eax, 1' doesn't change timing much (0.9% speedup on one test). > it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this loop on > any CPU Somewhat counterintuitively, that doesn't seem to be the case for the efficiency cores on this platform, as the "38% faster" code is 7% slower on E-cores. However, the use cases I'm concerned about are typically run on performance cores.
[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 --- Comment #6 from Alexander Monakov --- Thanks. i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight "efficiency cores". They have different micro-architecture. Are you binding the benchmark to some core in particular? On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with zero latency), this optimization appeared in the Alder Lake generation with the "Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g. it works for 64-bit additions but not 32-bit, the addend must be an immediate less than 1024). Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions, so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax, [rbx]' get delayed too), but ascribing the difference to compiler scheduling on a CPU that does out-of-order dynamic scheduling is strange.
[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 --- Comment #5 from Paul Eggert --- (In reply to Alexander Monakov from comment #4) > To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, > rax' to 'add rbx, 1'. Adding the (unnecessary) 'mov eax, 1' doesn't affect the timing much, which is what I would expect on a newer processor. When I reran the benchmark on the same laptop (Intel i5-1335U), I got 3.289s for GCC-generated code, 2.256s for the "38% faster" code (now it's 46% faster; don't know why) and 2.260 s for the faster code with the unnecessary 'mov eax, 1' inserted.
[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- (In reply to Paul Eggert from comment #0) > The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly > scheduled; the resulting dependency makes the code run quite a bit slower > than it should. Replacing it with "addq $1, %rbx" and readjusting the > surrounding code accordingly, as is done in the attached file > code-mcel-opt.s, causes the benchmark to run 38% faster on my laptop's Intel > i5-1335U. This is a mischaracterization. The modified loop has one uop less, because you are replacing 'mov eax, 1; add rbx, rax' with 'add rbx, 1'. To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, rax' to 'add rbx, 1'. There are two separate loop-carried data dependencies, both one cycle per iteration (addition chains over r12 and rbx).
[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 --- Comment #3 from Andrew Pinski --- _22 = *iter_57; if (_22 >= 0) goto ; [90.00%] else goto ; [10.00%] [local count: 860067200]: _76 = (long long unsigned int) _22; _15 = sum_31 + _76; goto ; [100.00%] ... [local count: 955630226]: # prephitmp_42 = PHI <1(4), 1(5), len_29(6)> # prephitmp_35 = PHI <_15(4), sum_31(5), _34(6)> mbs ={v} {CLOBBER(eol)}; ch ={v} {CLOBBER(eol)}; iter_21 = iter_57 + prephitmp_42;
[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 --- Comment #2 from Paul Eggert --- Created attachment 55790 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55790&action=edit asm code that's 38% faster on my platform
[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43 --- Comment #1 from Paul Eggert --- Created attachment 55789 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55789&action=edit asm code generated by gcc -O2 -S