[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-26 Thread eggert at cs dot ucla.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #7 from Paul Eggert  ---
(In reply to Alexander Monakov from comment #6)

> Are you binding the benchmark to some core in particular?

I did the benchmark on performance cores, which was my original use case. On
efficiency cores, adding the (unnecessary) 'mov eax, 1' doesn't change timing
much (0.9% speedup on one test).

> it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this loop on 
> any CPU

Somewhat counterintuitively, that doesn't seem to be the case for the
efficiency cores on this platform, as the "38% faster" code is 7% slower on
E-cores. However, the use cases I'm concerned about are typically run on
performance cores.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #6 from Alexander Monakov  ---
Thanks.

i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?

On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with
zero latency), this optimization appeared in the Alder Lake generation with the
"Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g.
it works for 64-bit additions but not 32-bit, the addend must be an immediate
less than 1024).

Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this
loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions,
so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax,
[rbx]' get delayed too), but ascribing the difference to compiler scheduling on
a CPU that does out-of-order dynamic scheduling is strange.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-25 Thread eggert at cs dot ucla.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #5 from Paul Eggert  ---
(In reply to Alexander Monakov from comment #4)

> To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx,
> rax' to 'add rbx, 1'.

Adding the (unnecessary) 'mov eax, 1' doesn't affect the timing much, which is
what I would expect on a newer processor.

When I reran the benchmark on the same laptop (Intel i5-1335U), I got 3.289s
for GCC-generated code, 2.256s for the "38% faster" code (now it's 46% faster;
don't know why) and 2.260 s for the faster code with the unnecessary 'mov eax,
1' inserted.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
(In reply to Paul Eggert from comment #0)
> The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly
> scheduled; the resulting dependency makes the code run quite a bit slower
> than it should. Replacing it with "addq $1, %rbx" and readjusting the
> surrounding code accordingly, as is done in the attached file
> code-mcel-opt.s, causes the benchmark to run 38% faster on my laptop's Intel
> i5-1335U.

This is a mischaracterization. The modified loop has one uop less, because you
are replacing 'mov eax, 1; add rbx, rax' with 'add rbx, 1'.

To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, rax'
to 'add rbx, 1'.

There are two separate loop-carried data dependencies, both one cycle per
iteration (addition chains over r12 and rbx).

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #3 from Andrew Pinski  ---
  _22 = *iter_57;
  if (_22 >= 0)
goto ; [90.00%]
  else
goto ; [10.00%]

   [local count: 860067200]:
  _76 = (long long unsigned int) _22;
  _15 = sum_31 + _76;
  goto ; [100.00%]
...
   [local count: 955630226]:
  # prephitmp_42 = PHI <1(4), 1(5), len_29(6)>
  # prephitmp_35 = PHI <_15(4), sum_31(5), _34(6)>
  mbs ={v} {CLOBBER(eol)};
  ch ={v} {CLOBBER(eol)};
  iter_21 = iter_57 + prephitmp_42;

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-24 Thread eggert at cs dot ucla.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #2 from Paul Eggert  ---
Created attachment 55790
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55790&action=edit
asm code that's 38% faster on my platform

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-24 Thread eggert at cs dot ucla.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #1 from Paul Eggert  ---
Created attachment 55789
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55789&action=edit
asm code generated by gcc -O2 -S