[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-16 Thread hp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

Hans-Peter Nilsson  changed:

   What|Removed |Added

 CC||hp at gcc dot gnu.org

--- Comment #7 from Hans-Peter Nilsson  ---
(In reply to Richard Biener from comment #6)
> The auto-inc pass is well
> structured, so it should be possible to extend it.
Or just replace it, as it doesn't look far enough to be able to handle all
incdec-opportunities.

[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-02-06
 Ever confirmed|0   |1

--- Comment #6 from Richard Biener  ---
It's already visible with a simple

void f(const long* src, long* dst)
{
  *dst++ = *src++;
  *dst = *src;
}

where we expand to RTL from

  _1 = *src_3(D);
  *dst_4(D) = _1;
  _2 = MEM[(const long int *)src_3(D) + 4B];
  MEM[(long int *)dst_4(D) + 4B] = _2;

there's nothing on GIMPLE that would split the add and RTLs auto-inc-dec
pass doesn't do anything either.  We'd need a form of "strength-reduction"
or maybe targets prefering auto-inc/dec should not legitimize constant
offsets before reload ...

Note with one more copy you then see

  _1 = *src_4(D);
  *dst_5(D) = _1;
  _2 = MEM[(const long int *)src_4(D) + 4B];
  MEM[(long int *)dst_5(D) + 4B] = _2;
  _3 = MEM[(const long int *)src_4(D) + 8B];
  MEM[(long int *)dst_5(D) + 8B] = _3;

and naiively splitting gives you

  src_6 = src_4(D) + 4;
  src_7 = src_4(D) + 8;

that said, it's really sth for RTL since it's going to be highly target
dependent which form is more efficient.  The auto-inc pass is well
structured, so it should be possible to extend it.

[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-06 Thread miro.kropacek at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #5 from Miro Kropacek  ---
I have been told that one of the reasons why post-incrementing modes are not
supported / preferred these days is that they halt the CPU pipeline (of course,
totally not applicable on m68k). So with the offsets you can parallelize the
movements while when post-incrementing the values of a1, you always have to
wait for the previous instruction to finish.

So I could understand that this has been changed but it definitely shouldn't be
a change involving all possible CPUs.

[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-06 Thread mikpelinux at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #4 from Mikael Pettersson  ---
I'm not sure this is an m68k bug. I tried several targets that have
auto-increment addressing modes (m68k, pdp11, msp430, vax, aarch64) and none of
them would use auto-increment for this test case.

[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-06 Thread miro.kropacek at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #3 from Miro Kropacek  ---
> I wonder if the code we emit is measurably slower though?  It's possibly
a little bit larger due to the two IV increments.

It's definitely slower as both offsets next to the An registers generate a
separate instruction word. So instead of 2-byte instruction "move.l
(a0)+,(a1)+" we have a 6-byte instruction "move.l off(a0),off(a1)" and that
hurts a lot even on the 68060, not to mention the poor 68000.

[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #2 from Richard Biener  ---
I don't think IVOPTs would use postinc for the intermediate increments.  It's
constant propagation/forwarding that accumulates the increments to a constant
offset which removes dependences on the instructions and thus would allow the
loads/stores to be executed in parallel (well, not that m68k uarchs likely can
do any of that ...).

I wonder if the code we emit is measurably slower though?  It's possibly
a little bit larger due to the two IV increments.

[Bug target/113779] Very inefficient m68k code generated for simple copy loop

2024-02-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #1 from Andrew Pinski  ---
> So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much?

At one point of time (before I think GCC 9 or 8 or so), GCC's IV-OPTs
optimization does not take into account post/pre increment, but now it does.
BUT if the target cost model does not take those into account, then IV-OPTs
could decide not to use them.
Now m68k is a target which not many GCC developers look at fixing, so it is up
to someone to look into why the post increment is no longer being used.