[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-05-13
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #8 from Tamar Christina  ---
(In reply to Richard Biener from comment #7)
> Likely
> 
>   Base: (integer(kind=4) *)  + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
> 
> vs.
> 
>   Base: (integer(kind=4) *)  + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> where we fail to optimize the outer multiply.  It's
> 
>  ((unsigned)((signed)x * 81) + 9) * 4
> 
> and likely done by extract_muldiv for the case of (unsigned)x.  The trick
> would be to promote the inner multiply to unsigned to make the otherwise
> profitable transform valid.  But best not by enhancing extract_muldiv ...

Ah, merci!

Mine then.

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #7 from Richard Biener  ---
Likely

  Base: (integer(kind=4) *)  + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)

vs.

  Base: (integer(kind=4) *)  + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

where we fail to optimize the outer multiply.  It's

 ((unsigned)((signed)x * 81) + 9) * 4

and likely done by extract_muldiv for the case of (unsigned)x.  The trick
would be to promote the inner multiply to unsigned to make the otherwise
profitable transform valid.  But best not by enhancing extract_muldiv ...

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #6 from Tamar Christina  ---
Created attachment 58096
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58096=edit
exchange2.fppized-bad.f90.187t.ivopts

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #5 from Tamar Christina  ---
Created attachment 58095
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58095=edit
exchange2.fppized-good.f90.187t.ivopts

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #4 from Tamar Christina  ---
reduced more:

---
  module brute_force
integer, parameter :: r=9
 integer  block(r, r, 0)
contains
  subroutine brute
 do
  do
  do
   do
do
 do
 do i7 = l0, 1
   select case(1 )
   case(1)
   block(:2, 7:, 1) = block(:2, 7:, i7) - 1
   end select
do i8 = 1, 1
   do i9 = 1, 1
if(1 == 1) then
call digits_20
end if
end do
  end do
end do
end do
  end do
  end do
   end do
 end do
  end do
 end
  end
---

I'll have to stop now till I'm back, but the main difference seems to be in:

good:

:
IV struct:
  SSA_NAME: _1
  Type: integer(kind=8)
  Base: (integer(kind=8)) ((unsigned long) l0_19(D) * 81)
  Step: 81
  Biv:  N
  Overflowness wrto loop niter: Overflow
IV struct:
  SSA_NAME: _20
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D)
  Step: 1
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: i7_28
  Type: integer(kind=4)
  Base: l0_19(D) + 1
  Step: 1
  Biv:  Y
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: vectp.22_46
  Type: integer(kind=4) *
  Base: (integer(kind=4) *)  + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)
  Step: 324
  Object:   (void *) 
  Biv:  N
  Overflowness wrto loop niter: No-overflow

bad:

:
IV struct:
  SSA_NAME: _1
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D) * 81
  Step: 81
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: _20
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D)
  Step: 1
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: i7_28
  Type: integer(kind=4)
  Base: l0_19(D) + 1
  Step: 1
  Biv:  Y
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: vectp.22_46
  Type: integer(kind=4) *
  Base: (integer(kind=4) *)  + ((sizetype) ((integer(kind=8)) l0_19(D) *
81) + 9) * 4
  Step: 324
  Object:   (void *) 
  Biv:  N
  Overflowness wrto loop niter: No-overflow

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #3 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #2)
> > which is harder for prefetchers to follow.
> 
> This seems like a limitation in the HW prefetcher rather than anything else.
> Maybe the cost model for addressing mode should punish base+index if so.
> Many HW prefetchers I know of are based on the final VA (or even PA) rather
> looking at the instruction to see if it increments or not ...

That was the first thing we tried, and even increasing the cost of
register_offset to something ridiculously high doesn't change a thing.

IVopts thinks it needs to use it and generates:

  _1150 = (voidD.26 *) _1148;
  _1152 = (sizetype) l0_78(D);
  _1154 = _1152 * 324;
  _1156 = _1154 + 216;
  # VUSE <.MEM_421>
  vect__349.614_1418 = MEM  [(integer(kind=4)D.9
*)_1150 + _1156 * 1 clique 2 base 0];

Hence the bug report to see what's going on.

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

Andrew Pinski  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org

--- Comment #2 from Andrew Pinski  ---
> which is harder for prefetchers to follow.

This seems like a limitation in the HW prefetcher rather than anything else.
Maybe the cost model for addressing mode should punish base+index if so. Many
HW prefetchers I know of are based on the final VA (or even PA) rather looking
at the instruction to see if it increments or not ...

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #1 from Richard Biener  ---
The change likely made SCEV/IVOPTs "stop" at more convenient places, but we can
only know when there's more detailed analysis.