[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 Tamar Christina changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-05-13 Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #8 from Tamar Christina --- (In reply to Richard Biener from comment #7) > Likely > > Base: (integer(kind=4) *) + ((sizetype) ((unsigned long) l0_19(D) * > 324) + 36) > > vs. > > Base: (integer(kind=4) *) + ((sizetype) ((integer(kind=8)) l0_19(D) > * 81) + 9) * 4 > > where we fail to optimize the outer multiply. It's > > ((unsigned)((signed)x * 81) + 9) * 4 > > and likely done by extract_muldiv for the case of (unsigned)x. The trick > would be to promote the inner multiply to unsigned to make the otherwise > profitable transform valid. But best not by enhancing extract_muldiv ... Ah, merci! Mine then.
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #7 from Richard Biener --- Likely Base: (integer(kind=4) *) + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36) vs. Base: (integer(kind=4) *) + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 9) * 4 where we fail to optimize the outer multiply. It's ((unsigned)((signed)x * 81) + 9) * 4 and likely done by extract_muldiv for the case of (unsigned)x. The trick would be to promote the inner multiply to unsigned to make the otherwise profitable transform valid. But best not by enhancing extract_muldiv ...
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #6 from Tamar Christina --- Created attachment 58096 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58096=edit exchange2.fppized-bad.f90.187t.ivopts
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #5 from Tamar Christina --- Created attachment 58095 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58095=edit exchange2.fppized-good.f90.187t.ivopts
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #4 from Tamar Christina --- reduced more: --- module brute_force integer, parameter :: r=9 integer block(r, r, 0) contains subroutine brute do do do do do do do i7 = l0, 1 select case(1 ) case(1) block(:2, 7:, 1) = block(:2, 7:, i7) - 1 end select do i8 = 1, 1 do i9 = 1, 1 if(1 == 1) then call digits_20 end if end do end do end do end do end do end do end do end do end do end end --- I'll have to stop now till I'm back, but the main difference seems to be in: good: : IV struct: SSA_NAME: _1 Type: integer(kind=8) Base: (integer(kind=8)) ((unsigned long) l0_19(D) * 81) Step: 81 Biv: N Overflowness wrto loop niter: Overflow IV struct: SSA_NAME: _20 Type: integer(kind=8) Base: (integer(kind=8)) l0_19(D) Step: 1 Biv: N Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: i7_28 Type: integer(kind=4) Base: l0_19(D) + 1 Step: 1 Biv: Y Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: vectp.22_46 Type: integer(kind=4) * Base: (integer(kind=4) *) + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36) Step: 324 Object: (void *) Biv: N Overflowness wrto loop niter: No-overflow bad: : IV struct: SSA_NAME: _1 Type: integer(kind=8) Base: (integer(kind=8)) l0_19(D) * 81 Step: 81 Biv: N Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: _20 Type: integer(kind=8) Base: (integer(kind=8)) l0_19(D) Step: 1 Biv: N Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: i7_28 Type: integer(kind=4) Base: l0_19(D) + 1 Step: 1 Biv: Y Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: vectp.22_46 Type: integer(kind=4) * Base: (integer(kind=4) *) + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 9) * 4 Step: 324 Object: (void *) Biv: N Overflowness wrto loop niter: No-overflow
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #3 from Tamar Christina --- (In reply to Andrew Pinski from comment #2) > > which is harder for prefetchers to follow. > > This seems like a limitation in the HW prefetcher rather than anything else. > Maybe the cost model for addressing mode should punish base+index if so. > Many HW prefetchers I know of are based on the final VA (or even PA) rather > looking at the instruction to see if it increments or not ... That was the first thing we tried, and even increasing the cost of register_offset to something ridiculously high doesn't change a thing. IVopts thinks it needs to use it and generates: _1150 = (voidD.26 *) _1148; _1152 = (sizetype) l0_78(D); _1154 = _1152 * 324; _1156 = _1154 + 216; # VUSE <.MEM_421> vect__349.614_1418 = MEM [(integer(kind=4)D.9 *)_1150 + _1156 * 1 clique 2 base 0]; Hence the bug report to see what's going on.
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 Andrew Pinski changed: What|Removed |Added CC||pinskia at gcc dot gnu.org --- Comment #2 from Andrew Pinski --- > which is harder for prefetchers to follow. This seems like a limitation in the HW prefetcher rather than anything else. Maybe the cost model for addressing mode should punish base+index if so. Many HW prefetchers I know of are based on the final VA (or even PA) rather looking at the instruction to see if it increments or not ...
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #1 from Richard Biener --- The change likely made SCEV/IVOPTs "stop" at more convenient places, but we can only know when there's more detailed analysis.