https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |matz at gcc dot gnu.org --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- One difference that is clearly visible is missed vectorization of module_small_step_em.fppized.f90:1399:14: note: LOOP VECTORIZED module_small_step_em.fppized.f90:1376:14: note: LOOP VECTORIZED -module_small_step_em.fppized.f90:1376:14: note: LOOP VECTORIZED module_small_step_em.fppized.f90:1354:14: note: LOOP VECTORIZED This is the following inner loop which we vectorize before the change but not after. Before we have create runtime check for data references MEM[(float[0:D.14326] *)_891][_13070] and MEM[(float[0:D.14326] *)_891][_4424] module_small_step_em.fppized.f90:1376:14: note: created 1 versioning for alias checks. module_small_step_em.fppized.f90:1376:14: optimized: loop versioned for vectorization because of possible aliasing where afterwards module_small_step_em.fppized.f90:1376:14: missed: number of versioning for alias run-time tests exceeds 10 (--param vect-max-version-for-alias-checks) SUBROUTINE advance_w( w, rw_tend, ww, u, v, & ... ! Jammed 3 doubly nested loops over k/i into 1 for slight improvement ! in efficiency. No change in results (bit-for-bit). JM 20040514 ! (left a blank line where the other two k/i-loops were) ! DO k=2,k_end DO i=i_start, i_end w(i,k,j)=w(i,k,j)+dts*rw_tend(i,k,j) & + msft_inv(i)*cqw(i,k,j)*( & +.5*dts*g*mut_inv(i)*rdn(k)* & (c2a(i,k ,j)*rdnw(k ) & *((1.+epssm)*(rhs(i,k+1 )-rhs(i,k )) & +(1.-epssm)*(ph(i,k+1,j)-ph(i,k ,j))) & -c2a(i,k-1,j)*rdnw(k-1) & *((1.+epssm)*(rhs(i,k )-rhs(i,k-1 )) & +(1.-epssm)*(ph(i,k ,j)-ph(i,k-1,j))))) & +dts*g*msft_inv(i)*(rdn(k)* & (c2a(i,k ,j)*alt(i,k ,j)*t_2ave(i,k ,j) & -c2a(i,k-1,j)*alt(i,k-1,j)*t_2ave(i,k-1,j)) & +(rdn(k)*(c2a(i,k ,j)*alb(i,k ,j) & -c2a(i,k-1,j)*alb(i,k-1,j))*mut_inv(i)-1.) & *muave(i,j)) ENDDO ENDDO There is almost no difference in the IL after if-conversion but changes like - _14697 = MEM[(float[0:D.14440] *)_959 clique 40 base 47][_15604]; + _14697 = MEM[(float[0:D.14440] *)_959 clique 71 base 47][_15604]; _9355 = _14697 * 4.905000209808349609375e+0; - _9371 = MEM[(float[0:D.14436] *)_957 clique 40 base 28][_15604]; + _9371 = MEM[(float[0:D.14436] *)_957 clique 71 base 28][_15604]; which is introduced by unroll-and-jam by means of unrolling the outer loop once. The restrict clique remapping is strictly speaking required here. There are of course cases where remapping can be elided but they are difficult to reconstruct. Basically each clique > 1 belongs to a specific inline block and whether we need to remap it when copying a stmt depends on whether we duplicate "within" that block or we duplicate the block itself. A good "hint" would be (for example for unroll-and-jam) whether cliques mentioned in the to be copied loop are also mentioned in stmts belonging exclusively to the outer loop that is unrolled (but then they could have been moved there by invariant motion for example). I'm not sure if we want to look at the BLOCK tree. Or whether we want to build a similar thing into the loop structure, noting inline contexts (the single clique that needs not remapping when copying the loop). Note I haven't been able to verify whether the above loop is executed at all but -fno-loop-unroll-and-jam makes performance equal (and faster than ever...).