https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Martin Liška from comment #7) > (In reply to Jakub Jelinek from comment #6) > > Is it really r256643 and not r256644 that is causing this though? > > Yes, I can verify that it's r256644 that's causing the regression. This means that those newly vectorized loops make capacita slower. 551 is do m=1,n/4-1 h = A(j0+m) + A(j2+m)*E(m*inc) A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc) A(j0+m) = h eh = conjg(E(ntot/4-m*inc)) h = A(j1+m) - A(j3+m)*eh A(j3+m) = A(j1+m) + A(j3+m)*eh A(j1+m) = h end do but it's actually the loops from the array expressions I guess (receiving "interesting" locations). 105 is do i=1,Ng1 ! .. and multiply charge with x -> do j=1,Ng2 X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp) end do end do the variable strides are because those are arrays accessed via array descriptors. This means that the stride will be very likely one so any "strided XY" vectorization will have quite a big overhead. Of course in the end it looks like a cost model issue (which might be just not enough factoring in of the alias runtime check). Like in the 105 case I expect the non-vectorized loop to be a quite "nice" optimized nest with IVO doing a good job, etc.. If the inner loop is vectorized conditionally this can wreck code generation and runtime quite a bit. Ng1/Ng2 are 1024 (at runtime, read from capacita.in). For 105 we have capacita.f90:105:0: note: need run-time check that (ssizetype) ((sizetype) prephitmp_341 * 4) is nonzero capacita.f90:105:0: note: versioning for alias required: can't determine dependence between d1 and *_150[_65] so its two checks needed. capacita.f90:105:0: note: Cost model analysis: Vector inside of loop cost: 172 Vector prologue cost: 60 Vector epilogue cost: 136 Scalar iteration cost: 60 Scalar outside cost: 8 Vector outside cost: 196 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 7 I think the bug is that we're somehow thinking the vectorized arithmetic (two multiplications) offset the use of strided loads and stores... Testcase for this loop: module solv_cap implicit none public :: solveP integer, parameter, public :: dp = selected_real_kind(5) real(kind=dp), private :: Pi, eps0 real(kind=dp), private :: D1, D2 integer, private, save :: Ng1=0, Ng2=0 integer, private, pointer, dimension(:,:) :: Grid contains subroutine solveP(P) real(kind=dp), intent(out) :: P real(kind=dp), allocatable, dimension(:,:) :: Y0, X integer :: i,j allocate( Y0(Ng1,Ng2), X(Ng1,Ng2) ) do i=1,Ng1 do j=1,Ng2 Y0(i,j) = D1 * (i-(Ng1+1)/2.0_dp) * Grid(i,j) end do end do ! RHS for in-field E_x=1. V = -V_in = x, where metal on grid call solve( X, Y0 ) X = X - sum(X)/size(X) ! get rid of monopole term .. do i=1,Ng1 ! .. and multiply charge with x do j=1,Ng2 X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp) end do end do P = sum(X)*D1*D2 * 4*Pi*eps0 ! E-dipole moment in 1 V/m field deallocate( X, Y0 ) return end subroutine solveP end module solv_cap