[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

rguenth at gcc dot gnu.org Mon, 29 Jan 2018 02:55:25 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037


--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Martin Liška from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > Is it really r256643 and not r256644 that is causing this though?
> 
> Yes, I can verify that it's r256644 that's causing the regression.

This means that those newly vectorized loops make capacita slower.  551 is

          do m=1,n/4-1
            h = A(j0+m) + A(j2+m)*E(m*inc)
            A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc)
            A(j0+m) = h
            eh = conjg(E(ntot/4-m*inc))
            h = A(j1+m) - A(j3+m)*eh
            A(j3+m) = A(j1+m) + A(j3+m)*eh
            A(j1+m) = h
          end do

but it's actually the loops from the array expressions I guess (receiving
"interesting" locations).  105 is

    do i=1,Ng1                     !   .. and multiply charge with x
->    do j=1,Ng2
        X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp)
      end do
    end do

the variable strides are because those are arrays accessed via array
descriptors.  This means that the stride will be very likely one so
any "strided XY" vectorization will have quite a big overhead.

Of course in the end it looks like a cost model issue (which might be
just not enough factoring in of the alias runtime check).  Like in
the 105 case I expect the non-vectorized loop to be a quite "nice"
optimized nest with IVO doing a good job, etc..  If the inner loop
is vectorized conditionally this can wreck code generation and
runtime quite a bit.  Ng1/Ng2 are 1024 (at runtime, read from capacita.in).

For 105 we have

capacita.f90:105:0: note: need run-time check that (ssizetype) ((sizetype)
prephitmp_341 * 4) is nonzero
capacita.f90:105:0: note: versioning for alias required: can't determine
dependence between d1 and *_150[_65]

so its two checks needed.

capacita.f90:105:0: note: Cost model analysis:
  Vector inside of loop cost: 172
  Vector prologue cost: 60
  Vector epilogue cost: 136
  Scalar iteration cost: 60
  Scalar outside cost: 8
  Vector outside cost: 196
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

I think the bug is that we're somehow thinking the vectorized arithmetic
(two multiplications) offset the use of strided loads and stores...


Testcase for this loop:

module solv_cap
  implicit none

  public  :: solveP

  integer, parameter, public :: dp = selected_real_kind(5)

  real(kind=dp), private :: Pi, eps0
  real(kind=dp), private :: D1, D2
  integer,       private, save :: Ng1=0, Ng2=0
  integer,       private, pointer,     dimension(:,:)  :: Grid

contains

  subroutine solveP(P)
    real(kind=dp), intent(out) :: P

    real(kind=dp), allocatable, dimension(:,:)  :: Y0, X
    integer :: i,j

    allocate( Y0(Ng1,Ng2), X(Ng1,Ng2) )

    do i=1,Ng1
      do j=1,Ng2
        Y0(i,j) = D1 * (i-(Ng1+1)/2.0_dp) * Grid(i,j)
      end do
    end do     ! RHS for in-field E_x=1.  V = -V_in = x, where metal on grid

    call solve( X, Y0 )

    X = X - sum(X)/size(X)         ! get rid of monopole term ..
    do i=1,Ng1                     !   .. and multiply charge with x
      do j=1,Ng2
        X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp)
      end do
    end do
    P = sum(X)*D1*D2 * 4*Pi*eps0   ! E-dipole moment in 1 V/m field

    deallocate( X, Y0 )
    return
  end subroutine solveP

end module solv_cap

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

Reply via email to