[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

rguenth at gcc dot gnu.org Thu, 07 Mar 2019 02:55:56 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89578


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at gcc dot gnu.org

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
One difference that is clearly visible is missed vectorization of

 module_small_step_em.fppized.f90:1399:14: note:  LOOP VECTORIZED
 module_small_step_em.fppized.f90:1376:14: note:  LOOP VECTORIZED
-module_small_step_em.fppized.f90:1376:14: note:  LOOP VECTORIZED
 module_small_step_em.fppized.f90:1354:14: note:  LOOP VECTORIZED

This is the following inner loop which we vectorize before the change but not
after.  Before we have

create runtime check for data references MEM[(float[0:D.14326] *)_891][_13070]
and MEM[(float[0:D.14326] *)_891][_4424]
module_small_step_em.fppized.f90:1376:14: note:  created 1 versioning for alias
checks.
module_small_step_em.fppized.f90:1376:14: optimized:  loop versioned for
vectorization because of possible aliasing

where afterwards

module_small_step_em.fppized.f90:1376:14: missed:   number of versioning for
alias run-time tests exceeds 10 (--param vect-max-version-for-alias-checks)

SUBROUTINE advance_w( w, rw_tend, ww, u, v,       &
...
! Jammed 3 doubly nested loops over k/i into 1 for slight improvement
! in efficiency.  No change in results (bit-for-bit). JM 20040514
! (left a blank line where the other two k/i-loops were)
!
    DO k=2,k_end
      DO i=i_start, i_end
        w(i,k,j)=w(i,k,j)+dts*rw_tend(i,k,j)                       &
                 + msft_inv(i)*cqw(i,k,j)*(                        &
            +.5*dts*g*mut_inv(i)*rdn(k)*                           &
             (c2a(i,k  ,j)*rdnw(k  )                               &
        *((1.+epssm)*(rhs(i,k+1  )-rhs(i,k    ))                   &
         +(1.-epssm)*(ph(i,k+1,j)-ph(i,k  ,j)))                    &
             -c2a(i,k-1,j)*rdnw(k-1)                               &
        *((1.+epssm)*(rhs(i,k    )-rhs(i,k-1  ))                   &
         +(1.-epssm)*(ph(i,k  ,j)-ph(i,k-1,j)))))                  &

                +dts*g*msft_inv(i)*(rdn(k)*                        &
             (c2a(i,k  ,j)*alt(i,k  ,j)*t_2ave(i,k  ,j)            &
             -c2a(i,k-1,j)*alt(i,k-1,j)*t_2ave(i,k-1,j))           &
               +(rdn(k)*(c2a(i,k  ,j)*alb(i,k  ,j)                 &
                        -c2a(i,k-1,j)*alb(i,k-1,j))*mut_inv(i)-1.) &
                     *muave(i,j))
      ENDDO
    ENDDO


There is almost no difference in the IL after if-conversion but changes
like

-  _14697 = MEM[(float[0:D.14440] *)_959 clique 40 base 47][_15604];
+  _14697 = MEM[(float[0:D.14440] *)_959 clique 71 base 47][_15604];
   _9355 = _14697 * 4.905000209808349609375e+0;
-  _9371 = MEM[(float[0:D.14436] *)_957 clique 40 base 28][_15604];
+  _9371 = MEM[(float[0:D.14436] *)_957 clique 71 base 28][_15604];

which is introduced by unroll-and-jam by means of unrolling the outer loop
once.  The restrict clique remapping is strictly speaking required here.

There are of course cases where remapping can be elided but they are
difficult to reconstruct.

Basically each clique > 1 belongs to a specific inline block and whether
we need to remap it when copying a stmt depends on whether we duplicate
"within" that block or we duplicate the block itself.

A good "hint" would be (for example for unroll-and-jam) whether cliques
mentioned in the to be copied loop are also mentioned in stmts belonging
exclusively to the outer loop that is unrolled (but then they could have
been moved there by invariant motion for example).

I'm not sure if we want to look at the BLOCK tree.  Or whether we want to
build a similar thing into the loop structure, noting inline contexts
(the single clique that needs not remapping when copying the loop).

Note I haven't been able to verify whether the above loop is executed
at all but -fno-loop-unroll-and-jam makes performance equal (and faster
than ever...).

[Bug target/89578] [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto

Reply via email to