https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125931

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase:

      SUBROUTINE TWOTFF(XPQKL,CO,FC,DC,NUM2,NUM,NOC,XX,IX,NXX)
      IMPLICIT DOUBLE PRECISION(A-H,O-Z)
      PARAMETER (MXAO=2047)
      DIMENSION XPQKL(NUM2,*),CO(NUM,*),FC(*),DC(*),XX(*),IX(*)
      COMMON /IJPAIR/ IA(MXAO)
      NINT = ABS(NXX)
         DO 40 M=1,NINT
            VAL1 = XX(M)
            VAL3 = VAL1
            VAL4 = (VAL1+VAL1)+(VAL1+VAL1)
            LABEL = IX(M)
            MP = ISHFT( LABEL, -24 )
            MQ = IAND( ISHFT( LABEL, -16 ), 255 )
            MR = IAND( ISHFT( LABEL,  -8 ), 255 )
            MS = IAND( LABEL, 255 )
            MPQ= IA(MP)+MQ
            MRS= IA(MR)+MS
            MPR= IA(MP)+MR
            MPS= IA(MP)+MS
            MQR= IA(MAX(MQ,MR))+MIN(MQ,MR)
            MQS= IA(MAX(MQ,MS))+MIN(MQ,MS)
            FC(MPQ) = FC(MPQ)+VAL4*DC(MRS)
            FC(MRS) = FC(MRS)+VAL4*DC(MPQ)
            FC(MPR) = FC(MPR)-VAL1*DC(MQS)
            FC(MPS) = FC(MPS)-VAL1*DC(MQR)
            FC(MQR) = FC(MQR)-VAL1*DC(MPS)
            FC(MQS) = FC(MQS)-VAL1*DC(MPR)
            IF(MP.EQ.MQ) VAL1 = VAL1+VAL1
            IF(MR.EQ.MS) VAL3 = VAL3+VAL3
            MKL=0
            DO 30 MK=1,NOC
            DO 30 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30       CONTINUE
   40     CONTINUE
      RETURN
      END


A key observation is MK is {1, +, 1}_2 and

Analyzing # of iterations of loop 3
  exit condition [2, + , 1](no_overflow) <= mk_172
  bounds on difference of bases: -1 ... 2147483644
  result:
    # of iterations (unsigned int) mk_172 + 4294967295, bounded by 2147483645

but as NOC isn't known we cannot really give a good estimate on how
bad the outside loop cost is to hit us for the inner loop.

Another observation is that we have a collapsed IV with MKL so actually
collapsing the nest might make sense.

When looking at the vector costing side the code we generate for the
inner loop is (VF == 2)

.L7:
        vmovsd  (%r11,%rcx), %xmm1
        vmovhpd (%r11,%rsi), %xmm1, %xmm1
        vmovsd  (%rdx,%r13), %xmm0
        incl    %edi
        vmovhpd (%rdx), %xmm0, %xmm5
        vmovsd  (%r8,%rcx), %xmm0
        vmovhpd (%r8,%rsi), %xmm0, %xmm0
        vmulpd  %xmm11, %xmm1, %xmm1
        vfmadd132pd     %xmm12, %xmm1, %xmm0
        vmovsd  (%r15,%rcx), %xmm1
        vmovhpd (%r15,%rsi), %xmm1, %xmm1
        vfmadd132pd     %xmm10, %xmm5, %xmm0
        vmulpd  %xmm8, %xmm1, %xmm1
        vmovlpd %xmm0, (%rdx,%r13)
        vmovhpd %xmm0, (%rdx)
        vmovsd  (%rax,%r13), %xmm0
        vmovhpd (%rax), %xmm0, %xmm5
        vmovsd  (%rbx,%rcx), %xmm0
        vmovhpd (%rbx,%rsi), %xmm0, %xmm0
        addq    %rbp, %rcx
        addq    %r14, %rdx
        addq    %rbp, %rsi
        vfmadd132pd     %xmm9, %xmm1, %xmm0
        vfmadd132pd     %xmm7, %xmm5, %xmm0
        vmovlpd %xmm0, (%rax,%r13)
        vmovhpd %xmm0, (%rax)
        addq    %r14, %rax
        cmpl    %r12d, %edi
        jne     .L7

compared to the scalar code

.L9:    
        vmulsd  (%rax,%rbp,8), %xmm5, %xmm0 
        incl    %ecx    
        vfmadd231sd     (%rax), %xmm14, %xmm0
        vfmadd213sd     (%rdx), %xmm17, %xmm0
        vmovsd  %xmm0, (%rdx)
        vmulsd  (%rax,%r8,8), %xmm6, %xmm0
        vfmadd231sd     (%rax,%rsi,8), %xmm15, %xmm0
        addq    %r12, %rax
        vfmadd213sd     (%rdx,%r9,8), %xmm13, %xmm0
        vmovsd  %xmm0, (%rdx,%r9,8)
        addq    %rdi, %rdx
        cmpl    %r10d, %ecx
        jne     .L9

we cost the load + vector construction too much, looking at the number
of ops.

(*xpqkl_156(D))[_56] 1 times scalar_load costs 12 in body
(*xpqkl_156(D))[_56] 1 times scalar_load costs 12 in body
(*xpqkl_156(D))[_56] 1 times vec_construct costs 12 in body

but of course there's extra latency due to it and since we're nowhere
compute bound this is likley what makes the vectorization bad as we
are not reducing the number of load or store ops.

This would be sth for overall accounting.


The effect of the patch was that the pessimization we got for
vec_to_scalar is that while vec_to_scalar is cost 4 we costed
that twice and vec_deconstruct for V2DF is also cost 4.  So the
former cost was too big and the overzealeous scaling by 3 from

      stmt_cost *= (GET_MODE_BITSIZE (TYPE_MODE (ls_type))
                    / GET_MODE_BITSIZE (TYPE_MODE (ls_eltype)) + 1);

gets us to 12 vs former 24.


But as said the issue is that 'NOC' is small and the branch/layout
overhead hurts.

Reply via email to