https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125931
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase:
SUBROUTINE TWOTFF(XPQKL,CO,FC,DC,NUM2,NUM,NOC,XX,IX,NXX)
IMPLICIT DOUBLE PRECISION(A-H,O-Z)
PARAMETER (MXAO=2047)
DIMENSION XPQKL(NUM2,*),CO(NUM,*),FC(*),DC(*),XX(*),IX(*)
COMMON /IJPAIR/ IA(MXAO)
NINT = ABS(NXX)
DO 40 M=1,NINT
VAL1 = XX(M)
VAL3 = VAL1
VAL4 = (VAL1+VAL1)+(VAL1+VAL1)
LABEL = IX(M)
MP = ISHFT( LABEL, -24 )
MQ = IAND( ISHFT( LABEL, -16 ), 255 )
MR = IAND( ISHFT( LABEL, -8 ), 255 )
MS = IAND( LABEL, 255 )
MPQ= IA(MP)+MQ
MRS= IA(MR)+MS
MPR= IA(MP)+MR
MPS= IA(MP)+MS
MQR= IA(MAX(MQ,MR))+MIN(MQ,MR)
MQS= IA(MAX(MQ,MS))+MIN(MQ,MS)
FC(MPQ) = FC(MPQ)+VAL4*DC(MRS)
FC(MRS) = FC(MRS)+VAL4*DC(MPQ)
FC(MPR) = FC(MPR)-VAL1*DC(MQS)
FC(MPS) = FC(MPS)-VAL1*DC(MQR)
FC(MQR) = FC(MQR)-VAL1*DC(MPS)
FC(MQS) = FC(MQS)-VAL1*DC(MPR)
IF(MP.EQ.MQ) VAL1 = VAL1+VAL1
IF(MR.EQ.MS) VAL3 = VAL3+VAL3
MKL=0
DO 30 MK=1,NOC
DO 30 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
30 CONTINUE
40 CONTINUE
RETURN
END
A key observation is MK is {1, +, 1}_2 and
Analyzing # of iterations of loop 3
exit condition [2, + , 1](no_overflow) <= mk_172
bounds on difference of bases: -1 ... 2147483644
result:
# of iterations (unsigned int) mk_172 + 4294967295, bounded by 2147483645
but as NOC isn't known we cannot really give a good estimate on how
bad the outside loop cost is to hit us for the inner loop.
Another observation is that we have a collapsed IV with MKL so actually
collapsing the nest might make sense.
When looking at the vector costing side the code we generate for the
inner loop is (VF == 2)
.L7:
vmovsd (%r11,%rcx), %xmm1
vmovhpd (%r11,%rsi), %xmm1, %xmm1
vmovsd (%rdx,%r13), %xmm0
incl %edi
vmovhpd (%rdx), %xmm0, %xmm5
vmovsd (%r8,%rcx), %xmm0
vmovhpd (%r8,%rsi), %xmm0, %xmm0
vmulpd %xmm11, %xmm1, %xmm1
vfmadd132pd %xmm12, %xmm1, %xmm0
vmovsd (%r15,%rcx), %xmm1
vmovhpd (%r15,%rsi), %xmm1, %xmm1
vfmadd132pd %xmm10, %xmm5, %xmm0
vmulpd %xmm8, %xmm1, %xmm1
vmovlpd %xmm0, (%rdx,%r13)
vmovhpd %xmm0, (%rdx)
vmovsd (%rax,%r13), %xmm0
vmovhpd (%rax), %xmm0, %xmm5
vmovsd (%rbx,%rcx), %xmm0
vmovhpd (%rbx,%rsi), %xmm0, %xmm0
addq %rbp, %rcx
addq %r14, %rdx
addq %rbp, %rsi
vfmadd132pd %xmm9, %xmm1, %xmm0
vfmadd132pd %xmm7, %xmm5, %xmm0
vmovlpd %xmm0, (%rax,%r13)
vmovhpd %xmm0, (%rax)
addq %r14, %rax
cmpl %r12d, %edi
jne .L7
compared to the scalar code
.L9:
vmulsd (%rax,%rbp,8), %xmm5, %xmm0
incl %ecx
vfmadd231sd (%rax), %xmm14, %xmm0
vfmadd213sd (%rdx), %xmm17, %xmm0
vmovsd %xmm0, (%rdx)
vmulsd (%rax,%r8,8), %xmm6, %xmm0
vfmadd231sd (%rax,%rsi,8), %xmm15, %xmm0
addq %r12, %rax
vfmadd213sd (%rdx,%r9,8), %xmm13, %xmm0
vmovsd %xmm0, (%rdx,%r9,8)
addq %rdi, %rdx
cmpl %r10d, %ecx
jne .L9
we cost the load + vector construction too much, looking at the number
of ops.
(*xpqkl_156(D))[_56] 1 times scalar_load costs 12 in body
(*xpqkl_156(D))[_56] 1 times scalar_load costs 12 in body
(*xpqkl_156(D))[_56] 1 times vec_construct costs 12 in body
but of course there's extra latency due to it and since we're nowhere
compute bound this is likley what makes the vectorization bad as we
are not reducing the number of load or store ops.
This would be sth for overall accounting.
The effect of the patch was that the pessimization we got for
vec_to_scalar is that while vec_to_scalar is cost 4 we costed
that twice and vec_deconstruct for V2DF is also cost 4. So the
former cost was too big and the overzealeous scaling by 3 from
stmt_cost *= (GET_MODE_BITSIZE (TYPE_MODE (ls_type))
/ GET_MODE_BITSIZE (TYPE_MODE (ls_eltype)) + 1);
gets us to 12 vs former 24.
But as said the issue is that 'NOC' is small and the branch/layout
overhead hurts.