https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #18 from Chris Elrod <elrodc at gmail dot com> --- I can confirm that the inlined packing does allow gfortran to vectorize the loop. So allowing packing to inline does seem (to me) like an optimization well worth making. However, performance seems to be about the same as before, still close to 2x slower than Flang. There is definitely something interesting going on in Flang's SLP vectorization, though. I defined the function: #ifndef VECTORWIDTH #define VECTORWIDTH 16 #endif subroutine vpdbacksolve(Uix, x, S) real, dimension(VECTORWIDTH,3) :: Uix real, dimension(VECTORWIDTH,3), intent(in) :: x real, dimension(VECTORWIDTH,6), intent(in) :: S real, dimension(VECTORWIDTH) :: U11, U12, U22, U13, U23, U33, & Ui11, Ui12, Ui22, Ui33 U33 = sqrt(S(:,6)) Ui33 = 1 / U33 U13 = S(:,4) * Ui33 U23 = S(:,5) * Ui33 U22 = sqrt(S(:,3) - U23**2) Ui22 = 1 / U22 U12 = (S(:,2) - U13*U23) * Ui22 U11 = sqrt(S(:,1) - U12**2 - U13**2) Ui11 = 1 / U11 ! u11 Ui12 = - U12 * Ui11 * Ui22 ! u12 Uix(:,3) = Ui33*x(:,3) Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) * Uix(:,3) Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3) end subroutine vpdbacksolve in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling. I wanted to modify the Fortran file to benchmark these, but I'm pretty sure Flang cheated in the benchmarks. So compiling into a shared library, and benchmarking from Julia: julia> @benchmark flangvtest($Uix, $x, $S) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 15.104 ns (0.00% GC) median time: 15.563 ns (0.00% GC) mean time: 16.017 ns (0.00% GC) maximum time: 49.524 ns (0.00% GC) -------------- samples: 10000 evals/sample: 998 julia> @benchmark gfortvtest($Uix, $x, $S) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 24.394 ns (0.00% GC) median time: 24.562 ns (0.00% GC) mean time: 25.600 ns (0.00% GC) maximum time: 58.652 ns (0.00% GC) -------------- samples: 10000 evals/sample: 996 That is over 60% faster for Flang, which would account for much, but not all, of the runtime difference in the actual for loops. For comparison, the vectorized loop in processbpp covers 16 samples per iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations. For the three gfortran benchmarks (that averaged 100,000 runs of the loop), that means each loop iteration averaged at about 1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64) 21.230246197916664 For flang, that was: 1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64) 9.991520833333334 so we have about 21 vs 10 ns for the loop body in gfortran vs Flang, respectively. Comparing the asm between: 1. Flang processbpp loop body 2. Flang vpdbacksolve 3. gfortran processbpp loop body 4. gfortran vpdbacksolve Here are a few things I notice. 1. gfortran always uses masked reciprocal square root operations, to make sure it only takes the square root of non-negative (positive?) numbers: vxorps %xmm5, %xmm5, %xmm5 ... vmovups (%rsi,%rax), %zmm0 vmovups 0(%r13,%rax), %zmm9 vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} This might be avx512f specific? Either way, Flang does not use masks: vmovups (%rcx,%r14), %zmm4 vrsqrt14ps %zmm4, %zmm5 I'm having a hard time finding any information on what the performance impact of this may be. Agner Fog's instruction tables, for example, don't mention mask arguments for vrsqrt14ps. 2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop iteration (fpdbacksolve's arguments are a vector of length 3 and another of length 6; it returns a vector of length 3). gfortran's loop body has 3 unnecessary vmovaps, copying register contents. gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register contents. Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple unnecessary memory accesses. Ouch! They also moved on/off (the stack?) vmovaps %zmm2, .BSS4+192(%rip) ... vmovaps %zmm5, .BSS4+320(%rip) ... vmovaps .BSS4+192(%rip), %zmm5 ... #zmm5 is overwritten in here, I just mean to show the sort of stuff that goes on vmulps .BSS4+320(%rip), %zmm5, %zmm0 Some of those moves also don't get used again, and some other things are just plain weird: vxorps %xmm3, %xmm3, %xmm3 vfnmsub231ps %zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3 vmovaps %zmm3, .BSS4+576(%rip) Like, why zero out the 128 bit portion of zmm3 ? I verified that the answers are still correct. I don't know that much about how compiler's and loop vectorizer's work, but I'm guessing in the loop, Flang managed to verify lots of things that helped out the register allocator. And that without it, it struggled. gfortran's vpdbacksolve also did some stuff I don't understand: vmulps %zmm1, %zmm2, %zmm2 vxorps .LC3(%rip), %zmm2, %zmm2 vmulps %zmm6, %zmm2, %zmm4 This happens in gfortran's loop too, except the move from .LC3(%rip) was hoisted out of the loop. It definitely handled register allocation much better in just the function, although not as well in the loop. Given that Flang's vpdbacksolve did the worst here, but was still >60% faster than gfortran's vpdbacksolve, I don't think we can attribute worse performance here. 3. Arithmetic instructions: vaddps: flang-loop body: 0 flang-vpdbacksolve: 0 gfortran-loop-body: 6 gfortran-vpdbacksolve: 6 vsubps flang-loop body: 1 flang-vpdbacksolve: 1 gfortran-loop-body: 3 gfortran-vpdbacksolve: 4 vmulps flang-loop body: 20 flang-vpdbacksolve: 18 gfortran-loop-body: 27 gfortran-vpdbacksolve: 29 Total unfused operations: flang-loop body: 21 flang-vpdbacksolve: 19 gfortran-loop-body: 30 gfortran-vpdbacksolve: 33 vfmadd flang-loop body: 5 flang-vpdbacksolve: 6 gfortran-loop-body: 2 gfortran-vpdbacksolve: 3 vfnmadd flang-loop body: 2 flang-vpdbacksolve: 4 gfortran-loop-body: 6 gfortran-vpdbacksolve: 2 vfmsub flang-loop body: 3 flang-vpdbacksolve: 0 gfortran-loop-body: 0 gfortran-vpdbacksolve: 2 vfnmsub flang-loop body: 0 flang-vpdbacksolve: 1 gfortran-loop-body: 0 gfortran-vpdbacksolve: 0 Total fused operations: flang-loop body: 10 flang-vpdbacksolve: 11 gfortran-loop-body: 8 gfortran-vpdbacksolve: 7 Total arithmetic operations: flang-loop body: 31 flang-vpdbacksolve: 30 gfortran-loop-body: 38 gfortran-vpdbacksolve: 40 So gfortran's version had more overall arithmetic instructions (but less fused operations), but definitely not by a factor approaching the degree to which it was slower.