[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #10 from Chris Elrod --- (In reply to Thomas Koenig from comment #9) > Hm. > > It would help if your benchmark was complete, so I could run it. > I don't suppose you happen to have and be familiar with Julia? If you (or someone else here is), I'll attach the code to generate the fake data (the most important point is that columns 5:10 of BPP are the upper triangle of a 3x3 symmetric positive definite matrix). I have also already written a manually unrolled version that gfortran likes.. But I could write Fortran code to create an executable and run benchmarks. What are best practices? system_clock? (In reply to Thomas Koenig from comment #9) > > However, what happens if you put int > > real, dimension(:) :: Uix > real, dimension(:), intent(in) :: x > real, dimension(:), intent(in) :: S > > ? > > gfortran should not pack then. You're right! I wasn't able to follow this exactly, because it didn't want me to defer shape on Uix. Probably because it needs to compile a version of fpdbacksolve that can be called from the shared library? Interestingly, with that change, Flang failed to vectorize the code, but gfortran did. Compilers are finicky. Flang, original: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 655.827 ns (0.00% GC) median time: 665.698 ns (0.00% GC) mean time:689.967 ns (0.00% GC) maximum time: 1.061 μs (0.00% GC) -- samples: 1 evals/sample: 162 Flang, not specifying shape: # assembly shows it is using xmm BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 8.086 μs (0.00% GC) median time: 8.315 μs (0.00% GC) mean time:8.591 μs (0.00% GC) maximum time: 20.299 μs (0.00% GC) -- samples: 1 evals/sample: 3 gfortran, transposed version (not vectorizable): BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 20.643 μs (0.00% GC) median time: 20.901 μs (0.00% GC) mean time:21.441 μs (0.00% GC) maximum time: 54.103 μs (0.00% GC) -- samples: 1 evals/sample: 1 gfortran, not specifying shape: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 1.290 μs (0.00% GC) median time: 1.316 μs (0.00% GC) mean time:1.347 μs (0.00% GC) maximum time: 4.562 μs (0.00% GC) -- samples: 1 evals/sample: 10 Assembly confirms it is using zmm registers (but this time is much too fast not to be vectorized, anyway). For why gfortran is still slower than the Flang version, here is the loop body: .L16: vmovups (%r10,%rax), %zmm0 vcmpps $4, %zmm0, %zmm4, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm5, %zmm2, %zmm2 vaddps %zmm6, %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm8 vmulps %zmm0, %zmm8, %zmm0 vmulps %zmm0, %zmm8, %zmm0 vaddps %zmm8, %zmm8, %zmm8 vsubps %zmm0, %zmm8, %zmm8 vmulps (%r8,%rax), %zmm8, %zmm9 vmulps (%r9,%rax), %zmm8, %zmm10 vmulps (%r12,%rax), %zmm8, %zmm8 vmovaps %zmm9, %zmm3 vfnmadd213ps0(%r13,%rax), %zmm9, %zmm3 vcmpps $4, %zmm3, %zmm4, %k1 vrsqrt14ps %zmm3, %zmm2{%k1}{z} vmulps %zmm3, %zmm2, %zmm3 vmulps %zmm2, %zmm3, %zmm1 vmulps %zmm5, %zmm3, %zmm3 vaddps %zmm6, %zmm1, %zmm1 vmulps %zmm3, %zmm1, %zmm1 vmovaps %zmm9, %zmm3 vfnmadd213ps(%rdx,%rax), %zmm10, %zmm3 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm11 vmulps %zmm11, %zmm3, %zmm12 vmovaps %zmm10, %zmm3 vfnmadd213ps(%r14,%rax), %zmm10, %zmm3 vfnmadd231ps%zmm12, %zmm12, %zmm3 vcmpps $4, %zmm3, %zmm4, %k1 vrsqrt14ps %zmm3, %zmm1{%k1}{z} vmulps %zmm3, %zmm1, %zmm3 vmulps %zmm1, %zmm3, %zmm0 vmulps %zmm5, %zmm3, %zmm3 vmovups (%rcx,%rax), %zmm1 vaddps %zmm6, %zmm0, %zmm0 vmulps %zmm3, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm2 vmulps %zmm0, %zmm2, %zmm0 vmulps %zmm0, %zmm2, %zmm0 vaddps %zmm2, %zmm2, %zmm2 vsubps %zmm0, %zmm2, %zmm0 vmulps %zmm0, %zmm11, %zmm3 vmulps %zmm12, %zmm3, %zmm3 vxorps %zmm7, %zmm3, %zmm3 vmulps %zmm1, %zmm3, %zmm2 vmulps %zmm3, %zmm9, %zmm3 vfnmadd231ps%zmm8, %zmm9, %zmm1
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #9 from Thomas Koenig --- Hm. It would help if your benchmark was complete, so I could run it. However, what happens if you put int real, dimension(:) :: Uix real, dimension(:), intent(in) :: x real, dimension(:), intent(in) :: S ? gfortran should not pack then.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #8 from Chris Elrod --- Created attachment 45358 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358=edit gfortran compiled assembly for the tranposed version of the original code. Here is the assembly for the loop body of the transposed version of the code, compiled by gfortran: .L8: vmovss 36(%rsi), %xmm0 addq$40, %rsi vrsqrtss%xmm0, %xmm2, %xmm2 addq$12, %rdi vmulss %xmm0, %xmm2, %xmm0 vmulss %xmm2, %xmm0, %xmm0 vmulss %xmm7, %xmm2, %xmm2 vaddss %xmm8, %xmm0, %xmm0 vmulss %xmm2, %xmm0, %xmm0 vmulss -8(%rsi), %xmm0, %xmm5 vmulss -12(%rsi), %xmm0, %xmm4 vmulss -32(%rsi), %xmm0, %xmm0 vmovaps %xmm5, %xmm3 vfnmadd213ss-16(%rsi), %xmm5, %xmm3 vmovaps %xmm4, %xmm2 vfnmadd213ss-20(%rsi), %xmm5, %xmm2 vmovss %xmm0, -4(%rdi) vrsqrtss%xmm3, %xmm1, %xmm1 vmulss %xmm3, %xmm1, %xmm3 vmulss %xmm1, %xmm3, %xmm3 vmulss %xmm7, %xmm1, %xmm1 vaddss %xmm8, %xmm3, %xmm3 vmulss %xmm1, %xmm3, %xmm3 vmulss %xmm3, %xmm2, %xmm6 vmovaps %xmm4, %xmm2 vfnmadd213ss-24(%rsi), %xmm4, %xmm2 vfnmadd231ss%xmm6, %xmm6, %xmm2 vrsqrtss%xmm2, %xmm10, %xmm10 vmulss %xmm2, %xmm10, %xmm1 vmulss %xmm10, %xmm1, %xmm1 vmulss %xmm7, %xmm10, %xmm10 vaddss %xmm8, %xmm1, %xmm1 vmulss %xmm10, %xmm1, %xmm1 vmulss %xmm1, %xmm3, %xmm2 vmulss %xmm6, %xmm2, %xmm2 vmovss -36(%rsi), %xmm6 vxorps %xmm9, %xmm2, %xmm2 vmulss %xmm6, %xmm2, %xmm10 vmulss %xmm2, %xmm5, %xmm2 vfmadd231ss -40(%rsi), %xmm1, %xmm10 vfmadd132ss %xmm4, %xmm2, %xmm1 vfnmadd132ss%xmm0, %xmm10, %xmm1 vmulss %xmm0, %xmm5, %xmm0 vmovss %xmm1, -12(%rdi) vsubss %xmm0, %xmm6, %xmm0 vmulss %xmm3, %xmm0, %xmm3 vmovss %xmm3, -8(%rdi) cmpq%rsi, %rax jne .L8 While Flang had a second loop of scalar code (to catch the N mod [SIMD vector width] remainder of the vectorized loop), there are no secondary loops in the gfortran code, meaning these must all be scalar operations (I have a hard time telling apart SSE from scalar code...). It looks similar in the operations it performs to Flang's vectorized loop, except that it is only performing operations on a single number at a time. Because to get efficient vectorization, we need corresponding elements to be contiguous (ie, all the input1s, all the input2s). We do not get any benefit from having all the different elements with the same index (the first input1 next to the first input2, next to the first input3...) being contiguous. The memory layout I used is performance-optimal, but is something that gfortran unfortunately often cannot handle automatically (without manual unrolling). This is why I filed a report on bugzilla.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #7 from Chris Elrod --- Created attachment 45357 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45357=edit Assembly generated by Flang compiler on the original version of the code. This is the main loop body in the Flang compiled version of the original code (starts line 132): .LBB1_8:# %vector.body # =>This Inner Loop Header: Depth=1 leaq(%rsi,%rbx,4), %r12 vmovups (%rcx,%r12), %zmm2 addq%rcx, %r12 leaq(%r12,%rcx), %rbp vmovups (%r11,%rbp), %zmm3 addq%r11, %rbp leaq(%rcx,%rbp), %r13 leaq(%rcx,%r13), %r8 leaq(%r8,%rcx), %r10 leaq(%r10,%rcx), %r14 vmovups (%rcx,%r14), %zmm4 vrsqrt14ps %zmm4, %zmm5 vmulps %zmm5, %zmm4, %zmm4 vfmadd213ps %zmm0, %zmm5, %zmm4 # zmm4 = (zmm5 * zmm4) + zmm0 vmulps %zmm1, %zmm5, %zmm5 vmulps %zmm4, %zmm5, %zmm4 .Ltmp1: .loc1 31 1 is_stmt 1# vectorization_test.f90:31:1 vmulps (%rcx,%r8), %zmm4, %zmm5 .loc1 32 1 # vectorization_test.f90:32:1 vmulps (%rcx,%r10), %zmm4, %zmm6 vmovups (%rcx,%r13), %zmm7 .loc1 33 1 # vectorization_test.f90:33:1 vfnmadd231ps%zmm6, %zmm6, %zmm7 # zmm7 = -(zmm6 * zmm6) + zmm7 vrsqrt14ps %zmm7, %zmm8 vmulps %zmm8, %zmm7, %zmm7 vfmadd213ps %zmm0, %zmm8, %zmm7 # zmm7 = (zmm8 * zmm7) + zmm0 vmulps %zmm1, %zmm8, %zmm8 vmulps %zmm7, %zmm8, %zmm7 vmovups (%rcx,%rbp), %zmm8 .loc1 35 1 # vectorization_test.f90:35:1 vfnmadd231ps%zmm5, %zmm6, %zmm8 # zmm8 = -(zmm6 * zmm5) + zmm8 vmulps %zmm8, %zmm7, %zmm8 vmulps %zmm5, %zmm5, %zmm9 vfmadd231ps %zmm8, %zmm8, %zmm9 # zmm9 = (zmm8 * zmm8) + zmm9 vsubps %zmm9, %zmm3, %zmm3 vrsqrt14ps %zmm3, %zmm9 vmulps %zmm9, %zmm3, %zmm3 vfmadd213ps %zmm0, %zmm9, %zmm3 # zmm3 = (zmm9 * zmm3) + zmm0 vmulps %zmm1, %zmm9, %zmm9 vmulps %zmm3, %zmm9, %zmm3 .loc1 39 1 # vectorization_test.f90:39:1 vmulps %zmm8, %zmm7, %zmm8 .loc1 40 1 # vectorization_test.f90:40:1 vmulps (%rcx,%r12), %zmm4, %zmm4 .loc1 39 1 # vectorization_test.f90:39:1 vmulps %zmm3, %zmm8, %zmm8 .loc1 41 1 # vectorization_test.f90:41:1 vmulps %zmm8, %zmm2, %zmm9 vfmsub231ps (%rsi,%rbx,4), %zmm3, %zmm9 # zmm9 = (zmm3 * mem) - zmm9 vmulps %zmm5, %zmm3, %zmm3 vfmsub231ps %zmm8, %zmm6, %zmm3 # zmm3 = (zmm6 * zmm8) - zmm3 vfmadd213ps %zmm9, %zmm4, %zmm3 # zmm3 = (zmm4 * zmm3) + zmm9 .loc1 42 1 # vectorization_test.f90:42:1 vmulps %zmm4, %zmm6, %zmm5 vmulps %zmm5, %zmm7, %zmm5 vfmsub231ps %zmm7, %zmm2, %zmm5 # zmm5 = (zmm2 * zmm7) - zmm5 .Ltmp2: .loc1 15 1 # vectorization_test.f90:15:1 vmovups %zmm3, (%rdi,%rbx,4) movq-16(%rsp), %rbp # 8-byte Reload vmovups %zmm5, (%rbp,%rbx,4) vmovups %zmm4, (%rax,%rbx,4) addq$16, %rbx cmpq%rbx, %rdx jne .LBB1_8 zmm registers are 64 byte registers. It vmovups from memory into registers, performs a series of arithmetics and inverse square roots on them, and then vmovups three of these 64 byte registers back into memory. That is the most efficient memory access pattern (as demonstrated empirically via benchmarks).
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #6 from Chris Elrod --- Created attachment 45356 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356=edit Code to demonstrate that transposing makes things slower. Thomas Koenig, I am well aware that Fortran is column major. That is precisely why I chose the memory layout I did. Benchmark of the "optimal" corrected code: @benchmark gforttest($X32t, $BPP32t, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 20.647 μs (0.00% GC) median time: 20.860 μs (0.00% GC) mean time:21.751 μs (0.00% GC) maximum time: 47.760 μs (0.00% GC) -- samples: 1 evals/sample: 1 Here is a benchmark (compiling with Flang) of my code, exactly as written (suboptimal) in the attachments: @benchmark flangtest($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 658.329 ns (0.00% GC) median time: 668.012 ns (0.00% GC) mean time:692.384 ns (0.00% GC) maximum time: 1.192 μs (0.00% GC) -- samples: 1 evals/sample: 161 That is 20 microseconds, vs 670 nanoseconds. N was 1024, and the exact same data used in both cases (but pretransposed, so I do not benchmark tranposing). Benchmarking was done by compiling shared libraries, and using `ccall` and BenchmarkTools from Julia. As indicated by the reports, the benchmark was run 10,000 times for gfortran, and 1.61 million times for Flang, to get accurate timings. I compiled with (march=native is equivalent to march=skylake-avx512): gfortran -Ofast -march=native -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC vectorization_test_transposed.f90 -o libgfortvectorization_test.so flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libflangvectorization_test.so Flang was built with LLVM 7.0.1. The "suboptimal" code was close to 32 times faster than the "optimal" code. I was expecting it to be closer to 16 times faster, given the vector width. To go into more detail: " Fortran lays out the memory for that array as BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2) so you are accessing your memory with a stride of n in the expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient anyway, vectorization would not really help in this case. " Yes, each call to fpdbacksolve is accessing memory across strides. But fpdbacksolve itself cannot be vectorized well at all. What does work, however, is vectorizing across loop iterations. For example, imagine calling fpdbacksolve on this: BPP(1:16,1), BPP(1:16,2), BPP(1:16,3), BPP(1:16,5), ..., BPP(1:16,10) and then performing every single scalar operation defined in fpdbacksolve on an entire SIMD vector of floats (that is, on 16 floats) at a time. That would of course require inlining fpdbacksolve (which was achieved with -fno-semantic-interposition, as the assembly shows), and recompiling it. Perhaps another way you can imagine it is that fpdbacksolve takes in 9 numbers (BPP(:,4) was unused), and returns 3 numbers. Because operations within it aren't vectorizable, we want to vectorize it ACROSS loop iterations, not within them. So to facilitate that, we have 9 vectors of contiguous inputs, and 3 vectors of contiguous outputs. Now, all inputs1 are stored contiguously, as are all inputs2, etc..., allowing the inputs to efficiently be loaded into SIMD registers, and each loop iteration to calculate [SIMD vector width] of the outputs at a time. Of course, it is inconvenient to handle a dozen vectors. So if they all have the same length, we can just concatenate them together. I'll attach the assembly of both code examples as well. The assembly makes it clear that the "suboptimal" way was vectorized, and the "optimal" way was not. The benchmarks make it resoundingly clear that the vectorized ("suboptimal") version was dramatically faster. As is, this is a missed optimization, and gfortran is severely falling behind in performance versus LLVM-based Flang in the highest performance version of the code.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Thomas Koenig changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX --- Comment #5 from Thomas Koenig --- An additional point. The called routine has pure function fpdbacksolve(x, S) result(Uix) real, dimension(3) :: Uix ... and so on. This _requires_ Uix to be contiguous in memory, so we need to call the internal pack routine. This is something we cannot change easily. So, the workaround: Use the appropriate memory layout in your Fortran programs.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Thomas Koenig changed: What|Removed |Added CC||tkoenig at gcc dot gnu.org Blocks||36854 Severity|normal |enhancement --- Comment #4 from Thomas Koenig --- First, a few remarks on the code: It is written in a suboptmial way regarding the way Fortran lays out its memory. The code does real, dimension(N,3), intent(out) :: X real, dimension(N,10),intent(in) :: BPP and then do concurrent (i = 1:N) X(i,:) = fpdbacksolve(BPP(i,1:3), BPP(i,5:10)) end do The problem is that BPP(i,1:3) is not contiguous in memory. Fortran lays out the memory for that array as BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2) so you are accessing your memory with a stride of n in the expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient anyway, vectorization would not really help in this case. So, if you change your code to real, dimension(3,N), intent(out) :: X real, dimension(10,N),intent(in) :: BPP do concurrent (i = 1:N) X(i,:) = fpdbacksolve(BPP(1:3,i), BPP(5:10,i)) end do then processBPP will be inlined completely, and memory accesses will be contiguous. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36854 [Bug 36854] [meta-bug] fortran front-end optimization
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #3 from Chris Elrod --- Created attachment 45353 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45353=edit g++ assembly output
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #2 from Chris Elrod --- Created attachment 45352 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45352=edit gfortran assembly output
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #1 from Chris Elrod --- Created attachment 45351 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45351=edit C++ version of the vectorization test case.