https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #6 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 45356 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356&action=edit Code to demonstrate that transposing makes things slower. Thomas Koenig, I am well aware that Fortran is column major. That is precisely why I chose the memory layout I did. Benchmark of the "optimal" corrected code: @benchmark gforttest($X32t, $BPP32t, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 20.647 μs (0.00% GC) median time: 20.860 μs (0.00% GC) mean time: 21.751 μs (0.00% GC) maximum time: 47.760 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 Here is a benchmark (compiling with Flang) of my code, exactly as written (suboptimal) in the attachments: @benchmark flangtest($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 658.329 ns (0.00% GC) median time: 668.012 ns (0.00% GC) mean time: 692.384 ns (0.00% GC) maximum time: 1.192 μs (0.00% GC) -------------- samples: 10000 evals/sample: 161 That is 20 microseconds, vs 670 nanoseconds. N was 1024, and the exact same data used in both cases (but pretransposed, so I do not benchmark tranposing). Benchmarking was done by compiling shared libraries, and using `ccall` and BenchmarkTools from Julia. As indicated by the reports, the benchmark was run 10,000 times for gfortran, and 1.61 million times for Flang, to get accurate timings. I compiled with (march=native is equivalent to march=skylake-avx512): gfortran -Ofast -march=native -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC vectorization_test_transposed.f90 -o libgfortvectorization_test.so flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libflangvectorization_test.so Flang was built with LLVM 7.0.1. The "suboptimal" code was close to 32 times faster than the "optimal" code. I was expecting it to be closer to 16 times faster, given the vector width. To go into more detail: " Fortran lays out the memory for that array as BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2) so you are accessing your memory with a stride of n in the expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient anyway, vectorization would not really help in this case. " Yes, each call to fpdbacksolve is accessing memory across strides. But fpdbacksolve itself cannot be vectorized well at all. What does work, however, is vectorizing across loop iterations. For example, imagine calling fpdbacksolve on this: BPP(1:16,1), BPP(1:16,2), BPP(1:16,3), BPP(1:16,5), ..., BPP(1:16,10) and then performing every single scalar operation defined in fpdbacksolve on an entire SIMD vector of floats (that is, on 16 floats) at a time. That would of course require inlining fpdbacksolve (which was achieved with -fno-semantic-interposition, as the assembly shows), and recompiling it. Perhaps another way you can imagine it is that fpdbacksolve takes in 9 numbers (BPP(:,4) was unused), and returns 3 numbers. Because operations within it aren't vectorizable, we want to vectorize it ACROSS loop iterations, not within them. So to facilitate that, we have 9 vectors of contiguous inputs, and 3 vectors of contiguous outputs. Now, all inputs1 are stored contiguously, as are all inputs2, etc..., allowing the inputs to efficiently be loaded into SIMD registers, and each loop iteration to calculate [SIMD vector width] of the outputs at a time. Of course, it is inconvenient to handle a dozen vectors. So if they all have the same length, we can just concatenate them together. I'll attach the assembly of both code examples as well. The assembly makes it clear that the "suboptimal" way was vectorized, and the "optimal" way was not. The benchmarks make it resoundingly clear that the vectorized ("suboptimal") version was dramatically faster. As is, this is a missed optimization, and gfortran is severely falling behind in performance versus LLVM-based Flang in the highest performance version of the code.