https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #6 from Chris Elrod <elrodc at gmail dot com> ---
Created attachment 45356
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356&action=edit
Code to demonstrate that transposing makes things slower.

Thomas Koenig, I am well aware that Fortran is column major. That is precisely
why I chose the memory layout I did.

Benchmark of the "optimal" corrected code:

@benchmark gforttest($X32t, $BPP32t, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.647 μs (0.00% GC)
  median time:      20.860 μs (0.00% GC)
  mean time:        21.751 μs (0.00% GC)
  maximum time:     47.760 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1


Here is a benchmark (compiling with Flang) of my code, exactly as written
(suboptimal) in the attachments:

@benchmark flangtest($X32,  $BPP32,  $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     658.329 ns (0.00% GC)
  median time:      668.012 ns (0.00% GC)
  mean time:        692.384 ns (0.00% GC)
  maximum time:     1.192 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     161


That is 20 microseconds, vs 670 nanoseconds.

N was 1024, and the exact same data used in both cases (but pretransposed, so I
do not benchmark tranposing).
Benchmarking was done by compiling shared libraries, and using `ccall` and
BenchmarkTools from Julia. As indicated by the reports, the benchmark was run
10,000 times for gfortran, and 1.61 million times for Flang, to get accurate
timings.

I compiled with (march=native is equivalent to march=skylake-avx512):
gfortran -Ofast -march=native -mprefer-vector-width=512
-fno-semantic-interposition -shared -fPIC vectorization_test_transposed.f90 -o
libgfortvectorization_test.so
flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC
vectorization_test.f90 -o libflangvectorization_test.so

Flang was built with LLVM 7.0.1.


The "suboptimal" code was close to 32 times faster than the "optimal" code.
I was expecting it to be closer to 16 times faster, given the vector width.


To go into more detail:

"
Fortran lays out the memory for that array as

BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2)

so you are accessing your memory with a stride of n in the
expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient
anyway, vectorization would not really help in this case.
"

Yes, each call to fpdbacksolve is accessing memory across strides.
But fpdbacksolve itself cannot be vectorized well at all.

What does work, however, is vectorizing across loop iterations.

For example, imagine calling fpdbacksolve on this:

BPP(1:16,1), BPP(1:16,2), BPP(1:16,3), BPP(1:16,5), ..., BPP(1:16,10)

and then performing every single scalar operation defined in fpdbacksolve on an
entire SIMD vector of floats (that is, on 16 floats) at a time.

That would of course require inlining fpdbacksolve (which was achieved with
-fno-semantic-interposition, as the assembly shows), and recompiling it.

Perhaps another way you can imagine it is that fpdbacksolve takes in 9 numbers
(BPP(:,4) was unused), and returns 3 numbers.
Because operations within it aren't vectorizable, we want to vectorize it
ACROSS loop iterations, not within them.
So to facilitate that, we have 9 vectors of contiguous inputs, and 3 vectors of
contiguous outputs. Now, all inputs1 are stored contiguously, as are all
inputs2, etc..., allowing the inputs to efficiently be loaded into SIMD
registers, and each loop iteration to calculate [SIMD vector width] of the
outputs at a time.

Of course, it is inconvenient to handle a dozen vectors. So if they all have
the same length, we can just concatenate them together.


I'll attach the assembly of both code examples as well.
The assembly makes it clear that the "suboptimal" way was vectorized, and the
"optimal" way was not.

The benchmarks make it resoundingly clear that the vectorized ("suboptimal")
version was dramatically faster.

As is, this is a missed optimization, and gfortran is severely falling behind
in performance versus LLVM-based Flang in the highest performance version of
the code.

Reply via email to