[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #10 from Chris Elrod  ---
(In reply to Thomas Koenig from comment #9)
> Hm.
> 
> It would help if your benchmark was complete, so I could run it.
> 

I don't suppose you happen to have and be familiar with Julia? If you (or
someone else here is), I'll attach the code to generate the fake data (the most
important point is that columns 5:10 of BPP are the upper triangle of a 3x3
symmetric positive definite matrix).

I have also already written a manually unrolled version that gfortran likes..

But I could write Fortran code to create an executable and run benchmarks.
What are best practices? system_clock?

(In reply to Thomas Koenig from comment #9)
> 
> However, what happens if you put int
> 
> real, dimension(:) ::  Uix
> real, dimension(:), intent(in)  ::  x
> real, dimension(:), intent(in)  ::  S
> 
> ?
> 
> gfortran should not pack then.

You're right! I wasn't able to follow this exactly, because it didn't want me
to defer shape on Uix. Probably because it needs to compile a version of
fpdbacksolve that can be called from the shared library?

Interestingly, with that change, Flang failed to vectorize the code, but
gfortran did. Compilers are finicky.

Flang, original:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 655.827 ns (0.00% GC)
  median time:  665.698 ns (0.00% GC)
  mean time:689.967 ns (0.00% GC)
  maximum time: 1.061 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 162

Flang, not specifying shape: # assembly shows it is using xmm

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 8.086 μs (0.00% GC)
  median time:  8.315 μs (0.00% GC)
  mean time:8.591 μs (0.00% GC)
  maximum time: 20.299 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 3

gfortran, transposed version (not vectorizable): 

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 20.643 μs (0.00% GC)
  median time:  20.901 μs (0.00% GC)
  mean time:21.441 μs (0.00% GC)
  maximum time: 54.103 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 1

gfortran, not specifying shape:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 1.290 μs (0.00% GC)
  median time:  1.316 μs (0.00% GC)
  mean time:1.347 μs (0.00% GC)
  maximum time: 4.562 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 10


Assembly confirms it is using zmm registers (but this time is much too fast not
to be vectorized, anyway).


For why gfortran is still slower than the Flang version, here is the loop body:

.L16:
vmovups (%r10,%rax), %zmm0
vcmpps  $4, %zmm0, %zmm4, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
vmulps  %zmm0, %zmm1, %zmm2
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm5, %zmm2, %zmm2
vaddps  %zmm6, %zmm0, %zmm0
vmulps  %zmm2, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm8
vmulps  %zmm0, %zmm8, %zmm0
vmulps  %zmm0, %zmm8, %zmm0
vaddps  %zmm8, %zmm8, %zmm8
vsubps  %zmm0, %zmm8, %zmm8
vmulps  (%r8,%rax), %zmm8, %zmm9
vmulps  (%r9,%rax), %zmm8, %zmm10
vmulps  (%r12,%rax), %zmm8, %zmm8
vmovaps %zmm9, %zmm3
vfnmadd213ps0(%r13,%rax), %zmm9, %zmm3
vcmpps  $4, %zmm3, %zmm4, %k1
vrsqrt14ps  %zmm3, %zmm2{%k1}{z}
vmulps  %zmm3, %zmm2, %zmm3
vmulps  %zmm2, %zmm3, %zmm1
vmulps  %zmm5, %zmm3, %zmm3
vaddps  %zmm6, %zmm1, %zmm1
vmulps  %zmm3, %zmm1, %zmm1
vmovaps %zmm9, %zmm3
vfnmadd213ps(%rdx,%rax), %zmm10, %zmm3
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm11
vmulps  %zmm11, %zmm3, %zmm12
vmovaps %zmm10, %zmm3
vfnmadd213ps(%r14,%rax), %zmm10, %zmm3
vfnmadd231ps%zmm12, %zmm12, %zmm3
vcmpps  $4, %zmm3, %zmm4, %k1
vrsqrt14ps  %zmm3, %zmm1{%k1}{z}
vmulps  %zmm3, %zmm1, %zmm3
vmulps  %zmm1, %zmm3, %zmm0
vmulps  %zmm5, %zmm3, %zmm3
vmovups (%rcx,%rax), %zmm1
vaddps  %zmm6, %zmm0, %zmm0
vmulps  %zmm3, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm2
vmulps  %zmm0, %zmm2, %zmm0
vmulps  %zmm0, %zmm2, %zmm0
vaddps  %zmm2, %zmm2, %zmm2
vsubps  %zmm0, %zmm2, %zmm0
vmulps  %zmm0, %zmm11, %zmm3
vmulps  %zmm12, %zmm3, %zmm3
vxorps  %zmm7, %zmm3, %zmm3
vmulps  %zmm1, %zmm3, %zmm2
vmulps  %zmm3, %zmm9, %zmm3
vfnmadd231ps%zmm8, %zmm9, %zmm1

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #9 from Thomas Koenig  ---
Hm.

It would help if your benchmark was complete, so I could run it.

However, what happens if you put int

real, dimension(:) ::  Uix
real, dimension(:), intent(in)  ::  x
real, dimension(:), intent(in)  ::  S

?

gfortran should not pack then.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #8 from Chris Elrod  ---
Created attachment 45358
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358=edit
gfortran compiled assembly for the tranposed version of the original code.

Here is the assembly for the loop body of the transposed version of the code,
compiled by gfortran:


.L8:
vmovss  36(%rsi), %xmm0
addq$40, %rsi
vrsqrtss%xmm0, %xmm2, %xmm2
addq$12, %rdi
vmulss  %xmm0, %xmm2, %xmm0
vmulss  %xmm2, %xmm0, %xmm0
vmulss  %xmm7, %xmm2, %xmm2
vaddss  %xmm8, %xmm0, %xmm0
vmulss  %xmm2, %xmm0, %xmm0
vmulss  -8(%rsi), %xmm0, %xmm5
vmulss  -12(%rsi), %xmm0, %xmm4
vmulss  -32(%rsi), %xmm0, %xmm0
vmovaps %xmm5, %xmm3
vfnmadd213ss-16(%rsi), %xmm5, %xmm3
vmovaps %xmm4, %xmm2
vfnmadd213ss-20(%rsi), %xmm5, %xmm2
vmovss  %xmm0, -4(%rdi)
vrsqrtss%xmm3, %xmm1, %xmm1
vmulss  %xmm3, %xmm1, %xmm3
vmulss  %xmm1, %xmm3, %xmm3
vmulss  %xmm7, %xmm1, %xmm1
vaddss  %xmm8, %xmm3, %xmm3
vmulss  %xmm1, %xmm3, %xmm3
vmulss  %xmm3, %xmm2, %xmm6
vmovaps %xmm4, %xmm2
vfnmadd213ss-24(%rsi), %xmm4, %xmm2
vfnmadd231ss%xmm6, %xmm6, %xmm2
vrsqrtss%xmm2, %xmm10, %xmm10
vmulss  %xmm2, %xmm10, %xmm1
vmulss  %xmm10, %xmm1, %xmm1
vmulss  %xmm7, %xmm10, %xmm10
vaddss  %xmm8, %xmm1, %xmm1
vmulss  %xmm10, %xmm1, %xmm1
vmulss  %xmm1, %xmm3, %xmm2
vmulss  %xmm6, %xmm2, %xmm2
vmovss  -36(%rsi), %xmm6
vxorps  %xmm9, %xmm2, %xmm2
vmulss  %xmm6, %xmm2, %xmm10
vmulss  %xmm2, %xmm5, %xmm2
vfmadd231ss -40(%rsi), %xmm1, %xmm10
vfmadd132ss %xmm4, %xmm2, %xmm1
vfnmadd132ss%xmm0, %xmm10, %xmm1
vmulss  %xmm0, %xmm5, %xmm0
vmovss  %xmm1, -12(%rdi)
vsubss  %xmm0, %xmm6, %xmm0
vmulss  %xmm3, %xmm0, %xmm3
vmovss  %xmm3, -8(%rdi)
cmpq%rsi, %rax
jne .L8


While Flang had a second loop of scalar code (to catch the N mod [SIMD vector
width] remainder of the vectorized loop), there are no secondary loops in the
gfortran code, meaning these must all be scalar operations (I have a hard time
telling apart SSE from scalar code...).

It looks similar in the operations it performs to Flang's vectorized loop,
except that it is only performing operations on a single number at a time.
Because to get efficient vectorization, we need corresponding elements to be
contiguous (ie, all the input1s, all the input2s).
We do not get any benefit from having all the different elements with the same
index (the first input1 next to the first input2, next to the first input3...)
being contiguous.


The memory layout I used is performance-optimal, but is something that gfortran
unfortunately often cannot handle automatically (without manual unrolling).
This is why I filed a report on bugzilla.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #7 from Chris Elrod  ---
Created attachment 45357
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45357=edit
Assembly generated by Flang compiler on the original version of the code.

This is the main loop body in the Flang compiled version of the original code
(starts line 132):

.LBB1_8:# %vector.body
# =>This Inner Loop Header: Depth=1
leaq(%rsi,%rbx,4), %r12
vmovups (%rcx,%r12), %zmm2
addq%rcx, %r12
leaq(%r12,%rcx), %rbp
vmovups (%r11,%rbp), %zmm3
addq%r11, %rbp
leaq(%rcx,%rbp), %r13
leaq(%rcx,%r13), %r8
leaq(%r8,%rcx), %r10
leaq(%r10,%rcx), %r14
vmovups (%rcx,%r14), %zmm4
vrsqrt14ps  %zmm4, %zmm5
vmulps  %zmm5, %zmm4, %zmm4
vfmadd213ps %zmm0, %zmm5, %zmm4 # zmm4 = (zmm5 * zmm4) + zmm0
vmulps  %zmm1, %zmm5, %zmm5
vmulps  %zmm4, %zmm5, %zmm4
.Ltmp1:
.loc1 31 1 is_stmt 1# vectorization_test.f90:31:1
vmulps  (%rcx,%r8), %zmm4, %zmm5
.loc1 32 1  # vectorization_test.f90:32:1
vmulps  (%rcx,%r10), %zmm4, %zmm6
vmovups (%rcx,%r13), %zmm7
.loc1 33 1  # vectorization_test.f90:33:1
vfnmadd231ps%zmm6, %zmm6, %zmm7 # zmm7 = -(zmm6 * zmm6) + zmm7
vrsqrt14ps  %zmm7, %zmm8
vmulps  %zmm8, %zmm7, %zmm7
vfmadd213ps %zmm0, %zmm8, %zmm7 # zmm7 = (zmm8 * zmm7) + zmm0
vmulps  %zmm1, %zmm8, %zmm8
vmulps  %zmm7, %zmm8, %zmm7
vmovups (%rcx,%rbp), %zmm8
.loc1 35 1  # vectorization_test.f90:35:1
vfnmadd231ps%zmm5, %zmm6, %zmm8 # zmm8 = -(zmm6 * zmm5) + zmm8
vmulps  %zmm8, %zmm7, %zmm8
vmulps  %zmm5, %zmm5, %zmm9
vfmadd231ps %zmm8, %zmm8, %zmm9 # zmm9 = (zmm8 * zmm8) + zmm9
vsubps  %zmm9, %zmm3, %zmm3
vrsqrt14ps  %zmm3, %zmm9
vmulps  %zmm9, %zmm3, %zmm3
vfmadd213ps %zmm0, %zmm9, %zmm3 # zmm3 = (zmm9 * zmm3) + zmm0
vmulps  %zmm1, %zmm9, %zmm9
vmulps  %zmm3, %zmm9, %zmm3
.loc1 39 1  # vectorization_test.f90:39:1
vmulps  %zmm8, %zmm7, %zmm8
.loc1 40 1  # vectorization_test.f90:40:1
vmulps  (%rcx,%r12), %zmm4, %zmm4
.loc1 39 1  # vectorization_test.f90:39:1
vmulps  %zmm3, %zmm8, %zmm8
.loc1 41 1  # vectorization_test.f90:41:1
vmulps  %zmm8, %zmm2, %zmm9
vfmsub231ps (%rsi,%rbx,4), %zmm3, %zmm9 # zmm9 = (zmm3 * mem) -
zmm9
vmulps  %zmm5, %zmm3, %zmm3
vfmsub231ps %zmm8, %zmm6, %zmm3 # zmm3 = (zmm6 * zmm8) - zmm3
vfmadd213ps %zmm9, %zmm4, %zmm3 # zmm3 = (zmm4 * zmm3) + zmm9
.loc1 42 1  # vectorization_test.f90:42:1
vmulps  %zmm4, %zmm6, %zmm5
vmulps  %zmm5, %zmm7, %zmm5
vfmsub231ps %zmm7, %zmm2, %zmm5 # zmm5 = (zmm2 * zmm7) - zmm5
.Ltmp2:
.loc1 15 1  # vectorization_test.f90:15:1
vmovups %zmm3, (%rdi,%rbx,4)
movq-16(%rsp), %rbp # 8-byte Reload
vmovups %zmm5, (%rbp,%rbx,4)
vmovups %zmm4, (%rax,%rbx,4)
addq$16, %rbx
cmpq%rbx, %rdx
jne .LBB1_8



zmm registers are 64 byte registers. It vmovups from memory into registers,
performs a series of arithmetics and inverse square roots on them, and then
vmovups three of these 64 byte registers back into memory.

That is the most efficient memory access pattern (as demonstrated empirically
via benchmarks).

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #6 from Chris Elrod  ---
Created attachment 45356
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356=edit
Code to demonstrate that transposing makes things slower.

Thomas Koenig, I am well aware that Fortran is column major. That is precisely
why I chose the memory layout I did.

Benchmark of the "optimal" corrected code:

@benchmark gforttest($X32t, $BPP32t, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 20.647 μs (0.00% GC)
  median time:  20.860 μs (0.00% GC)
  mean time:21.751 μs (0.00% GC)
  maximum time: 47.760 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 1


Here is a benchmark (compiling with Flang) of my code, exactly as written
(suboptimal) in the attachments:

@benchmark flangtest($X32,  $BPP32,  $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 658.329 ns (0.00% GC)
  median time:  668.012 ns (0.00% GC)
  mean time:692.384 ns (0.00% GC)
  maximum time: 1.192 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 161


That is 20 microseconds, vs 670 nanoseconds.

N was 1024, and the exact same data used in both cases (but pretransposed, so I
do not benchmark tranposing).
Benchmarking was done by compiling shared libraries, and using `ccall` and
BenchmarkTools from Julia. As indicated by the reports, the benchmark was run
10,000 times for gfortran, and 1.61 million times for Flang, to get accurate
timings.

I compiled with (march=native is equivalent to march=skylake-avx512):
gfortran -Ofast -march=native -mprefer-vector-width=512
-fno-semantic-interposition -shared -fPIC vectorization_test_transposed.f90 -o
libgfortvectorization_test.so
flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC
vectorization_test.f90 -o libflangvectorization_test.so

Flang was built with LLVM 7.0.1.


The "suboptimal" code was close to 32 times faster than the "optimal" code.
I was expecting it to be closer to 16 times faster, given the vector width.


To go into more detail:

"
Fortran lays out the memory for that array as

BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2)

so you are accessing your memory with a stride of n in the
expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient
anyway, vectorization would not really help in this case.
"

Yes, each call to fpdbacksolve is accessing memory across strides.
But fpdbacksolve itself cannot be vectorized well at all.

What does work, however, is vectorizing across loop iterations.

For example, imagine calling fpdbacksolve on this:

BPP(1:16,1), BPP(1:16,2), BPP(1:16,3), BPP(1:16,5), ..., BPP(1:16,10)

and then performing every single scalar operation defined in fpdbacksolve on an
entire SIMD vector of floats (that is, on 16 floats) at a time.

That would of course require inlining fpdbacksolve (which was achieved with
-fno-semantic-interposition, as the assembly shows), and recompiling it.

Perhaps another way you can imagine it is that fpdbacksolve takes in 9 numbers
(BPP(:,4) was unused), and returns 3 numbers.
Because operations within it aren't vectorizable, we want to vectorize it
ACROSS loop iterations, not within them.
So to facilitate that, we have 9 vectors of contiguous inputs, and 3 vectors of
contiguous outputs. Now, all inputs1 are stored contiguously, as are all
inputs2, etc..., allowing the inputs to efficiently be loaded into SIMD
registers, and each loop iteration to calculate [SIMD vector width] of the
outputs at a time.

Of course, it is inconvenient to handle a dozen vectors. So if they all have
the same length, we can just concatenate them together.


I'll attach the assembly of both code examples as well.
The assembly makes it clear that the "suboptimal" way was vectorized, and the
"optimal" way was not.

The benchmarks make it resoundingly clear that the vectorized ("suboptimal")
version was dramatically faster.

As is, this is a missed optimization, and gfortran is severely falling behind
in performance versus LLVM-based Flang in the highest performance version of
the code.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Thomas Koenig  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX

--- Comment #5 from Thomas Koenig  ---
An additional point. The called routine has

pure function fpdbacksolve(x, S) result(Uix)

real, dimension(3) ::  Uix

... and so on. This _requires_ Uix to be contiguous in memory,
so we need to call the internal pack routine.

This is something we cannot change easily. So, the workaround:

Use the appropriate memory layout in your Fortran programs.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org
 Blocks||36854
   Severity|normal  |enhancement

--- Comment #4 from Thomas Koenig  ---
First, a few remarks on the code: It is written in a suboptmial way
regarding the way Fortran lays out its memory.

The code does

real,   dimension(N,3), intent(out) ::  X
real,   dimension(N,10),intent(in)  ::  BPP

and then

do concurrent (i = 1:N)
X(i,:) = fpdbacksolve(BPP(i,1:3), BPP(i,5:10))
end do

The problem is that BPP(i,1:3) is not contiguous in memory.

Fortran lays out the memory for that array as

BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2)

so you are accessing your memory with a stride of n in the
expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient
anyway, vectorization would not really help in this case.

So, if you change your code to

real,   dimension(3,N), intent(out) ::  X
real,   dimension(10,N),intent(in)  ::  BPP

do concurrent (i = 1:N)
X(i,:) = fpdbacksolve(BPP(1:3,i), BPP(5:10,i))
end do

then processBPP will be inlined completely, and memory accesses will
be contiguous.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36854
[Bug 36854] [meta-bug] fortran front-end optimization

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #3 from Chris Elrod  ---
Created attachment 45353
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45353=edit
g++ assembly output

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #2 from Chris Elrod  ---
Created attachment 45352
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45352=edit
gfortran assembly output

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #1 from Chris Elrod  ---
Created attachment 45351
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45351=edit
C++ version of the vectorization test case.