[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-05-14 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #7 from mjr19 at cam dot ac.uk ---
Another manifestation of this issue in GCC 13.1 and 14.1 is that the loop

  do i=1,n
 c(i)=a(i)*c(i)*(0d0,1d0)
  enddo

takes about twice as long to run as

  do i=1,n
 c(i)=a(i)*(0d0,1d0)*c(i)
  enddo

when compiled -Ofast -mavx2. In the second case the compiler manages to merge
its unnecessary desire to form separate vectors of real and imaginary
components to perform the sign flips on multiplying by i, with its much more
reasonable desire to form such vectors for the general complex-complex
multiplication.

One might also argue that, as the above expressions are mathematically
identical, at -Ofast the compiler ought to chose the faster anyway.

[Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-05-01 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #5 from mjr19 at cam dot ac.uk ---
Note that bug 114767 also turns out to be a case in which the inability to
alternate neg and nop along a vector leads to poor performance with some
operations on the complex type. That optimisation improvement request also
discusses that the ability to alternate add and nop could be beneficial.

Ifort can alternate neg and nop, at least in the simple case of

  complex(kind(1d0)) :: c(*)
  do i=1,n
 c(i)=conjg(c(i))
  enddo

Helped by aggressive default unrolling, it ends up being almost four times
faster than gfortran-14 on the machine I tested it on. On asking gfortran-14 to
unroll, the difference is reduced to about a factor of two.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-19 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #6 from mjr19 at cam dot ac.uk ---
I was starting to wonder whether this issue might be related to that in bug
114324, which is a slightly more complicated example in which multiplication by
a purely imaginary number destroys vectorisation.

In 114324 the problem seems to arise from a refusal to alternate no-ops and
negs along a vector, which is pretty much the issue here too.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #4 from mjr19 at cam dot ac.uk ---
An issue which I suspect is related is shown by

subroutine zradd(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

  do i=1,n
 c(i)=c(i)+1d0
  enddo
end subroutine

If compiled with gfortran-14 and -O3 -mavx2 it all looks very sensible.

If one adds -ffast-math, it looks a lot less sensible, and takes over 70%
longer to run. I think it has changed from promoting 1d0 to (1d0,0d0) and then
adding that (which one might argue that a strict interpretation of the Fortran
standard requires, but I am not certain that it does), to collecting all the
real parts in a vector, adding 1d0 to them, and avoiding adding 0d0 to the
imaginary parts. Unsurprisingly the gain in halving the number of additions is
more than offset by the required vperms and vshufs.

Ideally -ffast-math would have noticed that adding 0d0 to the imaginary part is
not necessary, but then concluded that doing so was faster than any alternative
method, and so done so anyway.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #2 from mjr19 at cam dot ac.uk ---
Ah, I see. An inability to alternate negation with noop also means that
conjugation is treated suboptimally.

  do i=1,n
 c(i)=conjg(c(i))
  enddo

Here gfortran-13 and -14 are differently suboptimal, and again it appears to be
because they don't wish to alternate noops and negations along a single vector. 

In this case -14 fails to vectorise, and loads just the imaginary values,
negating them and storing them, moving with a stride of two. Not a bad answer,
but one which would not generalise well should the loop contain other
operations on c(i).

In practice almost every vector operation can be trivially alternated with a
noop, as +, -, *, xor, or, and, all have special values which reduce the
operation to a noop and which could be used to pad things. Is there no scope
for making the SLP build more flexible here, as otherwise shuffling and
permuting can cost both time and registers?

I also note that it means that -ffast-math (or simply -fno-signed-zeros) slows
down these examples. This is unfortunate, as there are cases in which
-fno-signed-zeros gives a useful performance increase, but this can be reduced
or reversed when there are cases, perhaps in the same source file, when it
reduces performance.

[Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

Bug ID: 114767
   Summary: gfortran AVX2 complex multiplication by (0d0,1d0)
suboptimal
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

Gfortran 14 shows considerable improvement over 13.1 on x86_64 AVX2 on the test
case

subroutine scale_i(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

  do i=1,n
 c(i)=c(i)*(0d0,1d0)
  enddo
end subroutine scale_i

Both vectorise well, and use an xor for the multiplication by -1 -- good.

But both progress by forming one vector containing the real parts of four
consecutive complex elements, and one of the imaginary parts. The imaginary
parts are then all xor'ed to swap their signs, and further permuting and
shuffling occurs to reassemble things into the correct interleaved order.
Gfortran-14 has reduced the amount of permuting and shuffling to achieve the
same result.

I think that it should be possible to do this with the vector registers holding
the complex data in their natural order. A single xor could switch the signs of
alternate elements, leaving the real parts untouched, and a single vpermilpd
(or the more general vpermpd) could then swap pairs of elements. This should
not only be faster, but use fewer registers too.

(This could be generalised to the case where the constant is a multiple of
(0d0,1d0), say (0d0,q), either by a final multiplication with a vector
containing q repeated, or by replacing the xor by a multiplication by a vector
containing q repeated but with alternating signs.)

Compilation tests with -mavx2 -O3 -ffast-math.

[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-03-15 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #4 from mjr19 at cam dot ac.uk ---
Created attachment 57713
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57713=edit
Second testcase, very similar to first

Thank you for looking into this. The real code in question has more than one
loop which suffers a slow-down with gfortran 13/14 when compared to 12, and I
suspect it is the same underlying issue in all cases.

I attach another test case, which seems very similar. The odd logic surrounding
the initialisation of ci is to replicate the fact that in the real code the
sign of ci depends on an argument which I have dropped, and so the compiler
cannot optimise it away completely.

For this case, gfortran 12 and ifort produce very similar performance, gfortran
13 is over 20% slower, and ifx slower still.

[Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14

2024-03-13 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Bug ID: 114324
   Summary: AVX2 vectorisation performance regression with
gfortran 13/14
   Product: gcc
   Version: 13.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

Created attachment 57685
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57685=edit
Test case of loop showing performance regression

The attached loop, when compiled with "-Ofast -mavx2" runs over 20% slower on
gfortran 13 or (pre-release) 14 than it does on 12.x. Precise versions tested
12.3.0, 13.1.0 and GCC 14 downloaded on 11th March.

Precise slowdown depends on CPU. Tested on Haswell and Kaby Lake desktops.

Adding "-fopenmp" changes the code produced, but 12.3 still beats later
compilers. The analysis below is without -fopenmp.

It appears (to me) that 12.x is using the full width of the ymm registers, and
has a loop of 17 vector instructions, and some scalar loop control, which
performs two iterations of the original Fortran loop.

13.x manages more aggressive unrolling, performing four iterations per pass,
but uses about 54 vector instructions, rather than the 34 one might naively
expect. More instructions does not necessarily mean slower, but here it does.

I attach the test case to which I refer. I would be happy to add the trivial
timing program to show how I have been timing it. The full code is an FFT, but
the test case has been reduced to functional nonsense.

(I note that in other areas there are pleasing performance gains in gfortran
13.x. It is a pity that this partially cancels them.)