On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
a:
Because dmd currently doesn't have an intrinsic for the SHUFPS
instruction I've included a version block with some GDC
specific code (this gave me a speedup of up to 80%).
It seems an instruction worth having in dmd too.
Chart: http://cloud.github.com/downloads/jerro/pfft/image.png
I know your code is relatively simple, so it's not meant to be
the fastest on the ground, but in your nice graph _as reference
point_ I'd like to see a line for the FTTW too. Such line is
able to show us how close or how far all this is from an
industry standard performance.
(And if possible I'd like to see two lines for the LDC2
compiler too.)
Bye,
bearophile
"bench" program in the fftw test directory gives this when run in
a loop:
2 Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7
3 Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3
4 Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9
5 Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108
6 Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225
7 Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717
8 Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501
9 Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958
10 Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'':
20690
11 Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'':
19693
12 Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'':
18622
13 Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'':
14784
14 Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'':
13913
15 Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'':
12659
16 Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'':
12409
17 Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'':
11200
18 Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'':
10338
19 Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'':
5243.4
20 Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'':
5102.8
21 Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'':
5135.2
22 Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'':
5209.8
This was with fftw compiled for single precision and with SSE,
but without AVX support. When I compiled fftw with AVX support,
the peak was at about 30 GFLOPS, IIRC. It is possible that it
would be even faster if I configured it in a different way. The
C++ version of my FFT also supports AVX and gets to about 24
GFLOPS when using it. If AVX types will be added to D, I will
port that part too.