On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
a:

Because dmd currently doesn't have an intrinsic for the SHUFPS instruction I've included a version block with some GDC specific code (this gave me a speedup of up to 80%).

It seems an instruction worth having in dmd too.


Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

I know your code is relatively simple, so it's not meant to be the fastest on the ground, but in your nice graph _as reference point_ I'd like to see a line for the FTTW too. Such line is able to show us how close or how far all this is from an industry standard performance. (And if possible I'd like to see two lines for the LDC2 compiler too.)

Bye,
bearophile

"bench" program in the fftw test directory gives this when run in a loop:


2       Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7
3       Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3
4       Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9
5       Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108
6       Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225
7       Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717
8       Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501
9       Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958
10 Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'': 20690 11 Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'': 19693 12 Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'': 18622 13 Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'': 14784 14 Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'': 13913 15 Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'': 12659 16 Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'': 12409 17 Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'': 11200 18 Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'': 10338 19 Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'': 5243.4 20 Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'': 5102.8 21 Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'': 5135.2 22 Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'': 5209.8

This was with fftw compiled for single precision and with SSE, but without AVX support. When I compiled fftw with AVX support, the peak was at about 30 GFLOPS, IIRC. It is possible that it would be even faster if I configured it in a different way. The C++ version of my FFT also supports AVX and gets to about 24 GFLOPS when using it. If AVX types will be added to D, I will port that part too.

Reply via email to