On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:
Sean Cavanaugh:
In many ways the biggest thing I use regularly in game
development that I would lose by moving to D would be good
built-in SIMD support.
Don has given a nice answer about how D2 plans to face this.
To focus more what Don was saying I think a small exaple will
help. This is a C implementation of one Computer Shootout
benchmarks, that generates a binary PPM image of the Mandelbrot
set:
http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4
This is an important part of that C version:
typedef double v2df __attribute__ ((vector_size(16))); /*
vector of two doubles */
const v2df zero = { 0.0, 0.0 };
const v2df four = { 4.0, 4.0 };
// Constant throughout the program, value depends on N
int bytes_per_row;
double inverse_w;
double inverse_h;
// Program argument: height and width of the image
int N;
// Lookup table for initial real-axis value
v2df *Crvs;
// Mandelbrot bitmap
uint8_t *bitmap;
static void calc_row(int y) {
uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
int x;
const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };
for (x = 0; x < N; x += 2) {
v2df Crv = Crvs[x >> 1];
v2df Civ = Civ_init;
v2df Zrv = zero;
v2df Ziv = zero;
v2df Trv = zero;
v2df Tiv = zero;
int i = 50;
int two_pixels;
v2df is_still_bounded;
do {
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
// All bits will be set to 1 if 'Trv + Tiv' is less than 4
// and all bits will be set to 0 otherwise. Two elements
// are calculated in parallel here.
is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv,
four);
// Move the sign-bit of the low element to bit 0, move the
// sign-bit of the high element to bit 1. The result is
// that the pixel will be set if the calculation was
// bounded.
two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);
// The pixel bits must be in the most and second most
// significant position
two_pixels <<= 6;
// Add the two pixels to the bitmap, all bits are
// initially zero since the area was allocated with calloc()
row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
}
}
GCC 4.6 compiles the inner do-while loop of calc_row() to just
this very clean assembly, that in my opinion is quite
_beautiful_, it shows one of the most important final purposes
of a good compiler:
L9:
subl $1, %ecx
addpd %xmm0, %xmm0
mulpd %xmm0, %xmm1
movapd %xmm4, %xmm0
addpd %xmm6, %xmm1
addpd %xmm5, %xmm0
subpd %xmm3, %xmm0
movapd %xmm1, %xmm3
movapd %xmm0, %xmm4
mulpd %xmm1, %xmm3
mulpd %xmm0, %xmm4
movapd %xmm3, %xmm2
addpd %xmm4, %xmm2
cmplepd %xmm7, %xmm2
movmskpd %xmm2, %ebx
je L18
testl %ebx, %ebx
jne L9
Those addpd, subpd, mulpd, movapd, etc, instructions work on
pairs of doubles (those v2df). And the code uses the cmplepd
and movmskpd instructions too, in a very clean way, that I
think not even GCC 4.6 is normally able to use by itself. A
good language + compiler have many purposes, but producing ASM
code like that is one of the most important purposes,
expecially if you write numerical code.
A numerical programmer really wants to write code that somehow
produces equally clean and powerful code (or better, using AVX
256-bit registers and 3-way instructions) in numerical
processing kernels (often such kernels are small, often just
bodies of inner loops).
D2 allows to write code almost as clean as this C one (but I
think currently no D compiler is able to turn this into clean
inlined addpd, subpd, mulpd, movapd instructions. This is a
compiler issue, not a language one):
v2df Zrv = zero;
...
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
In D it becomes:
double[2] Zrv = zero;
...
Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
Zrv[] = Trv[] - Tiv[] + Crv[];
Trv[] = Zrv[] * Zrv[];
Tiv[] = Ziv[] * Ziv[];
But then how do you write this in a clean way in D2/D3?
do {
...
is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);
Using those __builtin_ia32_cmplepd() and
__builtin_ia32_movmskpd() is not easy, so there is a tradeoff
between allowing easy to write code, and giving power. So it's
acceptable for a language to give a bit less power if the code
is simpler to write. Yet, in a system language if you don't
give people a way to produce ASM code as clean as the one I've
shown in the inner loops of numerical processing code, some D2
programmers will be forced to write down inline asm, and that's
sometimes worse than using intrinsics like
__builtin_ia32_cmplepd().
Writing efficient inner loops is very important for numerical
processing code, and I think numerical processing code is
important for D2.
Time ago I have suggested to extend the D2 vector operations to
code like this, but I think this is not enough still:
float[4] a, b, c, d;
c = a[] == b[];
d = a[] >= b[];
Bye,
bearophile
Just found this old post, since I'm tuning mandelbrot.d right now
[0].
The good news: LDC produces code, which is quite close to the C
version.
mulsd %xmm6,%xmm4
subsd %xmm1,%xmm7
addsd %xmm4,%xmm4
addsd %xmm5,%xmm7
addsd %xmm0,%xmm4
movaps %xmm7,%xmm6
mulsd %xmm6,%xmm6
movaps %xmm4,%xmm2
mulsd %xmm2,%xmm2
movaps %xmm2,%xmm1
addsd %xmm6,%xmm1
ucomisd %xmm1,%xmm3
jb 4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130>
jl 402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0>
Even better, the code is produce from the following (inlined!)
source,
which is pretty much the mathematical definition.
for(auto i = 0; i < iter && norm(Z) <= lim; i++)
Z = Z*Z + C;
The bad news: cmplepd and movmskpd are not used. Is that possible
somehow four years later?
The gcc code is roughly twice as fast at the moment, but I don't
know if cmplepd and movmskpd is the only thing missing.
[0] https://github.com/qznc/d-shootout