Re: OOP, faster data layouts, compilers

qznc via Digitalmars-d Wed, 02 Sep 2015 12:06:41 -0700

On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:

Sean Cavanaugh:
In many ways the biggest thing I use regularly in gamedevelopment that I would lose by moving to D would be goodbuilt-in SIMD support.
Don has given a nice answer about how D2 plans to face this.
To focus more what Don was saying I think a small exaple willhelp. This is a C implementation of one Computer Shootoutbenchmarks, that generates a binary PPM image of the Mandelbrotset:
http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4

This is an important part of that C version:
typedef double v2df __attribute__ ((vector_size(16))); /*vector of two doubles */
const v2df zero = { 0.0, 0.0 };
const v2df four = { 4.0, 4.0 };

// Constant throughout the program, value depends on N
int bytes_per_row;
double inverse_w;
double inverse_h;

// Program argument: height and width of the image
int N;

// Lookup table for initial real-axis value
v2df *Crvs;

// Mandelbrot bitmap
uint8_t *bitmap;

static void calc_row(int y) {
  uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
  int x;
  const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };

  for (x = 0; x < N; x += 2) {
    v2df Crv = Crvs[x >> 1];
    v2df Civ = Civ_init;
    v2df Zrv = zero;
    v2df Ziv = zero;
    v2df Trv = zero;
    v2df Tiv = zero;
    int i = 50;
    int two_pixels;
    v2df is_still_bounded;

    do {
      Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
      Zrv = Trv - Tiv + Crv;
      Trv = Zrv * Zrv;
      Tiv = Ziv * Ziv;

      // All bits will be set to 1 if 'Trv + Tiv' is less than 4
      // and all bits will be set to 0 otherwise. Two elements
      // are calculated in parallel here.
is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv,four);
      // Move the sign-bit of the low element to bit 0, move the
      // sign-bit of the high element to bit 1. The result is
      // that the pixel will be set if the calculation was
      // bounded.
      two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
    } while (--i > 0 && two_pixels);

    // The pixel bits must be in the most and second most
    // significant position
    two_pixels <<= 6;

    // Add the two pixels to the bitmap, all bits are
    // initially zero since the area was allocated with calloc()
    row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
  }
}
GCC 4.6 compiles the inner do-while loop of calc_row() to justthis very clean assembly, that in my opinion is quite_beautiful_, it shows one of the most important final purposesof a good compiler:
L9:
    subl    $1, %ecx
    addpd   %xmm0, %xmm0
    mulpd   %xmm0, %xmm1
    movapd  %xmm4, %xmm0
    addpd   %xmm6, %xmm1
    addpd   %xmm5, %xmm0
    subpd   %xmm3, %xmm0
    movapd  %xmm1, %xmm3
    movapd  %xmm0, %xmm4
    mulpd   %xmm1, %xmm3
    mulpd   %xmm0, %xmm4
    movapd  %xmm3, %xmm2
    addpd   %xmm4, %xmm2
    cmplepd %xmm7, %xmm2
    movmskpd    %xmm2, %ebx
    je  L18
    testl   %ebx, %ebx
    jne L9
Those addpd, subpd, mulpd, movapd, etc, instructions work onpairs of doubles (those v2df). And the code uses the cmplepdand movmskpd instructions too, in a very clean way, that Ithink not even GCC 4.6 is normally able to use by itself. Agood language + compiler have many purposes, but producing ASMcode like that is one of the most important purposes,expecially if you write numerical code.
A numerical programmer really wants to write code that somehowproduces equally clean and powerful code (or better, using AVX256-bit registers and 3-way instructions) in numericalprocessing kernels (often such kernels are small, often justbodies of inner loops).
D2 allows to write code almost as clean as this C one (but Ithink currently no D compiler is able to turn this into cleaninlined addpd, subpd, mulpd, movapd instructions. This is acompiler issue, not a language one):
v2df Zrv = zero;
...
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;


In D it becomes:

double[2] Zrv = zero;
...
Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
Zrv[] = Trv[] - Tiv[] + Crv[];
Trv[] = Zrv[] * Zrv[];
Tiv[] = Ziv[] * Ziv[];


But then how do you write this in a clean way in D2/D3?

do {
    ...
    is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
    two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);
Using those __builtin_ia32_cmplepd() and__builtin_ia32_movmskpd() is not easy, so there is a tradeoffbetween allowing easy to write code, and giving power. So it'sacceptable for a language to give a bit less power if the codeis simpler to write. Yet, in a system language if you don'tgive people a way to produce ASM code as clean as the one I'veshown in the inner loops of numerical processing code, some D2programmers will be forced to write down inline asm, and that'ssometimes worse than using intrinsics like__builtin_ia32_cmplepd().
Writing efficient inner loops is very important for numericalprocessing code, and I think numerical processing code isimportant for D2.
Time ago I have suggested to extend the D2 vector operations tocode like this, but I think this is not enough still:
float[4] a, b, c, d;
c = a[] == b[];
d = a[] >= b[];

Bye,
bearophile

Just found this old post, since I'm tuning mandelbrot.d right now[0].

The good news: LDC produces code, which is quite close to the Cversion.


mulsd  %xmm6,%xmm4
subsd  %xmm1,%xmm7
addsd  %xmm4,%xmm4
addsd  %xmm5,%xmm7
addsd  %xmm0,%xmm4
movaps %xmm7,%xmm6
mulsd  %xmm6,%xmm6
movaps %xmm4,%xmm2
mulsd  %xmm2,%xmm2
movaps %xmm2,%xmm1
addsd  %xmm6,%xmm1
ucomisd %xmm1,%xmm3
jb     4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130>
jl     402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0>

Even better, the code is produce from the following (inlined!)source,

which is pretty much the mathematical definition.

for(auto i = 0; i < iter && norm(Z) <= lim; i++)
        Z = Z*Z + C;

The bad news: cmplepd and movmskpd are not used. Is that possiblesomehow four years later?

The gcc code is roughly twice as fast at the moment, but I don'tknow if cmplepd and movmskpd is the only thing missing.


[0] https://github.com/qznc/d-shootout

Re: OOP, faster data layouts, compilers

Reply via email to