Re: SIMD support...

Martin Nowak Fri, 06 Jan 2012 04:57:26 -0800

On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright<[email protected]> wrote:

On 1/5/2012 5:42 PM, Manu wrote:
So I've been hassling about this for a while now, and Walter asked meto pitch
an email detailing a minimal implementation with some initial thoughts.
Takeaways:

1. SIMD behavior is going to be very machine specific.
2. Even trying to do something with + is fraught with peril, as integeradds with SIMD can be saturated or unsaturated.
3. Trying to build all the details about how each of the various addsand other ops work into the compiler/optimizer is a large undertaking. Dwould have to support internally maybe a 100 or more new operators.
So some simplification is in order, perhaps a low level layer that isfairly extensible for new instructions, and for which a library can belayered over for a more presentable interface. A half-formed idea ofmine is, taking a cue from yours:
Declare one new basic type:

     __v128
which represents the 16 byte aligned 128 bit vector type. The onlyoperations defined to work on it would be construction and assignment.The __ prefix signals that it is non-portable.
Then, have:

    import core.simd;

which provides two functions:

    __v128 simdop(operator, __v128 op1);
    __v128 simdop(operator, __v128 op1, __v128 op2);
This will be a function built in to the compiler, at least for the x86.(Other architectures can provide an implementation of it that simulatesits operation, but I doubt that it would be worth anyone's while to usethat.)
The operators would be an enum listing of the SIMD opcodes,

     PFACC, PFADD, PFCMPEQ, etc.

For:

     z = simdop(PFADD, x, y);

the compiler would generate:

     MOV z,x
     PFADD z,y
The code generator knows enough about these instructions to do registerassignments reasonably optimally.
What do you think? It ain't beeyoootiful, but it's implementable in areasonable amount of time, and it should make writing tight & fast SIMDcode without having to do it all in assembler.
One caveat is it is typeless; a __v128 could be used as 4 packed ints or2 packed doubles. One problem with making it typed is it'll add 10 moretypes to the base compiler, instead of one. Maybe we should just bitethe bullet and do the types:
     __vdouble2
     __vfloat4
     __vlong2
     __vulong2
     __vint4
     __vuint4
     __vshort8
     __vushort8
     __vbyte16
     __vubyte16


Those could be typedefs, i.e. alias this wrapper.
Still simdop would not be typesafe.

As much as this proposal presents a viable solution,
why not spending the time to extend inline asm.

void foo()
{
    __v128 a = loadss(1.0f);
    __v128 b = loadss(1.0f);
    a = addss(a, b);
}

__v128 load(float v)
{
    __v128 res; // allocates register
    asm
    {
        movss res, v[RBP];
    }
    return res; // return in XMM1 but inlineable return assignment
}

__v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
{
    __v128 res = a;
    // asm prolog, allocates registers for every __v128 used within the asm
    asm
    {
        addss res, b;
    }
    // asm epilog, possibly restore spilled registers
    return res;
}

What would be needed?
 - Implement the asm allocation logic.
 - Functions containing asm statements should participate in inlining.
 - Determining inline cost of asm statements.

When being used with typedefs for __vubyte16 et.al. this would
allow a really clean and simple library implementation of intrinsics.

Re: SIMD support...

Reply via email to