Re: core.simd woes

F i L Mon, 06 Aug 2012 18:25:27 -0700

On Monday, 6 August 2012 at 15:15:30 UTC, Manu wrote:

I think core.simd is only designed for the lowest level ofaccess to theSIMD hardware. I started writing std.simd some time back; it ismostlyfinished in a fork, but there are some bugs/missing features inD's SIMDsupport preventing me from finishing/releasing it. (incompletedmdimplementation, missing intrinsics, no SIMD literals, can't dounit
testing, etc)

Yes, I found, and have been referring to, your std.simd libraryfor awhile now. Even with your library having GDC only supportAtM, it's been a help. Thank you.

The intention was that std.simd would be flat C-style api,which would be
the lowest level required for practical and portable use.
It's almost done, and it should make it a lot easier for peopleto buildtheir own SIMD libraries on top. It supplies most useful linearalgebraicoperations, and implements them as efficiently as possible forother
architectures than just SSE.
Take a look:https://github.com/TurkeyMan/phobos/blob/master/std/simd.d

Right now I'm working with DMD on Linux x86_64. LDC doesn'tsupport SIMD right now, and I haven't built GDC yet, so I can'tdo performance comparisons between the two. I really need to getaround to setting up GDC, because I've always planned on usingthat as a "release compiler" for my code.

The problem is, as I mentioned above, that performance of SIMDcompletely get's shot when wrapping a float4 into a struct,rather than using float4 directly. There are some places where(like matrices), where they do make a big impact, but I'm tryingto find the best solution for general code. For instance mycurrent math library looks like:


    struct Vector4 { float x, y, z, w; ... }
    struct Matrix4 { Vector4 x, y, z, w; ... }

but I was planning on changing over to (something like):

    alias float4 Vector4;
    alias float4[4] Matrix4;

So I could use the types directly and reap the performance gains.I'm currently doing this to both my D code (still in earlystate), and our C# code for Mono. Both core.simd and Mono.Simdhave "compiler magic" vector types, but Mono's version gives meaccess to component channels and simple constructors I can use,so for user code (and types like the Matrix above, with internalvectors) it's very convenient and natural. D's simply isn't, andI'm not sure there's any ways around it since again, at leastwith DMD, performance is shot when I put it in a struct.

On a side note, your example where you're performing a scalaradd within a
vector; this is bad, don't ever do this.
SSE (ie, x86) is the most tolerant architecture in this regard,but it'sVERY bad SIMD design. You should never perform anycomponent-wise
arithmetic when working with SIMD; It's absolutely not portable.
Basically, a good rule of thumb is, if the keyword 'float'appears anywherethat interacts with your SIMD code, you are likely to see worseperformance
than just using float[4] on most architectures.
Better to factor your code to eliminate any scalar work, andmake sure'scalars' are broadcast across all 4 components and continuedoing 4d
operations.
Instead of: @property pure nothrow float x(float4 v) { returnv.ptr[0]; }Better to use: @property pure nothrow float4 x(float4 v) {return
swizzle!"xxxx"(v); }

Thanks a lot for telling me this, I don't know much about SIMDstuff. You're actually the exact person I wanted to talk to,because you do know a lot about this and I've always respectedyour opinions.


I'm not apposed to doing something like:

    float4 addX(ref float4 v, float val)
    {
        float4 f;
        f.x = val
        v += f;
    }

to do single component scalars, but it's very inconvenient forusers to remember to use:


    vec.addX(scalar);

instead of:

    vec.x += scalar;

But that wouldn't be an issue if I could write custom operatorsfor the components what basically did that. But I can't withoutwrapping float, which is why I am requesting these magic typesget some basic features like that.

I'm wondering if I should be looking at just using inlined ASMand use the ASM SIMD instructions directly. I know basic ASM, butI don't know what the potential pitfalls of doing that,especially with portability. Is there a reason not to do this(short of complexity)? I'm also wondering why wrapping acore.simd type into a struct completely negates performance.. I'mguessing because when I return the struct type, the compiler hasto think about it as a struct, instead of it's "magic" type andall struct types have a bit more overhead.

On a side note, DMD without SIMD is much faster than C# withoutSIMD, by a factor of 8x usually on simple vector types(micro-benchmarks), and that's not counting the runtimes startuptimes either. However, when I use Mono.Simd, both DMD (withcore.simd) and C# are similar performance (see below). Math codewith Mono C# (with SIMD) actually runs faster on Linux (evenwithout the SGen GC or LLVM codegen) than it does on Window 8with MS .NET, which I find to be pretty impressive andencouraging for our future games with Mono on Android (which hasbeen out biggest performance PITA platform so far).

I've noticed some really odd things with core.simd as well, whichis another reason I'm thing of trying inlined ASM. I'm not surewhat's causing certain compiler optimizations. For instance,given the basic test program, when I do:


    float rand = ...; // user input value

    float4 a, b = [1, 4, -12, 5];

    a.ptr[0] = rand;
    a.ptr[1] = rand + 1;
    a.ptr[2] = rand + 2;
    a.ptr[3] = rand + 3;

    ulong mil;
    StopWatch sw;

    foreach (t; 0 .. testCount)
    {
        sw.start();
        foreach (i; 0 .. 1_000_000)
        {
            a += b;
            b -= a;
        }
        sw.stop();
        mil += sw.peek().msecs;
        sw.reset();
    }

    writeln(a.array, ", ", b.array);
    writeln(cast(double) mil / testCount);

When I run this on my Phenom II X4 920, it completes in ~9ms. Forcomparison, C# Mono.Simd gets almost identical performance withidentical code. However, if I add:


    auto vec4(float x, float y, float z, float w)
    {
        float4 result;

        result.ptr[0] = x;
        result.ptr[1] = y;
        result.ptr[2] = z;
        result.ptr[3] = w;

        return result;
    }

then replace the vector initialization lines:

    float4 a, b = [ ... ];
    a.ptr[0] = rand;
    ...

with ones using my factory function:

    auto a = vec4(rand, rand+1, rand+2, rand+3);
    auto b = vec4(1, 4, -12, 5);

Then the program consistently completes in 2.15ms...

wtf right? The printed vector output is identical, and there's nochanges to the loop code (a += b, etc), I just change theconstruction code of the vectors and it runs 4.5x faster. Beatsme, but I'll take it. Btw, for comparison, if I use a struct withan internal float4 it runs in ~19ms, and a struct with fourfloats runs in ~22ms. So you can see my concerns with usingcore.simd types directly, especially when my Intel Mac gets evenbetter improvements with SIMD code.I haven't done extensive test on the Intel, but my original test(the one above, only in C# using Mono.Simd) the results for ~55msusing a struct with internal float4, and ~5ms for using float4directly.


anyways, thanks for the feedback.

Re: core.simd woes

Reply via email to