On Monday, 6 August 2012 at 15:15:30 UTC, Manu wrote:
I think core.simd is only designed for the lowest level of access to the SIMD hardware. I started writing std.simd some time back; it is mostly finished in a fork, but there are some bugs/missing features in D's SIMD support preventing me from finishing/releasing it. (incomplete dmd implementation, missing intrinsics, no SIMD literals, can't do unit
testing, etc)

Yes, I found, and have been referring to, your std.simd library for awhile now. Even with your library having GDC only support AtM, it's been a help. Thank you.

The intention was that std.simd would be flat C-style api, which would be
the lowest level required for practical and portable use.
It's almost done, and it should make it a lot easier for people to build their own SIMD libraries on top. It supplies most useful linear algebraic operations, and implements them as efficiently as possible for other
architectures than just SSE.
Take a look: https://github.com/TurkeyMan/phobos/blob/master/std/simd.d

Right now I'm working with DMD on Linux x86_64. LDC doesn't support SIMD right now, and I haven't built GDC yet, so I can't do performance comparisons between the two. I really need to get around to setting up GDC, because I've always planned on using that as a "release compiler" for my code.

The problem is, as I mentioned above, that performance of SIMD completely get's shot when wrapping a float4 into a struct, rather than using float4 directly. There are some places where (like matrices), where they do make a big impact, but I'm trying to find the best solution for general code. For instance my current math library looks like:

    struct Vector4 { float x, y, z, w; ... }
    struct Matrix4 { Vector4 x, y, z, w; ... }

but I was planning on changing over to (something like):

    alias float4 Vector4;
    alias float4[4] Matrix4;

So I could use the types directly and reap the performance gains. I'm currently doing this to both my D code (still in early state), and our C# code for Mono. Both core.simd and Mono.Simd have "compiler magic" vector types, but Mono's version gives me access to component channels and simple constructors I can use, so for user code (and types like the Matrix above, with internal vectors) it's very convenient and natural. D's simply isn't, and I'm not sure there's any ways around it since again, at least with DMD, performance is shot when I put it in a struct.


On a side note, your example where you're performing a scalar add within a
vector; this is bad, don't ever do this.
SSE (ie, x86) is the most tolerant architecture in this regard, but it's VERY bad SIMD design. You should never perform any component-wise
arithmetic when working with SIMD; It's absolutely not portable.
Basically, a good rule of thumb is, if the keyword 'float' appears anywhere that interacts with your SIMD code, you are likely to see worse performance
than just using float[4] on most architectures.
Better to factor your code to eliminate any scalar work, and make sure 'scalars' are broadcast across all 4 components and continue doing 4d
operations.

Instead of: @property pure nothrow float x(float4 v) { return v.ptr[0]; } Better to use: @property pure nothrow float4 x(float4 v) { return
swizzle!"xxxx"(v); }

Thanks a lot for telling me this, I don't know much about SIMD stuff. You're actually the exact person I wanted to talk to, because you do know a lot about this and I've always respected your opinions.

I'm not apposed to doing something like:

    float4 addX(ref float4 v, float val)
    {
        float4 f;
        f.x = val
        v += f;
    }

to do single component scalars, but it's very inconvenient for users to remember to use:

    vec.addX(scalar);

instead of:

    vec.x += scalar;

But that wouldn't be an issue if I could write custom operators for the components what basically did that. But I can't without wrapping float, which is why I am requesting these magic types get some basic features like that.

I'm wondering if I should be looking at just using inlined ASM and use the ASM SIMD instructions directly. I know basic ASM, but I don't know what the potential pitfalls of doing that, especially with portability. Is there a reason not to do this (short of complexity)? I'm also wondering why wrapping a core.simd type into a struct completely negates performance.. I'm guessing because when I return the struct type, the compiler has to think about it as a struct, instead of it's "magic" type and all struct types have a bit more overhead.


On a side note, DMD without SIMD is much faster than C# without SIMD, by a factor of 8x usually on simple vector types (micro-benchmarks), and that's not counting the runtimes startup times either. However, when I use Mono.Simd, both DMD (with core.simd) and C# are similar performance (see below). Math code with Mono C# (with SIMD) actually runs faster on Linux (even without the SGen GC or LLVM codegen) than it does on Window 8 with MS .NET, which I find to be pretty impressive and encouraging for our future games with Mono on Android (which has been out biggest performance PITA platform so far).

I've noticed some really odd things with core.simd as well, which is another reason I'm thing of trying inlined ASM. I'm not sure what's causing certain compiler optimizations. For instance, given the basic test program, when I do:

    float rand = ...; // user input value

    float4 a, b = [1, 4, -12, 5];

    a.ptr[0] = rand;
    a.ptr[1] = rand + 1;
    a.ptr[2] = rand + 2;
    a.ptr[3] = rand + 3;

    ulong mil;
    StopWatch sw;

    foreach (t; 0 .. testCount)
    {
        sw.start();
        foreach (i; 0 .. 1_000_000)
        {
            a += b;
            b -= a;
        }
        sw.stop();
        mil += sw.peek().msecs;
        sw.reset();
    }

    writeln(a.array, ", ", b.array);
    writeln(cast(double) mil / testCount);

When I run this on my Phenom II X4 920, it completes in ~9ms. For comparison, C# Mono.Simd gets almost identical performance with identical code. However, if I add:

    auto vec4(float x, float y, float z, float w)
    {
        float4 result;

        result.ptr[0] = x;
        result.ptr[1] = y;
        result.ptr[2] = z;
        result.ptr[3] = w;

        return result;
    }

then replace the vector initialization lines:

    float4 a, b = [ ... ];
    a.ptr[0] = rand;
    ...

with ones using my factory function:

    auto a = vec4(rand, rand+1, rand+2, rand+3);
    auto b = vec4(1, 4, -12, 5);

Then the program consistently completes in 2.15ms...

wtf right? The printed vector output is identical, and there's no changes to the loop code (a += b, etc), I just change the construction code of the vectors and it runs 4.5x faster. Beats me, but I'll take it. Btw, for comparison, if I use a struct with an internal float4 it runs in ~19ms, and a struct with four floats runs in ~22ms. So you can see my concerns with using core.simd types directly, especially when my Intel Mac gets even better improvements with SIMD code. I haven't done extensive test on the Intel, but my original test (the one above, only in C# using Mono.Simd) the results for ~55ms using a struct with internal float4, and ~5ms for using float4 directly.

anyways, thanks for the feedback.

Reply via email to