Bill Baxter wrote:
On Mon, Feb 23, 2009 at 5:18 PM, Don <[email protected]> wrote:
Mattias Holm wrote:
On 2009-02-21 17:03:06 +0100, Don <[email protected]> said:
I don't think that's messy at all. I can't see much difference between
special support for float[4] versus float4. It's better if the code can take
advantage of hardware without specific support. Bear in mind that SSE/SSE2
is a temporary situation. AVX provides for much longer arrays of vectors;
and it's extensible. You'd end up needing to keep adding on special types
whenever a new CPU comes out.

Note that the fundamental concept which is missing from the C virtual
machine is that all modern machines can efficiently perform operations on
arrays of built-in types of length 2^n, for some small value of n.
We need to get this into the language abstraction. Not follow C++ in
hacking a few extra special types onto the old, deficient C model. And I
think D is actually in a position to do this.

float[4] would be a greatly superior option if it could be done.
The key requirements are:
(1) need to specify that static arrays are passed by value.
(2) need to keep stack aligned to 16.
The good news is that both of these appear to be done on DMD2-Mac!
Yes, float[4] would be ok, if some CPU independent permutation support can
be added. Would this be with some intrinsic then or what? I very much like
the OpenCL syntax for permutation, but I suppose that an intrinsic such as
"float[4] noref permute(float[4] noref vec, int newPos0, int newPos1, int
newPos2, int newPos3)" would work as well. Note that this should also work
with double[2], byte[16], short[8] and int[4].
Note that if you had static arrays with value semantics, with proper
alignment, then you could simply create

module std.swizzle;
float[4] permute(float[4] vec, int newPos0, int newPos1, int newPos2, int
newPos3);  /* intrinsic */

float[4] wzyx(float[4] q) { return permute(q, 4, 3, 2, 1); }
float[4] xywz(float[4] q) { return permute(q, 1, 2, 4, 3); }
// etc

---
and your code would be:

import std.swizzle;

void main()
{
  float[4] t;
  auto u = t.wzyx;
}

I don't think this is terribly difficult once the value semantics are in
place.
(Note that once you get beyond 4 members, the .xyzw syntax gives an
explosion of functions; but I think it's workable at 4; 4! is only 24.
Beyond that point, you'd probably require direct permute calls).

Actually its 4^4 if you do it like OpenCL/GLSL/HLSL/Cg and allow
repeats like .xxyy.

Yes. Is the syntax sugar actually needed for all the permutations?
Even so, it's still only 256, which is probably still OK. I don't think a language change is required.

This scheme doesn't cover:
* shufp  where the two sources are different
* haddpd, haddps [SSE3] { double[2] a, b;  a[0]=a[0]+a[1]; a[1]=b[0]+b[1]; }
* non-temporal stores (although I think these are covered adequately by array operations)

and the byte/word operations:

* pack with saturation
* movmsk
* avg
* multiply and add.

So it looks to me as though with the minimal language changes, we could get almost complete SIMD support, with excellent syntax.

Reply via email to