Re: __restrict, architecture intrinsics vs asm, consoles, and other

Marco Leise Thu, 22 Sep 2011 11:25:45 -0700

Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander<[email protected]>:

On 22/09/11 7:39 AM, Don wrote:
On 22.09.2011 05:24, a wrote:
How would one do something like this without intrinsics (the code is
c++ using
gcc vector extensions):
[snip]
At present, you can't do it without ultimately resorting to inline asm.
But, what we've done is to move SIMD into the machine model: the D
machine model assumes that float[4] + float[4] is a more efficient
operation than a loop.
Currently, only arithmetic operations are implemented, and on DMD at
least, they're still not proper intrinsics. So in the long term it'll be
possible to do it directly, but not yet.

At various times, several of us have implemented 'swizzle' using CTFE,
giving you a syntax like:

float[4] x, y;
x[] = y[].swizzle!"cdcd"();
// x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to alreadybe in vector registers, and unless I've missed something, they won't be.You'll need instructions for loading into registers (using the slowmovups because 16-byte alignment isn't guaranteed) then do the shufps,then load back out again.
This is too slow for performance critical code.
Being stored in XMM registers from creation, passed and returned in XMMregisters to/from functions is a key requirement for this sort of code.If you have to keep loading in and out of memory then you lose allperformance.

I thought about this. Either write long functions, so you don't have toload and unload often or just make the functions assume that theparameters are in registers without explicit declaration.

Re: __restrict, architecture intrinsics vs asm, consoles, and other

Reply via email to