Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander
<[email protected]>:
On 22/09/11 7:39 AM, Don wrote:
On 22.09.2011 05:24, a wrote:
How would one do something like this without intrinsics (the code is
c++ using
gcc vector extensions):
[snip]
At present, you can't do it without ultimately resorting to inline asm.
But, what we've done is to move SIMD into the machine model: the D
machine model assumes that float[4] + float[4] is a more efficient
operation than a loop.
Currently, only arithmetic operations are implemented, and on DMD at
least, they're still not proper intrinsics. So in the long term it'll be
possible to do it directly, but not yet.
At various times, several of us have implemented 'swizzle' using CTFE,
giving you a syntax like:
float[4] x, y;
x[] = y[].swizzle!"cdcd"();
// x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]
which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already
be in vector registers, and unless I've missed something, they won't be.
You'll need instructions for loading into registers (using the slow
movups because 16-byte alignment isn't guaranteed) then do the shufps,
then load back out again.
This is too slow for performance critical code.
Being stored in XMM registers from creation, passed and returned in XMM
registers to/from functions is a key requirement for this sort of code.
If you have to keep loading in and out of memory then you lose all
performance.
I thought about this. Either write long functions, so you don't have to
load and unload often or just make the functions assume that the
parameters are in registers without explicit declaration.