On 01/07/12 04:27, Martin Nowak wrote:
> __v128 add(__v128 a, __v128 b) pure
> {
>     __v128 res = a;
>     asm (res, b)
>     {
>         ADD res, b;
>     }
>     return res;
> }


> This is effectively achieves the same as writing this with intrinsics.
> It also greatly improves the composition of inline asm.

What it also does is allows mixing "ordinary" asm with the SIMD instructions. 
People will do that, because it's easier this way (less typing), and then the 
result is practically unportable. Cause every compiler would now have to fully 
understand and support that one asm variant.

If you do "__v128 __simd_add(__v128 a, __v128)" instead, you don't loose 
anything; in fact it could be internally implemented with your asm(). But now 
the "real" asm code is separate from the more generic (and sometimes even 
portable) simd ops -- the compiler does not need to understand asm() to be able 
to use it. It can still do every optimization as with the raw asm, and possibly 
more as it knows exactly what's going on. The explicit pure annotations are not 
needed. It has more freedom to choose better scheduling, ordering, sometimes 
instruction selection (if there's more than one alternative) and even various 
code transformations. Even CTFE works.
Consider the case when a lot of your above add()-like functions are inlined 
into another one, which will be a common pattern -- you don't want any false 
dependencies. (If you do care about exact instruction scheduling you're writing 
asm, not D, so for that case asm() is a better choice)

I wrote "__v128 __simd_add(__v128 a, __v128)" above, but that was just to keep 
things simple. What you actually want is "vfloat4 __simd_add(vfloat4 a, vfloat4 
b)" etc. Ie strongly typed.

Whether this needs to go into the compiler itself depends on only one thing - 
if it can be done efficiently in a library. Efficiently in this case means 
"zero-cost" or "free".

Having different static types (in addition to the untyped __v(64|128|256) ones) 
gives you not only security (you don't accidentally end up operating on the 
wrong data/format because you forgot about some version() combination etc), but 
also allows things like overloading. Then you can write more generic code, 
which works with all available formats. And eg changing the precision used by 
some app module involves only changing a few declarations plus data entry/exit 
points, not modifying every single SIMD instruction.
Untyped __v128 only really works for memcpy() type functions; other than that 
is mainly useful for conversions and passing data etc - the cases where you 
don't care about the content in transit.

>> What dmd does do with the inline assembler is it keeps track of which 
>> registers are read/written, so that effective register allocation can be 
>> done for the non-asm code.
> 
> Which is why the compiler should be the one to allocate pseudo-registers.

Yep.

artur

Reply via email to