>> IMHO SIMD should be adopted as something like
>>
>>
>>
>> Phase1 ) Reserve register keywords for future use ( including 256 bit
> ymm)
>>
>> Phase2) Implement SSE non SIMD instructions like mov for Temporal and
>> Normal cache usage , these are not SIMD but normal instructions that
>just
>> use SIMD registers and are very common in todays libs.
>>
>> Phase3) Fuller SIMD support either around converting to simd
>idealized
>> forms and still use intrinsics , using SIMD even for ints in a SIMD
>heavy
>> function eg loop counters ( to save conversion costs) or full
>language
>> support. The real work is more in the compiler and optimization than
>the
>> language.
> <snip>
>Part of the problem with
>anything even partial auto-vectorisation is that the language
>expression semantics are often specified in terms of promoting
>subtypes to machine ints/machine uints, doing computations, then
>converting back to the data type.
>
>Regards,
>David Steven Tweed
Yes this conversion is one of the biggest issues and intrinsic do very badly
at these since they have little awareness of surrounding code. While not an
issue for high loop count number crunching it does stop the intertwining of
of 128-256 bit registers say for IPC small to medium message copying.
This is why I suggest phase 2 as the use of these registers without tackling
the SIMD problem space but instead tackling the conversion to and from
standard language forms ( and their underlying machine code constructs) for
simple operations on these registers like move . Obviously phase 2 would
consider the final use of Simd on these registers and storage forms . Eg
while a vector form for these registers ( and memory pointers) make sense
for SIMD is it really appropriate for Mov instructions ? Is a union a good
solution and if so how does this union compile to a register ? eg for a
union it allows Loading an XMM with several ints and a long long which is
very useful for say composing IPC messages bypassing the cache or any small
message. Also for ymm loading 8 ints , 16 shorts or 32 bytes gets expensive
if there is only a vector representation .
On these 128 /256 bit registers you can do the following as full 128 and
soon 256 bit registers ( these are not SIMD instructions)
Loads , Sets and Stores ( including non cache polluting)
Bit scan ( bit count , next set etc , very nice for GCs and OS as these bit
arrays are often used where performance is critical)
Bitwise And , Xor , Or , Not
Shift
Comparison
Load and Store
The main things you can't do which would make it a full GP register is add
(including inc) , subtract , div and multiply though this is pretty
pointless on 128-256 bits. You could put the loop counter in the vector[4]
loc and increment it ( by adding a 0,0,0,1 loaded in another reg each time )
you can then run your loop via a comparison just like normal GP registers.
This is efficient where the loop counter is used in SIMD algorithms. Eg if
we limit these reg to 32 bit for non SSE ops you can do many things and
there can be quite valid and easy for a compiler or programmer to handle and
we are not even touching SIMD ( but note such constructs improve the way the
algorithms look even with intrinsics) .
Eg
for ( xmm i = 0 ; i < loopCount ; i = i + 1)
RunLoopVariableDependentSIMDAlgorithm(i) ;
Or this
//pointers/data must be 16 byte aligned
int blockMemCopy(void *destination, void *source, int32 size)
{
xmm *dest = (xmm*)&destination;
xmm *sour = (xmm*)&source;
int c;
for(c=0;c< (size <<2) ;c++)
*dest++ = *sour++;
return c>>2 ;
}
this would be valid
int a =12 ;
xmm xa = a;
This would be invalid
xmm b = ( long long int) 1<<32 + 12;
but this would be ok using the union.
xmm b = { 0 , 0 , 1 , 12); or
xmm b = { 0 , 1<<32 + 12};
How hard would it be to map these to various intrinsics in the compiler ?
It may be over the next 10 years these registers end up being used more and
more leading to the GP registers being used less ( not because 128 bit is
better for these things but due to the conversions costs from GP to XMM) .
Phase3 and full SIMD is not really necessary from a potential performance
point of view ( with ideal human set intrinsic) since the biggest issues
for intrinsics is the converting to and from the XMM registers . However my
gut feel is that SIMD is getting so complicated that you have a similar
situation (but worse) than 15-20 years ago where people struggle to optimize
code as well as a good compiler ( but before then it was easily possible) .
So you either need a optimizing compiler to generate it from some language
constructs or absolutely ludicrous amounts of time and then someone else
points out you can use some different instructions (or as you mention you
don't have the registers or instruction on platform X) and you get a 100%
performance gain and look stupid. Still this is a mighty ambitious challenge
and I would be happy with Phase2.
Also compilers are using more and more of this code ( esp MMX) without
intrinsic or runtime calls , I caught the Intel compiler using it to Zero
some data in a struct ( it unrolled the loop saw it was working on adjacent
regions and then used a SSE2 instruction) and the CLR JIT in some conversion
routines.
However while this (converting code into MMX/SSE) is more of a compiler
problem there may be a chicken and the egg problem here, compilers can't do
a lot more without language support that make use of it and languages don't
really want to progress in this area without better compiler support.
Lastly this is quite interesting , mono which is a dog compared to the CLR
blowing away C++ ( not using SIMD) by using a class to build SSE
instructions . C# does the same with Interlocked and its quite nice to use
runtime classes which the compiler understands instead of intrinsics. (
Note Allignment and Packing is set with attributes also read from the CIL by
the compiler)
http://tirania.org/blog/archive/2008/Nov-03.html
BTW anyone read any papers on modern SIMD language contructs esp related to
the x86 ( eg not fortran / multi processing related) ?
Ben
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev