On Fri, Aug 13, 2010 at 9:40 PM, Ben Kloosterman <[email protected]> wrote:
> Agreed already it should wait to get the current version out , but you wrote
> ..
>
>
>
>> I'm also fairly confident that we are in no worse a position than anybody
>> else.
>
>
>
> In C you can often do use asm {} for this , while Intrinsic are ok for
> output registers already in the format they do badly at the edges eg
>
>
>
>
>
> const __m128i Reg0 = _mm_set_epi32 ( i,i,i,i);
Any reason not to use _mm_set1_epi32 here?
>
> //compiler does this
>
> //movd xmm0, esi
>
> //movd xmm1, esi
>
> //movd xmm2, esi
>
> //movd xmm3, esi
>
> //mov DWORD PTR _prt$8074[ebp], eax //load ptr address
>
> //punpckldq xmm2, xmm0
>
> //punpckldq xmm3, xmm1
>
> //punpckldq xmm3, xmm2
>
>
>
> Not so good..i is a loop counter (esi).
>
>
>
> _mm_stream_si128((__m128i *) prt ,Reg0 );
>
> //movntdq XMMWORD PTR [eax], xmm3
>
> Good.
>
>
>
> This forces asm I wouldn’t be surprised in c if the majority of these
> algorithms do simd parts in asm instead of intrinsic. And asm{} is ot part
> of bitc.
Being philosophical, one of the worst things about a lot of languages
(ironically other than C/C++) is that they basically give the
programmer two choices: either work in the languages high level vector
extension that's really a research work in progress to detect when you
can use instructions like abs_diff, or else do absolutely everything
(scheduling, register allocation, etc) in assembly. The unique selling
point for intrinsics is that they allow the programmer to say "this
instruction performs the mathematical mapping I want on my data"
without requiring the programmer to know the instruction latencies and
compute the register allocation for the particular machine he's
compiling for. (If you look at, eg, the VP8 mailing lists they'll see
that they're trying to get work done manually scheduling the inline
assembly for Atom chips.) For algorithms of even mid-range complexity
that's a daunting task even for one CPU architecture, let along if the
code is supposed to support say x86 and ARM. Certainly whilst I'm
familiar with both the x86 SSE and ARM NEON SIMD instruction sets I'm
not remotely familiar with the rest of the instruction set. (By
mid-complexity I mean something like
typedef __m128i s8w;
typedef __m128i u8w;
typedef __m128i s16w;
typedef __m128i s32w;
void
buildForegroundBitmask(uint32_t ** __restrict fgBitmap,
uint8_t ** __restrict raw,
int rowMax,
int colMax,
s8w * __restrict q,
u8w * __restrict c,
u8w * __restrict rearrangeMask,
int32_t scalarThreshold)
{
// copy ellipsoid coefficients into SIMD vectors ---------------------------
s8w q0_a=q[0];//1 s8w^2
s8w q1_a=q[1];//2 s8w^2
s8w q_b=q[2];//3 s8w^2
u8w c0_a=c[0];//4 u8w^2
u8w c1_a=c[1];//5 u8w^2
u8w c_b=c[2];//6 u8w^2
u8w rearr0=rearrangeMask[0];//7 u8w^2
u8w rearr1=rearrangeMask[1];//8 u8w^2
u8w rearr2=rearrangeMask[2];//9 u8w^2
s32w threshold=_mm_set1_epi32(scalarThreshold);//10 s32w <- int32_t
int i=0;
do{
// pos is 1-D array index for every second row -------------------------
int j=0;
do{
uint32_t mask=0;
int jP=j;
int shift=0;
do{
// test if pixel-values lie within either ellipsoid ------------
// we read 16 bytes but only use first 12 ----------------------
u8w
*channelPtr=reinterpret_cast<u8w*>(&raw[i][3*jP]);//11 u8w <- uint8_t
u8w b=_mm_loadu_si128(channelPtr);//12 u8w^2
u8w e0_a=_mm_shuffle_epi8(b,rearr0);//13 u8w^3
u8w e1_a=_mm_shuffle_epi8(b,rearr1);//14 u8w^3
u8w e_b=_mm_shuffle_epi8(b,rearr2);//15 u8w^3
// make positions relative to ellipsoid centre -----------------
s8w t0=_mm_subs_epi8(e0_a,c0_a);//16 s8w <- u8w x u8w
e0_a=_mm_abs_epi8(t0);//17 u8w <- s8w
s8w t1=_mm_subs_epi8(e1_a,c1_a);//18 s8w <- u8w x u8w
e1_a=_mm_abs_epi8(t1);//19 u8w <- s8w
s8w t2=_mm_subs_epi8(e_b,c_b);//20 s8w <- u8w x u8w
e_b=_mm_abs_epi8(t2);//21 u8w <- s8w
// compute quadratic form --------------------------------------
s16w t3=_mm_maddubs_epi16(e0_a,q0_a);//22 s16w <- u8w x s8w
s16w t4=_mm_maddubs_epi16(e1_a,q1_a);//23 s16w <- u8w x s8w
s16w t5=_mm_maddubs_epi16(e_b,q_b);//24 s16w <- u8w x s8w
s32w t6=static_cast<s32w>(t5);//25 s32w <- s16w
s32w t7=_mm_srli_epi32(t6,16);//26 s32w <- s32w x int32_t
s16w t8=static_cast<s16w>(t7);//27 s16w <- s32w
s32w t9=_mm_slli_epi32(t6,16);//28 s32w <- s32w x int32_t
s16w t10=static_cast<s16w>(t9);//29 s16w <- s32w
s32w t11=_mm_madd_epi16(t3,t3);//30 s32w <- s16w x s16w
s32w t12=_mm_madd_epi16(t8,t8);//31 s32w <- s16w x s16w
s32w t13=_mm_add_epi32(t12,t11);//32 s32w^3
s32w t14=_mm_cmplt_epi32(t13,threshold);//33 s32w^3
s32w t15=_mm_madd_epi16(t4,t4);//34 s32w <- s16w x s16w
s32w t16=_mm_madd_epi16(t10,t10);//35 s32w <- s16w x s16w
s32w t17=_mm_add_epi32(t15,t16);//36 s32w^3
s32w t18=_mm_cmplt_epi32(t17,threshold);//37 s32w^3
s32w minQuadInAnEllipse=_mm_or_si128(t14,t18);//38 s32w^3
// convert from SIMD to 4-bit bitmask --------------------------
uint32_t
rawMask=_mm_movemask_epi8(minQuadInAnEllipse);//39 uint32_t <- s32w
rawMask=rawMask & 0x1111;
rawMask=(rawMask>>3)|rawMask;
uint32_t mask4=((rawMask>>6)|rawMask) & 0xF;
// append mask fragment to foreground mask ---------------------
mask |= mask4<<shift;
jP+=4;
shift+=4;
}while(shift<32);
// dilate horizontally to smooth out noise a little bit ------------
fgBitmap[i][j/32] = mask | (mask>>1) | (mask<<1);
j+=32;
}while(j<colMax);
i+=1;
}while(i<rowMax);
}
)
> IMHO SIMD should be adopted as something like
>
>
>
> Phase1 ) Reserve register keywords for future use ( including 256 bit ymm)
>
> Phase2) Implement SSE non SIMD instructions like mov for Temporal and
> Normal cache usage , these are not SIMD but normal instructions that just
> use SIMD registers and are very common in todays libs.
>
> Phase3) Fuller SIMD support either around converting to simd idealized
> forms and still use intirinsics , using SIMD even for ints in a SIMD heavy
> function eg loop counters ( to save conversion costs) or full language
> support. The real work is more in the compiler and optimization than the
> language.
Smartbook/netbook Atom CPUs have only 8 simd registers, so using one
for the loop counter will have a cost. Part of the problem with
anything even partial auto-vectorisation is that the language
expression semantics are often specified in terms of promoting
subtypes to machine ints/machine uints, doing computations, then
converting back to the data type. So the programmer has to figure out
a way to write expressions so that the compiler can prove that doing
the operations at the native type via SIMD instructions will produce
the same result as the "scalar artihmetic semantics". Arguably this is
a deficiency of the C arithmetic semantics in the modern world, but
it's one of the causes of failure to auto-vectorise in gcc (and
sometimes icc I gather).
Regards,
David Steven Tweed
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev