Re: [A51] CPU implementation (more code)

Sascha Krissler Tue, 03 Nov 2009 18:13:32 -0800

Forward to the list from PM


> So you now use m128i registers and do instructions on 128 bit wide 

yes

> values? I assume that what you are doing is what is usually called bit 
> slicing?

yes

> For both Cell and SSE2, do you use interlacing of instructions to fill 
> up the pipeline (and usually improve speed by a factor 2 or 3)? (not 
> sure if that really helps on bit sliced calculations though)

i did not pay special attention to the instruction scheduling.
The heavy load loop is unrolled and gcc will have plenty of opportunity
to reorder the instructions.
Also the loop body is executed ~ 650m times per second on a 2ghz core,
effectivly using 3 clocks per loop for the 7 instructions (unrolled)

on the cell spu it is half as fast despite higher clocks, so i should try to 
optimize there.
(but the SPU is also not fully multiscalar, so should be half as fast)

with the SPU being in-order, instruction instruction reordering and scheduling 
may pay off.

This function does 90% of the work:
nshift == 0, lenght= 19 or 22 or 23, RT=ssevector(uint64, uint64)

template <int base, int length, int nshift, typename RT>
void lsh_reg(RT * regs, RT clock3) {
  RT clock2 = ~clock3;
  int i;
  for (i = base - length + 1; i < base; ++i) {
    regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3;
  }



Before loop unrolling:

xmm0: clock2
xmm6: ~clock2
720(...): regs[i + nshift]
736(...): regs[i + nshift + 1]
 
L41:
        .loc 2 110 0
        movdqa  720(%ecx,%eax), %xmm1
        movdqa  736(%ecx,%eax), %xmm2
L18:
        .loc 2 111 0
        pand    %xmm0, %xmm1
        pand    %xmm6, %xmm2
        por     %xmm2, %xmm1
        movdqa  %xmm1, 720(%ecx,%eax)
        addl    $16, %eax
        .loc 2 110 0
        cmpl    $288, %eax
        jne     L41

After loop unrolling this pattern repeats:
xmm2, xmm6: clock2, ~clock2, 720(), 736() regs[...]

        movdqa  %xmm2, %xmm0
        movdqa  %xmm6, %xmm1
        pand    720(%ebx,%eax), %xmm0
        pand    736(%ebx,%eax), %xmm1
        por     %xmm1, %xmm0
        movdqa  %xmm0, 720(%ebx,%eax)
        leal    32(%edx), %eax


______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de

_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Re: [A51] CPU implementation (more code)

Reply via email to