On Sep 28, 2:39 am, Jason Moxham <[email protected]> wrote:
> On Friday 25 September 2009 18:36:06 Jason Moxham wrote:
>
>
>
> > Hi, Try your skills at some bit twiddling !!
>
> > We have the existing MPIR macro
> > modlimb_invert(i,x) in gmp-impl.h
> > which calculates the Hensel(2-adic) inverse i of x
>
> > ie) x*i == 1 (mod 2^64) or whatever the limb size is
>
> > Now the existing macro starts out with a table lookup to get the first 8
> > bits , which it then refines. Can we do better ?
> > Assuming the table is in the L1 Cache (how likely is this?) , then a cache
> > lookup has a latency of 3cycles(AMD,INTEL) 4cycle(nehalem) and ?(others)
>
> > So currently
> > i=table[(x/2)%128]
> > we need 5 cycles and a 128byte table , or 4 cycles and a 256byte table. If
> > we are in a shared library rather than a static one , then we also need a
> > PIC lookup ( which is at least 2? cycles)
>
> > Using some bit twiddling we can get
>
> > We can get 3bits in 0 cycles
> > i=x
>
> > 4bits in 2 cycles
> > unsigned long c=0xF050309070D0B010;i= c>>((x<<2)%64);// %64 is free for
> > shifts
>
> > 5 bits in 3cycles
> > unsigned long c=0x2D282D282D282D28;i= x^((c>>(x%64))<<3);
> > or
> > unsigned int c=0x2D282D28;i= x^((c>>(x%32))<<3);
>
> > 6 bits in 4 cycles
> > unsigned long c=0x1459945105598540;i= x^((c>>((x*2)%64))<<3);
>
> > 6 bits in 7 cycles
> > i=x*(2-x*x); // ie using i=i*(2-x*i) on our 3 bit inverse
>
> > 8bits in 9 cycles
> > get 4 bits as above , and use i=i*(2-x*i);
>
> > Once we get to 8 bits then we can use the standard i=i*(2-x*i) quadratic
> > iteration to get to 16,32,64 bits . We can save a one cycle by using a
> > quartic iteration.
>
> > Can you do any better?
>
> > I'm fairly sure 7bits can be done 5 or less cycles , so I expect it can be
> > done( go parallel, although for x86_64 variable shifts can't be
> > parallel(require CX reg for the count), mmx is possible but have to then
> > reduce the count mod 64
>
> > Have fun.
>
> unsigned long c=0x1459945105598540;
> unsigned long k=0x080A80028002A020;
> i= x^((c>>((x*2)%64))<<3)^( (k>>(x%64))<<6 );}
>
> which should give us 7bits in 4 cycles
> as in assembler we get
> mov c ,rax
> mov k,rbx
> mov x,rcx
> <----------- latency measured from here (ie when we get x)
> shr cl,rbx
> mov rcx,rdx
> add rcx,rcx
>
> shl 6,rbx
> shr cl,rax
>
> xor rdx,rbx
> shl 3,rax
>
> xor rax,rbx
> -------> result here in rbx
> mov rbx,i
>
>
>
> > Jason
Here is a 8bit inverse
unsigned long inv8(unsigned long x)
{unsigned long r8=0x1459945105598540;
unsigned long r9=0x080A80028002A020;
unsigned long r10=0x44A2C4E698DCBADD;
return x^((r8>>((x*2)%64))<<3)^( (r9>>(x%64))<<6 )^( (r10>>((x/2)%64))
<<7 );}
that should take 5 cycles on AMD , which is the same speed as the
current table look up.
mov x,rcx
<--- measure latency from here
shr cl,r9
mov rcx,rax
add rcx,rcx
shl 6,r9
shr cl,r8
shr 2,rcx
xor r9,rax
shl 3,r8
shr cl,r10
xor r8,rax
shl 7,r10
xor r10,rax
<--- result here in rax
Core2 only has 2 executions units which can do shifts so we lose 1
cycle , I expect we can re-arrange it to overcome this. There is still
a lot of symmetry in the bit patterns so perhaps this can be improved
some more.
Jason
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---