On Friday 25 September 2009 18:36:06 Jason Moxham wrote:
> Hi, Try your skills at some bit twiddling !!
>
> We have the existing MPIR macro
> modlimb_invert(i,x)     in gmp-impl.h
> which calculates the Hensel(2-adic) inverse i of x
>
> ie)   x*i == 1 (mod  2^64) or whatever the limb size is
>
> Now the existing macro starts out with a table lookup to get the first 8
> bits , which it then refines. Can we do better ?
> Assuming the table is in the L1 Cache (how likely is this?) , then a cache
> lookup has a latency of 3cycles(AMD,INTEL) 4cycle(nehalem) and ?(others)
>
> So currently
> i=table[(x/2)%128]
>  we need 5 cycles and a 128byte table , or 4 cycles and a 256byte table. If
> we are in a shared library rather than a static one , then we also need a
> PIC lookup ( which is at least 2? cycles)
>
> Using some bit twiddling we can get
>
> We can get 3bits in 0 cycles
> i=x
>
> 4bits in 2 cycles
> unsigned long c=0xF050309070D0B010;i= c>>((x<<2)%64);// %64 is free for
> shifts
>
> 5 bits in 3cycles
> unsigned long c=0x2D282D282D282D28;i= x^((c>>(x%64))<<3);
> or
> unsigned int c=0x2D282D28;i= x^((c>>(x%32))<<3);
>
> 6 bits in 4 cycles
> unsigned long c=0x1459945105598540;i= x^((c>>((x*2)%64))<<3);
>
> 6 bits in 7 cycles
> i=x*(2-x*x);  // ie using i=i*(2-x*i) on our 3 bit inverse
>
> 8bits in 9 cycles
> get 4 bits as above , and use i=i*(2-x*i);
>
> Once we get to 8 bits then we can use the standard i=i*(2-x*i) quadratic
> iteration to get to 16,32,64 bits . We can save a one cycle by using a
> quartic iteration.
>
> Can you do any better?
>
> I'm fairly sure 7bits can be done 5 or less cycles , so I expect it can be
> done( go parallel, although for x86_64 variable shifts can't be
> parallel(require CX reg for the count), mmx is possible but have to then
> reduce the count mod 64
>
> Have fun.

unsigned long c=0x1459945105598540;
unsigned long  k=0x080A80028002A020;
i= x^((c>>((x*2)%64))<<3)^( (k>>(x%64))<<6 );}

which should give us 7bits in 4 cycles
as in assembler we get
mov c ,rax
mov k,rbx
mov x,rcx
<----------- latency measured from here (ie when we get x)
shr cl,rbx
mov rcx,rdx
add rcx,rcx

shl 6,rbx
shr cl,rax

xor rdx,rbx
shl 3,rax

xor rax,rbx
------->   result here in rbx
mov rbx,i


>
> Jason
>
>
> 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to