On Friday 25 September 2009 18:36:06 Jason Moxham wrote: > Hi, Try your skills at some bit twiddling !! > > We have the existing MPIR macro > modlimb_invert(i,x) in gmp-impl.h > which calculates the Hensel(2-adic) inverse i of x > > ie) x*i == 1 (mod 2^64) or whatever the limb size is > > Now the existing macro starts out with a table lookup to get the first 8 > bits , which it then refines. Can we do better ? > Assuming the table is in the L1 Cache (how likely is this?) , then a cache > lookup has a latency of 3cycles(AMD,INTEL) 4cycle(nehalem) and ?(others) > > So currently > i=table[(x/2)%128] > we need 5 cycles and a 128byte table , or 4 cycles and a 256byte table. If > we are in a shared library rather than a static one , then we also need a > PIC lookup ( which is at least 2? cycles) > > Using some bit twiddling we can get > > We can get 3bits in 0 cycles > i=x > > 4bits in 2 cycles > unsigned long c=0xF050309070D0B010;i= c>>((x<<2)%64);// %64 is free for > shifts > > 5 bits in 3cycles > unsigned long c=0x2D282D282D282D28;i= x^((c>>(x%64))<<3); > or > unsigned int c=0x2D282D28;i= x^((c>>(x%32))<<3); > > 6 bits in 4 cycles > unsigned long c=0x1459945105598540;i= x^((c>>((x*2)%64))<<3); > > 6 bits in 7 cycles > i=x*(2-x*x); // ie using i=i*(2-x*i) on our 3 bit inverse > > 8bits in 9 cycles > get 4 bits as above , and use i=i*(2-x*i); > > Once we get to 8 bits then we can use the standard i=i*(2-x*i) quadratic > iteration to get to 16,32,64 bits . We can save a one cycle by using a > quartic iteration. > > Can you do any better? > > I'm fairly sure 7bits can be done 5 or less cycles , so I expect it can be > done( go parallel, although for x86_64 variable shifts can't be > parallel(require CX reg for the count), mmx is possible but have to then > reduce the count mod 64 > > Have fun.
unsigned long c=0x1459945105598540; unsigned long k=0x080A80028002A020; i= x^((c>>((x*2)%64))<<3)^( (k>>(x%64))<<6 );} which should give us 7bits in 4 cycles as in assembler we get mov c ,rax mov k,rbx mov x,rcx <----------- latency measured from here (ie when we get x) shr cl,rbx mov rcx,rdx add rcx,rcx shl 6,rbx shr cl,rax xor rdx,rbx shl 3,rax xor rax,rbx -------> result here in rbx mov rbx,i > > Jason > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---
