Hi, Try your skills at some bit twiddling !! We have the existing MPIR macro modlimb_invert(i,x) in gmp-impl.h which calculates the Hensel(2-adic) inverse i of x
ie) x*i == 1 (mod 2^64) or whatever the limb size is Now the existing macro starts out with a table lookup to get the first 8 bits , which it then refines. Can we do better ? Assuming the table is in the L1 Cache (how likely is this?) , then a cache lookup has a latency of 3cycles(AMD,INTEL) 4cycle(nehalem) and ?(others) So currently i=table[(x/2)%128] we need 5 cycles and a 128byte table , or 4 cycles and a 256byte table. If we are in a shared library rather than a static one , then we also need a PIC lookup ( which is at least 2? cycles) Using some bit twiddling we can get We can get 3bits in 0 cycles i=x 4bits in 2 cycles unsigned long c=0xF050309070D0B010;i= c>>((x<<2)%64);// %64 is free for shifts 5 bits in 3cycles unsigned long c=0x2D282D282D282D28;i= x^((c>>(x%64))<<3); or unsigned int c=0x2D282D28;i= x^((c>>(x%32))<<3); 6 bits in 4 cycles unsigned long c=0x1459945105598540;i= x^((c>>((x*2)%64))<<3); 6 bits in 7 cycles i=x*(2-x*x); // ie using i=i*(2-x*i) on our 3 bit inverse 8bits in 9 cycles get 4 bits as above , and use i=i*(2-x*i); Once we get to 8 bits then we can use the standard i=i*(2-x*i) quadratic iteration to get to 16,32,64 bits . We can save a one cycle by using a quartic iteration. Can you do any better? I'm fairly sure 7bits can be done 5 or less cycles , so I expect it can be done( go parallel, although for x86_64 variable shifts can't be parallel(require CX reg for the count), mmx is possible but have to then reduce the count mod 64 Have fun. Jason --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---
