It looks like my suggestions for gmp-impl.h are unfortunately cpu-specific. So I can no longer recommend changing it. The following notes may help with cpu specific assembler routines.
In assembler, I suggest the following 1). Replace a divide by 2 and a mask of 0x7F, with load 8-bit register and shift right 2). If the binvert_limb_table is not available, create a 4-bit inverse from the formula for a 5-bit inverse, since the 5-bit inverse is simpler and faster, namely inv5 = (3*n ^ 2) where '^' is the xor function. Discussing the AMD A8-5500 only:- I added options for binvert_limb to 'speed' program on cpu type AMD A8-5500. Results: overhead 5.81 cycles, precision 1410065408 units of 7.14e-10 secs, CPU freq 1400.00 MHz binvert_limb binvert_limb_htype binvert_limb_orig binvert_limb_5bit binvert_limb_4bit 1 45.74 #0.8200 0.9592 0.9724 0.9810 Fastest at 0.82 is u64 __inv; UHWtype __hinv = binvert_limb_table[(__n & 0xFF)/2]; __hinv *= 2 - __hinv * (UHWtype)__n; /* 16 */ __hinv *= 2 - __hinv * (UHWtype)__n; /* 32 */ __inv = __hinv * (2 - __hinv * __n); /* 64 */ Good is u64 __inv; UHWtype __hinv = binvert_limb_table[(__n & 0xFF)/2]; __hinv = 2 * __hinv - __hinv * __hinv * (UHWtype)__n; /* 16 */ __hinv = 2 * __hinv - __hinv * __hinv * (UHWtype)__n; /* 32 */ __inv = 2 * (mp_limb_t)__hinv - (mp_limb_t)__hinv * (mp_limb_t)__hinv * __n; /* 64 */ In the absence of binvert_limb_table, it is quicker to calculate a 5-bit inverse and pretend it is 4-bits UHWtype __hinv = 3 * (UHWtype) __n ^ 2; /* caret is xor function. 5-bits */ __hinv = 2 * __hinv - __hinv * __hinv * (UHWtype)__n; /* 8 of 10 bits */ John Gatrell _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel