ni...@lysator.liu.se (Niels Möller) writes: $ ./tune/speed -c -s1 -p100000 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary overhead 6.02 cycles, precision 100000 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 mpn_hgcd2_4 mpn_hgcd2_5 mpn_hgcd2_binary 1 #1668.90 1863.72 1670.73 1757.54 1738.50 2044.25
Had a look at the disassembly for the binary algorithm. The double-precision loop needs, 20 instructions for just the conditional swap logic, 23 for the clz + shift + subtract, 8 for the shift+add updates of the u matrix. Perhaps keeping the to-be-swapped variables in two structs, and instead conditionally swap pointers to the structs? Some measurements with method 4 and 5 are now in. Modern Intel CPUs like method 5, as I had expected. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel