ni...@lysator.liu.se (Niels Möller) writes: The below implementation appears to pass tests, and give a modest speedup of 0.2 cycles per input bit, or 2.5%. Benchmark, comparing C implementations of gcd_11 and gcd_22:
Beware of "turbo" when counting cycles! (Relative measurements like gcd_11 vs gcd_22 should be fine!) The speed difference between C gcd_11 and gcd_22 is surprisingly small! Perhaps gcd_11 should be rewritten in the style of gcd_22? I did not provide a top-level gcd_22 for x86_64 as you might have seen. The one similar to x86_64/gcd_11.asm is probably x86_64/k8/gcd_22.asm. Perhaps it should be moved. But as far as I can tell, that function is slower than you C gcd_22 for some platforms, such as Intel haswell. I'm curious if your C code could be made into competitive asm. One usually can beat the compiler some 10-30%. Measurements for gcd_11/22 for most of our machines are in. See https://gmplib.org/devel/tm/gmp/date.html and click on any HOSTgentoo64 tuneup link. Scroll down; after the normal *_THRESHOLD stuff comes comparative measurements of asm code. (The mpn/generic code is not usually measured; the exception is when it appears in the default column. I plan to fix this some day, and have a few columns "gcc -O", "gcc -Os", "gcc -O2".) -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel