ni...@lysator.liu.se (Niels Möller) writes: I've had a look at the latest gcd_11 asm, and it's really neat, including naturally getting %rdx zero on return.
One question: the bd2 and bd4 versions use L(top): rep;bsf %rdx, %rcx C tzcnt! I've not seen this before, but a quick web search indicates that tzcnt is the same as bsf, except that it has a well defined result also when the input is zero. But in these loops, we should get to this instruction only for non-zero %rdx. So are there any other subtleties? That, and that the flags are set quite differently. But the code cares neiter about the flags nor aboubt what might or might not happen for zero input. The really weird thing is that tzcnt, encoded identically to rep;bsf is much faster than a bare bsf on AMD platforms. One would think (1) an instruction B which works like instruction A except that B has some additional corner case requirements would not be faster than A, and (2) why didn't they just make instruction A faster and also defined it for 0?, and (3) why didn't they make bsf faster too, it should be trivial. You might say "but thanks to rep;bsf we KNOW that it is well-defined for 0". Good point. Except that rep;bsf will run just like bsf on any older x86 CPU out there, i.e. be undefined for 0. Therefore safe use of rep;bsf requires knowldge of whether this instruction is specifically handled as tzcnt (which asks for e.g. running the cpuid instruction). Conclusion: silently making bsf well-defined is really the same as silently making rep;bsf well-defined (except that the latter has a byte longer encoding). The performance difference is huge. E.g., AMD Ryzen can execute 6 times more rep;bsf than plain bsf per cycle, each with 2/3 of the latency. See also: https://gmplib.org/~tege/x86-timing.pdf -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel