ni...@lysator.liu.se (Niels Möller) writes: Right, mpn_gcd usually ends with a call to mpn_gcd_1 or gcd_2, and the latter also usually ends with a call to gcd_1. But I think it's easiest to leave that as is until we have good gcd_22.
OK, but let's not forget about it! > I expect asm gcd_1 to disappear as the C code should be equivalent. Do > you agree? That would be nice. There will be one more function call, but hopefully that's not going to be a significant performance regression. Perhaps it would be worthwhile to do a tailcall for when the leftshift is 0? I think most gcd calls, irrespective of operand size, will not have any factor of two, let alone a common one! > I suppose some hardwired stuff for the case u >> v (not bitshift, the > mathematical meaning of >>!) might want to be parameterised and also > ideally tune/tuneup'ed. Should that be done by gcd_1 only? Or do we need some variant of gcd_11 with an initial division? I suppose we should have a lowest level without that (conditional or unconditional) and that that level is gcd_11. My workstation (intel broadwell) uses takes 96 cycles for a division (if I read https://gmplib.org/~tege/x86-timing.pdf, or is there are faster 64/64 div isntruction?). And gcd_11 runs at roughly 4 cycles per input bit according to speed. So then threshold should be around 24. Modern Intel processors have different code for 128b/64b and 64b/64b division. The former is very slow while the latter apparently uses a high-radix SRT. AMD Ryzen has fast 128b/64b (SRT-4 I think).. Below patch to add a gcd_11 entrypoint for this arch. Passes make check, but would be good to also test with devel/try. Ah, so you suggest that we keep asm gcd_1 after all. I was in the editing process of removing them! :-) -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel