I'm a bit surprised that the METHOD 3 div1 didn't yield more speedup. The hgcd2 code is almost content with a 25 cycle / operation, which is really, really odd to me.
The quotient is < 8 in over 80% of the time. It is < 4 almost 70% of the time. Perhaps we should aim for handling just < 4 to save some time. Then, on the other hand, we could make a METHOD 3 div1 in assembly which should be somewhat quicker. So if out efforts to improve div1 doesn't make wonders, perhaps we should look into a better variant of Euclid's algorithm for developing the quotient chain? To start with, we could compare a mod b with b - a mod b and choose whichever is smaller. Then we could compare (a mod b)/2^s with (b - a mod b)/2^t where s and t are the number of trailing bits in respective expression. Reducing the number of overall hgcd2 iterations should really make a difference. But perhaps this would make the 2x2 matrix computation harder, and require signed numbers, which would in turn complicate things too much? We could then perhaps find some middle ground which doesn't try to find the absolutely minimal remainder, but one which avoids negative numbers which hurts... (I haven't studied hgcd2 well enough to answer these questions myself, obviously.) -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel