ni...@lysator.liu.se (Niels Möller) writes: Maybe I'm confused. And I think tradeoffs are different for the double-limb and the single-limb loops.
Or I am confused. I have mostly thought of div1, less of div2. For the single-limb loop, I think special handling of q == 1 means we avoid a division and three independent single-limb multiplies, at the cost of a mispredicted branch, plus some extra unconditional add/compere operations. I would expect that to be a good tradeoff only on machines with quite poor multiplication throughput. I agree; even 64-bit ARM processors with very slow 64b x 64b and moderate branch misprediction penalty probably are better off with the branch reduced code. But what was the saying, "benchmarking is better that blind faith"? :-D For the double-limb loop, special handling of q == 1 also avoids the operations needed to compute the double-limb remainder, which is a higher cost and with deeper dependency path. I need to look deeper into this! > Or perhaps you reason along the lines of "if a CPU is good at small > quotient division, it ought to be good at multiplication, therefore > _METHOD 1 does not need q = 1 code which principally avoids some limb > multiplications"? I think that's a reasonable argument for the single-limb loop. Only it is false. ARM A53, A57, A72, A73, A75 are counter examples. I think multiply performance (at least its throughput) and small quotient friendly division have zero correlation in current CPUs. All four combinations (slow+slow, slow+fast, fast+slow, and fast+fast) exist. > Perhaps there is some confusion about which method is chosen for each > HGCD2_METHOD value? You designated the original method number 2, your > plain / based one number 1. I added lead texts to > https://gmplib.org/devel/thres/HGCD2_METHOD.html for clarity. And I was hoping the original method (#2) could be retired. At the moment, that web page indicates that it's useful only on ia64 aka itanic (14% margin to method 3). How much do we care about itanic performance? Is it worth writing div_11 in itanic asm just to eliminate METHOD 2? The result comes from an old version of FreeBSD with their intentionally obsolete GNU tools and thus compiler support libraries. I tried replacing the libgcc function __udivdi3 with code from gcc 9, and suddenly HGCD2_METHOD becomes 3. Conclusion: The HGCD2_METHOD 2 code in its current form is not useful on any of our platforms. (Perhaps we should wait a few days more as not all hosts have reported results yet.) > I suggest that we rename HGCD2_METHOD to DIV_11_METHOD. Agree? Makes sense. But I think it should still be tuned based on hgcd2 performance. So for speed speed and tuneup we'll get a matrix of hgcd2_${DIV_11_METHOD}_${DIV_22_METHOD} (for tuneup, DIV_22_METHOD likely depends on DIV_11_METHOD, but not the other way around, so we don't have to try all combinations). OK. What about hgcd2_jacobi? Same, same, but different, I presume. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel