t...@gmplib.org (Torbjörn Granlund) writes: > Should we rename div1 to put it into the GMP name space? How about > mpn_div_11? (We could encode in its name that it is for use for small > q, but I'd suggest that we don't do that.) > > That would allow for some assembly experiments.
If you think assembly will make a big difference (which seems likely), that makes sense. And define it only when it's faster than a a div instruction generated by the compiler? > Taking out powers of two makes things complicated. Taking out a factor > of two from one of the numbers corresponds to a matrix (1,0;0,2) with > determinant 2. So the inverse is not an integer matrix, but an integer > matrix + a shift count. > > Would a shift count hurt? On second thought, I think there's a bigger problem: The way hgcd is used, it's called on the most significant part of some numbers, but applied to larger numbers. And a matrix based on trailing zeros in hgcd won't be applicable to the larger numbers, since they have different lowend bits. > It would be interesting to try some left-to-right binary hgcd. Logic > should be something like > > count_leading_zeros(a_bits, a); > count_leading_zeros(b_bits, b); > > if (a_bits == b_bits) > (a, b) <-- (min(a,b), |a-b|) > else if a_bits > b_bits > a <-- |a - b * 2^{a_bits - b_bits}| > else > b <-- |b - a * 2^{b_bits - a_bits}| > > OK! This is super interesting! I think it may work nicely on machines with slow multiplication as well as small-quotient division. But will be some overhead to make branch-free. I haven't tried it seriously yet. We'll sometimes get determinant == -1, so users of hgcd2 would need to be updated to accept that. Or maybe one can just do a swap to ensure that we have det == +1 in all cases. E.g., a <-- 2b - a corresponds to the matrix (-1, 2; 0,1) with determinant -1. But if we swap a and b, and set (a, b) <-- (b, 2b - a) that's the matrix (0,1; -1, 2) with determinant +1. In the mean time, I've updated the replacement of div2. See attached patch, using div1 except when HGCD2_METHOD == 2. Timing on my old laptop: $ ./tune/speed -c -p1000000 -s1 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 overhead 6.00 cycles, precision 1000000 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 1 1492.62 2064.18 #1375.18 So this is 50% speedup over the old method. Regards, /Niels
diff -r 8c0120dc67b0 mpn/generic/hgcd2.c --- a/mpn/generic/hgcd2.c Thu Sep 05 21:25:39 2019 +0200 +++ b/mpn/generic/hgcd2.c Sat Sep 07 23:59:38 2019 +0200 @@ -159,7 +159,9 @@ #error Unknown HGCD2_METHOD #endif -/* Two-limb division optimized for small quotients. */ +#if HGCD2_METHOD == 2 +/* Two-limb division optimized for small quotients, similar to + corresponding div1 above. */ static inline mp_limb_t div2 (mp_ptr rp, mp_limb_t nh, mp_limb_t nl, @@ -262,6 +264,44 @@ } #endif +#else /* HGCD2_METHOD != 2 */ +/* Used only for large quotients */ +static mp_limb_t +div2 (mp_ptr rp, + mp_limb_t n1, mp_limb_t n0, + mp_limb_t d1, mp_limb_t d0) +{ + mp_limb_t n2, q, t1, t0; + int c; + + /* Normalize */ + count_leading_zeros (c, d1); + + n2 = n1 >> (GMP_LIMB_BITS - c); + n1 = (n1 << c) | (n0 >> (GMP_LIMB_BITS - c)); + n0 <<= c; + d1 = (d1 << c) | (d0 >> (GMP_LIMB_BITS - c)); + d0 <<= c; + + udiv_qrnnd (q, n1, n2, n1, d1); + umul_ppmm (t1, t0, q, d0); + if (t1 > n1 || (t1 == n1 && t0 > n0)) + { + ASSERT_ALWAYS (q > 0); + q--; + n1 += d1; + sub_ddmmss (t1, t0, t1, t0, 0, d0); + } + sub_ddmmss (n1, n0, n1, n0, t1, t0); + + /* Undo normalization */ + rp[0] = (n0 >> c) | (n1 << (GMP_LIMB_BITS - c)); + rp[1] = n1 >> c; + + return q; +} +#endif /* HGCD2_METHOD != 2 */ + /* Reduces a,b until |a-b| (almost) fits in one limb + 1 bit. Constructs matrix M. Returns 1 if we make progress, i.e. can perform at least one subtraction. Otherwise returns zero. */ @@ -338,9 +378,34 @@ } else { + mp_limb_t q; +#if HGCD2_METHOD == 2 mp_limb_t r[2]; - mp_limb_t q = div2 (r, ah, al, bh, bl); + q = div2 (r, ah, al, bh, bl); al = r[0]; ah = r[1]; +#else + mp_double_limb_t rq = div1 (ah, bh); + if (UNLIKELY (rq.d1 > bh)) + { + mp_limb_t r[2]; + q = div2 (r, ah, al, bh, bl); + al = r[0]; ah = r[1]; + } + else + { + mp_limb_t t1, t0; + ah = rq.d0; + q = rq.d1; + umul_ppmm (t1, t0, q, bl); + if (UNLIKELY (t1 >= ah) && (t1 > ah || t0 > al)) + { + ah += bh; + q--; + sub_ddmmss (t1, t0, t1, t0, 0, bl); + } + sub_ddmmss (ah, al, ah, al, t1, t0); + } +#endif if (ah < 2) { /* A is too small, but q is correct. */ @@ -380,9 +445,34 @@ } else { + mp_limb_t q; +#if HGCD2_METHOD == 2 mp_limb_t r[2]; - mp_limb_t q = div2 (r, bh, bl, ah, al); + q = div2 (r, bh, bl, ah, al); bl = r[0]; bh = r[1]; +#else + mp_double_limb_t rq = div1 (bh, ah); + if (UNLIKELY (rq.d1 > ah)) + { + mp_limb_t r[2]; + q = div2 (r, bh, bl, ah, al); + bl = r[0]; bh = r[1]; + } + else + { + mp_limb_t t1, t0; + bh = rq.d0; + q = rq.d1; + umul_ppmm (t1, t0, q, al); + if (UNLIKELY (t1 >= bh) && (t1 > bh || t0 > bl)) + { + bh += ah; + q--; + sub_ddmmss (t1, t0, t1, t0, 0, al); + } + sub_ddmmss (bh, bl, bh, bl, t1, t0); + } +#endif if (bh < 2) { /* B is too small, but q is correct. */
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel