The below patch enables use of mpn_div_qr_2u_pi1 (written back in 2011). It's used only if an assembly implementation is available, which currently is x86_64 only.
The point of it is to avoid a temporary allocation and shift to normalize the numerator up front. I'm a bit concerned about performance, it's not as well tuned as mpn_divrem_2, so might not be a win in current form. On my intel broadwell machine, mpn_divrem_2 takes 16.5 c/l, mpn_div_qr_2n takes 17 c/l (and it's essentially the same loop, so this is poor tuning). mpn_div_qr_2u seem to be about 19.5 cycles, and it uses the shld instruction which likely is a poor choice on many x86 chips. So 2.5 additional cycles for the shifting, which is pretty expensive. For comparison, mpn_lshift takes only 1.5 c/l, and it's implemented using sse instructions. So we'd need to save half a cycle in mpn_div_qr_2n to make it same speed as mpn_divrem_2, and then do the on-the-fly shifting in mpn_div_qr_2u at a cost of at most one extra cycle. Long time since I wrote that code. It's not entirely clear to me if it the loop is limited by instruction issue or the latency of the critical path, involving both multiply and quite a few additional arithmetic instructions. Regards, /Niels diff -r 1ad8cc22b714 configure.ac --- a/configure.ac Tue Jul 03 11:16:06 2018 +0200 +++ b/configure.ac Sat Aug 18 20:57:02 2018 +0200 @@ -3526,6 +3526,8 @@ #undef HAVE_NATIVE_mpn_copyi #undef HAVE_NATIVE_mpn_div_qr_1n_pi1 #undef HAVE_NATIVE_mpn_div_qr_2 +#undef HAVE_NATIVE_mpn_div_qr_2n_pi1 +#undef HAVE_NATIVE_mpn_div_qr_2u_pi1 #undef HAVE_NATIVE_mpn_divexact_1 #undef HAVE_NATIVE_mpn_divexact_by3c #undef HAVE_NATIVE_mpn_divrem_1 diff -r 1ad8cc22b714 mpn/generic/tdiv_qr.c --- a/mpn/generic/tdiv_qr.c Tue Jul 03 11:16:06 2018 +0200 +++ b/mpn/generic/tdiv_qr.c Sat Aug 18 20:57:02 2018 +0200 @@ -69,7 +69,7 @@ case 2: { mp_ptr n2p; - mp_limb_t qhl, cy; + mp_limb_t qhl; TMP_DECL; TMP_MARK; if ((dp[1] & GMP_NUMB_HIGHBIT) == 0) @@ -80,6 +80,16 @@ cnt -= GMP_NAIL_BITS; d2p[1] = (dp[1] << cnt) | (dp[0] >> (GMP_NUMB_BITS - cnt)); d2p[0] = (dp[0] << cnt) & GMP_NUMB_MASK; +#if HAVE_NATIVE_mpn_div_qr_2u_pi1 + { + gmp_pi1_t dinv; + invert_pi1 (dinv, d2p[1], d2p[0]); + qp[nn-2] = mpn_div_qr_2u_pi1 (qp, rp, np, nn, + d2p[1], d2p[0], cnt, dinv.inv32); + } +#else + { + mp_limb_t cy; n2p = TMP_ALLOC_LIMBS (nn + 1); cy = mpn_lshift (n2p, np, nn, cnt); n2p[nn] = cy; @@ -90,6 +100,8 @@ | ((n2p[1] << (GMP_NUMB_BITS - cnt)) & GMP_NUMB_MASK); rp[1] = (n2p[1] >> cnt); } +#endif + } else { n2p = TMP_ALLOC_LIMBS (nn); -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel