Using mpn_div_qr_2u_pi1?

Niels Möller Sat, 18 Aug 2018 13:01:00 -0700

The below patch enables use of mpn_div_qr_2u_pi1 (written back in 2011).
It's used only if an assembly implementation is available, which
currently is x86_64 only.


The point of it is to avoid a temporary allocation and shift to
normalize the numerator up front. 

I'm a bit concerned about performance, it's not as well tuned as
mpn_divrem_2, so might not be a win in current form.

On my intel broadwell machine, mpn_divrem_2 takes 16.5 c/l,
mpn_div_qr_2n takes 17 c/l (and it's essentially the same loop, so this
is poor tuning). mpn_div_qr_2u seem to be about 19.5 cycles, and it uses
the shld instruction which likely is a poor choice on many x86 chips. So
2.5 additional cycles for the shifting, which is pretty expensive. For 
comparison,
mpn_lshift takes only 1.5 c/l, and it's implemented using sse instructions.

So we'd need to save half a cycle in mpn_div_qr_2n to make it same speed
as mpn_divrem_2, and then do the on-the-fly shifting in mpn_div_qr_2u at
a cost of at most one extra cycle.

Long time since I wrote that code. It's not entirely clear to me if it
the loop is limited by instruction issue or the latency of the critical
path, involving both multiply and quite a few additional arithmetic
instructions.

Regards,
/Niels

diff -r 1ad8cc22b714 configure.ac
--- a/configure.ac      Tue Jul 03 11:16:06 2018 +0200
+++ b/configure.ac      Sat Aug 18 20:57:02 2018 +0200
@@ -3526,6 +3526,8 @@
 #undef HAVE_NATIVE_mpn_copyi
 #undef HAVE_NATIVE_mpn_div_qr_1n_pi1
 #undef HAVE_NATIVE_mpn_div_qr_2
+#undef HAVE_NATIVE_mpn_div_qr_2n_pi1
+#undef HAVE_NATIVE_mpn_div_qr_2u_pi1
 #undef HAVE_NATIVE_mpn_divexact_1
 #undef HAVE_NATIVE_mpn_divexact_by3c
 #undef HAVE_NATIVE_mpn_divrem_1
diff -r 1ad8cc22b714 mpn/generic/tdiv_qr.c
--- a/mpn/generic/tdiv_qr.c     Tue Jul 03 11:16:06 2018 +0200
+++ b/mpn/generic/tdiv_qr.c     Sat Aug 18 20:57:02 2018 +0200
@@ -69,7 +69,7 @@
     case 2:
       {
        mp_ptr n2p;
-       mp_limb_t qhl, cy;
+       mp_limb_t qhl;
        TMP_DECL;
        TMP_MARK;
        if ((dp[1] & GMP_NUMB_HIGHBIT) == 0)
@@ -80,6 +80,16 @@
            cnt -= GMP_NAIL_BITS;
            d2p[1] = (dp[1] << cnt) | (dp[0] >> (GMP_NUMB_BITS - cnt));
            d2p[0] = (dp[0] << cnt) & GMP_NUMB_MASK;
+#if HAVE_NATIVE_mpn_div_qr_2u_pi1
+           {
+             gmp_pi1_t dinv;
+             invert_pi1 (dinv, d2p[1], d2p[0]);
+             qp[nn-2] = mpn_div_qr_2u_pi1 (qp, rp, np, nn,
+                                           d2p[1], d2p[0], cnt, dinv.inv32);
+           }
+#else
+           {
+             mp_limb_t cy;
            n2p = TMP_ALLOC_LIMBS (nn + 1);
            cy = mpn_lshift (n2p, np, nn, cnt);
            n2p[nn] = cy;
@@ -90,6 +100,8 @@
              | ((n2p[1] << (GMP_NUMB_BITS - cnt)) & GMP_NUMB_MASK);
            rp[1] = (n2p[1] >> cnt);
          }
+#endif
+         }
        else
          {
            n2p = TMP_ALLOC_LIMBS (nn);

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Using mpn_div_qr_2u_pi1?

Reply via email to