t...@gmplib.org (Torbjörn Granlund) writes:

> Should we rename div1 to put it into the GMP name space?  How about
> mpn_div_11?  (We could encode in its name that it is for use for small
> q, but I'd suggest that we don't do that.)
>
> That would allow for some assembly experiments.

If you think assembly will make a big difference (which seems likely),
that makes sense. And define it only when it's faster than a a div
instruction generated by the compiler?

>   Taking out powers of two makes things complicated. Taking out a factor
>   of two from one of the numbers corresponds to a matrix (1,0;0,2) with
>   determinant 2. So the inverse is not an integer matrix, but an integer
>   matrix + a shift count.
>
> Would a shift count hurt?

On second thought, I think there's a bigger problem: The way hgcd is
used, it's called on the most significant part of some numbers, but
applied to larger numbers. And a matrix based on trailing zeros in hgcd
won't be applicable to the larger numbers, since they have different
lowend bits.

>   It would be interesting to try some left-to-right binary hgcd. Logic
>   should be something like
>
>     count_leading_zeros(a_bits, a);
>     count_leading_zeros(b_bits, b);
>
>     if (a_bits == b_bits)
>       (a, b) <-- (min(a,b), |a-b|)
>     else if a_bits > b_bits
>       a <-- |a - b * 2^{a_bits - b_bits}|
>     else 
>       b <-- |b - a * 2^{b_bits - a_bits}|
>
> OK!  This is super interesting!

I think it may work nicely on machines with slow multiplication as well
as small-quotient division. But will be some overhead to make
branch-free. I haven't tried it seriously yet.

We'll sometimes get determinant == -1, so users of hgcd2 would need to
be updated to accept that. Or maybe one can just do a swap to ensure
that we have det == +1 in all cases. E.g.,

  a <-- 2b - a

corresponds to the matrix (-1, 2; 0,1) with determinant -1. But if we
swap a and b, and set

  (a, b) <-- (b, 2b - a)

that's the matrix (0,1; -1, 2) with determinant +1.

In the mean time, I've updated the replacement of div2. See attached
patch, using div1 except when HGCD2_METHOD == 2. Timing on my old laptop:

$ ./tune/speed -c -p1000000 -s1 mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3
overhead 6.00 cycles, precision 1000000 units of 8.33e-10 secs, CPU freq 
1200.00 MHz
          mpn_hgcd2_1   mpn_hgcd2_2   mpn_hgcd2_3
1             1492.62       2064.18      #1375.18

So this is 50% speedup over the old method.

Regards,
/Niels


diff -r 8c0120dc67b0 mpn/generic/hgcd2.c
--- a/mpn/generic/hgcd2.c	Thu Sep 05 21:25:39 2019 +0200
+++ b/mpn/generic/hgcd2.c	Sat Sep 07 23:59:38 2019 +0200
@@ -159,7 +159,9 @@
 #error Unknown HGCD2_METHOD
 #endif
 
-/* Two-limb division optimized for small quotients.  */
+#if HGCD2_METHOD == 2
+/* Two-limb division optimized for small quotients, similar to
+   corresponding div1 above. */
 static inline mp_limb_t
 div2 (mp_ptr rp,
       mp_limb_t nh, mp_limb_t nl,
@@ -262,6 +264,44 @@
 }
 #endif
 
+#else /* HGCD2_METHOD != 2 */
+/* Used only for large quotients */
+static mp_limb_t
+div2 (mp_ptr rp,
+	   mp_limb_t n1, mp_limb_t n0,
+	   mp_limb_t d1, mp_limb_t d0)
+{
+  mp_limb_t n2, q, t1, t0;
+  int c;
+
+  /* Normalize */
+  count_leading_zeros (c, d1);
+
+  n2 = n1 >> (GMP_LIMB_BITS - c);
+  n1 = (n1 << c) | (n0 >> (GMP_LIMB_BITS - c));
+  n0 <<= c;
+  d1 = (d1 << c) | (d0 >> (GMP_LIMB_BITS - c));
+  d0 <<= c;
+
+  udiv_qrnnd (q, n1, n2, n1, d1);
+  umul_ppmm (t1, t0, q, d0);
+  if (t1 > n1 || (t1 == n1 && t0 > n0))
+    {
+      ASSERT_ALWAYS (q > 0);
+      q--;
+      n1 += d1;
+      sub_ddmmss (t1, t0, t1, t0, 0, d0);
+    }
+  sub_ddmmss (n1, n0, n1, n0, t1, t0);
+
+  /* Undo normalization */
+  rp[0] = (n0 >> c) | (n1 << (GMP_LIMB_BITS - c));
+  rp[1] = n1 >> c;
+
+  return q;
+}
+#endif /* HGCD2_METHOD != 2 */
+
 /* Reduces a,b until |a-b| (almost) fits in one limb + 1 bit. Constructs
    matrix M. Returns 1 if we make progress, i.e. can perform at least
    one subtraction. Otherwise returns zero. */
@@ -338,9 +378,34 @@
 	}
       else
 	{
+	  mp_limb_t q;
+#if HGCD2_METHOD == 2
 	  mp_limb_t r[2];
-	  mp_limb_t q = div2 (r, ah, al, bh, bl);
+	  q = div2 (r, ah, al, bh, bl);
 	  al = r[0]; ah = r[1];
+#else
+	  mp_double_limb_t rq = div1 (ah, bh);
+	  if (UNLIKELY (rq.d1 > bh))
+	    {
+	      mp_limb_t r[2];
+	      q = div2 (r, ah, al, bh, bl);
+	      al = r[0]; ah = r[1];
+	    }
+	  else
+	    {
+	      mp_limb_t t1, t0;
+	      ah = rq.d0;
+	      q = rq.d1;
+	      umul_ppmm (t1, t0, q, bl);
+	      if (UNLIKELY (t1 >= ah) && (t1 > ah || t0 > al))
+		{
+		  ah += bh;
+		  q--;
+		  sub_ddmmss (t1, t0, t1, t0, 0, bl);
+		}
+	      sub_ddmmss (ah, al, ah, al, t1, t0);
+	    }
+#endif
 	  if (ah < 2)
 	    {
 	      /* A is too small, but q is correct. */
@@ -380,9 +445,34 @@
 	}
       else
 	{
+	  mp_limb_t q;
+#if HGCD2_METHOD == 2
 	  mp_limb_t r[2];
-	  mp_limb_t q = div2 (r, bh, bl, ah, al);
+	  q = div2 (r, bh, bl, ah, al);
 	  bl = r[0]; bh = r[1];
+#else
+	  mp_double_limb_t rq = div1 (bh, ah);
+	  if (UNLIKELY (rq.d1 > ah))
+	    {
+	      mp_limb_t r[2];
+	      q = div2 (r, bh, bl, ah, al);
+	      bl = r[0]; bh = r[1];
+	    }
+	  else
+	    {
+	      mp_limb_t t1, t0;
+	      bh = rq.d0;
+	      q = rq.d1;
+	      umul_ppmm (t1, t0, q, al);
+	      if (UNLIKELY (t1 >= bh) && (t1 > bh || t0 > bl))
+		{
+		  bh += ah;
+		  q--;
+		  sub_ddmmss (t1, t0, t1, t0, 0, al);
+		}
+	      sub_ddmmss (bh, bl, bh, bl, t1, t0);
+	    }
+#endif
 	  if (bh < 2)
 	    {
 	      /* B is too small, but q is correct. */
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to