Some optimisations for numeric division

Dean Rasheed Wed, 23 Feb 2022 05:35:27 -0800

Attached are 3 small patches that improve the performance of numeric
division. Patch 0002 seems to have the biggest impact, but I think
they're all worth including, since they're quite simple changes, with
noticeable performance gains.



Patch 0001 vectorises the inner loop of div_var_fast(). This loop is
almost identical to the inner loop of mul_var(), which was vectorised
by commit 8870917623. The thing preventing the div_var_fast() loop
from being vectorised is the assignment to div[qi + i], and simply
replacing that with div_qi[i] where div_qi = &div[qi], in the same
style as mul_var(), fixes that.

One difference between this and the mul_var() code is that it is also
necessary to add an explicit cast to get the compiler to generate
matching output, and give the best results. This is because in
mul_var() the compiler recognises that var1digit is actually 16-bit,
rather than 32-bit, and so it doesn't need to multiply the high part,
but in div_var_fast() it's not obvious to the compiler that qdigit
also fits in 16 bits, hence the cast.

The actual performance gain is typically quite modest, since
div_var_fast() is always only a part of some more complex numeric
computation, but it can be seen in cases like raising numbers to
negative integer powers, e.g.:

CREATE TEMP TABLE num_test(x numeric);
SELECT setseed(0);
INSERT INTO num_test
  SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, '')||x)::numeric
          FROM generate_series(1,100))
  FROM generate_series(1,100000) g(x);

SELECT sum(x^(-2)) FROM num_test;

Time: 112.949 ms  (HEAD)
Time: 98.537 ms   (with patch)


Patch 0002 simplifies the inner loop of div_var() (the guts of the
public-facing numeric division operator) by more closely combining the
multiplication and subtraction operations and folding the separate
"carry" and "borrow" variables into a single "borrow", as suggested by
the old code comment.

IMO, this makes the code easier to understand, as well as giving more
significant performance gains:

CREATE TEMP TABLE div_test(x numeric);
SELECT setseed(0);
INSERT INTO div_test
  SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, ''))::numeric + x
          FROM generate_series(1,50))
  FROM generate_series(1,1000) g(x);

SELECT sum(x/y) FROM div_test t1(x), div_test t2(y);

Time: 1477.340 ms  (HEAD)
Time: 826.748 ms   (with patch)


Patch 0003 replaces some uses of div_var_fast() with div_var().
Specifically, when the divisor has just one or two base-NBASE digits,
div_var() is faster. This is especially true for 1-digit divisors, due
to the "fast path" code in div_var() to handle that. It's also just
about true for 2-digit divisors, and it occurs to me that that case
could potentially be optimised further with similar fast path code in
div_var(), but I haven't tried that yet.

One-digit divisors are quite common. For example, in the Taylor series
expansions in exp_var() and ln_var(), since the number of Taylor
series terms never exceeds a few hundred in practice. Also,
exp_var()'s argument reduction step divides by 2^ndiv2, which is
roughly 100 times the input, rounded up to a power of two, and valid
inputs are less than 6000, so this will always be one or two digits.

Testing this with a bunch of random exp() and ln() computations I saw
a speed-up of 15-20%, and it reduced the run time of the numeric-big
regression test by around 10%, which seems worth having.

Regards,
Dean

From 41732ad9a44dcd12e52d823fb59cb23cce4fe217 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rash...@gmail.com>
Date: Sun, 20 Feb 2022 12:45:01 +0000
Subject: [PATCH 1/3] Apply auto-vectorization to the inner loop of
 div_var_fast().

This loop is basically the same as the inner loop of mul_var(), which
was auto-vectorized in commit 8870917623, but the compiler will only
consider auto-vectorizing the div_var_fast() loop if the assignment
target div[qi + i] is replaced by div_qi[i], where div_qi = &div[qi].

Additionally, since the compiler doesn't know that qdigit is
guaranteed to fit in a 16-bit NumericDigit, cast it to NumericDigit
before multiplying to make the resulting auto-vectorized code more
efficient.
---
 src/backend/utils/adt/numeric.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 3208789f75..fc43d2a456 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8908,13 +8908,22 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2,
 			 * which would make the new value simply div[qi] mod vardigits[0].
 			 * The lower-order terms in qdigit can change this result by not
 			 * more than about twice INT_MAX/NBASE, so overflow is impossible.
+			 *
+			 * This inner loop is the performance bottleneck for division, so
+			 * code it in the same way as the inner loop of mul_var() so that
+			 * it can be auto-vectorized.  We cast qdigit to NumericDigit
+			 * before multiplying to allow the compiler to generate more
+			 * efficient code (using 16-bit multiplication), which is safe
+			 * since we know that the quotient digit is off by at most one, so
+			 * there is no overflow risk.
 			 */
 			if (qdigit != 0)
 			{
 				int			istop = Min(var2ndigits, div_ndigits - qi + 1);
+				int		   *div_qi = &div[qi];
 
 				for (i = 0; i < istop; i++)
-					div[qi + i] -= qdigit * var2digits[i];
+					div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i];
 			}
 		}
 
-- 
2.26.2

From f578051209ec952873795beabadd40b54dbf182a Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rash...@gmail.com>
Date: Sun, 20 Feb 2022 18:10:25 +0000
Subject: [PATCH 3/3] Use div_var() instead of div_var_fast() when the divisor
 is small.

When the divisor contains just one base-NBASE digit, div_var() is much
faster than div_var_fast(), since it contains special fast-path code
for that case, and when the divisor has two digits, div_var() is still
somewhat faster, since div_var_fast() has more overhead intended for
larger inputs. Therefore, use div_var() instead of div_var_fast() in
such cases.

Where we know for sure that the divisor is small, such as the
denominators of Taylor series terms, just call div_var() directly.
Otherwise, have div_var_fast() delegate to div_var() when the
divisor is small. This produces a noticeable speedup of exp(), ln()
and the numeric-big regression test.
---
 src/backend/utils/adt/numeric.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 909f4eed74..3c1e310ce2 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8725,6 +8725,16 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2,
 	NumericDigit *var1digits = var1->digits;
 	NumericDigit *var2digits = var2->digits;
 
+	/*
+	 * If the divisor has just one or two base-NBASE digits, it is
+	 * significantly faster to just use div_var().
+	 */
+	if (var2ndigits <= 2)
+	{
+		div_var(var1, var2, result, rscale, round);
+		return;
+	}
+
 	/*
 	 * First of all division by zero check; we must not be handed an
 	 * unnormalized divisor.
@@ -9840,7 +9850,7 @@ exp_var(const NumericVar *arg, NumericVar *result, int rscale)
 		}
 
 		local_rscale = x.dscale + ndiv2;
-		div_var_fast(&x, &tmp, &x, local_rscale, true);
+		div_var(&x, &tmp, &x, local_rscale, true);
 
 		free_var(&tmp);
 	}
@@ -9871,7 +9881,7 @@ exp_var(const NumericVar *arg, NumericVar *result, int rscale)
 
 	mul_var(&x, &x, &elem, local_rscale);
 	set_var_from_var(&const_two, &ni);
-	div_var_fast(&elem, &ni, &elem, local_rscale, true);
+	div_var(&elem, &ni, &elem, local_rscale, true);
 
 	while (elem.ndigits != 0)
 	{
@@ -9879,7 +9889,7 @@ exp_var(const NumericVar *arg, NumericVar *result, int rscale)
 
 		mul_var(&elem, &x, &elem, local_rscale);
 		add_var(&ni, &const_one, &ni);
-		div_var_fast(&elem, &ni, &elem, local_rscale, true);
+		div_var(&elem, &ni, &elem, local_rscale, true);
 	}
 
 	/*
@@ -10079,7 +10089,7 @@ ln_var(const NumericVar *arg, NumericVar *result, int rscale)
 	{
 		add_var(&ni, &const_two, &ni);
 		mul_var(&xx, &x, &xx, local_rscale);
-		div_var_fast(&xx, &ni, &elem, local_rscale, true);
+		div_var(&xx, &ni, &elem, local_rscale, true);
 
 		if (elem.ndigits == 0)
 			break;
-- 
2.26.2

From 481b7c0044afbe72adff1961624a1bb515791b96 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rash...@gmail.com>
Date: Sun, 20 Feb 2022 17:15:49 +0000
Subject: [PATCH 2/3] Simplify the inner loop of numeric division in div_var().

In the standard numeric division algorithm, the inner loop multiplies
the divisor by the next quotient digit and subtracts that from the
working dividend. As suggested by the original code comment, the
separate "carry" and "borrow" variables (from the multiplication and
subtraction steps respectively) can be folded together into a single
variable. Doing so significantly improves performance, as well as
simplifying the code.
---
 src/backend/utils/adt/numeric.c | 36 ++++++++++++++-------------------
 1 file changed, 15 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index fc43d2a456..909f4eed74 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8605,31 +8605,25 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 			/* As above, need do nothing more when quotient digit is 0 */
 			if (qhat > 0)
 			{
+				NumericDigit *dividend_j = &dividend[j];
+
 				/*
 				 * Multiply the divisor by qhat, and subtract that from the
-				 * working dividend.  "carry" tracks the multiplication,
-				 * "borrow" the subtraction (could we fold these together?)
+				 * working dividend.  The multiplication and subtraction are
+				 * folded together here, noting that qhat <= NBASE (since it
+				 * might be one too large), and so the intermediate result
+				 * "tmp_result" is in the range [-NBASE^2, NBASE - 1], and
+				 * "borrow" is in the range [0, NBASE].
 				 */
-				carry = 0;
 				borrow = 0;
 				for (i = var2ndigits; i >= 0; i--)
 				{
-					carry += divisor[i] * qhat;
-					borrow -= carry % NBASE;
-					carry = carry / NBASE;
-					borrow += dividend[j + i];
-					if (borrow < 0)
-					{
-						dividend[j + i] = borrow + NBASE;
-						borrow = -1;
-					}
-					else
-					{
-						dividend[j + i] = borrow;
-						borrow = 0;
-					}
+					int			tmp_result;
+
+					tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
+					borrow = (NBASE - 1 - tmp_result) / NBASE;
+					dividend_j[i] = tmp_result + borrow * NBASE;
 				}
-				Assert(carry == 0);
 
 				/*
 				 * If we got a borrow out of the top dividend digit, then
@@ -8645,15 +8639,15 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 					carry = 0;
 					for (i = var2ndigits; i >= 0; i--)
 					{
-						carry += dividend[j + i] + divisor[i];
+						carry += dividend_j[i] + divisor[i];
 						if (carry >= NBASE)
 						{
-							dividend[j + i] = carry - NBASE;
+							dividend_j[i] = carry - NBASE;
 							carry = 1;
 						}
 						else
 						{
-							dividend[j + i] = carry;
+							dividend_j[i] = carry;
 							carry = 0;
 						}
 					}
-- 
2.26.2

Some optimisations for numeric division

Reply via email to