Attached are 3 small patches that improve the performance of numeric division. Patch 0002 seems to have the biggest impact, but I think they're all worth including, since they're quite simple changes, with noticeable performance gains.
Patch 0001 vectorises the inner loop of div_var_fast(). This loop is almost identical to the inner loop of mul_var(), which was vectorised by commit 8870917623. The thing preventing the div_var_fast() loop from being vectorised is the assignment to div[qi + i], and simply replacing that with div_qi[i] where div_qi = &div[qi], in the same style as mul_var(), fixes that. One difference between this and the mul_var() code is that it is also necessary to add an explicit cast to get the compiler to generate matching output, and give the best results. This is because in mul_var() the compiler recognises that var1digit is actually 16-bit, rather than 32-bit, and so it doesn't need to multiply the high part, but in div_var_fast() it's not obvious to the compiler that qdigit also fits in 16 bits, hence the cast. The actual performance gain is typically quite modest, since div_var_fast() is always only a part of some more complex numeric computation, but it can be seen in cases like raising numbers to negative integer powers, e.g.: CREATE TEMP TABLE num_test(x numeric); SELECT setseed(0); INSERT INTO num_test SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, '')||x)::numeric FROM generate_series(1,100)) FROM generate_series(1,100000) g(x); SELECT sum(x^(-2)) FROM num_test; Time: 112.949 ms (HEAD) Time: 98.537 ms (with patch) Patch 0002 simplifies the inner loop of div_var() (the guts of the public-facing numeric division operator) by more closely combining the multiplication and subtraction operations and folding the separate "carry" and "borrow" variables into a single "borrow", as suggested by the old code comment. IMO, this makes the code easier to understand, as well as giving more significant performance gains: CREATE TEMP TABLE div_test(x numeric); SELECT setseed(0); INSERT INTO div_test SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, ''))::numeric + x FROM generate_series(1,50)) FROM generate_series(1,1000) g(x); SELECT sum(x/y) FROM div_test t1(x), div_test t2(y); Time: 1477.340 ms (HEAD) Time: 826.748 ms (with patch) Patch 0003 replaces some uses of div_var_fast() with div_var(). Specifically, when the divisor has just one or two base-NBASE digits, div_var() is faster. This is especially true for 1-digit divisors, due to the "fast path" code in div_var() to handle that. It's also just about true for 2-digit divisors, and it occurs to me that that case could potentially be optimised further with similar fast path code in div_var(), but I haven't tried that yet. One-digit divisors are quite common. For example, in the Taylor series expansions in exp_var() and ln_var(), since the number of Taylor series terms never exceeds a few hundred in practice. Also, exp_var()'s argument reduction step divides by 2^ndiv2, which is roughly 100 times the input, rounded up to a power of two, and valid inputs are less than 6000, so this will always be one or two digits. Testing this with a bunch of random exp() and ln() computations I saw a speed-up of 15-20%, and it reduced the run time of the numeric-big regression test by around 10%, which seems worth having. Regards, Dean
From 41732ad9a44dcd12e52d823fb59cb23cce4fe217 Mon Sep 17 00:00:00 2001 From: Dean Rasheed <dean.a.rash...@gmail.com> Date: Sun, 20 Feb 2022 12:45:01 +0000 Subject: [PATCH 1/3] Apply auto-vectorization to the inner loop of div_var_fast(). This loop is basically the same as the inner loop of mul_var(), which was auto-vectorized in commit 8870917623, but the compiler will only consider auto-vectorizing the div_var_fast() loop if the assignment target div[qi + i] is replaced by div_qi[i], where div_qi = &div[qi]. Additionally, since the compiler doesn't know that qdigit is guaranteed to fit in a 16-bit NumericDigit, cast it to NumericDigit before multiplying to make the resulting auto-vectorized code more efficient. --- src/backend/utils/adt/numeric.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 3208789f75..fc43d2a456 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -8908,13 +8908,22 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2, * which would make the new value simply div[qi] mod vardigits[0]. * The lower-order terms in qdigit can change this result by not * more than about twice INT_MAX/NBASE, so overflow is impossible. + * + * This inner loop is the performance bottleneck for division, so + * code it in the same way as the inner loop of mul_var() so that + * it can be auto-vectorized. We cast qdigit to NumericDigit + * before multiplying to allow the compiler to generate more + * efficient code (using 16-bit multiplication), which is safe + * since we know that the quotient digit is off by at most one, so + * there is no overflow risk. */ if (qdigit != 0) { int istop = Min(var2ndigits, div_ndigits - qi + 1); + int *div_qi = &div[qi]; for (i = 0; i < istop; i++) - div[qi + i] -= qdigit * var2digits[i]; + div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i]; } } -- 2.26.2
From f578051209ec952873795beabadd40b54dbf182a Mon Sep 17 00:00:00 2001 From: Dean Rasheed <dean.a.rash...@gmail.com> Date: Sun, 20 Feb 2022 18:10:25 +0000 Subject: [PATCH 3/3] Use div_var() instead of div_var_fast() when the divisor is small. When the divisor contains just one base-NBASE digit, div_var() is much faster than div_var_fast(), since it contains special fast-path code for that case, and when the divisor has two digits, div_var() is still somewhat faster, since div_var_fast() has more overhead intended for larger inputs. Therefore, use div_var() instead of div_var_fast() in such cases. Where we know for sure that the divisor is small, such as the denominators of Taylor series terms, just call div_var() directly. Otherwise, have div_var_fast() delegate to div_var() when the divisor is small. This produces a noticeable speedup of exp(), ln() and the numeric-big regression test. --- src/backend/utils/adt/numeric.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 909f4eed74..3c1e310ce2 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -8725,6 +8725,16 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2, NumericDigit *var1digits = var1->digits; NumericDigit *var2digits = var2->digits; + /* + * If the divisor has just one or two base-NBASE digits, it is + * significantly faster to just use div_var(). + */ + if (var2ndigits <= 2) + { + div_var(var1, var2, result, rscale, round); + return; + } + /* * First of all division by zero check; we must not be handed an * unnormalized divisor. @@ -9840,7 +9850,7 @@ exp_var(const NumericVar *arg, NumericVar *result, int rscale) } local_rscale = x.dscale + ndiv2; - div_var_fast(&x, &tmp, &x, local_rscale, true); + div_var(&x, &tmp, &x, local_rscale, true); free_var(&tmp); } @@ -9871,7 +9881,7 @@ exp_var(const NumericVar *arg, NumericVar *result, int rscale) mul_var(&x, &x, &elem, local_rscale); set_var_from_var(&const_two, &ni); - div_var_fast(&elem, &ni, &elem, local_rscale, true); + div_var(&elem, &ni, &elem, local_rscale, true); while (elem.ndigits != 0) { @@ -9879,7 +9889,7 @@ exp_var(const NumericVar *arg, NumericVar *result, int rscale) mul_var(&elem, &x, &elem, local_rscale); add_var(&ni, &const_one, &ni); - div_var_fast(&elem, &ni, &elem, local_rscale, true); + div_var(&elem, &ni, &elem, local_rscale, true); } /* @@ -10079,7 +10089,7 @@ ln_var(const NumericVar *arg, NumericVar *result, int rscale) { add_var(&ni, &const_two, &ni); mul_var(&xx, &x, &xx, local_rscale); - div_var_fast(&xx, &ni, &elem, local_rscale, true); + div_var(&xx, &ni, &elem, local_rscale, true); if (elem.ndigits == 0) break; -- 2.26.2
From 481b7c0044afbe72adff1961624a1bb515791b96 Mon Sep 17 00:00:00 2001 From: Dean Rasheed <dean.a.rash...@gmail.com> Date: Sun, 20 Feb 2022 17:15:49 +0000 Subject: [PATCH 2/3] Simplify the inner loop of numeric division in div_var(). In the standard numeric division algorithm, the inner loop multiplies the divisor by the next quotient digit and subtracts that from the working dividend. As suggested by the original code comment, the separate "carry" and "borrow" variables (from the multiplication and subtraction steps respectively) can be folded together into a single variable. Doing so significantly improves performance, as well as simplifying the code. --- src/backend/utils/adt/numeric.c | 36 ++++++++++++++------------------- 1 file changed, 15 insertions(+), 21 deletions(-) diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index fc43d2a456..909f4eed74 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -8605,31 +8605,25 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, /* As above, need do nothing more when quotient digit is 0 */ if (qhat > 0) { + NumericDigit *dividend_j = ÷nd[j]; + /* * Multiply the divisor by qhat, and subtract that from the - * working dividend. "carry" tracks the multiplication, - * "borrow" the subtraction (could we fold these together?) + * working dividend. The multiplication and subtraction are + * folded together here, noting that qhat <= NBASE (since it + * might be one too large), and so the intermediate result + * "tmp_result" is in the range [-NBASE^2, NBASE - 1], and + * "borrow" is in the range [0, NBASE]. */ - carry = 0; borrow = 0; for (i = var2ndigits; i >= 0; i--) { - carry += divisor[i] * qhat; - borrow -= carry % NBASE; - carry = carry / NBASE; - borrow += dividend[j + i]; - if (borrow < 0) - { - dividend[j + i] = borrow + NBASE; - borrow = -1; - } - else - { - dividend[j + i] = borrow; - borrow = 0; - } + int tmp_result; + + tmp_result = dividend_j[i] - borrow - divisor[i] * qhat; + borrow = (NBASE - 1 - tmp_result) / NBASE; + dividend_j[i] = tmp_result + borrow * NBASE; } - Assert(carry == 0); /* * If we got a borrow out of the top dividend digit, then @@ -8645,15 +8639,15 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, carry = 0; for (i = var2ndigits; i >= 0; i--) { - carry += dividend[j + i] + divisor[i]; + carry += dividend_j[i] + divisor[i]; if (carry >= NBASE) { - dividend[j + i] = carry - NBASE; + dividend_j[i] = carry - NBASE; carry = 1; } else { - dividend[j + i] = carry; + dividend_j[i] = carry; carry = 0; } } -- 2.26.2