On Tue, 2 Jul 2024 at 08:50, Joel Jacobson <[email protected]> wrote:
>
> On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:
>
> > Attachments:
> > * optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
>
Shortly after posting that, I realised that there was a small bug. This bit:
case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
That doesn't noticeably affect the performance though. Update attached.
> I've benchmarked your patch on my three machines with great results.
> I added a setseed() step, to make the benchmarks reproducible,
> shouldn't matter much since it should statistically average out, but I
> thought why not.
Nice. The results on the Apple M3 Max look particularly impressive.
I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.
Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..f77ad8b
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8748,6 +8748,76 @@ mul_var(const NumericVar *var1, const Nu
}
/*
+ * Simplified fast-path computation, if var1 has just one or two digits.
+ * This is significantly faster, since it avoids allocating a separate
+ * digit array, making multiple passes over var2, and having separate
+ * carry-propagation passes.
+ */
+ if (var1ndigits <= 2)
+ {
+ NumericDigit *res_buf;
+
+ /* Allocate result digit array */
+ res_buf = digitbuf_alloc(res_ndigits);
+ res_buf[0] = 0; /* spare digit for later rounding */
+ res_digits = res_buf + 1;
+
+ /*
+ * Compute the result digits directly, in one pass, propagating the
+ * carry up as we go.
+ */
+ switch (var1ndigits)
+ {
+ case 1:
+ carry = 0;
+ for (i = res_ndigits - 3; i >= 0; i--)
+ {
+ newdig = (int) var1digits[0] * var2digits[i] + carry;
+ res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+ carry = newdig / NBASE;
+ }
+ res_digits[0] = (NumericDigit) carry;
+ break;
+
+ case 2:
+ newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+ if (res_ndigits - 3 < var2ndigits)
+ newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+ res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+ carry = newdig / NBASE;
+
+ for (i = res_ndigits - 4; i > 0; i--)
+ {
+ newdig = (int) var1digits[0] * var2digits[i] +
+ (int) var1digits[1] * var2digits[i - 1] + carry;
+ res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+ carry = newdig / NBASE;
+ }
+
+ newdig = (int) var1digits[0] * var2digits[0] + carry;
+ res_digits[1] = (NumericDigit) (newdig % NBASE);
+ res_digits[0] = (NumericDigit) (newdig / NBASE);
+ break;
+ }
+
+ /* Store the product in result (minus extra rounding digit) */
+ digitbuf_free(result->buf);
+ result->ndigits = res_ndigits - 1;
+ result->buf = res_buf;
+ result->digits = res_digits;
+ result->weight = res_weight - 1;
+ result->sign = res_sign;
+
+ /* Round to target rscale (and set result->dscale) */
+ round_var(result, rscale);
+
+ /* Strip leading and trailing zeroes */
+ strip_var(result);
+
+ return;
+ }
+
+ /*
* We do the arithmetic in an array "dig[]" of signed int's. Since
* INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
* to avoid normalizing carries immediately.