On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:
> div_var() also has an optimisation for 3- and 4-digit operands under
> HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
> mul_var() too?

I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:

```
#ifdef HAVE_INT128
        /*
         * If var1 and var2 are up to four digits, their product will fit in
         * an int128 can be computed directly, which is significantly faster.
         */
        if (var2ndigits <= 4)
        {
                int128          product = 0;

                switch (var1ndigits)
                {
                        case 1:
                                product = var1digits[0];
                                break;
                        case 2:
                                product = var1digits[0] * NBASE + var1digits[1];
                                break;
                        case 3:
                                product = ((int128) var1digits[0] * NBASE + 
var1digits[1])
                                                * NBASE + var1digits[2];
                                break;
                        case 4:
                                product = (((int128) var1digits[0] * NBASE + 
var1digits[1])
                                                * NBASE + var1digits[2]) * 
NBASE + var1digits[3];
                                break;
                }

                switch (var2ndigits)
                {
                        case 1:
                                product *= var2digits[0];
                                break;
                        case 2:
                                product *= var2digits[0] * NBASE + 
var2digits[1];
                                break;
                        case 3:
                                product = ((int128) var2digits[0] * NBASE + 
var2digits[1])
                                                * NBASE + var2digits[2];
                                break;
                        case 4:
                                product = (((int128) var2digits[0] * NBASE + 
var2digits[1])
                                                * NBASE + var2digits[2]) * 
NBASE + var2digits[3];
                                break;
                }

                alloc_var(result, res_ndigits);
                res_digits = result->digits;
                for (i = res_ndigits - 1; i >= 0; i--)
                {
                        res_digits[i] = product % NBASE;
                        product /= NBASE;
                }
                Assert(product == 0);

                /*
                 * Finally, round the result to the requested precision.
                 */
                result->weight = res_weight;
                result->sign = res_sign;

                /* Round to target rscale (and set result->dscale) */
                round_var(result, rscale);

                /* Strip leading and trailing zeroes */
                strip_var(result);

                return;
        }
#endif
```

Benchmark 1, testing 2 ndigits * 2 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
      '11112222',
      '33334444'
   ],
   min_time := 1000000,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 32.2 ns          | 20.5 ns          | 1.5700112246809388
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 30.2 ns          | 21.4 ns          | 1.4113042510107371
(1 row)

So 57% and 41% faster.

Benchmark 2, testing 4 ndigits * 4 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
      '1111222233334444',
      '5555666677778888'
   ],
   min_time := 1000000,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b |   performance_ratio
------------------+------------------+------------------------
 41.9 ns          | 51.3 ns          | 0.81733655797170943614
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 40 ns            | 38 ns            | 1.0515610914706320
(1 row)

So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K.

/Joel


Reply via email to