On Tue, 16 Nov 2021 13:22:46 GMT, kabutz <[email protected]> wrote:

> BigInteger currently uses three different algorithms for multiply. The simple 
> quadratic algorithm, then the slightly better Karatsuba if we exceed a bit 
> count and then Toom Cook 3 once we go into the several thousands of bits. 
> Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. 
> I have demonstrated this several times in conference talks. In order to be 
> consistent with other classes such as Arrays and Collection, I have added a 
> parallelMultiply() method. Internally we have added a parameter to the 
> private multiply method to indicate whether the calculation should be done in 
> parallel.
> 
> The performance improvements are as should be expected. Fibonacci of 100 
> million (using a single-threaded Dijkstra's sum of squares version) completes 
> in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the 
> sequential multiply() method. This is on my 1-8-2 laptop. The final 
> multiplications are with very large numbers, which then benefit from the 
> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
> 
> We have also parallelized the private square() method. Internally, the 
> square() method defaults to be sequential.
> 
> Some benchmark results, run on my 1-6-2 server:
> 
> 
> Benchmark                                          (n)  Mode  Cnt      Score  
>     Error  Units
> BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 
> ±   11.194  ms/op
> BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 
> ±  235.977  ms/op
> BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 
> ± 1123.329  ms/op
> BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 
> ±   26.611  ms/op
> BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 
> ±  268.903  ms/op
> BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 
> ± 1899.444  ms/op
> 
> 
> We can see that for larger calculations (fib 100m), the execution is 2.7x 
> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
> small (fib 1m) it is roughly the same. Considering that the fibonacci 
> algorithm that we used was in itself sequential, and that the last 3 
> calculations would dominate, 2.7x faster should probably be considered quite 
> good on a 1-6-2 machine.

I would be wary to make any API use multiple threads behind the scenes without 
the user explicitly asking for it. While latency of the given operation might 
improve in isolation, parallelization always incur some (often significant) 
additional cost of computation. This might reduce power efficiency, reduce CPU 
availability for other tasks, and play tricks with scalability. 

Cost: On my system `multiply` burns about half as many cycles as 
`parallelMultiply`, even though it takes 2.4x longer (measured with `-prof 
perfnorm`).

Scalability: To simulate how well the solution scales you could try running the 
`multiply` and `parallelMultiply` micros with increasingly large `-t` values. 
(`-t` controls the number of threads running the benchmarks in parallel, 
default 1). On my system the advantage of `parallelMultiply` diminishes 
markedly as I saturate the system. On a test where I see a 2.4x speed-up at `-t 
1` (`n = 50000000`), the advantage drops to 1.5x at `-t 4` and only gives me a 
1.1x speed-up at `-t 8`.

I'd favor a public `parallelMultiply()`. Alternatively a flag to opt-in to 
parallelization.

(Nit: avoid appending flags to microbenchmarks that aren't strictly necessary 
for the tests, or such that can be reasonably expected to be within bounds on 
any test system. I didn't have 16Gb of free RAM.)

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409

Reply via email to