On Wed, 17 Nov 2021 17:51:03 GMT, Erik Ă–sterlund <eosterl...@openjdk.org> wrote:
> I would be wary to make any API use multiple threads behind the scenes > without the user explicitly asking for it. While latency of the given > operation might improve in isolation, parallelization always incur some > (often significant) additional cost of computation. This might reduce power > efficiency, reduce CPU availability for other tasks, and play tricks with > scalability. > > Cost: On my system `multiply` burns about half as many cycles as > `parallelMultiply`, even though it takes 2.4x longer (measured with `-prof > perfnorm`). > > Scalability: To simulate how well the solution scales you could try running > the `multiply` and `parallelMultiply` micros with increasingly large `-t` > values. (`-t` controls the number of threads running the benchmarks in > parallel, default 1). On my system the advantage of `parallelMultiply` > diminishes markedly as I saturate the system. On a test where I see a 2.4x > speed-up at `-t 1` (`n = 50000000`), the advantage drops to 1.5x at `-t 4` > and only gives me a 1.1x speed-up at `-t 8`. Furthermore, since it uses the common FJP by default, any simultaneous parallel execution (parallel sort, parallel streams, etc.) would decrease the performance. > I'd favor a public `parallelMultiply()`. Alternatively a flag to opt-in to > parallelization. The reason I would prefer a public method to a flag is that it might be useful to enable parallel multiply on a case-by-case basis. With a flag, it is an all-or-nothing approach. If it has to be a flag, then I'd agree that we should have to opt it in. > (Nit: avoid appending flags to microbenchmarks that aren't strictly necessary > for the tests, or such that can be reasonably expected to be within bounds on > any test system. I didn't have 16Gb of free RAM.) Thanks, will change that. Fun fact - the book Optimizing Java has a graph in the introduction that refers to some tests I did on Fibonacci a long time ago. The GC was dominating because the spaces were too small. However, in that case I was calculating Fibonacci of 1 billion. For "just" 100m, we don't need as much memory. Here is the amount of memory that each of the Fibonacci calculations allocate. For the 100m calculation, the resident set size for multiply() is about 125mb and for parallelMultiply() about 190mb. BigIntegerParallelMultiply.multiply: 1_000_000 177 MB BigIntegerParallelMultiply.multiply: 10_000_000 3411 MB BigIntegerParallelMultiply.multiply: 100_000_000 87689 MB BigIntegerParallelMultiply.parallelMultiply: 1_000_000 177 MB BigIntegerParallelMultiply.parallelMultiply: 10_000_000 3411 MB BigIntegerParallelMultiply.parallelMultiply: 100_000_000 87691 MB ------------- PR: https://git.openjdk.java.net/jdk/pull/6409