Many thanks Peter,
my preference would be for adding the benchmark now, and come back to
fix this post integration. The MemoryScope code is finicky and getting
confidence that the code is race-free takes time and I'd prefer not to
change that during the course of this RFR.
Your approach seems promising, so let's keep working and thinking on it
on the side (and perhaps let's maybe go for it in the panama repo first,
to make sure we don't add regressions).
Sounds like a plan?
Cheers
Maurizio
On 29/04/2020 20:19, Peter Levart wrote:
Hi Maurizio,
On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
The current implementation has performances that are on par with the
previous acquire-based implementation, and also on par with what can
be achieved with Unsafe. We do have a micro benchmark in the patch
(see ParallelSum (**)) which tests this, and I get _identical_
numbers even if I _comment_ the body of acquire/release - so that no
contention can happen; so, I'm a bit skeptical overall that
contention on acquire/release is the main factor at play here - but
perhaps we need more targeted benchmarks.
So I modified your benchmark (just took out the relevant parts) and
added some benchmarks that exhibit Stream.findAny() and
Stream.findFirst(). As I anticipated, the results for parallel stream
variants were slowing the benchmark down, so I had to reduce the
number of elements by a factor of 16 to get results in reasonable time:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum.java
I then swapped-in the alternative implementation of MemoryScope (note
that this is not a whole implementation - the dup() method is missing):
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope.java
...and I got these results:
i7 2600K (4 cores / 8 threads)
with proposed MemoryScope:
Benchmark Mode Cnt Score Error Units
ParallelSum.find_any_stream_parallel avgt 10 1332.687 ± 733.535
ms/op
ParallelSum.find_any_stream_serial avgt 10 440.260 ± 3.110
ms/op
ParallelSum.find_first_loop_serial avgt 10 5.809 ± 0.044
ms/op
ParallelSum.find_first_stream_parallel avgt 10 2070.318 ± 41.072
ms/op
ParallelSum.find_first_stream_serial avgt 10 440.034 ± 4.672
ms/op
ParallelSum.sum_loop_serial avgt 10 5.647 ± 0.055
ms/op
ParallelSum.sum_stream_parallel avgt 10 5.314 ± 0.294
ms/op
ParallelSum.sum_stream_serial avgt 10 19.179 ± 0.136
ms/op
with alternative MemoryScope:
Benchmark Mode Cnt Score Error Units
ParallelSum.find_any_stream_parallel avgt 10 80.280 ± 13.183
ms/op
ParallelSum.find_any_stream_serial avgt 10 317.388 ± 2.787 ms/op
ParallelSum.find_first_loop_serial avgt 10 5.790 ± 0.038 ms/op
ParallelSum.find_first_stream_parallel avgt 10 117.925 ± 1.747 ms/op
ParallelSum.find_first_stream_serial avgt 10 315.076 ± 5.725 ms/op
ParallelSum.sum_loop_serial avgt 10 5.652 ± 0.042 ms/op
ParallelSum.sum_stream_parallel avgt 10 4.881 ± 0.053 ms/op
ParallelSum.sum_stream_serial avgt 10 19.143 ± 0.035 ms/op
So here it is. The proof that contention does occur.
Regards, Peter