On Mon, 15 Jul 2024 14:02:27 GMT, Maurizio Cimadamore <[email protected]>
wrote:
> So, that means that `arrayElementVarHandle` is ~4x faster than memory
> segment? Isn't that a bit odd?
I did some more analyis of the benchmark. I first eliminated the closing
thread, and started with two simple benchmarks:
@Benchmark
public int memorySegmentAccess() {
int sum = 0;
for (int i = 0; i < segment.byteSize(); i++) {
sum += segment.get(JAVA_BYTE, i);
}
return sum;
}
and
@Benchmark
public int otherAccess() {
int sum = 0;
for (int i = 0; i < array.length; i++) {
sum += (byte)BYTE_HANDLE.get(array, i);
}
return sum;
}
where the setup code is as follows:
static final int SIZE = 10_000;
MemorySegment segment;
byte[] array;
static final VarHandle BYTE_HANDLE =
MethodHandles.arrayElementVarHandle(byte[].class);
@Setup
public void setup() {
array = new byte[SIZE];
segment = MemorySegment.ofArray(array);
}
With this, I obtained the following results:
Benchmark Mode Cnt Score Error Units
ConcurrentClose.memorySegmentAccess avgt 10 13.879 ± 0.478 us/op
ConcurrentClose.otherAccess avgt 10 2.256 ± 0.017 us/op
Ugh. It seems like C2 "blows up" at the third iteration:
# Run progress: 0.00% complete, ETA 00:05:00
# Fork: 1 of 1
# Warmup Iteration 1: 6.712 us/op
# Warmup Iteration 2: 5.756 us/op
# Warmup Iteration 3: 13.267 us/op
# Warmup Iteration 4: 13.267 us/op
# Warmup Iteration 5: 13.274 us/op
This might be a bug/regression. But, let's move on. I then tweaked the
induction variable of the memory segment loop to be `long`, not `int` and I got:
Benchmark Mode Cnt Score Error Units
ConcurrentClose.memorySegmentAccess avgt 10 2.764 ± 0.016 us/op
ConcurrentClose.otherAccess avgt 10 2.240 ± 0.016 us/op
Far more respectable! And now we have a good baseline, since both workloads
take amount the same time, so we can use them to draw interesting comparisons.
So, let's add back a thread that does a shared arena close:
Benchmark Mode Cnt Score Error
Units
ConcurrentClose.sharedClose avgt 10 12.001 ± 0.061
us/op
ConcurrentClose.sharedClose:closing avgt 10 19.281 ± 0.323
us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 9.802 ± 0.314
us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 6.921 ± 0.151
us/op
This is with vanilla JDK. If I apply the changes in this PR, I get this:
Benchmark Mode Cnt Score Error
Units
ConcurrentClose.sharedClose avgt 10 10.837 ± 0.241
us/op
ConcurrentClose.sharedClose:closing avgt 10 20.337 ± 1.674
us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 8.672 ± 0.993
us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 3.501 ± 0.162
us/op
This is good. Note how `otherAccess` improved almost 2x, as the code is no
longer redundantly de-optimized. Now, we know that, even for memory segment
access, we can avoid redundant deopt once JDK-8290892 is fixed. To simulate
that, I've dropped the lines which apply the conservative deoptimization in
`scopedMemoryAccess.cpp` and ran the bench again:
Benchmark Mode Cnt Score Error
Units
ConcurrentClose.sharedClose avgt 10 8.957 ± 0.089
us/op
ConcurrentClose.sharedClose:closing avgt 10 18.898 ± 0.338
us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 4.403 ± 0.054
us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 3.571 ± 0.042
us/op
Ok, now both accessor threads seem faster.
If I swap the shared arena close with a confined arena close I get this:
Benchmark Mode Cnt Score Error
Units
ConcurrentClose.sharedClose avgt 10 1.760 ± 0.008
us/op
ConcurrentClose.sharedClose:closing avgt 10 ≈ 10⁻³
us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 2.912 ± 0.016
us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 2.367 ± 0.009
us/op
Summing up:
* there is some issue involving segment access with `int` induction variable
which we should investigate separately
* this PR significantly improves performance of threads that are not touching
memory segments, even under heavy shared arena close loads
* performance of unrelated memory segment access is still affected by
concurrent shared arena close. This is due to conservative deoptimization which
will be removed once JDK-8290892 is fixed
* when all fixes will be applied, the performance of the accessing threads gets
quite close to ideal, but not 100% there. The loss seems in the acceptable
range - given that this benchmark is closing shared arenas in a loop, arguably
the worst possible case.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228916752