Hi,
I've also taken a look at your microbenchmark and seen a few regressions
from 9 through 12, some of which I've identified - and some that might
be (partially) actionable. Mostly related to recent additions of low
overhead heap sampling and allocator/GC changes. All of the blame is in
hotspot, so let's leave the core-libs-devs alone for now. :-)
Some additional comments below...
On 2019-01-04 08:25, Сергей Цыпанов wrote:
Hi Claes,
thanks for the explanation, I suspected something like that.
I've run into this performance effect while investigating creation of Spring's
ConcurrentReferenceHashMap,
it turned out that it used Array::newInstance to create array of References
stored in a map's Segment:
private Reference<K, V>[] createReferenceArray(int size) {
return Array.newInstance(Reference.class, size);
}
The code above was rewritten into plain array constructor call gaining some
performance improvement:
private Reference<K, V>[] createReferenceArray(int size) {
return new Reference[size];
}
while a point fix, avoiding reflection seems like the right thing to do
when the array type is known statically, anyhow.
This was the reason to go deeper and look how both methods behave.
The actual behaviour is the same on both JDK 8 and JDK 11.
And creation of ConcurrentReferenceHashMap is important on some workloads, in
my case it's
database access via Spring Data where creation of ConcurrentReferenceHashMap
takes approximately 1/5
of execution profile.
Talkin about Spring Boot it's possible to run SB application in IntelliJ IDEA
in certain mode adding
-XX:TieredStopAtLevel=1 and -noverify VM options.
With full compilation the simplest application takes this to start up
Mode Cnt Score Error Units
start-up ss 100 2885,493 ± 167,660 ms/op
and with `-XX:TieredStopAtLevel=1 -noverify`
Benchmark Mode Cnt Score Error Units
start-up ss 100 1707,342 ± 75,166 ms/op
Thanks! Which JDK version are you using?
-noverify can be used without -XX:TieredStopAtLevel=1 (but don't use
this in production!). You might gain some by enabling CDS (run java
-Xshare:dump once, then add -Xshare:auto to your command lines). There
are a few other tricks to pull that might help startup without
sacrificing peak performance.
/Claes
Hi,
what you're seeing specifically here is likely the native overhead:
Array::newInstance calls into the native method Array::newArray, and C1
(TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.
C1 and the interpreter will instead call into
Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
which will add a rather expensive constant overhead..
TieredStopAtLevel=1/C1 performance is expected to be relatively slower
than C2 in general, and often much worse in cases like this there are
optimized intrinsics at play.
Have you seen a regression here compared to some older JDK release?
It would also be very helpful if you could shed more light on the use
case and point out what particular startup issues you're seeing that
prevents you from using full tiered compilation and Spring Boot.
/Claes
On 2019-01-02 22:56, Сергей Цыпанов wrote:
Hello,
-XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring
Boot based) to reduce start-up time.
With this flag I've spotted huge performance degradation of Array::newInstance
comparing to plain constructor call.
I've used this benchmark
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ArrayInstantiationBenchmark {
@Param({"10", "100", "1000"})
private int length;
@Benchmark
public Object newInstance() {
return Array.newInstance(Object.class, length);
}
@Benchmark
public Object constructor() {
return new Object[length];
}
}
On C2 (JDK 11) both methods perform the same:
Benchmark (length) Mode Cnt Score Error Units
ArrayInstantiationBenchmark.constructor 10 avgt 50 11,557 ± 0,316 ns/op
ArrayInstantiationBenchmark.constructor 100 avgt 50 86,944 ± 4,945 ns/op
ArrayInstantiationBenchmark.constructor 1000 avgt 50 520,722 ± 28,068 ns/op
ArrayInstantiationBenchmark.newInstance 10 avgt 50 11,899 ± 0,569 ns/op
ArrayInstantiationBenchmark.newInstance 100 avgt 50 86,805 ± 5,103 ns/op
ArrayInstantiationBenchmark.newInstance 1000 avgt 50 488,647 ± 20,829 ns/op
On C1 however there's a huge difference (approximately 8 times!) for length =
10:
Benchmark (length) Mode Cnt Score Error Units
ArrayInstantiationBenchmark.constructor 10 avgt 50 11,183 ± 0,168 ns/op
ArrayInstantiationBenchmark.constructor 100 avgt 50 92,215 ± 4,425 ns/op
ArrayInstantiationBenchmark.constructor 1000 avgt 50 838,303 ± 33,161 ns/op
ArrayInstantiationBenchmark.newInstance 10 avgt 50 86,696 ± 1,297 ns/op
ArrayInstantiationBenchmark.newInstance 100 avgt 50 106,751 ± 2,796 ns/op
ArrayInstantiationBenchmark.newInstance 1000 avgt 50 840,582 ± 24,745 ns/op
Pay attention that performance for length = {100, 1000} is almost the same.
I suppose it's a bug somewhere on VM because both methods just allocate memory
and do zeroing elimination and subsequently there shouldn't be such a huge
difference between them.
Sergey Tsypanov