Hi Claes, thanks for the explanation, I suspected something like that.
I've run into this performance effect while investigating creation of Spring's ConcurrentReferenceHashMap, it turned out that it used Array::newInstance to create array of References stored in a map's Segment: private Reference<K, V>[] createReferenceArray(int size) { return Array.newInstance(Reference.class, size); } The code above was rewritten into plain array constructor call gaining some performance improvement: private Reference<K, V>[] createReferenceArray(int size) { return new Reference[size]; } This was the reason to go deeper and look how both methods behave. The actual behaviour is the same on both JDK 8 and JDK 11. And creation of ConcurrentReferenceHashMap is important on some workloads, in my case it's database access via Spring Data where creation of ConcurrentReferenceHashMap takes approximately 1/5 of execution profile. Talkin about Spring Boot it's possible to run SB application in IntelliJ IDEA in certain mode adding -XX:TieredStopAtLevel=1 and -noverify VM options. With full compilation the simplest application takes this to start up Mode Cnt Score Error Units start-up ss 100 2885,493 ± 167,660 ms/op and with `-XX:TieredStopAtLevel=1 -noverify` Benchmark Mode Cnt Score Error Units start-up ss 100 1707,342 ± 75,166 ms/op > Hi, > > what you're seeing specifically here is likely the native overhead: > Array::newInstance calls into the native method Array::newArray, and C1 > (TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does. > > C1 and the interpreter will instead call into > Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI, > which will add a rather expensive constant overhead.. > > TieredStopAtLevel=1/C1 performance is expected to be relatively slower > than C2 in general, and often much worse in cases like this there are > optimized intrinsics at play. > > Have you seen a regression here compared to some older JDK release? > > It would also be very helpful if you could shed more light on the use > case and point out what particular startup issues you're seeing that > prevents you from using full tiered compilation and Spring Boot. > > /Claes > > On 2019-01-02 22:56, Сергей Цыпанов wrote: > >> Hello, >> >> -XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring >> Boot based) to reduce start-up time. >> >> With this flag I've spotted huge performance degradation of >> Array::newInstance comparing to plain constructor call. >> >> I've used this benchmark >> >> @State(Scope.Thread) >> @BenchmarkMode(Mode.AverageTime) >> @OutputTimeUnit(TimeUnit.NANOSECONDS) >> public class ArrayInstantiationBenchmark { >> >> @Param({"10", "100", "1000"}) >> private int length; >> >> @Benchmark >> public Object newInstance() { >> return Array.newInstance(Object.class, length); >> } >> >> @Benchmark >> public Object constructor() { >> return new Object[length]; >> } >> >> } >> >> On C2 (JDK 11) both methods perform the same: >> >> Benchmark (length) Mode Cnt Score Error Units >> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,557 ± 0,316 ns/op >> ArrayInstantiationBenchmark.constructor 100 avgt 50 86,944 ± 4,945 ns/op >> ArrayInstantiationBenchmark.constructor 1000 avgt 50 520,722 ± 28,068 ns/op >> >> ArrayInstantiationBenchmark.newInstance 10 avgt 50 11,899 ± 0,569 ns/op >> ArrayInstantiationBenchmark.newInstance 100 avgt 50 86,805 ± 5,103 ns/op >> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 488,647 ± 20,829 ns/op >> >> On C1 however there's a huge difference (approximately 8 times!) for length >> = 10: >> >> Benchmark (length) Mode Cnt Score Error Units >> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,183 ± 0,168 ns/op >> ArrayInstantiationBenchmark.constructor 100 avgt 50 92,215 ± 4,425 ns/op >> ArrayInstantiationBenchmark.constructor 1000 avgt 50 838,303 ± 33,161 ns/op >> >> ArrayInstantiationBenchmark.newInstance 10 avgt 50 86,696 ± 1,297 ns/op >> ArrayInstantiationBenchmark.newInstance 100 avgt 50 106,751 ± 2,796 ns/op >> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 840,582 ± 24,745 ns/op >> >> Pay attention that performance for length = {100, 1000} is almost the same. >> >> I suppose it's a bug somewhere on VM because both methods just allocate >> memory and do zeroing elimination and subsequently there shouldn't be such a >> huge difference between them. >> >> Sergey Tsypanov