Hi,

I've also taken a look at your microbenchmark and seen a few regressions
from 9 through 12, some of which I've identified - and some that might
be (partially) actionable. Mostly related to recent additions of low
overhead heap sampling and allocator/GC changes. All of the blame is in
hotspot, so let's leave the core-libs-devs alone for now. :-)

Some additional comments below...

On 2019-01-04 08:25, Сергей Цыпанов wrote:
Hi Claes,

thanks for the explanation, I suspected something like that.

I've run into this performance effect while investigating creation of Spring's 
ConcurrentReferenceHashMap,
it turned out that it used Array::newInstance to create array of References 
stored in a map's Segment:

private Reference<K, V>[] createReferenceArray(int size) {
   return Array.newInstance(Reference.class, size);
}

The code above was rewritten into plain array constructor call gaining some 
performance improvement:

private Reference<K, V>[] createReferenceArray(int size) {
   return new Reference[size];
}

while a point fix, avoiding reflection seems like the right thing to do
when the array type is known statically, anyhow.


This was the reason to go deeper and look how both methods behave.
The actual behaviour is the same on both JDK 8 and JDK 11.

And creation of ConcurrentReferenceHashMap is important on some workloads, in 
my case it's
database access via Spring Data where creation of ConcurrentReferenceHashMap 
takes approximately 1/5
of execution profile.

Talkin about Spring Boot it's possible to run SB application in IntelliJ IDEA 
in certain mode adding
-XX:TieredStopAtLevel=1 and -noverify VM options.

With full compilation the simplest application takes this to start up

           Mode  Cnt     Score     Error  Units
start-up    ss  100  2885,493 ± 167,660  ms/op

and with `-XX:TieredStopAtLevel=1 -noverify`

Benchmark Mode  Cnt     Score    Error  Units
start-up    ss  100  1707,342 ± 75,166  ms/op

Thanks! Which JDK version are you using?

-noverify can be used without -XX:TieredStopAtLevel=1 (but don't use
this in production!). You might gain some by enabling CDS (run java
-Xshare:dump once, then add -Xshare:auto to your command lines). There
are a few other tricks to pull that might help startup without
sacrificing peak performance.

/Claes


Hi,

what you're seeing specifically here is likely the native overhead:
Array::newInstance calls into the native method Array::newArray, and C1
(TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.

C1 and the interpreter will instead call into
Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
which will add a rather expensive constant overhead..

TieredStopAtLevel=1/C1 performance is expected to be relatively slower
than C2 in general, and often much worse in cases like this there are
optimized intrinsics at play.

Have you seen a regression here compared to some older JDK release?

It would also be very helpful if you could shed more light on the use
case and point out what particular startup issues you're seeing that
prevents you from using full tiered compilation and Spring Boot.

/Claes

On 2019-01-02 22:56, Сергей Цыпанов wrote:

Hello,

-XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring 
Boot based) to reduce start-up time.

With this flag I've spotted huge performance degradation of Array::newInstance 
comparing to plain constructor call.

I've used this benchmark

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ArrayInstantiationBenchmark {

@Param({"10", "100", "1000"})
private int length;

@Benchmark
public Object newInstance() {
return Array.newInstance(Object.class, length);
}

@Benchmark
public Object constructor() {
return new Object[length];
}

}

On C2 (JDK 11) both methods perform the same:

Benchmark (length) Mode Cnt Score Error Units
ArrayInstantiationBenchmark.constructor 10 avgt 50 11,557 ± 0,316 ns/op
ArrayInstantiationBenchmark.constructor 100 avgt 50 86,944 ± 4,945 ns/op
ArrayInstantiationBenchmark.constructor 1000 avgt 50 520,722 ± 28,068 ns/op

ArrayInstantiationBenchmark.newInstance 10 avgt 50 11,899 ± 0,569 ns/op
ArrayInstantiationBenchmark.newInstance 100 avgt 50 86,805 ± 5,103 ns/op
ArrayInstantiationBenchmark.newInstance 1000 avgt 50 488,647 ± 20,829 ns/op

On C1 however there's a huge difference (approximately 8 times!) for length = 
10:

Benchmark (length) Mode Cnt Score Error Units
ArrayInstantiationBenchmark.constructor 10 avgt 50 11,183 ± 0,168 ns/op
ArrayInstantiationBenchmark.constructor 100 avgt 50 92,215 ± 4,425 ns/op
ArrayInstantiationBenchmark.constructor 1000 avgt 50 838,303 ± 33,161 ns/op

ArrayInstantiationBenchmark.newInstance 10 avgt 50 86,696 ± 1,297 ns/op
ArrayInstantiationBenchmark.newInstance 100 avgt 50 106,751 ± 2,796 ns/op
ArrayInstantiationBenchmark.newInstance 1000 avgt 50 840,582 ± 24,745 ns/op

Pay attention that performance for length = {100, 1000} is almost the same.

I suppose it's a bug somewhere on VM because both methods just allocate memory 
and do zeroing elimination and subsequently there shouldn't be such a huge 
difference between them.

Sergey Tsypanov

Reply via email to