Hi,

what you're seeing specifically here is likely the native overhead:
Array::newInstance calls into the native method Array::newArray, and C1
(TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.

C1 and the interpreter will instead call into
Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
which will add a rather expensive constant overhead..

TieredStopAtLevel=1/C1 performance is expected to be relatively slower
than C2 in general, and often much worse in cases like this there are
optimized intrinsics at play.

Have you seen a regression here compared to some older JDK release?

It would also be very helpful if you could shed more light on the use
case and point out what particular startup issues you're seeing that
prevents you from using full tiered compilation and Spring Boot.

/Claes

On 2019-01-02 22:56, Сергей Цыпанов wrote:
Hello,

-XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring 
Boot based) to reduce start-up time.

With this flag I've spotted huge performance degradation of Array::newInstance 
comparing to plain constructor call.

I've used this benchmark

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ArrayInstantiationBenchmark {

   @Param({"10", "100", "1000"})
   private int length;

   @Benchmark
   public Object newInstance() {
     return Array.newInstance(Object.class, length);
   }

   @Benchmark
   public Object constructor() {
     return new Object[length];
   }

}

On C2 (JDK 11) both methods perform the same:

Benchmark                                (length)  Mode  Cnt    Score    Error  
Units
ArrayInstantiationBenchmark.constructor        10  avgt   50   11,557 ±  0,316  
ns/op
ArrayInstantiationBenchmark.constructor       100  avgt   50   86,944 ±  4,945  
ns/op
ArrayInstantiationBenchmark.constructor      1000  avgt   50  520,722 ± 28,068  
ns/op

ArrayInstantiationBenchmark.newInstance        10  avgt   50   11,899 ±  0,569  
ns/op
ArrayInstantiationBenchmark.newInstance       100  avgt   50   86,805 ±  5,103  
ns/op
ArrayInstantiationBenchmark.newInstance      1000  avgt   50  488,647 ± 20,829  
ns/op

On C1 however there's a huge difference (approximately 8 times!) for length = 
10:

Benchmark                                (length)  Mode  Cnt    Score    Error  
Units
ArrayInstantiationBenchmark.constructor        10  avgt   50   11,183 ±  0,168  
ns/op
ArrayInstantiationBenchmark.constructor       100  avgt   50   92,215 ±  4,425  
ns/op
ArrayInstantiationBenchmark.constructor      1000  avgt   50  838,303 ± 33,161  
ns/op

ArrayInstantiationBenchmark.newInstance        10  avgt   50   86,696 ±  1,297  
ns/op
ArrayInstantiationBenchmark.newInstance       100  avgt   50  106,751 ±  2,796  
ns/op
ArrayInstantiationBenchmark.newInstance      1000  avgt   50  840,582 ± 24,745  
ns/op

Pay attention that performance for length = {100, 1000} is almost the same.

I suppose it's a bug somewhere on VM because both methods just allocate memory 
and do zeroing elimination and subsequently there shouldn't be such a huge 
difference between them.

Sergey Tsypanov


Reply via email to