Re: RFR: 8373696: AArch64: Refine post-call NOPs

Ruben Thu, 18 Dec 2025 05:32:17 -0800

On Wed, 17 Dec 2025 09:32:43 GMT, Andrew Haley <[email protected]> wrote:


>> Extend MOVK-based scheme to MOVK/MOVZ allowing to store 19 bits of metadata.
>> 
>> Choose number of metadata slots in post-call NOP sequence between 1 and 2 
>> depending on the offset from the CodeBlob header.
>> 
>> Additionally, implement ADR/ADRP-based metadata storage - that provides 22 
>> bits instead of 16 bits to store metadata. This can be enabled via 
>> UsePostCallSequenceWithADRP option.
>> 
>> 
>>  Renaissance 0.15.0 benchmark results (MOVK-based scheme)
>>  Neoverse V1.
>>  The runs were limited to 16 cores.
>> 
>>  Number of runs:
>>    6 for baseline, 6 for the changes - interleaved pairs.
>> 
>>  Command line:
>>   java -jar renaissance-jmh-0.15.0.jar \
>>     -bm avgt -gc true -v extra \
>>     -jvmArgsAppend '-Xbatch -XX:-UseDynamicNumberOfCompilerThreads \
>>       -XX:-CICompilerCountPerCPU -XX:ActiveProcessorCount=16 \
>>       -XX:CICompilerCount=2 -Xms8g -Xmx8g -XX:+AlwaysPreTouch \
>>       -XX:+UseG1GC'
>> 
>>  The change is geometric mean of ratios across 6 the pairs of runs.
>> 
>>   |  Benchmark                                            |  Change  | 90% 
>> CI for the change |
>>   | ----------------------------------------------------- | -------- | 
>> --------------------- |
>>   | org.renaissance.actors.JmhAkkaUct.run                 |  -0.215% |    
>> -2.652% to  1.357% |
>>   | org.renaissance.actors.JmhReactors.run                |  -0.166% |    
>> -1.974% to  1.775% |
>>   | org.renaissance.jdk.concurrent.JmhFjKmeans.run        |   0.222% |    
>> -0.492% to  0.933% |
>>   | org.renaissance.jdk.concurrent.JmhFutureGenetic.run   |  -1.880% |    
>> -2.438% to -1.343% |
>>   | org.renaissance.jdk.streams.JmhMnemonics.run          |  -0.500% |    
>> -1.032% to  0.089% |
>>   | org.renaissance.jdk.streams.JmhParMnemonics.run       |  -0.740% |    
>> -2.092% to  0.639% |
>>   | org.renaissance.jdk.streams.JmhScrabble.run           |  -0.031% |    
>> -0.353% to  0.310% |
>>   | org.renaissance.neo4j.JmhNeo4jAnalytics.run           |  -0.873% |    
>> -2.323% to  0.427% |
>>   | org.renaissance.rx.JmhRxScrabble.run                  |  -0.512% |    
>> -1.121% to  0.049% |
>>   | org.renaissance.scala.dotty.JmhDotty.run              |  -0.219% |    
>> -1.108% to  0.708% |
>>   | org.renaissance.scala.sat.JmhScalaDoku.run            |  -2.750% |    
>> -6.426% to -0.827% |
>>   | org.renaissance.scala.stdlib.JmhScalaKmeans.run       |   0.046% |    
>> -0.383% to  0.408% |
>>   | org.renaissance.scala.stm.JmhPhilosophers.run         |   1.497% |    
>> -0.955% to  3.923% |
>>   | org.renaissance.scala.stm.JmhScalaStmBench7.run ...
>
> While the basic idea is a good one, I think there are better ways to do it.
> 
> Consider fixing the number of instructions to two rather than three. Then a 
> post-call NOP can be either 'nop; movk/movz` or `b.nv; movk/adrp`. Or 
> something similar.. `b.nv` is more expensive, but should be rare.

Thank you for the feedback, @theRealAph,

I will be out of office until early January - I'm planning to respond in more 
detail and work on updating the PR when I am back.

I agree a fixed-length post-call NOP sequence would be preferable and would 
make the implementation simpler.

The `B.nv` might not be suitable in this case - I believe the branch will be 
"always taken".
However, I had considered using `CBNZ XZR, <#imm>`. So far, I avoided 
implementing it because it is unclear what performance effects this might have.

While I agree that the case will be rare in terms of static occurrences - it 
might still impact performance if a particular instance is on a frequent 
execution path.

I believe the change in C1 floating-point constants handling is a dependency 
for the current implementation - it allows keeping the constants section empty 
and ensures the compiler can determine the offset to the code blob header 
before the code buffer is finalized. However, if the approach is changed to a 
fixed post-call NOP sequence length, there will be no dependency between these 
two optimizations.

I will revisit both the post-call NOP sequence structure and the dependency on 
floating-point constants handling when I return in January.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28855#issuecomment-3670298853

Re: RFR: 8373696: AArch64: Refine post-call NOPs

Reply via email to