Re: RFR: 8373696: AArch64: Refine post-call NOPs

Ruben Tue, 13 Jan 2026 03:41:33 -0800

On Thu, 18 Dec 2025 13:52:44 GMT, Andrew Haley <[email protected]> wrote:


>> Extend MOVK-based scheme to MOVK/MOVZ allowing to store 19 bits of metadata.
>> 
>> Choose number of metadata slots in post-call NOP sequence between 1 and 2 
>> depending on the offset from the CodeBlob header.
>> 
>> Additionally, implement ADR/ADRP-based metadata storage - that provides 22 
>> bits instead of 16 bits to store metadata. This can be enabled via 
>> UsePostCallSequenceWithADRP option.
>> 
>> 
>>  Renaissance 0.15.0 benchmark results (MOVK-based scheme)
>>  Neoverse V1.
>>  The runs were limited to 16 cores.
>> 
>>  Number of runs:
>>    6 for baseline, 6 for the changes - interleaved pairs.
>> 
>>  Command line:
>>   java -jar renaissance-jmh-0.15.0.jar \
>>     -bm avgt -gc true -v extra \
>>     -jvmArgsAppend '-Xbatch -XX:-UseDynamicNumberOfCompilerThreads \
>>       -XX:-CICompilerCountPerCPU -XX:ActiveProcessorCount=16 \
>>       -XX:CICompilerCount=2 -Xms8g -Xmx8g -XX:+AlwaysPreTouch \
>>       -XX:+UseG1GC'
>> 
>>  The change is geometric mean of ratios across 6 the pairs of runs.
>> 
>>   |  Benchmark                                            |  Change  | 90% 
>> CI for the change |
>>   | ----------------------------------------------------- | -------- | 
>> --------------------- |
>>   | org.renaissance.actors.JmhAkkaUct.run                 |  -0.215% |    
>> -2.652% to  1.357% |
>>   | org.renaissance.actors.JmhReactors.run                |  -0.166% |    
>> -1.974% to  1.775% |
>>   | org.renaissance.jdk.concurrent.JmhFjKmeans.run        |   0.222% |    
>> -0.492% to  0.933% |
>>   | org.renaissance.jdk.concurrent.JmhFutureGenetic.run   |  -1.880% |    
>> -2.438% to -1.343% |
>>   | org.renaissance.jdk.streams.JmhMnemonics.run          |  -0.500% |    
>> -1.032% to  0.089% |
>>   | org.renaissance.jdk.streams.JmhParMnemonics.run       |  -0.740% |    
>> -2.092% to  0.639% |
>>   | org.renaissance.jdk.streams.JmhScrabble.run           |  -0.031% |    
>> -0.353% to  0.310% |
>>   | org.renaissance.neo4j.JmhNeo4jAnalytics.run           |  -0.873% |    
>> -2.323% to  0.427% |
>>   | org.renaissance.rx.JmhRxScrabble.run                  |  -0.512% |    
>> -1.121% to  0.049% |
>>   | org.renaissance.scala.dotty.JmhDotty.run              |  -0.219% |    
>> -1.108% to  0.708% |
>>   | org.renaissance.scala.sat.JmhScalaDoku.run            |  -2.750% |    
>> -6.426% to -0.827% |
>>   | org.renaissance.scala.stdlib.JmhScalaKmeans.run       |   0.046% |    
>> -0.383% to  0.408% |
>>   | org.renaissance.scala.stm.JmhPhilosophers.run         |   1.497% |    
>> -0.955% to  3.923% |
>>   | org.renaissance.scala.stm.JmhScalaStmBench7.run ...
>
> On 18/12/2025 13:28, Ruben wrote:
>> The |B.nv| might not be suitable in this case - I believe the branch 
>> will be "always taken".
> 
> Ah, so it is. That's a shame.
> 
>> However, I had considered using |CBNZ XZR, <#imm>|. So far, I avoided 
>> implementing it because it is unclear what performance effects this 
>> might have.
> 
> I've tried it, and it's definitely a lot slower than a NOP or a MOVZ.
> It would be nice if we could persuade the architecture people to give as 
> a NOP with payload, perhaps because "Intel has it."
> 
> But if we choose the split between CB offset and oopmap index wisely, we 
> can avoid using a second instruction most of the time.

Hi @theRealAph,

> But if we choose the split between CB offset and oopmap index wisely, we can 
> avoid using a second instruction most of the time.

> Something else to bear in mind is that we don't always have to succeed
placing a post-call NOP. They are not used as frequently as they were in
the past, in particular because we found a way to save and restore
virtual thread stacks without usually having to trace them. In practice
this means that we only have to copy the stack memory, so we don't need
the post-call NOPs so often.

I believe we can avoid the second instruction with the following approach:
- where possible, fully store cb_offset and oopmap_index and use MOVZ/MOVK
- where not possible, use MOVN and store partial cb_offset and oopmap_index.

The partial cb_offset and oopmap_index values will make search for the 
respective metadata faster, compared to searching just based on the PC value.

>From an initial experiment, is appears oopmap_index increments about every 8 
>instructions on average - though this observation might be specific to a 
>certain workload. However, if that's the case, current implementation allows 
>us to encode post-call NOPs metadata for offsets up to 8K. The above approach 
>would be an improvement in this case: the MOVK/MOVZ-based chunks would allow 
>us to encode the metadata for offsets up to 16K, while MOVN-based chunks would 
>improve search for the rest of the cases.

What do you think about this option?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28855#issuecomment-3743848408

Re: RFR: 8373696: AArch64: Refine post-call NOPs

Reply via email to