On Thu, 18 Dec 2025 13:52:44 GMT, Andrew Haley <[email protected]> wrote:
>> Extend MOVK-based scheme to MOVK/MOVZ allowing to store 19 bits of metadata. >> >> Choose number of metadata slots in post-call NOP sequence between 1 and 2 >> depending on the offset from the CodeBlob header. >> >> Additionally, implement ADR/ADRP-based metadata storage - that provides 22 >> bits instead of 16 bits to store metadata. This can be enabled via >> UsePostCallSequenceWithADRP option. >> >> >> Renaissance 0.15.0 benchmark results (MOVK-based scheme) >> Neoverse V1. >> The runs were limited to 16 cores. >> >> Number of runs: >> 6 for baseline, 6 for the changes - interleaved pairs. >> >> Command line: >> java -jar renaissance-jmh-0.15.0.jar \ >> -bm avgt -gc true -v extra \ >> -jvmArgsAppend '-Xbatch -XX:-UseDynamicNumberOfCompilerThreads \ >> -XX:-CICompilerCountPerCPU -XX:ActiveProcessorCount=16 \ >> -XX:CICompilerCount=2 -Xms8g -Xmx8g -XX:+AlwaysPreTouch \ >> -XX:+UseG1GC' >> >> The change is geometric mean of ratios across 6 the pairs of runs. >> >> | Benchmark | Change | 90% >> CI for the change | >> | ----------------------------------------------------- | -------- | >> --------------------- | >> | org.renaissance.actors.JmhAkkaUct.run | -0.215% | >> -2.652% to 1.357% | >> | org.renaissance.actors.JmhReactors.run | -0.166% | >> -1.974% to 1.775% | >> | org.renaissance.jdk.concurrent.JmhFjKmeans.run | 0.222% | >> -0.492% to 0.933% | >> | org.renaissance.jdk.concurrent.JmhFutureGenetic.run | -1.880% | >> -2.438% to -1.343% | >> | org.renaissance.jdk.streams.JmhMnemonics.run | -0.500% | >> -1.032% to 0.089% | >> | org.renaissance.jdk.streams.JmhParMnemonics.run | -0.740% | >> -2.092% to 0.639% | >> | org.renaissance.jdk.streams.JmhScrabble.run | -0.031% | >> -0.353% to 0.310% | >> | org.renaissance.neo4j.JmhNeo4jAnalytics.run | -0.873% | >> -2.323% to 0.427% | >> | org.renaissance.rx.JmhRxScrabble.run | -0.512% | >> -1.121% to 0.049% | >> | org.renaissance.scala.dotty.JmhDotty.run | -0.219% | >> -1.108% to 0.708% | >> | org.renaissance.scala.sat.JmhScalaDoku.run | -2.750% | >> -6.426% to -0.827% | >> | org.renaissance.scala.stdlib.JmhScalaKmeans.run | 0.046% | >> -0.383% to 0.408% | >> | org.renaissance.scala.stm.JmhPhilosophers.run | 1.497% | >> -0.955% to 3.923% | >> | org.renaissance.scala.stm.JmhScalaStmBench7.run ... > > On 18/12/2025 13:28, Ruben wrote: >> The |B.nv| might not be suitable in this case - I believe the branch >> will be "always taken". > > Ah, so it is. That's a shame. > >> However, I had considered using |CBNZ XZR, <#imm>|. So far, I avoided >> implementing it because it is unclear what performance effects this >> might have. > > I've tried it, and it's definitely a lot slower than a NOP or a MOVZ. > It would be nice if we could persuade the architecture people to give as > a NOP with payload, perhaps because "Intel has it." > > But if we choose the split between CB offset and oopmap index wisely, we > can avoid using a second instruction most of the time. Hi @theRealAph, > But if we choose the split between CB offset and oopmap index wisely, we can > avoid using a second instruction most of the time. > Something else to bear in mind is that we don't always have to succeed placing a post-call NOP. They are not used as frequently as they were in the past, in particular because we found a way to save and restore virtual thread stacks without usually having to trace them. In practice this means that we only have to copy the stack memory, so we don't need the post-call NOPs so often. I believe we can avoid the second instruction with the following approach: - where possible, fully store cb_offset and oopmap_index and use MOVZ/MOVK - where not possible, use MOVN and store partial cb_offset and oopmap_index. The partial cb_offset and oopmap_index values will make search for the respective metadata faster, compared to searching just based on the PC value. >From an initial experiment, is appears oopmap_index increments about every 8 >instructions on average - though this observation might be specific to a >certain workload. However, if that's the case, current implementation allows >us to encode post-call NOPs metadata for offsets up to 8K. The above approach >would be an improvement in this case: the MOVK/MOVZ-based chunks would allow >us to encode the metadata for offsets up to 16K, while MOVN-based chunks would >improve search for the rest of the cases. What do you think about this option? ------------- PR Comment: https://git.openjdk.org/jdk/pull/28855#issuecomment-3743848408
