On Mon, 23 Mar 2026 18:43:41 GMT, Andrew Haley <[email protected]> wrote:
>> Please use [this >> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the >> files changed. >> >> Profile counters scale very badly. >> >> The overhead for profiled code isn't too bad with one thread, but as the >> thread count increases, things go wrong very quickly. >> >> For example, here's a benchmark from the OpenJDK test suite, run at >> TieredLevel 3 with one thread, then three threads: >> >> >> Benchmark (randomized) Mode Cnt Score Error Units >> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op >> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op >> >> >> This slowdown is caused by high memory contention on the profile counters. >> Not only is this slow, but it can also lose profile counts. >> >> This patch is for C1 only. It'd be easy to randomize C1 counters as well in >> another PR, if anyone thinks it's worth doing. >> >> One other thing to note is that randomized profile counters degrade very >> badly with small decimation ratios. For example, using a ratio of 2 with >> `-XX:ProfileCaptureRatio=2` with a single thread results in >> >> >> Benchmark (randomized) Mode Cnt Score Error >> Units >> InterfaceCalls.test2ndInt5Types false avgt 4 80.147 ± 9.991 >> ns/op >> >> >> The problem is that the branch prediction rate drops away very badly, >> leading to many mispredictions. It only really makes sense to use higher >> decimation ratios, e.g. 64. > > Andrew Haley has updated the pull request incrementally with one additional > commit since the last revision: > > Fix up any out-of-range offsets I tried using a vector register instead of a core register to hold Random state, but I don't think there's any way to get it to perform well. This PR generates guards that look like this: 0x00007f230caca146: cmp r14d,0x4000000 0x00007f230caca14d: jb 0x00007f230caca27a 0x00007f230caca153: mov edx,0x10dcd 0x00007f230caca158: imul r14d,edx Using a vector register looks like this: 0.06% 0x00007f8cd071430b: vmovd r8d,xmm15 0.14% 0x00007f8cd0714310: vpmulld xmm15,xmm15,xmm14 2.29% 0x00007f8cd0714315: cmp r8d,0x4000000 0.83% 0x00007f8cd071431c: jb 0x00007f8cd0714723 This takes 50% longer on the `InterfaceCalls.test1stInt2Types` benchmark. I think it's the latency of the move from vector to core (which we have to do for the test and branch) that's killing the performance. So I'm leaving this idea here for the record. As discussed, keeping the random state in memory would be an alternative, but the performance would be at least as bad and probably worse. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28541#issuecomment-4178666907
