On Mon, 23 Mar 2026 18:43:41 GMT, Andrew Haley <[email protected]> wrote:

>> Please use [this 
>> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the 
>> files changed.
>> 
>> Profile counters scale very badly.
>> 
>> The overhead for profiled code isn't too bad with one thread, but as the 
>> thread count increases, things go wrong very quickly.
>> 
>> For example, here's a benchmark from the OpenJDK test suite, run at 
>> TieredLevel 3 with one thread, then three threads:
>> 
>> 
>> Benchmark (randomized) Mode Cnt Score Error Units
>> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op
>> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op
>> 
>> 
>> This slowdown is caused by high memory contention on the profile counters. 
>> Not only is this slow, but it can also lose profile counts.
>> 
>> This patch is for C1 only. It'd be easy to randomize C1 counters as well in 
>> another PR, if anyone thinks it's worth doing.
>> 
>> One other thing to note is that randomized profile counters degrade very 
>> badly with small decimation ratios. For example, using a ratio of 2 with 
>> `-XX:ProfileCaptureRatio=2` with a single thread results in
>> 
>> 
>> Benchmark                        (randomized)  Mode  Cnt   Score   Error  
>> Units
>> InterfaceCalls.test2ndInt5Types         false  avgt    4  80.147 ± 9.991  
>> ns/op
>> 
>> 
>> The problem is that the branch prediction rate drops away very badly, 
>> leading to many mispredictions. It only really makes sense to use higher 
>> decimation ratios, e.g. 64.
>
> Andrew Haley has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix up any out-of-range offsets

I tried using a vector register instead of a core register to hold Random 
state, but I don't think there's any way to get it to perform well.

This PR generates guards that look like this:

            0x00007f230caca146:   cmp    r14d,0x4000000
            0x00007f230caca14d:   jb     0x00007f230caca27a
            0x00007f230caca153:   mov    edx,0x10dcd
            0x00007f230caca158:   imul   r14d,edx


Using a vector register looks like this:


   0.06%    0x00007f8cd071430b:   vmovd  r8d,xmm15
   0.14%    0x00007f8cd0714310:   vpmulld xmm15,xmm15,xmm14
   2.29%    0x00007f8cd0714315:   cmp    r8d,0x4000000
   0.83%    0x00007f8cd071431c:   jb     0x00007f8cd0714723


This takes 50% longer on the `InterfaceCalls.test1stInt2Types` benchmark.

I think it's the latency of the move from vector to core (which we have to do 
for the test and branch) that's killing the performance. So I'm leaving this 
idea here for the record.

As discussed, keeping the random state in memory would be an alternative, but 
the performance would be at least as bad and probably worse.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28541#issuecomment-4178666907

Reply via email to