On Mon, 23 Mar 2026 18:43:41 GMT, Andrew Haley <[email protected]> wrote:

>> Please use [this 
>> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the 
>> files changed.
>> 
>> Profile counters scale very badly.
>> 
>> The overhead for profiled code isn't too bad with one thread, but as the 
>> thread count increases, things go wrong very quickly.
>> 
>> For example, here's a benchmark from the OpenJDK test suite, run at 
>> TieredLevel 3 with one thread, then three threads:
>> 
>> 
>> Benchmark (randomized) Mode Cnt Score Error Units
>> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op
>> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op
>> 
>> 
>> This slowdown is caused by high memory contention on the profile counters. 
>> Not only is this slow, but it can also lose profile counts.
>> 
>> This patch is for C1 only. It'd be easy to randomize C1 counters as well in 
>> another PR, if anyone thinks it's worth doing.
>> 
>> One other thing to note is that randomized profile counters degrade very 
>> badly with small decimation ratios. For example, using a ratio of 2 with 
>> `-XX:ProfileCaptureRatio=2` with a single thread results in
>> 
>> 
>> Benchmark                        (randomized)  Mode  Cnt   Score   Error  
>> Units
>> InterfaceCalls.test2ndInt5Types         false  avgt    4  80.147 ± 9.991  
>> ns/op
>> 
>> 
>> The problem is that the branch prediction rate drops away very badly, 
>> leading to many mispredictions. It only really makes sense to use higher 
>> decimation ratios, e.g. 64.
>
> Andrew Haley has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix up any out-of-range offsets

**Summary**

This EXPERIMENTAL PR is now ready for review.

It dramatically reduces memory traffic during the C1-compiled profile capture 
phase of HotSpot compilation. It does so at a modest cost, due to increased 
C1-generated code size. The resulting C2-generated code might be slightly worse 
in quality, but it'll take deep analysis to be sure.

**Performance**

I've been testing with baseline, this branch using 
`-XX:ProfileCaptureRatio=64`, and this branch using `-XX:ProfileCaptureRatio=64`

For high-level tests I've used  DaCapo:

`dacapo-23.11-MR2-chopin.jar --iterations 15 jython -s large`

Platform: Apple Mac Mini M1 running Fedora 
Mainline:    100%
With this PR, ProfileCaptureRatio=1: 2%-3% faster
With this PR, ProfileCaptureRatio=64: 0.7% slower

I do not understand why this PR improves performance on the jython benchmark 
with `ProfileCaptureRatio=1`. The profile-capture code is almost identical to 
that generated by mainline, and in any case after warmup I believe that almost 
all of the code being executed is C2-compiled.

A slight slowdown with `ProfileCaptureRatio=64` is possible because the 
generated profile counts are more noisy, so C2 has somewhat less reliable data. 
Given the usual noisiness of profile counts I wouldn't expect a huge 
difference, and 0.7% is within the margin of error.

**Code Size**

C1-compiled code is slightly larger because a random number is generated at 
every sample point. This is somewhat mitigated by slightly better code quality 
because sampling is hand-coded assembly rather than generated via C1 LIR, but 
that is a really small difference.

**Random number generation**

Most commonly-used random number generators are too slow for this application. 
Even XorShift, the usual go-to for super-lightweight generators, is too much. 
I've ended up with a simple linear feedback shift register (using a CRC 
instruction) and the smallest decent linear congruential generator, _69069x + 
1_. Unfortunately, it's really hard to know which of these to choose. CRC 
probably wins on high-end modern processors, but on a AMD Threadripper launched 
in 2018 (12nm Zen+) it seems to be faster to use LCG (i.e multiply) than CRC. I 
don't know of any way to find out which will be best on any particular 
processor without trying it. It's not impossible to run both for a millisecond 
at startup to find out!

**ProfileCaptureRatio selection**

Low profile capture ratios, where counter updates are frequent, destroy 
performance because of constant branch mispredictions. 64 seems to be a 
reasonable compromise between profile data accuracy and profiling overhead.

**Supported Architectures**

I've done 32- and 64-bit Arm and x86.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28541#issuecomment-4120264165

Reply via email to