On Thu, 4 Jun 2026 16:46:22 GMT, Andrew Haley <[email protected]> wrote:
>> Please use [this >> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the >> files changed. >> >> Profile counters scale very badly. >> >> The overhead for profiled code isn't too bad with one thread, but as the >> thread count increases, things go wrong very quickly. >> >> For example, here's a benchmark from the OpenJDK test suite, run at >> TieredLevel 3 with one thread, then three threads: >> >> >> Benchmark (randomized) Mode Cnt Score Error Units >> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op >> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op >> >> >> This slowdown is caused by high memory contention on the profile counters. >> Not only is this slow, but it can also lose profile counts. >> >> This patch is for C1 only. It'd be easy to randomize C1 counters as well in >> another PR, if anyone thinks it's worth doing. >> >> One other thing to note is that randomized profile counters degrade very >> badly with small decimation ratios. For example, using a ratio of 2 with >> `-XX:ProfileCaptureRatio=2` with a single thread results in >> >> >> Benchmark (randomized) Mode Cnt Score Error >> Units >> InterfaceCalls.test2ndInt5Types false avgt 4 80.147 ± 9.991 >> ns/op >> >> >> The problem is that the branch prediction rate drops away very badly, >> leading to many mispredictions. It only really makes sense to use higher >> decimation ratios, e.g. 64. >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Andrew Haley has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 128 commits: > > - Merge https://github.com/openjdk/jdk into JDK-8134940 > - Review comments > - Review comments > - Merge branch 'JDK-8134940' of https://github.com/theRealAph/jdk into > JDK-8134940 > - Merge remote-tracking branch 'refs/remotes/origin/JDK-8134940' into > JDK-8134940 > - Review comments > - Review comments > - Remove misleading comment > - Merge remote-tracking branch 'refs/remotes/origin/JDK-8134940' into > JDK-8134940 > - Add macros > - ... and 118 more: https://git.openjdk.org/jdk/compare/ca52afa3...f9440fbb > Aleksey, you've been brilliant. I can't thank you enough. Thank you, but to be clear, I am able to do this because I am backed by the substantial amount of compute we can throw to automatic analyzers :) 90% of my work here is working out what those tools are trying to tell me. Another round of analysis/comments: src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 463: > 461: Address(address target, relocInfo::relocType rtype = > relocInfo::external_word_type); > 462: > 463: Address(Register base, RegisterOrConstant index, extend ext = lsl(0)) { OK, but do you need to do the same at the sibling constructor? Address(Register r, Register r1, extend ext = lsl()) : Maybe do this in separate PR, as it might affect (or fix bugs for) others? src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 2253: > 2251: // Finally, update the counter > 2252: bind(L_count_update); > 2253: inc(this, Address(mdp, offset), DataLayout::counter_increment); You are going to love this. Here is an interesting sequence of bad things. 1. `offset` is `rscratch2` in this method. 2. `inc` is `increment_mdo`, we call there; it does `increment(dst, src << ratio_shift);` 3. `increment` is now `MacroAssembler::increment(Address dst, int value)`. It prepares for `MA::increment(Register, int)`: void MacroAssembler::increment(Address dst, int value) { ... ldr(rscratch1, dst); increment(rscratch1, value); // <--- calling here str(rscratch1, dst); } 4. When `value` is large, and AFAICS in can be large with large `ProfileCaptureRation`, that increment has a branch that clobbers `rscratch2`: void MacroAssembler::increment(Register reg, int value) { if (value < 0) { decrement(reg, -value); return; } if (value == 0) { return; } if (value < (1 << 12)) { add(reg, reg, value); return; } /* else */ { assert(reg != rscratch2, "invalid dst for register increment"); movw(rscratch2, (unsigned)value); // <--- clobbers rscratch2, I am sure it is fine... add(reg, reg, rscratch2); } } 5. We return to `MacroAssembler::increment(Address dst, int value)`: void MacroAssembler::increment(Address dst, int value) { ... ldr(rscratch1, dst); increment(rscratch1, value); // <--- returned from here str(rscratch1, dst); // dst is Address(mdo, rscratch2), but rscratch2 IS CLOBBERED // by increment() above, we do the store to the WRONG ADDRESS, // AWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW } src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2507: > 2505: auto offset = md_offset_opr->as_constant_ptr()->as_jint(); > 2506: bool is_memoryI = offset < 4096 && offset > -4096; > 2507: // bool is_memoryI = offset < 40 && offset > -40; Stale? src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2908: > 2906: __ cmpl(step_opr->as_register(), 0); > 2907: __ movl(step_opr->as_register(), > InvocationCounter::count_increment * ProfileCaptureRatio); > 2908: __ cmovl(Assembler::notEqual, dest, step_opr->as_register()); Hold on. The intent is to max out `dest` when `step_opr == 0`, right? But as it is written now, this thing sets `dest = MAX` when `step_opr != 0`? Shouldn't it be `Assembler::equal`? Might be an issue on other platforms as well? src/hotspot/share/c1/c1_CodeStubs.hpp line 65: > 63: #endif > 64: Address as_Address(LIR_Assembler* ce, LIR_Address* addr, Register tmp); > 65: Address as_Address(LIR_Assembler* ce, LIR_Address* addr); Looks dead, unless I am missing something. src/hotspot/share/c1/c1_LIR.cpp line 1085: > 1083: masm->append_code_stub(overflow_stub()); > 1084: } > 1085: #endif Are we sure we are not emitting `LIR_OpIncrementCounter` for platforms where `RANDOMIZED_PROFILE_CAPTURE` is `false`? Can we at least assert that on `#else` branch here? src/hotspot/share/c1/c1_LIRGenerator.cpp line 969: > 967: LIR_Opr tmp = new_register(T_INT); > 968: LIR_Opr step = LIR_OprFact::intConst(DataLayout::counter_increment); > 969: __ increment_counter(step, tmp, md_reg, md->constant_encoding(), > data_offset_reg); OK, so this covers C1 profiling path. The interpreter profiling path (`InterpreterMacroAssembler::profile_taken_branch`) is still not covered? So if we are mixing the subsampled counters from C1 and raw counters from intepreter, does that skew the profiling results? In the long run this is probably not a problem? But I am not sure we are actually profiling any particular bci for long enough to mitigate this. What I am concerned about is the profile inversion. If there is a hot branch, it would probably progress to C1 profiling, where it would be subsampled, and would add up to profile only after some time. But the _cold_ branch that is running in intepreter would show up in profile right away. So there is time window where cold branch is over-represented over hot branch in profile. With large `ProfileCaptureRatio` this window can be uncomfortably large and fairly close to triggering the compilation with skewed profile? It is not much of the problem with receiver type profiling, where interpreter and C1 are on the same subsampling footing. src/hotspot/share/c1/c1_LIRGenerator.cpp line 3213: > 3211: } > 3212: > 3213: LIR_Opr result = notify ? new_register(T_INT) : > LIR_OprFact::intConst(0); Looks like `increment_profile_ctr` expects `result` to be the register: Register dest = as_reg(dest_opr); ...so passing `intConst(0)` would fail. src/hotspot/share/runtime/javaThread.cpp line 414: > 412: _om_cache(this), > 413: > 414: _profile_rng(-1) { `profile_rng` is `uint32_t`; so maybe storing signed value there is not great? ------------- PR Review: https://git.openjdk.org/jdk/pull/28541#pullrequestreview-4434131960 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361269439 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361071628 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361272853 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361148114 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361240088 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361165689 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361236733 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361257617 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3361281553
