On Mon, 23 Mar 2026 18:43:41 GMT, Andrew Haley <[email protected]> wrote:
>> Please use [this >> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the >> files changed. >> >> Profile counters scale very badly. >> >> The overhead for profiled code isn't too bad with one thread, but as the >> thread count increases, things go wrong very quickly. >> >> For example, here's a benchmark from the OpenJDK test suite, run at >> TieredLevel 3 with one thread, then three threads: >> >> >> Benchmark (randomized) Mode Cnt Score Error Units >> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op >> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op >> >> >> This slowdown is caused by high memory contention on the profile counters. >> Not only is this slow, but it can also lose profile counts. >> >> This patch is for C1 only. It'd be easy to randomize C1 counters as well in >> another PR, if anyone thinks it's worth doing. >> >> One other thing to note is that randomized profile counters degrade very >> badly with small decimation ratios. For example, using a ratio of 2 with >> `-XX:ProfileCaptureRatio=2` with a single thread results in >> >> >> Benchmark (randomized) Mode Cnt Score Error >> Units >> InterfaceCalls.test2ndInt5Types false avgt 4 80.147 ± 9.991 >> ns/op >> >> >> The problem is that the branch prediction rate drops away very badly, >> leading to many mispredictions. It only really makes sense to use higher >> decimation ratios, e.g. 64. > > Andrew Haley has updated the pull request incrementally with one additional > commit since the last revision: > > Fix up any out-of-range offsets Had some cycles to look at it, while the contention with other work is not high on Monday morning. (Clunky pun intended.) Impressive work! make/Hsdis.gmk line 47: > 45: CAPSTONE_ARCH := CS_ARCH_$(CAPSTONE_ARCH_AARCH64_NAME) > 46: CAPSTONE_MODE := CS_MODE_ARM > 47: else ifeq ($(call isTargetCpuArch, arm), true) Looks not relevant for this PR, upstream it separately? src/hotspot/cpu/aarch64/c1_FrameMap_aarch64.cpp line 205: > 203: if (ProfileCaptureRatio > 1) { > 204: // Use the highest remaining register for r_profile_rng. > 205: r_profile_rng = *remaining.rbegin(); OK, so we reserve `r26` or `r27` for RNG counter, right? Is this a good trade for level=1 C1 code that users run with `-XX:TieredStopAtLevel=1` for that sake of startup performance? Why can't / shouldn't we use the PRNG state straight in `JavaThread`? That would also obviate the need to save/restore this register, thus simplifying the machinery and avoiding subtle bugs. It is even worse on x86 that does not have too many registers to begin with. I wonder if there is a way to sense on this level if we are compiling for tier=2,3 or tier=1, and only reserve on tier=2,3? If we going to reserve more registers, maybe start writing up release note with possible caveats. src/hotspot/cpu/aarch64/c1_FrameMap_aarch64.hpp line 150: > 148: } > 149: > 150: // Use r26 for randomized profile captures. ...or `r27`? src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1235: > 1233: assert(masm()->is_C1_MacroAssembler(), "must be"); > 1234: > 1235: Label nope; Style: `L_nope`. src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1271: > 1269: int profile_capture_ratio = ProfileCaptureRatio; > 1270: int ratio_shift = exact_log2(profile_capture_ratio); > 1271: auto threshold = (1ull << 32) >> ratio_shift; Is this just `auto threshold = 1 << (32 - ratio_shift)`? src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1316: > 1314: = __ form_address(rscratch2, mdo, > 1315: md->byte_offset_of_slot(data, > DataLayout::flags_offset()), > 1316: LogBytesPerWord); Oh, OK, cute. So this encodes the address more efficiently, knowing the address is in the units of heap words? Can be upstreamed separately then? src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1987: > 1985: > 1986: void LIR_Assembler::align_call(LIR_Code code) { > 1987: __ save_profile_rng(); Looks misplaced. Do you want to do it in `LIR_Assembler::call`, or something gets in the way? src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 2563: > 2561: __ adjust_mdo_address(&counter_address, dest_opr->type()); > 2562: } > 2563: if (step->is_register()) { I remember looking at this before. Do we even have the cases where step is not a constant? Might simplify some code... src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp line 287: > 285: void C1_MacroAssembler::step_random(Register state, Register temp, > Register data) { > 286: if (VM_Version::supports_crc32()) { > 287: /* CRC used as a psuedo-random-number generator */ "pseudo" src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 2828: > 2826: if (value < 0) { incrementw(reg, -value); return; } > 2827: if (value == 0) { return; } > 2828: if (value < (1 << 24)) { subw(reg, reg, value); return; } Oh, another nice micro-optimization, I see? More stuff to upstream separately? src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2527: > 2525: break; > 2526: } > 2527: // On 32-bit platforms, 64-bit profile counters are never used Assert this and remove the commented code. src/hotspot/cpu/arm/c1_LIRGenerator_arm.cpp line 778: > 776: // Call out to runtime because we don't have enough registers to > 777: // expand compareAndSet(long) inline. > 778: arm_cas_long(addr, cmp_value, new_value, result); Um. Is this still true at the end? How does this manifest? C1 regalloc cannot give you enough registers when you are emitting the profiling counter update, or? src/hotspot/cpu/arm/c1_LIRGenerator_arm.cpp line 1307: > 1305: __ cmp(lir_cond(cond), left, right); > 1306: profile_branch(x, cond); > 1307: // If we're subsampling counter updates, then profiling code kills > flags And are we sure on AArch64 it does not? I don't see a similar change there. I see at least this in `LIR_Assembler::increment_profile_ctr` on AArch64: if (!step->is_constant()) { // If step is 0, make sure the stub check below always fails __ cmp(step->as_register(), (u1)0); __ mov(rscratch1, InvocationCounter::count_increment * ProfileCaptureRatio); __ csel(dest, dest, rscratch1, __ NE); } Also ugh, doing two tests just to restore the flags. But that one seems unavoidable. src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 256: > 254: > 255: void C1_MacroAssembler::step_random(Register state, Register temp, > Register data) { > 256: // if (VM_Version::supports_crc32()) { Can be just deleted, IMO. src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 1285: > 1283: auto threshold = (1ull << 32) >> ratio_shift; > 1284: __ cmpl(r_profile_rng, threshold); > 1285: __ jcc(Assembler::aboveEqual, nope); `jccb` src/hotspot/share/c1/c1_Compilation.cpp line 366: > 364: } > 365: > 366: THREAD_LOCAL const char *compilation_name; Debugging leftover? src/hotspot/share/c1/c1_LIR.cpp line 1082: > 1080: (_step, _result, _freq_op, > 1081: _md_reg, _md_op, _md_offset_op, _overflow_stub); > 1082: if (overflow_stub()) { `overflow_stub() != nullptr` src/hotspot/share/compiler/compiler_globals.hpp line 391: > 389: "information at the bailout point") > \ > 390: > \ > 391: product(int, ProfileCaptureRatio, 1, EXPERIMENTAL, > \ Put it at `64`, if you believe this is a right tradeoff. src/hotspot/share/runtime/javaThread.cpp line 546: > 544: do { > 545: state = os::random(); > 546: } while (state == 0); Curious: why should not the initial state be zero? ------------- PR Review: https://git.openjdk.org/jdk/pull/28541#pullrequestreview-4030481858 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3009725177 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3009990535 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3009993506 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010000039 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010014074 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010038774 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010051720 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010062967 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010088991 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010116717 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010122301 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010159281 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010194360 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010200886 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010219672 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010251921 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010260753 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010283586 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010294250
