On Mon, 30 Mar 2026 13:58:02 GMT, Aleksey Shipilev <[email protected]> wrote:

> OK, so we reserve `r26` or `r27` for RNG counter, right?
> 
> Is this a good trade for level=1 C1 code that users run with 
> `-XX:TieredStopAtLevel=1` for that sake of startup performance?

No. I'll fix it.

> Why can't / shouldn't we use the PRNG state straight in `JavaThread`? That 
> would also obviate the need to save/restore this register, thus simplifying 
> the machinery and avoiding subtle bugs.

The random generator is _extremely_ hot, probably(?) more so than any local 
variable. 

(Re AArch64, a confession: when I wrote the C1 port I forgot to allocate any 
upper registers. No one noticed until Dmitry Chyuko fixed it in February 2015.) 
I guess 16 regs is usually enough.

> It is even worse on x86 that does not have too many registers to begin with. 
> I wonder if there is a way to sense on this level if we are compiling for 
> tier=2,3 or tier=1, and only reserve on tier=2,3?

I'll check that this works correctly.

> If we going to reserve more registers, maybe start writing up release note 
> with possible caveats.

There is one other thing I could do: use a vector register instead of a core 
register. Vector registers tend to be used far less than core registers, so 
(arch-dependent) this could be a win. I'll try this on x86.

> src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1316:
> 
>> 1314:       = __ form_address(rscratch2, mdo,
>> 1315:                         md->byte_offset_of_slot(data, 
>> DataLayout::flags_offset()),
>> 1316:                         LogBytesPerWord);
> 
> Oh, OK, cute. So this encodes the address more efficiently, knowing the 
> address is in the units of heap words? Can be upstreamed separately then?

Seems like make work, but I guess so.

> src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1987:
> 
>> 1985: 
>> 1986: void LIR_Assembler::align_call(LIR_Code code) {
>> 1987:   __ save_profile_rng();
> 
> Looks misplaced. Do you want to do it in `LIR_Assembler::call`, or something 
> gets in the way?

Something does indeed. The size of the call is known and fixed, so we need to 
get anything variably sized in first.

> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 2828:
> 
>> 2826:   if (value < 0)  { incrementw(reg, -value);      return; }
>> 2827:   if (value == 0) {                               return; }
>> 2828:   if (value < (1 << 24)) { subw(reg, reg, value); return; }
> 
> Oh, another nice micro-optimization, I see? More stuff to upstream separately?

Eh, I guess so.

> src/hotspot/cpu/arm/c1_LIRGenerator_arm.cpp line 778:
> 
>> 776:       // Call out to runtime because we don't have enough registers to
>> 777:       // expand compareAndSet(long) inline.
>> 778:       arm_cas_long(addr, cmp_value, new_value, result);
> 
> Um. Is this still true at the end? How does this manifest? C1 regalloc cannot 
> give you enough registers when you are emitting the profiling counter update, 
> or?

`arm_cas_long` expanded inline uses every register we have. Regalloc fails.

I timed the CAS and this callout takes it from 17ns to 19ns on Graviton 3. I 
think this is in the noise.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010764702
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010773441
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010778848
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010784713
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010797445

Reply via email to