On Tue, 6 Aug 2024 17:26:55 GMT, Jorn Vernee <jver...@openjdk.org> wrote:

> As discussed in the JBS issue:
> 
> FFM upcall stubs embed a `Method*` of the target method in the stub. This 
> `Method*` is read from the `LambdaForm::vmentry` field associated with the 
> target method handle at the time when the upcall stub is generated. The MH 
> instance itself is stashed in a global JNI ref. So, there should be a 
> reachability chain to the holder class: `MH (receiver) -> LF (form) -> 
> MemberName (vmentry) -> ResolvedMethodName (method) -> Class<?> (vmholder)`
> 
> However, it appears that, due to multiple threads racing to initialize the 
> `vmentry` field of the `LambdaForm` of the target method handle of an upcall 
> stub, it is possible that the `vmentry` is updated _after_ we embed the 
> corresponding `Method`* into an upcall stub (or rather, the latest update is 
> not visible to the thread generating the upcall stub). Technically, it is 
> fine to keep using a 'stale' `vmentry`, but the problem is that now the 
> reachability chain is broken, since the upcall stub only extracts the target 
> `Method*`, and doesn't keep the stale `vmentry` reachable. The holder class 
> can then be unloaded, resulting in a crash.
> 
> The fix I've chosen for this is to mimic what we already do in 
> `MethodHandles::jump_to_lambda_form`, and re-load the `vmentry` field from 
> the target method handle each time. Luckily, this does not really seem to 
> impact performance.
> 
> <details>
> <summary>Performance numbers</summary>
> x64:
> 
> before:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  69.216 ± 1.791  ns/op
> 
> 
> after:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  67.787 ± 0.684  ns/op
> 
> 
> aarch64:
> 
> before:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  61.574 ± 0.801  ns/op
> 
> 
> after:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  61.218 ± 0.554  ns/op
> 
> </details>
> 
> As for the added TestUpcallStress test, it takes about 800 seconds to run 
> this test on the dev machine I'm using, so I've set the timeout quite high. 
> Since it runs for so long, I've dropped it from the default `jdk_foreign` 
> test suite, which runs in tier2. Instead the new test will run in tier4.
> 
> Testing: tier 1-4

Can't we do these nasty loads in C++ code and use `set_vm_result_2` in 
`UpcallLinker::on_entry`?
The GC barriers can generate excessive amounts of code with some GCs. I guess 
that upcalls are less performance critical, so I'd prefer the other solution. 
Maybe the C++ code can get optimized better, too.

Some of the `DecoratorSet` should be applicable and improve performance. If 
that doesn't help enough, maybe we should implement a dedicated static stub? 
There's no need to have the code replicated in each upcall stub.

Also note that each `load_heap_oop` may save and restore registers which is 
actually only needed once.

Regarding PPC64, I think that we could avoid PRESERVATION_FRAME_LR_GP_FP_REGS 
if we rearrange it such that the `load_heap_oop` is done at a place where the 
volatile regs are not live. But seems like this optimization is not available 
for other platforms.

Some performance related remarks:

- You could use `resolve_global_jobject` which is shorter and faster and exists 
on most platforms.
- Using `vm_result_2` is no longer needed. The Method* can be directly passed 
in the method reg (or loaded from `callee_target`).
- If you call the stub from assembler instead of from C++ you should be able to 
save some extra stuff like the frame.

I'll check the PPC64 code later.

@offamitkumar: Can you take a look at the s390 code, please? The cross build 
has failed. For the future: You may want to implement `resolve_global_jobject` 
which is shorter and faster and available on the other platforms.

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 4778:

> 4776:     StubCodeMark mark(this, "StubRoutines", "upcall_stub_load_target");
> 4777:     address start = __ pc();
> 4778:     __ save_LR_CR(R0);

I think `save_LR_CR` and `restore_LR_CR` should get removed, too. CR is not 
live and LR is preserved everywhere below. But, I'll check this later.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2275524582
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2275707529
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2278240985
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2325103457
PR Review Comment: https://git.openjdk.org/jdk/pull/20479#discussion_r1713844568

Reply via email to