On Mon, 31 Jul 2023 12:22:00 GMT, Yasumasa Suenaga <ysuen...@openjdk.org> wrote:
> In FFM, native function would be called via `nep_invoker_blob`. If the > function has two arguments, it would be following: > > > Decoding RuntimeStub - nep_invoker_blob 0x00007fcae394cd10 > -------------------------------------------------------------------------------- > 0x00007fcae394cd80: pushq %rbp > 0x00007fcae394cd81: movq %rsp, %rbp > 0x00007fcae394cd84: subq $0, %rsp > ;; { argument shuffle > 0x00007fcae394cd88: movq %r8, %rax > 0x00007fcae394cd8b: movq %rsi, %r10 > 0x00007fcae394cd8e: movq %rcx, %rsi > 0x00007fcae394cd91: movq %rdx, %rdi > ;; } argument shuffle > 0x00007fcae394cd94: callq *%r10 > 0x00007fcae394cd97: leave > 0x00007fcae394cd98: retq > > > `subq $0, %rsp` is for shadow space on stack, and `movq %r8, %rax` is number > of args for variadic function. So they are not necessary in some case. They > should be remove following if they are not needed: > > > Decoding RuntimeStub - nep_invoker_blob 0x00007fd8778e2810 > -------------------------------------------------------------------------------- > 0x00007fd8778e2880: pushq %rbp > 0x00007fd8778e2881: movq %rsp, %rbp > ;; { argument shuffle > 0x00007fd8778e2884: movq %rsi, %r10 > 0x00007fd8778e2887: movq %rcx, %rsi > 0x00007fd8778e288a: movq %rdx, %rdi > ;; } argument shuffle > 0x00007fd8778e288d: callq *%r10 > 0x00007fd8778e2890: leave > 0x00007fd8778e2891: retq > > > All java/foreign jtreg tests are passed. > > We can see these stub code on [ffmasm > testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/examples/cpumodel) > with `-XX:+UnlockDiagnosticVMOptions -XX:+PrintStubCode` and hsdis library. > This testcase linked the code with `Linker.Option.isTrivial()`. > > After this change, FFM performance on [another ffmasm > testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/benchmarks/funccall) > was improved: > > before: > > Benchmark Mode Cnt Score Error > Units > FuncCallComparison.invokeFFMRDTSC thrpt 3 106664071.816 ± 14396524.718 > ops/s > FuncCallComparison.rdtsc thrpt 3 108024079.738 ± 13223921.011 > ops/s > > > after: > > Benchmark Mode Cnt Score Error > Units > FuncCallComparison.invokeFFMRDTSC thrpt 3 107622971.525 ± 12249767.134 > ops/s > FuncCallComparison.rdtsc thrpt 3 107695741.608 ± 23983281.346 > ops/s > > > Environment: > * CPU: AMD Ryzen 3 3300X > * OS: Fedora 38 x86_64 (Kernel 6.3.8-200.fc38.x86_64) > * Hyper-V 4vCPU, 8GB mem FWIW, if you want to look into reducing the generated code further, I think we can potentially reduce the amount of shuffling between registers that's needed by reordering the arguments on the Java side so that each VMStorage corresponding to an argument of the leaf method handle is the same as the register for that argument in the Java calling convention. I think the right place to do this is in DowncallLinker where we are creating the NativeEntryPoint. The way I think it should work: 1. compute the Java calling convention's argument registers for the leaf method type. 2. compute a re-ordered VMStorage[] for the arguments, and a re-ordered method type, such that the VMStorage/type for a particular argument index matches the register for the same index used in the Java calling convention as much as possible. 3. use these 2 to create the native entry point + native method handle 4. apply the same reordering to the created native method handle (using MethodHandles::permuteArguments) so that the resulting method handle has the original argument order/method type. Pushing this shuffling to the Java side will allow the JIT to reduce data motion, and this should result in reduced shuffling being needed overall I think. ------------- PR Comment: https://git.openjdk.org/jdk/pull/15089#issuecomment-1661382669