On Mon, 31 Jul 2023 12:22:00 GMT, Yasumasa Suenaga <ysuen...@openjdk.org> wrote:

> In FFM, native function would be called via `nep_invoker_blob`. If the 
> function has two arguments, it would be following:
> 
> 
> Decoding RuntimeStub - nep_invoker_blob 0x00007fcae394cd10
> --------------------------------------------------------------------------------
>   0x00007fcae394cd80: pushq %rbp
>   0x00007fcae394cd81: movq %rsp, %rbp
>   0x00007fcae394cd84: subq $0, %rsp
>  ;; { argument shuffle
>   0x00007fcae394cd88: movq %r8, %rax
>   0x00007fcae394cd8b: movq %rsi, %r10
>   0x00007fcae394cd8e: movq %rcx, %rsi
>   0x00007fcae394cd91: movq %rdx, %rdi
>  ;; } argument shuffle
>   0x00007fcae394cd94: callq *%r10
>   0x00007fcae394cd97: leave
>   0x00007fcae394cd98: retq
> 
> 
> `subq $0, %rsp` is for shadow space on stack, and `movq %r8, %rax` is number 
> of args for variadic function. So they are not necessary in some case. They 
> should be remove following if they are not needed:
> 
> 
> Decoding RuntimeStub - nep_invoker_blob 0x00007fd8778e2810
> --------------------------------------------------------------------------------
>   0x00007fd8778e2880: pushq %rbp
>   0x00007fd8778e2881: movq %rsp, %rbp
>  ;; { argument shuffle
>   0x00007fd8778e2884: movq %rsi, %r10
>   0x00007fd8778e2887: movq %rcx, %rsi
>   0x00007fd8778e288a: movq %rdx, %rdi
>  ;; } argument shuffle
>   0x00007fd8778e288d: callq *%r10
>   0x00007fd8778e2890: leave
>   0x00007fd8778e2891: retq
> 
> 
> All java/foreign jtreg tests are passed.
> 
> We can see these stub code on [ffmasm 
> testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/examples/cpumodel)
>  with `-XX:+UnlockDiagnosticVMOptions -XX:+PrintStubCode` and hsdis library. 
> This testcase linked the code with `Linker.Option.isTrivial()`.
> 
> After this change, FFM performance on [another ffmasm 
> testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/benchmarks/funccall)
>  was improved:
> 
> before:
> 
> Benchmark                           Mode  Cnt          Score          Error  
> Units
> FuncCallComparison.invokeFFMRDTSC  thrpt    3  106664071.816 ± 14396524.718  
> ops/s
> FuncCallComparison.rdtsc           thrpt    3  108024079.738 ± 13223921.011  
> ops/s
> 
> 
> after:
> 
> Benchmark                           Mode  Cnt          Score          Error  
> Units
> FuncCallComparison.invokeFFMRDTSC  thrpt    3  107622971.525 ± 12249767.134  
> ops/s
> FuncCallComparison.rdtsc           thrpt    3  107695741.608 ± 23983281.346  
> ops/s
> 
> 
> Environment:
> * CPU: AMD Ryzen 3 3300X
> * OS: Fedora 38 x86_64 (Kernel 6.3.8-200.fc38.x86_64)
> * Hyper-V 4vCPU, 8GB mem

FWIW, if you want to look into reducing the generated code further, I think we 
can potentially reduce the amount of shuffling between registers that's needed 
by reordering the arguments on the Java side so that each VMStorage 
corresponding to an argument of the leaf method handle is the same as the 
register for that argument in the Java calling convention.

I think the right place to do this is in DowncallLinker where we are creating 
the NativeEntryPoint. The way I think it should work: 
1. compute the Java calling convention's argument registers for the leaf method 
type.
2. compute a re-ordered VMStorage[] for the arguments, and a re-ordered method 
type, such that the VMStorage/type for a particular argument index matches the 
register for the same index used in the Java calling convention as much as 
possible.
3. use these 2 to create the native entry point + native method handle
4. apply the same reordering to the created native method handle (using 
MethodHandles::permuteArguments) so that the resulting method handle has the 
original argument order/method type.

Pushing this shuffling to the Java side will allow the JIT to reduce data 
motion, and this should result in reduced shuffling being needed overall I 
think.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15089#issuecomment-1661382669

Reply via email to