On Wed, 2 Aug 2023 02:12:43 GMT, Jorn Vernee <jver...@openjdk.org> wrote:

>> In FFM, native function would be called via `nep_invoker_blob`. If the 
>> function has two arguments, it would be following:
>> 
>> 
>> Decoding RuntimeStub - nep_invoker_blob 0x00007fcae394cd10
>> --------------------------------------------------------------------------------
>>   0x00007fcae394cd80: pushq %rbp
>>   0x00007fcae394cd81: movq %rsp, %rbp
>>   0x00007fcae394cd84: subq $0, %rsp
>>  ;; { argument shuffle
>>   0x00007fcae394cd88: movq %r8, %rax
>>   0x00007fcae394cd8b: movq %rsi, %r10
>>   0x00007fcae394cd8e: movq %rcx, %rsi
>>   0x00007fcae394cd91: movq %rdx, %rdi
>>  ;; } argument shuffle
>>   0x00007fcae394cd94: callq *%r10
>>   0x00007fcae394cd97: leave
>>   0x00007fcae394cd98: retq
>> 
>> 
>> `subq $0, %rsp` is for shadow space on stack, and `movq %r8, %rax` is number 
>> of args for variadic function. So they are not necessary in some case. They 
>> should be remove following if they are not needed:
>> 
>> 
>> Decoding RuntimeStub - nep_invoker_blob 0x00007fd8778e2810
>> --------------------------------------------------------------------------------
>>   0x00007fd8778e2880: pushq %rbp
>>   0x00007fd8778e2881: movq %rsp, %rbp
>>  ;; { argument shuffle
>>   0x00007fd8778e2884: movq %rsi, %r10
>>   0x00007fd8778e2887: movq %rcx, %rsi
>>   0x00007fd8778e288a: movq %rdx, %rdi
>>  ;; } argument shuffle
>>   0x00007fd8778e288d: callq *%r10
>>   0x00007fd8778e2890: leave
>>   0x00007fd8778e2891: retq
>> 
>> 
>> All java/foreign jtreg tests are passed.
>> 
>> We can see these stub code on [ffmasm 
>> testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/examples/cpumodel)
>>  with `-XX:+UnlockDiagnosticVMOptions -XX:+PrintStubCode` and hsdis library. 
>> This testcase linked the code with `Linker.Option.isTrivial()`.
>> 
>> After this change, FFM performance on [another ffmasm 
>> testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/benchmarks/funccall)
>>  was improved:
>> 
>> before:
>> 
>> Benchmark                           Mode  Cnt          Score          Error  
>> Units
>> FuncCallComparison.invokeFFMRDTSC  thrpt    3  106664071.816 ± 14396524.718  
>> ops/s
>> FuncCallComparison.rdtsc           thrpt    3  108024079.738 ± 13223921.011  
>> ops/s
>> 
>> 
>> after:
>> 
>> Benchmark                           Mode  Cnt          Score          Error  
>> Units
>> FuncCallComparison.invokeFFMRDTSC  thrpt    3  107622971.525 ± 12249767.134  
>> ops/s
>> FuncCallComparison.rdtsc           thrpt    3  107695741.608 ± 23983281.346  
>> ops/s
>> 
>> 
>> Environment:
>> * CPU: AMD Ry...
>
> FWIW, if you want to look into reducing the generated code further, I think 
> we can potentially reduce the amount of shuffling between registers that's 
> needed by reordering the arguments on the Java side so that each VMStorage 
> corresponding to an argument of the leaf method handle is the same as the 
> register for that argument in the Java calling convention.
> 
> I think the right place to do this is in DowncallLinker where we are creating 
> the NativeEntryPoint. The way I think it should work: 
> 1. compute the Java calling convention's argument registers for the leaf 
> method type.
> 2. compute a re-ordered VMStorage[] for the arguments, and a re-ordered 
> method type, such that the VMStorage/type for a particular argument index 
> matches the register for the same index used in the Java calling convention 
> as much as possible.
> 3. use the re-ordered VMStorage[] + MethodType to create the native entry 
> point + native method handle
> 4. apply the same reordering in reverse to the arguments of the created 
> native method handle (using MethodHandles::permuteArguments) so that the 
> resulting method handle has the original argument order/method type.
> 
> Pushing this shuffling to the Java side will allow the JIT to reduce data 
> motion, and this should result in reduced shuffling being needed overall I 
> think.

@JornVernee Thanks for your review! I will integrate this when I get second 
reviewer.

> I think we can potentially reduce the amount of shuffling between registers 
> that's needed by reordering the arguments on the Java side so that each 
> VMStorage corresponding to an argument of the leaf method handle is the same 
> as the register for that argument in the Java calling convention.

It would be great! I guess you suggested that `ArgumentShuffle` in HotSpot 
moves into `DowncallLinker`, right? To be honest, I haven't yet understood well 
about this, and also I do not have other testbed excepting Linux x64. So it is 
difficult to work for this now.

Again, this idea is great. I'd like to call native function via FFM with less 
overhead. So I'm happy to help if I can.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15089#issuecomment-1662132113

Reply via email to