On Fri, 12 Dec 2025 15:42:24 GMT, Bhavana Kilambi <[email protected]> wrote:

>> I mean we do not expect there is data-dependence between two `ins` 
>> operations, but it has now. We do not recommend use the instructions that 
>> just write part of a register. This might involve un-expected dependence 
>> between. I suggest to use `ext` instead, and I can observe about 20% 
>> performance improvement compared with current version on V2. I did not check 
>> the correctness, but it looks right to me. Could you please help check on 
>> other machines? Thanks!
>> 
>> The change might look like:
>> Suggestion:
>> 
>>         fmulh(dst, fsrc, vsrc);
>>         ext(vtmp, T8B, vsrc, vsrc, 2);
>>         fmulh(dst, dst, vtmp);
>>         ext(vtmp, T8B, vsrc, vsrc, 4);
>>         fmulh(dst, dst, vtmp);
>>         ext(vtmp, T8B, vsrc, vsrc, 6);
>>         fmulh(dst, dst, vtmp);
>>         if (isQ) {
>>           ext(vtmp, T16B, vsrc, vsrc, 8);
>>           fmulh(dst, dst, vtmp);
>>           ext(vtmp, T16B, vsrc, vsrc, 10);
>>           fmulh(dst, dst, vtmp);
>>           ext(vtmp, T16B, vsrc, vsrc, 12);
>>           fmulh(dst, dst, vtmp);
>>           ext(vtmp, T16B, vsrc, vsrc, 14);
>>           fmulh(dst, dst, vtmp);
>
> Hi @XiaohongGong Thanks for this suggestion. I understand that `ins` has a 
> read-modify-write dependency while `ext` does not have that as we are not 
> reading the `vtmp` register in this case.
> 
> I made changes to both the add and mul reduction implementation and I could 
> see some perf gains on Neoverse V1 and Neoverse V2 for mul reduction but none 
> for Neoverse N1. The following is ratio between throughput with `ext` and 
> throughput with `ins` (`>1` would mean `ext` is better) on Neoverse V2 - 
> 
> <html xmlns:v="urn:schemas-microsoft-com:vml"
> xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:x="urn:schemas-microsoft-com:office:excel"
> xmlns="http://www.w3.org/TR/REC-html40";>
> 
> <head>
> 
> <meta name=ProgId content=Excel.Sheet>
> <meta name=Generator content="Microsoft Excel 15">
> <link id=Main-File rel=Main-File
> href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
> <link rel=File-List
> href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
> <style>
> 
> </style>
> </head>
> 
> <body link="#467886" vlink="#96607D">
> 
> 
> Benchmark | vectorDim | 8B | 16B
> -- | -- | -- | --
> Float16OperationsBenchmark.ReductionAddFP16 | 256 | 1.0022509 | 0.99938584
> Float16OperationsBenchmark.ReductionAddFP16 | 512 | 1.05157946 | 1.00262025
> Float16OperationsBenchmark.ReductionAddFP16 | 1024 | 1.02392196 | 1.00187924
> Float16OperationsBenchmark.ReductionAddFP16 | 2048 | 1.01219315 | 0.99964493
> Float16OperationsBenchmark.ReductionMulFP16 | 256 | 0.99729809 | 1.19006546
> Float16OperationsBenchmark.ReductionMulFP16 | 512 | 1.03897347 | 1.0689105
> Float16OperationsBenchmark.ReductionMulFP16 | 1024 | 1.01822982 | 1.01509971
> Float16OperationsBenchmark.ReductionMulFP16 | 2048 | 1.0086255 | 1.0032434
> 
> 
> 
> </body>
> 
> </html>
> 
> The 20% gain you mentioned is reproducible but only for the smallest array 
> size. The gains taper for larger array sizes (my wild guess is that for 
> smaller array sizes the loop is lantency-bound so reducing the dependency due 
> to the `ins` chains helps bring down the total latency but for larger array 
> sizes the loop becomes more memory bound with more number of loads/stores and 
> probably here removing the `ins` dependency chains doesn't help much?).
> 
> 
> Similar number for Neoverse V1 -
> 
> <html xmlns:v="urn:schemas-microsoft-com:vml"
> xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:x="urn:schemas-microsoft-com:office:excel"
> xmlns="http://www.w3.org/TR/REC-html40";>
> 
> <head>
> 
> <me...

Thanks for your testing!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2634220120

Reply via email to