On Tue, 7 Oct 2025 02:47:50 GMT, Xiaohong Gong <[email protected]> wrote:
>> Are you referring to the N1 numbers? The add reduction operation has gains
>> around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they
>> look comparable (not considering the cases where we generate `fadda`
>> instructions for add reduction).
>>
>>> Seems instructions between different ins instructions will have a
>>> data-dependence, which is not expected
>>
>> Why do you think it's not expected? We have the exact same sequence for Neon
>> add reduction as well. There's back to back dependency there as well and yet
>> it shows better performance. The N1 optimization guide shows 2 cyc latency
>> for `fadd` and 3 cyc latency for `fmul`. Could this be the reason? WDYT?
>
> I mean we do not expect there is data-dependence between two `ins`
> operations, but it has now. We do not recommend use the instructions that
> just write part of a register. This might involve un-expected dependence
> between. I suggest to use `ext` instead, and I can observe about 20%
> performance improvement compared with current version on V2. I did not check
> the correctness, but it looks right to me. Could you please help check on
> other machines? Thanks!
>
> The change might look like:
> Suggestion:
>
> fmulh(dst, fsrc, vsrc);
> ext(vtmp, T8B, vsrc, vsrc, 2);
> fmulh(dst, dst, vtmp);
> ext(vtmp, T8B, vsrc, vsrc, 4);
> fmulh(dst, dst, vtmp);
> ext(vtmp, T8B, vsrc, vsrc, 6);
> fmulh(dst, dst, vtmp);
> if (isQ) {
> ext(vtmp, T16B, vsrc, vsrc, 8);
> fmulh(dst, dst, vtmp);
> ext(vtmp, T16B, vsrc, vsrc, 10);
> fmulh(dst, dst, vtmp);
> ext(vtmp, T16B, vsrc, vsrc, 12);
> fmulh(dst, dst, vtmp);
> ext(vtmp, T16B, vsrc, vsrc, 14);
> fmulh(dst, dst, vtmp);
Hi @XiaohongGong Thanks for this suggestion. I understand that `ins` has a
read-modify-write dependency while `ext` does not have that as we are not
reading the `vtmp` register in this case.
I made changes to both the add and mul reduction implementation and I could see
some perf gains on V1 and V2 for mul reduction -
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
<style>
</style>
</head>
<body link="#467886" vlink="#96607D">
Benchmark | vectorDim | 8B | 16B
-- | -- | -- | --
Float16OperationsBenchmark.ReductionAddFP16 | 256 | 1.0022509 | 0.99938584
Float16OperationsBenchmark.ReductionAddFP16 | 512 | 1.05157946 | 1.00262025
Float16OperationsBenchmark.ReductionAddFP16 | 1024 | 1.02392196 | 1.00187924
Float16OperationsBenchmark.ReductionAddFP16 | 2048 | 1.01219315 | 0.99964493
Float16OperationsBenchmark.ReductionMulFP16 | 256 | 0.99729809 | 1.19006546
Float16OperationsBenchmark.ReductionMulFP16 | 512 | 1.03897347 | 1.0689105
Float16OperationsBenchmark.ReductionMulFP16 | 1024 | 1.01822982 | 1.01509971
Float16OperationsBenchmark.ReductionMulFP16 | 2048 | 1.0086255 | 1.0032434
</body>
</html>
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2614674991