Hi Kyrill, > -----Original Message----- > From: Kyrylo Tkachov <ktkac...@nvidia.com> > Sent: Friday, July 18, 2025 10:40 AM > To: GCC Patches <gcc-patches@gcc.gnu.org> > Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford > <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan > <alex.cop...@arm.com> > Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR > instructions > > Hi all, > > For inserting zero into a vector lane we usually use an instruction like: > ins v0.h[2], wzr > > This, however, has not-so-great performance on some CPUs. > On Grace, for example it has a latency of 5 and throughput 1. > The alternative sequence: > movi v31.8b, #0 > ins v0.h[2], v31.h[0] > is prefereble bcause the MOVI-0 is often a zero-latency operation that is > eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and > throughput of 4. > We can avoid the merging of the two instructions into the > aarch64_simd_vec_set_zero<mode> > insn through rtx costs. We just need to handle the right VEC_MERGE form in > aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to > describe > this operation. > According to a similar LLVM PR: https://github.com/llvm/llvm- > project/pull/146538 > and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit > from this, > whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency > 3 in Cortex-A510). >
I'm not really sure about the Cortex-A510, the entries are a bit suspect. I think they haven't added an entry for INS from GPR, which is an alias for MOV element, GPR. And the throughput for the INS they did add (element to element) seems suspect. If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the MOVI is slower (at least on paper) but I'd expect in actual code the throughput to matter more. So I think we should just do this for every core and by default. I have a couple of arguments for this: 1. We already do this for every constant but 0, since we prefer the throughput advantage here. See https://godbolt.org/z/9PKhEax8G where if you change your testcase to anything but 0 you get the movi. Even though strictly speaking the movi is more expensive than the mov. 2. I've ran this sequence change over various cores, and they all showed either an improvement or very little difference. i.e. Cortex-A510 shows cortex-a510 │ little │ -0.1% cortex-a520 │ little │ -0.09% cortex-a710 │ mid │ -0.03% etc, this is comparing using movi over the INS, wzr. However the advantage grows significantly when there is more than one of the same constant to insert. e.g. __attribute__((noipa)) uint16x8_t foo2(uint16x8_t a) { a[2] = 0; a[5] = 0; return a; } Now the movi pays for itself and the INS wzr is serial over most cores. Benchmarking this gives you as expected for inorder cores: cortex-a510 │ little │ -0.1% cortex-a520 │ little │ -0.05% but then out of order cores cortex-a710 │ mid │ ✔ -49.49% neoverse v2 │ big │ ✔ -49.57% etc. Showing that I think we should do this by default. As a side note, the current patch only seems to work outside loops. e.g. __attribute__((noipa)) uint16x8_t foos(uint16x8_t a, int n) { for (int i = 0; i < n; i++) a[2] = 0; return a; } Does not seem to make the movi and I think that's because after rejecting the movi in the vec_merge and floating it in the loop preheader, combine now tries to force it down and ultimately succeeds. Given the above, what do you think about instead just disabling the aarch64_simd_vec_set_zero pattern when not compiling for size? That should do what we want in all cases. Thanks, Tamar > Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the > split > into two instructions, values lower than that retain the INS-WZR form. > cortexa76_extra_costs, from which Grace and many other Neoverse cores derive > from, > sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles. 3 is the > number > of cycles above the normal cheapest SIMD instruction on such cores (which > take 2 > cycles > for the cheapest one). > > cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to > preserve the current codegen, though I'd be happy to increase it for generic > tuning. > > For -Os we don't add any extra cost so the shorter INS-WZR form is still > generated always. > > Bootstrapped and tested on aarch64-none-linux-gnu. > Ok for trunk? > Thanks, > Kyrill > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ > > * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp > field. Add comments to other vector cost fields. > * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE > case. > * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, > thunderx_extra_costs, thunderx2t99_extra_costs, > thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs, > ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs): > Specify ins_gp cost. > * config/arm/aarch-cost-tables.h (generic_extra_costs, > cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, > exynosm1_extra_costs, xgene1_extra_costs): Likewise. > > gcc/testsuite/ > > * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4, > test_setq_lane4): Relax allowed assembly.