Hi Tamar, > On 21 Jul 2025, at 11:12, Tamar Christina <tamar.christ...@arm.com> wrote: > > Hi Kyrill, > >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Friday, July 18, 2025 10:40 AM >> To: GCC Patches <gcc-patches@gcc.gnu.org> >> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford >> <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan >> <alex.cop...@arm.com> >> Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR >> instructions >> >> Hi all, >> >> For inserting zero into a vector lane we usually use an instruction like: >> ins v0.h[2], wzr >> >> This, however, has not-so-great performance on some CPUs. >> On Grace, for example it has a latency of 5 and throughput 1. >> The alternative sequence: >> movi v31.8b, #0 >> ins v0.h[2], v31.h[0] >> is prefereble bcause the MOVI-0 is often a zero-latency operation that is >> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 >> and >> throughput of 4. >> We can avoid the merging of the two instructions into the >> aarch64_simd_vec_set_zero<mode> >> insn through rtx costs. We just need to handle the right VEC_MERGE form in >> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to >> describe >> this operation. >> According to a similar LLVM PR: https://github.com/llvm/llvm- >> project/pull/146538 >> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit >> from this, >> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable >> latency >> 3 in Cortex-A510). >> > > I'm not really sure about the Cortex-A510, the entries are a bit suspect. I > think they > haven't added an entry for INS from GPR, which is an alias for MOV element, > GPR. > And the throughput for the INS they did add (element to element) seems > suspect. > > If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the MOVI is > slower (at least on paper) but I'd expect in actual code the throughput to > matter more. > > So I think we should just do this for every core and by default. I have a > couple of > arguments for this: > > 1. We already do this for every constant but 0, since we prefer the throughput > advantage here. See https://godbolt.org/z/9PKhEax8G where if you change > your > testcase to anything but 0 you get the movi. Even though strictly > speaking the movi > is more expensive than the mov. > > 2. I've ran this sequence change over various cores, and they all showed > either an improvement > or very little difference. i.e. Cortex-A510 shows > > cortex-a510 │ little │ -0.1% > cortex-a520 │ little │ -0.09% > cortex-a710 │ mid │ -0.03% > > etc, this is comparing using movi over the INS, wzr.
Thanks for the broader benchmarking! > > However the advantage grows significantly when there is more than one of the > same constant to insert. > e.g. > > __attribute__((noipa)) > uint16x8_t foo2(uint16x8_t a) { > a[2] = 0; > a[5] = 0; > return a; > } > > Now the movi pays for itself and the INS wzr is serial over most cores. > > Benchmarking this gives you as expected for inorder cores: > > cortex-a510 │ little │ -0.1% > cortex-a520 │ little │ -0.05% > but then out of order cores > > cortex-a710 │ mid │ ✔ -49.49% > neoverse v2 │ big │ ✔ -49.57% > > etc. > > Showing that I think we should do this by default. Yeah that’s fine with me. > > As a side note, the current patch only seems to work outside loops. e.g. > > __attribute__((noipa)) > uint16x8_t foos(uint16x8_t a, int n) { > for (int i = 0; i < n; i++) > a[2] = 0; > return a; > } > > Does not seem to make the movi and I think that's because after rejecting > the movi in the vec_merge and floating it in the loop preheader, combine now > tries to force it down and ultimately succeeds. > > Given the above, what do you think about instead just disabling the > aarch64_simd_vec_set_zero pattern when not compiling for size? > Yes, that would make for a simpler patch and wouldn’t rely on the rtx cost users doing the right thing. I’ll respin accordingly. Thanks, Kyrill > That should do what we want in all cases. > > Thanks, > Tamar > >> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the >> split >> into two instructions, values lower than that retain the INS-WZR form. >> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive >> from, >> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles. 3 is the >> number >> of cycles above the normal cheapest SIMD instruction on such cores (which >> take 2 >> cycles >> for the cheapest one). >> >> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to >> preserve the current codegen, though I'd be happy to increase it for generic >> tuning. >> >> For -Os we don't add any extra cost so the shorter INS-WZR form is still >> generated always. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> Ok for trunk? >> Thanks, >> Kyrill >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp >> field. Add comments to other vector cost fields. >> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE >> case. >> * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, >> thunderx_extra_costs, thunderx2t99_extra_costs, >> thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs, >> ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs): >> Specify ins_gp cost. >> * config/arm/aarch-cost-tables.h (generic_extra_costs, >> cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, >> exynosm1_extra_costs, xgene1_extra_costs): Likewise. >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4, >> test_setq_lane4): Relax allowed assembly.