Hi Kyrill,

> -----Original Message-----
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Sent: Friday, July 18, 2025 10:40 AM
> To: GCC Patches <gcc-patches@gcc.gnu.org>
> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
> <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan
> <alex.cop...@arm.com>
> Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR
> instructions
> 
> Hi all,
> 
> For inserting zero into a vector lane we usually use an instruction like:
>       ins     v0.h[2], wzr
> 
> This, however, has not-so-great performance on some CPUs.
> On Grace, for example it has a latency of 5 and throughput 1.
> The alternative sequence:
>       movi    v31.8b, #0
>       ins     v0.h[2], v31.h[0]
> is prefereble bcause the MOVI-0 is often a zero-latency operation that is
> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and
> throughput of 4.
> We can avoid the merging of the two instructions into the
> aarch64_simd_vec_set_zero<mode>
> insn through rtx costs.  We just need to handle the right VEC_MERGE form in
> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
> describe
> this operation.
> According to a similar LLVM PR: https://github.com/llvm/llvm-
> project/pull/146538
> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit
> from this,
> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency
> 3 in Cortex-A510).
> 

I'm not really sure about the Cortex-A510, the entries are a bit suspect.  I 
think they
haven't added an entry for INS from GPR, which is an alias for MOV element, GPR.
And the throughput for the INS they did add (element to element) seems suspect.

If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the MOVI is
slower (at least on paper) but I'd expect in actual code the throughput to 
matter more.

So I think we should just do this for every core and by default.  I have a 
couple of
arguments for this:

1. We already do this for every constant but 0, since we prefer the throughput
     advantage here. See https://godbolt.org/z/9PKhEax8G where if you change 
your
     testcase to anything but 0 you get the movi.  Even though strictly 
speaking the movi
     is more expensive than the mov.

2. I've ran this sequence change over various cores, and they all showed either 
an improvement
    or very little difference. i.e. Cortex-A510 shows 

cortex-a510   │ little │         -0.1%        
cortex-a520   │ little │        -0.09%
cortex-a710   │  mid   │        -0.03%

etc, this is comparing using movi over the INS, wzr.

However the advantage grows significantly when there is more than one of the 
same constant to insert.
e.g.

__attribute__((noipa))
uint16x8_t foo2(uint16x8_t a) {
  a[2] = 0;
  a[5] = 0;
  return a;
}

Now the movi pays for itself and the INS wzr is serial over most cores.

Benchmarking this gives you as expected for inorder cores:

cortex-a510   │ little │         -0.1%
cortex-a520   │ little │        -0.05%
but then out of order cores

cortex-a710   │  mid   │      ✔  -49.49%
neoverse v2   │  big   │      ✔  -49.57%

etc.

Showing that I think we should do this by default.

As a side note, the current patch only seems to work outside loops. e.g.

__attribute__((noipa))
uint16x8_t foos(uint16x8_t a, int n) {
  for (int i = 0; i < n; i++)
    a[2] = 0;
  return a;
}

Does not seem to make the movi and I think that's because after rejecting
the movi in the vec_merge and floating it in the loop preheader, combine now
tries to force it down and ultimately succeeds.

Given the above, what do you think about instead just disabling the
aarch64_simd_vec_set_zero pattern when not compiling for size?

That should do what we want in all cases.

Thanks,
Tamar

> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the 
> split
> into two instructions, values lower than that retain the INS-WZR form.
> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive
> from,
> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is the 
> number
> of cycles above the normal cheapest SIMD instruction on such cores (which 
> take 2
> cycles
> for the cheapest one).
> 
> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
> preserve the current codegen, though I'd be happy to increase it for generic 
> tuning.
> 
> For -Os we don't add any extra cost so the shorter INS-WZR form is still
> generated always.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?
> Thanks,
> Kyrill
> 
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> 
> gcc/
> 
>       * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
>       field.  Add comments to other vector cost fields.
>       * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE
> case.
>       * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
>       thunderx_extra_costs, thunderx2t99_extra_costs,
>       thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
>       ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
>       Specify ins_gp cost.
>       * config/arm/aarch-cost-tables.h (generic_extra_costs,
>       cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
>       exynosm1_extra_costs, xgene1_extra_costs): Likewise.
> 
> gcc/testsuite/
> 
>       * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
>       test_setq_lane4): Relax allowed assembly.

Reply via email to