[SUMMARY] Re: [PATCH v11 0/8] aarch64: port NEON intrinsics to pragma-based framework

Karl Meakin via Sourceware Forge Tue, 30 Jun 2026 07:48:54 -0700


This is a summary of discussions relative to the merge request created by Karl 
Meakin (karmea01) <[email protected]> titled
aarch64: port NEON intrinsics to pragma-based framework
since its creation.


Description: This patch is a proof of concept patch which ports a few NEON 
intrinsics (intrinsics defined in `arm_neon.h`) to the "pragma-based" framework 
used by SVE/SME intrinsics.
If successful, I will follow up with further patches porting the rest.

tested with `make check`

changelog:
* v1: Initial revision
* v2: Appease `check_GNU_style.py`
* v3: Drop unrelated `.editorconfig` changes which were included by mistake
* v4: 
    * Address review comments
    * Move reformatting of `config.gcc` into its own commit.
    * Merge `aarch64-neon-builtins.cc` into `aarch64-sve-builtins.cc` and 
rename it to `aarch64-acle-builtins.cc`
* v5: Fix codegen for big-endian targets
* v6 Improve codegen for `FEAT_SHA3` intrinsics (`veor3`, `vbcax`, `vrax1` and 
`vxar`) at `-O0`.
* v7 Delete `aarch64-neon-builtins.cc` again after it somehow got reintroduced 
in v6
* v8 Remove RFC tag, rebase against master
* v9 Use the new `IFN_BITREVERSE` when lowering `rbit`
* v10:
  * Address review comments
  * Split the commit porting vector manipulation intrinsics into two commits: 
one for vector creation, and one for lane getters and setters

CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]

The full and up to date discussion can be found at 
https://forge.sourceware.org/gcc/gcc-TEST/pulls/158

The merge request has been closed without being merged directly on the forge 
repository.

On 2026-05-13 16:28:51+00:00, Claudio Bantaloukas (rdfm) <[email protected]> 
requested changes to the code:
This is a great start and I've very excited to see the patch series land. I 
have some comments and hopefully more people will chime in.

> +++ .editorconfig

Seems unrelated :)
> +++ gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3333,3 +2529,1 @@
> -  /* The type and range are unsigned, so read the argument as an
> -     unsigned rather than signed HWI.  */
> -  if (!tree_fits_uhwi_p (arg))
> +  if (tree_fits_shwi_p (arg))
should this function be moved in a more generic file than sve?
> +++ gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3333,3 +2529,1 @@
> -  /* The type and range are unsigned, so read the argument as an
> -     unsigned rather than signed HWI.  */
> -  if (!tree_fits_uhwi_p (arg))
> +  if (tree_fits_shwi_p (arg))
What about renaming the whole file to `aarch64-acle-builtins.cc`?
> +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_dup.c
> @@ -42,3 +42,3 @@
>    return vdupq_lane_bf16 (a, 1);
>  }
> -/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h, 
> v\[0-9\]+.h\\\[0\\\]" 2 } } */
> +/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h, 
> v\[0-9\]+.h\\\[0\\\]" 1 } } */
This is surprising, if this was a bug in the previous implementation or a 
wanted change, please document it in the patch description.
> +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_dup.c
> @@ -42,3 +42,3 @@
>    return vdupq_lane_bf16 (a, 1);
>  }
> -/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h, 
> v\[0-9\]+.h\\\[0\\\]" 2 } } */
> +/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h, 
> v\[0-9\]+.h\\\[0\\\]" 1 } } */
It's an improvement in the generated code. [Previously these two functions 
generated](https://godbolt.org/z/jnEbKjqYf)
```asm
vdupq_test:
        dup     v0.8h, v0.h[0]
        ret
test_vdupq_lane_bf16:
        dup     h0, v0.h[1]
        dup     v0.8h, v0.h[0]
        ret
```

Now they generate
```asm
vdupq_test:
        dup     v0.8h, v0.h[0]
        ret

test_vdupq_lane_bf16:
        dup     v0.8h, v0.h[1]
        ret
```
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> -   /* Use vgetq_lane_u64 to get a 
> -     __builtin_aarch64_im_lane_boundsi */
> -   vgetq_lane_u64(c, __b);
> +    __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
@pinskia will this still trigger the case you fixed for 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117665

> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> -   /* Use vgetq_lane_u64 to get a 
> -     __builtin_aarch64_im_lane_boundsi */
> -   vgetq_lane_u64(c, __b);
> +    __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
Yes replacing it with __builtin_aarch64_im_lane_boundsi will invoke the ICE 
that was previously there.  But I suspect we should have 2 versions of the 
testcase, one that still uses the intrinsics and one that uses the 
`__builtin_aarch64_im_lane_boundsi` builtin. Since the original code used the 
intrinsics and not the builtin directly. We want to make sure the original 
testcase does not regress either.
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> -   /* Use vgetq_lane_u64 to get a 
> -     __builtin_aarch64_im_lane_boundsi */
> -   vgetq_lane_u64(c, __b);
> +    __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
The `vgetq_lane_u64` intrinsic does not call 
`__builtin_aarch64_im_lane_boundsi` anymore, since the checking of the 
arguments is now done directly. I could replace it with another intrinsic that 
hasn't been ported yet, but then I would have to update the test again once 
that intrinsic is ported to the new framework. Ultimately, once all the 
intrinsics have been ported, `__builtin_aarch64_im_lane_boundsi` and this test 
can be deleted.
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> -   /* Use vgetq_lane_u64 to get a 
> -     __builtin_aarch64_im_lane_boundsi */
> -   vgetq_lane_u64(c, __b);
> +    __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
Actually thinking about this slightly more. We should do both testcases. One 
with the original `vgetq_lane_u64` instrinsics and one with 
`__builtin_aarch64_im_lane_boundsi`. To make sure the original testcase that 
was provided in the bug report does not regress and one for the 
`__builtin_aarch64_im_lane_boundsi` which we caused the issue.
> +++ gcc/testsuite/gcc.target/aarch64/sha3_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=armv8.2-a+sha3" } */
> +/* { dg-options "-O1 -march=armv8.2-a+sha3" } */
What kind of issues does the absence of -O1 cause?
> +++ gcc/testsuite/gcc.target/aarch64/sha3_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=armv8.2-a+sha3" } */
> +/* { dg-options "-O1 -march=armv8.2-a+sha3" } */
There is a regression in code quality at `-O0`. Since the intrinsics now expand 
to several GIMPLE statements, they produce assembly like

https://godbolt.org/z/jsjzbz19P
```asm
xor3(__Uint8x16_t, __Uint8x16_t, __Uint8x16_t):
        sub     sp, sp, #48
        str     q0, [sp, 32]
        str     q1, [sp, 16]
        str     q2, [sp]
        ldr     q30, [sp, 32]
        ldr     q31, [sp, 16]
        eor     v30.16b, v30.16b, v31.16b
        ldr     q31, [sp]
        eor     v31.16b, v30.16b, v31.16b
        mov     v0.16b, v31.16b
        add     sp, sp, 48
        ret
```

instead of
```asm
xor3(__Uint8x16_t, __Uint8x16_t, __Uint8x16_t):
        sub     sp, sp, #96
        str     q0, [sp, 32]
        str     q1, [sp, 16]
        str     q2, [sp]
        ldr     q31, [sp, 32]
        str     q31, [sp, 80]
        ldr     q31, [sp, 16]
        str     q31, [sp, 64]
        ldr     q31, [sp]
        str     q31, [sp, 48]
        ldr     q31, [sp, 80]
        ldr     q29, [sp, 64]
        ldr     q30, [sp, 48]
        eor3    v31.16b, v31.16b, v29.16b, v30.16b
        nop
        mov     v0.16b, v31.16b
        add     sp, sp, 96
        ret
```

The two EORs get combined into EOR3 in the RTL combine pass, which doesn't run 
at `-O0`
> +++ gcc/testsuite/gcc.target/aarch64/sme/inlining_10.c
> @@ -51,7 +45,6 @@ void
>  sc_caller () [[arm::inout("za"), arm::streaming_compatible]]
>  {
>    call_vadd ();
> -  call_vbsl ();
bsl and all the other AdvSIMD functions are not compatible with straming mode. 
See CheckFPAdvSIMDEnabled in 
https://developer.arm.com/documentation/ddi0602/2026-03/Shared-Pseudocode/aarch64-exceptions-traps?lang=en#func_AArch64_CheckFPAdvSIMDEnabled_0

I think the patch should be amended to maintain the existing behaviour.
> +++ gcc/config.gcc
> @@ -365,0 +365,4 @@
> +             'arm_fp16.h'
> +             'arm_neon.h'
> +             'arm_bf16.h'
> +             'arm_acle.h'
This is not portable to /bin/sh (sorry!)
you could do something like `extra_objs="${extra_objs} aarch64-sve-builtins.o"` 
instead

> +++ gcc/config.gcc
> @@ -362,3 +362,3 @@
>  aarch64*-*-*)
>       cpu_type=aarch64
> -     extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h 
> arm_sme.h arm_neon_sve_bridge.h arm_private_fp8.h arm_private_neon_types.h"
> +     extra_headers=(
See comment on previous patch. These should be /bin/sh compliant.

On 2026-05-14 18:23:18+00:00, Drea Pinski (pinskia) <[email protected]> 
requested changes to the code:
Move the config.gcc reformating to a seperate patch. I will let others decide 
if the reformating is ok though.
I am ok with it but others might not be.

> +++ gcc/config.gcc
> @@ -365,0 +370,4 @@
> +     extra_headers="${extra_headers} arm_sme.h"
> +     extra_headers="${extra_headers} arm_neon_sve_bridge.h"
> +     extra_headers="${extra_headers} arm_private_fp8.h"
> +     extra_headers="${extra_headers} arm_private_neon_types.h"
Can you do reformating this as the first patch?
> +++ gcc/config.gcc
> @@ -370,0 +395,4 @@
> +     extra_objs="${extra_objs} aarch64-narrow-gp-writes.o"
> +     extra_objs="${extra_objs} aarch64-neon-builtins.o"
> +     extra_objs="${extra_objs} aarch64-neon-builtins-base.o"
> +     extra_objs="${extra_objs} aarch64-neon-builtins-shapes.o"
Likewise.
> +++ gcc/config.gcc
> @@ -370,0 +404,4 @@
> +     target_gtfiles="${target_gtfiles} 
> \$(srcdir)/config/aarch64/aarch64-acle-builtins.h"
> +     target_gtfiles="${target_gtfiles} 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> +     target_gtfiles="${target_gtfiles} 
> \$(srcdir)/config/aarch64/aarch64-neon-builtins.cc"
> +     target_gtfiles="${target_gtfiles} 
> \$(srcdir)/config/aarch64/aarch64-neon-builtins.h"
Likewise.

On 2026-05-27 15:00:15+00:00, Karl Meakin (karmea01) <[email protected]> 
commented on the code:


> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +328,4 @@
> +public:
> +  constexpr gimple_not_rhs (tree_code code)
> +    : m_code (code)
> +    {}
An IFN for bitreverse over vectors may be coming soon. Keep an eye on 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481


On 2026-06-01 18:05:35+00:00, Karl Meakin (karmea01) wrote:

I've removed the RFC tag; I think its good enough to merge now. Ok to merge?

On 2026-06-02 08:35:16+00:00, Claudio Bantaloukas (rdfm) <[email protected]> 
approved the changes:
LGTM apart from a small formatting nit. But I'm not a maintainer.
Thank you for addressing concerns.

> +++ gcc/config/aarch64/aarch64-builtins.cc

Spurious spacing?


On 2026-06-08 14:19:35+00:00, Karl Meakin (karmea01) wrote:

@pinskia ping?

On 2026-06-17 13:05:57+00:00, Kyrill Tkachov (ktkachov) <[email protected]> 
commented on the code:
Thanks, this is a long-awaited transition. Some comments I've found

> +++ gcc/config/aarch64/aarch64-neon-builtins-base.def
> @@ -0,0 +71,4 @@
> +#undef REQUIRED_EXTENSIONS
> +
> +// Lanewise arithmetic (FP16)
> +#define REQUIRED_EXTENSIONS nonstreaming_only (AARCH64_FL_F16)
I think this needs to be AARCH64_FL_SIMD | AARCH64_FL_F16
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +49,4 @@
> +/* Build a cast expression, `(TYPE)EXPR`, if necessary to make an expression
> +   with type TYPE.  */
> +tree
> +build_cast (tree type, tree expr)
This and other functions here have quite generic names and sit at global scope 
with external linkage. I think these should be wrapped in the aarch64_acle 
namespace or something
> +++ gcc/config/aarch64/t-aarch64
> @@ -71,0 +94,4 @@
> +     $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +             $(srcdir)/config/aarch64/aarch64-neon-builtins-base.cc
> +
> +aarch64-acle-builtins.o: $(srcdir)/config/aarch64/aarch64-acle-builtins.cc \
I think this rule should also include the new *neon*.def/.h files, as well as 
gimple-fold.h
> +++ gcc/config/aarch64/aarch64-builtins.cc
> @@ -1974,3 +1731,3 @@
>    aarch64_init_simd_builtin_functions (false);
>    if (in_lto_p)
> -    handle_arm_neon_h ();
> +    init_arm_neon_builtins ();
Do we get in a situation where init_arm_neon_builtins is called twice? Does 
this need to take to take arm_neon_h_handled into account?
> +++ gcc/config/aarch64/aarch64-builtins.cc
> @@ -1974,3 +1731,3 @@
>    aarch64_init_simd_builtin_functions (false);
>    if (in_lto_p)
> -    handle_arm_neon_h ();
> +    init_arm_neon_builtins ();
No, it is called either from `handle_arm_neon_h` (when encountering `#pragma 
GCC aarch64 "arm_neon.h"`) or from `aarch64_init_simd_builtins` (when 
initialising LTO). So it is called in two separate places, but will not be 
called in both places in the same compilation.

`handle_arm_neon_h` checks for `arm_neon_h_handled`

On 2026-06-23 07:10:40+00:00, Tamar Christina (tnfchris) <[email protected]> 
requested changes to the code:
Looks pretty good, just some minor changes. 

> +++ gcc/testsuite/gcc.target/aarch64/sme/inlining_10.c
> @@ -21,1 +20,3 @@
> -call_vbsl () // { dg-error "inlining failed" }
> +// Gets expanded to bitwise select early, so no error.  An error would be
> +// more correct though.
> +inline void __attribute__ ((always_inline))
I think BSL was chosen here not because we wanted to test `bsl` itself but 
because we wanted to test the inlining behavior of a non-lowered intrinsics.

Now that you lower `bsl` instead of removing the error you should pick another 
intrinsics that isn't lowered.  Otherwise we both checks here doesn't check 
inlining errors.

Same with the below.
> +++ gcc/testsuite/gcc.target/aarch64/neon/aarch64-neon.exp
> @@ -0,0 +33,4 @@
> +
> +# Main loop.
> +dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*\[cCs\]]] \
> +     " -ansi -pedantic-errors -std=c23 -O3 -march=armv8-a+simd" ""
I think we just want -O2 here, since that's the standard compile flag for most. 
Or even -O1 since intrinsics shouldn't rely on optimizations to produce the 
result stated by ACLE.
> +++ gcc/config/aarch64/aarch64-neon-builtins-shapes.cc
> @@ -0,0 +108,4 @@
> +  check_fn_t m_check_fn1;
> +  check_fn_t m_check_fn2;
> +
> +  void build (function_builder &b,
In the removed code above the comment said
```
    case UNSPEC_VEC_COPY:               
      /* & rather than && so that we report errors against both indices.  */    
        
      return (require_immediate_lane_index (1, 0)               
              & require_immediate_lane_index (3, 2));
```
and now we only report the first error.  I think we should restore that 
behavior, so just use `&` instead of `&&` here to report all errors at the same 
time.
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +302,4 @@
> +};
> +
> +struct gimple_dup_lane : public gimple_function_base
> +{
Doesn't this also need
```
    if (optimize == 0)
      return nullptr;
```

> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +302,4 @@
> +};
> +
> +struct gimple_dup_lane : public gimple_function_base
> +{
No, we only emit RTL for SHA3 intrinsics because we want to ensure they emit a 
single instruction even at `-O0`
> +++ gcc/config/aarch64/aarch64-simd.md
> @@ -9854,3 +9855,4 @@
>    [(set_attr "type" "crypto_sha3")]
>  )
>  
> +(define_insn "aarch64_rax1qv2di"
this one isn't needed, just make the `*aarch64_rax1qv2di` above not anonymous.

On 2026-06-29 19:27:31+00:00, Tamar Christina (tnfchris) <[email protected]> 
approved the changes:
Thanks! Lets get this in.



On 2026-06-30 14:48:21+00:00, Karl Meakin (karmea01) wrote:

Merged

[SUMMARY] Re: [PATCH v11 0/8] aarch64: port NEON intrinsics to pragma-based framework

Reply via email to