https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125744
Bug ID: 125744
Summary: [arm] -mtune/-mcpu with
-fno-fuse-ops-with-volatile-access allows for `add r3,
sp, #6` when it should not
Product: gcc
Version: 15.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: azoff at gcc dot gnu.org
Target Milestone: ---
In r16-5947-ga6c50ec2c6ebcb, the flag -ffuse-ops-with-volatile-access was
added. With this new flag, an extra optimization step is introduced to remove
unnecessary registers in a function body.
An example where this matters is for gcc.target/arm/bfloat16_scalar_1_2.c where
the function "stacktest1" produces the following 2 permutations. To simplify
the output, I've removed everything below "stacktest1" function in the input.
First, -mtune=cortex-m0 will influence what instructions that are used and
regardless of -ffuse-ops-with-volatile-access or
-fno-fuse-ops-with-volatile-access, the same assembler is produced.
$ /build/r17-1372-g80b78b2504fba0/bin/arm-none-eabi-gcc -mthumb
-march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto
-fdiagnostics-plain-output -ansi -pedantic-errors -mcpu=unset
-march=armv8.2-a+bf16 -mfpu=auto -mfloat-abi=softfp -O3 -std=gnu90 -S -x c -o -
-fno-ident -ffuse-ops-with-volatile-access <(sed -ne '1,22p'
/build/gcc_src/gcc/testsuite/gcc.target/arm/bfloat16_scalar_1_
2.c)
.arch armv8.2-a
.fpu neon-fp-armv8
.arch_extension bf16
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "63"
.text
.align 1
.p2align 2,,3
.global stacktest1
.syntax unified
.thumb
.thumb_func
.type stacktest1, %function
stacktest1:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #8
strh r0, [sp, #6] @ __bf16
ldrh r0, [sp, #6] @ __bf16
add sp, sp, #8
@ sp needed
bx lr
.size stacktest1, .-stacktest1
Switching over to -mcpu=cortex-m0 -ffuse-ops-with-volatile-access gives the
same result:
$ /build/r17-1372-g80b78b2504fba0/bin/arm-none-eabi-gcc -mthumb
-march=armv6s-m -mcpu=cortex-m0 -mfloat-abi=soft -mfpu=auto
-fdiagnostics-plain-output -ansi -pedantic-errors -mcpu=unset
-march=armv8.2-a+bf16 -mfpu=auto -mfloat-abi=softfp -O3 -std=gnu90 -S -x c -o -
-fno-ident -ffuse-ops-with-volatile-access <(sed -ne '1,22p'
/build/gcc_src/gcc/testsuite/gcc.target/arm/bfloat16_scalar_1_2
.c)
.arch armv8.2-a
.fpu neon-fp-armv8
.arch_extension bf16
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "63"
.text
.align 1
.p2align 2,,3
.global stacktest1
.syntax unified
.thumb
.thumb_func
.type stacktest1, %function
stacktest1:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #8
strh r0, [sp, #6] @ __bf16
ldrh r0, [sp, #6] @ __bf16
add sp, sp, #8
@ sp needed
bx lr
.size stacktest1, .-stacktest1
But, using -mcpu=cortex-m0 -ffuse-ops-with-volatile-access gives a different
outcome:
$ /build/r17-1372-g80b78b2504fba0/bin/arm-none-eabi-gcc -mthumb
-march=armv6s-m -mcpu=cortex-m0 -mfloat-abi=soft -mfpu=auto
-fdiagnostics-plain-output -ansi -pedantic-errors -mcpu=unset
-march=armv8.2-a+bf16 -mfpu=auto -mfloat-abi=softfp -O3 -std=gnu90 -S -x c -o -
-fno-ident -fno-fuse-ops-with-volatile-access <(sed -ne '1,22p'
/build/gcc_src/gcc/testsuite/gcc.target/arm/bfloat16_scalar_
1_2.c)
.arch armv8.2-a
.fpu neon-fp-armv8
.arch_extension bf16
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "63"
.text
.align 1
.p2align 2,,3
.global stacktest1
.syntax unified
.thumb
.thumb_func
.type stacktest1, %function
stacktest1:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #8
add r3, sp, #6
strh r0, [r3] @ __bf16
ldrh r0, [sp, #6] @ __bf16
add sp, sp, #8
@ sp needed
bx lr
.size stacktest1, .-stacktest1
All the 4 permutations should, ideally, give the same assembler, the one
without the usage of r3 and an extra addition.
This ticket is derived from the review comments in
https://gcc.gnu.org/pipermail/gcc-patches/2026-June/719922.html.
There are similar issues with the following test cases in trunk as of
r17-1372-g80b78b2504fba0:
* gcc.target/arm/bfloat16_scalar_1_2.c
* gcc.target/arm/bfloat16_scalar_2_2.c
* gcc.target/arm/bfloat16_scalar_3_2.c
* gcc.target/arm/bfloat16_simd_1_2.c
* gcc.target/arm/bfloat16_simd_2_2.c
* gcc.target/arm/bfloat16_simd_3_2.c
In GCC15, the -ffuse-ops-with-volatile-access feature is not implemented (added
in r16-5947-ga6c50ec2c6ebcb), so the produced assembler is likely to contain
the add instruction.