RE: [PATCH]AArch64: support bf16 to sf extensions [PR121853]

Tamar Christina Mon, 27 Oct 2025 14:58:18 -0700

> -----Original Message-----
> From: Andrew Pinski <[email protected]>
> Sent: 27 October 2025 21:30
> To: Tamar Christina <[email protected]>
> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> <[email protected]>; [email protected]; Alex Coplan
> <[email protected]>; Wilco Dijkstra <[email protected]>; Alice
> Carlotti <[email protected]>
> Subject: Re: [PATCH]AArch64: support bf16 to sf extensions [PR121853]
> 
> On Mon, Oct 27, 2025 at 2:11 PM Tamar Christina
> <[email protected]> wrote:
> >
> > > -----Original Message-----
> > > From: Andrew Pinski <[email protected]>
> > > Sent: 27 October 2025 21:00
> > > To: Tamar Christina <[email protected]>
> > > Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> > > <[email protected]>; [email protected]; Alex Coplan
> > > <[email protected]>; Wilco Dijkstra <[email protected]>; Alice
> > > Carlotti <[email protected]>
> > > Subject: Re: [PATCH]AArch64: support bf16 to sf extensions [PR121853]
> > >
> > > On Mon, Oct 27, 2025 at 1:22 PM Tamar Christina
> > > <[email protected]> wrote:
> > > >
> > > > It looks like during the upstreaming of BF16 we didn't implement the
> extend
> > > > optab for it.
> > > >
> > > > As a result we go through soft-float emulation which results in massive
> > > > performance drop in projects using BF16.
> > > >
> > > > As an example, for
> > > >
> > > > float convert(__bf16 value) {
> > > >     return (float)value;
> > > > }
> > > >
> > > > we generate:
> > > >
> > > > convert(__bf16):
> > > >         stp     x29, x30, [sp, -16]!
> > > >         mov     x29, sp
> > > >         bl      __extendbfsf2
> > > >         ldp     x29, x30, [sp], 16
> > > >         ret
> > > >
> > > > and after this patch
> > > >
> > > > convert:
> > > >         movi    v31.4s, 0
> > > >         ext     v0.16b, v31.16b, v0.16b, #14
> > > >         ret
> > > >
> > > > We generate an ext with movi because this has same latency as a shift
> > > however
> > > > it has twice the throughput.  The zero vector is zero latency as such 
> > > > in real
> > > > workloads this codegen is much better than using shifts.
> > > >
> > > > As a reminder, BF16 -> FP32 is just shifting left 16 bits.
> > > >
> > > > The expand pattern has to rely on generating multiple subregs due to a
> > > > restriction that subregs can't chang floating point size and type at the
> same
> > > > time.
> > > >
> > > > I've tried alternative approaches like using the EXT as SF mode, but the
> > > > paradoxical subreg of BF -> SF isn't allowed and using an extend doesn't
> > > work
> > > > because extend is what we're defining.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > >
> > > There is generic code in expr.cc that should have handled this:
> > >               /* If we don't expect sNaNs, for BFmode -> SFmode we can 
> > > just
> > >                  shift the bits up.  */
> > >
> > > Why is this not being used here?
> > >
> >
> > Because GCC honors NANs by default and IEEE semantics require signaling
> NANs
> > to be quieted on conversions.  However BF16 conversions don't require this
> as it's
> > not an IEEE standard.
> >
> > So the generic code has no bearing here.
> 
> I am still confused as the generic code is specifically for BF16->SF
> rather than anything else .
> Is the problem the generic BF16->SF code should be using HONOR_SNANS
> rather than HONOR_NANS instead?
> 
> The non-target specific code does:
> ```
>       if (REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
>           && REAL_MODE_FORMAT (SFmode) == &ieee_single_format)
>         {
> ...
>           if (to_mode == SFmode
>               && !HONOR_NANS (from_mode)
>               && !HONOR_NANS (to_mode)
>               && optimize_insn_for_speed_p ())
>             {
>               /* If we don't expect sNaNs, for BFmode -> SFmode we can just
>                  shift the bits up.  */
> ```
> I am asking as the comment talks about sNaNs but the check is
> HONOR_NANS which seems to have a disconnect.
> Plus now you are doing an optab which does exactly what the generic
> BFmode->SFmode code does for aarch64 when ignoring NANs.
> 
> The commit message from r13-3292-gc2565a31c1622a says the following:
> ```
>     expr.cc then contains a -ffast-math optimization of the BF -> SF and
>     SF -> BF conversions if we don't optimize for space (and for the latter
>     if -frounding-math isn't enabled either).
> ```
> I can understand SF->BF having issues with respect to NaNs and such
> but BF->SF should only have issues with sNANs.
> The default for GCC dealing with sNANs is to ignore they exist (that
> is flag_signaling_nans is 0 by default).


That commit adds support for std::bfloat16_t which I don't know what
semantics that has. The Arm format is a storage format only and doesn’t
support arithmetic aside from defined instructions like matrix multiplication.

In addition, the generic code converts the floating point format to an integer
mode to perform the shift.  This results in two subregs that must be matched
in a combine pattern where it's more difficult to create the zero for the 
extraction.

So even if the generic code was working, it would always be suboptimal for Arm,
so the extend optab would be needed. Especially if done inside loops combine is 
to
late to lift out the zero.

Again, we want the optab, we don't want a shift here, we want a vector extract.

Thanks,
Tamar
> 
> Thanks,
> Andre
> 
> 
> >
> > Thanks,
> > Tamar
> >
> > > Also this bit is a bit odd:
> > >               && optimize_insn_for_speed_p ())
> > > since using a shift will always always be smaller than a call.
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         PR target/121853
> > > >         * config/aarch64/aarch64-simd.md (extendbfsf2): New.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         PR target/121853
> > > >         * gcc.target/aarch64/pr121853_1.c: New test.
> > > >         * gcc.target/aarch64/pr121853_2.c: New test.
> > > >
> > > > ---
> > > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > b/gcc/config/aarch64/aarch64-simd.md
> > > > index
> > >
> a121a18f9a09bc0670a6e71ad7eb2b442eb85bf8..243d413551ddd6eb7ff7
> > > 7ba14c7a5cb55706ab18 100644
> > > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > > @@ -3223,6 +3223,7 @@ (define_expand "vec_unpacks_hi_<mode>"
> > > >      DONE;
> > > >    }
> > > >  )
> > > > +
> > > >  (define_insn "extend<mode><Vwide>2"
> > > >    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
> > > >         (float_extend:<VWIDE>
> > > > @@ -3232,6 +3233,25 @@ (define_insn "extend<mode><Vwide>2"
> > > >    [(set_attr "type" "neon_fp_cvt_widen_s")]
> > > >  )
> > > >
> > > > +(define_expand "extendbfsf2"
> > > > +  [(set (match_operand:SF 0 "register_operand" "=w")
> > > > +       (float_extend:SF
> > > > +         (match_operand:BF 1 "register_operand" "w")))]
> > > > +  "TARGET_SIMD"
> > > > +{
> > > > +  rtx tmp0 = aarch64_gen_shareable_zero (V8BFmode);
> > > > +  rtx op0 = force_lowpart_subreg (V8BFmode, operands[1], BFmode);
> > > > +  rtx res = gen_reg_rtx (V8BFmode);
> > > > +  emit_insn (gen_aarch64_extv8bf (res, tmp0, op0, gen_int_mode (7,
> > > SImode)));
> > > > +  /* Subregs between floating point modes aren't allowed to change 
> > > > size,
> so
> > > go
> > > > +     through V4SFmode.  */
> > > > +  res = force_lowpart_subreg (V4SFmode, res, V8BFmode);
> > > > +  res = force_lowpart_subreg (SFmode, res, V4SFmode);
> > > > +  emit_move_insn (operands[0], res);
> > > > +  DONE;
> > > > +})
> > > > +
> > > > +
> > > >  ;; Float narrowing operations.
> > > >
> > > >  (define_insn "aarch64_float_trunc_rodd_df"
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr121853_1.c
> > > b/gcc/testsuite/gcc.target/aarch64/pr121853_1.c
> > > > new file mode 100644
> > > > index
> > >
> 0000000000000000000000000000000000000000..24b2fdc8cea1d061081
> > > 2665e8faafa34f57b90b2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr121853_1.c
> > > > @@ -0,0 +1,64 @@
> > > > +/* { dg-do run } */
> > > > +/* { dg-additional-options "-O2 -std=c99" } */
> > > > +
> > > > +#include <stdint.h>
> > > > +#include <stdio.h>
> > > > +#include <string.h>
> > > > +
> > > > +__attribute__ ((noipa))
> > > > +float convert(__bf16 value) {
> > > > +    return (float)value;
> > > > +}
> > > > +
> > > > +static inline uint32_t f32_bits(float f) {
> > > > +    uint32_t u; memcpy(&u, &f, sizeof u); return u;
> > > > +}
> > > > +static inline __bf16 bf16_from_bits(uint16_t u) {
> > > > +    __bf16 b;  memcpy(&b, &u, sizeof b); return b;
> > > > +}
> > > > +
> > > > +/* Fixed bf16 inputs (as raw 16-bit payloads) covering edge cases.  */
> > > > +static const uint16_t inputs[] = {
> > > > +    0x0000, // +0
> > > > +    0x8000, // -0
> > > > +    0x7F80, // +inf
> > > > +    0xFF80, // -inf
> > > > +    0x7FC0, // qNaN (+)  (quiet bit set in bf16)
> > > > +    0xFFC0, // qNaN (-)
> > > > +    0x7F01, // sNaN (+)  (will be quieted by conversion)
> > > > +    0xFF01, // sNaN (-)
> > > > +    0x0001, // smallest +subnormal
> > > > +    0x007F, // largest  +subnormal
> > > > +    0x8001, // smallest -subnormal
> > > > +    0x807F, // largest  -subnormal
> > > > +    0x0080, // smallest +normal
> > > > +    0x3F80, // +1.0
> > > > +    0xBF80, // -1.0
> > > > +    0x3F00, // +0.5
> > > > +    0xBF00, // -0.5
> > > > +    0x3FC0, // +1.5
> > > > +    0x7F7F, // max finite +
> > > > +    0xFF7F, // max finite -
> > > > +};
> > > > +
> > > > +int main(void) {
> > > > +    const size_t N = sizeof(inputs)/sizeof(inputs[0]);
> > > > +    size_t fails = 0;
> > > > +
> > > > +    for (size_t i = 0; i < N; ++i) {
> > > > +        __bf16 in = bf16_from_bits(inputs[i]);
> > > > +        float out = convert(in);
> > > > +        uint32_t got = f32_bits(out);
> > > > +        uint32_t exp = inputs[i] << 16;
> > > > +
> > > > +        if (got != exp) {
> > > > +            printf("FAIL[%zu]: in_bf16=0x%04X  exp_f32=0x%08X
> > > got_f32=0x%08X\n",
> > > > +                   i, inputs[i], exp, got);
> > > > +            ++fails;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    if (fails != 0)
> > > > +      __builtin_abort ();
> > > > +}
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr121853_2.c
> > > b/gcc/testsuite/gcc.target/aarch64/pr121853_2.c
> > > > new file mode 100644
> > > > index
> > >
> 0000000000000000000000000000000000000000..e9fb4016bcb7cb92b4
> > > d132da5f6fc07eeebd248a
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr121853_2.c
> > > > @@ -0,0 +1,14 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-additional-options "-O1" } */
> > > > +/* { dg-final { check-function-bodies "**" "" } } */
> > > > +
> > > > +float convert(__bf16 value) {
> > > > +    return (float)value;
> > > > +}
> > > > +
> > > > +/*
> > > > +** convert:
> > > > +**     movi    v[0-9]+.4s, 0
> > > > +**     ext     v[0-9]+.16b, v[0-9]+.16b, v[0-9]+.16b, #14
> > > > +**     ret
> > > > +*/
> > > >
> > > >
> > > > --

RE: [PATCH]AArch64: support bf16 to sf extensions [PR121853]

Reply via email to