`x` is store_bit_field's `str_rtx`, the V4x4BF register. You can see ASF's dumps in this comment in the PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125988#c6
`x` is `(reg:V4x4BF 60 v28)` in this case. Konstantinos. On Tue, Jun 30, 2026 at 3:36 PM Richard Biener <[email protected]> wrote: > On Tue, Jun 30, 2026 at 2:20 PM Konstantinos Eleftheriou > <[email protected]> wrote: > > > > The call chain for our testcase is the following: > > gen_lowpart -> gen_lowpart_general -> gen_lowpart_common -> > lowpart_subreg -> simplify_gen_subreg -> validate_subreg. > > > > `validate_subreg` for OI and V4x4BF modes fails, leading to this > fallback in `gen_lowpart_general`: > > > > /* Handle SUBREGs and hard REGs that were rejected by > > simplify_gen_subreg. */ > > else if (REG_P (x) || GET_CODE (x) == SUBREG) > > { > > result = gen_lowpart_common (mode, copy_to_reg (x)); > > gcc_assert (result != 0); > > return result; > > } > > > > `copy_to_reg (x)` creates the new reg. > > And what's 'x' here? That said, either gen_lowpart () isn't supposed > to be used in > lvalue context or there's sth else wrong around store_bit_field > (not to say, OI and V4x4BF? really?) > > Richard. > > > > > > > > > > > On Tue, Jun 30, 2026 at 1:57 PM Richard Biener < > [email protected]> wrote: > >> > >> On Tue, Jun 30, 2026 at 11:22 AM Konstantinos Eleftheriou > >> <[email protected]> wrote: > >> > > >> > Hi Richard, > >> > > >> > We've been sending this as part of the ASF default enablement > patchset for nearly a year now. > >> > We are splitting it now as we judged that it requires separate > attention. > >> > > >> > The idea is to overwrite the original to-be-updated register -- > stripping the subregs to extract it -- > >> > with the newly generated one. Otherwise, str_rtx is left unchanged. > >> > The copy is done by `emit_mov_insn`. > >> > > >> > Any ideas on how we could handle this better? It definitely looks > like the right place for > >> > the fix though. > >> > >> I don't think changing the destination was intended when doing > >> > >> op0 = gen_lowpart (int_mode_for_mode (GET_MODE (op0), op0); > >> > >> but instead a subreg was intended (and it's full-size). I wasn't > >> aware that gen_lowpart > >> eventually creates a new reg -- when would it do that? > >> > >> > > >> > Thanks, > >> > Konstantinos > >> > > >> > On Tue, Jun 30, 2026 at 10:22 AM Richard Biener < > [email protected]> wrote: > >> >> > >> >> On Mon, Jun 29, 2026 at 4:13 PM Konstantinos Eleftheriou > >> >> <[email protected]> wrote: > >> >> > > >> >> > The call to `gen_lowpart` in `store_bit_field_1` might copy the > destination > >> >> > register into a new one, which may lead to wrong code generation, > as the bit > >> >> > insertions update the new register instead of updating `str_rtx`. > >> >> > > >> >> > This patch copies back the new destination register into `str_rtx` > when needed. > >> >> > > >> >> > Bootstrapped/regtested on AArch64 and x86-64. > >> >> > > >> >> > PR rtl-optimization/125988 > >> >> > > >> >> > gcc/ChangeLog: > >> >> > > >> >> > * expmed.cc (store_bit_field_1): Copy back the new > destination > >> >> > register into `str_rtx` when needed. > >> >> > > >> >> > gcc/testsuite/ChangeLog: > >> >> > > >> >> > * gcc.target/aarch64/pr125988.c: New test. > >> >> > --- > >> >> > gcc/expmed.cc | 22 +++++++-- > >> >> > gcc/testsuite/gcc.target/aarch64/pr125988.c | 51 > +++++++++++++++++++++ > >> >> > 2 files changed, 70 insertions(+), 3 deletions(-) > >> >> > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr125988.c > >> >> > > >> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc > >> >> > index da1b5b632876..1f4611a6ed89 100644 > >> >> > --- a/gcc/expmed.cc > >> >> > +++ b/gcc/expmed.cc > >> >> > @@ -888,9 +888,25 @@ store_bit_field_1 (rtx str_rtx, poly_uint64 > bitsize, poly_uint64 bitnum, > >> >> > op0 = gen_lowpart (op0_mode.require (), op0); > >> >> > } > >> >> > > >> >> > - return store_integral_bit_field (op0, op0_mode, ibitsize, > ibitnum, > >> >> > - bitregion_start, bitregion_end, > >> >> > - fieldmode, value, reverse, > fallback_p); > >> >> > + if (!store_integral_bit_field (op0, op0_mode, ibitsize, ibitnum, > >> >> > + bitregion_start, bitregion_end, > >> >> > + fieldmode, value, reverse, > fallback_p)) > >> >> > + return false; > >> >> > + > >> >> > + rtx op0_reg = op0; > >> >> > + rtx str_rtx_reg = str_rtx; > >> >> > + while (SUBREG_P (op0_reg)) > >> >> > + op0_reg = SUBREG_REG (op0_reg); > >> >> > + while (SUBREG_P (str_rtx_reg)) > >> >> > + str_rtx_reg = SUBREG_REG (str_rtx_reg); > >> >> > >> >> That looks definitely fishy. > >> >> > >> >> You do not quote the part that does the copy, but stripping all > subregs > >> >> and then copying looks wrong. It also looks this was produced by an > AI? > >> >> > >> >> > + > >> >> > + /* If a new destination register has been generated, copy the > value back > >> >> > + into str_rtx. */ > >> >> > + if (REG_P (op0_reg) && REG_P (str_rtx_reg) > >> >> > + && REGNO (op0_reg) != REGNO (str_rtx_reg)) > >> >> > + emit_move_insn (str_rtx_reg, op0_reg); > >> >> > + > >> >> > + return true; > >> >> > } > >> >> > > >> >> > /* Subroutine of store_bit_field_1, with the same arguments, > except > >> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr125988.c > b/gcc/testsuite/gcc.target/aarch64/pr125988.c > >> >> > new file mode 100644 > >> >> > index 000000000000..3ac7be9b7b99 > >> >> > --- /dev/null > >> >> > +++ b/gcc/testsuite/gcc.target/aarch64/pr125988.c > >> >> > @@ -0,0 +1,51 @@ > >> >> > +/* PR rtl-optimization/125988 */ > >> >> > +/* { dg-do run } */ > >> >> > +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */ > >> >> > +/* { dg-options "-O3 -favoid-store-forwarding" } */ > >> >> > +/* { dg-add-options arm_v8_2a_bf16_neon } */ > >> >> > + > >> >> > +/* Verify that the lane inserted by vld4_lane_bf16 survives > >> >> > + avoid-store-forwarding's bit-insert rewrite. */ > >> >> > + > >> >> > +#include <arm_neon.h> > >> >> > + > >> >> > +extern void abort (void); > >> >> > + > >> >> > +typedef union > >> >> > +{ > >> >> > + bfloat16_t bf; > >> >> > + unsigned short u; > >> >> > +} bf16_u; > >> >> > + > >> >> > +__attribute__((noipa)) static int > >> >> > +test (const bf16_u *data, const bf16_u *overwrite) > >> >> > +{ > >> >> > + bfloat16x4x4_t v; > >> >> > + bf16_u t[4]; > >> >> > + int i, j; > >> >> > + for (i = 0; i < 4; i++, data += 4) > >> >> > + v.val[i] = vld1_bf16 (&data->bf); > >> >> > + v = vld4_lane_bf16 (&overwrite->bf, v, 3); > >> >> > + while (--i >= 0) > >> >> > + { > >> >> > + vst1_bf16 (&t[0].bf, v.val[i]); > >> >> > + data -= 4; > >> >> > + for (j = 0; j < 4; j++) > >> >> > + if (t[j].u != (j == 3 ? overwrite[i].u : data[j].u)) > >> >> > + return 1; > >> >> > + } > >> >> > + return 0; > >> >> > +} > >> >> > + > >> >> > +int > >> >> > +main (void) > >> >> > +{ > >> >> > + bf16_u d[16]; > >> >> > + for (int i = 0; i < 16; i++) > >> >> > + d[i].u = 0x1000 + i; > >> >> > + bf16_u ov[4] = { {.u = 0xABCD}, {.u = 0x1234}, > >> >> > + {.u = 0xCAFE}, {.u = 0xBEEF} }; > >> >> > + if (test (d, ov)) > >> >> > + abort (); > >> >> > + return 0; > >> >> > +} > >> >> > -- > >> >> > 2.52.0 > >> >> > >
