On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubiz...@gmail.com> wrote:
> >>
> >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> >> <richard.guent...@gmail.com> wrote:
> >> >
> >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubiz...@gmail.com> wrote:
> >> > >
> >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> >> > > <richard.guent...@gmail.com> wrote:
> >> > > >
> >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> >> > > > <gcc-patches@gcc.gnu.org> wrote:
> >> > > > >
> >> > > > > As shown in the PR, simplify_gen_subreg call in 
> >> > > > > simplify_replace_fn_rtx:
> >> > > > >
> >> > > > > (gdb) list
> >> > > > > 469           if (code == SUBREG)
> >> > > > > 470             {
> >> > > > > 471               op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> >> > > > > old_rtx, fn, data);
> >> > > > > 472               if (op0 == SUBREG_REG (x))
> >> > > > > 473                 return x;
> >> > > > > 474               op0 = simplify_gen_subreg (GET_MODE (x), op0,
> >> > > > > 475                                          GET_MODE (SUBREG_REG 
> >> > > > > (x)),
> >> > > > > 476                                          SUBREG_BYTE (x));
> >> > > > > 477               return op0 ? op0 : x;
> >> > > > > 478             }
> >> > > > >
> >> > > > > simplifies with following arguments:
> >> > > > >
> >> > > > > (gdb) p debug_rtx (op0)
> >> > > > > (const_vector:V4QI [
> >> > > > >         (const_int -52 [0xffffffffffffffcc]) repeated x4
> >> > > > >     ])
> >> > > > > (gdb) p debug_rtx (x)
> >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> >> > > > >
> >> > > > > to:
> >> > > > >
> >> > > > > (gdb) p debug_rtx (op0)
> >> > > > > (const_vector:V16QI [
> >> > > > >         (const_int -52 [0xffffffffffffffcc]) repeated x16
> >> > > > >     ])
> >> > > > >
> >> > > > > This simplification is invalid, it is not possible to get 
> >> > > > > V16QImode vector
> >> > > > > from V4QImode vector, even when all elements are duplicates.
> >> >
> >> > ^^^
> >> >
> >> > I think this simplification is valid.  A simplification to
> >> >
> >> > (const_vector:V16QI [
> >> >          (const_int -52 [0xffffffffffffffcc]) repeated x4
> >> >          (const_int 0 [0]) repeated x12
> >> >      ])
> >> >
> >> > would be valid as well.
> >> >
> >> > > > > The simplification happens in simplify_context::simplify_subreg:
> >> > > > >
> >> > > > > (gdb) list
> >> > > > > 7558          if (VECTOR_MODE_P (outermode)
> >> > > > > 7559              && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> >> > > > > (innermode)
> >> > > > > 7560              && vec_duplicate_p (op, &elt))
> >> > > > > 7561            return gen_vec_duplicate (outermode, elt);
> >> > > > >
> >> > > > > but the above simplification is valid only for non-paradoxical 
> >> > > > > registers,
> >> > > > > where outermode <= innermode.  We should not assume that elements 
> >> > > > > outside
> >> > > > > the original register are valid, let alone all duplicates.
> >> > > >
> >> > > > Hmm, but looking at the audit trail the x86 backend expects them to 
> >> > > > be zero?
> >> > > > Isn't that wrong as well?
> >> > >
> >> > > If you mean Comment #10, it is just an observation that
> >> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> >> > >
> >> > > (gdb) p debug_rtx (src)
> >> > > (const_vector:V8HI [
> >> > >         (const_int 204 [0xcc]) repeated x4
> >> > >         (const_int 0 [0]) repeated x4
> >> > >     ])
> >> > >
> >> > > instead of:
> >> > >
> >> > > (gdb) p debug_rtx (src)
> >> > > (const_vector:V8HI [
> >> > >         (const_int 204 [0xcc]) repeated x8
> >> > >     ])
> >> > >
> >> > > which is in line with the statement below.
> >> > > >
> >> > > > That is, I think putting any random value into the upper lanes when
> >> > > > constant folding
> >> > > > a paradoxical subreg sounds OK to me, no?
> >> > >
> >> > > The compiler is putting zero there as can be seen from the above new 
> >> > > RTX.
> >> > >
> >> > > > Of course we might choose to not do such constant propagation for
> >> > > > efficiency reason - at least
> >> > > > when the resulting CONST_* would require a larger constant pool entry
> >> > > > or more costly
> >> > > > construction.
> >> > >
> >> > > This is probably a follow-up improvement, where this patch tries to
> >> > > fix a specific invalid simplification of simplify_replace_rtx that is
> >> > > invalid universally.
> >> >
> >> > How so?  What specifies the values of the paradoxical subreg for the
> >> > bytes not covered by the subreg operand?
> >>
> >> I don't know why 0 is generated here (and if it is valid) for
> >> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
> >> the wrong constant and triggers unwanted propagation later on.
> >
> > Quoting what I wrote in the PR below.  I think pragmatically the fix is
> > good - we might miss some opportunistic folding this way but we for
> > sure may not optimistically register an equality via REG_EQUAL without
> > enforcing it (removing the producer and replacing it with the optimistic
> > constant).
> >
> > So consider the patch approved if no other RTL maintainer chimes in
> > within 48h.
>
> Sorry, can you hold off a bit longer?  Wanted to have a look but the
> deadline is about to expire.

No problem, I will wait for you.

Thanks,
Uros.

>
> I think at least a comment is needed, since like Richard says,
> the transformation seems correct for paradoxical subregs, even if it
> isn't wanted for other reasons.
>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Richard.
> >
> >
> > I can see cprop1 adds the REG_EQUAL note:
> >
> > (insn 22 21 23 4 (set (reg:V8HI 100)
> >         (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
> >                 (parallel [
> >                         (const_int 0 [0])
> >                         (const_int 1 [0x1])
> >                         (const_int 2 [0x2])
> >                         (const_int 3 [0x3])
> >                         (const_int 4 [0x4])
> >                         (const_int 5 [0x5])
> >                          (const_int 6 [0x6])
> >                          (const_int 7 [0x7])
> >                      ])))) "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2}
> > -     (expr_list:REG_DEAD (reg:V4QI 98)
> > -        (nil)))
> > +     (expr_list:REG_EQUAL (const_vector:V8HI [
> > +                (const_int 204 [0xcc]) repeated x8
> > +            ])
> > +        (expr_list:REG_DEAD (reg:V4QI 98)
> > +            (nil))))
> >
> > but I don't see yet what the actual wrong transform based on this
> > REG_EQUAL note is?
> >
> > It looks like we CSE the above with
> >
> > -   46: r122:V8QI=[`*.LC3']
> > -      REG_EQUAL const_vector
> > -   48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel))
> > -      REG_EQUAL const_vector
> > -      REG_DEAD r122:V8QI
> > -   49: r126:V8HI=r124:V8HI*r125:V8HI
> > -      REG_DEAD r125:V8HI
> > +   49: r126:V8HI=r124:V8HI*r100:V8HI
> >
> > but otherwise do nothing.  So the issue is that we rely on the "undefined"
> > vals to have a specific value (from the earlier REG_EQUAL note) but actual
> > code generation doesn't ensure this (it doesn't need to).  That said,
> > the issue isn't the constant folding per-se but that we do not actually
> > constant fold but register an equality that doesn't hold.
> >
> >
> >
> >> Uros.

Reply via email to