[Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2021-09-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed|2019-10-14 00:00:00 |2021-9-4
   Severity|normal  |enhancement

--- Comment #5 from Andrew Pinski  ---
This gives good code:
#include 

__m512i sinkz;
__m256i sinky;
void foo(char c) {
__m512i a = _mm512_set1_epi8(c);
sinkz = a;
sinky = *((__m256i*));
}

[Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

--- Comment #4 from rguenther at suse dot de  ---
On Mon, 14 Oct 2019, rguenther at suse dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
> 
> --- Comment #3 from rguenther at suse dot de  ---
> On Mon, 14 Oct 2019, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
> > 
> > Jakub Jelinek  changed:
> > 
> >What|Removed |Added
> > 
> >  CC||jakub at gcc dot gnu.org
> > 
> > --- Comment #2 from Jakub Jelinek  ---
> > Yeah, it isn't e.g. something RTL CSE would naturally do, because there is 
> > no
> > common subexpression, this needs to know that a narrower broadcast is a 
> > part of
> > a wider broadcast of the same argument and know how to replace that with a
> > backend instruction that takes the low bits from it (while it actually 
> > usually
> > expands to no code, at least before RA it needs to be expressed some way 
> > and is
> > very backend specific, we don't allow a vector mode to vector mode subreg 
> > with
> > different size).  So the only place to deal with this in RTL would be some
> > backend specific pass I'm afraid.
> 
> So what RTL CSE would need to do is when seeing
> 
>  (set reg:VNQI ...)
> 
> know (via a target hook?) which subregs can be accessed at zero-cost
> and register the apropriate smaller vector sets with a subreg value.
> That probably makes sense only after reload to not constrain RA
> too much.  It could be restricted to vec_duplicate since there
> it's easy to derive the lowpart expression to register.

Or IRA/LRA rematerialization / inheritance could be teached to do this.

[Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

--- Comment #3 from rguenther at suse dot de  ---
On Mon, 14 Oct 2019, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org
> 
> --- Comment #2 from Jakub Jelinek  ---
> Yeah, it isn't e.g. something RTL CSE would naturally do, because there is no
> common subexpression, this needs to know that a narrower broadcast is a part 
> of
> a wider broadcast of the same argument and know how to replace that with a
> backend instruction that takes the low bits from it (while it actually usually
> expands to no code, at least before RA it needs to be expressed some way and 
> is
> very backend specific, we don't allow a vector mode to vector mode subreg with
> different size).  So the only place to deal with this in RTL would be some
> backend specific pass I'm afraid.

So what RTL CSE would need to do is when seeing

 (set reg:VNQI ...)

know (via a target hook?) which subregs can be accessed at zero-cost
and register the apropriate smaller vector sets with a subreg value.
That probably makes sense only after reload to not constrain RA
too much.  It could be restricted to vec_duplicate since there
it's easy to derive the lowpart expression to register.

[Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek  ---
Yeah, it isn't e.g. something RTL CSE would naturally do, because there is no
common subexpression, this needs to know that a narrower broadcast is a part of
a wider broadcast of the same argument and know how to replace that with a
backend instruction that takes the low bits from it (while it actually usually
expands to no code, at least before RA it needs to be expressed some way and is
very backend specific, we don't allow a vector mode to vector mode subreg with
different size).  So the only place to deal with this in RTL would be some
backend specific pass I'm afraid.

[Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-10-14
 CC||rguenth at gcc dot gnu.org
  Component|tree-optimization   |middle-end
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Interestingly enough with just -mavx512f we get

vmovd   %edi, %xmm0
vpbroadcastb%xmm0, %ymm0
vinserti64x4$0x1, %ymm0, %zmm0, %zmm1
vmovdqa %ymm0, sinky(%rip)
vmovdqa64   %zmm1, sinkz(%rip)

the GIMPLE we expand from is

  _7 = {c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D)};
  _8 = VIEW_CONVERT_EXPR<__m512i>(_7);
  sinkz = _8;
  _3 = {c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D),
c_1(D), c_1(D), c_1(D)};
  _6 = VIEW_CONVERT_EXPR<__m256i>(_3);
  sinky = _6;

where we could replace _6 with a BIT_FIELD_REF but it will be a quite
costly thing to do in general.  Our representation for the splats isn't
too nice either...

So without avx512bw we seem miss the splat on V64QI and do a V32QI splat
plus a concat.  On the RTL side optimizing this isn't any less awkward
than on GIMPLE I guess.