Re: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-25 Thread Hongtao Liu
On Fri, Jan 26, 2024 at 3:03 AM Roger Sayle  wrote:
>
>
> Hi Hongtao,
> Many thanks for the review.  Here's a revised version of my patch
> that addresses (most of) the issues you've raised.  Firstly the
> handling of zero and all_ones in this function is mostly for
> completeness/documentation, these standard_sse_constant_p
> values are (currently/normally) handled elsewhere.  But I have
> added an "n_var == 0" optimization to ix86_expand_vector_init.
>
> As you've suggested I've added explicit TARGET_SSE2 tests where
> required, and for consistency I've also added support for AVX512's
> V16SImode.
>
> As you've predicted, the eventual goal is to move this after combine
> (or reload) using define_insn_and_split, but that requires a significant
> restructuring that should be done in steps.  This also interacts with
> a similar planned reorganization of TImode constant handling.  If
> all 128-bit (vector) constants are acceptable before combine, then
> STV has the freedom to chose V1TImode (and this broadcast
> functionality) to implement TImode operations on immediate
> constants.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline (in stage 1)?
Ok, thanks for handling this.
>
>
> 2024-01-25  Roger Sayle  
> Hongtao Liu  
>
> gcc/ChangeLog
> PR target/106060
> * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
> (struct ix86_vec_bcast_map_simode_t): New type for table below.
> (ix86_vec_bcast_map_simode): Table of SImode constants that may
> be efficiently synthesized by a ix86_vec_bcast_alg method.
> (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
> (ix86_vector_duplicate_simode_const): Efficiently synthesize
> V4SImode and V8SImode constants that duplicate special constants.
> (ix86_vector_duplicate_value): Attempt to synthesize "special"
> vector constants using ix86_vector_duplicate_simode_const.
> * config/i386/i386.cc (ix86_rtx_costs) : ABS of a
> vector integer mode costs with a single SSE instruction.
>
> gcc/testsuite/ChangeLog
> PR target/106060
> * gcc.target/i386/auto-init-8.c: Update test case.
> * gcc.target/i386/avx512fp16-3.c: Likewise.
> * gcc.target/i386/pr100865-9a.c: Likewise.
> * gcc.target/i386/pr101796-1.c: Likewise.
> * gcc.target/i386/pr106060-1.c: New test case.
> * gcc.target/i386/pr106060-2.c: Likewise.
> * gcc.target/i386/pr106060-3.c: Likewise.
> * gcc.target/i386/pr70314.c: Update test case.
> * gcc.target/i386/vect-shiftv4qi.c: Likewise.
> * gcc.target/i386/vect-shiftv8qi.c: Likewise.
>
>
> Roger
> --
>
> > -----Original Message-
> > From: Hongtao Liu 
> > Sent: 17 January 2024 03:13
> > To: Roger Sayle 
> > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak 
> > Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant
> > materialization.
> >
> > On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle 
> > wrote:
> > >
> > >
> > > I thought I'd just missed the bug fixing season of stage3, but there
> > > appears to a little latitude in early stage4 (for vector patches), so
> > > I'll post this now.
> > >
> > > This patch resolves PR target/106060 by providing efficient methods
> > > for materializing/synthesizing special "vector" constants on x86.
> > > Currently there are three methods of materializing a vector constant;
> > > the most general is to load a vector from the constant pool, secondly
> > "duplicated"
> > > constants can be synthesized by moving an integer between units and
> > > broadcasting (or shuffling it), and finally the special cases of the
> > > all-zeros vector and all-ones vectors can be loaded via a single SSE
> > > instruction.   This patch handles additional cases that can be synthesized
> > > in two instructions, loading an all-ones vector followed by another
> > > SSE instruction.  Following my recent patch for PR target/112992,
> > > there's conveniently a single place in i386-expand.cc where these
> > > special cases can be handled.
> > >
> > > Two examples are given in the original bugzilla PR for 106060.
> > >
> > > __m256i
> > > should_be_cmpeq_abs ()
> > > {
> > >   return _mm256_set1_epi8 (1);
> > > }
> > >
> > > is now generated (with -O3 -march=x

RE: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-25 Thread Roger Sayle

Hi Hongtao,
Many thanks for the review.  Here's a revised version of my patch
that addresses (most of) the issues you've raised.  Firstly the
handling of zero and all_ones in this function is mostly for 
completeness/documentation, these standard_sse_constant_p
values are (currently/normally) handled elsewhere.  But I have
added an "n_var == 0" optimization to ix86_expand_vector_init.

As you've suggested I've added explicit TARGET_SSE2 tests where
required, and for consistency I've also added support for AVX512's
V16SImode.

As you've predicted, the eventual goal is to move this after combine
(or reload) using define_insn_and_split, but that requires a significant
restructuring that should be done in steps.  This also interacts with
a similar planned reorganization of TImode constant handling.  If
all 128-bit (vector) constants are acceptable before combine, then
STV has the freedom to chose V1TImode (and this broadcast
functionality) to implement TImode operations on immediate
constants.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline (in stage 1)?


2024-01-25  Roger Sayle  
Hongtao Liu  

gcc/ChangeLog
PR target/106060
* config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
(struct ix86_vec_bcast_map_simode_t): New type for table below.
(ix86_vec_bcast_map_simode): Table of SImode constants that may
be efficiently synthesized by a ix86_vec_bcast_alg method.
(ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
(ix86_vector_duplicate_simode_const): Efficiently synthesize
V4SImode and V8SImode constants that duplicate special constants.
(ix86_vector_duplicate_value): Attempt to synthesize "special"
vector constants using ix86_vector_duplicate_simode_const.
* config/i386/i386.cc (ix86_rtx_costs) : ABS of a
vector integer mode costs with a single SSE instruction.

gcc/testsuite/ChangeLog
PR target/106060
* gcc.target/i386/auto-init-8.c: Update test case.
* gcc.target/i386/avx512fp16-3.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr101796-1.c: Likewise.
* gcc.target/i386/pr106060-1.c: New test case.
* gcc.target/i386/pr106060-2.c: Likewise.
* gcc.target/i386/pr106060-3.c: Likewise.
* gcc.target/i386/pr70314.c: Update test case.
* gcc.target/i386/vect-shiftv4qi.c: Likewise.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.


Roger
--

> -Original Message-
> From: Hongtao Liu 
> Sent: 17 January 2024 03:13
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak 
> Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant
> materialization.
> 
> On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle 
> wrote:
> >
> >
> > I thought I'd just missed the bug fixing season of stage3, but there
> > appears to a little latitude in early stage4 (for vector patches), so
> > I'll post this now.
> >
> > This patch resolves PR target/106060 by providing efficient methods
> > for materializing/synthesizing special "vector" constants on x86.
> > Currently there are three methods of materializing a vector constant;
> > the most general is to load a vector from the constant pool, secondly
> "duplicated"
> > constants can be synthesized by moving an integer between units and
> > broadcasting (or shuffling it), and finally the special cases of the
> > all-zeros vector and all-ones vectors can be loaded via a single SSE
> > instruction.   This patch handles additional cases that can be synthesized
> > in two instructions, loading an all-ones vector followed by another
> > SSE instruction.  Following my recent patch for PR target/112992,
> > there's conveniently a single place in i386-expand.cc where these
> > special cases can be handled.
> >
> > Two examples are given in the original bugzilla PR for 106060.
> >
> > __m256i
> > should_be_cmpeq_abs ()
> > {
> >   return _mm256_set1_epi8 (1);
> > }
> >
> > is now generated (with -O3 -march=x86-64-v3) as:
> >
> > vpcmpeqd%ymm0, %ymm0, %ymm0
> > vpabsb  %ymm0, %ymm0
> > ret
> >
> > and
> >
> > __m256i
> > should_be_cmpeq_add ()
> > {
> >   return _mm256_set1_epi8 (-2);
> > }
> >
> > is now generated as:
> >
> > vpcmpeqd%ymm0, %ymm0, %ymm0
> > vpaddb  %ymm0, %ymm0, %ymm0
> > ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with

Re: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-16 Thread Hongtao Liu
On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle  wrote:
>
>
> I thought I'd just missed the bug fixing season of stage3, but there
> appears to a little latitude in early stage4 (for vector patches), so
> I'll post this now.
>
> This patch resolves PR target/106060 by providing efficient methods for
> materializing/synthesizing special "vector" constants on x86.  Currently
> there are three methods of materializing a vector constant; the most
> general is to load a vector from the constant pool, secondly "duplicated"
> constants can be synthesized by moving an integer between units and
> broadcasting (or shuffling it), and finally the special cases of the
> all-zeros vector and all-ones vectors can be loaded via a single SSE
> instruction.   This patch handles additional cases that can be synthesized
> in two instructions, loading an all-ones vector followed by another SSE
> instruction.  Following my recent patch for PR target/112992, there's
> conveniently a single place in i386-expand.cc where these special cases
> can be handled.
>
> Two examples are given in the original bugzilla PR for 106060.
>
> __m256i
> should_be_cmpeq_abs ()
> {
>   return _mm256_set1_epi8 (1);
> }
>
> is now generated (with -O3 -march=x86-64-v3) as:
>
> vpcmpeqd%ymm0, %ymm0, %ymm0
> vpabsb  %ymm0, %ymm0
> ret
>
> and
>
> __m256i
> should_be_cmpeq_add ()
> {
>   return _mm256_set1_epi8 (-2);
> }
>
> is now generated as:
>
> vpcmpeqd%ymm0, %ymm0, %ymm0
> vpaddb  %ymm0, %ymm0, %ymm0
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-01-16  Roger Sayle  
>
> gcc/ChangeLog
> PR target/106060
> * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
> (struct ix86_vec_bcast_map_simode_t): New type for table below.
> (ix86_vec_bcast_map_simode): Table of SImode constants that may
> be efficiently synthesized by a ix86_vec_bcast_alg method.
> (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
> (ix86_vector_duplicate_simode_const): Efficiently synthesize
> V4SImode and V8SImode constants that duplicate special constants.
> (ix86_vector_duplicate_value): Attempt to synthesize "special"
> vector constants using ix86_vector_duplicate_simode_const.
> * config/i386/i386.cc (ix86_rtx_costs) : ABS of a
> vector integer mode costs with a single SSE instruction.
>

+  switch (entry->alg)
+{
+case VEC_BCAST_PXOR:
+  if (mode == V8SImode && !TARGET_AVX2)
+ return false;
+  emit_move_insn (target, CONST0_RTX (mode));
+  return true;
+case VEC_BCAST_PCMPEQ:
+  if ((mode == V4SImode && !TARGET_SSE2)
+  || (mode == V8SImode && !TARGET_AVX2))
+ return false;
+  emit_move_insn (target, CONSTM1_RTX (mode));
+  return true;

I think we need to prevent those standard_sse_constant_p getting in
ix86_expand_vector_init_duplicate by below codes.

  /* If all values are identical, broadcast the value.  */
  if (all_same
  && (nvars != 0 || !standard_sse_constant_p (gen_rtx_CONST_VECTOR
(mode, XVEC (vals, 0)), mode))
  && ix86_expand_vector_init_duplicate (mmx_ok, mode, target,
XVECEXP (vals, 0, 0)))
return;

+case VEC_BCAST_PABSB:
+  if (mode == V4SImode)
+ {
+  tmp1 = gen_reg_rtx (V16QImode);
+  emit_move_insn (tmp1, CONSTM1_RTX (V16QImode));
+  tmp2 = gen_reg_rtx (V16QImode);
+  emit_insn (gen_absv16qi2 (tmp2, tmp1));
Shouldn't it rely on TARGET_SSE2?

+case VEC_BCAST_PADDB:
+  if (mode == V4SImode)
+ {
+  tmp1 = gen_reg_rtx (V16QImode);
+  emit_move_insn (tmp1, CONSTM1_RTX (V16QImode));
+  tmp2 = gen_reg_rtx (V16QImode);
+  emit_insn (gen_addv16qi3 (tmp2, tmp1, tmp1));
Ditto here and for all logic shift cases.
+ }

+
+  if ((mode == V4SImode || mode == V8SImode)
+  && CONST_INT_P (val)
+  && ix86_vector_duplicate_simode_const (mode, target, INTVAL (val)))
+return true;
+
The alternative way is adding a pre_reload define_insn_and_split to
match specific const_vector and splitt it into new instructions.
In theoritically, the constant info can be retained before combine and
will enable more simplication.

Also the patch can be extend to V16SImode, but it can be a separate patch.

> gcc/testsuite/ChangeLog
> PR target/106060
> * gcc.target/i386/auto-init-8.c: Update test case.
> * gcc.target/i386/avx512fp16-3.c: Likewise.
> * gcc.target/i386/pr100865-9a.c: Likewise.
> * gcc.target/i386/pr106060-1.c: New test case.
> * gcc.target/i386/pr106060-2.c: Likewise.
> * gcc.target/i386/pr106060-3.c: Likewise.
> * gcc.target/i386/pr70314-3.c: Update test case.
> * gcc.target/i386/vect-shiftv4qi.c: Likewise.
> * gcc.target/i386/vect-shiftv8qi.c: Likewise.
>
>
> Thanks in advance,
> Roger
> 

[x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-16 Thread Roger Sayle

I thought I'd just missed the bug fixing season of stage3, but there
appears to a little latitude in early stage4 (for vector patches), so
I'll post this now.

This patch resolves PR target/106060 by providing efficient methods for
materializing/synthesizing special "vector" constants on x86.  Currently
there are three methods of materializing a vector constant; the most
general is to load a vector from the constant pool, secondly "duplicated"
constants can be synthesized by moving an integer between units and
broadcasting (or shuffling it), and finally the special cases of the
all-zeros vector and all-ones vectors can be loaded via a single SSE
instruction.   This patch handles additional cases that can be synthesized
in two instructions, loading an all-ones vector followed by another SSE
instruction.  Following my recent patch for PR target/112992, there's
conveniently a single place in i386-expand.cc where these special cases
can be handled.

Two examples are given in the original bugzilla PR for 106060.

__m256i
should_be_cmpeq_abs ()
{
  return _mm256_set1_epi8 (1);
}

is now generated (with -O3 -march=x86-64-v3) as:

vpcmpeqd%ymm0, %ymm0, %ymm0
vpabsb  %ymm0, %ymm0
ret

and

__m256i
should_be_cmpeq_add ()
{
  return _mm256_set1_epi8 (-2);
}

is now generated as:

vpcmpeqd%ymm0, %ymm0, %ymm0
vpaddb  %ymm0, %ymm0, %ymm0
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-01-16  Roger Sayle  

gcc/ChangeLog
PR target/106060
* config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
(struct ix86_vec_bcast_map_simode_t): New type for table below.
(ix86_vec_bcast_map_simode): Table of SImode constants that may
be efficiently synthesized by a ix86_vec_bcast_alg method.
(ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
(ix86_vector_duplicate_simode_const): Efficiently synthesize
V4SImode and V8SImode constants that duplicate special constants.
(ix86_vector_duplicate_value): Attempt to synthesize "special"
vector constants using ix86_vector_duplicate_simode_const.
* config/i386/i386.cc (ix86_rtx_costs) : ABS of a
vector integer mode costs with a single SSE instruction.

gcc/testsuite/ChangeLog
PR target/106060
* gcc.target/i386/auto-init-8.c: Update test case.
* gcc.target/i386/avx512fp16-3.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr106060-1.c: New test case.
* gcc.target/i386/pr106060-2.c: Likewise.
* gcc.target/i386/pr106060-3.c: Likewise.
* gcc.target/i386/pr70314-3.c: Update test case.
* gcc.target/i386/vect-shiftv4qi.c: Likewise.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 52754e1..f8f8af6 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -15638,6 +15638,288 @@ s4fma_expand:
   gcc_unreachable ();
 }
 
+/* See below where shifts are handled for explanation of this enum.  */
+enum ix86_vec_bcast_alg
+{
+  VEC_BCAST_PXOR,
+  VEC_BCAST_PCMPEQ,
+  VEC_BCAST_PABSB,
+  VEC_BCAST_PADDB,
+  VEC_BCAST_PSRLW,
+  VEC_BCAST_PSRLD,
+  VEC_BCAST_PSLLW,
+  VEC_BCAST_PSLLD
+};
+
+struct ix86_vec_bcast_map_simode_t
+{
+  unsigned int key;
+  enum ix86_vec_bcast_alg alg;
+  unsigned int arg;
+};
+
+/* This table must be kept sorted as values are looked-up using bsearch.  */
+static const ix86_vec_bcast_map_simode_t ix86_vec_bcast_map_simode[] = {
+  { 0x, VEC_BCAST_PXOR,0 },
+  { 0x0001, VEC_BCAST_PSRLD,  31 },
+  { 0x0003, VEC_BCAST_PSRLD,  30 },
+  { 0x0007, VEC_BCAST_PSRLD,  29 },
+  { 0x000f, VEC_BCAST_PSRLD,  28 },
+  { 0x001f, VEC_BCAST_PSRLD,  27 },
+  { 0x003f, VEC_BCAST_PSRLD,  26 },
+  { 0x007f, VEC_BCAST_PSRLD,  25 },
+  { 0x00ff, VEC_BCAST_PSRLD,  24 },
+  { 0x01ff, VEC_BCAST_PSRLD,  23 },
+  { 0x03ff, VEC_BCAST_PSRLD,  22 },
+  { 0x07ff, VEC_BCAST_PSRLD,  21 },
+  { 0x0fff, VEC_BCAST_PSRLD,  20 },
+  { 0x1fff, VEC_BCAST_PSRLD,  19 },
+  { 0x3fff, VEC_BCAST_PSRLD,  18 },
+  { 0x7fff, VEC_BCAST_PSRLD,  17 },
+  { 0x, VEC_BCAST_PSRLD,  16 },
+  { 0x00010001, VEC_BCAST_PSRLW,  15 },
+  { 0x0001, VEC_BCAST_PSRLD,  15 },
+  { 0x00030003, VEC_BCAST_PSRLW,  14 },
+  { 0x0003, VEC_BCAST_PSRLD,  14 },
+  { 0x00070007, VEC_BCAST_PSRLW,  13 },
+  { 0x0007, VEC_BCAST_PSRLD,  13 },
+  { 0x000f000f, VEC_BCAST_PSRLW,  12 },
+  { 0x000f, VEC_BCAST_PSRLD,  12 },
+  { 0x001f001f, VEC_BCAST_PSRLW,  11 },
+  { 0x001f, VEC_BCAST_PSRLD,  11 },
+  { 0x003f003f, VEC_BCAST_PSRLW,  10 },
+  { 0x003f, VEC_BCAST_PSRLD,  10 },
+  { 0x007f007f, VEC_BCAST_PSRLW,   9 },
+  {