Re: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.
On Fri, Jan 26, 2024 at 3:03 AM Roger Sayle wrote: > > > Hi Hongtao, > Many thanks for the review. Here's a revised version of my patch > that addresses (most of) the issues you've raised. Firstly the > handling of zero and all_ones in this function is mostly for > completeness/documentation, these standard_sse_constant_p > values are (currently/normally) handled elsewhere. But I have > added an "n_var == 0" optimization to ix86_expand_vector_init. > > As you've suggested I've added explicit TARGET_SSE2 tests where > required, and for consistency I've also added support for AVX512's > V16SImode. > > As you've predicted, the eventual goal is to move this after combine > (or reload) using define_insn_and_split, but that requires a significant > restructuring that should be done in steps. This also interacts with > a similar planned reorganization of TImode constant handling. If > all 128-bit (vector) constants are acceptable before combine, then > STV has the freedom to chose V1TImode (and this broadcast > functionality) to implement TImode operations on immediate > constants. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline (in stage 1)? Ok, thanks for handling this. > > > 2024-01-25 Roger Sayle > Hongtao Liu > > gcc/ChangeLog > PR target/106060 > * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. > (struct ix86_vec_bcast_map_simode_t): New type for table below. > (ix86_vec_bcast_map_simode): Table of SImode constants that may > be efficiently synthesized by a ix86_vec_bcast_alg method. > (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. > (ix86_vector_duplicate_simode_const): Efficiently synthesize > V4SImode and V8SImode constants that duplicate special constants. > (ix86_vector_duplicate_value): Attempt to synthesize "special" > vector constants using ix86_vector_duplicate_simode_const. > * config/i386/i386.cc (ix86_rtx_costs) : ABS of a > vector integer mode costs with a single SSE instruction. > > gcc/testsuite/ChangeLog > PR target/106060 > * gcc.target/i386/auto-init-8.c: Update test case. > * gcc.target/i386/avx512fp16-3.c: Likewise. > * gcc.target/i386/pr100865-9a.c: Likewise. > * gcc.target/i386/pr101796-1.c: Likewise. > * gcc.target/i386/pr106060-1.c: New test case. > * gcc.target/i386/pr106060-2.c: Likewise. > * gcc.target/i386/pr106060-3.c: Likewise. > * gcc.target/i386/pr70314.c: Update test case. > * gcc.target/i386/vect-shiftv4qi.c: Likewise. > * gcc.target/i386/vect-shiftv8qi.c: Likewise. > > > Roger > -- > > > -----Original Message- > > From: Hongtao Liu > > Sent: 17 January 2024 03:13 > > To: Roger Sayle > > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak > > Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant > > materialization. > > > > On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle > > wrote: > > > > > > > > > I thought I'd just missed the bug fixing season of stage3, but there > > > appears to a little latitude in early stage4 (for vector patches), so > > > I'll post this now. > > > > > > This patch resolves PR target/106060 by providing efficient methods > > > for materializing/synthesizing special "vector" constants on x86. > > > Currently there are three methods of materializing a vector constant; > > > the most general is to load a vector from the constant pool, secondly > > "duplicated" > > > constants can be synthesized by moving an integer between units and > > > broadcasting (or shuffling it), and finally the special cases of the > > > all-zeros vector and all-ones vectors can be loaded via a single SSE > > > instruction. This patch handles additional cases that can be synthesized > > > in two instructions, loading an all-ones vector followed by another > > > SSE instruction. Following my recent patch for PR target/112992, > > > there's conveniently a single place in i386-expand.cc where these > > > special cases can be handled. > > > > > > Two examples are given in the original bugzilla PR for 106060. > > > > > > __m256i > > > should_be_cmpeq_abs () > > > { > > > return _mm256_set1_epi8 (1); > > > } > > > > > > is now generated (with -O3 -march=x
RE: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.
Hi Hongtao, Many thanks for the review. Here's a revised version of my patch that addresses (most of) the issues you've raised. Firstly the handling of zero and all_ones in this function is mostly for completeness/documentation, these standard_sse_constant_p values are (currently/normally) handled elsewhere. But I have added an "n_var == 0" optimization to ix86_expand_vector_init. As you've suggested I've added explicit TARGET_SSE2 tests where required, and for consistency I've also added support for AVX512's V16SImode. As you've predicted, the eventual goal is to move this after combine (or reload) using define_insn_and_split, but that requires a significant restructuring that should be done in steps. This also interacts with a similar planned reorganization of TImode constant handling. If all 128-bit (vector) constants are acceptable before combine, then STV has the freedom to chose V1TImode (and this broadcast functionality) to implement TImode operations on immediate constants. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline (in stage 1)? 2024-01-25 Roger Sayle Hongtao Liu gcc/ChangeLog PR target/106060 * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. (struct ix86_vec_bcast_map_simode_t): New type for table below. (ix86_vec_bcast_map_simode): Table of SImode constants that may be efficiently synthesized by a ix86_vec_bcast_alg method. (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. (ix86_vector_duplicate_simode_const): Efficiently synthesize V4SImode and V8SImode constants that duplicate special constants. (ix86_vector_duplicate_value): Attempt to synthesize "special" vector constants using ix86_vector_duplicate_simode_const. * config/i386/i386.cc (ix86_rtx_costs) : ABS of a vector integer mode costs with a single SSE instruction. gcc/testsuite/ChangeLog PR target/106060 * gcc.target/i386/auto-init-8.c: Update test case. * gcc.target/i386/avx512fp16-3.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr101796-1.c: Likewise. * gcc.target/i386/pr106060-1.c: New test case. * gcc.target/i386/pr106060-2.c: Likewise. * gcc.target/i386/pr106060-3.c: Likewise. * gcc.target/i386/pr70314.c: Update test case. * gcc.target/i386/vect-shiftv4qi.c: Likewise. * gcc.target/i386/vect-shiftv8qi.c: Likewise. Roger -- > -Original Message- > From: Hongtao Liu > Sent: 17 January 2024 03:13 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak > Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant > materialization. > > On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle > wrote: > > > > > > I thought I'd just missed the bug fixing season of stage3, but there > > appears to a little latitude in early stage4 (for vector patches), so > > I'll post this now. > > > > This patch resolves PR target/106060 by providing efficient methods > > for materializing/synthesizing special "vector" constants on x86. > > Currently there are three methods of materializing a vector constant; > > the most general is to load a vector from the constant pool, secondly > "duplicated" > > constants can be synthesized by moving an integer between units and > > broadcasting (or shuffling it), and finally the special cases of the > > all-zeros vector and all-ones vectors can be loaded via a single SSE > > instruction. This patch handles additional cases that can be synthesized > > in two instructions, loading an all-ones vector followed by another > > SSE instruction. Following my recent patch for PR target/112992, > > there's conveniently a single place in i386-expand.cc where these > > special cases can be handled. > > > > Two examples are given in the original bugzilla PR for 106060. > > > > __m256i > > should_be_cmpeq_abs () > > { > > return _mm256_set1_epi8 (1); > > } > > > > is now generated (with -O3 -march=x86-64-v3) as: > > > > vpcmpeqd%ymm0, %ymm0, %ymm0 > > vpabsb %ymm0, %ymm0 > > ret > > > > and > > > > __m256i > > should_be_cmpeq_add () > > { > > return _mm256_set1_epi8 (-2); > > } > > > > is now generated as: > > > > vpcmpeqd%ymm0, %ymm0, %ymm0 > > vpaddb %ymm0, %ymm0, %ymm0 > > ret > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with
Re: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.
On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle wrote: > > > I thought I'd just missed the bug fixing season of stage3, but there > appears to a little latitude in early stage4 (for vector patches), so > I'll post this now. > > This patch resolves PR target/106060 by providing efficient methods for > materializing/synthesizing special "vector" constants on x86. Currently > there are three methods of materializing a vector constant; the most > general is to load a vector from the constant pool, secondly "duplicated" > constants can be synthesized by moving an integer between units and > broadcasting (or shuffling it), and finally the special cases of the > all-zeros vector and all-ones vectors can be loaded via a single SSE > instruction. This patch handles additional cases that can be synthesized > in two instructions, loading an all-ones vector followed by another SSE > instruction. Following my recent patch for PR target/112992, there's > conveniently a single place in i386-expand.cc where these special cases > can be handled. > > Two examples are given in the original bugzilla PR for 106060. > > __m256i > should_be_cmpeq_abs () > { > return _mm256_set1_epi8 (1); > } > > is now generated (with -O3 -march=x86-64-v3) as: > > vpcmpeqd%ymm0, %ymm0, %ymm0 > vpabsb %ymm0, %ymm0 > ret > > and > > __m256i > should_be_cmpeq_add () > { > return _mm256_set1_epi8 (-2); > } > > is now generated as: > > vpcmpeqd%ymm0, %ymm0, %ymm0 > vpaddb %ymm0, %ymm0, %ymm0 > ret > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? > > > 2024-01-16 Roger Sayle > > gcc/ChangeLog > PR target/106060 > * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. > (struct ix86_vec_bcast_map_simode_t): New type for table below. > (ix86_vec_bcast_map_simode): Table of SImode constants that may > be efficiently synthesized by a ix86_vec_bcast_alg method. > (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. > (ix86_vector_duplicate_simode_const): Efficiently synthesize > V4SImode and V8SImode constants that duplicate special constants. > (ix86_vector_duplicate_value): Attempt to synthesize "special" > vector constants using ix86_vector_duplicate_simode_const. > * config/i386/i386.cc (ix86_rtx_costs) : ABS of a > vector integer mode costs with a single SSE instruction. > + switch (entry->alg) +{ +case VEC_BCAST_PXOR: + if (mode == V8SImode && !TARGET_AVX2) + return false; + emit_move_insn (target, CONST0_RTX (mode)); + return true; +case VEC_BCAST_PCMPEQ: + if ((mode == V4SImode && !TARGET_SSE2) + || (mode == V8SImode && !TARGET_AVX2)) + return false; + emit_move_insn (target, CONSTM1_RTX (mode)); + return true; I think we need to prevent those standard_sse_constant_p getting in ix86_expand_vector_init_duplicate by below codes. /* If all values are identical, broadcast the value. */ if (all_same && (nvars != 0 || !standard_sse_constant_p (gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)), mode)) && ix86_expand_vector_init_duplicate (mmx_ok, mode, target, XVECEXP (vals, 0, 0))) return; +case VEC_BCAST_PABSB: + if (mode == V4SImode) + { + tmp1 = gen_reg_rtx (V16QImode); + emit_move_insn (tmp1, CONSTM1_RTX (V16QImode)); + tmp2 = gen_reg_rtx (V16QImode); + emit_insn (gen_absv16qi2 (tmp2, tmp1)); Shouldn't it rely on TARGET_SSE2? +case VEC_BCAST_PADDB: + if (mode == V4SImode) + { + tmp1 = gen_reg_rtx (V16QImode); + emit_move_insn (tmp1, CONSTM1_RTX (V16QImode)); + tmp2 = gen_reg_rtx (V16QImode); + emit_insn (gen_addv16qi3 (tmp2, tmp1, tmp1)); Ditto here and for all logic shift cases. + } + + if ((mode == V4SImode || mode == V8SImode) + && CONST_INT_P (val) + && ix86_vector_duplicate_simode_const (mode, target, INTVAL (val))) +return true; + The alternative way is adding a pre_reload define_insn_and_split to match specific const_vector and splitt it into new instructions. In theoritically, the constant info can be retained before combine and will enable more simplication. Also the patch can be extend to V16SImode, but it can be a separate patch. > gcc/testsuite/ChangeLog > PR target/106060 > * gcc.target/i386/auto-init-8.c: Update test case. > * gcc.target/i386/avx512fp16-3.c: Likewise. > * gcc.target/i386/pr100865-9a.c: Likewise. > * gcc.target/i386/pr106060-1.c: New test case. > * gcc.target/i386/pr106060-2.c: Likewise. > * gcc.target/i386/pr106060-3.c: Likewise. > * gcc.target/i386/pr70314-3.c: Update test case. > * gcc.target/i386/vect-shiftv4qi.c: Likewise. > * gcc.target/i386/vect-shiftv8qi.c: Likewise. > > > Thanks in advance, > Roger >
[x86 PATCH] PR target/106060: Improved SSE vector constant materialization.
I thought I'd just missed the bug fixing season of stage3, but there appears to a little latitude in early stage4 (for vector patches), so I'll post this now. This patch resolves PR target/106060 by providing efficient methods for materializing/synthesizing special "vector" constants on x86. Currently there are three methods of materializing a vector constant; the most general is to load a vector from the constant pool, secondly "duplicated" constants can be synthesized by moving an integer between units and broadcasting (or shuffling it), and finally the special cases of the all-zeros vector and all-ones vectors can be loaded via a single SSE instruction. This patch handles additional cases that can be synthesized in two instructions, loading an all-ones vector followed by another SSE instruction. Following my recent patch for PR target/112992, there's conveniently a single place in i386-expand.cc where these special cases can be handled. Two examples are given in the original bugzilla PR for 106060. __m256i should_be_cmpeq_abs () { return _mm256_set1_epi8 (1); } is now generated (with -O3 -march=x86-64-v3) as: vpcmpeqd%ymm0, %ymm0, %ymm0 vpabsb %ymm0, %ymm0 ret and __m256i should_be_cmpeq_add () { return _mm256_set1_epi8 (-2); } is now generated as: vpcmpeqd%ymm0, %ymm0, %ymm0 vpaddb %ymm0, %ymm0, %ymm0 ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-01-16 Roger Sayle gcc/ChangeLog PR target/106060 * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. (struct ix86_vec_bcast_map_simode_t): New type for table below. (ix86_vec_bcast_map_simode): Table of SImode constants that may be efficiently synthesized by a ix86_vec_bcast_alg method. (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. (ix86_vector_duplicate_simode_const): Efficiently synthesize V4SImode and V8SImode constants that duplicate special constants. (ix86_vector_duplicate_value): Attempt to synthesize "special" vector constants using ix86_vector_duplicate_simode_const. * config/i386/i386.cc (ix86_rtx_costs) : ABS of a vector integer mode costs with a single SSE instruction. gcc/testsuite/ChangeLog PR target/106060 * gcc.target/i386/auto-init-8.c: Update test case. * gcc.target/i386/avx512fp16-3.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr106060-1.c: New test case. * gcc.target/i386/pr106060-2.c: Likewise. * gcc.target/i386/pr106060-3.c: Likewise. * gcc.target/i386/pr70314-3.c: Update test case. * gcc.target/i386/vect-shiftv4qi.c: Likewise. * gcc.target/i386/vect-shiftv8qi.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 52754e1..f8f8af6 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -15638,6 +15638,288 @@ s4fma_expand: gcc_unreachable (); } +/* See below where shifts are handled for explanation of this enum. */ +enum ix86_vec_bcast_alg +{ + VEC_BCAST_PXOR, + VEC_BCAST_PCMPEQ, + VEC_BCAST_PABSB, + VEC_BCAST_PADDB, + VEC_BCAST_PSRLW, + VEC_BCAST_PSRLD, + VEC_BCAST_PSLLW, + VEC_BCAST_PSLLD +}; + +struct ix86_vec_bcast_map_simode_t +{ + unsigned int key; + enum ix86_vec_bcast_alg alg; + unsigned int arg; +}; + +/* This table must be kept sorted as values are looked-up using bsearch. */ +static const ix86_vec_bcast_map_simode_t ix86_vec_bcast_map_simode[] = { + { 0x, VEC_BCAST_PXOR,0 }, + { 0x0001, VEC_BCAST_PSRLD, 31 }, + { 0x0003, VEC_BCAST_PSRLD, 30 }, + { 0x0007, VEC_BCAST_PSRLD, 29 }, + { 0x000f, VEC_BCAST_PSRLD, 28 }, + { 0x001f, VEC_BCAST_PSRLD, 27 }, + { 0x003f, VEC_BCAST_PSRLD, 26 }, + { 0x007f, VEC_BCAST_PSRLD, 25 }, + { 0x00ff, VEC_BCAST_PSRLD, 24 }, + { 0x01ff, VEC_BCAST_PSRLD, 23 }, + { 0x03ff, VEC_BCAST_PSRLD, 22 }, + { 0x07ff, VEC_BCAST_PSRLD, 21 }, + { 0x0fff, VEC_BCAST_PSRLD, 20 }, + { 0x1fff, VEC_BCAST_PSRLD, 19 }, + { 0x3fff, VEC_BCAST_PSRLD, 18 }, + { 0x7fff, VEC_BCAST_PSRLD, 17 }, + { 0x, VEC_BCAST_PSRLD, 16 }, + { 0x00010001, VEC_BCAST_PSRLW, 15 }, + { 0x0001, VEC_BCAST_PSRLD, 15 }, + { 0x00030003, VEC_BCAST_PSRLW, 14 }, + { 0x0003, VEC_BCAST_PSRLD, 14 }, + { 0x00070007, VEC_BCAST_PSRLW, 13 }, + { 0x0007, VEC_BCAST_PSRLD, 13 }, + { 0x000f000f, VEC_BCAST_PSRLW, 12 }, + { 0x000f, VEC_BCAST_PSRLD, 12 }, + { 0x001f001f, VEC_BCAST_PSRLW, 11 }, + { 0x001f, VEC_BCAST_PSRLD, 11 }, + { 0x003f003f, VEC_BCAST_PSRLW, 10 }, + { 0x003f, VEC_BCAST_PSRLD, 10 }, + { 0x007f007f, VEC_BCAST_PSRLW, 9 }, + {