Re: RFC: Introduce -fhardened to enable security-related flags

2023-09-14 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 30, 2023 at 3:42 AM Marek Polacek via Gcc-patches
 wrote:
>
> Improving the security of software has been a major trend in the recent
> years.  Fortunately, GCC offers a wide variety of flags that enable extra
> hardening.  These flags aren't enabled by default, though.  And since
> there are a lot of hardening flags, with more to come, it's been difficult
> to keep on top of them; more so for the users of GCC who ought not to be
> expected to keep track of all the new options.
>
> To alleviate some of the problems I mentioned, we thought it would
> be useful to provide a new umbrella option that enables a reasonable set
> of hardening flags.  What's "reasonable" in this context is not easy to
> pin down.  Surely, there must be no ABI impact, the option cannot cause
> severe performance issues, and, I suspect, it should not cause build
> errors by enabling stricter compile-time errors (such as, -Wimplicit-int,
> -Wint-conversion).  Including a controversial option in -fhardened
> would likely cause that users would not use -fhardened at all.  It's
> roughly akin to -Wall or -O2 -- those also enable a reasonable set of
> options, and evolve over time, and are not kept in sync with other
> compilers.
>
> Currently, -fhardened enables:
>
>   -D_FORTIFY_SOURCE=3 (or =2 for older glibcs)
>   -D_GLIBCXX_ASSERTIONS
>   -ftrivial-auto-var-init=zero
>   -fPIE  -pie  -Wl,-z,relro,-z,now
>   -fstack-protector-strong
>   -fstack-clash-protection
>   -fcf-protection=full (x86 GNU/Linux only)
>
> -fsanitize=undefined is specifically not enabled.  -fstrict-flex-arrays is
> also liable to break a lot of code so I didn't include it.
>
> Appended is a proof-of-concept patch.  It doesn't implement --help=hardened
> yet.  A fairly crucial point is that -fhardened will not override options
> that were specified on the command line (before or after -fhardened).  For
> example,
>
>  -D_FORTIFY_SOURCE=1 -fhardened
>
> means that _FORTIFY_SOURCE=1 will be used.  Similarly,
>
>   -fhardened -fstack-protector
>
> will not enable -fstack-protector-strong.
>
> Thoughts?
>
> ---
>  gcc/c-family/c-opts.cc | 25 
>  gcc/common.opt |  4 +++
>  gcc/config/i386/i386-options.cc| 11 ++-
>  gcc/doc/invoke.texi| 29 +-
>  gcc/gcc.cc | 35 +-
>  gcc/opts.cc| 15 --
>  gcc/testsuite/c-c++-common/fhardened-1.S   |  6 
>  gcc/testsuite/c-c++-common/fhardened-1.c   | 18 +++
>  gcc/testsuite/c-c++-common/fhardened-10.c  | 10 +++
>  gcc/testsuite/c-c++-common/fhardened-2.c   | 12 
>  gcc/testsuite/c-c++-common/fhardened-3.c   | 12 
>  gcc/testsuite/c-c++-common/fhardened-5.c   | 11 +++
>  gcc/testsuite/c-c++-common/fhardened-6.c   | 11 +++
>  gcc/testsuite/c-c++-common/fhardened-7.c   |  7 +
>  gcc/testsuite/c-c++-common/fhardened-8.c   |  7 +
>  gcc/testsuite/c-c++-common/fhardened-9.c   |  6 
>  gcc/testsuite/gcc.misc-tests/help.exp  |  2 ++
>  gcc/testsuite/gcc.target/i386/cf_check-6.c | 12 
>  gcc/toplev.cc  |  6 
>  19 files changed, 233 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-1.S
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-1.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-10.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-2.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-3.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-5.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-6.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-7.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-8.c
>  create mode 100644 gcc/testsuite/c-c++-common/fhardened-9.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cf_check-6.c
>
> diff --git a/gcc/c-family/c-opts.cc b/gcc/c-family/c-opts.cc
> index 4961af63de8..764714ba8a5 100644
> --- a/gcc/c-family/c-opts.cc
> +++ b/gcc/c-family/c-opts.cc
> @@ -1514,6 +1514,9 @@ c_finish_options (void)
>cb_file_change (parse_in, cmd_map);
>linemap_line_start (line_table, 0, 1);
>
> +  bool fortify_seen_p = false;
> +  bool cxx_assert_seen_p = false;
> +
>/* All command line defines must have the same location.  */
>cpp_force_token_locations (parse_in, line_table->highest_line);
>for (size_t i = 0; i < deferred_count; i++)
> @@ -1531,6 +1534,28 @@ c_finish_options (void)
>   else
> cpp_assert (parse_in, opt->arg);
> }
> +
> + if (UNLIKELY (flag_hardened)
> + && (opt->code == OPT_D || opt->code == OPT_U))
> +   {
> + if (!fortify_seen_p)
> +   fortify_seen_p = !strncmp (opt->arg, "_FORTIFY_SOURCE", 15);
> + if 

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-04 Thread Hongtao Liu via Gcc-patches
On Mon, Sep 4, 2023 at 4:57 PM Uros Bizjak  wrote:
>
> On Mon, Sep 4, 2023 at 2:28 AM Hongtao Liu  wrote:
>
> > > > > > > > I think there should be some constraint which explicitly has 
> > > > > > > > all the 32
> > > > > > > > GPRs, like there is one for just all 16 GPRs (h), so that 
> > > > > > > > regardless of
> > > > > > > > -mapx-inline-asm-use-gpr32 one can be explicit what the inline 
> > > > > > > > asm wants.
> > > > > > > >
> > > > > > > > Also, what about the "g" constraint?  Shouldn't there be 
> > > > > > > > another for "g"
> > > > > > > > without r16..r31?  What about the various other memory
> > > > > > > > constraints ("<", "o", ...)?
> > > > > > >
> > > > > > > I think we should leave all existing constraints as they are, so 
> > > > > > > "r"
> > > > > > > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > > > > > > introduce "h" to instructions that have the ability to handle 
> > > > > > > EGPR.
> > > > > > > This would be somehow similar to the SSE -> AVX512F transition, 
> > > > > > > where
> > > > > > > we still have "x" for SSE16 and "v" was introduced as a separate
> > > > > > > register class for EVEX SSE registers. This way, asm will be
> > > > > > > compatible, when "r", "m", "o" and "g" are used. The new memory
> > > > > > > constraint "Bt", should allow new registers, and should be added 
> > > > > > > to
> > > > > > > the constraint string as a separate constraint, and conditionally
> > > > > > > enabled by relevant "isa" (AKA "enabled") attribute.
> > > > > >
> > > > > > The extended constraint can work for registers, but for memory it 
> > > > > > is more
> > > > > > complicated.
> > > > >
> > > > > Yes, unfortunately. The compiler assumes that an unchangeable register
> > > > > class is used for BASE/INDEX registers. I have hit this limitation
> > > > > when trying to implement memory support for instructions involving
> > > > > 8-bit high registers (%ah, %bh, %ch, %dh), which do not support REX
> > > > > registers, also inside memory operand. (You can see the "hack" in e.g.
> > > > > *extzvqi_mem_rex64" and corresponding peephole2 with the original
> > > > > *extzvqi pattern). I am aware that dynamic insn-dependent BASE/INDEX
> > > > > register class is the major limitation in the compiler, so perhaps the
> > > > > strategy on how to override this limitation should be discussed with
> > > > > the register allocator author first. Perhaps adding an insn attribute
> > > > > to insn RTX pattern to specify different BASE/INDEX register sets can
> > > > > be a better solution than passing insn RTX to the register allocator.
> > > > >
> > > > > The above idea still does not solve the asm problem on how to select
> > > > > correct BASE/INDEX register set for memory operands.
> > > > The current approach disables gpr32 for memory operand in asm_operand
> > > > by default. but can be turned on by options
> > > > ix86_apx_inline_asm_use_gpr32(users need to guarantee the instruction
> > > > supports gpr32).
> > > > Only ~ 5% of total instructions don't support gpr32, reversed approach
> > > > only gonna get more complicated.
> > >
> > > I'm not referring to the reversed approach, just want to point out
> > > that the same approach as you proposed w.r.t. to memory operand can be
> > > achieved using some named insn attribute that would affect BASE/INDEX
> > > register class selection. The attribute could default to gpr32 with
> > > APX, unless the insn specific attribute has e.g. nogpr32 value. See
> > > for example how "enabled" and "preferred_for_*" attributes are used.
> > > Perhaps this new attribute can also be applied to separate
> > > alternatives.
> > Yes, for xop/fma4/3dnow instructions, I think we can use isa attr like
> > (define_attr "gpr32" "0, 1"
> >   (cond [(eq_attr "isa" "fma4")
> >(const_string "0")]
> >   (const_string "1")))
>
> Just a nit, can the member be named "map0" and "map1"? The code will
> then look like:
>
> if (get_attr_gpr32 (insn) == GPR32_MAP0) ...
>
> instead of:
>
> if (get_attr_gpr32 (insn) == GPR32_0) ...
>
> > But still, we need to adjust memory constraints in the pattern.
>
> I guess the gpr32 property is the same for all alternatives of the
> insn pattern. In this case,  "m" "g" and "a" constraints could remain
> as they are, the final register class will be adjusted (by some target
> hook?) based on the value of gpr32 attribute.
I'm worried that not all rtl optimizers after post_reload will respect
base/index_reg_class regarding the insn they belong to.
 if they just check if it's a legitimate memory/address (the current
legitimate_address doesn't have a corresponding insn to pass down),
m/g/a will still generate invalid instruction.
So a defensive programming is to explicitly modifying the constraint.
>
> > Ideally, gcc includes encoding information for every instruction,
> > (.i.e. map0/map1), so that we can determine the attribute value of
> > gpr32 directly from this information.
>
> I think the right 

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-03 Thread Hongtao Liu via Gcc-patches
On Fri, Sep 1, 2023 at 7:03 PM Richard Sandiford via Gcc-patches
 wrote:
>
> Uros Bizjak via Gcc-patches  writes:
> > On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
> >  wrote:
> >>
> >> On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches 
> >> wrote:
> >> > From: Kong Lingling 
> >> >
> >> > In inline asm, we do not know if the insn can use EGPR, so disable EGPR
> >> > usage by default from mapping the common reg/mem constraint to non-EGPR
> >> > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR usage
> >> > for inline asm.
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
> >> >   ix86_md_asm_adjust.
> >> >   (ix86_md_asm_adjust): When APX EGPR enabled without specifying the
> >> >   target option, map reg/mem constraints to non-EGPR constraints.
> >> >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
> >> > ---
> >> >  gcc/config/i386/i386.cc   |  44 +++
> >> >  gcc/config/i386/i386.opt  |   5 +
> >> >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 ++
> >> >  3 files changed, 156 insertions(+)
> >> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
> >> >
> >> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> >> > index d26d9ab0d9d..9460ebbfda4 100644
> >> > --- a/gcc/config/i386/i386.cc
> >> > +++ b/gcc/config/i386/i386.cc
> >> > @@ -17,6 +17,7 @@ You should have received a copy of the GNU General 
> >> > Public License
> >> >  along with GCC; see the file COPYING3.  If not see
> >> >  .  */
> >> >
> >> > +#define INCLUDE_STRING
> >> >  #define IN_TARGET_CODE 1
> >> >
> >> >  #include "config.h"
> >> > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , vec 
> >> > & /*inputs*/,
> >> >bool saw_asm_flag = false;
> >> >
> >> >start_sequence ();
> >> > +  /* TODO: Here we just mapped the general r/m constraints to non-EGPR
> >> > +   constraints, will eventually map all the usable constraints in the 
> >> > future. */
> >>
> >> I think there should be some constraint which explicitly has all the 32
> >> GPRs, like there is one for just all 16 GPRs (h), so that regardless of
> >> -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm wants.
> >>
> >> Also, what about the "g" constraint?  Shouldn't there be another for "g"
> >> without r16..r31?  What about the various other memory
> >> constraints ("<", "o", ...)?
> >
> > I think we should leave all existing constraints as they are, so "r"
> > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > introduce "h" to instructions that have the ability to handle EGPR.
>
> Yeah.  I'm jumping in without having read the full thread, sorry,
> but the current mechanism for handling this is TARGET_MEM_CONSTRAINT
> (added for s390).  That is, TARGET_MEM_CONSTRAINT can be defined to some
Thanks for the comments.
> new constraint that is more general than the traditional "m" constraint.
> This constraint is then the one that is associated with memory_operand
> etc.  "m" can then be defined explicitly to the old definition,
> so that existing asms continue to work.
>
> So if the port wants generic internal memory addresses to use the
> EGPR set (sounds reasonable), then TARGET_MEM_CONSTRAINT would be
> a new constraint that maps to those addresses.
But still we need to enhance current reload infrastructure to support
selective base_reg_class/index_reg_class, refer to [1].
The good thing about using TARGET_MEM_CONSTRAINT is that we don't have
to remapping memory constraint for inline asm, but the bad thing about
it is that we need to modify the backend pattern a lot, because only
5% of the instructions don't support gpr32, and 95% of them need to be
changed to the new memory constraint.
It feels like the cons outweigh the pros.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629040.html

>
> Thanks,
> Richard



-- 
BR,
Hongtao


Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-03 Thread Hongtao Liu via Gcc-patches
On Fri, Sep 1, 2023 at 7:27 PM Uros Bizjak  wrote:
>
> On Fri, Sep 1, 2023 at 12:36 PM Hongtao Liu  wrote:
> >
> > On Fri, Sep 1, 2023 at 5:38 PM Uros Bizjak via Gcc-patches
> >  wrote:
> > >
> > > On Fri, Sep 1, 2023 at 11:10 AM Hongyu Wang  
> > > wrote:
> > > >
> > > > Uros Bizjak via Gcc-patches  于2023年8月31日周四 
> > > > 18:01写道:
> > > > >
> > > > > On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via 
> > > > > > Gcc-patches wrote:
> > > > > > > From: Kong Lingling 
> > > > > > >
> > > > > > > In inline asm, we do not know if the insn can use EGPR, so 
> > > > > > > disable EGPR
> > > > > > > usage by default from mapping the common reg/mem constraint to 
> > > > > > > non-EGPR
> > > > > > > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR 
> > > > > > > usage
> > > > > > > for inline asm.
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
> > > > > > >   ix86_md_asm_adjust.
> > > > > > >   (ix86_md_asm_adjust): When APX EGPR enabled without 
> > > > > > > specifying the
> > > > > > >   target option, map reg/mem constraints to non-EGPR 
> > > > > > > constraints.
> > > > > > >   * config/i386/i386.opt: Add option 
> > > > > > > mapx-inline-asm-use-gpr32.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
> > > > > > > ---
> > > > > > >  gcc/config/i386/i386.cc   |  44 +++
> > > > > > >  gcc/config/i386/i386.opt  |   5 +
> > > > > > >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 
> > > > > > > ++
> > > > > > >  3 files changed, 156 insertions(+)
> > > > > > >  create mode 100644 
> > > > > > > gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
> > > > > > >
> > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > > > index d26d9ab0d9d..9460ebbfda4 100644
> > > > > > > --- a/gcc/config/i386/i386.cc
> > > > > > > +++ b/gcc/config/i386/i386.cc
> > > > > > > @@ -17,6 +17,7 @@ You should have received a copy of the GNU 
> > > > > > > General Public License
> > > > > > >  along with GCC; see the file COPYING3.  If not see
> > > > > > >  .  */
> > > > > > >
> > > > > > > +#define INCLUDE_STRING
> > > > > > >  #define IN_TARGET_CODE 1
> > > > > > >
> > > > > > >  #include "config.h"
> > > > > > > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , 
> > > > > > > vec & /*inputs*/,
> > > > > > >bool saw_asm_flag = false;
> > > > > > >
> > > > > > >start_sequence ();
> > > > > > > +  /* TODO: Here we just mapped the general r/m constraints to 
> > > > > > > non-EGPR
> > > > > > > +   constraints, will eventually map all the usable constraints 
> > > > > > > in the future. */
> > > > > >
> > > > > > I think there should be some constraint which explicitly has all 
> > > > > > the 32
> > > > > > GPRs, like there is one for just all 16 GPRs (h), so that 
> > > > > > regardless of
> > > > > > -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm 
> > > > > > wants.
> > > > > >
> > > > > > Also, what about the "g" constraint?  Shouldn't there be another 
> > > > > > for "g"
> > > > > > without r16..r31?  What about the various other memory
> > > > > > constraints ("<", "o", ...)?
> > > > >
> > > > > I think we should leave all existing constraints as they are, so "r"
> > > > > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > > > > introduce "h" to instructions that have the ability to handle EGPR.
> > > > > This would be somehow similar to the SSE -> AVX512F transition, where
> > > > > we still have "x" for SSE16 and "v" was introduced as a separate
> > > > > register class for EVEX SSE registers. This way, asm will be
> > > > > compatible, when "r", "m", "o" and "g" are used. The new memory
> > > > > constraint "Bt", should allow new registers, and should be added to
> > > > > the constraint string as a separate constraint, and conditionally
> > > > > enabled by relevant "isa" (AKA "enabled") attribute.
> > > >
> > > > The extended constraint can work for registers, but for memory it is 
> > > > more
> > > > complicated.
> > >
> > > Yes, unfortunately. The compiler assumes that an unchangeable register
> > > class is used for BASE/INDEX registers. I have hit this limitation
> > > when trying to implement memory support for instructions involving
> > > 8-bit high registers (%ah, %bh, %ch, %dh), which do not support REX
> > > registers, also inside memory operand. (You can see the "hack" in e.g.
> > > *extzvqi_mem_rex64" and corresponding peephole2 with the original
> > > *extzvqi pattern). I am aware that dynamic insn-dependent BASE/INDEX
> > > register class is the major limitation in the compiler, so perhaps the
> > > strategy on how to 

Re: [PATCH 11/13] [APX EGPR] Handle legacy insns that only support GPR16 (3/5)

2023-09-01 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 31, 2023 at 5:31 PM Richard Biener via Gcc-patches
 wrote:
>
> On Thu, Aug 31, 2023 at 11:26 AM Richard Biener
>  wrote:
> >
> > On Thu, Aug 31, 2023 at 10:25 AM Hongyu Wang via Gcc-patches
> >  wrote:
> > >
> > > From: Kong Lingling 
> > >
> > > Disable EGPR usage for below legacy insns in opcode map2/3 that have vex
> > > but no evex counterpart.
> > >
> > > insn list:
> > > 1. phminposuw/vphminposuw
> > > 2. ptest/vptest
> > > 3. roundps/vroundps, roundpd/vroundpd,
> > >roundss/vroundss, roundsd/vroundsd
> > > 4. pcmpestri/vpcmpestri, pcmpestrm/vpcmpestrm
> > > 5. pcmpistri/vpcmpistri, pcmpistrm/vpcmpistrm
> >
> > How are GPRs involved in the above?  Or did I misunderstand something?
>
> Following up myself - for the memory operand alternatives I guess.  How about
> simply disabling the memory alternatives when EGPR is active?  Wouldn't
> that simplify the initial patchset a lot?  Re-enabling them when
> deemed important
> could be done as followup then?
>
There're instructions only support memory operand but don't support
gpr32 (.i.e. xsave)
We still need to handle them at the initial patch.
> Richard.
>
> > > 6. aesimc/vaesimc, aeskeygenassist/vaeskeygenassist
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386-protos.h (x86_evex_reg_mentioned_p): New
> > > prototype.
> > > * config/i386/i386.cc (x86_evex_reg_mentioned_p): New
> > > function.
> > > * config/i386/i386.md (sse4_1_round2): Set attr gpr32 0
> > > and constraint Bt/BM to all non-evex alternatives, adjust
> > > alternative outputs if evex reg is mentioned.
> > > * config/i386/sse.md (_ptest): Set attr gpr32 0
> > > and constraint Bt/BM to all non-evex alternatives.
> > > (ptesttf2): Likewise.
> > > (_round > > (sse4_1_round): Likewise.
> > > (sse4_2_pcmpestri): Likewise.
> > > (sse4_2_pcmpestrm): Likewise.
> > > (sse4_2_pcmpestr_cconly): Likewise.
> > > (sse4_2_pcmpistr): Likewise.
> > > (sse4_2_pcmpistri): Likewise.
> > > (sse4_2_pcmpistrm): Likewise.
> > > (sse4_2_pcmpistr_cconly): Likewise.
> > > (aesimc): Likewise.
> > > (aeskeygenassist): Likewise.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/apx-legacy-insn-check-norex2.c: Add intrinsic
> > > tests.
> > > ---
> > >  gcc/config/i386/i386-protos.h |  1 +
> > >  gcc/config/i386/i386.cc   | 13 +++
> > >  gcc/config/i386/i386.md   |  3 +-
> > >  gcc/config/i386/sse.md| 93 +--
> > >  .../i386/apx-legacy-insn-check-norex2.c   | 55 ++-
> > >  5 files changed, 132 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > index 78eb3e0f584..bbb219e3039 100644
> > > --- a/gcc/config/i386/i386-protos.h
> > > +++ b/gcc/config/i386/i386-protos.h
> > > @@ -65,6 +65,7 @@ extern bool extended_reg_mentioned_p (rtx);
> > >  extern bool x86_extended_QIreg_mentioned_p (rtx_insn *);
> > >  extern bool x86_extended_reg_mentioned_p (rtx);
> > >  extern bool x86_extended_rex2reg_mentioned_p (rtx);
> > > +extern bool x86_evex_reg_mentioned_p (rtx [], int);
> > >  extern bool x86_maybe_negate_const_int (rtx *, machine_mode);
> > >  extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index f5d642948bc..ec93c5bab97 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -22936,6 +22936,19 @@ x86_extended_rex2reg_mentioned_p (rtx insn)
> > >return false;
> > >  }
> > >
> > > +/* Return true when rtx operands mentions register that must be encoded 
> > > using
> > > +   evex prefix.  */
> > > +bool
> > > +x86_evex_reg_mentioned_p (rtx operands[], int nops)
> > > +{
> > > +  int i;
> > > +  for (i = 0; i < nops; i++)
> > > +if (EXT_REX_SSE_REG_P (operands[i])
> > > +   || x86_extended_rex2reg_mentioned_p (operands[i]))
> > > +  return true;
> > > +  return false;
> > > +}
> > > +
> > >  /* If profitable, negate (without causing overflow) integer constant
> > > of mode MODE at location LOC.  Return true in this case.  */
> > >  bool
> > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > > index 83ad01b43c1..4c305e72389 100644
> > > --- a/gcc/config/i386/i386.md
> > > +++ b/gcc/config/i386/i386.md
> > > @@ -21603,7 +21603,7 @@ (define_expand "significand2"
> > >  (define_insn "sse4_1_round2"
> > >[(set (match_operand:MODEFH 0 "register_operand" "=x,x,x,v,v")
> > > (unspec:MODEFH
> > > - [(match_operand:MODEFH 1 "nonimmediate_operand" "0,x,m,v,m")
> > > + [(match_operand:MODEFH 1 "nonimmediate_operand" "0,x,Bt,v,m")
> > >(match_operand:SI 2 "const_0_to_15_operand")]
> > >   UNSPEC_ROUND))]
> > >"TARGET_SSE4_1"
> > > 

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-01 Thread Hongtao Liu via Gcc-patches
On Fri, Sep 1, 2023 at 5:38 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Fri, Sep 1, 2023 at 11:10 AM Hongyu Wang  wrote:
> >
> > Uros Bizjak via Gcc-patches  于2023年8月31日周四 18:01写道:
> > >
> > > On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
> > >  wrote:
> > > >
> > > > On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches 
> > > > wrote:
> > > > > From: Kong Lingling 
> > > > >
> > > > > In inline asm, we do not know if the insn can use EGPR, so disable 
> > > > > EGPR
> > > > > usage by default from mapping the common reg/mem constraint to 
> > > > > non-EGPR
> > > > > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR usage
> > > > > for inline asm.
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
> > > > >   ix86_md_asm_adjust.
> > > > >   (ix86_md_asm_adjust): When APX EGPR enabled without specifying 
> > > > > the
> > > > >   target option, map reg/mem constraints to non-EGPR constraints.
> > > > >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc   |  44 +++
> > > > >  gcc/config/i386/i386.opt  |   5 +
> > > > >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 
> > > > > ++
> > > > >  3 files changed, 156 insertions(+)
> > > > >  create mode 100644 
> > > > > gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index d26d9ab0d9d..9460ebbfda4 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -17,6 +17,7 @@ You should have received a copy of the GNU General 
> > > > > Public License
> > > > >  along with GCC; see the file COPYING3.  If not see
> > > > >  .  */
> > > > >
> > > > > +#define INCLUDE_STRING
> > > > >  #define IN_TARGET_CODE 1
> > > > >
> > > > >  #include "config.h"
> > > > > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , 
> > > > > vec & /*inputs*/,
> > > > >bool saw_asm_flag = false;
> > > > >
> > > > >start_sequence ();
> > > > > +  /* TODO: Here we just mapped the general r/m constraints to 
> > > > > non-EGPR
> > > > > +   constraints, will eventually map all the usable constraints in 
> > > > > the future. */
> > > >
> > > > I think there should be some constraint which explicitly has all the 32
> > > > GPRs, like there is one for just all 16 GPRs (h), so that regardless of
> > > > -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm 
> > > > wants.
> > > >
> > > > Also, what about the "g" constraint?  Shouldn't there be another for "g"
> > > > without r16..r31?  What about the various other memory
> > > > constraints ("<", "o", ...)?
> > >
> > > I think we should leave all existing constraints as they are, so "r"
> > > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > > introduce "h" to instructions that have the ability to handle EGPR.
> > > This would be somehow similar to the SSE -> AVX512F transition, where
> > > we still have "x" for SSE16 and "v" was introduced as a separate
> > > register class for EVEX SSE registers. This way, asm will be
> > > compatible, when "r", "m", "o" and "g" are used. The new memory
> > > constraint "Bt", should allow new registers, and should be added to
> > > the constraint string as a separate constraint, and conditionally
> > > enabled by relevant "isa" (AKA "enabled") attribute.
> >
> > The extended constraint can work for registers, but for memory it is more
> > complicated.
>
> Yes, unfortunately. The compiler assumes that an unchangeable register
> class is used for BASE/INDEX registers. I have hit this limitation
> when trying to implement memory support for instructions involving
> 8-bit high registers (%ah, %bh, %ch, %dh), which do not support REX
> registers, also inside memory operand. (You can see the "hack" in e.g.
> *extzvqi_mem_rex64" and corresponding peephole2 with the original
> *extzvqi pattern). I am aware that dynamic insn-dependent BASE/INDEX
> register class is the major limitation in the compiler, so perhaps the
> strategy on how to override this limitation should be discussed with
> the register allocator author first. Perhaps adding an insn attribute
> to insn RTX pattern to specify different BASE/INDEX register sets can
> be a better solution than passing insn RTX to the register allocator.
>
> The above idea still does not solve the asm problem on how to select
> correct BASE/INDEX register set for memory operands.
The current approach disables gpr32 for memory operand in asm_operand
by default. but can be turned on by options
ix86_apx_inline_asm_use_gpr32(users need to guarantee the instruction
supports gpr32).
Only ~ 5% of total 

Re: [PATCH] Adjust costing of emulated vectorized gather/scatter

2023-08-31 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 30, 2023 at 8:18 PM Richard Biener via Gcc-patches
 wrote:
>
> On Wed, Aug 30, 2023 at 12:38 PM liuhongt via Gcc-patches
>  wrote:
> >
> > r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
> > gather/scatter.
> > 
> > commit 24905a4bd1375ccd99c02510b9f9529015a48315
> > Author: Richard Biener 
> > Date:   Wed Jan 18 11:04:49 2023 +0100
> >
> > Adjust costing of emulated vectorized gather/scatter
> >
> > Emulated gather/scatter behave similar to strided elementwise
> > accesses in that they need to decompose the offset vector
> > and construct or decompose the data vector so handle them
> > the same way, pessimizing the cases with may elements.
> > 
> >
> > But for emulated gather/scatter, offset vector load/vec_construct has
> > aready been counted, and in real case, it's probably eliminated by
> > later optimizer.
> > Also after decomposing, element loads from continous memory could be
> > less bounded compared to normal elementwise load.
> > The patch decreases the cost a little bit.
> >
> > This will enable gather emulation for below loop with VF=8(ymm)
> >
> > double
> > foo (double* a, double* b, unsigned int* c, int n)
> > {
> >   double sum = 0;
> >   for (int i = 0; i != n; i++)
> > sum += a[i] * b[c[i]];
> >   return sum;
> > }
> >
> > For the upper loop, microbenchmark result shows on ICX,
> > emulated gather with VF=8 is 30% faster than emulated gather with
> > VF=4 when tripcount is big enough.
> > It bring back ~4% for 510.parest still ~5% regression compared to
> > gather instruction due to throughput bound.
> >
> > For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
> > loop, VF remains 4(xmm) as before(guess related to their own cost
> > model).
> >
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/111064
> > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > Decrease cost a little bit for vec_to_scalar(offset vector) in
> > emulated gather.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr111064.c: New test.
> > ---
> >  gcc/config/i386/i386.cc  | 11 ++-
> >  gcc/testsuite/gcc.target/i386/pr111064.c | 12 
> >  2 files changed, 22 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 1bc3f11ff07..337e0f1bfbb 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, 
> > vect_cost_for_stmt kind,
> >   || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == 
> > VMAT_GATHER_SCATTER))
> >  {
> >stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, 
> > misalign);
> > -  stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > +  /* For emulated gather/scatter, offset vector load/vec_construct has
> > +already been counted and in real case, it's probably eliminated by
> > +later optimizer.
> > +Also after decomposing, element loads from continous memory
> > +could be less bounded compared to normal elementwise load.  */
> > +  if (kind == vec_to_scalar
> > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == 
> > VMAT_GATHER_SCATTER)
> > +   stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
>
> For gather we cost N vector extracts (from the offset vector), N scalar loads
> (the actual data loads) and one vec_construct.
>
> For scatter we cost N vector extracts (from the offset vector),
> N vector extracts (from the data vector) and N scalar stores.
>
> It was intended penaltize the extracts the same way as vector construction.
>
> Your change will adjust all three different decomposition kinds "a
> bit", I realize the
> scaling by (TYPE_VECTOR_SUBPARTS + 1) is kind-of arbitrary but so is your
> adjustment and I don't see why VMAT_GATHER_SCATTER is special to your
> adjustment.
>
> So the comment you put before the special-casing doesn't really make
> sense to me.
>
> For zen4 costing we currently have
>
> *_11 8 times vec_to_scalar costs 576 in body
> *_11 8 times scalar_load costs 96 in body
> *_11 1 times vec_construct costs 792 in body
>
> for zmm
>
> *_11 4 times vec_to_scalar costs 80 in body
> *_11 4 times scalar_load costs 48 in body
> *_11 1 times vec_construct costs 100 in body
>
> for ymm and
>
> *_11 2 times vec_to_scalar costs 24 in body
> *_11 2 times scalar_load costs 24 in body
> *_11 1 times vec_construct costs 12 in body
>
> for xmm.  Even with your adjustment if we were to enable cost comparison 
> between
> vector sizes we'd choose xmm I bet (you can try by re-ordering the modes in
> the ix86_autovectorize_vector_modes hook).  So it feels like a hack.  If you
> think that Icelake should enable 4 element vectorized emulated gather then
> we should disable this 

Re: [PATCH] Fix avx512ne2ps2bf16 wrong code [PR 111127]

2023-08-24 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 24, 2023 at 5:05 PM Hongyu Wang via Gcc-patches
 wrote:
>
> Hi,
>
> For PR27, the wrong code was caused by wrong expander for maskz.
> correct the parameter order for avx512ne2ps2bf16_maskz expander
>
> Bootstrapped/regtested on x86-64-pc-linux-gnu{m32,}.
> OK for master and backport to GCC13?
Ok.
>
> gcc/ChangeLog:
>
> PR target/27
> * config/i386/sse.md (avx512f_cvtne2ps2bf16__maskz):
> Adjust paramter order.
>
> gcc/testsuite/ChangeLog:
>
> PR target/27
> * gcc.target/i386/pr27.c: New test.
> ---
>  gcc/config/i386/sse.md   |  4 ++--
>  gcc/testsuite/gcc.target/i386/pr27.c | 24 
>  2 files changed, 26 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr27.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index da85223a9b4..194dab9a9d0 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -30006,8 +30006,8 @@ (define_expand "avx512f_cvtne2ps2bf16__maskz"
> (match_operand: 3 "register_operand")]
>"TARGET_AVX512BF16"
>  {
> -  emit_insn (gen_avx512f_cvtne2ps2bf16__mask(operands[0], operands[2],
> -operands[1], CONST0_RTX(mode), operands[3]));
> +  emit_insn (gen_avx512f_cvtne2ps2bf16__mask(operands[0], operands[1],
> +operands[2], CONST0_RTX(mode), operands[3]));
>DONE;
>  })
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr27.c 
> b/gcc/testsuite/gcc.target/i386/pr27.c
> new file mode 100644
> index 000..c124bc18bc4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr27.c
> @@ -0,0 +1,24 @@
> +/* PR target/27 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512bf16 -mavx512vl" } */
> +/* { dg-final { scan-assembler-times "vcvtne2ps2bf16\[ 
> \\t\]+\[^\{\n\]*%zmm1, %zmm0, %zmm0\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ 
> \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vcvtne2ps2bf16\[ 
> \\t\]+\[^\{\n\]*%ymm1, %ymm0, %ymm0\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ 
> \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vcvtne2ps2bf16\[ 
> \\t\]+\[^\{\n\]*%xmm1, %xmm0, %xmm0\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ 
> \\t\]+#)" 1 } } */
> +
> +#include 
> +
> +__m512bh cvttest(__mmask32 k, __m512 a, __m512 b)
> +{
> +  return _mm512_maskz_cvtne2ps_pbh (k,a,b);
> +}
> +
> +__m256bh cvttest2(__mmask16 k, __m256 a, __m256 b)
> +{
> +  return _mm256_maskz_cvtne2ps_pbh (k,a,b);
> +}
> +
> +__m128bh cvttest3(__mmask8 k, __m128 a, __m128 b)
> +{
> +  return _mm_maskz_cvtne2ps_pbh (k,a,b);
> +}
> +
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] Fix target_clone ("arch=graniterapids-d") and target_clone ("arch=arrowlake-s")

2023-08-23 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 12:31 PM liuhongt  wrote:
>
> Both "graniterapid-d" and "graniterapids" are attached with
> PROCESSOR_GRANITERAPID in processor_alias_table but mapped to
> different __cpu_subtype in get_intel_cpu.
>
> And get_builtin_code_for_version will try to match the first
> PROCESSOR_GRANITERAPIDS in processor_alias_table which maps to
> "granitepraids" here.
>
> 861  else if (new_target->arch_specified && new_target->arch > 0)
> 1862for (i = 0; i < pta_size; i++)
> 1863  if (processor_alias_table[i].processor == new_target->arch)
> 1864{
> 1865  const pta *arch_info = _alias_table[i];
> 1866  switch (arch_info->priority)
> 1867{
> 1868default:
> 1869  arg_str = arch_info->name;
>
> This mismatch makes dispatch_function_versions check the preidcate
> of__builtin_cpu_is ("graniterapids") for "graniterapids-d" and causes
> the issue.
> The patch explicitly adds PROCESSOR_ARROWLAKE_S and
> PROCESSOR_GRANITERAPIDS_D to make a distinction.
>
> For "alderlake","raptorlake", "meteorlake" they share same isa, cost,
> tuning, and mapped to the same __cpu_type/__cpu_subtype in
> get_intel_cpu, so no need to add PROCESSOR_RAPTORLAKE and others.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu.
> Ok for trunk(and backport graniterapids-d part to GCC13)?
Push to trunk and backport to GCC13 release branch.
>
> gcc/ChangeLog:
>
> * common/config/i386/i386-common.cc (processor_names): Add new
> member graniterapids-s and arrowlake-s.
> * config/i386/i386-options.cc (processor_alias_table): Update
> table with PROCESSOR_ARROWLAKE_S and
> PROCESSOR_GRANITERAPIDS_D.
> (m_GRANITERAPID_D): New macro.
> (m_ARROWLAKE_S): Ditto.
> (m_CORE_AVX512): Add m_GRANITERAPIDS_D.
> (processor_cost_table): Add icelake_cost for
> PROCESSOR_GRANITERAPIDS_D and alderlake_cost for
> PROCESSOR_ARROWLAKE_S.
> * config/i386/x86-tune.def: Hanlde m_ARROWLAKE_S same as
> m_ARROWLAKE.
> * config/i386/i386.h (enum processor_type): Add new member
> PROCESSOR_GRANITERAPIDS_D and PROCESSOR_ARROWLAKE_S.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Handle
> PROCESSOR_GRANITERAPIDS_D and PROCESSOR_ARROWLAKE_S
> ---
>  gcc/common/config/i386/i386-common.cc | 11 +++--
>  gcc/config/i386/i386-c.cc | 15 +++
>  gcc/config/i386/i386-options.cc   |  6 ++-
>  gcc/config/i386/i386.h|  4 +-
>  gcc/config/i386/x86-tune.def  | 63 ++-
>  5 files changed, 62 insertions(+), 37 deletions(-)
>
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 12a01704a73..1e11163004b 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -2155,7 +2155,9 @@ const char *const processor_names[] =
>"alderlake",
>"rocketlake",
>"graniterapids",
> +  "graniterapids-d",
>"arrowlake",
> +  "arrowlake-s",
>"intel",
>"lujiazui",
>"geode",
> @@ -2279,13 +2281,14 @@ const pta processor_alias_table[] =
>  M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2},
>{"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL, PTA_GRANITERAPIDS,
>  M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F},
> -  {"graniterapids-d", PROCESSOR_GRANITERAPIDS, CPU_HASWELL, 
> PTA_GRANITERAPIDS_D,
> -M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS_D), P_PROC_AVX512F},
> +  {"graniterapids-d", PROCESSOR_GRANITERAPIDS_D, CPU_HASWELL,
> +PTA_GRANITERAPIDS_D, M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS_D),
> +P_PROC_AVX512F},
>{"arrowlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE,
>  M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE), P_PROC_AVX2},
> -  {"arrowlake-s", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE_S,
> +  {"arrowlake-s", PROCESSOR_ARROWLAKE_S, CPU_HASWELL, PTA_ARROWLAKE_S,
>  M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE_S), P_PROC_AVX2},
> -  {"lunarlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE_S,
> +  {"lunarlake", PROCESSOR_ARROWLAKE_S, CPU_HASWELL, PTA_ARROWLAKE_S,
>  M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE_S), P_PROC_AVX2},
>{"bonnell", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
>  M_CPU_TYPE (INTEL_BONNELL), P_PROC_SSSE3},
> diff --git a/gcc/config/i386/i386-c.cc b/gcc/config/i386/i386-c.cc
> index caef5531593..0e11709ebc5 100644
> --- a/gcc/config/i386/i386-c.cc
> +++ b/gcc/config/i386/i386-c.cc
> @@ -258,6 +258,10 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
>def_or_undef (parse_in, "__graniterapids");
>def_or_undef (parse_in, "__graniterapids__");
>break;
> +case PROCESSOR_GRANITERAPIDS_D:
> +  def_or_undef (parse_in, "__graniterapids_d");
> +  def_or_undef (parse_in, "__graniterapids_d__");
> +  break;
>  case PROCESSOR_ALDERLAKE:
>

Re: [committed] i386: Fix grammar typo in diagnostic

2023-08-23 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 4:08 PM Hongtao Liu  wrote:
>
> On Wed, Aug 23, 2023 at 3:02 PM Jonathan Wakely  wrote:
> >
> >
> >
> > On Wed, 23 Aug 2023, 06:15 Hongtao Liu via Libstdc++, 
> >  wrote:
> >>
> >> On Wed, Aug 23, 2023 at 7:28 AM Hongtao Liu  wrote:
> >> >
> >> > On Tue, Aug 8, 2023 at 5:22 AM Marek Polacek via Libstdc++
> >> >  wrote:
> >> > >
> >> > > On Mon, Aug 07, 2023 at 10:12:35PM +0100, Jonathan Wakely via 
> >> > > Gcc-patches wrote:
> >> > > > Committed as obvious.
> >> > > >
> >> > > > Less obvious (to me) is whether it's correct to say "GCC V13" here. I
> >> > > > don't think we refer to a version that way anywhere else, do we?
> >> > > >
> >> > > > Would "since GCC 13.1.0" be better?
> >> > >
> >> > > x86_field_alignment uses
> >> > >
> >> > >   inform (input_location, "the alignment of %<_Atomic %T%> 
> >> > > "
> >> > >   "fields changed in %{GCC 11.1%}",
> >> > >
> >> > > so maybe the below should use %{GCC 13.1%}.  "GCC V13" looks unusual
> >> > > to me.
> >> >  %{GCC 13.1%} sounds reasonable.
> >> looks like %{ can't be using in const char*, so use % instead.
> >>
> >> How about:
> >>
> >> Author: liuhongt 
> >> Date:   Wed Aug 23 07:31:13 2023 +0800
> >>
> >> Adjust GCC V13 to GCC 13.1 in diagnotic.
> >>
> >> gcc/ChangeLog:
> >>
> >> * config/i386/i386.cc (ix86_invalid_conversion): Adjust GCC
> >> V13 to GCC 13.1.
> >>
> >> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> >> index e7822ef6500..88d9d7d537f 100644
> >> --- a/gcc/config/i386/i386.cc
> >> +++ b/gcc/config/i386/i386.cc
> >> @@ -22899,7 +22899,7 @@ ix86_invalid_conversion (const_tree fromtype,
> >> const_tree totype)
> >>   || (TYPE_MODE (totype) == BFmode
> >>   && TYPE_MODE (fromtype) == HImode))
> >> warning (0, "%<__bfloat16%> is redefined from typedef % "
> >> -   "to real %<__bf16%> since GCC V13, be careful of "
> >> +   "to real %<__bf16%> since %, be careful of "
> >>  "implicit conversion between %<__bf16%> and %; "
> >>  "an explicit bitcast may be needed here");
> >>  }
> >
> >
> >
> > Why does it need to be quoted? What's wrong with just saying GCC 13.1 
> > without the %< decoration?
> I'll just remove that.
pushed to trunk and backport to GCC13 release branch.
> >
> >
> >
> >>
> >> > >
> >> > > > -- >8 --
> >> > > >
> >> > > > gcc/ChangeLog:
> >> > > >
> >> > > >   * config/i386/i386.cc (ix86_invalid_conversion): Fix grammar.
> >> > > > ---
> >> > > >  gcc/config/i386/i386.cc | 2 +-
> >> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> >> > > >
> >> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> >> > > > index 50860050049..5d57726e22c 100644
> >> > > > --- a/gcc/config/i386/i386.cc
> >> > > > +++ b/gcc/config/i386/i386.cc
> >> > > > @@ -22890,7 +22890,7 @@ ix86_invalid_conversion (const_tree 
> >> > > > fromtype, const_tree totype)
> >> > > >   warning (0, "%<__bfloat16%> is redefined from typedef 
> >> > > > % "
> >> > > >   "to real %<__bf16%> since GCC V13, be careful of "
> >> > > >"implicit conversion between %<__bf16%> and 
> >> > > > %; "
> >> > > > -  "a explicit bitcast may be needed here");
> >> > > > +  "an explicit bitcast may be needed here");
> >> > > >  }
> >> > > >
> >> > > >/* Conversion allowed.  */
> >> > > > --
> >> > > > 2.41.0
> >> > > >
> >> > >
> >> > > Marek
> >> > >
> >> >
> >> >
> >> > --
> >> > BR,
> >> > Hongtao
> >>
> >>
> >>
> >> --
> >> BR,
> >> Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-23 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 4:31 PM Jakub Jelinek  wrote:
>
> On Wed, Aug 23, 2023 at 08:03:58AM +, Jiang, Haochen wrote:
> > We could first work on -mevex512 then further discuss -mavx10.1-256/512 
> > since
> > these -mavx10.1-256/512 is quite controversial.
> >
> > Just to clarify, -mno-evex512 -mavx512f should not enable 512 bit vector 
> > right?
>
> I think it should enable them because -mavx512f is after it.  But it seems the
> option handling is more complex than I thought, e.g. -mavx512bw -mno-avx512bw
> just cancels each other, rather than
> enabling AVX512BW, AVX512F, AVX2 and all its dependencies (like -mavx512bw
> alone does) and then just disabling AVX512BW (like -mno-avx512bw does).
> But, if one uses separate target pragmas, it behaves like that:
> #pragma GCC target ("avx512bw")
> #ifdef __AVX512F__
> int a;
> #endif
> #ifdef __AVX512BW__
> int b;
> #endif
> #pragma GCC target ("no-avx512bw")
> #ifdef __AVX512F__
> int c;
> #endif
> #ifdef __AVX512BW__
> int d;
> #endif
> The above defines a, b and c vars even without any special -march= or other
> command line option.
>
> So, first important decision would be whether to make EVEX512
> OPTION_MASK_ISA_EVEX512 or OPTION_MASK_ISA2_EVEX512, the former would need
> to move some other ISA flag from the first to second set.
> That OPTION_MASK_ISA*_EVEX512 then should be added to
> OPTION_MASK_ISA_AVX512F_SET or OPTION_MASK_ISA2_AVX512F_SET (but, if it is
> the latter, we also need to do that for tons of other AVX512*_SET),
> and then just arrange for -mavx10.1-256 to enable
> OPTION_MASK_ISA*_AVX512*_SET of everything it needs except the EVEX512 set
> (but, only disable it from the newly added set, not actually act as
> -mavx512{f,bw,...} -mno-evex512).
> OPTION_MASK_ISA*_EVEX512_SET dunno, should it enable OPTION_MASK_ISA_AVX512F
> or just EVEX512?
> And then the UNSET cases...
We can make OPTION_MASK_ISA2_EVEX512, but not set/unset that in
ix86_handle_option, but in ix86_option_override_internal, after all
set/unset for the existing AVX512***, if there's still
OPTION_MASK_ISA_AVX512F and no explicit set/unset for
OPTION_MASK_ISA2_EVEX512, then we set OPTION_MASK_ISA2_EVEX512.
That would make -mavx512*** implicitly set -mevex-512, but when
there's explicit -mno-evex512, -mavx512f won't set -mevex512 no matter
where -mno-evex512 is put.(-mno-evex512 -mavx512f still disable
512-bit).
>
> Jakub
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-23 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 4:16 PM Jakub Jelinek  wrote:
>
> On Wed, Aug 23, 2023 at 01:57:59AM +, Jiang, Haochen wrote:
> > > > Let's assume there's no detla now, AVX10.1-512 is equal to
> > > > AVX512{F,VL,BW,DQ,CD,BF16,FP16,VBMI,VBMI2,VNNI,IFMA,BITALG,VPOPCNTDQ}
> > > > > other stuff.
> > > > > The current common/config/i386/i386-common.cc OPTION_MASK_ISA*SET* 
> > > > > would be
> > > > > like now, except that the current AVX512* sets imply also 
> > > > > EVEX512/whatever
> > > > > it will be called, that option itself enables nothing (or 
> > > > > TARGET_AVX512F),
> > > > > and unsetting it doesn't disable all the TARGET_AVX512*.
> > > > > -mavx10.1 would enable the AVX512* sets without EVEX512/whatever.
> > > > So for -mavx512bw -mavx10.1-256, -mavx512bw will set EVEX512, but
> > > > -mavx10.1-256 doesn't clear EVEX512 but just enable all AVX512* sets?.
> > > > then the combination basically is equal to AVX10.1-512(AVX512* sets +
> > > > EVEX512)
> > > > If this is your assumption, yes, there's no need for TARGET_AVX10_1.
> >
> > I think we still need that since the current w/o AVX512VL, we will not only
> > enable 512 bit vector instructions but also enable scalar instructions, 
> > which
> > means when it comes to -mavx512bw -mno-evex512, we should enable
> > the scalar function.
> >
> > And scalar functions will also be enabled in AVX10.1-256, we need something
> > to distinguish them out from the ISA set w/o AVX512VL.
>
> Ah, forgot about scalar instructions, even better, then we don't have to do
> that special case.  So, I think TARGET_AVX512F && !TARGET_EVEX512 && 
> !TARGET_AVX512VL
> in general should disable 512-bit modes in ix86_hard_regno_mode_ok.  That
> should prevent the need to replace TARGET_AVX512F to TARGET_EVEX512 on all
> the patterns which refer to 512-bit modes.  Also wonder if it
> wouldn't be easiest to make "v" constraint in that case be equivalent to
> just "x" so that all those hacks to make xmm16+ registers working in various
We can clear evex sse register in ix86_conditional_register_usage when
TARGET_AVX512F && !TARGET_EVEX512 && !TARGET_AVX512VL if we don't care
much about scalar ones.
> instructions through g modifiers wouldn't trigger.  Sure, that would
> penalize also scalar instructions, but the above case wouldn't be something
> any CPU actually supports, it would be only the common subset of say XeonPhi
> and AVX10.1-256.
>
> Jakub
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-23 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 3:33 PM Richard Biener
 wrote:
>
> On Tue, Aug 22, 2023 at 4:36 PM Hongtao Liu  wrote:
> >
> > On Tue, Aug 22, 2023 at 9:54 PM Jakub Jelinek  wrote:
> > >
> > > On Tue, Aug 22, 2023 at 09:35:44PM +0800, Hongtao Liu wrote:
> > > > Ok, then we can't avoid TARGET_AVX10_1 in those existing 256/128-bit
> > > > evex instruction patterns.
> > >
> > > Why?
> > > Internally for md etc. purposes, we should have the current
> > > TARGET_AVX512* etc. ISA flags, plus one new one, whatever we call it
> > > (TARGET_EVEX512 even if it is not completely descriptive because of kandq
> > > etc., or some other name) which says if 512-bit vector modes can be used,
> > > if g modifier can be used, if the 64-bit mask operations can be used etc.
> > > Plus, if AVX10.1 contains any instructions not covered in the preexisting
> > > TARGET_AVX512* sets, TARGET_AVX10_1 which covers that delta, otherwise
> > > keep -mavx10.1 just as an command line option which enables/disables
> > Let's assume there's no detla now, AVX10.1-512 is equal to
> > AVX512{F,VL,BW,DQ,CD,BF16,FP16,VBMI,VBMI2,VNNI,IFMA,BITALG, VPOPCNTDQ}
> > > other stuff.
> > > The current common/config/i386/i386-common.cc OPTION_MASK_ISA*SET* would 
> > > be
> > > like now, except that the current AVX512* sets imply also EVEX512/whatever
> > > it will be called, that option itself enables nothing (or TARGET_AVX512F),
> > > and unsetting it doesn't disable all the TARGET_AVX512*.
> > > -mavx10.1 would enable the AVX512* sets without EVEX512/whatever.
> > So for -mavx512bw -mavx10.1-256, -mavx512bw will set EVEX512, but
> > -mavx10.1-256 doesn't clear EVEX512 but just enable all AVX512* sets?.
>
> As I said earlier -mavx10.1-256 (and -mavx10.1-512) should not exist.
> So instead
> we'd have -mavx512bw -mavx10.1 where -mavx512bw enables evex512 and
> -mavx10.1 will enable the 10.1 ISAs _not affecting_ whether evex512 is
> set or not.
>
> We then have the -mevex512 flag (or whatever name we agree to) to enable
> (or disable) 512bit support.
>
> If you insist on having -mavx10.1-256 that should alias to -mavx10.1 +
> -mno-evex512,
> but Jakub disagrees here, so I'd rather not have it at all.  We could have
I think we can just support -mevex512 for now, as for avx10.1-256/512
it can wait for a while, considering it doesn't have new instructions
and is controversial.
Basically, -mno-evex512 is good enough for most needs.
The only part I disagree with Jakub is I think for -mavx512f
-mno-evex512 -mavx512bw, we need to disable 512-bit, an explicit
-mno-evex512 should precedence over implicit yes.
> -mavx10.1-512 aliasing to -mavx10.1 + -mevex512 (Jakub would agree here).
>
> Richard.



-- 
BR,
Hongtao


Re: [committed] i386: Fix grammar typo in diagnostic

2023-08-23 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 3:02 PM Jonathan Wakely  wrote:
>
>
>
> On Wed, 23 Aug 2023, 06:15 Hongtao Liu via Libstdc++,  
> wrote:
>>
>> On Wed, Aug 23, 2023 at 7:28 AM Hongtao Liu  wrote:
>> >
>> > On Tue, Aug 8, 2023 at 5:22 AM Marek Polacek via Libstdc++
>> >  wrote:
>> > >
>> > > On Mon, Aug 07, 2023 at 10:12:35PM +0100, Jonathan Wakely via 
>> > > Gcc-patches wrote:
>> > > > Committed as obvious.
>> > > >
>> > > > Less obvious (to me) is whether it's correct to say "GCC V13" here. I
>> > > > don't think we refer to a version that way anywhere else, do we?
>> > > >
>> > > > Would "since GCC 13.1.0" be better?
>> > >
>> > > x86_field_alignment uses
>> > >
>> > >   inform (input_location, "the alignment of %<_Atomic %T%> "
>> > >   "fields changed in %{GCC 11.1%}",
>> > >
>> > > so maybe the below should use %{GCC 13.1%}.  "GCC V13" looks unusual
>> > > to me.
>> >  %{GCC 13.1%} sounds reasonable.
>> looks like %{ can't be using in const char*, so use % instead.
>>
>> How about:
>>
>> Author: liuhongt 
>> Date:   Wed Aug 23 07:31:13 2023 +0800
>>
>> Adjust GCC V13 to GCC 13.1 in diagnotic.
>>
>> gcc/ChangeLog:
>>
>> * config/i386/i386.cc (ix86_invalid_conversion): Adjust GCC
>> V13 to GCC 13.1.
>>
>> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
>> index e7822ef6500..88d9d7d537f 100644
>> --- a/gcc/config/i386/i386.cc
>> +++ b/gcc/config/i386/i386.cc
>> @@ -22899,7 +22899,7 @@ ix86_invalid_conversion (const_tree fromtype,
>> const_tree totype)
>>   || (TYPE_MODE (totype) == BFmode
>>   && TYPE_MODE (fromtype) == HImode))
>> warning (0, "%<__bfloat16%> is redefined from typedef % "
>> -   "to real %<__bf16%> since GCC V13, be careful of "
>> +   "to real %<__bf16%> since %, be careful of "
>>  "implicit conversion between %<__bf16%> and %; "
>>  "an explicit bitcast may be needed here");
>>  }
>
>
>
> Why does it need to be quoted? What's wrong with just saying GCC 13.1 without 
> the %< decoration?
I'll just remove that.
>
>
>
>>
>> > >
>> > > > -- >8 --
>> > > >
>> > > > gcc/ChangeLog:
>> > > >
>> > > >   * config/i386/i386.cc (ix86_invalid_conversion): Fix grammar.
>> > > > ---
>> > > >  gcc/config/i386/i386.cc | 2 +-
>> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > > >
>> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
>> > > > index 50860050049..5d57726e22c 100644
>> > > > --- a/gcc/config/i386/i386.cc
>> > > > +++ b/gcc/config/i386/i386.cc
>> > > > @@ -22890,7 +22890,7 @@ ix86_invalid_conversion (const_tree fromtype, 
>> > > > const_tree totype)
>> > > >   warning (0, "%<__bfloat16%> is redefined from typedef % "
>> > > >   "to real %<__bf16%> since GCC V13, be careful of "
>> > > >"implicit conversion between %<__bf16%> and %; "
>> > > > -  "a explicit bitcast may be needed here");
>> > > > +  "an explicit bitcast may be needed here");
>> > > >  }
>> > > >
>> > > >/* Conversion allowed.  */
>> > > > --
>> > > > 2.41.0
>> > > >
>> > >
>> > > Marek
>> > >
>> >
>> >
>> > --
>> > BR,
>> > Hongtao
>>
>>
>>
>> --
>> BR,
>> Hongtao



-- 
BR,
Hongtao


Re: [committed] i386: Fix grammar typo in diagnostic

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 7:28 AM Hongtao Liu  wrote:
>
> On Tue, Aug 8, 2023 at 5:22 AM Marek Polacek via Libstdc++
>  wrote:
> >
> > On Mon, Aug 07, 2023 at 10:12:35PM +0100, Jonathan Wakely via Gcc-patches 
> > wrote:
> > > Committed as obvious.
> > >
> > > Less obvious (to me) is whether it's correct to say "GCC V13" here. I
> > > don't think we refer to a version that way anywhere else, do we?
> > >
> > > Would "since GCC 13.1.0" be better?
> >
> > x86_field_alignment uses
> >
> >   inform (input_location, "the alignment of %<_Atomic %T%> "
> >   "fields changed in %{GCC 11.1%}",
> >
> > so maybe the below should use %{GCC 13.1%}.  "GCC V13" looks unusual
> > to me.
>  %{GCC 13.1%} sounds reasonable.
looks like %{ can't be using in const char*, so use % instead.

How about:

Author: liuhongt 
Date:   Wed Aug 23 07:31:13 2023 +0800

Adjust GCC V13 to GCC 13.1 in diagnotic.

gcc/ChangeLog:

* config/i386/i386.cc (ix86_invalid_conversion): Adjust GCC
V13 to GCC 13.1.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index e7822ef6500..88d9d7d537f 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22899,7 +22899,7 @@ ix86_invalid_conversion (const_tree fromtype,
const_tree totype)
  || (TYPE_MODE (totype) == BFmode
  && TYPE_MODE (fromtype) == HImode))
warning (0, "%<__bfloat16%> is redefined from typedef % "
-   "to real %<__bf16%> since GCC V13, be careful of "
+   "to real %<__bf16%> since %, be careful of "
 "implicit conversion between %<__bf16%> and %; "
 "an explicit bitcast may be needed here");
 }


> >
> > > -- >8 --
> > >
> > > gcc/ChangeLog:
> > >
> > >   * config/i386/i386.cc (ix86_invalid_conversion): Fix grammar.
> > > ---
> > >  gcc/config/i386/i386.cc | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 50860050049..5d57726e22c 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -22890,7 +22890,7 @@ ix86_invalid_conversion (const_tree fromtype, 
> > > const_tree totype)
> > >   warning (0, "%<__bfloat16%> is redefined from typedef % "
> > >   "to real %<__bf16%> since GCC V13, be careful of "
> > >"implicit conversion between %<__bf16%> and %; "
> > > -  "a explicit bitcast may be needed here");
> > > +  "an explicit bitcast may be needed here");
> > >  }
> > >
> > >/* Conversion allowed.  */
> > > --
> > > 2.41.0
> > >
> >
> > Marek
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 23, 2023 at 9:58 AM Jiang, Haochen  wrote:
>
> > -Original Message-
> > From: Jakub Jelinek 
> > Sent: Tuesday, August 22, 2023 11:02 PM
> > To: Hongtao Liu 
> > Cc: Richard Biener ; Jiang, Haochen
> > ; ZiNgA BuRgA ; gcc-
> > patc...@gcc.gnu.org
> > Subject: Re: Intel AVX10.1 Compiler Design and Support
> >
> > On Tue, Aug 22, 2023 at 10:35:55PM +0800, Hongtao Liu wrote:
> > > Let's assume there's no detla now, AVX10.1-512 is equal to
> > > AVX512{F,VL,BW,DQ,CD,BF16,FP16,VBMI,VBMI2,VNNI,IFMA,BITALG,VPOPCNTDQ}
> > > > other stuff.
> > > > The current common/config/i386/i386-common.cc OPTION_MASK_ISA*SET* 
> > > > would be
> > > > like now, except that the current AVX512* sets imply also 
> > > > EVEX512/whatever
> > > > it will be called, that option itself enables nothing (or 
> > > > TARGET_AVX512F),
> > > > and unsetting it doesn't disable all the TARGET_AVX512*.
> > > > -mavx10.1 would enable the AVX512* sets without EVEX512/whatever.
> > > So for -mavx512bw -mavx10.1-256, -mavx512bw will set EVEX512, but
> > > -mavx10.1-256 doesn't clear EVEX512 but just enable all AVX512* sets?.
> > > then the combination basically is equal to AVX10.1-512(AVX512* sets +
> > > EVEX512)
> > > If this is your assumption, yes, there's no need for TARGET_AVX10_1.
>
> I think we still need that since the current w/o AVX512VL, we will not only
> enable 512 bit vector instructions but also enable scalar instructions, which
> means when it comes to -mavx512bw -mno-evex512, we should enable
> the scalar function.
>
> And scalar functions will also be enabled in AVX10.1-256, we need something
> to distinguish them out from the ISA set w/o AVX512VL.
Why do we need to distinguish scalar evex instruction?
As long as -mavx512XXX -mno-evex does not generate zmm/64-bit kmask,
it should be ok.

Assume there's no delta in AVX10.1, It sounds to me the design should be like

avx512*  <== mno-evex512==  avx512* + mevex512
(no-evex512)(original AVX512 stuff)
   /\  /\
   ||(equal)   ||(equal)
   \/  \/
avx10.1-256   avx10.1-512
/\  /\
||  ||
||  ||
impliedimplied
||  ||
||  ||
avx10.2-256 <== implied ==  avx10.2-512
/\ /\
|| ||
|| ||
impliedImplied
|| ||
|| ||
avx10.3-256 <== implied ==   avx10.3-512

1. The new instructions in avx10.x should be put in either avx10.x-256
or avx10.x-512 according to vector/kmask size
2. -mno-evex512 should disable -avx10.x-512.
3. -mavx512* will defaultly enable -mevex512, but -mavx10.1-256 will
just enable -mavx512* but not -mevex512

>
> Thx,
> Haochen
>
> >
> > I think that would be my expectation.  -mavx512bw currently implies
> > 512-bit vector support of avx512f and avx512bw, and with -mavx512{bw,vl}
> > also 128-bit/256-bit vector support.  All pre-AVX10 chips which do support
> > AVX512BW support 512-bit vectors.  Now, -mavx10.1 will bring in also
> > vl,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,ifma,bitalg,vpopcntdq as you wrote
> > which weren't enabled before, but unless there is some existing or planned
> > CPU which would support 512-bit vectors in avx512f and avx512bw ISAs and
> > only support 128/256-bit vectors in those
> > dq,cd,bf16,fp16,vbmi,vbmi2,vnni,ifma,bitalg,vpopcntdq isas, I think there
> > is no need to differentiate further; the only CPUs which will support both
> > what -mavx512bw and -mavx10.1 requires will be (if there is no delta)
> > either CPUs with 128/256/512-bit vector support of those
> > f,vl,bw,dq,cd,...vpopcntdq ISAs, or AVX10.1-512 ISAs.
> > -mavx512vl -mavx512bw -mno-evex512 -mavx10.1-256 would on the other side
> > disable all 512-bit vector instructions and in the end just mean the
> > same as -mavx10.1-256.
> > For just
> > -mavx512bw -mno-evex512 -mavx10.1-256
> > the question is if that -mno-evex512 turns off also avx512bw/avx512f because
> > avx512vl isn't enabled at that point during processing, or if we do that
> > only at the end as a special case.  Of course, in this exact case there is
> > no difference, because -mavx10.1-256 turns that back on.
> > But it would make a difference on
> > -mavx512bw -mno-evex512 -mavx512vl
> > (when processed right away would disable AVX512BW (because VL isn't on)
> > and in the end enable VL,F including EVEX512, or be equivalent to just
> > -mavx512bw -mavx512vl if processed at the end, because -mavx512vl 

Re: [committed] i386: Fix grammar typo in diagnostic

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 8, 2023 at 5:22 AM Marek Polacek via Libstdc++
 wrote:
>
> On Mon, Aug 07, 2023 at 10:12:35PM +0100, Jonathan Wakely via Gcc-patches 
> wrote:
> > Committed as obvious.
> >
> > Less obvious (to me) is whether it's correct to say "GCC V13" here. I
> > don't think we refer to a version that way anywhere else, do we?
> >
> > Would "since GCC 13.1.0" be better?
>
> x86_field_alignment uses
>
>   inform (input_location, "the alignment of %<_Atomic %T%> "
>   "fields changed in %{GCC 11.1%}",
>
> so maybe the below should use %{GCC 13.1%}.  "GCC V13" looks unusual
> to me.
 %{GCC 13.1%} sounds reasonable.
>
> > -- >8 --
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386.cc (ix86_invalid_conversion): Fix grammar.
> > ---
> >  gcc/config/i386/i386.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 50860050049..5d57726e22c 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22890,7 +22890,7 @@ ix86_invalid_conversion (const_tree fromtype, 
> > const_tree totype)
> >   warning (0, "%<__bfloat16%> is redefined from typedef % "
> >   "to real %<__bf16%> since GCC V13, be careful of "
> >"implicit conversion between %<__bf16%> and %; "
> > -  "a explicit bitcast may be needed here");
> > +  "an explicit bitcast may be needed here");
> >  }
> >
> >/* Conversion allowed.  */
> > --
> > 2.41.0
> >
>
> Marek
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 22, 2023 at 9:35 PM Hongtao Liu  wrote:
>
> On Tue, Aug 22, 2023 at 9:24 PM Richard Biener
>  wrote:
> >
> > On Tue, Aug 22, 2023 at 3:16 PM Jakub Jelinek  wrote:
> > >
> > > On Tue, Aug 22, 2023 at 09:02:29PM +0800, Hongtao Liu wrote:
> > > > > Agreed.  And I still think -mevex512 vs. -mno-evex512 is the best 
> > > > > option
> > > > > name to represent whether the effective ISA set allows 512-bit 
> > > > > vectors or
> > > > > not.  I think -mavx10.1 -mno-avx512cd should be fine.  And, 
> > > > > -mavx10.1-256
> > > > > option IMHO should be in the same spirit to all the others a positive 
> > > > > enablement,
> > > > > not both positive (enable avx512{f,cd,bw,dq,...} and negative 
> > > > > (disallow
> > > > > 512-bit vectors).  So, if one uses -mavx512f -mavx10.1-256, because 
> > > > > the
> > > > > former would allow 512-bit vectors, the latter shouldn't disable 
> > > > > those again
> > > > > because it isn't a -mno-* option.  Sure, instructions which are 
> > > > > specific to
> > > > But there's implicit negative (disallow 512-bit vector), I think
> > >
> > > That is wrong.
> > >
> > > > -mav512f -mavx10.1-256 or -mavx10.1-256 -mavx512f shouldn't enable
> > > > 512-bit vector.
> > >
> > > Because then the -mavx10.1-256 option behaves completely differently from
> > > all the other isa options.
> > >
> > > We have the -march= options which are processed separately, but the normal
> > > ISA options either only enable something (when -mwhatever), or only 
> > > disable something
> > > (when -mno-whatever). -mavx512f -mavx10.1-256 should be a union of those
> > > ISAs, like say -mavx2 -mbmi is, not an intersection or something even
> > > harder to understand.
> > >
> > > > Further, we should disallow a mix of exex512 and non-evex512 (e.g.
> > > > -mavx10.1-512 -mavx10.2-256),they should be a unified separate switch
> > > > that either disallows both or allows both. Instead of some isa
> > > > allowing it and some isa disallowing it.
> > >
> > > No, it will be really terrible user experience if the new options behave
> > > completely differently from everything else.  Because then we'll need to
> Ok, then we can't avoid TARGET_AVX10_1 in those existing 256/128-bit
> evex instruction patterns.
> > > document it in detail how it behaves and users will have hard time to 
> > > figure
> > > it out, and specify what it does not just on the command line, but also 
> > > when
> > > mixing with target attribute or pragmas.  -mavx10.1-512 -mavx10.2-256 
> > > should
> > > be a union of those two ISAs.  Either internally there is an ISA flag 
> > > whether
> > > the instructions in the avx10.2 ISA but not avx10.1 ISA can operate on
> > > 512-bit vectors or not, in that case -mavx10.1-512 -mavx10.2-256 should
> > > enable the AVX10.1 set including 512-bit vectors + just the < 512-bit
> > > instructions from the 10.1 to 10.2 delta, or if there is no such 
> > > separation
> > > internally, it will just enable full AVX10.2-512.  User has asked for it.
> >
> > I think having all three -mavx10.1, -mavx10.1-256 and -mavx10.1-512 is just
> > confusing.  Please separate ISA (avx10.1) from size.  If -m[no-]evex512 
> > isn't
> > good propose something else.  -mavx512f will enable 512bits, -mavx10.1
> > will not unless -mevex512.  -mavx512f -mavx512vl -mno-evex512 will disable
> > 512bits.
> >
> > So scrap -mavx10.1-256 and -mavx10.1-512 please.
The related issue is what's the meaning of -mno-avx10.1-256/-mno-avx10.1-512
For -mno-avx10.1-256, maybe it just disable whole avx10.1
But for avx10.1-512 should it disable whole avx10.1 or just EVEX512,
or maybe we just doesn't provide -mno-avx10.1-512, just provide
-mno-avx10.1-256.
And use -mno-evex512 to disable 512-bit vectors.
>
> It sounds to me we would have something like
> avx512XXX
>^
>|
> "independent": TARGET_AVX512VL || TARGET_AVX10_1 will enable
> 128/256-bit instruction.
>|
> avx10.1-256  ^  ^
> |   |
> |   |
> implied   implied
> |   |
> |   |
> avx10.2-256  ^  ^
> |   |
> |   |
> impliedImplied
> |   |
> |   |
> avx10.3-256 <---implied---avx10.3-512
>   .
>
> And put every existing and new instruction under those flags
>
> >
> > Richard.
> >
> > > Jakub
> > >
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 22, 2023 at 9:54 PM Jakub Jelinek  wrote:
>
> On Tue, Aug 22, 2023 at 09:35:44PM +0800, Hongtao Liu wrote:
> > Ok, then we can't avoid TARGET_AVX10_1 in those existing 256/128-bit
> > evex instruction patterns.
>
> Why?
> Internally for md etc. purposes, we should have the current
> TARGET_AVX512* etc. ISA flags, plus one new one, whatever we call it
> (TARGET_EVEX512 even if it is not completely descriptive because of kandq
> etc., or some other name) which says if 512-bit vector modes can be used,
> if g modifier can be used, if the 64-bit mask operations can be used etc.
> Plus, if AVX10.1 contains any instructions not covered in the preexisting
> TARGET_AVX512* sets, TARGET_AVX10_1 which covers that delta, otherwise
> keep -mavx10.1 just as an command line option which enables/disables
Let's assume there's no detla now, AVX10.1-512 is equal to
AVX512{F,VL,BW,DQ,CD,BF16,FP16,VBMI,VBMI2,VNNI,IFMA,BITALG, VPOPCNTDQ}
> other stuff.
> The current common/config/i386/i386-common.cc OPTION_MASK_ISA*SET* would be
> like now, except that the current AVX512* sets imply also EVEX512/whatever
> it will be called, that option itself enables nothing (or TARGET_AVX512F),
> and unsetting it doesn't disable all the TARGET_AVX512*.
> -mavx10.1 would enable the AVX512* sets without EVEX512/whatever.
So for -mavx512bw -mavx10.1-256, -mavx512bw will set EVEX512, but
-mavx10.1-256 doesn't clear EVEX512 but just enable all AVX512* sets?.
then the combination basically is equal to AVX10.1-512(AVX512* sets +
EVEX512)
If this is your assumption, yes, there's no need for TARGET_AVX10_1.
(My former understanding is that you want  -mavx512bw -mavx10.1-256
enable all 128/256/scalar invariants but only avx512bw 512-bit
invariants, this can't be done without TARGET_AVX10_1).
So the whole point is -mavx10.x-256 shouldn't clear nor set EVEX512,
and -mavx10.x-512 should set EVEX512.
> At the end of the option processing, if EVEX512/whatever is set but
> TARGET_AVX512VL is not, disable TARGET_AVX512F with all its dependencies,
> because VL is a precondition of 128/256-bit EVEX and if 512-bit EVEX is not
> enabled, there is nothing left.
There's scalar evex instruction under TARGET_AVX512F(and other
non-avx512vl) w/o EVEX512, not nothing left.
>
> Jakub
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 22, 2023 at 9:24 PM Richard Biener
 wrote:
>
> On Tue, Aug 22, 2023 at 3:16 PM Jakub Jelinek  wrote:
> >
> > On Tue, Aug 22, 2023 at 09:02:29PM +0800, Hongtao Liu wrote:
> > > > Agreed.  And I still think -mevex512 vs. -mno-evex512 is the best option
> > > > name to represent whether the effective ISA set allows 512-bit vectors 
> > > > or
> > > > not.  I think -mavx10.1 -mno-avx512cd should be fine.  And, 
> > > > -mavx10.1-256
> > > > option IMHO should be in the same spirit to all the others a positive 
> > > > enablement,
> > > > not both positive (enable avx512{f,cd,bw,dq,...} and negative (disallow
> > > > 512-bit vectors).  So, if one uses -mavx512f -mavx10.1-256, because the
> > > > former would allow 512-bit vectors, the latter shouldn't disable those 
> > > > again
> > > > because it isn't a -mno-* option.  Sure, instructions which are 
> > > > specific to
> > > But there's implicit negative (disallow 512-bit vector), I think
> >
> > That is wrong.
> >
> > > -mav512f -mavx10.1-256 or -mavx10.1-256 -mavx512f shouldn't enable
> > > 512-bit vector.
> >
> > Because then the -mavx10.1-256 option behaves completely differently from
> > all the other isa options.
> >
> > We have the -march= options which are processed separately, but the normal
> > ISA options either only enable something (when -mwhatever), or only disable 
> > something
> > (when -mno-whatever). -mavx512f -mavx10.1-256 should be a union of those
> > ISAs, like say -mavx2 -mbmi is, not an intersection or something even
> > harder to understand.
> >
> > > Further, we should disallow a mix of exex512 and non-evex512 (e.g.
> > > -mavx10.1-512 -mavx10.2-256),they should be a unified separate switch
> > > that either disallows both or allows both. Instead of some isa
> > > allowing it and some isa disallowing it.
> >
> > No, it will be really terrible user experience if the new options behave
> > completely differently from everything else.  Because then we'll need to
Ok, then we can't avoid TARGET_AVX10_1 in those existing 256/128-bit
evex instruction patterns.
> > document it in detail how it behaves and users will have hard time to figure
> > it out, and specify what it does not just on the command line, but also when
> > mixing with target attribute or pragmas.  -mavx10.1-512 -mavx10.2-256 should
> > be a union of those two ISAs.  Either internally there is an ISA flag 
> > whether
> > the instructions in the avx10.2 ISA but not avx10.1 ISA can operate on
> > 512-bit vectors or not, in that case -mavx10.1-512 -mavx10.2-256 should
> > enable the AVX10.1 set including 512-bit vectors + just the < 512-bit
> > instructions from the 10.1 to 10.2 delta, or if there is no such separation
> > internally, it will just enable full AVX10.2-512.  User has asked for it.
>
> I think having all three -mavx10.1, -mavx10.1-256 and -mavx10.1-512 is just
> confusing.  Please separate ISA (avx10.1) from size.  If -m[no-]evex512 isn't
> good propose something else.  -mavx512f will enable 512bits, -mavx10.1
> will not unless -mevex512.  -mavx512f -mavx512vl -mno-evex512 will disable
> 512bits.
>
> So scrap -mavx10.1-256 and -mavx10.1-512 please.

It sounds to me we would have something like
avx512XXX
   ^
   |
"independent": TARGET_AVX512VL || TARGET_AVX10_1 will enable
128/256-bit instruction.
   |
avx10.1-256 
> Richard.
>
> > Jakub
> >



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 22, 2023 at 4:34 PM Jakub Jelinek  wrote:
>
> On Tue, Aug 22, 2023 at 09:36:15AM +0200, Richard Biener via Gcc-patches 
> wrote:
> > I think internally we should have conditional 512bit support work across
> > AVX512 and AVX10.
> >
> > I also think it makes sense to _internally_ have AVX10.1 (10.1!) just
> > enable the respective AVX512 features.  AVX10.2 would then internally
> > cover the ISA extensions added in 10.2 only.  Both would reduce the
> > redundancy and possibly make providing inter-operation between
> > AVX10.1 (10.1!) and AVX512 to the user easier.  I see AVX 10.1 (10.1!)
> > just as "re-branding" latest AVX512, so we should treat it that way
> > (making it an alias to the AVX512 features).
> >
> > Whether we want allow -mavx10.1 -mno-avx512cd or whether
> > we only allow the "positive" -mavx512f -mavx512... (omitting avx512cd)
> > is an entirely separate
> > question.  But I think to not wreck the core idea (more interoperability,
> > here between small/big cores) we absolutely have to
> > provide a subset of avx10.1 but with disabled 512bit vectors which
> > effectively means AVX512 with disabled 512bit support.
>
> Agreed.  And I still think -mevex512 vs. -mno-evex512 is the best option
> name to represent whether the effective ISA set allows 512-bit vectors or
> not.  I think -mavx10.1 -mno-avx512cd should be fine.  And, -mavx10.1-256
> option IMHO should be in the same spirit to all the others a positive 
> enablement,
> not both positive (enable avx512{f,cd,bw,dq,...} and negative (disallow
> 512-bit vectors).  So, if one uses -mavx512f -mavx10.1-256, because the
> former would allow 512-bit vectors, the latter shouldn't disable those again
> because it isn't a -mno-* option.  Sure, instructions which are specific to
But there's implicit negative (disallow 512-bit vector), I think
-mav512f -mavx10.1-256 or -mavx10.1-256 -mavx512f shouldn't enable
512-bit vector.
Further, we should disallow a mix of exex512 and non-evex512 (e.g.
-mavx10.1-512 -mavx10.2-256),they should be a unified separate switch
that either disallows both or allows both. Instead of some isa
allowing it and some isa disallowing it.
> AVX10.1 (aren't present in any currently existing AVX512* ISA set) might be
> enabled only in 128/256 bit variants if we differentiate that level.
> But, if one uses -mavx2 -mavx10.1-256, because no AVX512* has been enabled
> it can enable all the AVX10.1 implied AVX512* parts without EVEX.512.
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-08-22 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 22, 2023 at 5:05 PM Richard Biener via Gcc-patches
 wrote:
>
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF ;
>   result_4 = BIT_INSERT_EXPR ;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
>
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
>
> The patch FAILs one case in gcc.target/i386/avx512fp16-vmovsh-1a.c
> where we now expand from
>
>  __A_28 = VEC_PERM_EXPR ;
>
> instead of
>
>  _28 = BIT_FIELD_REF ;
>  __A_29 = BIT_INSERT_EXPR ;
>
> producing a vpblendw instruction instead of the expected vmovsh.  That's
> either a missed vec_perm_const expansion optimization or even better,
> an improvement - Zen4 for example has 4 ports to execute vpblendw
> but only 3 for executing vmovsh and both instructions have the same size.
Looks like Sapphire rapids only have 2 ports for executing vpblendw
but 3 for vmovsh. I guess we may need a micro-architecture tuning for
this specific permutation.
for vmovss/vpblendd, they're equivalent on SPR, both are 3.
The change for the testcase is ok, I'll handle it with an incremental patch.
>
> The patch XFAILs the sub-testcase - is that OK or should I update
> the expected instruction to a vpblend?
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Thanks,
> Richard.
>
> PR tree-optimization/94864
> PR tree-optimization/94865
> PR tree-optimization/93080
> * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> for vector insertion from vector extraction.
>
> * gcc.target/i386/pr94864.c: New testcase.
> * gcc.target/i386/pr94865.c: Likewise.
> * gcc.target/i386/avx512fp16-vmovsh-1a.c: XFAIL.
> * gcc.dg/tree-ssa/forwprop-40.c: Likewise.
> * gcc.dg/tree-ssa/forwprop-41.c: Likewise.
> ---
>  gcc/match.pd  | 25 +++
>  gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c   | 14 +++
>  gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c   | 16 
>  .../gcc.target/i386/avx512fp16-vmovsh-1a.c|  2 +-
>  gcc/testsuite/gcc.target/i386/pr94864.c   | 13 ++
>  gcc/testsuite/gcc.target/i386/pr94865.c   | 13 ++
>  6 files changed, 82 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 86fdc606a79..6e083021b27 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8006,6 +8006,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   wi::to_wide (@ipos) + isize))
>  (BIT_FIELD_REF @0 @rsize @rpos)
>
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +  && types_match (@0, @1)
> +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> + unsigned HOST_WIDE_INT elsz
> +   = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1;
> + poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> + poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> + unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> + vec_perm_builder builder;
> + builder.new_vector (nunits, nunits, 1);
> + for (unsigned i = 0; i < nunits; ++i)
> +   builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> + vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +   || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> false))
> +(vec_perm @0 @1 { vec_perm_indices_to_tree
> +(build_vector_type (ssizetype, nunits), sel); })
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>(simplify
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
> new file mode 100644
> index 000..7513497f552
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized -Wno-psabi -w" } */
> +
> +#define vector __attribute__((__vector_size__(16) ))
> +
> +vector int g(vector int a)
> +{
> +  int b = a[0];
> +  a[0] = b;
> +  return a;
> +}
> +
> +/* { dg-final { 

Re: Loop-ch improvements, part 3

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Jul 17, 2023 at 5:18 PM Richard Biener via Gcc-patches
 wrote:
>
> On Fri, 14 Jul 2023, Jan Hubicka wrote:
>
> > Hi,
> > loop-ch currently does analysis using ranger for all loops to identify
> > candidates and then follows by phase where headers are duplicated (which
> > breaks SSA and ranger).  The second stage does more analysis (to see how
> > many BBs we want to duplicate) but can't use ranger and thus misses
> > information about static conditionals.
> >
> > This patch pushes all analysis into the first stage. We record how many
> > BBs to duplicate and the second stage just duplicats as it is told so.
> > This makes it possible to also extend range query done also to basic
> > blocks that are not headers.  This is easy to do, since we already do
> > path specific query so we only need to extend the path by headers we
> > decided to dulicate earlier.
> >
> > This makes it possible to track situations where exit that is always
> > false in the first iteration for tests not in the original loop header.
> > Doing so lets us to update profile better and do better heuristics.  In
> > particular I changed logic as follows
> >   1) should_duplicate_loop_header_p counts size of duplicated region.  When 
> > we
> >  know that a given conditional will be constant true or constant false 
> > either
> >  in the duplicated region, by range query, or in the loop body after
> >  duplication (since it is loop invariant), we do not account it to code 
> > size
> >  costs
> >   2) don't need account loop invariant compuations that will be duplicated
> >  as they will become fully invariant
> >  (maybe we want to have some cap for register pressure eventually?)
> >   3) optimize_size logic is now different.  Originally we started 
> > duplicating
> >  iff the first conditional was known to be true by ranger query, but 
> > then
> >  we used same limits as for -O2.
> >
> >  I now simply lower limits to 0. This means that every conditional
> >  in duplicated sequence must be either loop invariant or constant when
> >  duplicated and we only duplicate statements computing loop invariants
> >  and those we account to 0 size anyway,
> >
> > This makes code IMO more streamlined (and hopefully will let us to merge
> > ibts with loop peeling logic), but makes little difference in practice.
> > The problem is that in loop:
> >
> > void test2();
> > void test(int n)
> > {
> >   for (int i = 0; n && i < 10; i++)
> > test2();
> > }
> >
> > We produce:
> >[local count: 1073741824 freq: 9.090909]:
> >   # i_4 = PHI <0(2), i_9(3)>
> >   _1 = n_7(D) != 0;
> >   _2 = i_4 <= 9;
> >   _3 = _1 & _2;
> >   if (_3 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> > and do not understand that the final conditional is a combination of a 
> > conditional
> > that is always true in first iteration and a conditional that is loop 
> > invariant.
> >
> > This is also the case of
> > void test2();
> > void test(int n)
> > {
> >   for (int i = 0; n; i++)
> > {
> >   if (i > 10)
> > break;
> >   test2();
> > }
> > }
> > Which we turn to the earlier case in ifcombine.
> >
> > With disabled ifcombine things however works as exepcted.  This is something
> > I plan to handle incrementally.  However extending loop-ch and peeling 
> > passes
> > to understand such combined conditionals is still not good enough: at the 
> > time ifcombine
> > merged the two conditionals we lost profile information on how often n is 0,
> > so we can't recover correct profile or know what is expected number of 
> > iterations
> > after the transofrm.
This regressed
FAIL: gcc.target/i386/pr93089-2.c scan-assembler vmulps[^\n\r]*zmm
FAIL: gcc.target/i386/pr93089-3.c scan-assembler vmulps[^\n\r]*zmm
The testcase is quite simple, not sure why it's regressed.

 1/* PR target/93089 */
 2/* { dg-do compile } */
 3/* { dg-options "-O2 -fopenmp-simd -mtune=znver1" } */
 4/* { dg-final { scan-assembler "vmulps\[^\n\r]*zmm" } } */
 5/* { dg-final { scan-assembler "vmulps\[^\n\r]*ymm" } } */
 6
 7#pragma omp declare simd notinbranch
 8float
 9foo (float x, float y)
10{
11  return x * y;
12}

> >
> > Bootstrapped/regtested x86_64-linux, OK?
>
> OK.
>
> Thanks,
> Richard.
>
> > Honza
> >
> >
> > gcc/ChangeLog:
> >
> >   * tree-ssa-loop-ch.cc (edge_range_query): Take loop argument; be ready
> >   for queries not in headers.
> >   (static_loop_exit): Add basic blck parameter; update use of
> >   edge_range_query
> >   (should_duplicate_loop_header_p): Add ranger and static_exits
> >   parameter.  Do not account statements that will be optimized
> >   out after duplicaiton in overall size. Add ranger query to
> >   find static exits.
> >   (update_profile_after_ch):  Take static_exits has set instead of
> >   single eliminated_edge.
> >   (ch_base::copy_headers): Do all analysis in the first pass;
> >   remember invariant_exits and 

Re: [PATCH] Fix FAIL: gcc.target/i386/pr87007-5.c

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 21, 2023 at 8:59 PM Richard Biener  wrote:
>
> On Mon, 21 Aug 2023, Hongtao Liu wrote:
>
> > On Mon, Aug 21, 2023 at 8:25?PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > The following fixes the gcc.target/i386/pr87007-5.c testcase which
> > > changed code generation again after the recent sinking improvements.
> > > We now have
> > >
> > > vxorps  %xmm0, %xmm0, %xmm0
> > > vsqrtsd d2(%rip), %xmm0, %xmm0
> > >
> > > and an unnecessary xor again in one case, the other vsqrtsd has
> > > a register source and a properly zeroing load:
> > >
> > > vmovsd  d3(%rip), %xmm0
> > > testl   %esi, %esi
> > > jg  .L11
> > > .L3:
> > > vsqrtsd %xmm0, %xmm0, %xmm0
> > >
> > > the following patch XFAILs the scan.  I'm not sure what's at
> > > fault here, there are no loops in the CFG, but somehow
> > > r84:DF=sqrt(['d2']) gets a pxor but r84:DF=sqrt(r83:DF)
> > > doesn't.  I guess I don't really understand what
> > > remove_partial_avx_dependency is supposed to do so can't
> > > really assess whether the pxor is necessary or not.
> > There's a false dependency on xmm0 when the source operand in the
> > pattern is memory, the pattern only takes xmm0 as dest, but the output
> > instruction takes xmm0 also as input(the second source operand),
> > that's why we need an pxor here.
>
> OK, so XFAIL is wrong, we should instead scan for one xorps then
> (like it was in the past).
>
> > When the source operand in the pattern is register_operand, we can
> > reuse the register_operand for the second source operand. The
> > instructions here are not very obvious, the more representative one
> > should be vsqrtsd %xmm1, %xmm1(rused one), %xmm0.
> > >
> > > OK?
> > Can we add -fno-XXX to disable the optimization to make the assembly
> > more stable?
>
> Not sure, we could feed GIMPLE IR to RTL expansion instead of
> feeding a complex testcase through the pipeline, but I'm not sure
> what we were originally supposed to test (the PR trail is a bit
> large).
>
> > Or current codegen should be optimal(for the sinking), then Ok for the 
> > patch.
>
> So like the following (I've just adjusted the comments to reflect the
> pxor is necessary).
>
> OK?
OK.
>
> Richard.
>
> From 7bed9399ae736c20a677ccf7e7fc4d2751a32327 Mon Sep 17 00:00:00 2001
> From: Richard Biener 
> Date: Mon, 21 Aug 2023 14:09:48 +0200
> Subject: [PATCH] Fix FAIL: gcc.target/i386/pr87007-5.c
> To: gcc-patches@gcc.gnu.org
>
> The following fixes the gcc.target/i386/pr87007-5.c testcase which
> changed code generation again after the recent sinking improvements.
> We now have
>
> vxorps  %xmm0, %xmm0, %xmm0
> vsqrtsd d2(%rip), %xmm0, %xmm0
>
> and a necessary xor again in one case, the other vsqrtsd has
> a register source and a properly zeroing load:
>
> vmovsd  d3(%rip), %xmm0
> testl   %esi, %esi
> jg  .L11
> .L3:
> vsqrtsd %xmm0, %xmm0, %xmm0
>
> the following patch adjusts the scan.
>
> * gcc.target/i386/pr87007-5.c: Update comment, adjust subtest.
> ---
>  gcc/testsuite/gcc.target/i386/pr87007-5.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr87007-5.c 
> b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> index a6cdf11522e..8f2dc947f6c 100644
> --- a/gcc/testsuite/gcc.target/i386/pr87007-5.c
> +++ b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> @@ -1,6 +1,8 @@
>  /* { dg-do compile } */
>  /* { dg-options "-Ofast -march=skylake-avx512 -mfpmath=sse 
> -fno-tree-vectorize -fdump-tree-cddce3-details -fdump-tree-lsplit-optimized" 
> } */
> -/* Load of d2/d3 is hoisted out, vrndscalesd will reuse loades register to 
> avoid partial dependence.  */
> +/* Load of d2/d3 is hoisted out, the loop is split, store of d1 and sqrt
> +   are sunk out of the loop and the loop is elided.  One vsqrtsd with
> +   memory operand needs a xor to avoid partial dependence.  */
>
>  #include
>
> @@ -17,4 +19,4 @@ foo (int n, int k)
>
>  /* { dg-final { scan-tree-dump "optimized: loop split" "lsplit" } } */
>  /* { dg-final { scan-tree-dump-times "removing loop" 2 "cddce3" } } */
> -/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
> --
> 2.35.3
>


-- 
BR,
Hongtao


Re: [PATCH] Fix FAIL: gcc.target/i386/pr87007-5.c

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 21, 2023 at 8:40 PM Hongtao Liu  wrote:
>
> On Mon, Aug 21, 2023 at 8:25 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > The following fixes the gcc.target/i386/pr87007-5.c testcase which
> > changed code generation again after the recent sinking improvements.
> > We now have
> >
> > vxorps  %xmm0, %xmm0, %xmm0
> > vsqrtsd d2(%rip), %xmm0, %xmm0
> >
> > and an unnecessary xor again in one case, the other vsqrtsd has
> > a register source and a properly zeroing load:
> >
> > vmovsd  d3(%rip), %xmm0
> > testl   %esi, %esi
> > jg  .L11
> > .L3:
> > vsqrtsd %xmm0, %xmm0, %xmm0
> >
> > the following patch XFAILs the scan.  I'm not sure what's at
> > fault here, there are no loops in the CFG, but somehow
> > r84:DF=sqrt(['d2']) gets a pxor but r84:DF=sqrt(r83:DF)
> > doesn't.  I guess I don't really understand what
> > remove_partial_avx_dependency is supposed to do so can't
> > really assess whether the pxor is necessary or not.
> There's a false dependency on xmm0 when the source operand in the
> pattern is memory, the pattern only takes xmm0 as dest, but the output
> instruction takes xmm0 also as input(the second source operand),
> that's why we need an pxor here.
> When the source operand in the pattern is register_operand, we can
> reuse the register_operand for the second source operand. The
> instructions here are not very obvious, the more representative one
> should be vsqrtsd %xmm1, %xmm1(rused one), %xmm0.
And there's no false dependence here.
> >
> > OK?
> Can we add -fno-XXX to disable the optimization to make the assembly
> more stable?
> Or current codegen should be optimal(for the sinking), then Ok for the patch.
>
> >
> > * gcc.target/i386/pr87007-5.c: Update comment, XFAIL
> > subtest.
> > ---
> >  gcc/testsuite/gcc.target/i386/pr87007-5.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr87007-5.c 
> > b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> > index a6cdf11522e..5902616d1f1 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr87007-5.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> > @@ -1,6 +1,8 @@
> >  /* { dg-do compile } */
> >  /* { dg-options "-Ofast -march=skylake-avx512 -mfpmath=sse 
> > -fno-tree-vectorize -fdump-tree-cddce3-details 
> > -fdump-tree-lsplit-optimized" } */
> > -/* Load of d2/d3 is hoisted out, vrndscalesd will reuse loades register to 
> > avoid partial dependence.  */
> > +/* Load of d2/d3 is hoisted out, the loop is split, store of d1 and sqrt
> > +   are sunk out of the loop and the loop is elided.  One vsqrtsd with
> > +   memory operand will need a xor to avoid partial dependence.  */
> >
> >  #include
> >
> > @@ -17,4 +19,4 @@ foo (int n, int k)
> >
> >  /* { dg-final { scan-tree-dump "optimized: loop split" "lsplit" } } */
> >  /* { dg-final { scan-tree-dump-times "removing loop" 2 "cddce3" } } */
> > -/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
> > --
> > 2.35.3
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Fix FAIL: gcc.target/i386/pr87007-5.c

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 21, 2023 at 8:25 PM Richard Biener via Gcc-patches
 wrote:
>
> The following fixes the gcc.target/i386/pr87007-5.c testcase which
> changed code generation again after the recent sinking improvements.
> We now have
>
> vxorps  %xmm0, %xmm0, %xmm0
> vsqrtsd d2(%rip), %xmm0, %xmm0
>
> and an unnecessary xor again in one case, the other vsqrtsd has
> a register source and a properly zeroing load:
>
> vmovsd  d3(%rip), %xmm0
> testl   %esi, %esi
> jg  .L11
> .L3:
> vsqrtsd %xmm0, %xmm0, %xmm0
>
> the following patch XFAILs the scan.  I'm not sure what's at
> fault here, there are no loops in the CFG, but somehow
> r84:DF=sqrt(['d2']) gets a pxor but r84:DF=sqrt(r83:DF)
> doesn't.  I guess I don't really understand what
> remove_partial_avx_dependency is supposed to do so can't
> really assess whether the pxor is necessary or not.
There's a false dependency on xmm0 when the source operand in the
pattern is memory, the pattern only takes xmm0 as dest, but the output
instruction takes xmm0 also as input(the second source operand),
that's why we need an pxor here.
When the source operand in the pattern is register_operand, we can
reuse the register_operand for the second source operand. The
instructions here are not very obvious, the more representative one
should be vsqrtsd %xmm1, %xmm1(rused one), %xmm0.
>
> OK?
Can we add -fno-XXX to disable the optimization to make the assembly
more stable?
Or current codegen should be optimal(for the sinking), then Ok for the patch.

>
> * gcc.target/i386/pr87007-5.c: Update comment, XFAIL
> subtest.
> ---
>  gcc/testsuite/gcc.target/i386/pr87007-5.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr87007-5.c 
> b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> index a6cdf11522e..5902616d1f1 100644
> --- a/gcc/testsuite/gcc.target/i386/pr87007-5.c
> +++ b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> @@ -1,6 +1,8 @@
>  /* { dg-do compile } */
>  /* { dg-options "-Ofast -march=skylake-avx512 -mfpmath=sse 
> -fno-tree-vectorize -fdump-tree-cddce3-details -fdump-tree-lsplit-optimized" 
> } */
> -/* Load of d2/d3 is hoisted out, vrndscalesd will reuse loades register to 
> avoid partial dependence.  */
> +/* Load of d2/d3 is hoisted out, the loop is split, store of d1 and sqrt
> +   are sunk out of the loop and the loop is elided.  One vsqrtsd with
> +   memory operand will need a xor to avoid partial dependence.  */
>
>  #include
>
> @@ -17,4 +19,4 @@ foo (int n, int k)
>
>  /* { dg-final { scan-tree-dump "optimized: loop split" "lsplit" } } */
>  /* { dg-final { scan-tree-dump-times "removing loop" 2 "cddce3" } } */
> -/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
> --
> 2.35.3



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 21, 2023 at 5:35 PM Richard Biener
 wrote:
>
> On Mon, Aug 21, 2023 at 10:28 AM Hongtao Liu  wrote:
> >
> > On Mon, Aug 21, 2023 at 4:09 PM Jakub Jelinek  wrote:
> > >
> > > On Mon, Aug 21, 2023 at 09:36:16AM +0200, Richard Biener via Gcc-patches 
> > > wrote:
> > > > > On Sun, Aug 20, 2023 at 6:44 AM ZiNgA BuRgA via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > With the proposed design of these switches, how would I restrict 
> > > > > > AVX10.1
> > > > > > to particular AVX-512 subsets?
> > > > > We can't, avx10.1 is taken as an indivisible ISA which contains all
> > > > > AVX512 related instructions.
> > > > >
> > > > > > We’ve been taking these cases as bugs (but yes, intrinsics are 
> > > > > > still allowed, so in some cases it might prove difficult to 
> > > > > > guarantee this).
> > > > > intel sde support avx10.1-256 target which can be used to validate the
> > > > > binary(if there's invalid 512-bit vector register or 64-bit kmask
> > > > > register is used).
> > > > > > I don’t see any other way of doing what you want within the 
> > > > > > constraints of this design.
> > > > > It looks like the requirement is that we want a
> > > > > -mavx10-vector-width=256(or maybe reuse -mprefer-vector-width=256)
> > > > > option that acts on the original -mavx512XXX option to produce
> > > > > avx10.1-256 compatible binary. we can't use -mavx10.1-256 since it may
> > > > > include avx512fp16 directives and thus not be backward compatible
> > > > > SKX/CLX/ICX.
> > > >
> > > > Yes.  Note we cannot really re-purpose -mprefer-vector-width=256 since 
> > > > that
> > > > would also make uses of 512bit intrinsics ill-formed.  So we'd need a 
> > > > new
> > > > flag that would restrict AVX512VL to 256bit, possibly using a common 
> > > > internal
> > > > flag for this and the -mavx10.1-256 vector size effect.
> > > >
> > > > Maybe -mdisable-vector-width-512 or -mavx512vl-for-avx10.1-256 or
> > > > -mavx512vl-256?  Writing these the last looks most sensible to me?
> > > > Note it should combine with -mavx512vl to -mavx512vl-256 to make
> > > > -march=native -mavx512vl-256 work (I think we should also allow the
> > > > flag together with -mavx10.1*?)
> > > >
> > > > mavx512vl-256
> > > > Target ...
> > > > Disable the 512bit vector ISA subset of AVX512 or AVX10, enable
> > > > the 256bit vector ISA subset of AVX512.
> > >
> > > Wouldn't it be better to have it similarly to other ISA options as 
> > > something
> > > positive, say -mevex512 (the ISA docs talk about EVEX.512, EVEX.256 and
> > > EVEX.128)?
> > > Have -mavx512f (and anything that implies it right now) imply also 
> > > -mevex512
> > > but allow -mno-evex512 which wouldn't unset everything dependent on
> > > -mavx512f.  There is one gotcha, if -mavx512vl isn't enabled in the end,
> > > then -mavx512f -mno-evex512 should disable whole TARGET_AVX512F because
> > > nothing is left.
> > > TARGET_EVEX512 then would guard all TARGET_AVX512* intrinsics which 
> > > operate
> > > on 512-bit vector registers or 64-bit mask registers (in addition to the
> > > other TARGET_AVX512* options, perhaps except TARGET_AVX512F), whether the
> > > 512-bit modes can be used etc.
> > We have an undocumented option mavx10-max-512bit.
> >
> > 1314;; Only for implementation use
> > 1315mavx10-max-512bit
> > 1316Target Mask(ISA2_AVX10_512BIT) Var(ix86_isa_flags2) Undocumented Save
> > 1317Indicates 512 bit vector width support for AVX10.
>
> Ah, missed that, but ...
>
> > Currently it's only used for AVX10 only, maybe we can extend it to
> > existing AVX512*** FLAGS.
> > so users can use -mavx512XXX -mno-avx10-max-512bit to get avx10.1-256
> > compatible binaries.
>
> ... -mno-avx10-max-512bit sounds awkward, no-..-max implies the max doesn't
> apply, so what is it then?
>
> If you think -mavx512vl-256 isn't good then maybe -mavx-width-512
> and -mno-avx-width-512 would be better (applying to both avx512 and avx10).
> I chose -mavx512vl-256 because of the existing -mavx10.1-256.  Btw,
> will we then have -mavx10.2-256 as well?  Do we allow -mavx10.1-512
> -mavx10.2-256 then, thus just enable 256bit for 10.2 extensions to 10.1?!
We're only allowing a single vector width.
-mavx10.1-512 mavx10.2-256 will only enable -mavx10.2-256 + -mavx10.1-256.
> I think we opened up too many holes here and the options should be fixed
> to decouple the size from the base ISA.
I see, we can try to use -mavx-max-512bit(maybe another name) to
decouple the size from the base ISA.
And make
 -mavx10.1-256 just implies all -mavx512XXX + -mno-avx-max-512bit,
 -mavx10.1-512 implies -mavx512XXX + mavx-max-512bit.
then -mavx512vl-256 is just equal to -mavx512vl + mno-avx-max-512bit.

Lots of work to do, but still not too late for GCC14.1
>
> What variable we map this to internally doesn't really matter but yes,
> we'd need to guard 512bit patterns with (AVX512VL || AVX10) && 
> 512-enabled-flag
>
> Richard.
>
> > From the implementation perspective, we 

Re: Intel AVX10.1 Compiler Design and Support

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 21, 2023 at 4:38 PM Jakub Jelinek  wrote:
>
> On Mon, Aug 21, 2023 at 04:28:20PM +0800, Hongtao Liu wrote:
> > We have an undocumented option mavx10-max-512bit.
>
> How it is called internally is one thing, but it is weird to use
> avx10 in an option name which would be meant for finding common subset
> of -mavx512xxx and -mavx10.1-256.
We can have an alias for the name, but internally use the same bit
since they're doing the same thing.
And the option is somewhat orthogonal to  AVX512XXX/AVX10, it only
care about vector/kmask size.
>
> Jakub
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-21 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 21, 2023 at 4:09 PM Jakub Jelinek  wrote:
>
> On Mon, Aug 21, 2023 at 09:36:16AM +0200, Richard Biener via Gcc-patches 
> wrote:
> > > On Sun, Aug 20, 2023 at 6:44 AM ZiNgA BuRgA via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > With the proposed design of these switches, how would I restrict AVX10.1
> > > > to particular AVX-512 subsets?
> > > We can't, avx10.1 is taken as an indivisible ISA which contains all
> > > AVX512 related instructions.
> > >
> > > > We’ve been taking these cases as bugs (but yes, intrinsics are still 
> > > > allowed, so in some cases it might prove difficult to guarantee this).
> > > intel sde support avx10.1-256 target which can be used to validate the
> > > binary(if there's invalid 512-bit vector register or 64-bit kmask
> > > register is used).
> > > > I don’t see any other way of doing what you want within the constraints 
> > > > of this design.
> > > It looks like the requirement is that we want a
> > > -mavx10-vector-width=256(or maybe reuse -mprefer-vector-width=256)
> > > option that acts on the original -mavx512XXX option to produce
> > > avx10.1-256 compatible binary. we can't use -mavx10.1-256 since it may
> > > include avx512fp16 directives and thus not be backward compatible
> > > SKX/CLX/ICX.
> >
> > Yes.  Note we cannot really re-purpose -mprefer-vector-width=256 since that
> > would also make uses of 512bit intrinsics ill-formed.  So we'd need a new
> > flag that would restrict AVX512VL to 256bit, possibly using a common 
> > internal
> > flag for this and the -mavx10.1-256 vector size effect.
> >
> > Maybe -mdisable-vector-width-512 or -mavx512vl-for-avx10.1-256 or
> > -mavx512vl-256?  Writing these the last looks most sensible to me?
> > Note it should combine with -mavx512vl to -mavx512vl-256 to make
> > -march=native -mavx512vl-256 work (I think we should also allow the
> > flag together with -mavx10.1*?)
> >
> > mavx512vl-256
> > Target ...
> > Disable the 512bit vector ISA subset of AVX512 or AVX10, enable
> > the 256bit vector ISA subset of AVX512.
>
> Wouldn't it be better to have it similarly to other ISA options as something
> positive, say -mevex512 (the ISA docs talk about EVEX.512, EVEX.256 and
> EVEX.128)?
> Have -mavx512f (and anything that implies it right now) imply also -mevex512
> but allow -mno-evex512 which wouldn't unset everything dependent on
> -mavx512f.  There is one gotcha, if -mavx512vl isn't enabled in the end,
> then -mavx512f -mno-evex512 should disable whole TARGET_AVX512F because
> nothing is left.
> TARGET_EVEX512 then would guard all TARGET_AVX512* intrinsics which operate
> on 512-bit vector registers or 64-bit mask registers (in addition to the
> other TARGET_AVX512* options, perhaps except TARGET_AVX512F), whether the
> 512-bit modes can be used etc.
We have an undocumented option mavx10-max-512bit.

1314;; Only for implementation use
1315mavx10-max-512bit
1316Target Mask(ISA2_AVX10_512BIT) Var(ix86_isa_flags2) Undocumented Save
1317Indicates 512 bit vector width support for AVX10.

Currently it's only used for AVX10 only, maybe we can extend it to
existing AVX512*** FLAGS.
so users can use -mavx512XXX -mno-avx10-max-512bit to get avx10.1-256
compatible binaries.

>From the implementation perspective, we need to restrict all 512-bit
vector patterns/builtins/intrinsics under both AVX512XXX and
TARGET_AVX10_512BIT.
similar for register allocation, parameter passing, return value,
vector_mode_supported_p, gather/scatter hook, and all other hooks.
After that, the -mavx10-max-512bit will divide existing AVX512 into 2
parts, AVX512XXX-256, AVX512XXX-512.


>
> Jakub
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-20 Thread Hongtao Liu via Gcc-patches
On Sun, Aug 20, 2023 at 6:44 AM ZiNgA BuRgA via Gcc-patches
 wrote:
>
> Hi,
>
> With the proposed design of these switches, how would I restrict AVX10.1
> to particular AVX-512 subsets?
We can't, avx10.1 is taken as an indivisible ISA which contains all
AVX512 related instructions.

> We’ve been taking these cases as bugs (but yes, intrinsics are still allowed, 
> so in some cases it might prove difficult to guarantee this).
intel sde support avx10.1-256 target which can be used to validate the
binary(if there's invalid 512-bit vector register or 64-bit kmask
register is used).
> I don’t see any other way of doing what you want within the constraints of 
> this design.
It looks like the requirement is that we want a
-mavx10-vector-width=256(or maybe reuse -mprefer-vector-width=256)
option that acts on the original -mavx512XXX option to produce
avx10.1-256 compatible binary. we can't use -mavx10.1-256 since it may
include avx512fp16 directives and thus not be backward compatible
SKX/CLX/ICX.
>
> For example, usage of the |_mm256_rol_epi32| intrinsic should be
> compatible on any AVX10/256 implementation, /as well as /any AVX-512VL
> without AVX10 implementation (e.g. Skylake-X).  But how do I signal that
> I want compatibility with both these targets?
>
>   * |-mavx512vl| lets the compiler use 512-bit registers -> incompatible
> with 256-bit AVX10.
>   * |-mavx512vl -mprefer-vector-width=256| might steer the compiler away
> from 512-bit registers, but I don't think it guarantees it.
>   * |-mavx10.1-256| lets the compiler use all Sapphire Rapids AVX-512
> features at 256-bit wide (so in theory, it could choose to compile
> it with |vpshldd|) -> incompatible with Skylake-X.
>   * |-mavx10.1-256 -mno-avx512fp16 -mno-avx512...| will emit a warning
> and ignore the attempts at disabling AVX-512 subsets.
>   * |-mavx10.1-256 -mavx512vl| takes the /union/ of the features, not
> the /intersection./
>
> Is there something like |-mavx512vl -mmax-vector-width=256|, or am I
> misunderstanding the situation?
>
> Thanks!



-- 
BR,
Hongtao


Re: [PATCH] i386: Add AVX2 pragma wrapper for AVX512DQVL intrins

2023-08-18 Thread Hongtao Liu via Gcc-patches
On Fri, Aug 18, 2023 at 2:01 PM Haochen Jiang via Gcc-patches
 wrote:
>
> Hi all,
>
> This patch aims to fix PR111051, which actually make sure that AVX2
> intrins are visible to AVX512/AVX10 intrins under any circumstances.
>
> I will also apply the same fix on AVX512DQ scalar intrins.
>
> Regtested on on x86_64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> PR target/111051
>
> gcc/ChangeLog:
>
> * config/i386/avx512vldqintrin.h: Push AVX2 when AVX2 is
> disabled.
>
> gcc/testsuite/ChangeLog:
>
> PR target/111051
> * gcc.target/i386/pr111051-1.c: New test.
> ---
>  gcc/config/i386/avx512vldqintrin.h | 11 +++
>  gcc/testsuite/gcc.target/i386/pr111051-1.c | 11 +++
>  2 files changed, 22 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr111051-1.c
>
> diff --git a/gcc/config/i386/avx512vldqintrin.h 
> b/gcc/config/i386/avx512vldqintrin.h
> index 1fbf93a0b52..db900ebf467 100644
> --- a/gcc/config/i386/avx512vldqintrin.h
> +++ b/gcc/config/i386/avx512vldqintrin.h
> @@ -28,6 +28,12 @@
>  #ifndef _AVX512VLDQINTRIN_H_INCLUDED
>  #define _AVX512VLDQINTRIN_H_INCLUDED
>
> +#if !defined(__AVX2__)
> +#pragma GCC push_options
> +#pragma GCC target("avx2")
> +#define __DISABLE_AVX2__
> +#endif /* __AVX2__ */
> +
>  extern __inline __m256i
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>  _mm256_cvttpd_epi64 (__m256d __A)
> @@ -2002,4 +2008,9 @@ _mm256_maskz_insertf64x2 (__mmask8 __U, __m256d __A, 
> __m128d __B,
>
>  #endif
>
> +#ifdef __DISABLE_AVX2__
> +#undef __DISABLE_AVX2__
> +#pragma GCC pop_options
> +#endif /* __DISABLE_AVX2__ */
> +
>  #endif /* _AVX512VLDQINTRIN_H_INCLUDED */
> diff --git a/gcc/testsuite/gcc.target/i386/pr111051-1.c 
> b/gcc/testsuite/gcc.target/i386/pr111051-1.c
> new file mode 100644
> index 000..973007043cb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr111051-1.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +
> +#include 
> +
> +#pragma GCC target("avx512vl,avx512dq")
> +
> +void foo (__m256i i)
> +{
> +  volatile __m256d v1 = _mm256_cvtepi64_pd (i);
> +}
> +
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH V2] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions

2023-08-16 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 14, 2023 at 10:40 AM Hongtao Liu  wrote:
>
> On Fri, Aug 11, 2023 at 2:02 PM liuhongt via Gcc-patches
>  wrote:
> >
> > Rename original use_gather to use_gather_8parts, Support
> > -mtune-ctrl={,^}use_gather to set/clear tune features
> > use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather
> > as alias of -mtune-ctrl=, use_gather, ^use_gather.
> >
> > Similar for use_scatter.
> >
> > How about this version?
> I'll commit the patch if there's no objections in the next 24 hours.
Pushed to trunk and backport to release/gcc-{13,12,11}.
Note for GCC11, The backport patch only supports -m{no,}gather since
the branch doesn't have scatter tunings.
For GCC12/GCC13. both -m{no,}gather/scatter are supported.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386-builtins.cc
> > (ix86_vectorize_builtin_gather): Adjust for use_gather_8parts.
> > * config/i386/i386-options.cc (parse_mtune_ctrl_str):
> > Set/Clear tune features use_{gather,scatter}_{2parts, 4parts,
> > 8parts} for -mtune-crtl={,^}{use_gather,use_scatter}.
> > * config/i386/i386.cc (ix86_vectorize_builtin_scatter): Adjust
> > for use_scatter_8parts
> > * config/i386/i386.h (TARGET_USE_GATHER): Rename to ..
> > (TARGET_USE_GATHER_8PARTS): .. this.
> > (TARGET_USE_SCATTER): Rename to ..
> > (TARGET_USE_SCATTER_8PARTS): .. this.
> > * config/i386/x86-tune.def (X86_TUNE_USE_GATHER): Rename to
> > (X86_TUNE_USE_GATHER_8PARTS): .. this.
> > (X86_TUNE_USE_SCATTER): Rename to
> > (X86_TUNE_USE_SCATTER_8PARTS): .. this.
> > * config/i386/i386.opt: Add new options mgather, mscatter.
> > ---
> >  gcc/config/i386/i386-builtins.cc |  2 +-
> >  gcc/config/i386/i386-options.cc  | 54 +++-
> >  gcc/config/i386/i386.cc  |  2 +-
> >  gcc/config/i386/i386.h   |  8 ++---
> >  gcc/config/i386/i386.opt |  8 +
> >  gcc/config/i386/x86-tune.def |  4 +--
> >  6 files changed, 56 insertions(+), 22 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-builtins.cc 
> > b/gcc/config/i386/i386-builtins.cc
> > index 356b6dfd5fb..8a0b8dfe073 100644
> > --- a/gcc/config/i386/i386-builtins.cc
> > +++ b/gcc/config/i386/i386-builtins.cc
> > @@ -1657,7 +1657,7 @@ ix86_vectorize_builtin_gather (const_tree mem_vectype,
> >   ? !TARGET_USE_GATHER_2PARTS
> >   : (known_eq (TYPE_VECTOR_SUBPARTS (mem_vectype), 4u)
> >  ? !TARGET_USE_GATHER_4PARTS
> > -: !TARGET_USE_GATHER)))
> > +: !TARGET_USE_GATHER_8PARTS)))
> >  return NULL_TREE;
> >
> >if ((TREE_CODE (index_type) != INTEGER_TYPE
> > diff --git a/gcc/config/i386/i386-options.cc 
> > b/gcc/config/i386/i386-options.cc
> > index 127ee24203c..b8d038af69d 100644
> > --- a/gcc/config/i386/i386-options.cc
> > +++ b/gcc/config/i386/i386-options.cc
> > @@ -1731,20 +1731,46 @@ parse_mtune_ctrl_str (struct gcc_options *opts, 
> > bool dump)
> >curr_feature_string++;
> >clear = true;
> >  }
> > -  for (i = 0; i < X86_TUNE_LAST; i++)
> > -{
> > -  if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> > -{
> > -  ix86_tune_features[i] = !clear;
> > -  if (dump)
> > -fprintf (stderr, "Explicitly %s feature %s\n",
> > - clear ? "clear" : "set", 
> > ix86_tune_feature_names[i]);
> > -  break;
> > -}
> > -}
> > -  if (i == X86_TUNE_LAST)
> > -   error ("unknown parameter to option %<-mtune-ctrl%>: %s",
> > -  clear ? curr_feature_string - 1 : curr_feature_string);
> > +
> > +  if (!strcmp (curr_feature_string, "use_gather"))
> > +   {
> > + ix86_tune_features[X86_TUNE_USE_GATHER_2PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_GATHER_4PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_GATHER_8PARTS] = !clear;
> > + if (dump)
> > +   fprintf (stderr, "Explicitly %s features use_gather_2parts,"
> > +" use_gather_4parts, use_gather_8parts\n",
> > +clear ? "clear" : "set");
> > +
> > +   }
> > +  else if (!strcmp (curr_feature_string, "use_scatter"))
> > +   {
> > + ix86_tune_features[X86_TUNE_USE_SCATTER_2PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_SCATTER_4PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_SCATTER_8PARTS] = !clear;
> > + if (dump)
> > +   fprintf (stderr, "Explicitly %s features use_scatter_2parts,"
> > +" use_scatter_4parts, use_scatter_8parts\n",
> > +clear ? "clear" : "set");
> > +   }
> > +  else
> > +   {
> > + for (i = 0; i < X86_TUNE_LAST; i++)
> > +   {
> > + if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> > +   {

Re: [PATCH] Software mitigation: Disable gather generation in vectorization for GDS affected Intel Processors.

2023-08-16 Thread Hongtao Liu via Gcc-patches
On Fri, Aug 11, 2023 at 8:38 AM liuhongt  wrote:
>
> For more details of GDS (Gather Data Sampling), refer to
> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
>
> After microcode update, there's performance regression. To avoid that,
> the patch disables gather generation in autovectorization but uses
> gather scalar emulation instead.
>
> Ready push to trunk and backport.
> any comments?
Pushed to trunk and backport to releases/gcc-{11,12,13}.
>
> gcc/ChangeLog:
>
> * config/i386/i386-options.cc (m_GDS): New macro.
> * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Don't
> enable for m_GDS.
> (X86_TUNE_USE_GATHER_4PARTS): Ditto.
> (X86_TUNE_USE_GATHER): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx2-gather-2.c: Adjust options to keep
> gather vectorization.
> * gcc.target/i386/avx2-gather-6.c: Ditto.
> * gcc.target/i386/avx512f-pr88464-1.c: Ditto.
> * gcc.target/i386/avx512f-pr88464-5.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-1.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-11.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-3.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-9.c: Ditto.
> * gcc.target/i386/pr88531-1b.c: Ditto.
> * gcc.target/i386/pr88531-1c.c: Ditto.
> ---
>  gcc/config/i386/i386-options.cc | 5 +
>  gcc/config/i386/x86-tune.def| 6 +++---
>  gcc/testsuite/gcc.target/i386/avx2-gather-2.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx2-gather-6.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512f-pr88464-1.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512f-pr88464-5.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-1.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-11.c | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-3.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-9.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/pr88531-1b.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/pr88531-1c.c  | 2 +-
>  12 files changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 127ee24203c..e6ba33c370d 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -141,6 +141,11 @@ along with GCC; see the file COPYING3.  If not see
>  #define m_ARROWLAKE (HOST_WIDE_INT_1U<  #define m_CORE_ATOM (m_SIERRAFOREST | m_GRANDRIDGE)
>  #define m_INTEL (HOST_WIDE_INT_1U< +/* Gather Data Sampling / CVE-2022-40982 / INTEL-SA-00828.
> +   Software mitigation.  */
> +#define m_GDS (m_SKYLAKE | m_SKYLAKE_AVX512 | m_CANNONLAKE \
> +  | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \
> +  | m_TIGERLAKE | m_COOPERLAKE | m_ROCKETLAKE)
>
>  #define m_LUJIAZUI (HOST_WIDE_INT_1U<
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 40e04ecddbf..22d26bb0030 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -491,7 +491,7 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, 
> "avoid_4byte_prefixes",
> elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts",
>   ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_ALDERLAKE
> -   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC))
> +   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2
> elements.  */
> @@ -502,7 +502,7 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, 
> "use_scatter_2parts",
> elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts",
>   ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_ALDERLAKE
> -   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC))
> +   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4
> elements.  */
> @@ -513,7 +513,7 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, 
> "use_scatter_4parts",
> elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
>   ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_ALDERLAKE | m_ARROWLAKE
> -   | m_CORE_ATOM | m_GENERIC))
> +   | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more
> elements.  */
> diff --git a/gcc/testsuite/gcc.target/i386/avx2-gather-2.c 
> b/gcc/testsuite/gcc.target/i386/avx2-gather-2.c
> index ad5ef73107c..978924b0f57 100644
> --- a/gcc/testsuite/gcc.target/i386/avx2-gather-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx2-gather-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -fdump-tree-vect-details -march=skylake" } */
> +/* { dg-options "-O3 -fdump-tree-vect-details -march=skylake -mtune=haswell" 
> } */
>
>  #include 

Re: [PATCH 6/6] Support AVX10.1 for AVX512DQ+AVX512VL intrins

2023-08-15 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 8, 2023 at 3:23 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-vextractf64x2-1.c: New test.
> * gcc.target/i386/avx10_1-vextracti64x2-1.c: Ditto.
> * gcc.target/i386/avx10_1-vfpclasspd-1.c: Ditto.
> * gcc.target/i386/avx10_1-vfpclassps-1.c: Ditto.
> * gcc.target/i386/avx10_1-vinsertf64x2-1.c: Ditto.
> * gcc.target/i386/avx10_1-vinserti64x2-1.c: Ditto.
> * gcc.target/i386/avx10_1-vrangepd-1.c: Ditto.
> * gcc.target/i386/avx10_1-vrangeps-1.c: Ditto.
> * gcc.target/i386/avx10_1-vreducepd-1.c: Ditto.
> * gcc.target/i386/avx10_1-vreduceps-1.c: Ditto.
Ok for all 6 patches(please wait for extra 24 hours to commit, if
there's no objection).
> ---
>  .../gcc.target/i386/avx10_1-vextractf64x2-1.c | 18 
>  .../gcc.target/i386/avx10_1-vextracti64x2-1.c | 19 
>  .../gcc.target/i386/avx10_1-vfpclasspd-1.c| 21 ++
>  .../gcc.target/i386/avx10_1-vfpclassps-1.c| 21 ++
>  .../gcc.target/i386/avx10_1-vinsertf64x2-1.c  | 18 
>  .../gcc.target/i386/avx10_1-vinserti64x2-1.c  | 18 
>  .../gcc.target/i386/avx10_1-vrangepd-1.c  | 27 +
>  .../gcc.target/i386/avx10_1-vrangeps-1.c  | 27 +
>  .../gcc.target/i386/avx10_1-vreducepd-1.c | 29 +++
>  .../gcc.target/i386/avx10_1-vreduceps-1.c | 29 +++
>  10 files changed, 227 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vfpclassps-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vinsertf64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vinserti64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vrangepd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vrangeps-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vreducepd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vreduceps-1.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c
> new file mode 100644
> index 000..4c7e54dc198
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx10.1 -O2" } */
> +/* { dg-final { scan-assembler-times "vextractf64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextractf64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextractf64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"  1 } } */
> +
> +#include 
> +
> +volatile __m256d x;
> +volatile __m128d y;
> +
> +void extern
> +avx10_1_test (void)
> +{
> +  y = _mm256_extractf64x2_pd (x, 1);
> +  y = _mm256_mask_extractf64x2_pd (y, 2, x, 1);
> +  y = _mm256_maskz_extractf64x2_pd (2, x, 1);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c
> new file mode 100644
> index 000..c0bd7700d52
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx10.1 -O2" } */
> +/* { dg-final { scan-assembler-times "vextracti64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextracti64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextracti64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"  1 } } */
> +
> +#include 
> +
> +volatile __m256i x;
> +volatile __m128i y;
> +
> +void extern
> +avx10_1_test (void)
> +{
> +  y = _mm256_extracti64x2_epi64 (x, 1);
> +  y = _mm256_mask_extracti64x2_epi64 (y, 2, x, 1);
> +  y = _mm256_maskz_extracti64x2_epi64 (2, x, 1);
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c
> new file mode 100644
> index 000..806ba800023
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx10.1 -O2" } */
> +/* { dg-final { scan-assembler-times "vfpclasspdy\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n^k\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vfpclasspdx\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n^k\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vfpclasspdy\[ 
> 

Re: [PATCH 3/3] Emit a warning when AVX10 options conflict in vector width

2023-08-15 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 8, 2023 at 3:13 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * config/i386/driver-i386.cc (host_detect_local_cpu):
> Do not append -mno-avx10-max-512bit for -march=native.
> * common/config/i386/i386-common.cc
> (ix86_check_avx10_vector_width): New function to check isa_flags
> to emit a warning when there is a conflict in AVX10 options for
> vector width.
> (ix86_handle_option): Add check for avx10.1-256 and avx10.1-512.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-15.c: New test.
> * gcc.target/i386/avx10_1-16.c: Ditto.
> * gcc.target/i386/avx10_1-17.c: Ditto.
> * gcc.target/i386/avx10_1-18.c: Ditto.
> ---
Ok(please wait for extra 24 hours to commit, if there's no objection)
>  gcc/common/config/i386/i386-common.cc  | 20 
>  gcc/config/i386/driver-i386.cc |  3 ++-
>  gcc/config/i386/i386-options.cc|  2 +-
>  gcc/testsuite/gcc.target/i386/avx10_1-15.c |  5 +
>  gcc/testsuite/gcc.target/i386/avx10_1-16.c |  5 +
>  gcc/testsuite/gcc.target/i386/avx10_1-17.c | 13 +
>  gcc/testsuite/gcc.target/i386/avx10_1-18.c | 13 +
>  7 files changed, 59 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-15.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-16.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-17.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-18.c
>
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index ec94251dd4c..db88befc9b8 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -428,6 +428,24 @@ ix86_check_avx512 (struct gcc_options *opts)
>return true;
>  }
>
> +/* Emit a warning when there is a conflict vector width in AVX10 options.  */
> +static void
> +ix86_check_avx10_vector_width (struct gcc_options *opts, bool avx10_max_512)
> +{
> +  if (avx10_max_512)
> +{
> +  if (((opts->x_ix86_isa_flags2 | ~OPTION_MASK_ISA2_AVX10_512BIT)
> +  == ~OPTION_MASK_ISA2_AVX10_512BIT)
> + && (opts->x_ix86_isa_flags2_explicit & 
> OPTION_MASK_ISA2_AVX10_512BIT))
> +   warning (0, "The options used for AVX10 have conflict vector width, "
> +"using the latter 512 as vector width");
> +}
> +  else if (opts->x_ix86_isa_flags2 & opts->x_ix86_isa_flags2_explicit
> +  & OPTION_MASK_ISA2_AVX10_512BIT)
> +warning (0, "The options used for AVX10 have conflict vector width, "
> +"using the latter 256 as vector width");
> +}
> +
>  /* Implement TARGET_HANDLE_OPTION.  */
>
>  bool
> @@ -1415,6 +1433,7 @@ ix86_handle_option (struct gcc_options *opts,
>return true;
>
>  case OPT_mavx10_1_256:
> +  ix86_check_avx10_vector_width (opts, false);
>opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2_explicit |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2 &= ~OPTION_MASK_ISA2_AVX10_512BIT_SET;
> @@ -1424,6 +1443,7 @@ ix86_handle_option (struct gcc_options *opts,
>return true;
>
>  case OPT_mavx10_1_512:
> +  ix86_check_avx10_vector_width (opts, true);
>opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2_explicit |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_AVX10_512BIT_SET;
> diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc
> index 227ace6ff83..f4551a74e3a 100644
> --- a/gcc/config/i386/driver-i386.cc
> +++ b/gcc/config/i386/driver-i386.cc
> @@ -854,7 +854,8 @@ const char *host_detect_local_cpu (int argc, const char 
> **argv)
>   options = concat (options, " ",
> isa_names_table[i].option, NULL);
>   }
> -   else if (isa_names_table[i].feature != FEATURE_AVX10_1)
> +   else if ((isa_names_table[i].feature != FEATURE_AVX10_1)
> +&& (isa_names_table[i].feature != FEATURE_AVX10_512BIT))
>   options = concat (options, neg_option,
> isa_names_table[i].option + 2, NULL);
>   }
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index b2281fbd4b5..8f9b825b527 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -985,7 +985,7 @@ ix86_valid_target_attribute_inner_p (tree fndecl, tree 
> args, char *p_strings[],
>  ix86_opt_ix86_no,
>  ix86_opt_str,
>  ix86_opt_enum,
> -ix86_opt_isa,
> +ix86_opt_isa
>};
>
>static const struct
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-15.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-15.c
> new file mode 100644
> index 000..fd873c9694c
> --- /dev/null
> +++ 

Re: [PATCH 2/3] Emit a warning when disabling AVX512 with AVX10 enabled or disabling AVX10 with AVX512 enabled

2023-08-15 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 8, 2023 at 3:15 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * config/i386/driver-i386.cc (host_detect_local_cpu):
> Do not append -mno-avx10.1 for -march=native.
> * config/i386/i386-options.cc
> (ix86_check_avx10): New function to check isa_flags and
> isa_flags_explicit to emit warning when AVX10 is enabled
> by "-m" option.
> (ix86_check_avx512):  New function to check isa_flags and
> isa_flags_explicit to emit warning when AVX512 is enabled
> by "-m" option.
> (ix86_handle_option): Do not change the flags when warning
> is emitted.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-11.c: New test.
> * gcc.target/i386/avx10_1-12.c: Ditto.
> * gcc.target/i386/avx10_1-13.c: Ditto.
> * gcc.target/i386/avx10_1-14.c: Ditto.
Ok(please wait for extra 24 hours to commit, if there's no objection)
> ---
>  gcc/common/config/i386/i386-common.cc  | 68 +-
>  gcc/config/i386/driver-i386.cc |  2 +-
>  gcc/testsuite/gcc.target/i386/avx10_1-11.c |  5 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-12.c | 13 +
>  gcc/testsuite/gcc.target/i386/avx10_1-13.c |  5 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-14.c | 13 +
>  6 files changed, 91 insertions(+), 15 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-11.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-12.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-13.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-14.c
>
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 6c3bebb1846..ec94251dd4c 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -388,6 +388,46 @@ set_malign_value (const char **flag, unsigned value)
>*flag = r;
>  }
>
> +/* Emit a warning when using -mno-avx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,
> +   vnni,ifma,bitalg,vpopcntdq} with -mavx10.1 and above.  */
> +static bool
> +ix86_check_avx10 (struct gcc_options *opts)
> +{
> +  if (opts->x_ix86_isa_flags2 & opts->x_ix86_isa_flags2_explicit
> +  & OPTION_MASK_ISA2_AVX10_1)
> +{
> +  warning (0, 
> "%<-mno-avx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,ifma,"
> +  "bitalg,vpopcntdq}%> are ignored with %<-mavx10.1%> and 
> above");
> +  return false;
> +}
> +
> +  return true;
> +}
> +
> +/* Emit a warning when using -mno-avx10.1 with -mavx512{f,vl,bw,dq,cd,bf16,
> +   fp16,vbmi,vbmi2,vnni,ifma,bitalg,vpopcntdq}.  */
> +static bool
> +ix86_check_avx512 (struct gcc_options *opts)
> +{
> +  if ((opts->x_ix86_isa_flags & opts->x_ix86_isa_flags_explicit
> +   & (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_AVX512CD
> + | OPTION_MASK_ISA_AVX512DQ | OPTION_MASK_ISA_AVX512BW
> + | OPTION_MASK_ISA_AVX512VL | OPTION_MASK_ISA_AVX512IFMA
> + | OPTION_MASK_ISA_AVX512VBMI | OPTION_MASK_ISA_AVX512VBMI2
> + | OPTION_MASK_ISA_AVX512VNNI | OPTION_MASK_ISA_AVX512VPOPCNTDQ
> + | OPTION_MASK_ISA_AVX512BITALG))
> +  || (opts->x_ix86_isa_flags2 & opts->x_ix86_isa_flags2_explicit
> + & (OPTION_MASK_ISA2_AVX512FP16 | OPTION_MASK_ISA2_AVX512BF16)))
> +{
> +  warning (0, "%<-mno-avx10.1%> is ignored when using with "
> +  "%<-mavx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,"
> +  "ifma,bitalg,vpopcntdq}%>");
> +  return false;
> +}
> +
> +  return true;
> +}
> +
>  /* Implement TARGET_HANDLE_OPTION.  */
>
>  bool
> @@ -609,7 +649,7 @@ ix86_handle_option (struct gcc_options *opts,
>   opts->x_ix86_isa_flags |= OPTION_MASK_ISA_AVX512F_SET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512F_SET;
> }
> -  else
> +  else if (ix86_check_avx10 (opts))
> {
>   opts->x_ix86_isa_flags &= ~OPTION_MASK_ISA_AVX512F_UNSET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512F_UNSET;
> @@ -624,7 +664,7 @@ ix86_handle_option (struct gcc_options *opts,
>   opts->x_ix86_isa_flags |= OPTION_MASK_ISA_AVX512CD_SET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512CD_SET;
> }
> -  else
> +  else if (ix86_check_avx10 (opts))
> {
>   opts->x_ix86_isa_flags &= ~OPTION_MASK_ISA_AVX512CD_UNSET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512CD_UNSET;
> @@ -898,7 +938,7 @@ ix86_handle_option (struct gcc_options *opts,
>   opts->x_ix86_isa_flags |= OPTION_MASK_ISA_AVX512VBMI2_SET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512VBMI2_SET;
> }
> -  else
> +  else if (ix86_check_avx10 (opts))
> {
>   opts->x_ix86_isa_flags &= ~OPTION_MASK_ISA_AVX512VBMI2_UNSET;
>   opts->x_ix86_isa_flags_explicit |= 
> OPTION_MASK_ISA_AVX512VBMI2_UNSET;
> @@ -913,7 +953,7 

Re: [PATCH 1/3] Initial support for AVX10.1

2023-08-15 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 8, 2023 at 3:16 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Add avx10_set and version and detect avx10.1.
> (cpu_indicator_init): Handle avx10.1-512.
> * common/config/i386/i386-common.cc
> (OPTION_MASK_ISA2_AVX10_512BIT_SET): New.
> (OPTION_MASK_ISA2_AVX10_1_SET): Ditto.
> (OPTION_MASK_ISA2_AVX10_512BIT_UNSET): Ditto.
> (OPTION_MASK_ISA2_AVX10_1_UNSET): Ditto.
> (OPTION_MASK_ISA2_AVX2_UNSET): Modify for AVX10_1.
> (ix86_handle_option): Handle -mavx10.1, -mavx10.1-256 and
> -mavx10.1-512.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_AVX10_512BIT, FEATURE_AVX10_1 and
> FEATURE_AVX10_512BIT.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> AVX10_512BIT, AVX10_1 and AVX10_1_512.
> * config/i386/constraints.md (Yk): Add AVX10_1.
> (Yv): Ditto.
> (k): Ditto.
> * config/i386/cpuid.h (bit_AVX10): New.
> (bit_AVX10_256): Ditto.
> (bit_AVX10_512): Ditto.
> * config/i386/i386-c.cc (ix86_target_macros_internal):
> Define AVX10_512BIT and AVX10_1.
> * config/i386/i386-isa.def
> (AVX10_512BIT): Add DEF_PTA(AVX10_512BIT).
> (AVX10_1): Add DEF_PTA(AVX10_1).
> * config/i386/i386-options.cc (isa2_opts): Add -mavx10.1.
> (ix86_valid_target_attribute_inner_p): Handle avx10-512bit, avx10.1
> and avx10.1-512.
> (ix86_option_override_internal): Enable AVX512{F,VL,BW,DQ,CD,BF16,
> FP16,VBMI,VBMI2,VNNI,IFMA,BITALG,VPOPCNTDQ} features for avx10.1-512.
> (ix86_valid_target_attribute_inner_p): Handle AVX10_1.
> * config/i386/i386.cc (ix86_get_ssemov): Add AVX10_1.
> (ix86_conditional_register_usage): Ditto.
> (ix86_hard_regno_mode_ok): Ditto.
> (ix86_rtx_costs): Ditto.
> * config/i386/i386.h (VALID_MASK_AVX10_MODE): New macro.
> * config/i386/i386.opt: Add option -mavx10.1, -mavx10.1-256 and
> -mavx10.1-512.
> * doc/extend.texi: Document avx10.1, avx10.1-256 and avx10.1-512.
> * doc/invoke.texi: Document -mavx10.1, -mavx10.1-256 and 
> -mavx10.1-512.
> * doc/sourcebuild.texi: Document target avx10.1, avx10.1-256
> and avx10.1-512.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/mv33.C: New test.
> * gcc.target/i386/avx10_1-1.c: Ditto.
> * gcc.target/i386/avx10_1-2.c: Ditto.
> * gcc.target/i386/avx10_1-3.c: Ditto.
> * gcc.target/i386/avx10_1-4.c: Ditto.
> * gcc.target/i386/avx10_1-5.c: Ditto.
> * gcc.target/i386/avx10_1-6.c: Ditto.
> * gcc.target/i386/avx10_1-7.c: Ditto.
> * gcc.target/i386/avx10_1-8.c: Ditto.
> * gcc.target/i386/avx10_1-9.c: Ditto.
> * gcc.target/i386/avx10_1-10.c: Ditto.
Ok(please wait for extra 24 hours to commit, if there's no objection)
> ---
>  gcc/common/config/i386/cpuinfo.h   | 36 +++
>  gcc/common/config/i386/i386-common.cc  | 53 +-
>  gcc/common/config/i386/i386-cpuinfo.h  |  3 ++
>  gcc/common/config/i386/i386-isas.h |  5 ++
>  gcc/config/i386/constraints.md |  6 +--
>  gcc/config/i386/cpuid.h|  6 +++
>  gcc/config/i386/i386-c.cc  |  4 ++
>  gcc/config/i386/i386-isa.def   |  2 +
>  gcc/config/i386/i386-options.cc| 26 ++-
>  gcc/config/i386/i386.cc| 18 ++--
>  gcc/config/i386/i386.h |  3 ++
>  gcc/config/i386/i386.opt   | 19 
>  gcc/doc/extend.texi| 13 ++
>  gcc/doc/invoke.texi| 16 +--
>  gcc/doc/sourcebuild.texi   |  9 
>  gcc/testsuite/g++.target/i386/mv33.C   | 30 
>  gcc/testsuite/gcc.target/i386/avx10_1-1.c  | 22 +
>  gcc/testsuite/gcc.target/i386/avx10_1-10.c | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-2.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-3.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-4.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-5.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-6.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-7.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-8.c  |  4 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-9.c  | 13 ++
>  26 files changed, 366 insertions(+), 13 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/mv33.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-10.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-3.c
>  create mode 100644 

Re: [PATCH] Generate vmovapd instead of vmovsd for moving DFmode between SSE_REGS.

2023-08-13 Thread Hongtao Liu via Gcc-patches
cc

On Mon, Aug 14, 2023 at 10:46 AM liuhongt  wrote:
>
> vmovapd can enable register renaming and have same code size as
> vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than
> vmovsh.
>
> When TARGET_AVX512VL is not available, still generate
> vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (movdf_internal): Generate vmovapd instead of
> vmovsd when moving DFmode between SSE_REGS.
> (movhi_internal): Generate vmovdqa instead of vmovsh when
> moving HImode between SSE_REGS.
> (mov_internal): Use vmovaps instead of vmovsh when
> moving HF/BFmode between SSE_REGS.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr89229-4a.c: Adjust testcase.
> ---
>  gcc/config/i386/i386.md| 20 +---
>  gcc/testsuite/gcc.target/i386/pr89229-4a.c |  4 +---
>  2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index c906d75b13e..77182e34fe1 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -2961,8 +2961,12 @@ (define_insn "*movhi_internal"
> ]
> (const_string "TI"))
> (eq_attr "alternative" "12")
> - (cond [(match_test "TARGET_AVX512FP16")
> + (cond [(match_test "TARGET_AVX512VL")
> +  (const_string "TI")
> +(match_test "TARGET_AVX512FP16")
>(const_string "HF")
> +(match_test "TARGET_AVX512F")
> +  (const_string "SF")
>  (match_test "TARGET_AVX")
>(const_string "TI")
>  (ior (not (match_test "TARGET_SSE2"))
> @@ -4099,8 +4103,12 @@ (define_insn "*movdf_internal"
>
>/* movaps is one byte shorter for non-AVX targets.  */
>(eq_attr "alternative" "13,17")
> -(cond [(match_test "TARGET_AVX")
> +(cond [(match_test "TARGET_AVX512VL")
> + (const_string "V2DF")
> +   (match_test "TARGET_AVX512F")
>   (const_string "DF")
> +   (match_test "TARGET_AVX")
> + (const_string "V2DF")
> (ior (not (match_test "TARGET_SSE2"))
>  (match_test "optimize_function_for_size_p 
> (cfun)"))
>   (const_string "V4SF")
> @@ -4380,8 +4388,14 @@ (define_insn "*mov_internal"
>(const_string "HI")
>(const_string "TI"))
>(eq_attr "alternative" "5")
> -(cond [(match_test "TARGET_AVX512FP16")
> +(cond [(match_test "TARGET_AVX512VL")
> +   (const_string "V4SF")
> +   (match_test "TARGET_AVX512FP16")
>   (const_string "HF")
> +   (match_test "TARGET_AVX512F")
> + (const_string "SF")
> +   (match_test "TARGET_AVX")
> + (const_string "V4SF")
> (ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY")
>  (match_test "TARGET_SSE_SPLIT_REGS"))
>   (const_string "V4SF")
> diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4a.c 
> b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> index 5bc10d25619..8869650b0ad 100644
> --- a/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> +++ b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-do assemble { target { ! ia32 } } } */
>  /* { dg-options "-O2 -march=skylake-avx512" } */
>
>  extern double d;
> @@ -12,5 +12,3 @@ foo1 (double x)
>asm volatile ("" : "+v" (xmm17));
>d = xmm17;
>  }
> -
> -/* { dg-final { scan-assembler-not "vmovapd" } } */
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH V2] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions

2023-08-13 Thread Hongtao Liu via Gcc-patches
On Fri, Aug 11, 2023 at 2:02 PM liuhongt via Gcc-patches
 wrote:
>
> Rename original use_gather to use_gather_8parts, Support
> -mtune-ctrl={,^}use_gather to set/clear tune features
> use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather
> as alias of -mtune-ctrl=, use_gather, ^use_gather.
>
> Similar for use_scatter.
>
> How about this version?
I'll commit the patch if there's no objections in the next 24 hours.
>
> gcc/ChangeLog:
>
> * config/i386/i386-builtins.cc
> (ix86_vectorize_builtin_gather): Adjust for use_gather_8parts.
> * config/i386/i386-options.cc (parse_mtune_ctrl_str):
> Set/Clear tune features use_{gather,scatter}_{2parts, 4parts,
> 8parts} for -mtune-crtl={,^}{use_gather,use_scatter}.
> * config/i386/i386.cc (ix86_vectorize_builtin_scatter): Adjust
> for use_scatter_8parts
> * config/i386/i386.h (TARGET_USE_GATHER): Rename to ..
> (TARGET_USE_GATHER_8PARTS): .. this.
> (TARGET_USE_SCATTER): Rename to ..
> (TARGET_USE_SCATTER_8PARTS): .. this.
> * config/i386/x86-tune.def (X86_TUNE_USE_GATHER): Rename to
> (X86_TUNE_USE_GATHER_8PARTS): .. this.
> (X86_TUNE_USE_SCATTER): Rename to
> (X86_TUNE_USE_SCATTER_8PARTS): .. this.
> * config/i386/i386.opt: Add new options mgather, mscatter.
> ---
>  gcc/config/i386/i386-builtins.cc |  2 +-
>  gcc/config/i386/i386-options.cc  | 54 +++-
>  gcc/config/i386/i386.cc  |  2 +-
>  gcc/config/i386/i386.h   |  8 ++---
>  gcc/config/i386/i386.opt |  8 +
>  gcc/config/i386/x86-tune.def |  4 +--
>  6 files changed, 56 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/config/i386/i386-builtins.cc 
> b/gcc/config/i386/i386-builtins.cc
> index 356b6dfd5fb..8a0b8dfe073 100644
> --- a/gcc/config/i386/i386-builtins.cc
> +++ b/gcc/config/i386/i386-builtins.cc
> @@ -1657,7 +1657,7 @@ ix86_vectorize_builtin_gather (const_tree mem_vectype,
>   ? !TARGET_USE_GATHER_2PARTS
>   : (known_eq (TYPE_VECTOR_SUBPARTS (mem_vectype), 4u)
>  ? !TARGET_USE_GATHER_4PARTS
> -: !TARGET_USE_GATHER)))
> +: !TARGET_USE_GATHER_8PARTS)))
>  return NULL_TREE;
>
>if ((TREE_CODE (index_type) != INTEGER_TYPE
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 127ee24203c..b8d038af69d 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1731,20 +1731,46 @@ parse_mtune_ctrl_str (struct gcc_options *opts, bool 
> dump)
>curr_feature_string++;
>clear = true;
>  }
> -  for (i = 0; i < X86_TUNE_LAST; i++)
> -{
> -  if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> -{
> -  ix86_tune_features[i] = !clear;
> -  if (dump)
> -fprintf (stderr, "Explicitly %s feature %s\n",
> - clear ? "clear" : "set", 
> ix86_tune_feature_names[i]);
> -  break;
> -}
> -}
> -  if (i == X86_TUNE_LAST)
> -   error ("unknown parameter to option %<-mtune-ctrl%>: %s",
> -  clear ? curr_feature_string - 1 : curr_feature_string);
> +
> +  if (!strcmp (curr_feature_string, "use_gather"))
> +   {
> + ix86_tune_features[X86_TUNE_USE_GATHER_2PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_GATHER_4PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_GATHER_8PARTS] = !clear;
> + if (dump)
> +   fprintf (stderr, "Explicitly %s features use_gather_2parts,"
> +" use_gather_4parts, use_gather_8parts\n",
> +clear ? "clear" : "set");
> +
> +   }
> +  else if (!strcmp (curr_feature_string, "use_scatter"))
> +   {
> + ix86_tune_features[X86_TUNE_USE_SCATTER_2PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_SCATTER_4PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_SCATTER_8PARTS] = !clear;
> + if (dump)
> +   fprintf (stderr, "Explicitly %s features use_scatter_2parts,"
> +" use_scatter_4parts, use_scatter_8parts\n",
> +clear ? "clear" : "set");
> +   }
> +  else
> +   {
> + for (i = 0; i < X86_TUNE_LAST; i++)
> +   {
> + if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> +   {
> + ix86_tune_features[i] = !clear;
> + if (dump)
> +   fprintf (stderr, "Explicitly %s feature %s\n",
> +clear ? "clear" : "set", 
> ix86_tune_feature_names[i]);
> + break;
> +   }
> +   }
> +
> + if (i == X86_TUNE_LAST)
> +   error ("unknown parameter to option %<-mtune-ctrl%>: %s",
> +  clear ? curr_feature_string - 1 : curr_feature_string);
> +

Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 7:13 PM Richard Biener
 wrote:
>
> On Thu, Aug 10, 2023 at 11:16 AM Hongtao Liu  wrote:
> >
> > On Thu, Aug 10, 2023 at 4:07 PM Hongtao Liu  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 3:55 PM Hongtao Liu  wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> > > > > >
> > > > > > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Currently we have 3 different independent tunes for gather
> > > > > > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > > > > > similar for scatter, there're
> > > > > > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > > > > > >
> > > > > > > > The patch support 2 standardizing options to enable/disable
> > > > > > > > vectorization for all gather/scatter instructions. The options 
> > > > > > > > is
> > > > > > > > interpreted by driver to 3 tunes.
> > > > > > > >
> > > > > > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > > > > > enable part of an ISA but they won't disable the use of intrinsics
> > > > > > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > > > > > >
> > > > > > > May I suggest to invent a more generic "short-cut" to
> > > > > > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > > > > > tunables add ^use_gather_any to cover all cases?  (or
> > > > > > > change what use_gather controls - it seems we changed its
> > > > > > > meaning before, and instead add use_gather_8parts and
> > > > > > > use_gather_16parts)
> > > > > > >
> > > > > > > That is, what's the point of this?
> > The point of this is to keep consistent between GCC, LLVM, and
> > ICX(Intel® oneAPI DPC++/C++ Compiler) .
> > LLVM,ICX will support that option.
>
> GCC has very many options that are not the same as LLVM or ICX,
> I don't see a good reason to special case this one.  As said, it's
> a very bad name IMHO.
In general terms, yes.
But this is a new option, shouldn't it be better to be consistent?
And the problem with mfma is mainly that the cpuid is just called fma,
but we don't have a cpuid called gather/scatter, with clear document
that the option is only for auto-vectorization,
-m{no-,}{gather,scattter} looks fine to me.
As Honza mentioned, users need to option to turn on/off gather/scatter
auto vectorization, I don't think they will expect the option is also
valid for intrinsic.
If -mtune-crtl= is not suitable for direct exposure to usersusers,
then the original proposal should be ok?
Developers will manintain the relation between mgather/scatter and
-mtune-crtl=XXX to make it consistent between GCC versions.
>
> Richard.
>
> > > > > >
> > > > > > https://www.phoronix.com/review/downfall
> > > > > >
> > > > > > that caused:
> > > > > >
> > > > > > https://www.phoronix.com/review/intel-downfall-benchmarks
> > > > >
> > > > > Yes, I know.  But there's -mtune-ctl= doing the trick.
> > > > > GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> > > > > to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> > > > > gather works only on SI/SFmode or larger).
> > > > >
> > > > > Then -mtune-ctl=^use_gather works which I think is nice enough?
> > > > So basically, -mtune-ctrl=^use_gather is used to turn off all gather
> > > > vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
> > > > We don't have an extrat explicit flag for target tune, just single bit
> > > > - ix86_tune_features[X86_TUNE_USE_GATHER]
> > > Looks like I can handle it specially in parse_mtune_ctrl_str, let me try.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Uros.
> > > >
> > > >
> > > >
> > > > --
> > > > BR,
> > > > Hongtao
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 4:07 PM Hongtao Liu  wrote:
>
> On Thu, Aug 10, 2023 at 3:55 PM Hongtao Liu  wrote:
> >
> > On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> > > >  wrote:
> > > > >
> > > > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  
> > > > > wrote:
> > > > > >
> > > > > > Currently we have 3 different independent tunes for gather
> > > > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > > > similar for scatter, there're
> > > > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > > > >
> > > > > > The patch support 2 standardizing options to enable/disable
> > > > > > vectorization for all gather/scatter instructions. The options is
> > > > > > interpreted by driver to 3 tunes.
> > > > > >
> > > > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > > > Ok for trunk?
> > > > >
> > > > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > > > enable part of an ISA but they won't disable the use of intrinsics
> > > > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > > > >
> > > > > May I suggest to invent a more generic "short-cut" to
> > > > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > > > tunables add ^use_gather_any to cover all cases?  (or
> > > > > change what use_gather controls - it seems we changed its
> > > > > meaning before, and instead add use_gather_8parts and
> > > > > use_gather_16parts)
> > > > >
> > > > > That is, what's the point of this?
The point of this is to keep consistent between GCC, LLVM, and
ICX(Intel® oneAPI DPC++/C++ Compiler) .
LLVM,ICX will support that option.
> > > >
> > > > https://www.phoronix.com/review/downfall
> > > >
> > > > that caused:
> > > >
> > > > https://www.phoronix.com/review/intel-downfall-benchmarks
> > >
> > > Yes, I know.  But there's -mtune-ctl= doing the trick.
> > > GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> > > to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> > > gather works only on SI/SFmode or larger).
> > >
> > > Then -mtune-ctl=^use_gather works which I think is nice enough?
> > So basically, -mtune-ctrl=^use_gather is used to turn off all gather
> > vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
> > We don't have an extrat explicit flag for target tune, just single bit
> > - ix86_tune_features[X86_TUNE_USE_GATHER]
> Looks like I can handle it specially in parse_mtune_ctrl_str, let me try.
> > >
> > > Richard.
> > >
> > > > Uros.
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 3:55 PM Hongtao Liu  wrote:
>
> On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> > >  wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> > > > >
> > > > > Currently we have 3 different independent tunes for gather
> > > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > > similar for scatter, there're
> > > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > > >
> > > > > The patch support 2 standardizing options to enable/disable
> > > > > vectorization for all gather/scatter instructions. The options is
> > > > > interpreted by driver to 3 tunes.
> > > > >
> > > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > > Ok for trunk?
> > > >
> > > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > > enable part of an ISA but they won't disable the use of intrinsics
> > > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > > >
> > > > May I suggest to invent a more generic "short-cut" to
> > > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > > tunables add ^use_gather_any to cover all cases?  (or
> > > > change what use_gather controls - it seems we changed its
> > > > meaning before, and instead add use_gather_8parts and
> > > > use_gather_16parts)
> > > >
> > > > That is, what's the point of this?
> > >
> > > https://www.phoronix.com/review/downfall
> > >
> > > that caused:
> > >
> > > https://www.phoronix.com/review/intel-downfall-benchmarks
> >
> > Yes, I know.  But there's -mtune-ctl= doing the trick.
> > GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> > to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> > gather works only on SI/SFmode or larger).
> >
> > Then -mtune-ctl=^use_gather works which I think is nice enough?
> So basically, -mtune-ctrl=^use_gather is used to turn off all gather
> vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
> We don't have an extrat explicit flag for target tune, just single bit
> - ix86_tune_features[X86_TUNE_USE_GATHER]
Looks like I can handle it specially in parse_mtune_ctrl_str, let me try.
> >
> > Richard.
> >
> > > Uros.
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
 wrote:
>
> On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> >
> > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> >  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> > > >
> > > > Currently we have 3 different independent tunes for gather
> > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > similar for scatter, there're
> > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > >
> > > > The patch support 2 standardizing options to enable/disable
> > > > vectorization for all gather/scatter instructions. The options is
> > > > interpreted by driver to 3 tunes.
> > > >
> > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > Ok for trunk?
> > >
> > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > enable part of an ISA but they won't disable the use of intrinsics
> > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > >
> > > May I suggest to invent a more generic "short-cut" to
> > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > tunables add ^use_gather_any to cover all cases?  (or
> > > change what use_gather controls - it seems we changed its
> > > meaning before, and instead add use_gather_8parts and
> > > use_gather_16parts)
> > >
> > > That is, what's the point of this?
> >
> > https://www.phoronix.com/review/downfall
> >
> > that caused:
> >
> > https://www.phoronix.com/review/intel-downfall-benchmarks
>
> Yes, I know.  But there's -mtune-ctl= doing the trick.
> GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> gather works only on SI/SFmode or larger).
>
> Then -mtune-ctl=^use_gather works which I think is nice enough?
So basically, -mtune-ctrl=^use_gather is used to turn off all gather
vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
We don't have an extrat explicit flag for target tune, just single bit
- ix86_tune_features[X86_TUNE_USE_GATHER]
>
> Richard.
>
> > Uros.



-- 
BR,
Hongtao


Re: [PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 2:06 PM Hongtao Liu  wrote:
>
> On Thu, Aug 10, 2023 at 2:01 PM Uros Bizjak via Gcc-patches
>  wrote:
> >
> > On Thu, Aug 10, 2023 at 2:49 AM liuhongt  wrote:
> > >
> > > Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
> > > patterns in order to avoid generation of partial vector V8HFmode
> > > trapping instructions.
> > >
> > > Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/110832
> > > * config/i386/mmx.md: (movq__to_sse): Also do not
> > > sanitize upper part of V4HFmode register with
> > > -fno-trapping-math.
> > > (v4hf3): Enable for ix86_partial_vec_fp_math.
> > > ( > > (v2hf3): Ditto.
> > > (divv2hf3): Ditto.
> > > (movd_v2hf_to_sse): Do not sanitize upper part of V2HFmode
> > > register with -fno-trapping-math.
> >
> > OK.
> >
> > BTW: I would just like to mention that plenty of instructions can be
> > enabled for V4HF/V2HFmode besides arithmetic insns. At least
> > conversions, comparisons, FMA and min/max (to name some of them) can
> > be enabled by introducing expanders that expand to V8HFmode
> > instruction.
> Yes, try to support that in GCC14.
I would wait for avx10's patch to go in first, so as to avoid extra
rebases and conflicts.
> >
> > Uros.
> > >
> > > ---
> > >  gcc/config/i386/mmx.md | 20 ++--
> > >  1 file changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > > index d51b3b9dc71..170432a7128 100644
> > > --- a/gcc/config/i386/mmx.md
> > > +++ b/gcc/config/i386/mmx.md
> > > @@ -596,7 +596,7 @@ (define_expand "movq__to_sse"
> > >   (match_dup 2)))]
> > >"TARGET_SSE2"
> > >  {
> > > -  if (mode == V2SFmode
> > > +  if (mode != V2SImode
> > >&& !flag_trapping_math)
> > >  {
> > >rtx op1 = force_reg (mode, operands[1]);
> > > @@ -1941,7 +1941,7 @@ (define_expand "v4hf3"
> > > (plusminusmult:V4HF
> > >   (match_operand:V4HF 1 "nonimmediate_operand")
> > >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > @@ -1961,7 +1961,7 @@ (define_expand "divv4hf3"
> > > (div:V4HF
> > >   (match_operand:V4HF 1 "nonimmediate_operand")
> > >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > @@ -1983,14 +1983,22 @@ (define_expand "movd_v2hf_to_sse"
> > > (match_operand:V2HF 1 "nonimmediate_operand"))
> > >   (match_operand:V8HF 2 "reg_or_0_operand")
> > >   (const_int 3)))]
> > > -  "TARGET_SSE")
> > > +  "TARGET_SSE"
> > > +{
> > > +  if (!flag_trapping_math && operands[2] == CONST0_RTX (V8HFmode))
> > > +  {
> > > +rtx op1 = force_reg (V2HFmode, operands[1]);
> > > +emit_move_insn (operands[0], lowpart_subreg (V8HFmode, op1, 
> > > V2HFmode));
> > > +DONE;
> > > +  }
> > > +})
> > >
> > >  (define_expand "v2hf3"
> > >[(set (match_operand:V2HF 0 "register_operand")
> > > (plusminusmult:V2HF
> > >   (match_operand:V2HF 1 "nonimmediate_operand")
> > >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > @@ -2009,7 +2017,7 @@ (define_expand "divv2hf3"
> > > (div:V2HF
> > >   (match_operand:V2HF 1 "nonimmediate_operand")
> > >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 2:04 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> >
> > Currently we have 3 different independent tunes for gather
> > "use_gather,use_gather_2parts,use_gather_4parts",
> > similar for scatter, there're
> > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> >
> > The patch support 2 standardizing options to enable/disable
> > vectorization for all gather/scatter instructions. The options is
> > interpreted by driver to 3 tunes.
> >
> > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.h (DRIVER_SELF_SPECS): Add
> > GATHER_SCATTER_DRIVER_SELF_SPECS.
> > (GATHER_SCATTER_DRIVER_SELF_SPECS): New macro.
> > * config/i386/i386.opt (mgather): New option.
> > (mscatter): Ditto.
> > ---
> >  gcc/config/i386/i386.h   | 12 +++-
> >  gcc/config/i386/i386.opt |  8 
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index ef342fcee9b..d9ac2c29bde 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -565,7 +565,17 @@ extern GTY(()) tree x86_mfence;
> >  # define SUBTARGET_DRIVER_SELF_SPECS ""
> >  #endif
> >
> > -#define DRIVER_SELF_SPECS SUBTARGET_DRIVER_SELF_SPECS
> > +#ifndef GATHER_SCATTER_DRIVER_SELF_SPECS
> > +# define GATHER_SCATTER_DRIVER_SELF_SPECS \
> > +  
> > "%{mno-gather:-mtune-ctrl=^use_gather_2parts,^use_gather_4parts,^use_gather}
> >  \
> > +   %{mgather:-mtune-ctrl=use_gather_2parts,use_gather_4parts,use_gather} \
> > +   
> > %{mno-scatter:-mtune-ctrl=^use_scatter_2parts,^use_scatter_4parts,^use_scatter}
> >  \
> > +   
> > %{mscatter:-mtune-ctrl=use_scatter_2parts,use_scatter_4parts,use_scatter}"
> > +#endif
> > +
> > +#define DRIVER_SELF_SPECS \
> > +  SUBTARGET_DRIVER_SELF_SPECS " " \
> > +  GATHER_SCATTER_DRIVER_SELF_SPECS
> >
> >  /* -march=native handling only makes sense with compiler running on
> > an x86 or x86_64 chip.  If changing this condition, also change
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index ddb7f110aa2..99948644a8d 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -424,6 +424,14 @@ mdaz-ftz
> >  Target
> >  Set the FTZ and DAZ Flags.
> >
> > +mgather
> > +Target
> > +Enable vectorization for gather instruction.
> > +
> > +mscatter
> > +Target
> > +Enable vectorization for scatter instruction.
>
> Are gather and scatter instructions affected in a separate way, or
> should we use one -mgather-scatter option to cover all gather/scatter
> tunings?
A separate way.
Gather Data Sampling is only for gather.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
>
> Uros.
>
> > +
> >  mpreferred-stack-boundary=
> >  Target RejectNegative Joined UInteger 
> > Var(ix86_preferred_stack_boundary_arg)
> >  Attempt to keep stack aligned to this power of 2.
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao


Re: [PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-10 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 10, 2023 at 2:01 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Aug 10, 2023 at 2:49 AM liuhongt  wrote:
> >
> > Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
> > patterns in order to avoid generation of partial vector V8HFmode
> > trapping instructions.
> >
> > Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/110832
> > * config/i386/mmx.md: (movq__to_sse): Also do not
> > sanitize upper part of V4HFmode register with
> > -fno-trapping-math.
> > (v4hf3): Enable for ix86_partial_vec_fp_math.
> > ( > (v2hf3): Ditto.
> > (divv2hf3): Ditto.
> > (movd_v2hf_to_sse): Do not sanitize upper part of V2HFmode
> > register with -fno-trapping-math.
>
> OK.
>
> BTW: I would just like to mention that plenty of instructions can be
> enabled for V4HF/V2HFmode besides arithmetic insns. At least
> conversions, comparisons, FMA and min/max (to name some of them) can
> be enabled by introducing expanders that expand to V8HFmode
> instruction.
Yes, try to support that in GCC14.
>
> Uros.
> >
> > ---
> >  gcc/config/i386/mmx.md | 20 ++--
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index d51b3b9dc71..170432a7128 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -596,7 +596,7 @@ (define_expand "movq__to_sse"
> >   (match_dup 2)))]
> >"TARGET_SSE2"
> >  {
> > -  if (mode == V2SFmode
> > +  if (mode != V2SImode
> >&& !flag_trapping_math)
> >  {
> >rtx op1 = force_reg (mode, operands[1]);
> > @@ -1941,7 +1941,7 @@ (define_expand "v4hf3"
> > (plusminusmult:V4HF
> >   (match_operand:V4HF 1 "nonimmediate_operand")
> >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > @@ -1961,7 +1961,7 @@ (define_expand "divv4hf3"
> > (div:V4HF
> >   (match_operand:V4HF 1 "nonimmediate_operand")
> >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > @@ -1983,14 +1983,22 @@ (define_expand "movd_v2hf_to_sse"
> > (match_operand:V2HF 1 "nonimmediate_operand"))
> >   (match_operand:V8HF 2 "reg_or_0_operand")
> >   (const_int 3)))]
> > -  "TARGET_SSE")
> > +  "TARGET_SSE"
> > +{
> > +  if (!flag_trapping_math && operands[2] == CONST0_RTX (V8HFmode))
> > +  {
> > +rtx op1 = force_reg (V2HFmode, operands[1]);
> > +emit_move_insn (operands[0], lowpart_subreg (V8HFmode, op1, V2HFmode));
> > +DONE;
> > +  }
> > +})
> >
> >  (define_expand "v2hf3"
> >[(set (match_operand:V2HF 0 "register_operand")
> > (plusminusmult:V2HF
> >   (match_operand:V2HF 1 "nonimmediate_operand")
> >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > @@ -2009,7 +2017,7 @@ (define_expand "divv2hf3"
> > (div:V2HF
> >   (match_operand:V2HF 1 "nonimmediate_operand")
> >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-09 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 5:15 PM Florian Weimer  wrote:
>
> * Hongtao Liu:
>
> > On Wed, Aug 9, 2023 at 3:17 PM Jan Beulich  wrote:
> >> Aiui these ABI levels were intended to be incremental, i.e. higher versions
> >> would include everything earlier ones cover. Without such a guarantee, how
> >> would you propose compatibility checks to be implemented in a way
>
> Correct, this was the intent.  But it's mostly to foster adoption and
> make it easier for developers to pick the variants that they want to
> target custom builds.  If it's an ascending chain, the trade-offs are
> simpler.
>
> > Are there many software implemenation based on this assumption?
> > At least in GCC, it's not a big problem, we can adjust code for the
> > new micro-architecture level.
>
> The glibc framework can deal with alternate choices in principle,
> although I'd prefer not to go there for the reasons indicated.
>
> >> applicable both forwards and backwards? If a new level is wanted here, then
> >> I guess it could only be something like v3.5.
>
> > But if we use avx10.1 as v3.5, it's still not subset of
> > x86-64-v4(avx10.1 contains avx512fp16,avx512bf16 .etc which are not in
> > x86-64-v4), there will be still a diverge.
> > Then 256-bit of x86-64-v4 as v3.5? that's too weired to me.
>
> The question is whether you want to mandate the 16-bit floating point
> extensions.  You might get better adoption if you stay compatible with
> shipping CPUs.  Furthermore, the 256-bit tuning apparently benefits
> current Intel CPUs, even though they can do 512-bit vectors.
Not only 16-bit floating point, here's a whole picture of  AVX512->AVX10 in
Figure 1-1. Intel® AVX-512 Feature Flags Across Intel® Xeon® Processor
Generations vs. Intel® AVX10
and Figure 1-2. Intel® ISA Families and Features
at https://cdrdv2.intel.com/v1/dl/getContent/784343 (this link is a
direct download of pdf).



>
> (The thread subject is a bit misleading for this sub-topic, by the way.)
>
> Thanks,
> Florian
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-09 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 4:14 PM Florian Weimer  wrote:
>
> * Richard Biener via Gcc-patches:
>
> > I don’t think we can realistically change the ABI.  If we could
> > passing them in two 256bit registers would be possible as well.
> >
> > Note I fully expect intel to turn around and implement 512 bits on a
> > 256 but data path on the E cores in 5 years.  And it will take at
> > least that time for AVX10 to take off (look at AVX512 for this and how
> > they cautionously chose to include bf16 to cut off Zen4).  So IMHO we
> > shouldn’t worry at all and just wait and see for AVX42 to arrive.
>
> Yes, the direction is a bit unclear.  In retrospect, we could have
> defined x86-64-v4 to use 256 bit vector width, so it could eventually be
> compatible with AVX10; it's also what current Intel CPUs prefer (and
NOTE, avx10.x-256 also inhibit the usage of 64-bit kmask which is
supposed to be only used  by zmm instructions.
But in theory, those 64-bit kmask intrinsics can be used standalone
.i.e. kshift/kand/kor.
> past, with the exception of the Xeon Phi line).  But in the meantime,
> AMD has started to ship CPUs that seem to prefer 512 bit vectors,
> despite having a double pumped implementation.  (Disclaimer: All CPU
> preferences inferred from current compiler tuning defaults, not actual
> experiments. 8-/)
>
> To me, this looks like we may have defined x86-64-v4 prematurely, and
> this suggests we should wait a bit to see where things are heading.
>
> Thanks,
> Florian
>


-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-09 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 3:17 PM Jan Beulich  wrote:
>
> On 09.08.2023 04:14, Hongtao Liu wrote:
> > On Wed, Aug 9, 2023 at 9:21 AM Hongtao Liu  wrote:
> >>
> >> On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  
> >> wrote:
> >>>
> >>> Do you have any comments on the interaction of AVX10 with the
> >>> micro-architecture levels defined in the ABI (and supported with
> >>> glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> >>> should we take it that any future levels will be ones supporting 512-bit
> >>> vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> >>> AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> >>> that only support 256-bit vector width will be considered to match the
> >>> x86-64-v3 micro-architecture level but not any higher level?
> >> This is actually something we really want to discuss in the community,
> >> our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
> >> One big reason is Intel E-core will only support AVX10 256-bit, if we
> >> want to use x86-64-v5 accross  server and client, it's better to
> >> 256-bit default.
>
> Aiui these ABI levels were intended to be incremental, i.e. higher versions
> would include everything earlier ones cover. Without such a guarantee, how
> would you propose compatibility checks to be implemented in a way
Are there many software implemenation based on this assumption?
At least in GCC, it's not a big problem, we can adjust code for the
new micro-architecture level.
> applicable both forwards and backwards? If a new level is wanted here, then
> I guess it could only be something like v3.5.
But if we use avx10.1 as v3.5, it's still not subset of
x86-64-v4(avx10.1 contains avx512fp16,avx512bf16 .etc which are not in
x86-64-v4), there will be still a diverge.
Then 256-bit of x86-64-v4 as v3.5? that's too weired to me.

Our main proposal is to make AVX10.x as new micro-architecture level
with 256-bit default, either v3.5 or v5 would be acceptable if it's
just the name.
>
> Jan



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 10:14 AM Hongtao Liu  wrote:
>
> On Wed, Aug 9, 2023 at 9:21 AM Hongtao Liu  wrote:
> >
> > On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  wrote:
> > >
> > > Do you have any comments on the interaction of AVX10 with the
> > > micro-architecture levels defined in the ABI (and supported with
> > > glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> > > should we take it that any future levels will be ones supporting 512-bit
> > > vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> > > AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> > > that only support 256-bit vector width will be considered to match the
> > > x86-64-v3 micro-architecture level but not any higher level?
> > This is actually something we really want to discuss in the community,
> > our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
> > One big reason is Intel E-core will only support AVX10 256-bit, if we
> > want to use x86-64-v5 accross  server and client, it's better to
> > 256-bit default.
> + ABI and LLVM folked for this topic.
s/folked/folks/

> > >
> > > --
> > > Joseph S. Myers
> > > jos...@codesourcery.com
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



--
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 9:21 AM Hongtao Liu  wrote:
>
> On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  wrote:
> >
> > Do you have any comments on the interaction of AVX10 with the
> > micro-architecture levels defined in the ABI (and supported with
> > glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> > should we take it that any future levels will be ones supporting 512-bit
> > vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> > AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> > that only support 256-bit vector width will be considered to match the
> > x86-64-v3 micro-architecture level but not any higher level?
> This is actually something we really want to discuss in the community,
> our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
> One big reason is Intel E-core will only support AVX10 256-bit, if we
> want to use x86-64-v5 accross  server and client, it's better to
> 256-bit default.
+ ABI and LLVM folked for this topic.
> >
> > --
> > Joseph S. Myers
> > jos...@codesourcery.com
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 10:06 AM Hongtao Liu  wrote:
>
> On Tue, Aug 8, 2023 at 8:45 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > On Tue, Aug 8, 2023 at 10:15 AM Jiang, Haochen via Gcc-patches
> >  wrote:
> > >
> > > Hi Jakub,
> > >
> > > > So, what does this imply for the current ISAs?
> > >
> > > AVX10 will imply AVX2 on the ISA level. And we suppose AVX10 is an
> > > independent ISA feature set. Although sharing the same instructions and
> > > encodings, AVX10 and AVX512 are conceptual independent features, which
> > > means they are orthogonal.
> > >
> > > > The expectations in lots of config/i386/* is that -mavx512f / 
> > > > TARGET_AVX512F
> > > > means 512 bit vector support is available and most of the various 
> > > > -mavx512XXX
> > > > options imply -mavx512f (and -mno-avx512f turns those off).  And if
> > > > -mavx512vl / TARGET_AVX512VL isn't available, tons of places just use
> > > > 512-bit EVEX instructions for 256-bit or 128-bit stuff (mostly to be 
> > > > able to
> > > > access [xy]mm16+).
> > >
> > > For AVX10, the 128/256/scalar version of the instructions are always 
> > > there, and
> > > also for [xy]mm16+. 512 version is "optional", which needs user to 
> > > indicate them
> > > in options. When 512 version is enabled, 128/256/scalar version is also 
> > > enabled,
> > > which is kind of reverse relation between the current AVX512F/AVX512VL.
> > >
> > > Since we take AVX10 and AVX512 are orthogonal, we will add OR logic for 
> > > the current
> > > pattern, which is shown in our AVX512DQ+VL sample patches.
> >
> > Hmm, so it sounds like AVX10 is currently, at the 10.1 level, a way to 
> > specify
> > AVX512F and AVX512VL "differently", so wouldn't it make sense to make it
> In the future there're plantfomrs only support AVX10.x-256, but not
> AVX512 stuffs, it doesn't make much sense on that platfrom to disable
> part of AVX512.
> We really want to make AVX10.x a indivisible features, just like other
> individual CPUID.
> > complement those only so one can use, say, -mavx10 -mno-avx512bf16 to 
> > disable
> > parts of the former AVX512 ISA one doesn't like to get code generated for?
> > -mavx10 would then enable all the existing sub-AVX512 ISAs?
> Another alternative solution is
is split AVX512 into AVX512-256 and AVX512-512, like AVX512F-256,
AVX512FP16-256, AVX512FP16-512, AVX512FP16-512, and make AVX10.1-256
implies those AVX512-256, AVX10.1-512 implies AVX512-512.
> >
> > > > Sure, I expect all AVX10.N CPUs will have AVX512VL CPUID, will they have
> > > > AVX512F CPUID even when the 512-bit vectors aren't present? What 
> > > > happens if
> > > > one mixes the -mavx10* options together with -mno-avx512vl or similar
> > > > options?  Will -mno-avx512f still imply -mno-avx512vl etc.?
> > >
> > > For the CPUID part, AVX10 and AVX512 have different emulation. Only Xeon 
> > > Server
> > > will have AVX512 related CPUIDs for backward compatibility. For GNR, it 
> > > will be
> > > AVX512F, AVX512VL, AVX512CD, AVX512BW, AVX512DQ, AVX512_IFMA, AVX512_VBMI,
> > > AVX512_VNNI, AVX512_BF16, AVX512_BITALG, AVX512_VPOPCNTDQ, AV512_VBMI2,
> > > AVX512_FP16. Also, it will have AVX10 CPUIDs with 512 bit support set. 
> > > Atom Server and
> > > client will only have AVX10 CPUIDs with 256 bit support set.
> > >
> > > -mno-avx512f will still imply -mno-avx512vl.
> > >
> > > As we mentioned below, we don't recommend users to combine the AVX10 and 
> > > legacy
> > > AVX512 options. We understand that there will be different opinions on 
> > > what should
> > > compiler behave on some controversial option combinations.
> > >
> > > If there is someone mixes the options, the golden rule is that we are 
> > > using OR logic.
> > > Therefore, enabling either feature will turn on the shared instructions, 
> > > no matter the other
> > > feature is not mentioned or closed. That is why we are emitting warning 
> > > for some scenarios,
> > > which is also mentioned in the letter.
> >
> > I'm refraining from commenting on the senslesness of AVX10 as you're
> > likely on the same
> > receiving side as us.
> >
> > Thanks,
> > Richard.
> >
> > > Thx,
> > > Haochen
> > >
> > > >
> > > >   Jakub
> > >
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 8, 2023 at 8:45 PM Richard Biener via Gcc-patches
 wrote:
>
> On Tue, Aug 8, 2023 at 10:15 AM Jiang, Haochen via Gcc-patches
>  wrote:
> >
> > Hi Jakub,
> >
> > > So, what does this imply for the current ISAs?
> >
> > AVX10 will imply AVX2 on the ISA level. And we suppose AVX10 is an
> > independent ISA feature set. Although sharing the same instructions and
> > encodings, AVX10 and AVX512 are conceptual independent features, which
> > means they are orthogonal.
> >
> > > The expectations in lots of config/i386/* is that -mavx512f / 
> > > TARGET_AVX512F
> > > means 512 bit vector support is available and most of the various 
> > > -mavx512XXX
> > > options imply -mavx512f (and -mno-avx512f turns those off).  And if
> > > -mavx512vl / TARGET_AVX512VL isn't available, tons of places just use
> > > 512-bit EVEX instructions for 256-bit or 128-bit stuff (mostly to be able 
> > > to
> > > access [xy]mm16+).
> >
> > For AVX10, the 128/256/scalar version of the instructions are always there, 
> > and
> > also for [xy]mm16+. 512 version is "optional", which needs user to indicate 
> > them
> > in options. When 512 version is enabled, 128/256/scalar version is also 
> > enabled,
> > which is kind of reverse relation between the current AVX512F/AVX512VL.
> >
> > Since we take AVX10 and AVX512 are orthogonal, we will add OR logic for the 
> > current
> > pattern, which is shown in our AVX512DQ+VL sample patches.
>
> Hmm, so it sounds like AVX10 is currently, at the 10.1 level, a way to specify
> AVX512F and AVX512VL "differently", so wouldn't it make sense to make it
In the future there're plantfomrs only support AVX10.x-256, but not
AVX512 stuffs, it doesn't make much sense on that platfrom to disable
part of AVX512.
We really want to make AVX10.x a indivisible features, just like other
individual CPUID.
> complement those only so one can use, say, -mavx10 -mno-avx512bf16 to disable
> parts of the former AVX512 ISA one doesn't like to get code generated for?
> -mavx10 would then enable all the existing sub-AVX512 ISAs?
Another alternative solution is
>
> > > Sure, I expect all AVX10.N CPUs will have AVX512VL CPUID, will they have
> > > AVX512F CPUID even when the 512-bit vectors aren't present? What happens 
> > > if
> > > one mixes the -mavx10* options together with -mno-avx512vl or similar
> > > options?  Will -mno-avx512f still imply -mno-avx512vl etc.?
> >
> > For the CPUID part, AVX10 and AVX512 have different emulation. Only Xeon 
> > Server
> > will have AVX512 related CPUIDs for backward compatibility. For GNR, it 
> > will be
> > AVX512F, AVX512VL, AVX512CD, AVX512BW, AVX512DQ, AVX512_IFMA, AVX512_VBMI,
> > AVX512_VNNI, AVX512_BF16, AVX512_BITALG, AVX512_VPOPCNTDQ, AV512_VBMI2,
> > AVX512_FP16. Also, it will have AVX10 CPUIDs with 512 bit support set. Atom 
> > Server and
> > client will only have AVX10 CPUIDs with 256 bit support set.
> >
> > -mno-avx512f will still imply -mno-avx512vl.
> >
> > As we mentioned below, we don't recommend users to combine the AVX10 and 
> > legacy
> > AVX512 options. We understand that there will be different opinions on what 
> > should
> > compiler behave on some controversial option combinations.
> >
> > If there is someone mixes the options, the golden rule is that we are using 
> > OR logic.
> > Therefore, enabling either feature will turn on the shared instructions, no 
> > matter the other
> > feature is not mentioned or closed. That is why we are emitting warning for 
> > some scenarios,
> > which is also mentioned in the letter.
>
> I'm refraining from commenting on the senslesness of AVX10 as you're
> likely on the same
> receiving side as us.
>
> Thanks,
> Richard.
>
> > Thx,
> > Haochen
> >
> > >
> > >   Jakub
> >



-- 
BR,
Hongtao


Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  wrote:
>
> Do you have any comments on the interaction of AVX10 with the
> micro-architecture levels defined in the ABI (and supported with
> glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> should we take it that any future levels will be ones supporting 512-bit
> vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> that only support 256-bit vector width will be considered to match the
> x86-64-v3 micro-architecture level but not any higher level?
This is actually something we really want to discuss in the community,
our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
One big reason is Intel E-core will only support AVX10 256-bit, if we
want to use x86-64-v5 accross  server and client, it's better to
256-bit default.
>
> --
> Joseph S. Myers
> jos...@codesourcery.com



-- 
BR,
Hongtao


Re: [PATCH] Fix ICE in rtl check when bootstrap.

2023-08-07 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 7, 2023 at 4:54 PM liuhongt  wrote:
>
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:
>  In function ‘matmul_i1_avx512f’:
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:1781:1:
>  internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' 
> (rtx const_int) in vpternlog_redundant_operand_mask, at 
> config/i386/i386.cc:19460
>  1781 | }
>   | ^
> 0x5559de26dc2d rtl_check_failed_type2(rtx_def const*, int, int, int, char 
> const*, int, char const*)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/rtl.cc:761
> 0x5559de340bfe vpternlog_redundant_operand_mask(rtx_def**)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/config/i386/i386.cc:19460
> 0x5559dfec67a6 split_44
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/config/i386/sse.md:12730
> 0x5559dfec67a6 split_63
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/config/i386/sse.md:28428
> 0x5559deb8a682 try_split(rtx_def*, rtx_insn*, int)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/emit-rtl.cc:3800
> 0x5559deb8adf2 try_split(rtx_def*, rtx_insn*, int)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/emit-rtl.cc:3972
> 0x5559def69194 split_insn
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/recog.cc:3385
> 0x5559def70c57 split_all_insns()
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/recog.cc:3489
> 0x5559def70d0c execute
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/recog.cc:4413
>
> Use INTVAL (imm_op) instead of XINT (imm_op, 0).
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
Pushed to trunk as an obvious fix.
>
> gcc/ChangeLog:
>
> * config/i386/i386-protos.h (vpternlog_redundant_operand_mask):
>   Adjust parameter type.
> * config/i386/i386.cc (vpternlog_redundant_operand_mask): Use
>   INTVAL instead of XINT, also adjust parameter type from rtx* to
>   rtx since the function only needs operands[4] in vpternlog
>   pattern.
> (substitute_vpternlog_operands): Pass operands[4] instead of
>   operands to vpternlog_redundant_operand_mask
> * config/i386/sse.md: Ditto.
> ---
>  gcc/config/i386/i386-protos.h | 2 +-
>  gcc/config/i386/i386.cc   | 6 +++---
>  gcc/config/i386/sse.md| 4 ++--
>  3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index e547ee64587..fc2f1f13b78 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -70,7 +70,7 @@ extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
>  extern int avx_vpermilp_parallel (rtx par, machine_mode mode);
>  extern int avx_vperm2f128_parallel (rtx par, machine_mode mode);
>
> -extern int vpternlog_redundant_operand_mask (rtx[]);
> +extern int vpternlog_redundant_operand_mask (rtx);
>  extern void substitute_vpternlog_operands (rtx[]);
>
>  extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 8cd26eb54fa..50860050049 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -19454,10 +19454,10 @@ avx_vperm2f128_parallel (rtx par, machine_mode mode)
>  /* Return a mask of VPTERNLOG operands that do not affect output.  */
>
>  int
> -vpternlog_redundant_operand_mask (rtx *operands)
> +vpternlog_redundant_operand_mask (rtx pternlog_imm)
>  {
>int mask = 0;
> -  int imm8 = XINT (operands[4], 0);
> +  int imm8 = INTVAL (pternlog_imm);
>
>if (((imm8 >> 4) & 0x0F) == (imm8 & 0x0F))
>  mask |= 1;
> @@ -19475,7 +19475,7 @@ vpternlog_redundant_operand_mask (rtx *operands)
>  void
>  substitute_vpternlog_operands (rtx *operands)
>  {
> -  int mask = vpternlog_redundant_operand_mask (operands);
> +  int mask = vpternlog_redundant_operand_mask (operands[4]);
>
>if (mask & 1) /* The first operand is redundant.  */
>  operands[1] = operands[2];
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 7e2aa3f995c..c53450fd965 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12706,7 +12706,7 @@ (define_split
>(match_operand:V 3 "memory_operand")
>(match_operand:SI 4 "const_0_to_255_operand")]
>   UNSPEC_VTERNLOG))]
> -  "!reload_completed && vpternlog_redundant_operand_mask (operands) == 3"
> +  "!reload_completed && vpternlog_redundant_operand_mask (operands[4]) == 3"
>[(set (match_dup 0)
> (match_dup 3))
> (set (match_dup 0)
> @@ -12727,7 +12727,7 @@ (define_split
>

Re: [PATCH] i386: Clear upper bits of XMM register for V4HFmode/V2HFmode operations [PR110762]

2023-08-07 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 7, 2023 at 5:19 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Mon, Aug 7, 2023 at 10:57 AM liuhongt  wrote:
> >
> > Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/110762
> > * config/i386/mmx.md (3): Changed from define_insn
> > to define_expand and break into ..
> > (v4hf3): .. this.
> > (divv4hf3): .. this.
> > (v2hf3): .. this.
> > (divv2hf3): .. this.
> > (movd_v2hf_to_sse): New define_expand.
> > (movq__to_sse): Extend to V4HFmode.
> > (mmxdoublevecmode): Ditto.
> > (V2FI_V4HF): New mode iterator.
> > * config/i386/sse.md (*vec_concatv4sf): Extend to hanlde V8HF
> > by using mode iterator V4SF_V8HF, renamed to ..
> > (*vec_concat): .. this.
> > (*vec_concatv4sf_0): Extend to handle V8HF by using mode
> > iterator V4SF_V8HF, renamed to ..
> > (*vec_concat_0): .. this.
> > (*vec_concatv8hf_movss): New define_insn.
> > (V4SF_V8HF): New mode iterator.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr110762-v4hf.c: New test.
>
> LGTM.
>
> Please also note the RFC patch [1] that relaxes clears for V2SFmode
> with -fno-trapping-math. The patched compiler will then emit the same
> code as clang does for -O2. Which raises another question - should gcc
> default to -fno-trapping-math?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625795.html
>
I can create another patch to handle my parts for -fno-trapping-math
optimization.
> Thanks,
> Uros.
>
> > ---
> >  gcc/config/i386/mmx.md| 109 +++---
> >  gcc/config/i386/sse.md|  40 +--
> >  gcc/testsuite/gcc.target/i386/pr110762-v4hf.c |  57 +
> >  3 files changed, 177 insertions(+), 29 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110762-v4hf.c
> >
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index 896af76a33f..88bdf084f54 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -79,9 +79,7 @@ (define_mode_iterator V_16_32_64
> >  ;; V2S* modes
> >  (define_mode_iterator V2FI [V2SF V2SI])
> >
> > -;; 4-byte and 8-byte float16 vector modes
> > -(define_mode_iterator VHF_32_64 [V4HF V2HF])
> > -
> > +(define_mode_iterator V2FI_V4HF [V2SF V2SI V4HF])
> >  ;; Mapping from integer vector mode to mnemonic suffix
> >  (define_mode_attr mmxvecsize
> >[(V8QI "b") (V4QI "b") (V2QI "b")
> > @@ -108,7 +106,7 @@ (define_mode_attr mmxintvecmodelower
> >
> >  ;; Mapping of vector modes to a vector mode of double size
> >  (define_mode_attr mmxdoublevecmode
> > -  [(V2SF "V4SF") (V2SI "V4SI")])
> > +  [(V2SF "V4SF") (V2SI "V4SI") (V4HF "V8HF")])
> >
> >  ;; Mapping of vector modes back to the scalar modes
> >  (define_mode_attr mmxscalarmode
> > @@ -594,7 +592,7 @@ (define_insn "sse_movntq"
> >  (define_expand "movq__to_sse"
> >[(set (match_operand: 0 "register_operand")
> > (vec_concat:
> > - (match_operand:V2FI 1 "nonimmediate_operand")
> > + (match_operand:V2FI_V4HF 1 "nonimmediate_operand")
> >   (match_dup 2)))]
> >"TARGET_SSE2"
> >"operands[2] = CONST0_RTX (mode);")
> > @@ -1927,21 +1925,94 @@ (define_expand "lroundv2sfv2si2"
> >  ;;
> >  ;
> >
> > -(define_insn "3"
> > -  [(set (match_operand:VHF_32_64 0 "register_operand" "=v")
> > -   (plusminusmultdiv:VHF_32_64
> > - (match_operand:VHF_32_64 1 "register_operand" "v")
> > - (match_operand:VHF_32_64 2 "register_operand" "v")))]
> > +(define_expand "v4hf3"
> > +  [(set (match_operand:V4HF 0 "register_operand")
> > +   (plusminusmult:V4HF
> > + (match_operand:V4HF 1 "nonimmediate_operand")
> > + (match_operand:V4HF 2 "nonimmediate_operand")))]
> >"TARGET_AVX512FP16 && TARGET_AVX512VL"
> > -  "vph\t{%2, %1, %0|%0, %1, %2}"
> > -  [(set (attr "type")
> > -  (cond [(match_test " == MULT")
> > -   (const_string "ssemul")
> > -(match_test " == DIV")
> > -   (const_string "ssediv")]
> > -(const_string "sseadd")))
> > -   (set_attr "prefix" "evex")
> > -   (set_attr "mode" "V8HF")])
> > +{
> > +  rtx op2 = gen_reg_rtx (V8HFmode);
> > +  rtx op1 = gen_reg_rtx (V8HFmode);
> > +  rtx op0 = gen_reg_rtx (V8HFmode);
> > +
> > +  emit_insn (gen_movq_v4hf_to_sse (op2, operands[2]));
> > +  emit_insn (gen_movq_v4hf_to_sse (op1, operands[1]));
> > +
> > +  emit_insn (gen_v8hf3 (op0, op1, op2));
> > +
> > +  emit_move_insn (operands[0], lowpart_subreg (V4HFmode, op0, V8HFmode));
> > +  DONE;
> > +})
> > +
> > +(define_expand "divv4hf3"
> > +  [(set (match_operand:V4HF 0 "register_operand")
> > +   (div:V4HF
> > + (match_operand:V4HF 1 

Re: [PATCH 00/10] x86: (mainly) "prefix_extra" adjustments

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:09 PM Jan Beulich via Gcc-patches
 wrote:
>
> Having noticed various bogus uses, I thought I'd go through and audit
> them all. This is the result, with some other attributes also adjusted
> as noticed in the process. (I think this tidying also is a good thing
> to have ahead of APX further complicating insn length calculations.)
Thanks for doing this.
I'm just checking the way to modify the attribute , doesn't go detail
for those instructions encoding(I think you must know better than me).
>
> 01: "prefix_extra" tidying
> 02: "sse4arg" adjustments
> 03: "ssemuladd" adjustments
> 04: "prefix_extra" can't really be "2"
> 05: replace/correct bogus "prefix_extra"
> 06: drop stray "prefix_extra"
> 07: add (adjust) XOP insn attributes
> 08: add missing "prefix" attribute to VF{,C}MULC
> 09: correct "length_immediate" in a few cases
> 10: drop redundant "prefix_data16" attributes
>
> Jan



-- 
BR,
Hongtao


Re: [PATCH 08/10] x86: add missing "prefix" attribute to VF{,C}MULC

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:16 PM Jan Beulich via Gcc-patches
 wrote:
>
> gcc/
>
> * config/i386/sse.md
> (__): Add
> "prefix" attribute.
> 
> (avx512fp16_sh_v8hf):
> Likewise.
Ok.
> ---
> Talking of "prefix": Shouldn't at least V32HF and V32BF have it also
> default to "evex"? (It won't matter right here, but it may matter
> elsewhere.)
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -6790,6 +6790,7 @@
>return "v\t{%2, %1, 
> %0|%0, %1, %2}";
>  }
>[(set_attr "type" "ssemul")
> +   (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>
>  (define_expand "avx512fp16_fmaddcsh_v8hf_maskz"
> @@ -6993,6 +6994,7 @@
>return "vsh\t{%2, %1, 
> %0|%0, %1, 
> %2}";
>  }
>[(set_attr "type" "ssemul")
> +   (set_attr "prefix" "evex")
> (set_attr "mode" "V8HF")])
>
>  ;
>


-- 
BR,
Hongtao


Re: [PATCH 10/10] x86: drop redundant "prefix_data16" attributes

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:17 PM Jan Beulich via Gcc-patches
 wrote:
>
> The attribute defaults to 1 for TI-mode insns of type sselog, sselog1,
> sseiadd, sseimul, and sseishft.
>
> In *v8hi3 [smaxmin] and *v16qi3 [umaxmin] also drop the
> similarly stray "prefix_extra" at this occasion. These two max/min
> flavors are encoded in 0f space.
Ok.
>
> gcc/
>
> * config/i386/mmx.md (*mmx_pinsrd): Drop "prefix_data16".
> (*mmx_pinsrb): Likewise.
> (*mmx_pextrb): Likewise.
> (*mmx_pextrb_zext): Likewise.
> (mmx_pshufbv8qi3): Likewise.
> (mmx_pshufbv4qi3): Likewise.
> (mmx_pswapdv2si2): Likewise.
> (*pinsrb): Likewise.
> (*pextrb): Likewise.
> (*pextrb_zext): Likewise.
> * config/i386/sse.md (*sse4_1_mulv2siv2di3): Likewise.
> (*sse2_eq3): Likewise.
> (*sse2_gt3): Likewise.
> (_pinsr): Likewise.
> (*vec_extract): Likewise.
> (*vec_extract_zext): Likewise.
> (*vec_extractv16qi_zext): Likewise.
> (ssse3_phwv8hi3): Likewise.
> (ssse3_pmaddubsw128): Likewise.
> (*_pmulhrsw3): Likewise.
> (_pshufb3): Likewise.
> (_psign3): Likewise.
> (_palignr): Likewise.
> (*abs2): Likewise.
> (sse4_2_pcmpestr): Likewise.
> (sse4_2_pcmpestri): Likewise.
> (sse4_2_pcmpestrm): Likewise.
> (sse4_2_pcmpestr_cconly): Likewise.
> (sse4_2_pcmpistr): Likewise.
> (sse4_2_pcmpistri): Likewise.
> (sse4_2_pcmpistrm): Likewise.
> (sse4_2_pcmpistr_cconly): Likewise.
> (vgf2p8affineinvqb_): Likewise.
> (vgf2p8affineqb_): Likewise.
> (vgf2p8mulb_): Likewise.
> (*v8hi3 [smaxmin]): Drop "prefix_data16" and
> "prefix_extra".
> (*v16qi3 [umaxmin]): Likewise.
>
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -3863,7 +3863,6 @@
>  }
>  }
>[(set_attr "isa" "noavx,avx")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "type" "sselog")
> (set_attr "length_immediate" "1")
> @@ -3950,7 +3949,6 @@
>  }
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,vex")
> @@ -4002,7 +4000,6 @@
> %vpextrb\t{%2, %1, %k0|%k0, %1, %2}
> %vpextrb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> @@ -4017,7 +4014,6 @@
>"TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
>"%vpextrb\t{%2, %1, %k0|%k0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> @@ -4035,7 +4031,6 @@
> vpshufb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1,*")
> (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,maybe_evex")
> (set_attr "btver2_decode" "vector")
> @@ -4053,7 +4048,6 @@
> vpshufb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1,*")
> (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,maybe_evex")
> (set_attr "btver2_decode" "vector")
> @@ -4191,7 +4185,6 @@
> (set_attr "mmx_isa" "native,*")
> (set_attr "type" "mmxcvt,sselog1")
> (set_attr "prefix_extra" "1,*")
> -   (set_attr "prefix_data16" "*,1")
> (set_attr "length_immediate" "*,1")
> (set_attr "mode" "DI,TI")])
>
> @@ -4531,7 +4524,6 @@
>  }
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,vex")
> @@ -4575,7 +4567,6 @@
> %vpextrb\t{%2, %1, %k0|%k0, %1, %2}
> %vpextrb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> @@ -4590,7 +4581,6 @@
>"TARGET_SSE4_1"
>"%vpextrb\t{%2, %1, %k0|%k0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15614,7 +15614,6 @@
> vpmuldq\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseimul")
> -   (set_attr "prefix_data16" "1,1,*")
> (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
> @@ -16688,8 +16687,6 @@
>

Re: [PATCH 07/10] x86: add (adjust) XOP insn attributes

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:14 PM Jan Beulich via Gcc-patches
 wrote:
>
> Many were lacking "prefix" and "prefix_extra", some had a bogus value of
> 2 for "prefix_extra" (presumably inherited from their SSE5 counterparts,
> which are long gone) and a meaningless "prefix_data16" one. Where
> missing, "mode" attributes are also added. (Note that "sse4arg" and
> "ssemuladd" ones don't need further adjustment in this regard.)
Ok.
>
> gcc/
>
> * config/i386/sse.md (xop_phaddbw): Add "prefix",
> "prefix_extra", and "mode" attributes.
> (xop_phaddbd): Likewise.
> (xop_phaddbq): Likewise.
> (xop_phaddwd): Likewise.
> (xop_phaddwq): Likewise.
> (xop_phadddq): Likewise.
> (xop_phsubbw): Likewise.
> (xop_phsubwd): Likewise.
> (xop_phsubdq): Likewise.
> (xop_rotl3): Add "prefix" and "prefix_extra" attributes.
> (xop_rotr3): Likewise.
> (xop_frcz2): Likewise.
> (*xop_vmfrcz2): Likewise.
> (xop_vrotl3): Add "prefix" attribute. Change
> "prefix_extra" to 1.
> (xop_sha3): Likewise.
> (xop_shl3): Likewise.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -24897,7 +24897,10 @@
>   (const_int 13) (const_int 15)])]
>"TARGET_XOP"
>"vphaddbw\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddbd"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> @@ -24926,7 +24929,10 @@
>(const_int 11) (const_int 15)]))]
>"TARGET_XOP"
>"vphaddbd\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddbq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -24971,7 +24977,10 @@
>  (parallel [(const_int 7) (const_int 15)])))]
>"TARGET_XOP"
>"vphaddbq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddwd"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> @@ -24988,7 +24997,10 @@
>   (const_int 5) (const_int 7)])]
>"TARGET_XOP"
>"vphaddwd\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddwq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -25013,7 +25025,10 @@
> (parallel [(const_int 3) (const_int 7)]))]
>"TARGET_XOP"
>"vphaddwq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phadddq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -25028,7 +25043,10 @@
>(parallel [(const_int 1) (const_int 3)])]
>"TARGET_XOP"
>"vphadddq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phsubbw"
>[(set (match_operand:V8HI 0 "register_operand" "=x")
> @@ -25049,7 +25067,10 @@
>   (const_int 13) (const_int 15)])]
>"TARGET_XOP"
>"vphsubbw\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phsubwd"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> @@ -25066,7 +25087,10 @@
>   (const_int 5) (const_int 7)])]
>"TARGET_XOP"
>"vphsubwd\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phsubdq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -25081,7 +25105,10 @@
>(parallel [(const_int 1) (const_int 3)])]
>"TARGET_XOP"
>"vphsubdq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  ;; XOP permute instructions
>  (define_insn "xop_pperm"
> @@ -25209,6 +25236,8 @@
>"TARGET_XOP"
>"vprot\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "sseishft")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" 

Re: [PATCH 09/10] x86: correct "length_immediate" in a few cases

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:14 PM Jan Beulich via Gcc-patches
 wrote:
>
> When first added explicitly in 3ddffba914b2 ("i386.md
> (sse4_1_round2): Add avx512f alternative"), "*" should not have
> been used for the pre-existing alternative. The attribute was plain
> missing. Subsequent changes adding more alternatives then generously
> extended the bogus pattern.
>
> Apparently something similar happened to the two mmx_pblendvb_* insns.
Ok.
>
> gcc/
>
> * config/i386/i386.md (sse4_1_round2): Make
> "length_immediate" uniformly 1.
> * config/i386/mmx.md (mmx_pblendvb_v8qi): Likewise.
> (mmx_pblendvb_): Likewise.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -21594,7 +21594,7 @@
> vrndscale\t{%2, %1, %d0|%d0, %1, %2}"
>[(set_attr "type" "ssecvt")
> (set_attr "prefix_extra" "1,1,1,*,*")
> -   (set_attr "length_immediate" "*,*,*,1,1")
> +   (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex,maybe_vex,maybe_vex,evex,evex")
> (set_attr "isa" "noavx512f,noavx512f,noavx512f,avx512f,avx512f")
> (set_attr "avx_partial_xmm_update" "false,false,true,false,true")
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -3094,7 +3094,7 @@
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "ssemov")
> (set_attr "prefix_extra" "1")
> -   (set_attr "length_immediate" "*,*,1")
> +   (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "btver2_decode" "vector")
> (set_attr "mode" "TI")])
> @@ -3114,7 +3114,7 @@
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "ssemov")
> (set_attr "prefix_extra" "1")
> -   (set_attr "length_immediate" "*,*,1")
> +   (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "btver2_decode" "vector")
> (set_attr "mode" "TI")])
>


-- 
BR,
Hongtao


Re: [PATCH 04/10] x86: "prefix_extra" can't really be "2"

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:11 PM Jan Beulich via Gcc-patches
 wrote:
>
> In the three remaining instances separate "prefix_0f" and "prefix_rep"
> are what is wanted instead.
Ok.
>
> gcc/
>
> * config/i386/i386.md (rdbase): Add "prefix_0f" and
> "prefix_rep". Drop "prefix_extra".
> (wrbase): Likewise.
> (ptwrite): Likewise.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -25914,7 +25914,8 @@
>"TARGET_64BIT && TARGET_FSGSBASE"
>"rdbase\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "2")])
> +   (set_attr "prefix_0f" "1")
> +   (set_attr "prefix_rep" "1")])
>
>  (define_insn "wrbase"
>[(unspec_volatile [(match_operand:SWI48 0 "register_operand" "r")]
> @@ -25922,7 +25923,8 @@
>"TARGET_64BIT && TARGET_FSGSBASE"
>"wrbase\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "2")])
> +   (set_attr "prefix_0f" "1")
> +   (set_attr "prefix_rep" "1")])
>
>  (define_insn "ptwrite"
>[(unspec_volatile [(match_operand:SWI48 0 "nonimmediate_operand" "rm")]
> @@ -25930,7 +25932,8 @@
>"TARGET_PTWRITE"
>"ptwrite\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "2")])
> +   (set_attr "prefix_0f" "1")
> +   (set_attr "prefix_rep" "1")])
>
>  (define_insn "@rdrand"
>[(set (match_operand:SWI248 0 "register_operand" "=r")
>


-- 
BR,
Hongtao


Re: [PATCH 06/10] x86: drop stray "prefix_extra"

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:16 PM Jan Beulich via Gcc-patches
 wrote:
>
> While the attribute is relevant for legacy- and VEX-encoded insns, it is
> of no relevance for EVEX-encoded ones.
>
> While there in avx512dq_broadcast_1 add
> the missing "length_immediate".
Ok.
>
> gcc/
>
> * config/i386/sse.md
> (*_eq3_1): Drop
> "prefix_extra".
> (avx512dq_vextract64x2_1_mask): Likewise.
> (*avx512dq_vextract64x2_1): Likewise.
> (avx512f_vextract32x4_1_mask): Likewise.
> (*avx512f_vextract32x4_1): Likewise.
> (vec_extract_lo__mask [AVX512 forms]): Likewise.
> (vec_extract_lo_ [AVX512 forms]): Likewise.
> (vec_extract_hi__mask [AVX512 forms]): Likewise.
> (vec_extract_hi_ [AVX512 forms]): Likewise.
> (@vec_extract_lo_ [AVX512 forms]): Likewise.
> (@vec_extract_hi_ [AVX512 forms]): Likewise.
> (vec_extract_lo_v64qi): Likewise.
> (vec_extract_hi_v64qi): Likewise.
> (*vec_widen_umult_even_v16si): Likewise.
> (*vec_widen_smult_even_v16si): Likewise.
> (*avx512f_3): Likewise.
> (*vec_extractv4ti): Likewise.
> (avx512bw_v32qiv32hi2): Likewise.
> (avx512dq_broadcast_1): Likewise.
> Add "length_immediate".
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -4030,7 +4030,6 @@
> vpcmpeq\t{%2, %1, 
> %0|%0, %1, %2}
> vptestnm\t{%1, %1, 
> %0|%0, %1, %1}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>
> @@ -4128,7 +4127,6 @@
> vpcmpeq\t{%2, %1, 
> %0|%0, %1, %2}
> vptestnm\t{%1, %1, 
> %0|%0, %1, %1}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>
> @@ -11487,7 +11485,6 @@
>return "vextract64x2\t{%2, %1, %0%{%5%}%N4|%0%{%5%}%N4, %1, 
> %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11506,7 +11503,6 @@
>return "vextract64x2\t{%2, %1, %0|%0, %1, %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11554,7 +11550,6 @@
>return "vextract32x4\t{%2, %1, %0%{%7%}%N6|%0%{%7%}%N6, %1, 
> %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11577,7 +11572,6 @@
>return "vextract32x4\t{%2, %1, %0|%0, %1, %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11671,7 +11665,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract64x4\t{$0x0, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x0}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "memory" "none,store")
> (set_attr "prefix" "evex")
> @@ -11691,7 +11684,6 @@
>  return "#";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "memory" "none,store,load")
> (set_attr "prefix" "evex")
> @@ -11710,7 +11702,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract64x4\t{$0x1, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11724,7 +11715,6 @@
>"TARGET_AVX512F"
>"vextract64x4\t{$0x1, %1, %0|%0, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11744,7 +11734,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract32x8\t{$0x1, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11762,7 +11751,6 @@
> vextract32x8\t{$0x1, %1, %0|%0, %1, 0x1}
> vextracti64x4\t{$0x1, %1, %0|%0, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "isa" "avx512dq,noavx512dq")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> @@ -11850,7 +11838,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract32x8\t{$0x0, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x0}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr 

Re: [PATCH 05/10] x86: replace/correct bogus "prefix_extra"

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:14 PM Jan Beulich via Gcc-patches
 wrote:
>
> In the rdrand and rdseed cases "prefix_0f" is meant instead. For
> mmx_floatv2siv2sf2 1 is correct only for the first alternative. For
> the integer min/max cases 1 uniformly applies to legacy and VEX
> encodings (the UB and SW variants are dealt with separately anyway).
> Same for {,V}MOVNTDQA.
>
> Unlike {,V}PEXTRW, which has two encoding forms, {,V}PINSRW only has
> a single form in 0f space. (In *vec_extract note that the
> dropped part if the condition also referenced non-existing alternative
> 2.)
>
> Of the integer compare insns, only the 64-bit element forms are encoded
> in 0f38 space.
Ok.
>
> gcc/
>
> * config/i386/i386.md (@rdrand): Add "prefix_0f". Drop
> "prefix_extra".
> (@rdseed): Likewise.
> * config/i386/mmx.md (3 [smaxmin and umaxmin cases]):
> Adjust "prefix_extra".
> * config/i386/sse.md (@vec_set_0): Likewise.
> (*sse4_1_3): Likewise.
> (*avx2_eq3): Likewise.
> (avx2_gt3): Likewise.
> (_pinsr): Likewise.
> (*vec_extract): Likewise.
> (_movntdqa): Likewise.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -25943,7 +25943,7 @@
>"TARGET_RDRND"
>"rdrand\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "1")])
> +   (set_attr "prefix_0f" "1")])
>
>  (define_insn "@rdseed"
>[(set (match_operand:SWI248 0 "register_operand" "=r")
> @@ -25953,7 +25953,7 @@
>"TARGET_RDSEED"
>"rdseed\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "1")])
> +   (set_attr "prefix_0f" "1")])
>
>  (define_expand "pause"
>[(set (match_dup 0)
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2483,7 +2483,7 @@
> vp\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -2532,7 +2532,7 @@
> vpb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -2561,7 +2561,7 @@
> vp\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -2623,7 +2623,7 @@
> vpw\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -11064,7 +11064,7 @@
>(const_string "1")
>(const_string "*")))
> (set (attr "prefix_extra")
> - (if_then_else (eq_attr "alternative" "5,6,7,8,9")
> + (if_then_else (eq_attr "alternative" "5,6,9")
>(const_string "1")
>(const_string "*")))
> (set (attr "length_immediate")
> @@ -16779,7 +16779,7 @@
> vp\t{%2, %1, 
> %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -16813,7 +16813,10 @@
>"TARGET_AVX2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
>"vpcmpeq\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> +   (set (attr "prefix_extra")
> + (if_then_else (eq (const_string "mode") (const_string "V4DImode"))
> +  (const_string "1")
> +  (const_string "*")))
> (set_attr "prefix" "vex")
> (set_attr "mode" "OI")])
>
> @@ -17048,7 +17051,10 @@
>"TARGET_AVX2"
>"vpcmpgt\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> +   (set (attr "prefix_extra")
> + (if_then_else (eq (const_string "mode") (const_string "V4DImode"))
> +  (const_string "1")
> +  (const_string "*")))
> (set_attr "prefix" "vex")
> (set_attr "mode" "OI")])
>
> @@ -18843,7 +18849,7 @@
> (const_string "*")))
> (set (attr "prefix_extra")
>   (if_then_else
> -   (and (not (match_test "TARGET_AVX"))
> +   (ior (eq_attr "prefix" "evex")
> (match_test "GET_MODE_NUNITS (mode) == 8"))
> (const_string "*")
> (const_string "1")))
> @@ -20004,8 +20010,7 @@
> (set_attr "prefix_data16" "1")
> (set (attr "prefix_extra")
>   (if_then_else
> -   (and (eq_attr "alternative" "0,2")
> -   

Re: [PATCH 03/10] x86: "ssemuladd" adjustments

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:11 PM Jan Beulich via Gcc-patches
 wrote:
>
> They're all VEX3- (also covering XOP) or EVEX-encoded. Express that in
> the default calculation of "prefix". FMA4 insns also all have a 1-byte
> immediate operand.
>
> Where the default calculation is not sufficient / applicable, add
> explicit "prefix" attributes. While there also add a "mode" attribute to
> fma___pair.
Ok.
>
> gcc/
>
> * config/i386/i386.md (isa): Move up.
> (length_immediate): Handle "fma4".
> (prefix): Handle "ssemuladd".
> * config/i386/sse.md (*fma_fmadd_): Add "prefix" attribute.
> (fma_fmadd_):
> Likewise.
> (_fmadd__mask): Likewise.
> (_fmadd__mask3): Likewise.
> (fma_fmsub_):
> Likewise.
> (_fmsub__mask): Likewise.
> (_fmsub__mask3): Likewise.
> (*fma_fnmadd_): Likewise.
> (fma_fnmadd_):
> Likewise.
> (_fnmadd__mask): Likewise.
> (_fnmadd__mask3): Likewise.
> (fma_fnmsub_):
> Likewise.
> (_fnmsub__mask): Likewise.
> (_fnmsub__mask3): Likewise.
> (fma_fmaddsub_):
> Likewise.
> (_fmaddsub__mask): Likewise.
> (_fmaddsub__mask3): Likewise.
> (fma_fmsubadd_):
> Likewise.
> (_fmsubadd__mask): Likewise.
> (_fmsubadd__mask3): Likewise.
> (*fmai_fmadd_): Likewise.
> (*fmai_fmsub_): Likewise.
> (*fmai_fnmadd_): Likewise.
> (*fmai_fnmsub_): Likewise.
> (avx512f_vmfmadd__mask): Likewise.
> (avx512f_vmfmadd__mask3): Likewise.
> (avx512f_vmfmadd__maskz_1): Likewise.
> (*avx512f_vmfmsub__mask): Likewise.
> (avx512f_vmfmsub__mask3): Likewise.
> (*avx512f_vmfmsub__maskz_1): Likewise.
> (avx512f_vmfnmadd__mask): Likewise.
> (avx512f_vmfnmadd__mask3): Likewise.
> (avx512f_vmfnmadd__maskz_1): Likewise.
> (*avx512f_vmfnmsub__mask): Likewise.
> (*avx512f_vmfnmsub__mask3): Likewise.
> (*avx512f_vmfnmsub__maskz_1): Likewise.
> (*fma4i_vmfmadd_): Likewise.
> (*fma4i_vmfmsub_): Likewise.
> (*fma4i_vmfnmadd_): Likewise.
> (*fma4i_vmfnmsub_): Likewise.
> (fma__): Likewise.
> (___mask): Likewise.
> 
> (avx512fp16_fma_sh_v8hf):
> Likewise.
> (avx512fp16_sh_v8hf_mask): Likewise.
> (xop_p): Likewise.
> (xop_pdql): Likewise.
> (xop_pdqh): Likewise.
> (xop_pwd): Likewise.
> (xop_pwd): Likewise.
> (fma___pair): Likewise. Add "mode" attribute.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -531,12 +531,23 @@
>(const_string "unknown")]
>  (const_string "integer")))
>
> +;; Used to control the "enabled" attribute on a per-instruction basis.
> +(define_attr "isa" "base,x64,nox64,x64_sse2,x64_sse4,x64_sse4_noavx,
> +   x64_avx,x64_avx512bw,x64_avx512dq,aes,
> +   sse_noavx,sse2,sse2_noavx,sse3,sse3_noavx,sse4,sse4_noavx,
> +   avx,noavx,avx2,noavx2,bmi,bmi2,fma4,fma,avx512f,noavx512f,
> +   avx512bw,noavx512bw,avx512dq,noavx512dq,fma_or_avx512vl,
> +   
> avx512vl,noavx512vl,avxvnni,avx512vnnivl,avx512fp16,avxifma,
> +   avx512ifmavl,avxneconvert,avx512bf16vl,vpclmulqdqvl"
> +  (const_string "base"))
> +
>  ;; The (bounding maximum) length of an instruction immediate.
>  (define_attr "length_immediate" ""
>(cond [(eq_attr "type" "incdec,setcc,icmov,str,lea,other,multi,idiv,leave,
>   bitmanip,imulx,msklog,mskmov")
>(const_int 0)
> -(eq_attr "type" "sse4arg")
> +(ior (eq_attr "type" "sse4arg")
> + (eq_attr "isa" "fma4"))
>(const_int 1)
>  (eq_attr "unit" "i387,sse,mmx")
>(const_int 0)
> @@ -637,6 +648,10 @@
> (const_string "vex")
>   (eq_attr "mode" "XI,V16SF,V8DF")
> (const_string "evex")
> +(eq_attr "type" "ssemuladd")
> +  (if_then_else (eq_attr "isa" "fma4")
> +(const_string "vex")
> +(const_string "maybe_evex"))
>  (eq_attr "type" "sse4arg")
>(const_string "vex")
>  ]
> @@ -842,16 +857,6 @@
>  ;; Define attribute to indicate unaligned ssemov insns
>  (define_attr "movu" "0,1" (const_string "0"))
>
> -;; Used to control the "enabled" attribute on a per-instruction basis.
> -(define_attr "isa" "base,x64,nox64,x64_sse2,x64_sse4,x64_sse4_noavx,
> -   x64_avx,x64_avx512bw,x64_avx512dq,aes,
> -   sse_noavx,sse2,sse2_noavx,sse3,sse3_noavx,sse4,sse4_noavx,
> -   avx,noavx,avx2,noavx2,bmi,bmi2,fma4,fma,avx512f,noavx512f,
> -   avx512bw,noavx512bw,avx512dq,noavx512dq,fma_or_avx512vl,
> -   
> avx512vl,noavx512vl,avxvnni,avx512vnnivl,avx512fp16,avxifma,
> -

Re: [PATCH 02/10] x86: "sse4arg" adjustments

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:10 PM Jan Beulich via Gcc-patches
 wrote:
>
> Record common properties in other attributes' default calculations:
> There's always a 1-byte immediate, and they're always encoded in a VEX3-
> like manner (note that "prefix_extra" already evaluates to 1 in this
> case). The drop now (or already previously) redundant explicit
> attributes, adding "mode" ones where they were missing.
>
> Furthermore use "sse4arg" consistently for all VPCOM* insns; so far
> signed comparisons did use it, while unsigned ones used "ssecmp". Note
> that while they have (not counting the explicit or implicit immediate
> operand) they really only have 3 operands, the operator is also counted
> in those patterns. That's relevant for establishing the "memory"
> attribute's value, and at the same time benign when there are only
> register operands.
>
> Note that despite also having 4 operands, multiply-add insns aren't
> affected by this change, as they use "ssemuladd" for "type".
Ok. (I'm not quite familiar for those xop instructions encoding, you
must have better understanding than me, so just rubber-stamp the
patch.
>
> gcc/
>
> * config/i386/i386.md (length_immediate): Handle "sse4arg".
> (prefix): Likewise.
> (*xop_pcmov_): Add "mode" attribute.
> * config/i386/mmx.md (*xop_maskcmp3): Drop "prefix_data16",
> "prefix_rep", "prefix_extra", and "length_immediate" attributes.
> (*xop_maskcmp_uns3): Likewise. Switch "type" to "sse4arg".
> (*xop_pcmov_): Add "mode" attribute.
> * config/i386/sse.md (xop_pcmov_): Add "mode"
> attribute.
> (xop_maskcmp3): Drop "prefix_data16", "prefix_rep",
> "prefix_extra", and "length_immediate" attributes.
> (xop_maskcmp_uns3): Likewise. Switch "type" to "sse4arg".
> (xop_maskcmp_uns23): Drop "prefix_data16", "prefix_extra",
> and "length_immediate" attributes. Switch "type" to "sse4arg".
> (xop_pcom_tf3): Likewise.
> (xop_vpermil23): Drop "length_immediate" attribute.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -536,6 +536,8 @@
>(cond [(eq_attr "type" "incdec,setcc,icmov,str,lea,other,multi,idiv,leave,
>   bitmanip,imulx,msklog,mskmov")
>(const_int 0)
> +(eq_attr "type" "sse4arg")
> +  (const_int 1)
>  (eq_attr "unit" "i387,sse,mmx")
>(const_int 0)
>  (eq_attr "type" "alu,alu1,negnot,imovx,ishift,ishiftx,ishift1,
> @@ -635,6 +637,8 @@
> (const_string "vex")
>   (eq_attr "mode" "XI,V16SF,V8DF")
> (const_string "evex")
> +(eq_attr "type" "sse4arg")
> +  (const_string "vex")
>  ]
>  (const_string "orig")))
>
> @@ -23286,7 +23290,8 @@
>   (match_operand:MODEF 3 "register_operand" "x")))]
>"TARGET_XOP"
>"vpcmov\t{%1, %3, %2, %0|%0, %2, %3, %1}"
> -  [(set_attr "type" "sse4arg")])
> +  [(set_attr "type" "sse4arg")
> +   (set_attr "mode" "TI")])
>
>  ;; These versions of the min/max patterns are intentionally ignorant of
>  ;; their behavior wrt -0.0 and NaN (via the commutative operand mark).
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2909,10 +2909,6 @@
>"TARGET_XOP"
>"vpcom%Y1\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr "type" "sse4arg")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> (set_attr "mode" "TI")])
>
>  (define_insn "*xop_maskcmp3"
> @@ -2923,10 +2919,6 @@
>"TARGET_XOP"
>"vpcom%Y1\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr "type" "sse4arg")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> (set_attr "mode" "TI")])
>
>  (define_insn "*xop_maskcmp_uns3"
> @@ -2936,11 +2928,7 @@
>   (match_operand:MMXMODEI 3 "register_operand" "x")]))]
>"TARGET_XOP"
>"vpcom%Y1u\t{%3, %2, %0|%0, %2, %3}"
> -  [(set_attr "type" "ssecmp")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> +  [(set_attr "type" "sse4arg")
> (set_attr "mode" "TI")])
>
>  (define_insn "*xop_maskcmp_uns3"
> @@ -2950,11 +2938,7 @@
>   (match_operand:VI_16_32 3 "register_operand" "x")]))]
>"TARGET_XOP"
>"vpcom%Y1u\t{%3, %2, %0|%0, %2, %3}"
> -  [(set_attr "type" "ssecmp")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> +  [(set_attr "type" "sse4arg")
> (set_attr "mode" "TI")])
>
>  (define_expand "vec_cmp"
> @@ -3144,7 +3128,8 @@
>(match_operand:MMXMODE124 2 "register_operand" "x")))]
>"TARGET_XOP && TARGET_MMX_WITH_SSE"
>"vpcmov\t{%3, %2, %1, %0|%0, %1, %2, %3}"
> -  [(set_attr "type" "sse4arg")])
> +  

Re: [PATCH 01/10] x86: "prefix_extra" tidying

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Thu, Aug 3, 2023 at 4:10 PM Jan Beulich via Gcc-patches
 wrote:
>
> Drop SSE5 leftovers from both its comment and its default calculation.
> A value of 2 simply cannot occur anymore. Instead extend the comment to
> mention the use of the attribute in "length_vex", clarifying why
> "prefix_extra" can actually be meaningful on VEX-encoded insns despite
> those not having any real prefixes except possibly segment overrides.
>
Ok.
> gcc/
>
> * config/i386/i386.md (prefix_extra): Correct comment. Fold
> cases yielding 2 into ones yielding 1.
> ---
> I question the 3DNow! aspect here: There's no extra prefix there. It's
> an immediate instead which "sub-divides" major opcode 0f0f.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -620,13 +620,11 @@
> (const_int 0)))
>
>  ;; There are also additional prefixes in 3DNOW, SSSE3.
> -;; ssemuladd,sse4arg default to 0f24/0f25 and DREX byte,
> -;; sseiadd1,ssecvt1 to 0f7a with no DREX byte.
>  ;; 3DNOW has 0f0f prefix, SSSE3 and SSE4_{1,2} 0f38/0f3a.
> +;; While generally inapplicable to VEX/XOP/EVEX encodings, "length_vex" uses
> +;; the attribute evaluating to zero to know that VEX2 encoding may be usable.
>  (define_attr "prefix_extra" ""
> -  (cond [(eq_attr "type" "ssemuladd,sse4arg")
> -  (const_int 2)
> -(eq_attr "type" "sseiadd1,ssecvt1")
> +  (cond [(eq_attr "type" "ssemuladd,sse4arg,sseiadd1,ssecvt1")
>(const_int 1)
> ]
> (const_int 0)))
>


-- 
BR,
Hongtao


Re: [PATCH] Replace invariant ternlog operands

2023-08-03 Thread Hongtao Liu via Gcc-patches
On Fri, Aug 4, 2023 at 1:30 AM Alexander Monakov  wrote:
>
>
> On Thu, 27 Jul 2023, Liu, Hongtao via Gcc-patches wrote:
>
> > > +;; If the first and the second operands of ternlog are invariant and ;;
> > > +the third operand is memory ;; then we should add load third operand
> > > +from memory to register and ;; replace first and second operands with
> > > +this register (define_split
> > > +  [(set (match_operand:V 0 "register_operand")
> > > +   (unspec:V
> > > + [(match_operand:V 1 "register_operand")
> > > +  (match_operand:V 2 "register_operand")
> > > +  (match_operand:V 3 "memory_operand")
> > > +  (match_operand:SI 4 "const_0_to_255_operand")]
> > > + UNSPEC_VTERNLOG))]
> > > +  "ternlog_invariant_operand_mask (operands) == 3 && !reload_completed"
> > Maybe better with "!reload_completed  && ternlog_invariant_operand_mask 
> > (operands) == 3"
>
> I made this change (in both places), plus some style TLC. Ok to apply?
Ok.
>
> From d24304a9efd049e8db6df5ac78de8ca2d941a3c7 Mon Sep 17 00:00:00 2001
> From: Yan Simonaytes 
> Date: Tue, 25 Jul 2023 20:43:19 +0300
> Subject: [PATCH] Eliminate irrelevant operands of VPTERNLOG
>
> As mentioned in PR 110202, GCC may be presented with input where control
> word of the VPTERNLOG intrinsic implies that some of its operands do not
> affect the result.  In that case, we can eliminate irrelevant operands
> of the instruction by substituting any other operand in their place.
> This removes false dependencies.
>
> For instance, instead of (252 = 0xfc = _MM_TERNLOG_A | _MM_TERNLOG_B)
>
> vpternlogq  $252, %zmm2, %zmm1, %zmm0
>
> emit
>
> vpternlogq  $252, %zmm0, %zmm1, %zmm0
>
> When VPTERNLOG is invariant w.r.t first and second operands, and the
> third operand is memory, load memory into the output operand first, i.e.
> instead of (85 = 0x55 = ~_MM_TERNLOG_C)
>
> vpternlogq  $85, (%rdi), %zmm1, %zmm0
>
> emit
>
> vmovdqa64   (%rdi), %zmm0
> vpternlogq  $85, %zmm0, %zmm0, %zmm0
>
> gcc/ChangeLog:
>
> * config/i386/i386-protos.h (vpternlog_irrelevant_operand_mask):
> Declare.
> (substitute_vpternlog_operands): Declare.
> * config/i386/i386.cc (vpternlog_irrelevant_operand_mask): New
> helper.
> (substitute_vpternlog_operands): New function.  Use them...
> * config/i386/sse.md: ... here in new VPTERNLOG define_splits.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/invariant-ternlog-1.c: New test.
> * gcc.target/i386/invariant-ternlog-2.c: New test.
> ---
>  gcc/config/i386/i386-protos.h |  3 ++
>  gcc/config/i386/i386.cc   | 43 +++
>  gcc/config/i386/sse.md| 42 ++
>  .../gcc.target/i386/invariant-ternlog-1.c | 21 +
>  .../gcc.target/i386/invariant-ternlog-2.c | 12 ++
>  5 files changed, 121 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-2.c
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 27fe73ca65..12e6ff0ebc 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -70,6 +70,9 @@ extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
>  extern int avx_vpermilp_parallel (rtx par, machine_mode mode);
>  extern int avx_vperm2f128_parallel (rtx par, machine_mode mode);
>
> +extern int vpternlog_irrelevant_operand_mask (rtx[]);
> +extern void substitute_vpternlog_operands (rtx[]);
> +
>  extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx);
>  extern bool ix86_expand_set_or_cpymem (rtx, rtx, rtx, rtx, rtx, rtx,
>rtx, rtx, rtx, rtx, bool);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 32851a514a..9a7c1135a0 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -19420,6 +19420,49 @@ avx_vperm2f128_parallel (rtx par, machine_mode mode)
>return mask + 1;
>  }
>
> +/* Return a mask of VPTERNLOG operands that do not affect output.  */
> +
> +int
> +vpternlog_irrelevant_operand_mask (rtx *operands)
> +{
> +  int mask = 0;
> +  int imm8 = XINT (operands[4], 0);
> +
> +  if (((imm8 >> 4) & 0x0F) == (imm8 & 0x0F))
> +mask |= 1;
> +  if (((imm8 >> 2) & 0x33) == (imm8 & 0x33))
> +mask |= 2;
> +  if (((imm8 >> 1) & 0x55) == (imm8 & 0x55))
> +mask |= 4;
> +
> +  return mask;
> +}
> +
> +/* Eliminate false dependencies on operands that do not affect output
> +   by substituting other operands of a VPTERNLOG.  */
> +
> +void
> +substitute_vpternlog_operands (rtx *operands)
> +{
> +  int mask = vpternlog_irrelevant_operand_mask (operands);
> +
> +  if (mask & 1) /* The first operand is irrelevant.  */
> +operands[1] = operands[2];
> +
> +  if (mask & 2) /* The second operand is irrelevant.  */
> +operands[2] = operands[1];
> +
> +  

Re: [r14-2834 Regression] FAIL: gcc.target/i386/pr87007-5.c scan-assembler-times vxorps[^\n\r]*xmm[0-9] 1 on Linux/x86_64

2023-07-31 Thread Hongtao Liu via Gcc-patches
On Sat, Jul 29, 2023 at 11:55 AM haochen.jiang via Gcc-regression
 wrote:
>
> On Linux/x86_64,
>
> b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit
> commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb
> Author: Jan Hubicka 
> Date:   Fri Jul 28 09:16:09 2023 +0200
>
> loop-split improvements, part 1
>
> caused
>
> FAIL: gcc.target/i386/pr87007-4.c scan-assembler-times vxorps[^\n\r]*xmm[0-9] 
> 1
> FAIL: gcc.target/i386/pr87007-5.c scan-assembler-times vxorps[^\n\r]*xmm[0-9] 
> 1
>
> with GCC configured with
I'll adjust testcase for this one.
Now we have
vpbroadcastd %ecx, %xmm0
vpaddd .LC3(%rip), %xmm0, %xmm0
vpextrd $3, %xmm0, %eax
vmovddup %xmm3, %xmm0
vrndscalepd $9, %xmm0, %xmm0
vunpckhpd %xmm0, %xmm0, %xmm3

for vrndscalepd, no need to insert pxor since it reuses input operand
xmm0 which loads from memory.
>
> ../../gcc/configure 
> --prefix=/export/users/haochenj/src/gcc-bisect/master/master/r14-2834/usr 
> --enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
> --with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
> --enable-libmpx x86_64-linux --disable-bootstrap
>
> To reproduce:
>
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c 
> --target_board='unix{-m32}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c --target_board='unix{-m32\ 
> -march=cascadelake}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c 
> --target_board='unix{-m64}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c --target_board='unix{-m64\ 
> -march=cascadelake}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c 
> --target_board='unix{-m32}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c --target_board='unix{-m32\ 
> -march=cascadelake}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c 
> --target_board='unix{-m64}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c --target_board='unix{-m64\ 
> -march=cascadelake}'"
>
> (Please do not reply to this email, for question about this report, contact 
> me at haochen dot jiang at intel.com.)
> (If you met problems with cascadelake related, disabling AVX512F in command 
> line might save that.)
> (However, please make sure that there is no potential problems with AVX512.)



-- 
BR,
Hongtao


Re: [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX

2023-07-20 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 20, 2023 at 4:11 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Jul 20, 2023 at 9:35 AM liuhongt  wrote:
> >
> > For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
> > as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
> > Can someone confirm this with AMD folks?
> > If AMD doesn't like such optimization, I'll put my optimization under
> > micro-architecture tuning.
>
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart and
> change the instruction behind the programmer's back. The caveats are
> also explained at length in the ISA manual.
fine.
>
> Uros.
>
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > If AMD also like such optimization, Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > * config/i386/sse.md (_lddqu): Change to
> > define_expand, expand as simple move when TARGET_AVX
> > && ( == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD).
> > The original define_insn is renamed to
> > ..
> > (_lddqu): .. this.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/vlddqu_vinserti128.c: New test.
> > ---
> >  gcc/config/i386/sse.md| 15 ++-
> >  .../gcc.target/i386/vlddqu_vinserti128.c  | 11 +++
> >  2 files changed, 25 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 2d81347c7b6..d571a78f4c4 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -1835,7 +1835,20 @@ (define_peephole2
> >[(set (match_dup 4) (match_dup 1))]
> >"operands[4] = adjust_address (operands[0], V2DFmode, 0);")
> >
> > -(define_insn "_lddqu"
> > +(define_expand "_lddqu"
> > +  [(set (match_operand:VI1 0 "register_operand")
> > +   (unspec:VI1 [(match_operand:VI1 1 "memory_operand")]
> > +   UNSPEC_LDDQU))]
> > +  "TARGET_SSE3"
> > +{
> > +  if (TARGET_AVX && ( == 16 || 
> > !TARGET_AVX256_SPLIT_UNALIGNED_LOAD))
> > +{
> > +  emit_move_insn (operands[0], operands[1]);
> > +  DONE;
> > +}
> > +})
> > +
> > +(define_insn "*_lddqu"
> >[(set (match_operand:VI1 0 "register_operand" "=x")
> > (unspec:VI1 [(match_operand:VI1 1 "memory_operand" "m")]
> > UNSPEC_LDDQU))]
> > diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c 
> > b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> > new file mode 100644
> > index 000..29699a5fa7f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx2 -O2" } */
> > +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> > +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> > +
> > +#include 
> > +__m256i foo(void *data) {
> > +__m128i X1 = _mm_lddqu_si128((__m128i*)data);
> > +__m256i V1 = _mm256_broadcastsi128_si256 (X1);
> > +return V1;
> > +}
> > --
> > 2.39.1.388.g2fc9e9ca3c
> >



-- 
BR,
Hongtao


Re: [PATCH V2] Provide -fcf-protection=branch,return.

2023-07-19 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 12, 2023 at 3:27 PM Hongtao Liu  wrote:
>
> ping.
>
> On Mon, May 22, 2023 at 4:08 PM Hongtao Liu  wrote:
> >
> > ping.
> >
> > On Sat, May 13, 2023 at 5:20 PM liuhongt  wrote:
> > >
> > > > I think this could be simplified if you use either EnumSet or
> > > > EnumBitSet instead in common.opt for `-fcf-protection=`.
> > >
> > > Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2.
> > > It is a bit tricky for sets classification, cf_branch and cf_return
> > > should be in different sets, but they both "conflicts" cf_full,
> > > cf_none. And current EnumSet don't handle this well.
> > >
> > > So in the current implementation, only cf_full,cf_none are exclusive
> > > to each other, but they can be combined with any cf_branch, cf_return,
> > > cf_check. It's not perfect, but still an improvement than original
> > > one.
> > >
I'm going to commit this patch if there's no objection, it's just a
refactor of option -fcf-protection=.
If there's any regression observed, I will fix(or revert the patch).
> > > gcc/ChangeLog:
> > >
> > > * common.opt: (fcf-protection=): Add EnumSet attribute to
> > > support combination of params.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * c-c++-common/fcf-protection-10.c: New test.
> > > * c-c++-common/fcf-protection-11.c: New test.
> > > * c-c++-common/fcf-protection-12.c: New test.
> > > * c-c++-common/fcf-protection-8.c: New test.
> > > * c-c++-common/fcf-protection-9.c: New test.
> > > * gcc.target/i386/pr89701-1.c: New test.
> > > * gcc.target/i386/pr89701-2.c: New test.
> > > * gcc.target/i386/pr89701-3.c: New test.
> > > ---
> > >  gcc/common.opt | 12 ++--
> > >  gcc/testsuite/c-c++-common/fcf-protection-10.c |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-11.c |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-12.c |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-8.c  |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-9.c  |  2 ++
> > >  gcc/testsuite/gcc.target/i386/pr89701-1.c  |  4 
> > >  gcc/testsuite/gcc.target/i386/pr89701-2.c  |  4 
> > >  gcc/testsuite/gcc.target/i386/pr89701-3.c  |  4 
> > >  9 files changed, 28 insertions(+), 6 deletions(-)
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-10.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-11.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-12.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-8.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-9.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-3.c
> > >
> > > diff --git a/gcc/common.opt b/gcc/common.opt
> > > index a28ca13385a..02f2472959a 100644
> > > --- a/gcc/common.opt
> > > +++ b/gcc/common.opt
> > > @@ -1886,7 +1886,7 @@ fcf-protection
> > >  Common RejectNegative Alias(fcf-protection=,full)
> > >
> > >  fcf-protection=
> > > -Common Joined RejectNegative Enum(cf_protection_level) 
> > > Var(flag_cf_protection) Init(CF_NONE)
> > > +Common Joined RejectNegative Enum(cf_protection_level) EnumSet 
> > > Var(flag_cf_protection) Init(CF_NONE)
> > >  -fcf-protection=[full|branch|return|none|check]Instrument 
> > > functions with checks to verify jump/call/return control-flow transfer
> > >  instructions have valid targets.
> > >
> > > @@ -1894,19 +1894,19 @@ Enum
> > >  Name(cf_protection_level) Type(enum cf_protection_level) 
> > > UnknownError(unknown Control-Flow Protection Level %qs)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(full) Value(CF_FULL)
> > > +Enum(cf_protection_level) String(full) Value(CF_FULL) Set(1)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(branch) Value(CF_BRANCH)
> > > +Enum(cf_protection_level) String(branch) Value(CF_BRANCH) Set(2)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(return) Value(CF_RETURN)
> > > +Enum(cf_protection_level) String(return) Value(CF_RETURN) Set(3)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(check) Value(CF_CHECK)
> > > +Enum(cf_protection_level) String(check) Value(CF_CHECK) Set(4)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(none) Value(CF_NONE)
> > > +Enum(cf_protection_level) String(none) Value(CF_NONE) Set(1)
> > >
> > >  finstrument-functions
> > >  Common Var(flag_instrument_function_entry_exit,1)
> > > diff --git a/gcc/testsuite/c-c++-common/fcf-protection-10.c 
> > > b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > > new file mode 100644
> > > index 000..b271d134e52
> > > --- /dev/null
> > > +++ b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > > @@ -0,0 +1,2 @@
> > > +/* { dg-do compile { target { "i?86-*-* x86_64-*-*" } } } */
> > > +/* { 

Re: [PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-07-18 Thread Hongtao Liu via Gcc-patches
On Mon, Jul 17, 2023 at 7:38 PM Uros Bizjak  wrote:
>
> On Mon, Jul 17, 2023 at 10:28 AM Hongtao Liu  wrote:
> >
> > I'd like to ping for this patch (only patch 1/2, for patch 2/2, I
> > think that may not be necessary).
> >
> > On Mon, May 15, 2023 at 9:20 AM Hongtao Liu  wrote:
> > >
> > > ping.
> > >
> > > On Fri, Apr 21, 2023 at 9:55 PM liuhongt  wrote:
> > > >
> > > > > > +  if (!TARGET_SSE2)
> > > > > > +{
> > > > > > +  if (c_dialect_cxx ()
> > > > > > +   && cxx_dialect > cxx20)
> > > > >
> > > > > Formatting, both conditions are short, so just put them on one line.
> > > > Changed.
> > > >
> > > > > But for the C++23 macros, more importantly I think we really should
> > > > > also in ix86_target_macros_internal add
> > > > >   if (c_dialect_cxx ()
> > > > >   && cxx_dialect > cxx20
> > > > >   && (isa_flag & OPTION_MASK_ISA_SSE2))
> > > > > {
> > > > >   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
> > > > >   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> > > > > }
> > > > > plus associated libstdc++ changes.  It can be done incrementally 
> > > > > though.
> > > > Added in PATCH 2/2
> > > >
> > > > > > +  if (flag_building_libgcc)
> > > > > > + {
> > > > > > +   /* libbid uses __LIBGCC_HAS_HF_MODE__ and 
> > > > > > __LIBGCC_HAS_BF_MODE__
> > > > > > +  to check backend support of _Float16 and __bf16 type.  */
> > > > >
> > > > > That is actually the case only for HFmode, but not for BFmode right 
> > > > > now.
> > > > > So, we need further work.  One is to add the BFmode support in there,
> > > > > and another one is make sure the _Float16 <-> _Decimal* and __bf16 <->
> > > > > _Decimal* conversions are compiled in also if not -msse2 by default.
> > > > > One way to do that is wrap the HF and BF mode related functions on x86
> > > > > #ifndef __SSE2__ into the pragmas like intrin headers use (but then
> > > > > perhaps we don't need to undef this stuff here), another is not 
> > > > > provide
> > > > > the hf/bf support in that case from the TUs where they are provided 
> > > > > now,
> > > > > but from a different one which would be compiled with -msse2.
> > > > Add CFLAGS-_hf_to_sd.c += -msse2, similar for other files in libbid, 
> > > > just like
> > > > we did before for HFtype softfp. Then no need to undef libgcc macros.
> > > >
> > > > > >/* We allowed the user to turn off SSE for kernel mode.  Don't 
> > > > > > crash if
> > > > > >   some less clueful developer tries to use floating-point 
> > > > > > anyway.  */
> > > > > > -  if (needed_sseregs && !TARGET_SSE)
> > > > > > +  if (needed_sseregs
> > > > > > +  && (!TARGET_SSE
> > > > > > +   || (VALID_SSE2_TYPE_MODE (mode)
> > > > > > +   && !TARGET_SSE2)))
> > > > >
> > > > > Formatting, no need to split this up that much.
> > > > >   if (needed_sseregs
> > > > >   && (!TARGET_SSE
> > > > >   || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > > > or even better
> > > > >   if (needed_sseregs
> > > > >   && (!TARGET_SSE || (VALID_SSE2_TYPE_MODE (mode) && 
> > > > > !TARGET_SSE2)))
> > > > > will do it.
> > > > Changed.
> > > >
> > > > > Instead of this, just use
> > > > >   if (!float16_type_node)
> > > > > {
> > > > >   float16_type_node = ix86_float16_type_node;
> > > > >   callback (float16_type_node);
> > > > >   float16_type_node = NULL_TREE;
> > > > > }
> > > > >   if (!bfloat16_type_node)
> > > > > {
> > > > >   bfloat16_type_node = ix86_bf16_type_node;
> > > > >   callback (bfloat16_type_node);
> > > > >   bfloat16_type_node = NULL_TREE;
> > > > > }
> > > > Changed.
> > > >
> > > >
> > > > > > +static const char *
> > > > > > +ix86_invalid_conversion (const_tree fromtype, const_tree totype)
> > > > > > +{
> > > > > > +  if (element_mode (fromtype) != element_mode (totype))
> > > > > > +{
> > > > > > +  /* Do no allow conversions to/from BFmode/HFmode scalar types
> > > > > > +  when TARGET_SSE2 is not available.  */
> > > > > > +  if ((TYPE_MODE (fromtype) == BFmode
> > > > > > +|| TYPE_MODE (fromtype) == HFmode)
> > > > > > +   && !TARGET_SSE2)
> > > > >
> > > > > First of all, not really sure if this should be purely about scalar
> > > > > modes, not also complex and vector modes involving those inner modes.
> > > > > Because complex or vector modes with BF/HF elements will be without
> > > > > TARGET_SSE2 for sure lowered into scalar code and that can't be 
> > > > > handled
> > > > > either.
> > > > > So if (!TARGET_SSE2 && GET_MODE_INNER (TYPE_MODE (fromtype)) == 
> > > > > BFmode)
> > > > > or even better
> > > > > if (!TARGET_SSE2 && element_mode (fromtype) == BFmode)
> > > > > ?
> > > > > Or even better remember the 2 modes above into machine_mode 
> > > > > temporaries
> > > > > and just use those in the != comparison and for the checks?
> > > > >
> > > > > Also, I think it is weird to tell user 

Re: [PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-07-17 Thread Hongtao Liu via Gcc-patches
I'd like to ping for this patch (only patch 1/2, for patch 2/2, I
think that may not be necessary).

On Mon, May 15, 2023 at 9:20 AM Hongtao Liu  wrote:
>
> ping.
>
> On Fri, Apr 21, 2023 at 9:55 PM liuhongt  wrote:
> >
> > > > +  if (!TARGET_SSE2)
> > > > +{
> > > > +  if (c_dialect_cxx ()
> > > > +   && cxx_dialect > cxx20)
> > >
> > > Formatting, both conditions are short, so just put them on one line.
> > Changed.
> >
> > > But for the C++23 macros, more importantly I think we really should
> > > also in ix86_target_macros_internal add
> > >   if (c_dialect_cxx ()
> > >   && cxx_dialect > cxx20
> > >   && (isa_flag & OPTION_MASK_ISA_SSE2))
> > > {
> > >   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
> > >   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> > > }
> > > plus associated libstdc++ changes.  It can be done incrementally though.
> > Added in PATCH 2/2
> >
> > > > +  if (flag_building_libgcc)
> > > > + {
> > > > +   /* libbid uses __LIBGCC_HAS_HF_MODE__ and __LIBGCC_HAS_BF_MODE__
> > > > +  to check backend support of _Float16 and __bf16 type.  */
> > >
> > > That is actually the case only for HFmode, but not for BFmode right now.
> > > So, we need further work.  One is to add the BFmode support in there,
> > > and another one is make sure the _Float16 <-> _Decimal* and __bf16 <->
> > > _Decimal* conversions are compiled in also if not -msse2 by default.
> > > One way to do that is wrap the HF and BF mode related functions on x86
> > > #ifndef __SSE2__ into the pragmas like intrin headers use (but then
> > > perhaps we don't need to undef this stuff here), another is not provide
> > > the hf/bf support in that case from the TUs where they are provided now,
> > > but from a different one which would be compiled with -msse2.
> > Add CFLAGS-_hf_to_sd.c += -msse2, similar for other files in libbid, just 
> > like
> > we did before for HFtype softfp. Then no need to undef libgcc macros.
> >
> > > >/* We allowed the user to turn off SSE for kernel mode.  Don't crash 
> > > > if
> > > >   some less clueful developer tries to use floating-point anyway.  
> > > > */
> > > > -  if (needed_sseregs && !TARGET_SSE)
> > > > +  if (needed_sseregs
> > > > +  && (!TARGET_SSE
> > > > +   || (VALID_SSE2_TYPE_MODE (mode)
> > > > +   && !TARGET_SSE2)))
> > >
> > > Formatting, no need to split this up that much.
> > >   if (needed_sseregs
> > >   && (!TARGET_SSE
> > >   || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > or even better
> > >   if (needed_sseregs
> > >   && (!TARGET_SSE || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > will do it.
> > Changed.
> >
> > > Instead of this, just use
> > >   if (!float16_type_node)
> > > {
> > >   float16_type_node = ix86_float16_type_node;
> > >   callback (float16_type_node);
> > >   float16_type_node = NULL_TREE;
> > > }
> > >   if (!bfloat16_type_node)
> > > {
> > >   bfloat16_type_node = ix86_bf16_type_node;
> > >   callback (bfloat16_type_node);
> > >   bfloat16_type_node = NULL_TREE;
> > > }
> > Changed.
> >
> >
> > > > +static const char *
> > > > +ix86_invalid_conversion (const_tree fromtype, const_tree totype)
> > > > +{
> > > > +  if (element_mode (fromtype) != element_mode (totype))
> > > > +{
> > > > +  /* Do no allow conversions to/from BFmode/HFmode scalar types
> > > > +  when TARGET_SSE2 is not available.  */
> > > > +  if ((TYPE_MODE (fromtype) == BFmode
> > > > +|| TYPE_MODE (fromtype) == HFmode)
> > > > +   && !TARGET_SSE2)
> > >
> > > First of all, not really sure if this should be purely about scalar
> > > modes, not also complex and vector modes involving those inner modes.
> > > Because complex or vector modes with BF/HF elements will be without
> > > TARGET_SSE2 for sure lowered into scalar code and that can't be handled
> > > either.
> > > So if (!TARGET_SSE2 && GET_MODE_INNER (TYPE_MODE (fromtype)) == BFmode)
> > > or even better
> > > if (!TARGET_SSE2 && element_mode (fromtype) == BFmode)
> > > ?
> > > Or even better remember the 2 modes above into machine_mode temporaries
> > > and just use those in the != comparison and for the checks?
> > >
> > > Also, I think it is weird to tell user %<__bf16%> or %<_Float16%> when
> > > we know which one it is.  Just return separate messages?
> > Changed.
> >
> > > > +  /* Reject all single-operand operations on BFmode/HFmode except for &
> > > > + when TARGET_SSE2 is not available.  */
> > > > +  if ((element_mode (type) == BFmode || element_mode (type) == HFmode)
> > > > +  && !TARGET_SSE2 && op != ADDR_EXPR)
> > > > +return N_("operation not permitted on type %<__bf16%> "
> > > > +   "or %<_Float16%> without option %<-msse2%>");
> > >
> > > Similarly.  Also, check !TARGET_SSE2 first as inexpensive one.
> > Changed.
> >
> >
> > Bootstrapped and regtested 

Re: [PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-17 Thread Hongtao Liu via Gcc-patches
Ping.

On Tue, Jul 11, 2023 at 5:16 PM liuhongt via Gcc-patches
 wrote:
>
> Similar like we did for CMPXCHG, but extended to all
> ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same
> as CMP.
>
> When operand order in CMP insn is same as that in CMPCCXADD,
> CMP insn can be eliminated directly.
>
> When operand order is swapped in CMP insn, only optimize
> cmpccxadd + cmpl + jcc/setcc to cmpccxadd + jcc/setcc when FLAGS_REG is dead
> after jcc/setcc plus adjusting code for jcc/setcc.
>
> gcc/ChangeLog:
>
> PR target/110591
> * config/i386/sync.md (cmpccxadd_): Adjust the pattern
> to explicitly set FLAGS_REG like *cmp_1, also add extra
> 3 define_peephole2 after the pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110591.c: New test.
> * gcc.target/i386/pr110591-2.c: New test.
> ---
>  gcc/config/i386/sync.md| 160 -
>  gcc/testsuite/gcc.target/i386/pr110591-2.c |  90 
>  gcc/testsuite/gcc.target/i386/pr110591.c   |  66 +
>  3 files changed, 315 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591.c
>
> diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> index e1fa1504deb..e84226cf895 100644
> --- a/gcc/config/i386/sync.md
> +++ b/gcc/config/i386/sync.md
> @@ -1093,7 +1093,9 @@ (define_insn "cmpccxadd_"
>   UNSPECV_CMPCCXADD))
> (set (match_dup 1)
> (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> -   (clobber (reg:CC FLAGS_REG))]
> +   (set (reg:CC FLAGS_REG)
> +   (compare:CC (match_dup 1)
> +   (match_dup 2)))]
>"TARGET_CMPCCXADD && TARGET_64BIT"
>  {
>char buf[128];
> @@ -1105,3 +1107,159 @@ (define_insn "cmpccxadd_"
>output_asm_insn (buf, operands);
>return "";
>  })
> +
> +(define_peephole2
> +  [(set (match_operand:SWI48x 0 "register_operand")
> +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_operand:SWI48x 2 "memory_operand")
> + (match_dup 0)
> + (match_operand:SWI48x 3 "register_operand")
> + (match_operand:SI 4 "const_int_operand")]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (set (reg:CC FLAGS_REG)
> +  (compare:CC (match_dup 2)
> +  (match_dup 0)))])
> +   (set (reg FLAGS_REG)
> +   (compare (match_operand:SWI48x 5 "register_operand")
> +(match_operand:SWI48x 6 "x86_64_general_operand")))]
> +  "TARGET_CMPCCXADD && TARGET_64BIT
> +   && rtx_equal_p (operands[0], operands[5])
> +   && rtx_equal_p (operands[1], operands[6])"
> +  [(set (match_dup 0)
> +   (match_dup 1))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_dup 2)
> + (match_dup 0)
> + (match_dup 3)
> + (match_dup 4)]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (set (reg:CC FLAGS_REG)
> +  (compare:CC (match_dup 2)
> +  (match_dup 0)))])
> +   (set (match_dup 7)
> +   (match_op_dup 8
> + [(match_dup 9) (const_int 0)]))])
> +
> +(define_peephole2
> +  [(set (match_operand:SWI48x 0 "register_operand")
> +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_operand:SWI48x 2 "memory_operand")
> + (match_dup 0)
> + (match_operand:SWI48x 3 "register_operand")
> + (match_operand:SI 4 "const_int_operand")]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (set (reg:CC FLAGS_REG)
> +  (compare:CC (match_dup 2)
> +  (match_dup 0)))])
> +   (set (reg FLAGS_REG)
> +   (compare (match_operand:SWI48x 5 "register_operand")
> +(match_operand:SWI48x 6 "x86_64_general_operand")))
> +   (set (match_operand:QI 7 "nonimmediate_operand")
> +   (match_operator:QI 8 "ix86_comparison_int_operator"
> + [(reg FLAGS_REG) (const_int 0)]))]
> +  "TARGET_CMPCCXADD && TARGET_64BIT
> +   && rtx_equal_p (operands[0], operands[6])
> +   && rtx_equal_p (operands[1], operands[5])
> +   && peep2_regno_dead_p (4, FLAGS_REG)"
> +  [(set (match_dup 0)
> +   (match_dup 1))
> +   (parallel [(set (match_dup 

Re: [PATCH] x86: slightly enhance "vec_dupv2df"

2023-07-17 Thread Hongtao Liu via Gcc-patches
On Mon, Jul 17, 2023 at 2:20 PM Jan Beulich  wrote:
>
> On 17.07.2023 08:09, Hongtao Liu wrote:
> > On Fri, Jul 14, 2023 at 5:40 PM Jan Beulich via Gcc-patches
> >  wrote:
> >>
> >> Introduce a new alternative permitting all 32 registers to be used as
> >> source without AVX512VL, by broadcasting to the full 512 bits in that
> >> case. (The insn would also permit all registers to be used as
> >> destination, but V2DFmode doesn't.)
> > The patch looks technically ok, but considering we don't have a real
> > CPU with only AVX512F but no AVX512VL, these optimisations for AVX512F
> > only don't make much sense, but rather increase the burden for
> > maintenance.
>
> Well, I can of course ignore this aspect going forward. It seemed
> relevant to me for two reasons: For one, I expect I'm not the only
> one to simply pass -mavx512f when caring about basic AVX512. And
You're not, AFAIK, some users used target("avx512f") for FMV.  But I'd
rather persuade them to use target ("arch=x86-64-v4") rather than
optimizing for AVX512F only.
> then isn't the Knights line of processors (Xeon Phi) lacking VL?
> (I'm getting the impression though that this line is discontinued
> now.)
KNL is deprecated, and yes it doesn't support AVX512VL.
>
> >> Can't the latter two of the original alternatives be folded, by using
> >> Yvm instead of xm/vm?
> > I think yes.
>
> I guess I'll make a follow-on patch for that then.
>
> Jan



-- 
BR,
Hongtao


Re: [PATCH] x86: avoid maybe_gen_...()

2023-07-17 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 14, 2023 at 5:42 PM Jan Beulich via Gcc-patches
 wrote:
>
> In the (however unlikely) event that no insn can be found for the
> requested mode, using maybe_gen_...() without (really) checking its
> result for being a null rtx would lead to silent bad code generation.
Ok.
>
> gcc/
>
> * config/i386/i386-expand.cc (ix86_expand_vector_init_duplicate):
> Use gen_vec_set_0.
> (ix86_expand_vector_extract): Use gen_vec_extract_lo /
> gen_vec_extract_hi.
> (expand_vec_perm_broadcast_1): Use gen_vec_interleave_high /
> gen_vec_interleave_low. Rename local variable.
>
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -15456,8 +15456,7 @@ ix86_expand_vector_init_duplicate (bool
> {
>   tmp1 = force_reg (GET_MODE_INNER (mode), val);
>   tmp2 = gen_reg_rtx (mode);
> - emit_insn (maybe_gen_vec_set_0 (mode, tmp2,
> - CONST0_RTX (mode), tmp1));
> + emit_insn (gen_vec_set_0 (mode, tmp2, CONST0_RTX (mode), tmp1));
>   tmp1 = gen_lowpart (mode, tmp2);
> }
>   else
> @@ -17419,9 +17418,9 @@ ix86_expand_vector_extract (bool mmx_ok,
>  ? gen_reg_rtx (V16HFmode)
>  : gen_reg_rtx (V16BFmode));
>   if (elt < 16)
> -   emit_insn (maybe_gen_vec_extract_lo (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_lo (mode, tmp, vec));
>   else
> -   emit_insn (maybe_gen_vec_extract_hi (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_hi (mode, tmp, vec));
>   ix86_expand_vector_extract (false, target, tmp, elt & 15);
>   return;
> }
> @@ -17435,9 +17434,9 @@ ix86_expand_vector_extract (bool mmx_ok,
>  ? gen_reg_rtx (V8HFmode)
>  : gen_reg_rtx (V8BFmode));
>   if (elt < 8)
> -   emit_insn (maybe_gen_vec_extract_lo (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_lo (mode, tmp, vec));
>   else
> -   emit_insn (maybe_gen_vec_extract_hi (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_hi (mode, tmp, vec));
>   ix86_expand_vector_extract (false, target, tmp, elt & 7);
>   return;
> }
> @@ -22501,18 +22500,18 @@ expand_vec_perm_broadcast_1 (struct expa
>if (d->testing_p)
> return true;
>
> -  rtx (*maybe_gen) (machine_mode, int, rtx, rtx, rtx);
> +  rtx (*gen_interleave) (machine_mode, int, rtx, rtx, rtx);
>if (elt >= nelt2)
> {
> - maybe_gen = maybe_gen_vec_interleave_high;
> + gen_interleave = gen_vec_interleave_high;
>   elt -= nelt2;
> }
>else
> -   maybe_gen = maybe_gen_vec_interleave_low;
> +   gen_interleave = gen_vec_interleave_low;
>nelt2 /= 2;
>
>dest = gen_reg_rtx (vmode);
> -  emit_insn (maybe_gen (vmode, 1, dest, op0, op0));
> +  emit_insn (gen_interleave (vmode, 1, dest, op0, op0));
>
>vmode = V4SImode;
>op0 = gen_lowpart (vmode, dest);



-- 
BR,
Hongtao


Re: [PATCH] x86: slightly enhance "vec_dupv2df"

2023-07-17 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 14, 2023 at 5:40 PM Jan Beulich via Gcc-patches
 wrote:
>
> Introduce a new alternative permitting all 32 registers to be used as
> source without AVX512VL, by broadcasting to the full 512 bits in that
> case. (The insn would also permit all registers to be used as
> destination, but V2DFmode doesn't.)
The patch looks technically ok, but considering we don't have a real
CPU with only AVX512F but no AVX512VL, these optimisations for AVX512F
only don't make much sense, but rather increase the burden for
maintenance.
(For now, AVX512VL+AVX512CD+AVX512BW+AVX512DQ is a base set after
skylake-avx512, users are more likely to use -march=$PROCESSOR for
avx512.)
For this(and those previous AVX512F only) optimised patch, I think
it's helpful to help understand the pattern, so I'll approve of this
patch. But I hope we don't spend too much time on such optimisations
(unless there is an AVX512F only processor).
>
> gcc/
>
> * config/i386/sse.md (vec_dupv2df): Add new AVX512F
> alternative. Move AVX512VL part of condition to new "enabled"
> attribute.
> ---
> Because of the V2DF restriction, in principle the new source constraint
> could also omit 'm'.
>
> Can't the latter two of the original alternatives be folded, by using
> Yvm instead of xm/vm?
I think yes.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -13761,18 +13761,27 @@
> (set_attr "mode" "DF,DF,V1DF,V1DF,V1DF,V2DF,V1DF,V1DF,V1DF")])
>
>  (define_insn "vec_dupv2df"
> -  [(set (match_operand:V2DF 0 "register_operand" "=x,x,v")
> +  [(set (match_operand:V2DF 0 "register_operand" "=x,x,v,v")
> (vec_duplicate:V2DF
> - (match_operand:DF 1 "nonimmediate_operand" " 0,xm,vm")))]
> -  "TARGET_SSE2 && "
> + (match_operand:DF 1 "nonimmediate_operand" "0,xm,vm,vm")))]
> +  "TARGET_SSE2"
>"@
> unpcklpd\t%0, %0
> %vmovddup\t{%1, %0|%0, %1}
> -   vmovddup\t{%1, %0|%0, %1}"
> -  [(set_attr "isa" "noavx,sse3,avx512vl")
> -   (set_attr "type" "sselog1")
> -   (set_attr "prefix" "orig,maybe_vex,evex")
> -   (set_attr "mode" "V2DF,DF,DF")])
> +   vmovddup\t{%1, %0|%0, %1}
> +   vbroadcastsd\t{%1, }%g0{|, %1}"
> +  [(set_attr "isa" "noavx,sse3,avx512vl,*")
> +   (set_attr "type" "sselog1,ssemov,ssemov,ssemov")
> +   (set_attr "prefix" "orig,maybe_vex,evex,evex")
> +   (set_attr "mode" "V2DF,DF,DF,V8DF")
> +   (set (attr "enabled")
> +   (cond [(eq_attr "alternative" "3")
> +(symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> + && !TARGET_PREFER_AVX256")
> +  (match_test "")
> +(const_string "*")
> + ]
> + (symbol_ref "false")))])
>
>  (define_insn "vec_concatv2df"
>[(set (match_operand:V2DF 0 "register_operand" "=x,x,v,x,x, v,x,x")



-- 
BR,
Hongtao


Re: [PATCH] Initial Lunar Lake, Arrow Lake and Arrow Lake S Support

2023-07-16 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 14, 2023 at 10:55 AM Mo, Zewei via Gcc-patches
 wrote:
>
> Hi all,
>
> This patch is to add initial support for Lunar Lake, Arrow Lake and Arrow Lake
> S for GCC.
>
> This link of related information is listed below:
> https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>
> This has been tested on x86_64-pc-linux-gnu. Is this ok for trunk? Thank you.
Ok.
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_intel_cpu): Handle Lunar Lake,
> Arrow Lake and Arrow Lake S.
> * common/config/i386/i386-common.cc:
> (processor_name): Add arrowlake.
> (processor_alias_table): Add arrow lake, arrow lake s and lunar
> lake.
> * common/config/i386/i386-cpuinfo.h (enum processor_subtypes):
> Add INTEL_COREI7_ARROWLAKE and INTEL_COREI7_ARROWLAKE_S.
> * config.gcc: Add -march=arrowlake and -march=arrowlake-s.
> * config/i386/driver-i386.cc (host_detect_local_cpu): Handle
> arrowlake-s.
> * config/i386/i386-options.cc (m_ARROWLAKE): New.
> (processor_cost_table): Add arrowlake.
> * config/i386/i386.h (enum processor_type):
> Add PROCESSOR_ARROWLAKE.
> * doc/extend.texi: Add arrowlake and arrowlake-s.
> * doc/invoke.texi: Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/mv16.C: Add arrowlake and arrowlake-s.
> * gcc.target/i386/funcspec-56.inc: Handle new march.
> ---
>  gcc/common/config/i386/cpuinfo.h  | 18 
>  gcc/common/config/i386/i386-common.cc |  7 ++
>  gcc/common/config/i386/i386-cpuinfo.h |  2 +
>  gcc/config.gcc|  2 +-
>  gcc/config/i386/driver-i386.cc|  5 +-
>  gcc/config/i386/i386-c.cc |  7 ++
>  gcc/config/i386/i386-options.cc   |  2 +
>  gcc/config/i386/i386.h|  4 +
>  gcc/config/i386/x86-tune.def  | 92 +++
>  gcc/doc/extend.texi   |  6 ++
>  gcc/doc/invoke.texi   | 17 
>  gcc/testsuite/g++.target/i386/mv16.C  | 12 +++
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  2 +
>  13 files changed, 135 insertions(+), 41 deletions(-)
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 159e5f03f0b..e6f1a0ac0a1 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -579,6 +579,24 @@ get_intel_cpu (struct __processor_model *cpu_model,
>CHECK___builtin_cpu_is ("grandridge");
>cpu_model->__cpu_type = INTEL_GRANDRIDGE;
>break;
> +case 0xc5:
> +  /* Arrow Lake.  */
> +  cpu = "arrowlake";
> +  CHECK___builtin_cpu_is ("corei7");
> +  CHECK___builtin_cpu_is ("arrowlake");
> +  cpu_model->__cpu_type = INTEL_COREI7;
> +  cpu_model->__cpu_subtype = INTEL_COREI7_ARROWLAKE;
> +  break;
> +case 0xc6:
> +  /* Arrow Lake S.  */
> +case 0xbd:
> +  /* Lunar Lake.  */
> +  cpu = "arrowlake-s";
> +  CHECK___builtin_cpu_is ("corei7");
> +  CHECK___builtin_cpu_is ("arrowlake-s");
> +  cpu_model->__cpu_type = INTEL_COREI7;
> +  cpu_model->__cpu_subtype = INTEL_COREI7_ARROWLAKE_S;
> +  break;
>  case 0x17:
>  case 0x1d:
>/* Penryn.  */
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 9b45ad61239..541f1441db8 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -2044,6 +2044,7 @@ const char *const processor_names[] =
>"alderlake",
>"rocketlake",
>"graniterapids",
> +  "arrowlake",
>"intel",
>"lujiazui",
>"geode",
> @@ -2167,6 +2168,12 @@ const pta processor_alias_table[] =
>  M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2},
>{"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL, PTA_GRANITERAPIDS,
>  M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F},
> +  {"arrowlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE,
> +M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE), P_PROC_AVX2},
> +  {"arrowlake-s", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE_S,
> +M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE_S), P_PROC_AVX2},
> +  {"lunarlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE_S,
> +M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE_S), P_PROC_AVX2},
>{"bonnell", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
>  M_CPU_TYPE (INTEL_BONNELL), P_PROC_SSSE3},
>{"atom", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
> diff --git a/gcc/common/config/i386/i386-cpuinfo.h 
> b/gcc/common/config/i386/i386-cpuinfo.h
> index e6385dd56a3..b371fb792ec 100644
> --- a/gcc/common/config/i386/i386-cpuinfo.h
> +++ b/gcc/common/config/i386/i386-cpuinfo.h
> @@ -98,6 +98,8 @@ enum processor_subtypes
>ZHAOXIN_FAM7H_LUJIAZUI,
>

Re: [PATCH 4/4] Support Intel SM4

2023-07-16 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 13, 2023 at 2:04 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detech SM4.
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SM4_SET,
> OPTION_MASK_ISA2_SM4_UNSET): New.
> (OPTION_MASK_ISA2_AVX_UNSET): Add SM4.
> (ix86_handle_option): Handle -msm4.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_SM4.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> sm4.
> * config.gcc: Add sm4intrin.h.
> * config/i386/cpuid.h (bit_SM4): New.
> * config/i386/i386-builtin.def (BDESC): Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __SM4__.
> * config/i386/i386-isa.def (SM4): Add DEF_PTA(SM4).
> * config/i386/i386-options.cc (isa2_opts): Add -msm4.
> (ix86_valid_target_attribute_inner_p): Handle sm4.
> * config/i386/i386.opt: Add option -msm4.
> * config/i386/immintrin.h: Include sm4intrin.h
> * config/i386/sse.md (vsm4key4_): New define insn.
> (vsm4rnds4_): Ditto.
> * doc/extend.texi: Document sm4.
> * doc/invoke.texi: Document -msm4.
> * doc/sourcebuild.texi: Document target sm4.
> * config/i386/sm4intrin.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/other/i386-2.C: Add -msm4.
> * g++.dg/other/i386-3.C: Ditto.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * gcc.target/i386/sse-12.c: Add -msm4.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Add sm4.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp (check_effective_target_sm4): New.
> * gcc.target/i386/sm4-1.c: New test.
> * gcc.target/i386/sm4-check.h: Ditto.
> * gcc.target/i386/sm4key4-2.c: Ditto.
> * gcc.target/i386/sm4rnds4-2.c: Ditto.
Ok.
> ---
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.cc |  20 +-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   1 +
>  gcc/config.gcc|   2 +-
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin.def  |   6 +
>  gcc/config/i386/i386-c.cc |   2 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-options.cc   |   4 +-
>  gcc/config/i386/i386.opt  |   5 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sm4intrin.h   |  70 +++
>  gcc/config/i386/sse.md|  26 +++
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |   9 +-
>  gcc/doc/sourcebuild.texi  |   3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |   2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |   2 +-
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/sm4-1.c |  20 ++
>  gcc/testsuite/gcc.target/i386/sm4-check.h | 183 ++
>  gcc/testsuite/gcc.target/i386/sm4key4-2.c |  14 ++
>  gcc/testsuite/gcc.target/i386/sm4rnds4-2.c|  14 ++
>  gcc/testsuite/gcc.target/i386/sse-12.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   2 +-
>  gcc/testsuite/lib/target-supports.exp |  14 ++
>  30 files changed, 409 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/config/i386/sm4intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4-check.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4key4-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4rnds4-2.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 0cfde3ebccd..f9434f038ea 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -881,6 +881,8 @@ get_available_features (struct __processor_model 
> *cpu_model,
> set_feature (FEATURE_SM3);
>   if (eax & bit_SHA512)
> set_feature (FEATURE_SHA512);
> + if (eax & bit_SM4)
> +   set_feature (FEATURE_SM4);
> }
>if (avx512_usable)
> {
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 97c3cdfe5e1..610cabe52c1 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -122,6 +122,7 @@ 

Re: [PATCH 2/4] Support Intel SM3

2023-07-16 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 13, 2023 at 2:04 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detect SM3.
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SM3_SET,
> OPTION_MASK_ISA2_SM3_UNSET): New.
> (OPTION_MASK_ISA2_AVX_UNSET): Add SM3.
> (ix86_handle_option): Handle -msm3.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_SM3.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> SM3.
> * config.gcc: Add sm3intrin.h
> * config/i386/cpuid.h (bit_SM3): New.
> * config/i386/i386-builtin-types.def:
> Add DEF_FUNCTION_TYPE (V4SI, V4SI, V4SI, V4SI, INT).
> * config/i386/i386-builtin.def (BDESC): Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __SM3__.
> * config/i386/i386-expand.cc (ix86_expand_args_builtin): Handle
> V4SI_FTYPE_V4SI_V4SI_V4SI_INT.
> * config/i386/i386-isa.def (SM3): Add DEF_PTA(SM3).
> * config/i386/i386-options.cc (isa2_opts): Add -msm3.
> (ix86_valid_target_attribute_inner_p): Handle sm3.
> * config/i386/i386.opt: Add option -msm3.
> * config/i386/immintrin.h: Include sm3intrin.h.
> * config/i386/sse.md (vsm3msg1): New define insn.
> (vsm3msg2): Ditto.
> (vsm3rnds2): Ditto.
> * doc/extend.texi: Document sm3.
> * doc/invoke.texi: Document -msm3.
> * doc/sourcebuild.texi: Document target sm3.
> * config/i386/sm3intrin.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/other/i386-2.C: Add -msm3.
> * g++.dg/other/i386-3.C: Ditto.
> * gcc.target/i386/avx-1.c: Add new define for immediate.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * gcc.target/i386/sse-12.c: Add -msm3.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Add sm3.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp (check_effective_target_sm3): New.
> * gcc.target/i386/sm3-1.c: New test.
> * gcc.target/i386/sm3-check.h: Ditto.
> * gcc.target/i386/sm3msg1-2.c: Ditto.
> * gcc.target/i386/sm3msg2-2.c: Ditto.
> * gcc.target/i386/sm3rnds2-2.c: Ditto.
Ok.
> ---
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.cc |  20 +++-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   1 +
>  gcc/config.gcc|   3 +-
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin-types.def|   3 +
>  gcc/config/i386/i386-builtin.def  |   5 +
>  gcc/config/i386/i386-c.cc |   2 +
>  gcc/config/i386/i386-expand.cc|   1 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-options.cc   |   2 +
>  gcc/config/i386/i386.opt  |   5 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sm3intrin.h   |  72 
>  gcc/config/i386/sse.md|  43 
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |   7 +-
>  gcc/doc/sourcebuild.texi  |   3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |   2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |   2 +-
>  gcc/testsuite/gcc.target/i386/avx-1.c |   3 +
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/sm3-1.c |  17 +++
>  gcc/testsuite/gcc.target/i386/sm3-check.h |  37 +++
>  gcc/testsuite/gcc.target/i386/sm3msg1-2.c |  54 +
>  gcc/testsuite/gcc.target/i386/sm3msg2-2.c |  57 ++
>  gcc/testsuite/gcc.target/i386/sm3rnds2-2.c| 104 ++
>  gcc/testsuite/gcc.target/i386/sse-12.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|   5 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   5 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   7 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   5 +-
>  gcc/testsuite/lib/target-supports.exp |  15 +++
>  34 files changed, 484 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/config/i386/sm3intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3-check.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3msg1-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3msg2-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3rnds2-2.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 

Re: [PATCH 3/4] Support Intel SHA512

2023-07-16 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 13, 2023 at 2:06 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detect SHA512.
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SHA512_SET,
> OPTION_MASK_ISA2_SHA512_UNSET): New.
> (OPTION_MASK_ISA2_AVX_UNSET): Add SHA512.
> (ix86_handle_option): Handle -msha512.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_SHA512.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> sha512.
> * config.gcc: Add sha512intrin.h.
> * config/i386/cpuid.h (bit_SHA512): New.
> * config/i386/i386-builtin-types.def:
> Add DEF_FUNCTION_TYPE (V4DI, V4DI, V4DI, V2DI).
> * config/i386/i386-builtin.def (BDESC): Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __SHA512__.
> * config/i386/i386-expand.cc (ix86_expand_args_builtin): Handle
> V4DI_FTYPE_V4DI_V4DI_V2DI and V4DI_FTYPE_V4DI_V2DI.
> * config/i386/i386-isa.def (SHA512): Add DEF_PTA(SHA512).
> * config/i386/i386-options.cc (isa2_opts): Add -msha512.
> (ix86_valid_target_attribute_inner_p): Handle sha512.
> * config/i386/i386.opt: Add option -msha512.
> * config/i386/immintrin.h: Include sha512intrin.h.
> * config/i386/sse.md (vsha512msg1): New define insn.
> (vsha512msg2): Ditto.
> (vsha512rnds2): Ditto.
> * doc/extend.texi: Document sha512.
> * doc/invoke.texi: Document -msha512.
> * doc/sourcebuild.texi: Document target sha512.
> * config/i386/sha512intrin.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/others/i386-2.C: Add -msha512.
> * g++.dg/others/i386-3.C: Ditto.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * gcc.target/i386/sse-12.c: Add -msha512.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Add sha512.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp (check_effective_target_sha512): New.
> * gcc.target/i386/sha512-1.c: New test.
> * gcc.target/i386/sha512-check.h: Ditto.
> * gcc.target/i386/sha512msg1-2.c: Ditto.
> * gcc.target/i386/sha512msg2-2.c: Ditto.
> * gcc.target/i386/sha512rnds2-2.c: Ditto.
Ok.
> ---
>  gcc/common/config/i386/cpuinfo.h  |  2 +
>  gcc/common/config/i386/i386-common.cc | 19 -
>  gcc/common/config/i386/i386-cpuinfo.h |  1 +
>  gcc/common/config/i386/i386-isas.h|  1 +
>  gcc/config.gcc|  2 +-
>  gcc/config/i386/cpuid.h   |  1 +
>  gcc/config/i386/i386-builtin-types.def|  3 +
>  gcc/config/i386/i386-builtin.def  |  5 ++
>  gcc/config/i386/i386-c.cc |  2 +
>  gcc/config/i386/i386-expand.cc|  2 +
>  gcc/config/i386/i386-isa.def  |  1 +
>  gcc/config/i386/i386-options.cc   |  4 +-
>  gcc/config/i386/i386.opt  | 10 +++
>  gcc/config/i386/immintrin.h   |  2 +
>  gcc/config/i386/sha512intrin.h| 64 ++
>  gcc/config/i386/sse.md| 40 +
>  gcc/doc/extend.texi   |  5 ++
>  gcc/doc/invoke.texi   | 10 ++-
>  gcc/doc/sourcebuild.texi  |  3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |  2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |  2 +-
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  2 +
>  gcc/testsuite/gcc.target/i386/sha512-1.c  | 18 
>  gcc/testsuite/gcc.target/i386/sha512-check.h  | 43 ++
>  gcc/testsuite/gcc.target/i386/sha512msg1-2.c  | 48 +++
>  gcc/testsuite/gcc.target/i386/sha512msg2-2.c  | 47 ++
>  gcc/testsuite/gcc.target/i386/sha512rnds2-2.c | 85 +++
>  gcc/testsuite/gcc.target/i386/sse-12.c|  2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|  2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|  2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|  4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|  2 +-
>  gcc/testsuite/lib/target-supports.exp | 14 +++
>  33 files changed, 436 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/config/i386/sha512intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512-check.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512msg1-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512msg2-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512rnds2-2.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 

Re: [PATCH 1/4] Support Intel AVX-VNNI-INT16

2023-07-16 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 13, 2023 at 2:06 PM Haochen Jiang via Gcc-patches
 wrote:
>
> From: Kong Lingling 
>
> gcc/ChangeLog
>
> * common/config/i386/cpuinfo.h (get_available_features): Detect
> avxvnniint16.
> * common/config/i386/i386-common.cc
> (OPTION_MASK_ISA2_AVXVNNIINT16_SET): New.
> (OPTION_MASK_ISA2_AVXVNNIINT16_UNSET): Ditto.
> (ix86_handle_option): Handle -mavxvnniint16.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_AVXVNNIINT16.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> avxvnniint16.
> * config.gcc: Add avxvnniint16.h.
> * config/i386/avxvnniint16intrin.h: New file.
> * config/i386/cpuid.h (bit_AVXVNNIINT16): New.
> * config/i386/i386-builtin.def: Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __AVXVNNIINT16__.
> * config/i386/i386-options.cc (isa2_opts): Add -mavxvnniint16.
> (ix86_valid_target_attribute_inner_p): Handle avxvnniint16intrin.h.
> * config/i386/i386-isa.def: Add DEF_PTA(AVXVNNIINT16).
> * config/i386/i386.opt: Add option -mavxvnniint16.
> * config/i386/immintrin.h: Include avxvnniint16.h.
> * config/i386/sse.md
> (vpdp_): New define_insn.
> * doc/extend.texi: Document avxvnniint16.
> * doc/invoke.texi: Document -mavxvnniint16.
> * doc/sourcebuild.texi: Document target avxvnniint16.
Ok.
>
> gcc/testsuite/ChangeLog
>
> * g++.dg/other/i386-2.C: Add -mavxvnniint16.
> * g++.dg/other/i386-3.C: Ditto.
> * gcc.target/i386/avx-check.h: Add avxvnniint16 check.
> * gcc.target/i386/sse-12.c: Add -mavxvnniint16.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Ditto.
> * gcc.target/i386/sse-23.c: Ditto.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * lib/target-supports.exp
> (check_effective_target_avxvnniint16): New.
> * gcc.target/i386/avxvnniint16-1.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwusd-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwusds-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwsud-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwsuds-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwuud-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwuuds-2.c: Ditto.
>
> Co-authored-by: Haochen Jiang 
> ---
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.cc |  22 ++-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   2 +
>  gcc/config.gcc|   2 +-
>  gcc/config/i386/avxvnniint16intrin.h  | 138 ++
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin.def  |  14 ++
>  gcc/config/i386/i386-c.cc |   2 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-options.cc   |   4 +-
>  gcc/config/i386/i386.opt  |   5 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sse.md|  32 
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |  10 +-
>  gcc/doc/sourcebuild.texi  |   3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |   2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |   2 +-
>  gcc/testsuite/gcc.target/i386/avx-check.h |   3 +
>  .../gcc.target/i386/avxvnniint16-1.c  |  43 ++
>  .../gcc.target/i386/avxvnniint16-vpdpwsud-2.c |  71 +
>  .../i386/avxvnniint16-vpdpwsuds-2.c   |  72 +
>  .../gcc.target/i386/avxvnniint16-vpdpwusd-2.c |  71 +
>  .../i386/avxvnniint16-vpdpwusds-2.c   |  72 +
>  .../gcc.target/i386/avxvnniint16-vpdpwuud-2.c |  71 +
>  .../i386/avxvnniint16-vpdpwuuds-2.c   |  71 +
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/sse-12.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   2 +-
>  gcc/testsuite/lib/target-supports.exp |  12 ++
>  34 files changed, 735 insertions(+), 15 deletions(-)
>  create mode 100644 gcc/config/i386/avxvnniint16intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/avxvnniint16-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avxvnniint16-vpdpwsud-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avxvnniint16-vpdpwsuds-2.c
>  create mode 100644 

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-07-13 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 13, 2023 at 2:32 PM Richard Biener  wrote:
>
> On Thu, 13 Jul 2023, Hongtao Liu wrote:
>
> > On Thu, Jul 13, 2023 at 10:47?AM Hongtao Liu  wrote:
> > >
> > > On Wed, Jul 12, 2023 at 9:37?PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > The PRs ask for optimizing of
> > > >
> > > >   _1 = BIT_FIELD_REF ;
> > > >   result_4 = BIT_INSERT_EXPR ;
> > > >
> > > > to a vector permutation.  The following implements this as
> > > > match.pd pattern, improving code generation on x86_64.
> > > >
> > > > On the RTL level we face the issue that backend patterns inconsistently
> > > > use vec_merge and vec_select of vec_concat to represent permutes.
> > > >
> > > > I think using a (supported) permute is almost always better
> > > > than an extract plus insert, maybe excluding the case we extract
> > > > element zero and that's aliased to a register that can be used
> > > > directly for insertion (not sure how to query that).
> > > >
> > > > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > > > now realizes that
> > > >
> > > >   _1 = BIT_FIELD_REF ;
> > > >   if (_1 > a_4(D))
> > > > goto ; [50.00%]
> > > >   else
> > > > goto ; [50.00%]
> > > >
> > > >[local count: 536870913]:
> > > >
> > > >[local count: 1073741824]:
> > > >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> > > >   x_5 = BIT_INSERT_EXPR ;
> > > >
> > > > is equal to
> > > >
> > > >[local count: 1073741824]:
> > > >   _1 = BIT_FIELD_REF ;
> > > >   if (_1 > a_4(D))
> > > > goto ; [50.00%]
> > > >   else
> > > > goto ; [50.00%]
> > > >
> > > >[local count: 536870912]:
> > > >   _7 = BIT_INSERT_EXPR ;
> > > >
> > > >[local count: 1073741824]:
> > > >   # prephitmp_8 = PHI 
> > > >
> > > > and that no longer produces the desired maxsd operation at the RTL
> > > The comparison is scalar mode, but operations in then_bb is
> > > vector_mode, if_convert can't eliminate the condition any more(and
> > > won't go into backend ix86_expand_sse_fp_minmax).
> > > I think for ordered comparisons like _1 > a_4, it doesn't match
> > > fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> > > operand(not the other operand) when there's NONE.
> > I mean NANs.
>
> Btw, I once tried to recognize MAX here at the GIMPLE level but
> while the x86 (vector) max insns are fine for x > y ? x : y we
> have no tree code or optab for exactly that, we have MAX_EXPR
> which behaves differently for NaN and .FMAX which is exactly IEEE
> which the x86 ISA isn't.
>
> I wonder if we thus should if-convert this on the GIMPLE level
> but to x > y ? x : y, thus a COND_EXPR?
COND_EXPR maps to movcc, for x86 it's expanded by
ix86_expand_fp_movcc which will try fp minmax detect.
It's probably ok.
>
> Richard.
>
> > > > level (we fail to match .FMAX at the GIMPLE level earlier).
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> > > >
> > > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ t] 1
> > > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ t]
> > > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> > > >
> > > > I think this is also PR88540 (the lack of min/max detection, not
> > > > sure if the SSE min/max are suitable here)
> > > >
> > > > PR tree-optimization/94864
> > > > PR tree-optimization/94865
> > > > * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > > > for vector insertion from vector extraction.
> > > >
> > > > * gcc.target/i386/pr94864.c: New testcase.
> > > > * gcc.target/i386/pr94865.c: Likewise.
> > > > ---
> > > >  gcc/match.pd| 25 +
> > > >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +
> > > >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +
> > > >  3 files changed, 51 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > index 8543f777a28..8cc106049c4 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > >   wi::to_wide (@ipos) + isize))
> > > >  (BIT_FIELD_REF @0 @rsize @rpos)
> > > >
> > > > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > > > +(simplify
> > > > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > > > + (if (VECTOR_TYPE_P (type)
> > > > +  && types_match (@0, @1)
> > > > +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > > > +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > > > +  (with
> 

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-07-12 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 13, 2023 at 10:47 AM Hongtao Liu  wrote:
>
> On Wed, Jul 12, 2023 at 9:37 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > The PRs ask for optimizing of
> >
> >   _1 = BIT_FIELD_REF ;
> >   result_4 = BIT_INSERT_EXPR ;
> >
> > to a vector permutation.  The following implements this as
> > match.pd pattern, improving code generation on x86_64.
> >
> > On the RTL level we face the issue that backend patterns inconsistently
> > use vec_merge and vec_select of vec_concat to represent permutes.
> >
> > I think using a (supported) permute is almost always better
> > than an extract plus insert, maybe excluding the case we extract
> > element zero and that's aliased to a register that can be used
> > directly for insertion (not sure how to query that).
> >
> > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > now realizes that
> >
> >   _1 = BIT_FIELD_REF ;
> >   if (_1 > a_4(D))
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> >
> >[local count: 536870913]:
> >
> >[local count: 1073741824]:
> >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> >   x_5 = BIT_INSERT_EXPR ;
> >
> > is equal to
> >
> >[local count: 1073741824]:
> >   _1 = BIT_FIELD_REF ;
> >   if (_1 > a_4(D))
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> >
> >[local count: 536870912]:
> >   _7 = BIT_INSERT_EXPR ;
> >
> >[local count: 1073741824]:
> >   # prephitmp_8 = PHI 
> >
> > and that no longer produces the desired maxsd operation at the RTL
> The comparison is scalar mode, but operations in then_bb is
> vector_mode, if_convert can't eliminate the condition any more(and
> won't go into backend ix86_expand_sse_fp_minmax).
> I think for ordered comparisons like _1 > a_4, it doesn't match
> fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> operand(not the other operand) when there's NONE.
I mean NANs.
> > level (we fail to match .FMAX at the GIMPLE level earlier).
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> >
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ t] 1
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ t]
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> >
> > I think this is also PR88540 (the lack of min/max detection, not
> > sure if the SSE min/max are suitable here)
> >
> > PR tree-optimization/94864
> > PR tree-optimization/94865
> > * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > for vector insertion from vector extraction.
> >
> > * gcc.target/i386/pr94864.c: New testcase.
> > * gcc.target/i386/pr94865.c: Likewise.
> > ---
> >  gcc/match.pd| 25 +
> >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +
> >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +
> >  3 files changed, 51 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 8543f777a28..8cc106049c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >   wi::to_wide (@ipos) + isize))
> >  (BIT_FIELD_REF @0 @rsize @rpos)
> >
> > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > +(simplify
> > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > + (if (VECTOR_TYPE_P (type)
> > +  && types_match (@0, @1)
> > +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > +  (with
> > +   {
> > + unsigned HOST_WIDE_INT elsz
> > +   = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1;
> > + poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > + poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > + unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > + vec_perm_builder builder;
> > + builder.new_vector (nunits, nunits, 1);
> > + for (unsigned i = 0; i < nunits; ++i)
> > +   builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > + vec_perm_indices sel (builder, 2, nunits);
> > +   }
> > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > +   || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> > false))
> > +(vec_perm @0 @1 { vec_perm_indices_to_tree
> > +(build_vector_type (ssizetype, nunits), sel); 
> > })
> > +
> >  (if (canonicalize_math_after_vectorization_p ())
> >   (for fmas (FMA)
> >(simplify
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c 
> > 

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-07-12 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 12, 2023 at 9:37 PM Richard Biener via Gcc-patches
 wrote:
>
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF ;
>   result_4 = BIT_INSERT_EXPR ;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
>
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
>
> But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> now realizes that
>
>   _1 = BIT_FIELD_REF ;
>   if (_1 > a_4(D))
> goto ; [50.00%]
>   else
> goto ; [50.00%]
>
>[local count: 536870913]:
>
>[local count: 1073741824]:
>   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
>   x_5 = BIT_INSERT_EXPR ;
>
> is equal to
>
>[local count: 1073741824]:
>   _1 = BIT_FIELD_REF ;
>   if (_1 > a_4(D))
> goto ; [50.00%]
>   else
> goto ; [50.00%]
>
>[local count: 536870912]:
>   _7 = BIT_INSERT_EXPR ;
>
>[local count: 1073741824]:
>   # prephitmp_8 = PHI 
>
> and that no longer produces the desired maxsd operation at the RTL
The comparison is scalar mode, but operations in then_bb is
vector_mode, if_convert can't eliminate the condition any more(and
won't go into backend ix86_expand_sse_fp_minmax).
I think for ordered comparisons like _1 > a_4, it doesn't match
fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
operand(not the other operand) when there's NONE.
> level (we fail to match .FMAX at the GIMPLE level earlier).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
>
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ t] 1
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ t]
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
>
> I think this is also PR88540 (the lack of min/max detection, not
> sure if the SSE min/max are suitable here)
>
> PR tree-optimization/94864
> PR tree-optimization/94865
> * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> for vector insertion from vector extraction.
>
> * gcc.target/i386/pr94864.c: New testcase.
> * gcc.target/i386/pr94865.c: Likewise.
> ---
>  gcc/match.pd| 25 +
>  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +
>  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8543f777a28..8cc106049c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   wi::to_wide (@ipos) + isize))
>  (BIT_FIELD_REF @0 @rsize @rpos)
>
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +  && types_match (@0, @1)
> +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> + unsigned HOST_WIDE_INT elsz
> +   = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1;
> + poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> + poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> + unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> + vec_perm_builder builder;
> + builder.new_vector (nunits, nunits, 1);
> + for (unsigned i = 0; i < nunits; ++i)
> +   builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> + vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +   || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> false))
> +(vec_perm @0 @1 { vec_perm_indices_to_tree
> +(build_vector_type (ssizetype, nunits), sel); })
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>(simplify
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c 
> b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df 

Re: [PATCH V2] Provide -fcf-protection=branch,return.

2023-07-12 Thread Hongtao Liu via Gcc-patches
ping.

On Mon, May 22, 2023 at 4:08 PM Hongtao Liu  wrote:
>
> ping.
>
> On Sat, May 13, 2023 at 5:20 PM liuhongt  wrote:
> >
> > > I think this could be simplified if you use either EnumSet or
> > > EnumBitSet instead in common.opt for `-fcf-protection=`.
> >
> > Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2.
> > It is a bit tricky for sets classification, cf_branch and cf_return
> > should be in different sets, but they both "conflicts" cf_full,
> > cf_none. And current EnumSet don't handle this well.
> >
> > So in the current implementation, only cf_full,cf_none are exclusive
> > to each other, but they can be combined with any cf_branch, cf_return,
> > cf_check. It's not perfect, but still an improvement than original
> > one.
> >
> > gcc/ChangeLog:
> >
> > * common.opt: (fcf-protection=): Add EnumSet attribute to
> > support combination of params.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * c-c++-common/fcf-protection-10.c: New test.
> > * c-c++-common/fcf-protection-11.c: New test.
> > * c-c++-common/fcf-protection-12.c: New test.
> > * c-c++-common/fcf-protection-8.c: New test.
> > * c-c++-common/fcf-protection-9.c: New test.
> > * gcc.target/i386/pr89701-1.c: New test.
> > * gcc.target/i386/pr89701-2.c: New test.
> > * gcc.target/i386/pr89701-3.c: New test.
> > ---
> >  gcc/common.opt | 12 ++--
> >  gcc/testsuite/c-c++-common/fcf-protection-10.c |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-11.c |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-12.c |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-8.c  |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-9.c  |  2 ++
> >  gcc/testsuite/gcc.target/i386/pr89701-1.c  |  4 
> >  gcc/testsuite/gcc.target/i386/pr89701-2.c  |  4 
> >  gcc/testsuite/gcc.target/i386/pr89701-3.c  |  4 
> >  9 files changed, 28 insertions(+), 6 deletions(-)
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-10.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-11.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-12.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-8.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-9.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-3.c
> >
> > diff --git a/gcc/common.opt b/gcc/common.opt
> > index a28ca13385a..02f2472959a 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -1886,7 +1886,7 @@ fcf-protection
> >  Common RejectNegative Alias(fcf-protection=,full)
> >
> >  fcf-protection=
> > -Common Joined RejectNegative Enum(cf_protection_level) 
> > Var(flag_cf_protection) Init(CF_NONE)
> > +Common Joined RejectNegative Enum(cf_protection_level) EnumSet 
> > Var(flag_cf_protection) Init(CF_NONE)
> >  -fcf-protection=[full|branch|return|none|check]Instrument 
> > functions with checks to verify jump/call/return control-flow transfer
> >  instructions have valid targets.
> >
> > @@ -1894,19 +1894,19 @@ Enum
> >  Name(cf_protection_level) Type(enum cf_protection_level) 
> > UnknownError(unknown Control-Flow Protection Level %qs)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(full) Value(CF_FULL)
> > +Enum(cf_protection_level) String(full) Value(CF_FULL) Set(1)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(branch) Value(CF_BRANCH)
> > +Enum(cf_protection_level) String(branch) Value(CF_BRANCH) Set(2)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(return) Value(CF_RETURN)
> > +Enum(cf_protection_level) String(return) Value(CF_RETURN) Set(3)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(check) Value(CF_CHECK)
> > +Enum(cf_protection_level) String(check) Value(CF_CHECK) Set(4)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(none) Value(CF_NONE)
> > +Enum(cf_protection_level) String(none) Value(CF_NONE) Set(1)
> >
> >  finstrument-functions
> >  Common Var(flag_instrument_function_entry_exit,1)
> > diff --git a/gcc/testsuite/c-c++-common/fcf-protection-10.c 
> > b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > new file mode 100644
> > index 000..b271d134e52
> > --- /dev/null
> > +++ b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > @@ -0,0 +1,2 @@
> > +/* { dg-do compile { target { "i?86-*-* x86_64-*-*" } } } */
> > +/* { dg-options "-fcf-protection=branch,check" } */
> > diff --git a/gcc/testsuite/c-c++-common/fcf-protection-11.c 
> > b/gcc/testsuite/c-c++-common/fcf-protection-11.c
> > new file mode 100644
> > index 000..2e566350ccd
> > --- /dev/null
> > +++ b/gcc/testsuite/c-c++-common/fcf-protection-11.c
> > @@ -0,0 +1,2 @@
> > +/* { dg-do compile { target { "i?86-*-* x86_64-*-*" } } } */
> > +/* { dg-options "-fcf-protection=branch,return" } */

Re: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

2023-07-11 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle  wrote:
>
>
> > From: Hongtao Liu 
> > Sent: 28 June 2023 04:23
> > > From: Roger Sayle 
> > > Sent: 27 June 2023 20:28
> > >
> > > I've also come up with an alternate/complementary/supplementary
> > > fix of generating the PTEST during RTL expansion, rather than rely on
> > > this being caught/optimized later during STV.
> > >
> > > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> > > appear last.  When I was writing this, I initially also added support
> > > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> > > support 256-bit OImode (which also explains why we don't have an
> > > OImode to V1OImode scalar-to-vector pass).  Retaining this clause
> > > ordering should minimize the lines changed if things change in future.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-06-27  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > > testing a TImode SUBREG of a 128-bit vector register against
> > > zero, use a PTEST instruction instead of first moving it to
> > > to scalar registers.
> > >
> >
> > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > + equality/inequality against zero.  */  if (op1 == const0_rtx
> > +  && SUBREG_P (op0)
> > +  && cmpmode == CCZmode
> > +  && SUBREG_BYTE (op0) == 0
> > +  && REG_P (SUBREG_REG (op0))
> > Just register_operand (op0, TImode),
>
> I completely agree that in most circumstances, the early RTL optimizers
> should use standard predicates, such as register_operand, that don't
> distinguish between REG and SUBREG, allowing the choice (assignment)
> to be left to register allocation (reload).
>
> However in this case, unusually, the presence of the SUBREG, and treating
> it differently from a REG is critical (in fact the reason for the patch).  
> x86_64
> can very efficiently test whether a 128-bit value is zero, setting ZF, either
> in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in
> V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction.
> There's no reason to prefer one form over the other.  A SUREG, however, that
> moves the value from the scalar registers to a vector register, or from a 
> vector
> registers to scalar registers, requires two or three instructions, often 
> reading
> and writing values via memory, at a huge performance penalty.   Hence the
> goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate
> single-cycle test instruction for where the data is located.  Hence we want
> to leave REG_P alone, but optimize (only) the SUBREG_P cases.
> register_operand doesn't help with this.
>
> Note this is counter to the usual advice.  Normally, a SUBREG between scalar
> registers is cheap (in fact free) on x86, hence it safe for predicates to 
> ignore
> them prior to register allocation.  But another use of SUBREG, to represent
> a VIEW_CONVERT_EXPR/transfer between processing units is closer to a
> conversion, and a very expensive one (going via memory with different size
> reads vs writes) at that.
>
>
> > +  && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > +  && TARGET_SSE4_1
> > +  && GET_MODE (op0) == TImode
> > +  && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > +{
> > +  tmp = SUBREG_REG (op0);
> > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > I think RA can handle SUBREG correctly, no need for extra predicates.
>
> Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> is forcing there to always be an inter-unit transfer/pipeline stall, when 
> this is
> idiom that we're trying to eliminate.
>
> I should have repeated the motivating example from my original post at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
>
> typedef long long __m128i __attribute__ ((__vector_size__ (16)));
> int foo (__m128i x, __m128i y) {
>   return (__int128)x == (__int128)y;
> }
>
> is currently generated as:
> foo:movaps  %xmm0, -40(%rsp)
> movq-32(%rsp), %rdx
> movq%xmm0, %rax
> movq%xmm1, %rsi
> movaps  %xmm1, -24(%rsp)
> movq-16(%rsp), %rcx
> xorq%rsi, %rax
> xorq%rcx, %rdx
> orq %rdx, %rax
> sete%al
> movzbl  %al, %eax
> ret
>
> with this patch (to eliminate the interunit SUBREG) this becomes:
>
> foo:pxor%xmm1, %xmm0
> xorl%eax, %eax
> ptest   %xmm0, %xmm0
> sete%al
> ret
>
> Hopefully, this clarifies things a little.
Thanks for the explanation, the patch LGTM.
One curious question, is there any case SUBREG_BYTE != 0 when inner
and outer mode(TImode) have the 

Re: [PATCH] i386: Guard 128 bit VAES builtins with AVX512VL

2023-07-10 Thread Hongtao Liu via Gcc-patches
On Tue, Jul 11, 2023 at 11:40 AM Haochen Jiang via Gcc-patches
 wrote:
>
> Hi all,
>
> Currently on trunk, both usage of intrin and builtin for 128 bit VAES
> ISA will result in ICE since we did not check AVX512VL until pattern,
> which is not user expected. This patch aims to fix that ICE and throw
> an error under this scenario.
>
> Regtested on x86-64-linux-gnu{-m32,}. Ok for trunk?
>
Ok.
> BRs,
> Haochen
>
> Since commit 24a8acc, 128 bit intrin is enabled for VAES. However,
> AVX512VL is not checked until we reached into pattern, which reports an
> ICE.
>
> Added an AVX512VL guard at builtin to report error when checking ISA
> flags.
>
> gcc/ChangeLog:
>
> * config/i386/i386-builtins.cc (ix86_init_mmx_sse_builtins):
> Add OPTION_MASK_ISA_AVX512VL.
> * config/i386/i386-expand.cc (ix86_check_builtin_isa_match):
> Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx512vl-vaes-1.c: New test.
> ---
>  gcc/config/i386/i386-builtins.cc| 12 
>  gcc/config/i386/i386-expand.cc  |  4 +++-
>  gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c | 12 
>  3 files changed, 23 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
>
> diff --git a/gcc/config/i386/i386-builtins.cc 
> b/gcc/config/i386/i386-builtins.cc
> index 28f404da288..e436ca4e5b1 100644
> --- a/gcc/config/i386/i386-builtins.cc
> +++ b/gcc/config/i386/i386-builtins.cc
> @@ -662,19 +662,23 @@ ix86_init_mmx_sse_builtins (void)
>VOID_FTYPE_UNSIGNED_UNSIGNED, IX86_BUILTIN_MWAIT);
>
>/* AES */
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesenc128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESENC128);
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesenclast128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESENCLAST128);
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesdec128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESDEC128);
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesdeclast128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESDECLAST128);
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 567248d6830..9a04bf4455b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -12626,6 +12626,7 @@ ix86_check_builtin_isa_match (unsigned int fcode,
> OPTION_MASK_ISA2_AVXIFMA
>   (OPTION_MASK_ISA_AVX512VL | OPTION_MASK_ISA2_AVX512BF16) or
> OPTION_MASK_ISA2_AVXNECONVERT
> + OPTION_MASK_ISA_AES or (OPTION_MASK_ISA_AVX512VL | 
> OPTION_MASK_ISA2_VAES)
>   where for each such pair it is sufficient if either of the ISAs is
>   enabled, plus if it is ored with other options also those others.
>   OPTION_MASK_ISA_MMX in bisa is satisfied also if TARGET_MMX_WITH_SSE.  
> */
> @@ -12649,7 +12650,8 @@ ix86_check_builtin_isa_match (unsigned int fcode,
>  OPTION_MASK_ISA2_AVXIFMA);
>SHARE_BUILTIN (OPTION_MASK_ISA_AVX512VL, OPTION_MASK_ISA2_AVX512BF16, 0,
>  OPTION_MASK_ISA2_AVXNECONVERT);
> -  SHARE_BUILTIN (OPTION_MASK_ISA_AES, 0, 0, OPTION_MASK_ISA2_VAES);
> +  SHARE_BUILTIN (OPTION_MASK_ISA_AES, 0, OPTION_MASK_ISA_AVX512VL,
> +OPTION_MASK_ISA2_VAES);
>isa = tmp_isa;
>isa2 = tmp_isa2;
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
> new file mode 100644
> index 000..fabb170a031
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mvaes -mno-avx512vl -mno-aes" } */
> +
> +#include 
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +
> +v2di
> +f1 (v2di x, v2di y)
> +{
> +  return __builtin_ia32_aesenc128 (x, y); /* { dg-error "needs isa option" } 
> */
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-10 Thread Hongtao Liu via Gcc-patches
Please ignore this patch, I'm testing another patch to separate non
swap operands case where a setcc is not needed in the peephole2.

On Tue, Jul 11, 2023 at 11:14 AM liuhongt via Gcc-patches
 wrote:
>
> Similar like we did for cmpxchg, but extended to all
> ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same
> as CMP.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,},
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/110591
> * config/i386/sync.md (cmpccxadd_): Add a new
> define_peephole2 after the pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110591.c: New test.
> ---
>  gcc/config/i386/sync.md  | 56 
>  gcc/testsuite/gcc.target/i386/pr110591.c | 66 
>  2 files changed, 122 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591.c
>
> diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> index e1fa1504deb..43f6421bcb8 100644
> --- a/gcc/config/i386/sync.md
> +++ b/gcc/config/i386/sync.md
> @@ -1105,3 +1105,59 @@ (define_insn "cmpccxadd_"
>output_asm_insn (buf, operands);
>return "";
>  })
> +
> +(define_peephole2
> +  [(set (match_operand:SWI48x 0 "register_operand")
> +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_operand:SWI48x 2 "memory_operand")
> + (match_dup 0)
> + (match_operand:SWI48x 3 "register_operand")
> + (match_operand:SI 4 "const_int_operand")]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (clobber (reg:CC FLAGS_REG))])
> +   (set (reg FLAGS_REG)
> +   (compare (match_operand:SWI48x 5 "register_operand")
> +(match_operand:SWI48x 6 "x86_64_general_operand")))
> +   (set (match_operand:QI 7 "nonimmediate_operand")
> +   (match_operator:QI 8 "ix86_comparison_int_operator"
> + [(reg FLAGS_REG) (const_int 0)]))]
> +  "TARGET_CMPCCXADD && TARGET_64BIT
> +   && ((rtx_equal_p (operands[0], operands[5])
> +   && rtx_equal_p (operands[1], operands[6]))
> +   || ((rtx_equal_p (operands[0], operands[6])
> +   && rtx_equal_p (operands[1], operands[5]))
> +  && peep2_regno_dead_p (4, FLAGS_REG)))"
> +  [(set (match_dup 0)
> +   (match_dup 1))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_dup 2)
> + (match_dup 0)
> + (match_dup 3)
> + (match_dup 4)]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (clobber (reg:CC FLAGS_REG))])
> +   (set (match_dup 7)
> +   (match_op_dup 8
> + [(match_dup 9) (const_int 0)]))]
> +{
> +  operands[9] = gen_rtx_REG (GET_MODE (XEXP (operands[8], 0)), FLAGS_REG);
> +  if (rtx_equal_p (operands[0], operands[6])
> + && rtx_equal_p (operands[1], operands[5])
> + && swap_condition (GET_CODE (operands[8])) != GET_CODE (operands[8]))
> + {
> +   operands[8] = shallow_copy_rtx (operands[8]);
> +   enum rtx_code ccode = swap_condition (GET_CODE (operands[8]));
> +   PUT_CODE (operands[8], ccode);
> +   operands[9] = gen_rtx_REG (SELECT_CC_MODE (ccode,
> + operands[6],
> + operands[5]),
> +  FLAGS_REG);
> + }
> +})
> diff --git a/gcc/testsuite/gcc.target/i386/pr110591.c 
> b/gcc/testsuite/gcc.target/i386/pr110591.c
> new file mode 100644
> index 000..32a515b429e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr110591.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mcmpccxadd -O2" } */
> +/* { dg-final { scan-assembler-not {cmp[lq]?[ \t]+} } } */
> +/* { dg-final { scan-assembler-times {cmpoxadd[ \t]+} 12 } } */
> +
> +#include 
> +
> +_Bool foo_setg (int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) > v;
> +}
> +
> +_Bool foo_setl (int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) < v;
> +}
> +
> +_Bool foo_sete(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) == v;
> +}
> +
> +_Bool foo_setne(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) != v;
> +}
> +
> +_Bool foo_setge(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) >= v;
> +}
> +
> +_Bool foo_setle(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) <= v;
> +}
> +
> +_Bool fooq_setg (long long *ptr, long long v)
> +{
> +return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) 

Re: [PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread Hongtao Liu via Gcc-patches
On Tue, Jul 11, 2023 at 12:24 AM Alexander Monakov via Gcc-patches
 wrote:
>
>
> On Mon, 10 Jul 2023, liuhongt via Gcc-patches wrote:
>
> > False dependency happens when destination is only updated by
> > pternlog. There is no false dependency when destination is also used
> > in source. So either a pxor should be inserted, or input operand
> > should be set with constraint '0'.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ready to push to trunk.
>
> Shouldn't this patch also remove uses of vpternlog in
> standard_sse_constant_opcode?
It's still needed when !optimize_function_for_speed_p (cfun).
>
> A couple more questions below:
>
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
> > ]
> > (symbol_ref "true")))])
> >
> > +; False dependency happens on destination register which is not really
> > +; used when moving all ones to vector register
> > +(define_split
> > +  [(set (match_operand:VMOVE 0 "register_operand")
> > + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> > +  "TARGET_AVX512F && reload_completed
> > +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
> > +  && optimize_function_for_speed_p (cfun)"
>
> Yan's patch used optimize_insn_for_speed_p (), which looks more appropriate.
> Doesn't it work here as well?
I'm just aligned with lzcnt/popcnt case, the difference between
option_insn_for_speed_p and optimized_function_for_speed_p is the
former will consider
!crtl->maybe_hot_insn_p but the latter just returns
!optimize_function_for_size_p (cfun). It looks
optimize_insn_for_speed_p() is more reasonable for single insn.

 350optimize_insn_for_size_p (void)
 351{
 352  enum optimize_size_level ret = optimize_function_for_size_p (cfun);
 353  if (ret < OPTIMIZE_SIZE_BALANCED && !crtl->maybe_hot_insn_p)
 354ret = OPTIMIZE_SIZE_BALANCED;
 355  return ret;

>
> > +  [(set (match_dup 0) (match_dup 2))
> > +   (parallel
> > + [(set (match_dup 0) (match_dup 1))
> > +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> > +  "operands[2] = CONST0_RTX (mode);")
> > +
> > +(define_insn "*vmov_constm1_pternlog_false_dep"
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=v")
> > + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
> > ""))
> > +   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
> > UNSPEC_INSN_FALSE_DEP)]
> > +   "TARGET_AVX512VL ||  == 64"
> > +   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
> > +  [(set_attr "type" "sselog1")
> > +   (set_attr "prefix" "evex")])
> > +
> >  ;; If mem_addr points to a memory region with less than whole vector size 
> > bytes
> >  ;; of accessible memory and k is a mask that would prevent reading the 
> > inaccessible
> >  ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be 
> > transformed to vpblendd
> > @@ -9336,7 +9359,7 @@ (define_expand 
> > "_cvtmask2"
> >  operands[3] = CONST0_RTX (mode);
> >}")
> >
> > -(define_insn "*_cvtmask2"
> > +(define_insn_and_split "*_cvtmask2"
> >[(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
> >   (vec_merge:VI48_AVX512VL
> > (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> > @@ -9346,11 +9369,35 @@ (define_insn 
> > "*_cvtmask2"
> >"@
> > vpmovm2\t{%1, %0|%0, %1}
> > vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> > %0, %0, 0x81}"
> > +  "&& !TARGET_AVX512DQ && reload_completed
> > +   && optimize_function_for_speed_p (cfun)"
> > +  [(set (match_dup 0) (match_dup 4))
> > +   (parallel
> > +[(set (match_dup 0)
> > +   (vec_merge:VI48_AVX512VL
> > + (match_dup 2)
> > + (match_dup 3)
> > + (match_dup 1)))
> > + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> > +  "operands[4] = CONST0_RTX (mode);"
> >[(set_attr "isa" "avx512dq,*")
> > (set_attr "length_immediate" "0,1")
> > (set_attr "prefix" "evex")
> > (set_attr "mode" "")])
> >
> > +(define_insn "*_cvtmask2_pternlog_false_dep"
> > +  [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
> > + (vec_merge:VI48_AVX512VL
> > +   (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> > +   (match_operand:VI48_AVX512VL 3 "const0_operand")
> > +   (match_operand: 1 "register_operand" "Yk")))
> > +   (unspec [(match_operand:VI48_AVX512VL 4 "register_operand" "0")] 
> > UNSPEC_INSN_FALSE_DEP)]
> > +  "TARGET_AVX512F && !TARGET_AVX512DQ"
> > +  "vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> > %0, %0, 0x81}"
> > +  [(set_attr "length_immediate" "1")
> > +   (set_attr "prefix" "evex")
> > +   (set_attr "mode" "")])
> > +
> >  (define_expand "extendv2sfv2df2"
> >[(set (match_operand:V2DF 0 "register_operand")
> >   (float_extend:V2DF
> > @@ -17166,20 +17213,32 @@ (define_expand "one_cmpl2"
> >  operands[2] = force_reg (mode, operands[2]);
> >  })
> >
> > -(define_insn "one_cmpl2"
> > -  [(set 

Re: [r14-2314 Regression] FAIL: gcc.target/i386/pr100711-2.c scan-assembler-times vpandn 8 on Linux/x86_64

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 7, 2023 at 3:50 PM Hongtao Liu  wrote:
>
> On Fri, Jul 7, 2023 at 3:50 PM Jan Beulich  wrote:
> >
> > On 07.07.2023 09:46, Hongtao Liu wrote:
> > > On Fri, Jul 7, 2023 at 3:18 PM Jan Beulich via Gcc-regression
> > >  wrote:
> > >>
> > >> On 06.07.2023 13:57, haochen.jiang wrote:
> > >>> On Linux/x86_64,
> > >>>
> > >>> e007369c8b67bcabd57c4fed8cff2a6db82e78e6 is the first bad commit
> > >>> commit e007369c8b67bcabd57c4fed8cff2a6db82e78e6
> > >>> Author: Jan Beulich 
> > >>> Date:   Wed Jul 5 09:49:16 2023 +0200
> > >>>
> > >>> x86: yet more PR target/100711-like splitting
> > >>>
> > >>> caused
> > >>>
> > >>> FAIL: gcc.target/i386/pr100711-1.c scan-assembler-times pandn 2
> > >>> FAIL: gcc.target/i386/pr100711-2.c scan-assembler-times vpandn 8
> > >>
> > >> I expect the same applies here - -mno-avx512f (or -mno-avx512vl) might
> > > For this one, we can just add -mno-avx512f to the testcase,it aims to
> > > optimize pandn for avx2 target.
> > >> address this failure. But whether that's really the way to go I'm not
> > >> sure of. Plus of course such adjustments should have been done ahead
> > >> of time, when it was decided that testing with certain -march= settings
> > >> is a goal. My changes have merely uncovered the prior omissions.
> > > It's not a standard request, it's just our private tester which is
> > > used to find gcc bugs and miss-optimizations.
> > > It sometimes generates false positive reports (usually adding
> > > -mno-avx512f to the testcase can fix that), hope that's not too
> > > annoying.
> >
> > Wouldn't that then better be done once uniformly for all affected tests,
> > rather than being discovered piecemeal?
This also prevents us from finding potential problems.
> >
> > Anyway, in this case: Since you said you'd take care of the other test,
> > will/can you do so for the two ones here as well, or am I on the hook?
> I'll do that.
> >
> > Jan
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [r14-2314 Regression] FAIL: gcc.target/i386/pr100711-2.c scan-assembler-times vpandn 8 on Linux/x86_64

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 7, 2023 at 3:50 PM Jan Beulich  wrote:
>
> On 07.07.2023 09:46, Hongtao Liu wrote:
> > On Fri, Jul 7, 2023 at 3:18 PM Jan Beulich via Gcc-regression
> >  wrote:
> >>
> >> On 06.07.2023 13:57, haochen.jiang wrote:
> >>> On Linux/x86_64,
> >>>
> >>> e007369c8b67bcabd57c4fed8cff2a6db82e78e6 is the first bad commit
> >>> commit e007369c8b67bcabd57c4fed8cff2a6db82e78e6
> >>> Author: Jan Beulich 
> >>> Date:   Wed Jul 5 09:49:16 2023 +0200
> >>>
> >>> x86: yet more PR target/100711-like splitting
> >>>
> >>> caused
> >>>
> >>> FAIL: gcc.target/i386/pr100711-1.c scan-assembler-times pandn 2
> >>> FAIL: gcc.target/i386/pr100711-2.c scan-assembler-times vpandn 8
> >>
> >> I expect the same applies here - -mno-avx512f (or -mno-avx512vl) might
> > For this one, we can just add -mno-avx512f to the testcase,it aims to
> > optimize pandn for avx2 target.
> >> address this failure. But whether that's really the way to go I'm not
> >> sure of. Plus of course such adjustments should have been done ahead
> >> of time, when it was decided that testing with certain -march= settings
> >> is a goal. My changes have merely uncovered the prior omissions.
> > It's not a standard request, it's just our private tester which is
> > used to find gcc bugs and miss-optimizations.
> > It sometimes generates false positive reports (usually adding
> > -mno-avx512f to the testcase can fix that), hope that's not too
> > annoying.
>
> Wouldn't that then better be done once uniformly for all affected tests,
> rather than being discovered piecemeal?
>
> Anyway, in this case: Since you said you'd take care of the other test,
> will/can you do so for the two ones here as well, or am I on the hook?
I'll do that.
>
> Jan



-- 
BR,
Hongtao


Re: [r14-2310 Regression] FAIL: gcc.target/i386/pr53652-1.c scan-assembler-times pandn[ \\t] 2 on Linux/x86_64

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 7, 2023 at 3:34 PM Jan Beulich  wrote:
>
> On 07.07.2023 09:30, Hongtao Liu wrote:
> > On Fri, Jul 7, 2023 at 3:13 PM Jan Beulich via Gcc-regression
> >  wrote:
> >>
> >> On 06.07.2023 13:57, haochen.jiang wrote:
> >>> On Linux/x86_64,
> >>>
> >>> 2d11c99dfca3cc603dbbfafb3afc41689a68e40f is the first bad commit
> >>> commit 2d11c99dfca3cc603dbbfafb3afc41689a68e40f
> >>> Author: Jan Beulich 
> >>> Date:   Wed Jul 5 09:41:09 2023 +0200
> >>>
> >>> x86: use VPTERNLOG also for certain andnot forms
> >>>
> >>> caused
> >>>
> >>> FAIL: gcc.target/i386/pr53652-1.c scan-assembler-not vpternlogq[ \\t]
> >>
> >> The respective expectation was never valid to add without excluding
> >> cases where -march= overrides (extends) the -msse2 that the test
> >> specifies explicitly. I'm afraid I don't know how to tweak a testcase
> >> to properly deal with that. Perhaps (like iirc was suggested elsewhere)
> >> -mno-avx512f, but honestly this approach feels clumsy to me. Cc-ing
> >> Hongtao, who I think suggested that approach elsewhere.
> >>
> >>> FAIL: gcc.target/i386/pr53652-1.c scan-assembler-times pandn[ \\t] 2
> > There're a false dependence when using pternlog for andnot(and other
> > newly added) pattern, i'm working on a patch to avoid that(PR110438).
> > Let me handle the test case.
>
> Of course I'm happy to see you handle the testcase, but if you don't
> mind I'm curious towards the connection you see between that false
> dependency issue and the adjustments missing in this (and other)
> testcase(s).
For the sake of simplicity, add -mno-avx512f should be ok, the
testcase is used to detect optimization on non-avx512 targets.
I'll add extra testcases to cover false dependence case.
>
> Jan



-- 
BR,
Hongtao


Re: [r14-2314 Regression] FAIL: gcc.target/i386/pr100711-2.c scan-assembler-times vpandn 8 on Linux/x86_64

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 7, 2023 at 3:18 PM Jan Beulich via Gcc-regression
 wrote:
>
> On 06.07.2023 13:57, haochen.jiang wrote:
> > On Linux/x86_64,
> >
> > e007369c8b67bcabd57c4fed8cff2a6db82e78e6 is the first bad commit
> > commit e007369c8b67bcabd57c4fed8cff2a6db82e78e6
> > Author: Jan Beulich 
> > Date:   Wed Jul 5 09:49:16 2023 +0200
> >
> > x86: yet more PR target/100711-like splitting
> >
> > caused
> >
> > FAIL: gcc.target/i386/pr100711-1.c scan-assembler-times pandn 2
> > FAIL: gcc.target/i386/pr100711-2.c scan-assembler-times vpandn 8
>
> I expect the same applies here - -mno-avx512f (or -mno-avx512vl) might
For this one, we can just add -mno-avx512f to the testcase,it aims to
optimize pandn for avx2 target.
> address this failure. But whether that's really the way to go I'm not
> sure of. Plus of course such adjustments should have been done ahead
> of time, when it was decided that testing with certain -march= settings
> is a goal. My changes have merely uncovered the prior omissions.
It's not a standard request, it's just our private tester which is
used to find gcc bugs and miss-optimizations.
It sometimes generates false positive reports (usually adding
-mno-avx512f to the testcase can fix that), hope that's not too
annoying.
>
> Jan
>
> > with GCC configured with
> >
> > ../../gcc/configure 
> > --prefix=/export/users/haochenj/src/gcc-bisect/master/master/r14-2314/usr 
> > --enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
> > --with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet 
> > --without-isl --enable-libmpx x86_64-linux --disable-bootstrap
> >
> > To reproduce:
> >
> > $ cd {build_dir}/gcc && make check 
> > RUNTESTFLAGS="i386.exp=gcc.target/i386/pr100711-1.c 
> > --target_board='unix{-m32\ -march=cascadelake}'"
> > $ cd {build_dir}/gcc && make check 
> > RUNTESTFLAGS="i386.exp=gcc.target/i386/pr100711-2.c 
> > --target_board='unix{-m32\ -march=cascadelake}'"
> >
> > (Please do not reply to this email, for question about this report, contact 
> > me at haochen dot jiang at intel.com)
>


-- 
BR,
Hongtao


Re: [r14-2310 Regression] FAIL: gcc.target/i386/pr53652-1.c scan-assembler-times pandn[ \\t] 2 on Linux/x86_64

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 7, 2023 at 3:13 PM Jan Beulich via Gcc-regression
 wrote:
>
> On 06.07.2023 13:57, haochen.jiang wrote:
> > On Linux/x86_64,
> >
> > 2d11c99dfca3cc603dbbfafb3afc41689a68e40f is the first bad commit
> > commit 2d11c99dfca3cc603dbbfafb3afc41689a68e40f
> > Author: Jan Beulich 
> > Date:   Wed Jul 5 09:41:09 2023 +0200
> >
> > x86: use VPTERNLOG also for certain andnot forms
> >
> > caused
> >
> > FAIL: gcc.target/i386/pr53652-1.c scan-assembler-not vpternlogq[ \\t]
>
> The respective expectation was never valid to add without excluding
> cases where -march= overrides (extends) the -msse2 that the test
> specifies explicitly. I'm afraid I don't know how to tweak a testcase
> to properly deal with that. Perhaps (like iirc was suggested elsewhere)
> -mno-avx512f, but honestly this approach feels clumsy to me. Cc-ing
> Hongtao, who I think suggested that approach elsewhere.
>
> > FAIL: gcc.target/i386/pr53652-1.c scan-assembler-times pandn[ \\t] 2
There're a false dependence when using pternlog for andnot(and other
newly added) pattern, i'm working on a patch to avoid that(PR110438).
Let me handle the test case.
>
> Aiui this is merely a knock-on effect.
>
> Jan
>
> > with GCC configured with
> >
> > ../../gcc/configure 
> > --prefix=/export/users/haochenj/src/gcc-bisect/master/master/r14-2310/usr 
> > --enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
> > --with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet 
> > --without-isl --enable-libmpx x86_64-linux --disable-bootstrap
> >
> > To reproduce:
> >
> > $ cd {build_dir}/gcc && make check 
> > RUNTESTFLAGS="i386.exp=gcc.target/i386/pr53652-1.c 
> > --target_board='unix{-m32\ -march=cascadelake}'"
> > $ cd {build_dir}/gcc && make check 
> > RUNTESTFLAGS="i386.exp=gcc.target/i386/pr53652-1.c 
> > --target_board='unix{-m64\ -march=cascadelake}'"
> >
> > (Please do not reply to this email, for question about this report, contact 
> > me at haochen dot jiang at intel.com)
>


-- 
BR,
Hongtao


Re: [PATCH] Break false dependence for vpternlog by inserting vpxor.

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 6, 2023 at 11:46 PM  wrote:
>
> > +; False dependency happens on destination register which is not really
> > +; used when moving all ones to vector register
> > +(define_split
> > +  [(set (match_operand:VMOVE 0 "register_operand")
> > + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> > +  "TARGET_AVX512F && reload_completed
> > +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))"
> > +  [(set (match_dup 0) (match_dup 2))
> > +   (parallel
> > + [(set (match_dup 0) (match_dup 1))
> > +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> > +  "operands[2] = CONST0_RTX (mode);")
>
> I think we shouldnt emit PXOR when optimizing for size. So should change
> define_split:
> define_split
>[(set (match_operand:VMOVE 0 "register_operand")
> (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
>"TARGET_AVX512F && reload_completed
>&& ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
>&& optimize_insn_for_speed_p ()"
>[(set (match_dup 0) (match_dup 2))
> (parallel
>   [(set (match_dup 0) (match_dup 1))
>(unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
>"operands[2] = CONST0_RTX (mode);")
Yes, will do. I'm still working on breaking the false depence for
pternlog in newly added pattern *iornot3,*xnor3 and
*3.
Will repost the patch when it's done.



-- 
BR,
Hongtao


Re: [PATCH V2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-07 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 7, 2023 at 2:02 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Fri, Jul 7, 2023 at 7:31 AM liuhongt  wrote:
> >
> > > Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX
> > > and the other emitting UNSPEC_IEEE_MIN.
> > Splitted.
> >
> > > The test involves blendv instruction, which is SSE4.1, so it is
> > > pointless to test it without -msse4.1. Please add -msse4.1 instead of
> > > -march=x86_64 and use sse4_runtime target selector, as is the case
> > > with gcc.target/i386/pr90358.c.
> > Changed.
> >
> > > Please also use -msse4.1 instead of -march here. With -mfpmath=sse,
> > > the test is valid also for 32bit targets, you should use -msseregparm
> > > additional options for ia32 (please see gcc.target/i386/pr43546.c
> > > testcase) in the same way as -mregparm to pass SSE arguments in
> > > registers.
> > 32-bit target still failed to do condition elimination for DFmode due to
> > below code in rtx_cost
> >
> >   /* A size N times larger than UNITS_PER_WORD likely needs N times as
> >  many insns, taking N times as long.  */
> >   factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
> >
> > It looks like a separate issue for DFmode operation under 32-bit target.
> >
> > I've enable 32-bit for the testcase, but only scan for minss/maxss
> > currently.
> >
> > Here's updated patch.
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > We have ix86_expand_sse_fp_minmax to detect min/max sematics, but
> > it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for
> > the testcase in the PR, there's an extra move from cmp_op0 to if_true,
> > and it failed ix86_expand_sse_fp_minmax.
> >
> > This patch adds pre_reload splitter to detect the min/max pattern.
> >
> > Operands order in MINSS matters for signed zero and NANs, since the
> > instruction always returns second operand when any operand is NAN or
> > both operands are zero.
> >
> > gcc/ChangeLog:
> >
> > PR target/110170
> > * config/i386/i386.md (*ieee_max3_1): New pre_reload
> > splitter to detect fp max pattern.
> > (*ieee_min3_1): Ditto, but for fp min pattern.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * g++.target/i386/pr110170.C: New test.
> > * gcc.target/i386/pr110170.c: New test.
>
> OK with a testcase fix below.
>
> Uros.
>
> > ---
> >  gcc/config/i386/i386.md  | 43 +
> >  gcc/testsuite/g++.target/i386/pr110170.C | 78 
> >  gcc/testsuite/gcc.target/i386/pr110170.c | 21 +++
> >  3 files changed, 142 insertions(+)
> >  create mode 100644 gcc/testsuite/g++.target/i386/pr110170.C
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110170.c
> >
> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index a82cc353cfd..6f415f899ae 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -23163,6 +23163,49 @@ (define_insn "*ieee_s3"
> > (set_attr "type" "sseadd")
> > (set_attr "mode" "")])
> >
> > +;; Operands order in min/max instruction matters for signed zero and NANs.
> > +(define_insn_and_split "*ieee_max3_1"
> > +  [(set (match_operand:MODEF 0 "register_operand")
> > +   (unspec:MODEF
> > + [(match_operand:MODEF 1 "register_operand")
> > +  (match_operand:MODEF 2 "register_operand")
> > +  (lt:MODEF
> > +(match_operand:MODEF 3 "register_operand")
> > +(match_operand:MODEF 4 "register_operand"))]
> > + UNSPEC_BLENDV))]
> > +  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH
> > +  && (rtx_equal_p (operands[1], operands[3])
> > +  && rtx_equal_p (operands[2], operands[4]))
> > +  && ix86_pre_reload_split ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (match_dup 0)
> > +   (unspec:MODEF
> > + [(match_dup 2)
> > +  (match_dup 1)]
> > +UNSPEC_IEEE_MAX))])
> > +
> > +(define_insn_and_split "*ieee_min3_1"
> > +  [(set (match_operand:MODEF 0 "register_operand")
> > +   (unspec:MODEF
> > + [(match_operand:MODEF 1 "register_operand")
> > +  (match_operand:MODEF 2 "register_operand")
> > +  (lt:MODEF
> > +(match_operand:MODEF 3 "register_operand")
> > +(match_operand:MODEF 4 "register_operand"))]
> > + UNSPEC_BLENDV))]
> > +  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH
> > +  && (rtx_equal_p (operands[1], operands[4])
> > +  && rtx_equal_p (operands[2], operands[3]))
> > +  && ix86_pre_reload_split ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (match_dup 0)
> > +   (unspec:MODEF
> > + [(match_dup 2)
> > +  (match_dup 1)]
> > +UNSPEC_IEEE_MIN))])
> > +
> >  ;; Make two stack loads independent:
> >  ;;   fld aa  fld aa
> >  ;;   fld %st(0) ->   fld bb
> > diff --git a/gcc/testsuite/g++.target/i386/pr110170.C 
> > b/gcc/testsuite/g++.target/i386/pr110170.C
> > new file mode 100644
> > index 000..5d6842270d0
> > --- 

Re: [PATCH 2/2] x86: slightly correct / simplify *vec_extractv2ti

2023-07-05 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 5, 2023 at 6:22 PM Hongtao Liu  wrote:
>
> On Wed, Jul 5, 2023 at 5:03 PM Jan Beulich  wrote:
> >
> > On 05.07.2023 10:47, Hongtao Liu wrote:
> > > On Wed, Jul 5, 2023 at 4:01 PM Jan Beulich via Gcc-patches
> > >  wrote:
> > >>
> > >> V2TImode values cannot appear in the upper 16 YMM registers without
> > >> AVX512VL being enabled. Therefore forcing 512-bit mode (also not
> > >> reflected in the "mode" attribute) is pointless.
> > > Please set isa attribute for alternative 1 to avx512vl.
> >
> > Since that looks redundant to me (as per the description), would you
> > mind explaining why that's necessary / wanted? It also feels orthogonal
> > to the change I'm making, as there was no "isa" attribute so far (which
> > would have wanted to be "avx512f" as per what you ask for, prior to the
> > change I'm making). Again me asking back is primarily to properly
> > describe the changes I'm making, of course along with me still needing
> > to properly understand when what attribute needs specifying explicitly.
It's decided by many factors: instruction isa requirement, possible
register allocation for the alternative, also how
recog_memoized(constrain_operands) decide which_alternative.
For *vec_extractv2ti the alternative is implicitly guarded by
ix86_hard_regno_ok and no need for explicit isa attribute.
> I checked ix86_hard_regno_ok, TImode/V2TImode will be allocated
> with evex sse register only under TARGET_AVX512VL. otherwise
> alternative 0 is matched.
> So yes, no need to set isa attribute here, patch LGTM.
> >
> > Jan
>
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH 2/2] x86: slightly correct / simplify *vec_extractv2ti

2023-07-05 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 5, 2023 at 5:03 PM Jan Beulich  wrote:
>
> On 05.07.2023 10:47, Hongtao Liu wrote:
> > On Wed, Jul 5, 2023 at 4:01 PM Jan Beulich via Gcc-patches
> >  wrote:
> >>
> >> V2TImode values cannot appear in the upper 16 YMM registers without
> >> AVX512VL being enabled. Therefore forcing 512-bit mode (also not
> >> reflected in the "mode" attribute) is pointless.
> > Please set isa attribute for alternative 1 to avx512vl.
>
> Since that looks redundant to me (as per the description), would you
> mind explaining why that's necessary / wanted? It also feels orthogonal
> to the change I'm making, as there was no "isa" attribute so far (which
> would have wanted to be "avx512f" as per what you ask for, prior to the
> change I'm making). Again me asking back is primarily to properly
> describe the changes I'm making, of course along with me still needing
> to properly understand when what attribute needs specifying explicitly.
I checked ix86_hard_regno_ok, TImode/V2TImode will be allocated
with evex sse register only under TARGET_AVX512VL. otherwise
alternative 0 is matched.
So yes, no need to set isa attribute here, patch LGTM.
>
> Jan




--
BR,
Hongtao


Re: [PATCH 1/2] x86: correct / simplify @vec_extract_hi_ and vec_extract_hi_v32qi

2023-07-05 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 5, 2023 at 4:55 PM Jan Beulich  wrote:
>
> On 05.07.2023 10:40, Hongtao Liu wrote:
> > On Wed, Jul 5, 2023 at 4:00 PM Jan Beulich via Gcc-patches
> >  wrote:
> >>
> >> The middle alternative each was unusable without enabling AVX512DQ (in
> >> addition to AVX512VL), which is entirely unrelated here. The last
> >> alternative is usable with AVX512VL only (due to type restrictions on
> >> what may be put in the upper 16 YMM registers), and hence is pointlessly
> >> forcing 512-bit mode (without actually reflecting that in the "mode"
> >> attribute).
> > Ok.
>
> Thanks.
>
> >> ---
> >> Like elsewhere I suspect "prefix_extra" is bogus here and should be
> >> dropped.
> >>
> >> Is "sselog1" actually appropriate here? Extracts are special forms of
> >> moves after all, not logical operations. Even "sseshuf1" would seem to
> >> come closer.
> > Honestly, I don't know why it's marked as sselog1, but looking at the
> > code,  almost all vec_extract patterns are marked as sselog1, guess
> > it's originally from pextr.
> > Agree that it's should be more close to shuffle instructions.
>
> Yet as said I think these are special forms of moves. To me "shuffle"
> involves more than one element. Yet then I don't really know what
I think if it only extracts from the low part, it's close to a move,
otherwise it's more like shuffle(shuffle the specific elements to the
low part).
I guess one possible reason it's marked as sselog1 is from port usage
perspective, it's more close to vector logic instructions?
> the "type" attributes are used for (other than vaguely "for
> scheduling"), and hence whether treating extracts as shuffles would
AFAI, it's only used by scheduling, I don't know if there're tools
based on GCC schedule model.
> be more appropriate. (IOW I'd be happy to make a patch to convert all
> extracts, but I'd need to know whether the conversion should be to
> "sseshuf", "sseshuf1", or "ssemov". In the former two cases knowing
> the "Why?" would also help, especially for writing a sensible
> description. I also haven't found any explanation towards the
> difference between sse and sse1: The "memory" attribute
> evaluates to "both" for the 1 forms if operand 1 is in memory, yet
> that doesn't seem to fit any of the uses here.)
I think sse1 only has one input operand, but sse may have
two or more.
For instruction perspective,  they're the same type, sse1 is
introduced to avoid Segment Fault in define_memory_attr which will
check operands[2] or operands[3].
(Similar for other attribute default setting)
>
> Jan




--
BR,
Hongtao


  1   2   3   4   5   6   7   8   9   10   >