[committed] i386: Handle CONST_WIDE_INT in output_pic_addr_const [PR111340]

2023-09-11 Thread Uros Bizjak via Gcc-patches
PR target/111340

gcc/ChangeLog:

* config/i386/i386.cc (output_pic_addr_const): Handle CONST_WIDE_INT.
Call output_addr_const for CASE_CONST_SCALAR_INT.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr111340.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}, will
be backported to release branches.

Uros.
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1cef7ee8f1a..477e6cecc38 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -12344,8 +12344,8 @@ output_pic_addr_const (FILE *file, rtx x, int code)
   assemble_name (asm_out_file, buf);
   break;
 
-case CONST_INT:
-  fprintf (file, HOST_WIDE_INT_PRINT_DEC, INTVAL (x));
+CASE_CONST_SCALAR_INT:
+  output_addr_const (file, x);
   break;
 
 case CONST:
diff --git a/gcc/testsuite/gcc.target/i386/pr111340.c 
b/gcc/testsuite/gcc.target/i386/pr111340.c
new file mode 100644
index 000..6539ae566c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr111340.c
@@ -0,0 +1,9 @@
+/* PR target/111340 */
+/* { dg-do compile { target { fpic && int128 } } } */
+/* { dg-options "-O2 -fpic" } */
+
+void
+bar (void)
+{
+  __asm ("# %0" : : "g" unsigned __int128) 0x123456789abcdef0ULL) << 64) | 
0x0fedcba987654321ULL));
+}


Re: [PATCH 01/13] [APX EGPR] middle-end: Add insn argument to base_reg_class

2023-09-07 Thread Uros Bizjak via Gcc-patches
On Wed, Sep 6, 2023 at 9:43 PM Vladimir Makarov  wrote:
>
>
> On 9/1/23 05:07, Hongyu Wang wrote:
> > Uros Bizjak via Gcc-patches  于2023年8月31日周四 18:16写道:
> >> On Thu, Aug 31, 2023 at 10:20 AM Hongyu Wang  wrote:
> >>> From: Kong Lingling 
> >>>
> >>> Current reload infrastructure does not support selective base_reg_class
> >>> for backend insn. Add insn argument to base_reg_class for
> >>> lra/reload usage.
> >> I don't think this is the correct approach. Ideally, a memory
> >> constraint should somehow encode its BASE/INDEX register class.
> >> Instead of passing "insn", simply a different constraint could be used
> >> in the constraint string of the relevant insn.
> > We tried constraint only at the beginning, but then we found the
> > reload infrastructure
> > does not work like that.
> >
> > The BASE/INDEX reg classes are determined before choosing alternatives, in
> > process_address under curr_insn_transform. Process_address creates the mem
> > operand according to the BASE/INDEX reg class. Then, the memory operand
> > constraint check will evaluate the mem op with targetm.legitimate_address_p.
> >
> > If we want to make use of EGPR in base/index we need to either extend 
> > BASE/INDEX
> > reg class in the backend, or, for specific insns, add a target hook to
> > tell reload
> > that the extended reg class with EGPR can be used to construct memory 
> > operand.
> >
> > CC'd Vladimir as git send-mail failed to add recipient.
> >
> >
> I think the approach proposed by Intel developers is better.  In some way
> we already use such approach when we pass memory mode to get the base
> reg class.  Although we could use different memory constraints for
> different modes when the possible base reg differs for some memory
> modes.
>
> Using special memory constraints probably can be implemented too (I
> understand attractiveness of such approach for readability of the
> machine description).  But in my opinion it will require much bigger
> work in IRA/LRA/reload.  It also significantly slow down RA as we need
> to process insn constraints for processing each memory in many places
> (e.g. for calculation of reg classes and costs in IRA).  Still I think
> there will be a few cases for this approach resulting in a bigger
> probability of assigning hard reg out of specific base reg class and
> this will result in additional reloads.
>
> So the approach proposed by Intel is ok for me.  Although if x86 maintainers
> are strongly against this approach and the changes in x86 machine
> dependent code and Intel developers implement Uros approach, I am
> ready to review this.  But still I prefer the current Intel developers
> approach for reasons I mentioned above.

My above proposal is more or less a wish from a target maintainer PoV.
Ideally, we would have a bunch of different memory constraints, and a
target hook that returns corresponding BASE/INDEX reg classes.
However, I have no idea about the complexity of the implementation in
the infrastructure part of the compiler.

Uros.


Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-04 Thread Uros Bizjak via Gcc-patches
On Mon, Sep 4, 2023 at 2:28 AM Hongtao Liu  wrote:

> > > > > > > I think there should be some constraint which explicitly has all 
> > > > > > > the 32
> > > > > > > GPRs, like there is one for just all 16 GPRs (h), so that 
> > > > > > > regardless of
> > > > > > > -mapx-inline-asm-use-gpr32 one can be explicit what the inline 
> > > > > > > asm wants.
> > > > > > >
> > > > > > > Also, what about the "g" constraint?  Shouldn't there be another 
> > > > > > > for "g"
> > > > > > > without r16..r31?  What about the various other memory
> > > > > > > constraints ("<", "o", ...)?
> > > > > >
> > > > > > I think we should leave all existing constraints as they are, so "r"
> > > > > > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > > > > > introduce "h" to instructions that have the ability to handle EGPR.
> > > > > > This would be somehow similar to the SSE -> AVX512F transition, 
> > > > > > where
> > > > > > we still have "x" for SSE16 and "v" was introduced as a separate
> > > > > > register class for EVEX SSE registers. This way, asm will be
> > > > > > compatible, when "r", "m", "o" and "g" are used. The new memory
> > > > > > constraint "Bt", should allow new registers, and should be added to
> > > > > > the constraint string as a separate constraint, and conditionally
> > > > > > enabled by relevant "isa" (AKA "enabled") attribute.
> > > > >
> > > > > The extended constraint can work for registers, but for memory it is 
> > > > > more
> > > > > complicated.
> > > >
> > > > Yes, unfortunately. The compiler assumes that an unchangeable register
> > > > class is used for BASE/INDEX registers. I have hit this limitation
> > > > when trying to implement memory support for instructions involving
> > > > 8-bit high registers (%ah, %bh, %ch, %dh), which do not support REX
> > > > registers, also inside memory operand. (You can see the "hack" in e.g.
> > > > *extzvqi_mem_rex64" and corresponding peephole2 with the original
> > > > *extzvqi pattern). I am aware that dynamic insn-dependent BASE/INDEX
> > > > register class is the major limitation in the compiler, so perhaps the
> > > > strategy on how to override this limitation should be discussed with
> > > > the register allocator author first. Perhaps adding an insn attribute
> > > > to insn RTX pattern to specify different BASE/INDEX register sets can
> > > > be a better solution than passing insn RTX to the register allocator.
> > > >
> > > > The above idea still does not solve the asm problem on how to select
> > > > correct BASE/INDEX register set for memory operands.
> > > The current approach disables gpr32 for memory operand in asm_operand
> > > by default. but can be turned on by options
> > > ix86_apx_inline_asm_use_gpr32(users need to guarantee the instruction
> > > supports gpr32).
> > > Only ~ 5% of total instructions don't support gpr32, reversed approach
> > > only gonna get more complicated.
> >
> > I'm not referring to the reversed approach, just want to point out
> > that the same approach as you proposed w.r.t. to memory operand can be
> > achieved using some named insn attribute that would affect BASE/INDEX
> > register class selection. The attribute could default to gpr32 with
> > APX, unless the insn specific attribute has e.g. nogpr32 value. See
> > for example how "enabled" and "preferred_for_*" attributes are used.
> > Perhaps this new attribute can also be applied to separate
> > alternatives.
> Yes, for xop/fma4/3dnow instructions, I think we can use isa attr like
> (define_attr "gpr32" "0, 1"
>   (cond [(eq_attr "isa" "fma4")
>(const_string "0")]
>   (const_string "1")))

Just a nit, can the member be named "map0" and "map1"? The code will
then look like:

if (get_attr_gpr32 (insn) == GPR32_MAP0) ...

instead of:

if (get_attr_gpr32 (insn) == GPR32_0) ...

> But still, we need to adjust memory constraints in the pattern.

I guess the gpr32 property is the same for all alternatives of the
insn pattern. In this case,  "m" "g" and "a" constraints could remain
as they are, the final register class will be adjusted (by some target
hook?) based on the value of gpr32 attribute.

> Ideally, gcc includes encoding information for every instruction,
> (.i.e. map0/map1), so that we can determine the attribute value of
> gpr32 directly from this information.

I think the right tool for this is attribute infrastructure of insn
patterns. We can set the default, set precise value of the insns, or
calculate attribute from some other attribute in a quite flexible way.
Other than that, adjusting BASE/INDEX register class of the RA pass is
the infrastructure change, but perhaps similar to the one you
proposed.

Uros.


Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-01 Thread Uros Bizjak via Gcc-patches
On Fri, Sep 1, 2023 at 12:36 PM Hongtao Liu  wrote:
>
> On Fri, Sep 1, 2023 at 5:38 PM Uros Bizjak via Gcc-patches
>  wrote:
> >
> > On Fri, Sep 1, 2023 at 11:10 AM Hongyu Wang  wrote:
> > >
> > > Uros Bizjak via Gcc-patches  于2023年8月31日周四 
> > > 18:01写道:
> > > >
> > > > On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches 
> > > > > wrote:
> > > > > > From: Kong Lingling 
> > > > > >
> > > > > > In inline asm, we do not know if the insn can use EGPR, so disable 
> > > > > > EGPR
> > > > > > usage by default from mapping the common reg/mem constraint to 
> > > > > > non-EGPR
> > > > > > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR 
> > > > > > usage
> > > > > > for inline asm.
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
> > > > > >   ix86_md_asm_adjust.
> > > > > >   (ix86_md_asm_adjust): When APX EGPR enabled without 
> > > > > > specifying the
> > > > > >   target option, map reg/mem constraints to non-EGPR 
> > > > > > constraints.
> > > > > >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
> > > > > > ---
> > > > > >  gcc/config/i386/i386.cc   |  44 +++
> > > > > >  gcc/config/i386/i386.opt  |   5 +
> > > > > >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 
> > > > > > ++
> > > > > >  3 files changed, 156 insertions(+)
> > > > > >  create mode 100644 
> > > > > > gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
> > > > > >
> > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > > index d26d9ab0d9d..9460ebbfda4 100644
> > > > > > --- a/gcc/config/i386/i386.cc
> > > > > > +++ b/gcc/config/i386/i386.cc
> > > > > > @@ -17,6 +17,7 @@ You should have received a copy of the GNU 
> > > > > > General Public License
> > > > > >  along with GCC; see the file COPYING3.  If not see
> > > > > >  <http://www.gnu.org/licenses/>.  */
> > > > > >
> > > > > > +#define INCLUDE_STRING
> > > > > >  #define IN_TARGET_CODE 1
> > > > > >
> > > > > >  #include "config.h"
> > > > > > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , 
> > > > > > vec & /*inputs*/,
> > > > > >bool saw_asm_flag = false;
> > > > > >
> > > > > >start_sequence ();
> > > > > > +  /* TODO: Here we just mapped the general r/m constraints to 
> > > > > > non-EGPR
> > > > > > +   constraints, will eventually map all the usable constraints in 
> > > > > > the future. */
> > > > >
> > > > > I think there should be some constraint which explicitly has all the 
> > > > > 32
> > > > > GPRs, like there is one for just all 16 GPRs (h), so that regardless 
> > > > > of
> > > > > -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm 
> > > > > wants.
> > > > >
> > > > > Also, what about the "g" constraint?  Shouldn't there be another for 
> > > > > "g"
> > > > > without r16..r31?  What about the various other memory
> > > > > constraints ("<", "o", ...)?
> > > >
> > > > I think we should leave all existing constraints as they are, so "r"
> > > > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > > > introduce "h" to instructions that have the ability to handle EGPR.
> > > > This would be somehow similar to the SSE -> AVX512F transition

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-01 Thread Uros Bizjak via Gcc-patches
On Fri, Sep 1, 2023 at 11:10 AM Hongyu Wang  wrote:
>
> Uros Bizjak via Gcc-patches  于2023年8月31日周四 18:01写道:
> >
> > On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches 
> > > wrote:
> > > > From: Kong Lingling 
> > > >
> > > > In inline asm, we do not know if the insn can use EGPR, so disable EGPR
> > > > usage by default from mapping the common reg/mem constraint to non-EGPR
> > > > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR usage
> > > > for inline asm.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
> > > >   ix86_md_asm_adjust.
> > > >   (ix86_md_asm_adjust): When APX EGPR enabled without specifying the
> > > >   target option, map reg/mem constraints to non-EGPR constraints.
> > > >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
> > > > ---
> > > >  gcc/config/i386/i386.cc   |  44 +++
> > > >  gcc/config/i386/i386.opt  |   5 +
> > > >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 ++
> > > >  3 files changed, 156 insertions(+)
> > > >  create mode 100644 
> > > > gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index d26d9ab0d9d..9460ebbfda4 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -17,6 +17,7 @@ You should have received a copy of the GNU General 
> > > > Public License
> > > >  along with GCC; see the file COPYING3.  If not see
> > > >  <http://www.gnu.org/licenses/>.  */
> > > >
> > > > +#define INCLUDE_STRING
> > > >  #define IN_TARGET_CODE 1
> > > >
> > > >  #include "config.h"
> > > > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , 
> > > > vec & /*inputs*/,
> > > >bool saw_asm_flag = false;
> > > >
> > > >start_sequence ();
> > > > +  /* TODO: Here we just mapped the general r/m constraints to non-EGPR
> > > > +   constraints, will eventually map all the usable constraints in the 
> > > > future. */
> > >
> > > I think there should be some constraint which explicitly has all the 32
> > > GPRs, like there is one for just all 16 GPRs (h), so that regardless of
> > > -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm wants.
> > >
> > > Also, what about the "g" constraint?  Shouldn't there be another for "g"
> > > without r16..r31?  What about the various other memory
> > > constraints ("<", "o", ...)?
> >
> > I think we should leave all existing constraints as they are, so "r"
> > covers only GPR16, "m" and "o" to only use GPR16. We can then
> > introduce "h" to instructions that have the ability to handle EGPR.
> > This would be somehow similar to the SSE -> AVX512F transition, where
> > we still have "x" for SSE16 and "v" was introduced as a separate
> > register class for EVEX SSE registers. This way, asm will be
> > compatible, when "r", "m", "o" and "g" are used. The new memory
> > constraint "Bt", should allow new registers, and should be added to
> > the constraint string as a separate constraint, and conditionally
> > enabled by relevant "isa" (AKA "enabled") attribute.
>
> The extended constraint can work for registers, but for memory it is more
> complicated.

Yes, unfortunately. The compiler assumes that an unchangeable register
class is used for BASE/INDEX registers. I have hit this limitation
when trying to implement memory support for instructions involving
8-bit high registers (%ah, %bh, %ch, %dh), which do not support REX
registers, also inside memory operand. (You can see the "hack" in e.g.
*extzvqi_mem_rex64" and corresponding peephole2 with the original
*extzvqi pattern). I am aware that dynamic insn-dependent BASE/INDEX
register

Re: [PATCH 01/13] [APX EGPR] middle-end: Add insn argument to base_reg_class

2023-08-31 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 31, 2023 at 10:20 AM Hongyu Wang  wrote:
>
> From: Kong Lingling 
>
> Current reload infrastructure does not support selective base_reg_class
> for backend insn. Add insn argument to base_reg_class for
> lra/reload usage.

I don't think this is the correct approach. Ideally, a memory
constraint should somehow encode its BASE/INDEX register class.
Instead of passing "insn", simply a different constraint could be used
in the constraint string of the relevant insn.

Uros.
>
> gcc/ChangeLog:
>
> * addresses.h (base_reg_class):  Add insn argument.
> Pass to MODE_CODE_BASE_REG_CLASS.
> (regno_ok_for_base_p_1): Add insn argument.
> Pass to REGNO_MODE_CODE_OK_FOR_BASE_P.
> (regno_ok_for_base_p): Add insn argument and parse to ok_for_base_p_1.
> * config/avr/avr.h (MODE_CODE_BASE_REG_CLASS): Add insn argument.
> (REGNO_MODE_CODE_OK_FOR_BASE_P): Ditto.
> * config/gcn/gcn.h (MODE_CODE_BASE_REG_CLASS): Ditto.
> (REGNO_MODE_CODE_OK_FOR_BASE_P): Ditto.
> * config/rl78/rl78.h (REGNO_MODE_CODE_OK_FOR_BASE_P): Ditto.
> (MODE_CODE_BASE_REG_CLASS): Ditto.
> * doc/tm.texi: Add insn argument for MODE_CODE_BASE_REG_CLASS
> and REGNO_MODE_CODE_OK_FOR_BASE_P.
> * doc/tm.texi.in: Ditto.
> * lra-constraints.cc (process_address_1): Pass insn to
> base_reg_class.
> (curr_insn_transform): Ditto.
> * reload.cc (find_reloads): Ditto.
> (find_reloads_address): Ditto.
> (find_reloads_address_1): Ditto.
> (find_reloads_subreg_address): Ditto.
> * reload1.cc (maybe_fix_stack_asms): Ditto.
> ---
>  gcc/addresses.h| 15 +--
>  gcc/config/avr/avr.h   |  5 +++--
>  gcc/config/gcn/gcn.h   |  4 ++--
>  gcc/config/rl78/rl78.h |  6 --
>  gcc/doc/tm.texi|  8 ++--
>  gcc/doc/tm.texi.in |  8 ++--
>  gcc/lra-constraints.cc | 15 +--
>  gcc/reload.cc  | 30 ++
>  gcc/reload1.cc |  2 +-
>  9 files changed, 58 insertions(+), 35 deletions(-)
>
> diff --git a/gcc/addresses.h b/gcc/addresses.h
> index 3519c241c6d..08b100cfe6d 100644
> --- a/gcc/addresses.h
> +++ b/gcc/addresses.h
> @@ -28,11 +28,12 @@ inline enum reg_class
>  base_reg_class (machine_mode mode ATTRIBUTE_UNUSED,
> addr_space_t as ATTRIBUTE_UNUSED,
> enum rtx_code outer_code ATTRIBUTE_UNUSED,
> -   enum rtx_code index_code ATTRIBUTE_UNUSED)
> +   enum rtx_code index_code ATTRIBUTE_UNUSED,
> +   rtx_insn *insn ATTRIBUTE_UNUSED = NULL)
>  {
>  #ifdef MODE_CODE_BASE_REG_CLASS
>return MODE_CODE_BASE_REG_CLASS (MACRO_MODE (mode), as, outer_code,
> -  index_code);
> +  index_code, insn);
>  #else
>  #ifdef MODE_BASE_REG_REG_CLASS
>if (index_code == REG)
> @@ -56,11 +57,12 @@ ok_for_base_p_1 (unsigned regno ATTRIBUTE_UNUSED,
>  machine_mode mode ATTRIBUTE_UNUSED,
>  addr_space_t as ATTRIBUTE_UNUSED,
>  enum rtx_code outer_code ATTRIBUTE_UNUSED,
> -enum rtx_code index_code ATTRIBUTE_UNUSED)
> +enum rtx_code index_code ATTRIBUTE_UNUSED,
> +rtx_insn* insn ATTRIBUTE_UNUSED = NULL)
>  {
>  #ifdef REGNO_MODE_CODE_OK_FOR_BASE_P
>return REGNO_MODE_CODE_OK_FOR_BASE_P (regno, MACRO_MODE (mode), as,
> -   outer_code, index_code);
> +   outer_code, index_code, insn);
>  #else
>  #ifdef REGNO_MODE_OK_FOR_REG_BASE_P
>if (index_code == REG)
> @@ -79,12 +81,13 @@ ok_for_base_p_1 (unsigned regno ATTRIBUTE_UNUSED,
>
>  inline bool
>  regno_ok_for_base_p (unsigned regno, machine_mode mode, addr_space_t as,
> -enum rtx_code outer_code, enum rtx_code index_code)
> +enum rtx_code outer_code, enum rtx_code index_code,
> +rtx_insn* insn = NULL)
>  {
>if (regno >= FIRST_PSEUDO_REGISTER && reg_renumber[regno] >= 0)
>  regno = reg_renumber[regno];
>
> -  return ok_for_base_p_1 (regno, mode, as, outer_code, index_code);
> +  return ok_for_base_p_1 (regno, mode, as, outer_code, index_code, insn);
>  }
>
>  #endif /* GCC_ADDRESSES_H */
> diff --git a/gcc/config/avr/avr.h b/gcc/config/avr/avr.h
> index 8e7e00db13b..1d090fe0838 100644
> --- a/gcc/config/avr/avr.h
> +++ b/gcc/config/avr/avr.h
> @@ -280,12 +280,13 @@ enum reg_class {
>
>  #define REGNO_REG_CLASS(R) avr_regno_reg_class(R)
>
> -#define MODE_CODE_BASE_REG_CLASS(mode, as, outer_code, index_code)   \
> +#define MODE_CODE_BASE_REG_CLASS(mode, as, outer_code, index_code, insn)   \
>avr_mode_code_base_reg_class (mode, as, outer_code, index_code)
>
>  #define INDEX_REG_CLASS NO_REGS
>
> -#define REGNO_MODE_CODE_OK_FOR_BASE_P(num, mode, as, outer_code, index_code) 
> \
> +#define REGNO_MODE_CODE_OK_FOR_BASE_P(num, 

Re: [PATCH 09/13] [APX EGPR] Handle legacy insn that only support GPR16 (1/5)

2023-08-31 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 31, 2023 at 10:20 AM Hongyu Wang  wrote:
>
> From: Kong Lingling 
>
> These legacy insn in opcode map0/1 only support GPR16,
> and do not have vex/evex counterpart, directly adjust constraints and
> add gpr32 attr to patterns.
>
> insn list:
> 1. xsave/xsave64, xrstor/xrstor64
> 2. xsaves/xsaves64, xrstors/xrstors64
> 3. xsavec/xsavec64
> 4. xsaveopt/xsaveopt64
> 5. fxsave64/fxrstor64

IMO, instructions should be handled with a reversed approach. Add "h"
constraint (and memory constraint that can handle EGPR) to
instructions that CAN use EGPR (together with a relevant "enabled"
attribute. We have had the same approach with "x" to "v" transition
with SSE registers. If we "forgot" to add "v" to the instruction, it
still worked, but not to its full potential w.r.t available registers.

Uros.
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (): Set attr gpr32 0 and constraint
> Bt.
> (_rex64): Likewise.
> (_rex64): Likewise.
> (64): Likewise.
> (fxsave64): Likewise.
> (fxstore64): Likewise.
>
> gcc/testsuite/ChangeLog:
>
> * lib/target-supports.exp: Add apxf check.
> * gcc.target/i386/apx-legacy-insn-check-norex2.c: New test.
> * gcc.target/i386/apx-legacy-insn-check-norex2-asm.c: New assembler 
> test.
> ---
>  gcc/config/i386/i386.md   | 18 +++
>  .../i386/apx-legacy-insn-check-norex2-asm.c   |  5 
>  .../i386/apx-legacy-insn-check-norex2.c   | 30 +++
>  gcc/testsuite/lib/target-supports.exp | 10 +++
>  4 files changed, 57 insertions(+), 6 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/apx-legacy-insn-check-norex2-asm.c
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/apx-legacy-insn-check-norex2.c
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index b9eaea78f00..83ad01b43c1 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -25626,11 +25626,12 @@ (define_insn "fxsave"
>  (symbol_ref "ix86_attr_length_address_default (insn) + 3"))])
>
>  (define_insn "fxsave64"
> -  [(set (match_operand:BLK 0 "memory_operand" "=m")
> +  [(set (match_operand:BLK 0 "memory_operand" "=Bt")
> (unspec_volatile:BLK [(const_int 0)] UNSPECV_FXSAVE64))]
>"TARGET_64BIT && TARGET_FXSR"
>"fxsave64\t%0"
>[(set_attr "type" "other")
> +   (set_attr "gpr32" "0")
> (set_attr "memory" "store")
> (set (attr "length")
>  (symbol_ref "ix86_attr_length_address_default (insn) + 4"))])
> @@ -25646,11 +25647,12 @@ (define_insn "fxrstor"
>  (symbol_ref "ix86_attr_length_address_default (insn) + 3"))])
>
>  (define_insn "fxrstor64"
> -  [(unspec_volatile [(match_operand:BLK 0 "memory_operand" "m")]
> +  [(unspec_volatile [(match_operand:BLK 0 "memory_operand" "Bt")]
> UNSPECV_FXRSTOR64)]
>"TARGET_64BIT && TARGET_FXSR"
>"fxrstor64\t%0"
>[(set_attr "type" "other")
> +   (set_attr "gpr32" "0")
> (set_attr "memory" "load")
> (set (attr "length")
>  (symbol_ref "ix86_attr_length_address_default (insn) + 4"))])
> @@ -25704,7 +25706,7 @@ (define_insn ""
>  (symbol_ref "ix86_attr_length_address_default (insn) + 3"))])
>
>  (define_insn "_rex64"
> -  [(set (match_operand:BLK 0 "memory_operand" "=m")
> +  [(set (match_operand:BLK 0 "memory_operand" "=Bt")
> (unspec_volatile:BLK
>  [(match_operand:SI 1 "register_operand" "a")
>   (match_operand:SI 2 "register_operand" "d")]
> @@ -25713,11 +25715,12 @@ (define_insn "_rex64"
>"\t%0"
>[(set_attr "type" "other")
> (set_attr "memory" "store")
> +   (set_attr "gpr32" "0")
> (set (attr "length")
>  (symbol_ref "ix86_attr_length_address_default (insn) + 3"))])
>
>  (define_insn ""
> -  [(set (match_operand:BLK 0 "memory_operand" "=m")
> +  [(set (match_operand:BLK 0 "memory_operand" "=Bt")
> (unspec_volatile:BLK
>  [(match_operand:SI 1 "register_operand" "a")
>   (match_operand:SI 2 "register_operand" "d")]
> @@ -25726,6 +25729,7 @@ (define_insn ""
>"\t%0"
>[(set_attr "type" "other")
> (set_attr "memory" "store")
> +   (set_attr "gpr32" "0")
> (set (attr "length")
>  (symbol_ref "ix86_attr_length_address_default (insn) + 4"))])
>
> @@ -25743,7 +25747,7 @@ (define_insn ""
>
>  (define_insn "_rex64"
> [(unspec_volatile:BLK
> - [(match_operand:BLK 0 "memory_operand" "m")
> + [(match_operand:BLK 0 "memory_operand" "Bt")
>(match_operand:SI 1 "register_operand" "a")
>(match_operand:SI 2 "register_operand" "d")]
>   ANY_XRSTOR)]
> @@ -25751,12 +25755,13 @@ (define_insn "_rex64"
>"\t%0"
>[(set_attr "type" "other")
> (set_attr "memory" "load")
> +   (set_attr "gpr32" "0")
> (set (attr "length")
>  (symbol_ref "ix86_attr_length_address_default (insn) + 3"))])
>
>  (define_insn "64"
> [(unspec_volatile:BLK
> - [(match_operand:BLK 0 

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-08-31 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches wrote:
> > From: Kong Lingling 
> >
> > In inline asm, we do not know if the insn can use EGPR, so disable EGPR
> > usage by default from mapping the common reg/mem constraint to non-EGPR
> > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR usage
> > for inline asm.
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
> >   ix86_md_asm_adjust.
> >   (ix86_md_asm_adjust): When APX EGPR enabled without specifying the
> >   target option, map reg/mem constraints to non-EGPR constraints.
> >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
> > ---
> >  gcc/config/i386/i386.cc   |  44 +++
> >  gcc/config/i386/i386.opt  |   5 +
> >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 ++
> >  3 files changed, 156 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d26d9ab0d9d..9460ebbfda4 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -17,6 +17,7 @@ You should have received a copy of the GNU General Public 
> > License
> >  along with GCC; see the file COPYING3.  If not see
> >  .  */
> >
> > +#define INCLUDE_STRING
> >  #define IN_TARGET_CODE 1
> >
> >  #include "config.h"
> > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , vec & 
> > /*inputs*/,
> >bool saw_asm_flag = false;
> >
> >start_sequence ();
> > +  /* TODO: Here we just mapped the general r/m constraints to non-EGPR
> > +   constraints, will eventually map all the usable constraints in the 
> > future. */
>
> I think there should be some constraint which explicitly has all the 32
> GPRs, like there is one for just all 16 GPRs (h), so that regardless of
> -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm wants.
>
> Also, what about the "g" constraint?  Shouldn't there be another for "g"
> without r16..r31?  What about the various other memory
> constraints ("<", "o", ...)?

I think we should leave all existing constraints as they are, so "r"
covers only GPR16, "m" and "o" to only use GPR16. We can then
introduce "h" to instructions that have the ability to handle EGPR.
This would be somehow similar to the SSE -> AVX512F transition, where
we still have "x" for SSE16 and "v" was introduced as a separate
register class for EVEX SSE registers. This way, asm will be
compatible, when "r", "m", "o" and "g" are used. The new memory
constraint "Bt", should allow new registers, and should be added to
the constraint string as a separate constraint, and conditionally
enabled by relevant "isa" (AKA "enabled") attribute.

Uros.

> > +  if (TARGET_APX_EGPR && !ix86_apx_inline_asm_use_gpr32)
> > +{
> > +  /* Map "r" constraint in inline asm to "h" that disallows r16-r31
> > +  and replace only r, exclude Br and Yr.  */
> > +  for (unsigned i = 0; i < constraints.length (); i++)
> > + {
> > +   std::string *s = new std::string (constraints[i]);
>
> Doesn't this leak memory (all the time)?
> I must say I don't really understand why you need to use std::string here,
> but certainly it shouldn't leak.
>
> > +   size_t pos = s->find ('r');
> > +   while (pos != std::string::npos)
> > + {
> > +   if (pos > 0
> > +   && (s->at (pos - 1) == 'Y' || s->at (pos - 1) == 'B'))
> > + pos = s->find ('r', pos + 1);
> > +   else
> > + {
> > +   s->replace (pos, 1, "h");
> > +   constraints[i] = (const char*) s->c_str ();
>
> Formatting (space before *).  The usual way for constraints is ggc_strdup on
> some string in a buffer.  Also, one could have several copies or r (or m, 
> memory (doesn't
> that appear just in clobbers?  And that doesn't look like something that
> should be replaced), Bm, e.g. in various alternatives.  So, you
> need to change them all, not just the first hit.  "r,r,r,m" and the like.
> Normally, one would simply walk the constraint string, parsing the special
> letters (+, =, & etc.) and single letter constraints and 2 letter
> constraints using CONSTRAINT_LEN macro (tons of examples in GCC sources).
> Either do it in 2 passes, first one counts how long constraint string one
> will need after the adjustments (and whether to adjust something at all),
> then if needed XALLOCAVEC it and adjust in there, or say use a
> auto_vec for
> it.
>
> > +   break;
> > + }
> > + }
> > + }
> > +  /* Also map "m/memory/Bm" constraint that may use GPR32, replace 
> > them with
> > +  "Bt/Bt/BT".  */
> > +  for (unsigned i = 

[PATCH] fortran: Rename TRUE/FALSE to true/false in *.cc files

2023-08-25 Thread Uros Bizjak via Gcc-patches
gcc/fortran/ChangeLog:

* match.cc (gfc_match_equivalence): Rename TRUE/FALSE to true/false.
* module.cc (check_access): Ditto.
* primary.cc (match_real_constant): Ditto.
* trans-array.cc (gfc_trans_allocate_array_storage): Ditto.
(get_array_ctor_strlen): Ditto.
* trans-common.cc (find_equivalence): Ditto.
(add_equivalences): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

OK for master?

Uros.
diff --git a/gcc/fortran/match.cc b/gcc/fortran/match.cc
index ba23bcd9692..c926f38058f 100644
--- a/gcc/fortran/match.cc
+++ b/gcc/fortran/match.cc
@@ -5788,7 +5788,7 @@ gfc_match_equivalence (void)
goto syntax;
 
   set = eq;
-  common_flag = FALSE;
+  common_flag = false;
   cnt = 0;
 
   for (;;)
@@ -5829,7 +5829,7 @@ gfc_match_equivalence (void)
 
  if (sym->attr.in_common)
{
- common_flag = TRUE;
+ common_flag = true;
  common_head = sym->common_head;
}
 
diff --git a/gcc/fortran/module.cc b/gcc/fortran/module.cc
index 95fdda6b2aa..c07e9dc9ba2 100644
--- a/gcc/fortran/module.cc
+++ b/gcc/fortran/module.cc
@@ -5744,9 +5744,9 @@ check_access (gfc_access specific_access, gfc_access 
default_access)
 return true;
 
   if (specific_access == ACCESS_PUBLIC)
-return TRUE;
+return true;
   if (specific_access == ACCESS_PRIVATE)
-return FALSE;
+return false;
 
   if (flag_module_private)
 return default_access == ACCESS_PUBLIC;
diff --git a/gcc/fortran/primary.cc b/gcc/fortran/primary.cc
index 0bb440b85a9..d3aeeb89362 100644
--- a/gcc/fortran/primary.cc
+++ b/gcc/fortran/primary.cc
@@ -530,13 +530,13 @@ match_real_constant (gfc_expr **result, int signflag)
   seen_dp = 0;
   seen_digits = 0;
   exp_char = ' ';
-  negate = FALSE;
+  negate = false;
 
   c = gfc_next_ascii_char ();
   if (signflag && (c == '+' || c == '-'))
 {
   if (c == '-')
-   negate = TRUE;
+   negate = true;
 
   gfc_gobble_whitespace ();
   c = gfc_next_ascii_char ();
diff --git a/gcc/fortran/trans-array.cc b/gcc/fortran/trans-array.cc
index 951cecfa5d5..90a7d4e9aef 100644
--- a/gcc/fortran/trans-array.cc
+++ b/gcc/fortran/trans-array.cc
@@ -1121,7 +1121,7 @@ gfc_trans_allocate_array_storage (stmtblock_t * pre, 
stmtblock_t * post,
 {
   /* A callee allocated array.  */
   gfc_conv_descriptor_data_set (pre, desc, null_pointer_node);
-  onstack = FALSE;
+  onstack = false;
 }
   else
 {
@@ -2481,7 +2481,7 @@ get_array_ctor_strlen (stmtblock_t *block, 
gfc_constructor_base base, tree * len
   gfc_constructor *c;
   bool is_const;
 
-  is_const = TRUE;
+  is_const = true;
 
   if (gfc_constructor_first (base) == NULL)
 {
diff --git a/gcc/fortran/trans-common.cc b/gcc/fortran/trans-common.cc
index c83b6f930eb..91a98b30b8d 100644
--- a/gcc/fortran/trans-common.cc
+++ b/gcc/fortran/trans-common.cc
@@ -1048,7 +1048,7 @@ find_equivalence (segment_info *n)
   gfc_equiv *e1, *e2, *eq;
   bool found;
 
-  found = FALSE;
+  found = false;
 
   for (e1 = n->sym->ns->equiv; e1; e1 = e1->next)
 {
@@ -1083,7 +1083,7 @@ find_equivalence (segment_info *n)
{
  add_condition (n, eq, e2);
  e2->used = 1;
- found = TRUE;
+ found = true;
}
}
 }
@@ -1102,11 +1102,11 @@ static void
 add_equivalences (bool *saw_equiv)
 {
   segment_info *f;
-  bool more = TRUE;
+  bool more = true;
 
   while (more)
 {
-  more = FALSE;
+  more = false;
   for (f = current_segment; f; f = f->next)
{
  if (!f->sym->equiv_built)


[committed] treewide: Rename TRUE/FALSE to true/false in *.cc files

2023-08-25 Thread Uros Bizjak via Gcc-patches
gcc/c-family/ChangeLog:

* c-format.cc (read_any_format_width):
Rename TRUE/FALSE to true/false.

gcc/ChangeLog:

* caller-save.cc (new_saved_hard_reg):
Rename TRUE/FALSE to true/false.
(setup_save_areas): Ditto.
* gcc.cc (set_collect_gcc_options): Ditto.
(driver::build_multilib_strings): Ditto.
(print_multilib_info): Ditto.
* genautomata.cc (gen_cpu_unit): Ditto.
(gen_query_cpu_unit): Ditto.
(gen_bypass): Ditto.
(gen_excl_set): Ditto.
(gen_presence_absence_set): Ditto.
(gen_presence_set): Ditto.
(gen_final_presence_set): Ditto.
(gen_absence_set): Ditto.
(gen_final_absence_set): Ditto.
(gen_automaton): Ditto.
(gen_regexp_repeat): Ditto.
(gen_regexp_allof): Ditto.
(gen_regexp_oneof): Ditto.
(gen_regexp_sequence): Ditto.
(process_decls): Ditto.
(reserv_sets_are_intersected): Ditto.
(initiate_excl_sets): Ditto.
(form_reserv_sets_list): Ditto.
(check_presence_pattern_sets): Ditto.
(check_absence_pattern_sets): Ditto.
(check_regexp_units_distribution): Ditto.
(check_unit_distributions_to_automata): Ditto.
(create_ainsns): Ditto.
(output_insn_code_cases): Ditto.
(output_internal_dead_lock_func): Ditto.
(form_important_insn_automata_lists): Ditto.
* gengtype-state.cc (read_state_files_list): Ditto.
* gengtype.cc (main): Ditto.
* gimple-array-bounds.cc (array_bounds_checker::check_array_bounds):
Ditto.
* gimple.cc (gimple_build_call_from_tree): Ditto.
(preprocess_case_label_vec_for_gimple): Ditto.
* gimplify.cc (gimplify_call_expr): Ditto.
* ordered-hash-map-tests.cc (test_map_of_int_to_strings): Ditto.

gcc/cp/ChangeLog:

* call.cc (build_conditional_expr):
Rename TRUE/FALSE to true/false.
(build_new_op): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/c-family/c-format.cc b/gcc/c-family/c-format.cc
index b3ef2d44ce9..b9906ecc171 100644
--- a/gcc/c-family/c-format.cc
+++ b/gcc/c-family/c-format.cc
@@ -2325,13 +2325,13 @@ read_any_format_width (tree ,
 {
   /* Possibly read a numeric width.  If the width is zero,
 we complain if appropriate.  */
-  int non_zero_width_char = FALSE;
-  int found_width = FALSE;
+  int non_zero_width_char = false;
+  int found_width = false;
   while (ISDIGIT (*format_chars))
{
- found_width = TRUE;
+ found_width = true;
  if (*format_chars != '0')
-   non_zero_width_char = TRUE;
+   non_zero_width_char = true;
  ++format_chars;
}
   if (found_width && !non_zero_width_char &&
diff --git a/gcc/caller-save.cc b/gcc/caller-save.cc
index b8915dab128..9ddf9b70b21 100644
--- a/gcc/caller-save.cc
+++ b/gcc/caller-save.cc
@@ -342,7 +342,7 @@ new_saved_hard_reg (int regno, int call_freq)
   saved_reg->num = saved_regs_num++;
   saved_reg->hard_regno = regno;
   saved_reg->call_freq = call_freq;
-  saved_reg->first_p = FALSE;
+  saved_reg->first_p = false;
   saved_reg->next = -1;
 }
 
@@ -558,7 +558,7 @@ setup_save_areas (void)
+ saved_reg2->num]
  = saved_reg_conflicts[saved_reg2->num * saved_regs_num
+ saved_reg->num]
- = TRUE;
+ = true;
  }
}
}
@@ -608,7 +608,7 @@ setup_save_areas (void)
}
  if (j == i)
{
- saved_reg->first_p = TRUE;
+ saved_reg->first_p = true;
  for (best_slot_num = -1, j = 0; j < prev_save_slots_num; j++)
{
  slot = prev_save_slots[j];
diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
index 673ec91d60e..23e458d3252 100644
--- a/gcc/cp/call.cc
+++ b/gcc/cp/call.cc
@@ -6058,7 +6058,7 @@ build_conditional_expr (const op_location_t ,
   if (complain & tf_error)
 {
   auto_diagnostic_group d;
-  op_error (loc, COND_EXPR, NOP_EXPR, arg1, arg2, arg3, FALSE);
+ op_error (loc, COND_EXPR, NOP_EXPR, arg1, arg2, arg3, false);
   print_z_candidates (loc, candidates);
 }
  return error_mark_node;
@@ -7129,7 +7129,7 @@ build_new_op (const op_location_t , enum tree_code 
code, int flags,
/* ... Otherwise, report the more generic
   "no matching operator found" error */
auto_diagnostic_group d;
-   op_error (loc, code, code2, arg1, arg2, arg3, FALSE);
+   op_error (loc, code, code2, arg1, arg2, arg3, false);
print_z_candidates (loc, candidates);
  }
}
@@ -7145,7 +7145,7 @@ build_new_op (const op_location_t , enum tree_code 
code, int flags,
  if (complain & tf_error)
{
  auto_diagnostic_group d;
- op_error (loc, code, code2, 

[committed] i386: Optimize pinsrq of 0 with index 1 into movq [PR94866]

2023-08-24 Thread Uros Bizjak via Gcc-patches
Add new pattern involving vec_merge RTX that is produced by combine from the
combination of sse4_1_pinsrq and *movdi_internal:

7: r86:DI=0
8: r85:V2DI=vec_merge(vec_duplicate(r86:DI),r87:V2DI,0x2)
  REG_DEAD r87:V2DI
  REG_DEAD r86:DI
Successfully matched this instruction:
(set (reg:V2DI 85 [ a ])
(vec_merge:V2DI (reg:V2DI 87)
(const_vector:V2DI [
(const_int 0 [0]) repeated x2
])
(const_int 1 [0x1])))

PR target/94866

gcc/ChangeLog:

* config/i386/sse.md (*sse2_movq128__1): New insn pattern.

gcc/testsuite/ChangeLog:

* g++.target/i386/pr94866.C: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index da85223a9b4..52104f8d1c9 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1770,6 +1770,18 @@ (define_insn "*sse2_movq128_"
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "TI")])
 
+(define_insn "*sse2_movq128__1"
+  [(set (match_operand:VI8F_128 0 "register_operand" "=v")
+   (vec_merge:VI8F_128
+ (match_operand:VI8F_128 1 "nonimmediate_operand" "vm")
+ (match_operand:VI8F_128 2 "const0_operand")
+ (const_int 1)))]
+  "TARGET_SSE2"
+  "%vmovq\t{%1, %0|%0, %q1}"
+  [(set_attr "type" "ssemov")
+   (set_attr "prefix" "maybe_vex")
+   (set_attr "mode" "TI")])
+
 ;; Move a DI from a 32-bit register pair (e.g. %edx:%eax) to an xmm.
 ;; We'd rather avoid this entirely; if the 32-bit reg pair was loaded
 ;; from memory, we'd prefer to load the memory directly into the %xmm
diff --git a/gcc/testsuite/g++.target/i386/pr94866.C 
b/gcc/testsuite/g++.target/i386/pr94866.C
new file mode 100644
index 000..eb0f5ef11c5
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr94866.C
@@ -0,0 +1,13 @@
+// PR target/94866
+// { dg-do compile }
+// { dg-options "-O2 -msse4.1" }
+// { dg-require-effective-target c++11 }
+
+typedef long long v2di __attribute__((vector_size(16)));
+
+v2di _mm_move_epi64(v2di a)
+{
+return v2di{a[0], 0LL};
+}
+
+// { dg-final { scan-assembler-times "movq\[ \\t\]+\[^\n\]*%xmm" 1 } }


Re: [PATCH 6/12] i386: Enable _BitInt on x86-64 [PR102989]

2023-08-23 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 9, 2023 at 8:19 PM Jakub Jelinek  wrote:
>
> Hi!
>
> The following patch enables _BitInt support on x86-64, the only
> target which has _BitInt specified in psABI.
>
> 2023-08-09  Jakub Jelinek  
>
> PR c/102989
> * config/i386/i386.cc (classify_argument): Handle BITINT_TYPE.
> (ix86_bitint_type_info): New function.
> (TARGET_C_BITINT_TYPE_INFO): Redefine.

LGTM, with a nit.

Thanks,
Uros.

>
> --- gcc/config/i386/i386.cc.jj  2023-08-08 15:55:05.627176766 +0200
> +++ gcc/config/i386/i386.cc 2023-08-08 16:12:02.308940091 +0200
> @@ -2121,7 +2121,8 @@ classify_argument (machine_mode mode, co
> return 0;
>  }
>
> -  if (type && AGGREGATE_TYPE_P (type))
> +  if (type && (AGGREGATE_TYPE_P (type)
> +  || (TREE_CODE (type) == BITINT_TYPE && words > 1)))
>  {
>int i;
>tree field;
> @@ -2270,6 +2271,14 @@ classify_argument (machine_mode mode, co
> }
>   break;
>
> +   case BITINT_TYPE:
> + /* _BitInt(N) for N > 64 is passed as structure containing
> +(N + 63) / 64 64-bit elements.  */
> + if (words > 2)
> +   return 0;
> + classes[0] = classes[1] = X86_64_INTEGER_CLASS;
> + return 2;
> +
> default:
>   gcc_unreachable ();
> }
> @@ -24842,6 +24851,26 @@ ix86_get_excess_precision (enum excess_p
>return FLT_EVAL_METHOD_UNPREDICTABLE;
>  }
>
> +/* Return true if _BitInt(N) is supported and fill details about it into
> +   *INFO.  */

The above comment should fit into one line.

> +bool
> +ix86_bitint_type_info (int n, struct bitint_info *info)
> +{
> +  if (!TARGET_64BIT)
> +return false;
> +  if (n <= 8)
> +info->limb_mode = QImode;
> +  else if (n <= 16)
> +info->limb_mode = HImode;
> +  else if (n <= 32)
> +info->limb_mode = SImode;
> +  else
> +info->limb_mode = DImode;
> +  info->big_endian = false;
> +  info->extended = false;
> +  return true;
> +}
> +
>  /* Implement PUSH_ROUNDING.  On 386, we have pushw instruction that
> decrements by exactly 2 no matter what the position was, there is no 
> pushb.
>
> @@ -25446,6 +25475,8 @@ ix86_run_selftests (void)
>
>  #undef TARGET_C_EXCESS_PRECISION
>  #define TARGET_C_EXCESS_PRECISION ix86_get_excess_precision
> +#undef TARGET_C_BITINT_TYPE_INFO
> +#define TARGET_C_BITINT_TYPE_INFO ix86_bitint_type_info
>  #undef TARGET_PROMOTE_PROTOTYPES
>  #define TARGET_PROMOTE_PROTOTYPES hook_bool_const_tree_true
>  #undef TARGET_PUSH_ARGUMENT
>
> Jakub
>


[committed] i386: Fix register spill failure with concat RTX [PR111010]

2023-08-23 Thread Uros Bizjak via Gcc-patches
Disable (=,m,m) alternative for 32-bit targets. The combination of two
memory operands (possibly with complex addressing mode), early clobbered
output, frame pointer and PIC registers uses too many registers on
a register constrained 32-bit target.

Also merge two similar patterns using DWIH mode iterator.

PR target/111010

gcc/ChangeLog:

* config/i386/i386.md (*concat3_3):
Merge pattern from *concatditi3_3 and *concatsidi3_3 using
DWIH mode iterator.  Disable (=,m,m) alternative for
32-bit targets.
(*concat3_4): Disable (=,m,m)
alternative for 32-bit targets.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Also regtested by Rainer on i386-pc-solaris2.11 where the patch fixes
the failure.

(I didn't find a nice testcase, the test is very sensitive to
perturbations in the code.)

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 108f4af8552..50794ed7bed 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -12435,17 +12435,16 @@ (define_insn_and_split "*concat3_2"
   DONE;
 })
 
-(define_insn_and_split "*concatditi3_3"
-  [(set (match_operand:TI 0 "nonimmediate_operand" "=ro,r,r,,x")
-   (any_or_plus:TI
- (ashift:TI
-   (zero_extend:TI
- (match_operand:DI 1 "nonimmediate_operand" "r,m,r,m,x"))
+(define_insn_and_split "*concat3_3"
+  [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,,x")
+   (any_or_plus:
+ (ashift:
+   (zero_extend:
+ (match_operand:DWIH 1 "nonimmediate_operand" "r,m,r,m,x"))
(match_operand:QI 2 "const_int_operand"))
- (zero_extend:TI
-   (match_operand:DI 3 "nonimmediate_operand" "r,r,m,m,0"]
-  "TARGET_64BIT
-   && INTVAL (operands[2]) == 64"
+ (zero_extend:
+   (match_operand:DWIH 3 "nonimmediate_operand" "r,r,m,m,0"]
+  "INTVAL (operands[2]) ==  * BITS_PER_UNIT"
   "#"
   "&& reload_completed"
   [(const_int 0)]
@@ -12456,28 +12455,10 @@ (define_insn_and_split "*concatditi3_3"
   emit_insn (gen_vec_concatv2di (tmp, operands[3], operands[1]));
 }
   else
-split_double_concat (TImode, operands[0], operands[3], operands[1]);
-  DONE;
-})
-
-(define_insn_and_split "*concatsidi3_3"
-  [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r,r,")
-   (any_or_plus:DI
- (ashift:DI
-   (zero_extend:DI
- (match_operand:SI 1 "nonimmediate_operand" "r,m,r,m"))
-   (match_operand:QI 2 "const_int_operand"))
- (zero_extend:DI
-   (match_operand:SI 3 "nonimmediate_operand" "r,r,m,m"]
-  "!TARGET_64BIT
-   && INTVAL (operands[2]) == 32"
-  "#"
-  "&& reload_completed"
-  [(const_int 0)]
-{
-  split_double_concat (DImode, operands[0], operands[3], operands[1]);
+split_double_concat (mode, operands[0], operands[3], operands[1]);
   DONE;
-})
+}
+  [(set_attr "isa" "*,*,*,x64,x64")])
 
 (define_insn_and_split "*concat3_4"
   [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,")
@@ -12495,7 +12476,8 @@ (define_insn_and_split "*concat3_4"
 {
   split_double_concat (mode, operands[0], operands[1], operands[2]);
   DONE;
-})
+}
+  [(set_attr "isa" "*,*,*,x64")])
 
 (define_insn_and_split "*concat3_5"
   [(set (match_operand:DWI 0 "nonimmediate_operand" "=r,o,o")


[committed] i386: Micro-optimize ix86_expand_sse_extend

2023-08-20 Thread Uros Bizjak via Gcc-patches
Partial vector src is forced to a register as ops[1], we can use it
instead of SRC in the call to ix86_expand_sse_cmp.  This change avoids
forcing operand[1] to a register in sign/zero-extend expanders.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_sse_extend): Use ops[1]
instead of src in the call to ix86_expand_sse_cmp.
* config/i386/sse.md (v8qiv8hi2): Do not
force operands[1] to a register.
(v4hiv4si2): Ditto.
(v2siv2di2): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 460d496ef22..031e2f72d15 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -5667,7 +5667,7 @@ ix86_expand_sse_extend (rtx dest, rtx src, bool 
unsigned_p)
 ops[2] = force_reg (imode, CONST0_RTX (imode));
   else
 ops[2] = ix86_expand_sse_cmp (gen_reg_rtx (imode), GT, CONST0_RTX (imode),
- src, pc_rtx, pc_rtx);
+ ops[1], pc_rtx, pc_rtx);
 
   ix86_split_mmx_punpck (ops, false);
   emit_move_insn (dest, lowpart_subreg (GET_MODE (dest), ops[0], imode));
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 87c3bf07020..da85223a9b4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -22923,8 +22923,7 @@ (define_expand "v8qiv8hi2"
 {
   if (!TARGET_SSE4_1)
 {
-  rtx op1 = force_reg (V8QImode, operands[1]);
-  ix86_expand_sse_extend (operands[0], op1, );
+  ix86_expand_sse_extend (operands[0], operands[1], );
   DONE;
 }
 
@@ -23240,8 +23239,7 @@ (define_expand "v4hiv4si2"
 {
   if (!TARGET_SSE4_1)
 {
-  rtx op1 = force_reg (V4HImode, operands[1]);
-  ix86_expand_sse_extend (operands[0], op1, );
+  ix86_expand_sse_extend (operands[0], operands[1], );
   DONE;
 }
 
@@ -23846,8 +23844,7 @@ (define_expand "v2siv2di2"
 {
   if (!TARGET_SSE4_1)
 {
-  rtx op1 = force_reg (V2SImode, operands[1]);
-  ix86_expand_sse_extend (operands[0], op1, );
+  ix86_expand_sse_extend (operands[0], operands[1], );
   DONE;
 }
 


[committed]: i386: Use PUNPCKL?? to implement vector extend and zero_extend for TARGET_SSE2 [PR111023]

2023-08-18 Thread Uros Bizjak via Gcc-patches
Implement vector extend and zero_extend functionality for TARGET_SSE2 using
PUNPCKL?? family of instructions. The code for e.g. zero-extend from V2SI to
V2DImode improves from:

movd%xmm0, %edx
pshufd  $85, %xmm0, %xmm0
movd%xmm0, %eax
movq%rdx, (%rdi)
movq%rax, 8(%rdi)

to:
pxor%xmm1, %xmm1
punpckldq   %xmm1, %xmm0
movaps  %xmm0, (%rdi)

And the code for sign-extend from V2SI to V2DImode from:

movd%xmm0, %edx
pshufd  $85, %xmm0, %xmm0
movd%xmm0, %eax
movslq  %edx, %rdx
cltq
movq%rdx, (%rdi)
movq%rax, 8(%rdi)

to:
pxor%xmm1, %xmm1
pcmpgtd %xmm0, %xmm1
punpckldq   %xmm1, %xmm0
movaps  %xmm0, (%rdi)

PR target/111023

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_split_mmx_punpck):
Also handle V2QImode.
(ix86_expand_sse_extend): New function.
* config/i386/i386-protos.h (ix86_expand_sse_extend): New prototype.
* config/i386/mmx.md (v4qiv4hi2): Enable for
TARGET_SSE2.  Expand through ix86_expand_sse_extend for !TARGET_SSE4_1.
(v2hiv2si2): Ditto.
(v2qiv2hi2): Ditto.
* config/i386/sse.md (v8qiv8hi2): Ditto.
(v4hiv4si2): Ditto.
(v2siv2di2): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr111023-2.c: New test.
* gcc.target/i386/pr111023-4b.c: New test.
* gcc.target/i386/pr111023-8b.c: New test.
* gcc.target/i386/pr111023.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 85e30552d6f..460d496ef22 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -1124,8 +1124,9 @@ ix86_split_mmx_punpck (rtx operands[], bool high_p)
 
   switch (mode)
 {
-case E_V4QImode:
 case E_V8QImode:
+case E_V4QImode:
+case E_V2QImode:
   sse_mode = V16QImode;
   double_sse_mode = V32QImode;
   mask = gen_rtx_PARALLEL (VOIDmode,
@@ -5636,7 +5637,43 @@ ix86_expand_vec_perm (rtx operands[])
 }
 }
 
-/* Unpack OP[1] into the next wider integer vector type.  UNSIGNED_P is
+/* Extend SRC into next wider integer vector type.  UNSIGNED_P is
+   true if we should do zero extension, else sign extension.  */
+
+void
+ix86_expand_sse_extend (rtx dest, rtx src, bool unsigned_p)
+{
+  machine_mode imode = GET_MODE (src);
+  rtx ops[3];
+
+  switch (imode)
+{
+case E_V8QImode:
+case E_V4QImode:
+case E_V2QImode:
+case E_V4HImode:
+case E_V2HImode:
+case E_V2SImode:
+  break;
+default:
+  gcc_unreachable ();
+}
+
+  ops[0] = gen_reg_rtx (imode);
+
+  ops[1] = force_reg (imode, src);
+
+  if (unsigned_p)
+ops[2] = force_reg (imode, CONST0_RTX (imode));
+  else
+ops[2] = ix86_expand_sse_cmp (gen_reg_rtx (imode), GT, CONST0_RTX (imode),
+ src, pc_rtx, pc_rtx);
+
+  ix86_split_mmx_punpck (ops, false);
+  emit_move_insn (dest, lowpart_subreg (GET_MODE (dest), ops[0], imode));
+}
+
+/* Unpack SRC into the next wider integer vector type.  UNSIGNED_P is
true if we should do zero extension, else sign extension.  HIGH_P is
true if we want the N/2 high elements, else the low elements.  */
 
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index fc2f1f13b78..9ffb125fc2b 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -155,6 +155,7 @@ extern bool ix86_expand_mask_vec_cmp (rtx, enum rtx_code, 
rtx, rtx);
 extern bool ix86_expand_int_vec_cmp (rtx[]);
 extern bool ix86_expand_fp_vec_cmp (rtx[]);
 extern void ix86_expand_sse_movcc (rtx, rtx, rtx, rtx);
+extern void ix86_expand_sse_extend (rtx, rtx, bool);
 extern void ix86_expand_sse_unpack (rtx, rtx, bool, bool);
 extern void ix86_expand_fp_spaceship (rtx, rtx, rtx);
 extern bool ix86_expand_int_addcc (rtx[]);
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 170432a7128..ef578222945 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -3744,8 +3744,14 @@ (define_expand "v4qiv4hi2"
   [(set (match_operand:V4HI 0 "register_operand")
(any_extend:V4HI
  (match_operand:V4QI 1 "register_operand")))]
-  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE"
 {
+  if (!TARGET_SSE4_1)
+{
+  ix86_expand_sse_extend (operands[0], operands[1], );
+  DONE;
+}
+
   rtx op1 = force_reg (V4QImode, operands[1]);
   op1 = lowpart_subreg (V8QImode, op1, V4QImode);
   emit_insn (gen_sse4_1_v4qiv4hi2 (operands[0], op1));
@@ -3770,8 +3776,14 @@ (define_expand "v2hiv2si2"
   [(set (match_operand:V2SI 0 "register_operand")
(any_extend:V2SI
  (match_operand:V2HI 1 "register_operand")))]
-  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE"
 {
+  if (!TARGET_SSE4_1)
+{
+  ix86_expand_sse_extend (operands[0], operands[1], );
+  DONE;

Re: [PATCH] Generate vmovapd instead of vmovsd for moving DFmode between SSE_REGS.

2023-08-15 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 14, 2023 at 4:46 AM liuhongt via Gcc-patches
 wrote:
>
> vmovapd can enable register renaming and have same code size as
> vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than
> vmovsh.
>
> When TARGET_AVX512VL is not available, still generate
> vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (movdf_internal): Generate vmovapd instead of
> vmovsd when moving DFmode between SSE_REGS.
> (movhi_internal): Generate vmovdqa instead of vmovsh when
> moving HImode between SSE_REGS.
> (mov_internal): Use vmovaps instead of vmovsh when
> moving HF/BFmode between SSE_REGS.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr89229-4a.c: Adjust testcase.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.md| 20 +---
>  gcc/testsuite/gcc.target/i386/pr89229-4a.c |  4 +---
>  2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index c906d75b13e..77182e34fe1 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -2961,8 +2961,12 @@ (define_insn "*movhi_internal"
> ]
> (const_string "TI"))
> (eq_attr "alternative" "12")
> - (cond [(match_test "TARGET_AVX512FP16")
> + (cond [(match_test "TARGET_AVX512VL")
> +  (const_string "TI")
> +(match_test "TARGET_AVX512FP16")
>(const_string "HF")
> +(match_test "TARGET_AVX512F")
> +  (const_string "SF")
>  (match_test "TARGET_AVX")
>(const_string "TI")
>  (ior (not (match_test "TARGET_SSE2"))
> @@ -4099,8 +4103,12 @@ (define_insn "*movdf_internal"
>
>/* movaps is one byte shorter for non-AVX targets.  */
>(eq_attr "alternative" "13,17")
> -(cond [(match_test "TARGET_AVX")
> +(cond [(match_test "TARGET_AVX512VL")
> + (const_string "V2DF")
> +   (match_test "TARGET_AVX512F")
>   (const_string "DF")
> +   (match_test "TARGET_AVX")
> + (const_string "V2DF")
> (ior (not (match_test "TARGET_SSE2"))
>  (match_test "optimize_function_for_size_p 
> (cfun)"))
>   (const_string "V4SF")
> @@ -4380,8 +4388,14 @@ (define_insn "*mov_internal"
>(const_string "HI")
>(const_string "TI"))
>(eq_attr "alternative" "5")
> -(cond [(match_test "TARGET_AVX512FP16")
> +(cond [(match_test "TARGET_AVX512VL")
> +   (const_string "V4SF")
> +   (match_test "TARGET_AVX512FP16")
>   (const_string "HF")
> +   (match_test "TARGET_AVX512F")
> + (const_string "SF")
> +   (match_test "TARGET_AVX")
> + (const_string "V4SF")
> (ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY")
>  (match_test "TARGET_SSE_SPLIT_REGS"))
>   (const_string "V4SF")
> diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4a.c 
> b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> index 5bc10d25619..8869650b0ad 100644
> --- a/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> +++ b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-do assemble { target { ! ia32 } } } */
>  /* { dg-options "-O2 -march=skylake-avx512" } */
>
>  extern double d;
> @@ -12,5 +12,3 @@ foo1 (double x)
>asm volatile ("" : "+v" (xmm17));
>d = xmm17;
>  }
> -
> -/* { dg-final { scan-assembler-not "vmovapd" } } */
> --
> 2.31.1
>


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
 wrote:
>
> On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> >
> > Currently we have 3 different independent tunes for gather
> > "use_gather,use_gather_2parts,use_gather_4parts",
> > similar for scatter, there're
> > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> >
> > The patch support 2 standardizing options to enable/disable
> > vectorization for all gather/scatter instructions. The options is
> > interpreted by driver to 3 tunes.
> >
> > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > Ok for trunk?
>
> I think -mgather/-mscatter are too close to -mfma suggesting they
> enable part of an ISA but they won't disable the use of intrinsics
> or enable gather/scatter on CPUs where the ISA doesn't have them.
>
> May I suggest to invent a more generic "short-cut" to
> -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> tunables add ^use_gather_any to cover all cases?  (or
> change what use_gather controls - it seems we changed its
> meaning before, and instead add use_gather_8parts and
> use_gather_16parts)
>
> That is, what's the point of this?

https://www.phoronix.com/review/downfall

that caused:

https://www.phoronix.com/review/intel-downfall-benchmarks

Uros.


Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
>
> Currently we have 3 different independent tunes for gather
> "use_gather,use_gather_2parts,use_gather_4parts",
> similar for scatter, there're
> "use_scatter,use_scatter_2parts,use_scatter_4parts"
>
> The patch support 2 standardizing options to enable/disable
> vectorization for all gather/scatter instructions. The options is
> interpreted by driver to 3 tunes.
>
> bootstrapped and regtested on x86_64-pc-linux-gnu.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.h (DRIVER_SELF_SPECS): Add
> GATHER_SCATTER_DRIVER_SELF_SPECS.
> (GATHER_SCATTER_DRIVER_SELF_SPECS): New macro.
> * config/i386/i386.opt (mgather): New option.
> (mscatter): Ditto.
> ---
>  gcc/config/i386/i386.h   | 12 +++-
>  gcc/config/i386/i386.opt |  8 
>  2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index ef342fcee9b..d9ac2c29bde 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -565,7 +565,17 @@ extern GTY(()) tree x86_mfence;
>  # define SUBTARGET_DRIVER_SELF_SPECS ""
>  #endif
>
> -#define DRIVER_SELF_SPECS SUBTARGET_DRIVER_SELF_SPECS
> +#ifndef GATHER_SCATTER_DRIVER_SELF_SPECS
> +# define GATHER_SCATTER_DRIVER_SELF_SPECS \
> +  
> "%{mno-gather:-mtune-ctrl=^use_gather_2parts,^use_gather_4parts,^use_gather} \
> +   %{mgather:-mtune-ctrl=use_gather_2parts,use_gather_4parts,use_gather} \
> +   
> %{mno-scatter:-mtune-ctrl=^use_scatter_2parts,^use_scatter_4parts,^use_scatter}
>  \
> +   %{mscatter:-mtune-ctrl=use_scatter_2parts,use_scatter_4parts,use_scatter}"
> +#endif
> +
> +#define DRIVER_SELF_SPECS \
> +  SUBTARGET_DRIVER_SELF_SPECS " " \
> +  GATHER_SCATTER_DRIVER_SELF_SPECS
>
>  /* -march=native handling only makes sense with compiler running on
> an x86 or x86_64 chip.  If changing this condition, also change
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index ddb7f110aa2..99948644a8d 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -424,6 +424,14 @@ mdaz-ftz
>  Target
>  Set the FTZ and DAZ Flags.
>
> +mgather
> +Target
> +Enable vectorization for gather instruction.
> +
> +mscatter
> +Target
> +Enable vectorization for scatter instruction.

Are gather and scatter instructions affected in a separate way, or
should we use one -mgather-scatter option to cover all gather/scatter
tunings?

Uros.

> +
>  mpreferred-stack-boundary=
>  Target RejectNegative Joined UInteger Var(ix86_preferred_stack_boundary_arg)
>  Attempt to keep stack aligned to this power of 2.
> --
> 2.31.1
>


Re: [PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-10 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 10, 2023 at 2:49 AM liuhongt  wrote:
>
> Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
> patterns in order to avoid generation of partial vector V8HFmode
> trapping instructions.
>
> Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/110832
> * config/i386/mmx.md: (movq__to_sse): Also do not
> sanitize upper part of V4HFmode register with
> -fno-trapping-math.
> (v4hf3): Enable for ix86_partial_vec_fp_math.
> ( (v2hf3): Ditto.
> (divv2hf3): Ditto.
> (movd_v2hf_to_sse): Do not sanitize upper part of V2HFmode
> register with -fno-trapping-math.

OK.

BTW: I would just like to mention that plenty of instructions can be
enabled for V4HF/V2HFmode besides arithmetic insns. At least
conversions, comparisons, FMA and min/max (to name some of them) can
be enabled by introducing expanders that expand to V8HFmode
instruction.

Uros.
>
> ---
>  gcc/config/i386/mmx.md | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index d51b3b9dc71..170432a7128 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -596,7 +596,7 @@ (define_expand "movq__to_sse"
>   (match_dup 2)))]
>"TARGET_SSE2"
>  {
> -  if (mode == V2SFmode
> +  if (mode != V2SImode
>&& !flag_trapping_math)
>  {
>rtx op1 = force_reg (mode, operands[1]);
> @@ -1941,7 +1941,7 @@ (define_expand "v4hf3"
> (plusminusmult:V4HF
>   (match_operand:V4HF 1 "nonimmediate_operand")
>   (match_operand:V4HF 2 "nonimmediate_operand")))]
> -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
>  {
>rtx op2 = gen_reg_rtx (V8HFmode);
>rtx op1 = gen_reg_rtx (V8HFmode);
> @@ -1961,7 +1961,7 @@ (define_expand "divv4hf3"
> (div:V4HF
>   (match_operand:V4HF 1 "nonimmediate_operand")
>   (match_operand:V4HF 2 "nonimmediate_operand")))]
> -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
>  {
>rtx op2 = gen_reg_rtx (V8HFmode);
>rtx op1 = gen_reg_rtx (V8HFmode);
> @@ -1983,14 +1983,22 @@ (define_expand "movd_v2hf_to_sse"
> (match_operand:V2HF 1 "nonimmediate_operand"))
>   (match_operand:V8HF 2 "reg_or_0_operand")
>   (const_int 3)))]
> -  "TARGET_SSE")
> +  "TARGET_SSE"
> +{
> +  if (!flag_trapping_math && operands[2] == CONST0_RTX (V8HFmode))
> +  {
> +rtx op1 = force_reg (V2HFmode, operands[1]);
> +emit_move_insn (operands[0], lowpart_subreg (V8HFmode, op1, V2HFmode));
> +DONE;
> +  }
> +})
>
>  (define_expand "v2hf3"
>[(set (match_operand:V2HF 0 "register_operand")
> (plusminusmult:V2HF
>   (match_operand:V2HF 1 "nonimmediate_operand")
>   (match_operand:V2HF 2 "nonimmediate_operand")))]
> -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
>  {
>rtx op2 = gen_reg_rtx (V8HFmode);
>rtx op1 = gen_reg_rtx (V8HFmode);
> @@ -2009,7 +2017,7 @@ (define_expand "divv2hf3"
> (div:V2HF
>   (match_operand:V2HF 1 "nonimmediate_operand")
>   (match_operand:V2HF 2 "nonimmediate_operand")))]
> -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
>  {
>rtx op2 = gen_reg_rtx (V8HFmode);
>rtx op1 = gen_reg_rtx (V8HFmode);
> --
> 2.31.1
>


Re: [PATCH] i386: Clear upper bits of XMM register for V4HFmode/V2HFmode operations [PR110762]

2023-08-09 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 7, 2023 at 1:20 PM Richard Biener
 wrote:

> > Please also note the RFC patch [1] that relaxes clears for V2SFmode
> > with -fno-trapping-math. The patched compiler will then emit the same
> > code as clang does for -O2. Which raises another question - should gcc
> > default to -fno-trapping-math?
>
> I think we discussed this before and yes, IMHO we should default to
> -fno-trapping-math at least for C/C++ to be consistent with our other
> handling of the FP environment (default to -fno-rounding-math) and
> lack of proper FENV access barriers for inspecting the exceptions.
>
> Note Fortran has the -ffpe-trap= option which would then need to make
> sure to also enable -ftrapping-math.  Ada might have similar constraints
> (it also uses -fnon-call-exceptions, but unless it enables CPU traps for
> FP exceptions that would be a no-op).  Note this also shows we should
> possibly separate maintaining the IEEE exception state and considering
> changes in the IEEE exception states to cause CPU traps (that's also
> a source of common confusion on the user side).

FTR: PR54192, "-fno-trapping-math by default?" [1]

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54192

Uros.


Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-09 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 9, 2023 at 8:38 AM Uros Bizjak  wrote:
>
> On Wed, Aug 9, 2023 at 8:37 AM Liu, Hongtao  wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Uros Bizjak 
> > > Sent: Wednesday, August 9, 2023 2:33 PM
> > > To: Liu, Hongtao 
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy
> > > Bridge.
> > >
> > > On Wed, Aug 9, 2023 at 3:48 AM liuhongt  wrote:
> > > >
> > > > > Please rather do it in a more self-descriptive way, as proposed in
> > > > > the attached patch. You won't need a comment then.
> > > > >
> > > >
> > > > Adjusted in V2 patch.
> > > >
> > > > Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported
> > > > via EAX.
> > > >
> > > > Intel documentation says invalid subleaves return 0. We had been
> > > > relying on that behavior instead of checking the max sublef number.
> > > >
> > > > It appears that some Sandy Bridge CPUs return at least the subleaf 0
> > > > EDX value for subleaf 1. Best guess is that this is a bug in a
> > > > microcode patch since all of the bits we're seeing set in EDX were
> > > > introduced after Sandy Bridge was originally released.
> > > >
> > > > This is causing avxvnniint16 to be incorrectly enabled with
> > > > -march=native on these CPUs.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * common/config/i386/cpuinfo.h (get_available_features): Check
> > > > EAX for valid subleaf before use CPUID.
> > > > ---
> > > >  gcc/common/config/i386/cpuinfo.h | 82
> > > > +---
> > > >  1 file changed, 43 insertions(+), 39 deletions(-)
> > > >
> > > > diff --git a/gcc/common/config/i386/cpuinfo.h
> > > > b/gcc/common/config/i386/cpuinfo.h
> > > > index 30ef0d334ca..9fa4dec2a7e 100644
> > > > --- a/gcc/common/config/i386/cpuinfo.h
> > > > +++ b/gcc/common/config/i386/cpuinfo.h
> > > > @@ -663,6 +663,7 @@ get_available_features (struct __processor_model
> > > *cpu_model,
> > > >unsigned int max_cpuid_level = cpu_model2->__cpu_max_level;
> > > >unsigned int eax, ebx;
> > > >unsigned int ext_level;
> > > > +  unsigned int subleaf_level;
> > >
> > > Oh, I failed this in my previous review. This variable should be named
> > > max_subleaf_level, as it represents the maximum supported ECX value.
> > I've committed previous patch ,but not backport yet.
> > Guess I can just commit another patch to change the name?
> > For backport, I'll merge the change together with just 1 commit.
>
> Yes. It is a trivial minor change.

I also think the declaration should go inside (max_cpuid_level >= 7)
block, since it is used only there (and irrelevant outside the block),
but it is your call...

Uros.


Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-09 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 9, 2023 at 8:37 AM Liu, Hongtao  wrote:
>
>
>
> > -Original Message-
> > From: Uros Bizjak 
> > Sent: Wednesday, August 9, 2023 2:33 PM
> > To: Liu, Hongtao 
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy
> > Bridge.
> >
> > On Wed, Aug 9, 2023 at 3:48 AM liuhongt  wrote:
> > >
> > > > Please rather do it in a more self-descriptive way, as proposed in
> > > > the attached patch. You won't need a comment then.
> > > >
> > >
> > > Adjusted in V2 patch.
> > >
> > > Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported
> > > via EAX.
> > >
> > > Intel documentation says invalid subleaves return 0. We had been
> > > relying on that behavior instead of checking the max sublef number.
> > >
> > > It appears that some Sandy Bridge CPUs return at least the subleaf 0
> > > EDX value for subleaf 1. Best guess is that this is a bug in a
> > > microcode patch since all of the bits we're seeing set in EDX were
> > > introduced after Sandy Bridge was originally released.
> > >
> > > This is causing avxvnniint16 to be incorrectly enabled with
> > > -march=native on these CPUs.
> > >
> > > gcc/ChangeLog:
> > >
> > > * common/config/i386/cpuinfo.h (get_available_features): Check
> > > EAX for valid subleaf before use CPUID.
> > > ---
> > >  gcc/common/config/i386/cpuinfo.h | 82
> > > +---
> > >  1 file changed, 43 insertions(+), 39 deletions(-)
> > >
> > > diff --git a/gcc/common/config/i386/cpuinfo.h
> > > b/gcc/common/config/i386/cpuinfo.h
> > > index 30ef0d334ca..9fa4dec2a7e 100644
> > > --- a/gcc/common/config/i386/cpuinfo.h
> > > +++ b/gcc/common/config/i386/cpuinfo.h
> > > @@ -663,6 +663,7 @@ get_available_features (struct __processor_model
> > *cpu_model,
> > >unsigned int max_cpuid_level = cpu_model2->__cpu_max_level;
> > >unsigned int eax, ebx;
> > >unsigned int ext_level;
> > > +  unsigned int subleaf_level;
> >
> > Oh, I failed this in my previous review. This variable should be named
> > max_subleaf_level, as it represents the maximum supported ECX value.
> I've committed previous patch ,but not backport yet.
> Guess I can just commit another patch to change the name?
> For backport, I'll merge the change together with just 1 commit.

Yes. It is a trivial minor change.

Uros.


Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-09 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 9, 2023 at 3:48 AM liuhongt  wrote:
>
> > Please rather do it in a more self-descriptive way, as proposed in the
> > attached patch. You won't need a comment then.
> >
>
> Adjusted in V2 patch.
>
> Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
> supported via EAX.
>
> Intel documentation says invalid subleaves return 0. We had been
> relying on that behavior instead of checking the max sublef number.
>
> It appears that some Sandy Bridge CPUs return at least the subleaf 0
> EDX value for subleaf 1. Best guess is that this is a bug in a
> microcode patch since all of the bits we're seeing set in EDX were
> introduced after Sandy Bridge was originally released.
>
> This is causing avxvnniint16 to be incorrectly enabled with
> -march=native on these CPUs.
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features): Check
> EAX for valid subleaf before use CPUID.
> ---
>  gcc/common/config/i386/cpuinfo.h | 82 +---
>  1 file changed, 43 insertions(+), 39 deletions(-)
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 30ef0d334ca..9fa4dec2a7e 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -663,6 +663,7 @@ get_available_features (struct __processor_model 
> *cpu_model,
>unsigned int max_cpuid_level = cpu_model2->__cpu_max_level;
>unsigned int eax, ebx;
>unsigned int ext_level;
> +  unsigned int subleaf_level;

Oh, I failed this in my previous review. This variable should be named
max_subleaf_level, as it represents the maximum supported ECX value.

Uros.

>
>/* Get XCR_XFEATURE_ENABLED_MASK register with xgetbv.  */
>  #define XCR_XFEATURE_ENABLED_MASK  0x0
> @@ -762,7 +763,7 @@ get_available_features (struct __processor_model 
> *cpu_model,
>/* Get Advanced Features at level 7 (eax = 7, ecx = 0/1). */
>if (max_cpuid_level >= 7)
>  {
> -  __cpuid_count (7, 0, eax, ebx, ecx, edx);
> +  __cpuid_count (7, 0, subleaf_level, ebx, ecx, edx);
>if (ebx & bit_BMI)
> set_feature (FEATURE_BMI);
>if (ebx & bit_SGX)
> @@ -874,45 +875,48 @@ get_available_features (struct __processor_model 
> *cpu_model,
> set_feature (FEATURE_AVX512FP16);
> }
>
> -  __cpuid_count (7, 1, eax, ebx, ecx, edx);
> -  if (eax & bit_HRESET)
> -   set_feature (FEATURE_HRESET);
> -  if (eax & bit_CMPCCXADD)
> -   set_feature(FEATURE_CMPCCXADD);
> -  if (edx & bit_PREFETCHI)
> -   set_feature (FEATURE_PREFETCHI);
> -  if (eax & bit_RAOINT)
> -   set_feature (FEATURE_RAOINT);
> -  if (avx_usable)
> -   {
> - if (eax & bit_AVXVNNI)
> -   set_feature (FEATURE_AVXVNNI);
> - if (eax & bit_AVXIFMA)
> -   set_feature (FEATURE_AVXIFMA);
> - if (edx & bit_AVXVNNIINT8)
> -   set_feature (FEATURE_AVXVNNIINT8);
> - if (edx & bit_AVXNECONVERT)
> -   set_feature (FEATURE_AVXNECONVERT);
> - if (edx & bit_AVXVNNIINT16)
> -   set_feature (FEATURE_AVXVNNIINT16);
> - if (eax & bit_SM3)
> -   set_feature (FEATURE_SM3);
> - if (eax & bit_SHA512)
> -   set_feature (FEATURE_SHA512);
> - if (eax & bit_SM4)
> -   set_feature (FEATURE_SM4);
> -   }
> -  if (avx512_usable)
> -   {
> - if (eax & bit_AVX512BF16)
> -   set_feature (FEATURE_AVX512BF16);
> -   }
> -  if (amx_usable)
> +  if (subleaf_level >= 1)
> {
> - if (eax & bit_AMX_FP16)
> -   set_feature (FEATURE_AMX_FP16);
> - if (edx & bit_AMX_COMPLEX)
> -   set_feature (FEATURE_AMX_COMPLEX);
> + __cpuid_count (7, 1, eax, ebx, ecx, edx);
> + if (eax & bit_HRESET)
> +   set_feature (FEATURE_HRESET);
> + if (eax & bit_CMPCCXADD)
> +   set_feature(FEATURE_CMPCCXADD);
> + if (edx & bit_PREFETCHI)
> +   set_feature (FEATURE_PREFETCHI);
> + if (eax & bit_RAOINT)
> +   set_feature (FEATURE_RAOINT);
> + if (avx_usable)
> +   {
> + if (eax & bit_AVXVNNI)
> +   set_feature (FEATURE_AVXVNNI);
> + if (eax & bit_AVXIFMA)
> +   set_feature (FEATURE_AVXIFMA);
> + if (edx & bit_AVXVNNIINT8)
> +   set_feature (FEATURE_AVXVNNIINT8);
> + if (edx & bit_AVXNECONVERT)
> +   set_feature (FEATURE_AVXNECONVERT);
> + if (edx & bit_AVXVNNIINT16)
> +   set_feature (FEATURE_AVXVNNIINT16);
> + if (eax & bit_SM3)
> +   set_feature (FEATURE_SM3);
> + if (eax & bit_SHA512)
> +   set_feature (FEATURE_SHA512);
> + if (eax & bit_SM4)
> +   set_feature (FEATURE_SM4);
> +   }
> + if (avx512_usable)
> +   {
> + if (eax & bit_AVX512BF16)
> +  

Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-08 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 9, 2023 at 3:48 AM liuhongt  wrote:
>
> > Please rather do it in a more self-descriptive way, as proposed in the
> > attached patch. You won't need a comment then.
> >
>
> Adjusted in V2 patch.
>
> Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
> supported via EAX.
>
> Intel documentation says invalid subleaves return 0. We had been
> relying on that behavior instead of checking the max sublef number.

Probably a documentation bug, even Wikipedia says about CPUID:

EAX=7, ECX=0: Extended Features

This returns extended feature flags in EBX, ECX, and EDX. Returns the
maximum ECX value for EAX=7 in EAX.

> It appears that some Sandy Bridge CPUs return at least the subleaf 0
> EDX value for subleaf 1. Best guess is that this is a bug in a
> microcode patch since all of the bits we're seeing set in EDX were
> introduced after Sandy Bridge was originally released.
>
> This is causing avxvnniint16 to be incorrectly enabled with
> -march=native on these CPUs.
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features): Check
> EAX for valid subleaf before use CPUID.

OK for mainline and backports.

Thanks,
Uros.

> ---
>  gcc/common/config/i386/cpuinfo.h | 82 +---
>  1 file changed, 43 insertions(+), 39 deletions(-)
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 30ef0d334ca..9fa4dec2a7e 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -663,6 +663,7 @@ get_available_features (struct __processor_model 
> *cpu_model,
>unsigned int max_cpuid_level = cpu_model2->__cpu_max_level;
>unsigned int eax, ebx;
>unsigned int ext_level;
> +  unsigned int subleaf_level;
>
>/* Get XCR_XFEATURE_ENABLED_MASK register with xgetbv.  */
>  #define XCR_XFEATURE_ENABLED_MASK  0x0
> @@ -762,7 +763,7 @@ get_available_features (struct __processor_model 
> *cpu_model,
>/* Get Advanced Features at level 7 (eax = 7, ecx = 0/1). */
>if (max_cpuid_level >= 7)
>  {
> -  __cpuid_count (7, 0, eax, ebx, ecx, edx);
> +  __cpuid_count (7, 0, subleaf_level, ebx, ecx, edx);
>if (ebx & bit_BMI)
> set_feature (FEATURE_BMI);
>if (ebx & bit_SGX)
> @@ -874,45 +875,48 @@ get_available_features (struct __processor_model 
> *cpu_model,
> set_feature (FEATURE_AVX512FP16);
> }
>
> -  __cpuid_count (7, 1, eax, ebx, ecx, edx);
> -  if (eax & bit_HRESET)
> -   set_feature (FEATURE_HRESET);
> -  if (eax & bit_CMPCCXADD)
> -   set_feature(FEATURE_CMPCCXADD);
> -  if (edx & bit_PREFETCHI)
> -   set_feature (FEATURE_PREFETCHI);
> -  if (eax & bit_RAOINT)
> -   set_feature (FEATURE_RAOINT);
> -  if (avx_usable)
> -   {
> - if (eax & bit_AVXVNNI)
> -   set_feature (FEATURE_AVXVNNI);
> - if (eax & bit_AVXIFMA)
> -   set_feature (FEATURE_AVXIFMA);
> - if (edx & bit_AVXVNNIINT8)
> -   set_feature (FEATURE_AVXVNNIINT8);
> - if (edx & bit_AVXNECONVERT)
> -   set_feature (FEATURE_AVXNECONVERT);
> - if (edx & bit_AVXVNNIINT16)
> -   set_feature (FEATURE_AVXVNNIINT16);
> - if (eax & bit_SM3)
> -   set_feature (FEATURE_SM3);
> - if (eax & bit_SHA512)
> -   set_feature (FEATURE_SHA512);
> - if (eax & bit_SM4)
> -   set_feature (FEATURE_SM4);
> -   }
> -  if (avx512_usable)
> -   {
> - if (eax & bit_AVX512BF16)
> -   set_feature (FEATURE_AVX512BF16);
> -   }
> -  if (amx_usable)
> +  if (subleaf_level >= 1)
> {
> - if (eax & bit_AMX_FP16)
> -   set_feature (FEATURE_AMX_FP16);
> - if (edx & bit_AMX_COMPLEX)
> -   set_feature (FEATURE_AMX_COMPLEX);
> + __cpuid_count (7, 1, eax, ebx, ecx, edx);
> + if (eax & bit_HRESET)
> +   set_feature (FEATURE_HRESET);
> + if (eax & bit_CMPCCXADD)
> +   set_feature(FEATURE_CMPCCXADD);
> + if (edx & bit_PREFETCHI)
> +   set_feature (FEATURE_PREFETCHI);
> + if (eax & bit_RAOINT)
> +   set_feature (FEATURE_RAOINT);
> + if (avx_usable)
> +   {
> + if (eax & bit_AVXVNNI)
> +   set_feature (FEATURE_AVXVNNI);
> + if (eax & bit_AVXIFMA)
> +   set_feature (FEATURE_AVXIFMA);
> + if (edx & bit_AVXVNNIINT8)
> +   set_feature (FEATURE_AVXVNNIINT8);
> + if (edx & bit_AVXNECONVERT)
> +   set_feature (FEATURE_AVXNECONVERT);
> + if (edx & bit_AVXVNNIINT16)
> +   set_feature (FEATURE_AVXVNNIINT16);
> + if (eax & bit_SM3)
> +   set_feature (FEATURE_SM3);
> + if (eax & bit_SHA512)
> +   set_feature (FEATURE_SHA512);
> + if (eax & bit_SM4)
> +   set_feature (FEATURE_SM4);
> +   }

[committed] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

2023-08-08 Thread Uros Bizjak via Gcc-patches
Also introduce -m[no-]partial-vector-fp-math option to disable trapping
V2SF named patterns in order to avoid generation of partial vector V4SFmode
trapping instructions.

The new option is enabled by default, because even with sanitization,
a small but consistent speed up of 2 to 3% with Polyhedron capacita
benchmark can be achieved vs. scalar code.

Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
vs. scalar code.  This is what clang does by default, as it defaults
to -fno-trapping-math.

PR target/110832

gcc/ChangeLog:

* config/i386/i386.opt (mpartial-vector-fp-math): New option.
* config/i386/mmx.md (movq__to_sse): Do not sanitize
upper part of V2SFmode register with -fno-trapping-math.
(v2sf3): Enable for ix86_partial_vec_fp_math.
(divv2sf3): Ditto.
(v2sf3): Ditto.
(sqrtv2sf2): Ditto.
(*mmx_haddv2sf3_low): Ditto.
(*mmx_hsubv2sf3_low): Ditto.
(vec_addsubv2sf3): Ditto.
(vec_cmpv2sfv2si): Ditto.
(vcondv2sf): Ditto.
(fmav2sf4): Ditto.
(fmsv2sf4): Ditto.
(fnmav2sf4): Ditto.
(fnmsv2sf4): Ditto.
(fix_truncv2sfv2si2): Ditto.
(fixuns_truncv2sfv2si2): Ditto.
(floatv2siv2sf2): Ditto.
(floatunsv2siv2sf2): Ditto.
(nearbyintv2sf2): Ditto.
(rintv2sf2): Ditto.
(lrintv2sfv2si2): Ditto.
(ceilv2sf2): Ditto.
(lceilv2sfv2si2): Ditto.
(floorv2sf2): Ditto.
(lfloorv2sfv2si2): Ditto.
(btruncv2sf2): Ditto.
(roundv2sf2): Ditto.
(lroundv2sfv2si2): Ditto.
* doc/invoke.texi (x86 Options): Document
-mpartial-vector-fp-math option.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110832-1.c: New test.
* gcc.target/i386/pr110832-2.c: New test.
* gcc.target/i386/pr110832-3.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 1cc8563477a..2feabc1bf32 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -632,6 +632,10 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
 EnumValue
 Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
 
+mpartial-vector-fp-math
+Target Var(ix86_partial_vec_fp_math) Init(1)
+Enable floating-point status flags setting SSE vector operations on partial 
vectors
+
 mmove-max=
 Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) 
Init(PVW_NONE) Save
 Maximum number of bits that can be moved from memory to memory efficiently.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index b49554e9b8f..d51b3b9dc71 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -595,7 +595,18 @@ (define_expand "movq__to_sse"
  (match_operand:V2FI_V4HF 1 "nonimmediate_operand")
  (match_dup 2)))]
   "TARGET_SSE2"
-  "operands[2] = CONST0_RTX (mode);")
+{
+  if (mode == V2SFmode
+  && !flag_trapping_math)
+{
+  rtx op1 = force_reg (mode, operands[1]);
+  emit_move_insn (operands[0], lowpart_subreg (mode,
+  op1, mode));
+  DONE;
+}
+
+  operands[2] = CONST0_RTX (mode);
+})
 
 ;
 ;;
@@ -648,7 +659,7 @@ (define_expand "v2sf3"
(plusminusmult:V2SF
  (match_operand:V2SF 1 "nonimmediate_operand")
  (match_operand:V2SF 2 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_fp_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -726,7 +737,7 @@ (define_expand "divv2sf3"
   [(set (match_operand:V2SF 0 "register_operand")
(div:V2SF (match_operand:V2SF 1 "register_operand")
  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_fp_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -748,7 +759,7 @@ (define_expand "v2sf3"
 (smaxmin:V2SF
  (match_operand:V2SF 1 "register_operand")
  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_fp_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -850,7 +861,7 @@ (define_insn "mmx_rcpit2v2sf3"
 (define_expand "sqrtv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
(sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_fp_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -931,7 +942,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low"
  (vec_select:SF
(match_dup 1)
(parallel [(match_operand:SI 3 "const_0_to_1_operand")]]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE
+  "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_fp_math
&& INTVAL (operands[2]) != INTVAL (operands[3])
&& ix86_pre_reload_split ()"
   "#"
@@ -977,7 

Re: [PATCH] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-08 Thread Uros Bizjak via Gcc-patches
On Tue, Aug 8, 2023 at 9:58 AM liuhongt  wrote:
>
> Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
> supported via EAX.
>
> Intel documentation says invalid subleaves return 0. We had been
> relying on that behavior instead of checking the max sublef number.
>
> It appears that some Sandy Bridge CPUs return at least the subleaf 0
> EDX value for subleaf 1. Best guess is that this is a bug in a
> microcode patch since all of the bits we're seeing set in EDX were
> introduced after Sandy Bridge was originally released.
>
> This is causing avxvnniint16 to be incorrectly enabled with
> -march=native on these CPUs.
>
> BTW: Thanks for reminder from llvm forks Phoebe and Craig.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.

Please rather do it in a more self-descriptive way, as proposed in the
attached patch. You won't need a comment then.

(Please note that indentation is wrong in the patch in order to better
see the changes).

Uros.
diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index 30ef0d334ca..49724d2cba1 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -762,7 +762,9 @@ get_available_features (struct __processor_model *cpu_model,
   /* Get Advanced Features at level 7 (eax = 7, ecx = 0/1). */
   if (max_cpuid_level >= 7)
 {
-  __cpuid_count (7, 0, eax, ebx, ecx, edx);
+  unsigned subleaf_level;
+
+  __cpuid_count (7, 0, subleaf_level, ebx, ecx, edx);
   if (ebx & bit_BMI)
set_feature (FEATURE_BMI);
   if (ebx & bit_SGX)
@@ -873,8 +875,9 @@ get_available_features (struct __processor_model *cpu_model,
  if (edx & bit_AVX512FP16)
set_feature (FEATURE_AVX512FP16);
}
-
-  __cpuid_count (7, 1, eax, ebx, ecx, edx);
+  if (subleaf_level >= 1)
+   {
+ __cpuid_count (7, 1, eax, ebx, ecx, edx);
   if (eax & bit_HRESET)
set_feature (FEATURE_HRESET);
   if (eax & bit_CMPCCXADD)
@@ -914,6 +917,7 @@ get_available_features (struct __processor_model *cpu_model,
  if (edx & bit_AMX_COMPLEX)
set_feature (FEATURE_AMX_COMPLEX);
}
+   }
 }
 
   /* Get Advanced Features at level 0xd (eax = 0xd, ecx = 1). */


Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

2023-08-08 Thread Uros Bizjak via Gcc-patches
On Tue, Aug 8, 2023 at 12:08 PM Richard Biener  wrote:

> > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > > > > > named patterns in order to avoid generation of partial vector 
> > > > > > V4SFmode
> > > > > > trapping instructions.
> > > > > >
> > > > > > The new option is enabled by default, because even with 
> > > > > > sanitization,
> > > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > > > > > benchmark can be achieved vs. scalar code.
> > > > > >
> > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 
> > > > > > 9%
> > > > > > vs. scalar code.  This is what clang does by default, as it defaults
> > > > > > to -fno-trapping-math.
> > > > >
> > > > > I like the new option, note you lack invoke.texi documentation where
> > > > > I'd also elaborate a bit on the interaction with -fno-trapping-math
> > > > > and the possible performance impact then NaNs or denormals leak
> > > > > into the upper halves and cross-reference -mdaz-ftz.
> > > >
> > > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
> > > > option. It is written in a way to also cover half-float vectors. WDYT?
> > >
> > > "generate trapping floating-point operations"
> > >
> > > I'd say "generate floating-point operations that might affect the
> > > set of floating point status flags", the word "trapping" is IMHO
> > > misleading.
> > > Not sure if "set of floating point status flags" is the correct term,
> > > but it's what the C standard seems to refer to when talking about
> > > things you get with fegetexceptflag.  feraieexcept refers to
> > > "floating-point exceptions".  Unfortunately the -fno-trapping-math
> > > documentation is similarly confusing (and maybe even wrong, I read
> > > it to conform to 'non-stop' IEEE arithmetic).
> >
> > Thanks for suggesting the right terminology. I think that:
> >
> > +@opindex mpartial-vector-math
> > +@item -mpartial-vector-math
> > +This option enables GCC to generate floating-point operations that might
> > +affect the set of floating point status flags on partial vectors, where
> > +vector elements reside in the low part of the 128-bit SSE register.  Unless
> > +@option{-fno-trapping-math} is specified, the compiler guarantees correct
> > +behavior by sanitizing all input operands to have zeroes in the unused
> > +upper part of the vector register.  Note that by using built-in functions
> > +or inline assembly with partial vector arguments, NaNs, denormal or invalid
> > +values can leak into the upper part of the vector, causing possible
> > +performance issues when @option{-fno-trapping-math} is in effect.  These
> > +issues can be mitigated by manually sanitizing the upper part of the 
> > partial
> > +vector argument register or by using @option{-mdaz-ftz} to set
> > +denormals-are-zero (DAZ) flag in the MXCSR register.
> >
> > Now explain in adequate detail what the option does. IMO, the
> > "floating-point operations that might affect the set of floating point
> > status flags" correctly identifies affected operations, so an example,
> > as suggested below, is not necessary.
> >
> > > I'd maybe give an example of a FP operation that's _not_ affected
> > > by the flag (copysign?).
> >
> > Please note that I have renamed the option to "-mpartial-vector-math"
> > with a short target-specific description:
>
> Ah yes, that's a less confusing name but then it might suggest
> that -mno-partial-vector-math would disable all of that, including
> integer ops, not only the patterns possibly affecting the exception
> flags?  Note I don't have a better suggestion and this is clearly
> better than the one mentioning mmx.

You are right, I think I'll rename the option to -mpartial-vector-fp-math.

Thanks,
Uros.


Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

2023-08-08 Thread Uros Bizjak via Gcc-patches
On Tue, Aug 8, 2023 at 10:07 AM Richard Biener  wrote:
>
> On Mon, 7 Aug 2023, Uros Bizjak wrote:
>
> > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener  wrote:
> > >
> > > On Sun, 30 Jul 2023, Uros Bizjak wrote:
> > >
> > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > > > named patterns in order to avoid generation of partial vector V4SFmode
> > > > trapping instructions.
> > > >
> > > > The new option is enabled by default, because even with sanitization,
> > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > > > benchmark can be achieved vs. scalar code.
> > > >
> > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > > > vs. scalar code.  This is what clang does by default, as it defaults
> > > > to -fno-trapping-math.
> > >
> > > I like the new option, note you lack invoke.texi documentation where
> > > I'd also elaborate a bit on the interaction with -fno-trapping-math
> > > and the possible performance impact then NaNs or denormals leak
> > > into the upper halves and cross-reference -mdaz-ftz.
> >
> > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
> > option. It is written in a way to also cover half-float vectors. WDYT?
>
> "generate trapping floating-point operations"
>
> I'd say "generate floating-point operations that might affect the
> set of floating point status flags", the word "trapping" is IMHO
> misleading.
> Not sure if "set of floating point status flags" is the correct term,
> but it's what the C standard seems to refer to when talking about
> things you get with fegetexceptflag.  feraieexcept refers to
> "floating-point exceptions".  Unfortunately the -fno-trapping-math
> documentation is similarly confusing (and maybe even wrong, I read
> it to conform to 'non-stop' IEEE arithmetic).

Thanks for suggesting the right terminology. I think that:

+@opindex mpartial-vector-math
+@item -mpartial-vector-math
+This option enables GCC to generate floating-point operations that might
+affect the set of floating point status flags on partial vectors, where
+vector elements reside in the low part of the 128-bit SSE register.  Unless
+@option{-fno-trapping-math} is specified, the compiler guarantees correct
+behavior by sanitizing all input operands to have zeroes in the unused
+upper part of the vector register.  Note that by using built-in functions
+or inline assembly with partial vector arguments, NaNs, denormal or invalid
+values can leak into the upper part of the vector, causing possible
+performance issues when @option{-fno-trapping-math} is in effect.  These
+issues can be mitigated by manually sanitizing the upper part of the partial
+vector argument register or by using @option{-mdaz-ftz} to set
+denormals-are-zero (DAZ) flag in the MXCSR register.

Now explain in adequate detail what the option does. IMO, the
"floating-point operations that might affect the set of floating point
status flags" correctly identifies affected operations, so an example,
as suggested below, is not necessary.

> I'd maybe give an example of a FP operation that's _not_ affected
> by the flag (copysign?).

Please note that I have renamed the option to "-mpartial-vector-math"
with a short target-specific description:

+partial-vector-math
+Target Var(ix86_partial_vec_math) Init(1)
+Enable floating-point status flags setting SSE vector operations on
partial vectors

which I think summarises the option (without the word "trapping"). The
same approach will be taken for Float16 operations, so the approach is
not specific to MMX vectors.

> Otherwise it looks OK to me.

Thanks, I have attached the RFC V2 patch; I plan to submit a formal
patch later today.

Uros.
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 1cc8563477a..8d9a1ae93f3 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -632,6 +632,10 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
 EnumValue
 Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
 
+partial-vector-math
+Target Var(ix86_partial_vec_math) Init(1)
+Enable floating-point status flags setting SSE vector operations on partial 
vectors
+
 mmove-max=
 Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) 
Init(PVW_NONE) Save
 Maximum number of bits that can be moved from memory to memory efficiently.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index b49554e9b8f..95f7a0113e7 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -595,7 +595,18 @@ (define_expand "movq__to_sse"
  (match_operand:V2FI_V4HF 1 "nonimmediate_operand")
  (match_dup 2)))]
   "TARGET_SSE2"
-  "operands[2] = CONST0_RTX (mode);")
+{
+  if (mode == V2SFmode
+  && !flag_trapping_math)
+{
+  rtx op1 = force_reg (mode, operands[1]);
+  emit_move_insn (operands[0], lowpart_subreg (mode,
+  op1, mode));
+  DONE;
+}
+
+  operands[2] = 

Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

2023-08-07 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 31, 2023 at 11:40 AM Richard Biener  wrote:
>
> On Sun, 30 Jul 2023, Uros Bizjak wrote:
>
> > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > named patterns in order to avoid generation of partial vector V4SFmode
> > trapping instructions.
> >
> > The new option is enabled by default, because even with sanitization,
> > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > benchmark can be achieved vs. scalar code.
> >
> > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > vs. scalar code.  This is what clang does by default, as it defaults
> > to -fno-trapping-math.
>
> I like the new option, note you lack invoke.texi documentation where
> I'd also elaborate a bit on the interaction with -fno-trapping-math
> and the possible performance impact then NaNs or denormals leak
> into the upper halves and cross-reference -mdaz-ftz.

The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
option. It is written in a way to also cover half-float vectors. WDYT?

Uros.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index fa765d5a0dd..99093172abe 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1417,6 +1417,7 @@ See RS/6000 and PowerPC Options.
 -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait
 -mrecip  -mrecip=@var{opt}
 -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt}
+-mmmxfp-with-sse
 -mmove-max=@var{bits} -mstore-max=@var{bits}
 -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx
 -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl
@@ -33708,6 +33709,22 @@ This option instructs GCC to use 128-bit AVX 
instructions instead of
 This option instructs GCC to use @var{opt}-bit vector width in instructions
 instead of default on the selected platform.
 
+@opindex -mmmxfp-with-sse
+@item -mmmxfp-with-sse
+This option enables GCC to generate trapping floating-point operations on
+partial vectors, where vector elements reside in the low part of the 128-bit
+SSE register.  Unless @option{-fno-trapping-math} is specified, the compiler
+guarantees correct trapping behavior by sanitizing all input operands to
+have zeroes in the upper part of the vector register.  Note that by using
+built-in functions or inline assembly with partial vector arguments, NaNs,
+denormal or invalid values can leak into the upper part of the vector,
+causing possible performance issues when @option{-fno-trapping-math} is in
+effect.  These issues can be mitigated by manually sanitizing the upper part
+of the partial vector argument register or by using @option{-mdaz-ftz} to set
+denormals-are-zero (DAZ) flag in the MXCSR register.
+
+This option is enabled by default.
+
 @opindex mmove-max
 @item -mmove-max=@var{bits}
 This option instructs GCC to set the maximum number of bits can be


Re: PR target/107671: Make more use of btl/btq on x86_64.

2023-08-07 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 7, 2023 at 9:37 AM Roger Sayle  wrote:
>
>
> This patch is a partial solution to PR target/107671, updating Uros'
> patch from comment #4, to catch both bit set (setc) and bit not set
> (setnc) cases from the code in comment #2, when compiled on x86_64.
> Unfortunately, this is a partial solution, as the pointer variants
> in comment #1, aren't yet all optimized, and my attempts to check
> whether the 32-bit versions are optimized with -m32 revealed they
> also need further improvement.  (Some of) These remaining issues
> might best be fixed in the middle-end, in either match.pd or the
> RTL optimizers, so I thought it reasonable to submit this independent
> backend piece, and gain/bank the improvements on x86_64.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-08-07  Roger Sayle  
> Uros Bizjak  
>
> gcc/ChangeLog
> PR target/107671
> * config/i386/i386.md (*bt_setc_mask): Allow the
> shift count to have a different mode (using SWI248) from either
> the bit-test or the result.
> (*bt_setnc_mask): New define_insn_and_split for the
> setnc (bit not set) case of the above pattern.
> (*btdi_setncsi_mask): New define_insn_and_split to handle the
> SImode result from a DImode bit-test variant of the above patterns.
> (*bt_setncqi_mask_2): New define_insn_and_split for the
> setnc (bit not set) version of *bt_setcqi_mask_2.
>
> gcc/testsuite/ChangeLog
> PR target/107671
> * gcc.target/i386/pr107671-1.c: New test case.
> * gcc.target/i386/pr107671-2.c: Likewise.

I am worried about the number of existing and new patterns that are
introduced to satisfy creativity of the combine pass. The following
can be handled via zero-extract RTXes:

return ((v & (1  << (bitnum & 31 != 0;
return ((v & (1L << (bitnum & 63 != 0;
return (v >> (bitnum & 31)) & 1;
return (v >> (bitnum & 63)) & 1;

but there is no canonicalization for negative forms of the above constructs.

For the above, the combine pass tries:

(set (reg:SI 95)
(zero_extract:SI (reg:SI 97)
(const_int 1 [0x1])
(and:SI (reg:SI 98)
(const_int 31 [0x1f]

that is necessary to handle the change of compare mode from CCZ to
CCC. However, negative forms try:

(set (reg:QI 96)
(eq:QI (zero_extract:SI (reg:SI 97)
(const_int 1 [0x1])
(and:SI (reg:SI 98)
(const_int 31 [0x1f])))
(const_int 0 [0])))

and:

(set (reg:SI 95)
(xor:SI (zero_extract:SI (reg:SI 97)
(const_int 1 [0x1])
(and:SI (reg:SI 98)
(const_int 31 [0x1f])))
(const_int 1 [0x1])))

and these are further different for SImode and DImode.

Ideally, we would simplify all forms to:

(set (reg:QI 96)
(eq:QI (zero_extract:SI (reg:SI 97)
(const_int 1 [0x1])
(and:SI (reg:SI 98)
(const_int 31 [0x1f])))
(const_int 0 [0])))

where inverted/non-inverted forms would emit ne/eq:QI ()
(const_int 0). The result would be zero-extended to DI or SImode,
depending on the target mode.

You can already see the problem with missing canonicalization in
i386.md with define_insn_and_split pattern with comments:

;; Help combine recognize bt followed by setc
;; Help combine recognize bt followed by setnc

where totally different patterns are needed to match what combine produces.

The above problem is specific to setcc patterns, where output value is
derived from the input operand. jcc and cmov look OK.

If the following pattern would be tried by combine, then we would
handle all the testcases in the PR (plus some more, where output is in
different mode than input):

(set (reg:QI 96)
({eq,ne}:QI (zero_extract:SI (reg:SI 97)
(const_int 1 [0x1])
(and:SI (reg:SI 98)
(const_int 31 [0x1f])))
(const_int 0 [0])))

where QIreg is later zero-extended to the target width. In this case,
one define_insn_and_split pattern would handle all testcases from
PR107671.

Please also note that we can implement this transformation via a
combine splitter. The benefit of the combine splitter is that its
results are immediately "recognized", and new RTXes can be propagated
into subsequent combinations.

I have made some measurements with my proposed patch (as posted in the
PR), and the transformation never triggered (neither for gcc build,
nor when building the linux kernel). So, I wonder if the added number
of patterns outweigh the benefits at all.

IMO, the correct way is to teach the combine pass some more about
bit-test functionality (to also pass negative forms via zero-extract
RTXes, see above). This would canonicalize the transformation and
prevent pattern explosion.

I consider bt to be quite an important instruction, 

Re: [PATCH] i386: Clear upper bits of XMM register for V4HFmode/V2HFmode operations [PR110762]

2023-08-07 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 7, 2023 at 10:57 AM liuhongt  wrote:
>
> Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/110762
> * config/i386/mmx.md (3): Changed from define_insn
> to define_expand and break into ..
> (v4hf3): .. this.
> (divv4hf3): .. this.
> (v2hf3): .. this.
> (divv2hf3): .. this.
> (movd_v2hf_to_sse): New define_expand.
> (movq__to_sse): Extend to V4HFmode.
> (mmxdoublevecmode): Ditto.
> (V2FI_V4HF): New mode iterator.
> * config/i386/sse.md (*vec_concatv4sf): Extend to hanlde V8HF
> by using mode iterator V4SF_V8HF, renamed to ..
> (*vec_concat): .. this.
> (*vec_concatv4sf_0): Extend to handle V8HF by using mode
> iterator V4SF_V8HF, renamed to ..
> (*vec_concat_0): .. this.
> (*vec_concatv8hf_movss): New define_insn.
> (V4SF_V8HF): New mode iterator.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110762-v4hf.c: New test.

LGTM.

Please also note the RFC patch [1] that relaxes clears for V2SFmode
with -fno-trapping-math. The patched compiler will then emit the same
code as clang does for -O2. Which raises another question - should gcc
default to -fno-trapping-math?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625795.html

Thanks,
Uros.

> ---
>  gcc/config/i386/mmx.md| 109 +++---
>  gcc/config/i386/sse.md|  40 +--
>  gcc/testsuite/gcc.target/i386/pr110762-v4hf.c |  57 +
>  3 files changed, 177 insertions(+), 29 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110762-v4hf.c
>
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index 896af76a33f..88bdf084f54 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -79,9 +79,7 @@ (define_mode_iterator V_16_32_64
>  ;; V2S* modes
>  (define_mode_iterator V2FI [V2SF V2SI])
>
> -;; 4-byte and 8-byte float16 vector modes
> -(define_mode_iterator VHF_32_64 [V4HF V2HF])
> -
> +(define_mode_iterator V2FI_V4HF [V2SF V2SI V4HF])
>  ;; Mapping from integer vector mode to mnemonic suffix
>  (define_mode_attr mmxvecsize
>[(V8QI "b") (V4QI "b") (V2QI "b")
> @@ -108,7 +106,7 @@ (define_mode_attr mmxintvecmodelower
>
>  ;; Mapping of vector modes to a vector mode of double size
>  (define_mode_attr mmxdoublevecmode
> -  [(V2SF "V4SF") (V2SI "V4SI")])
> +  [(V2SF "V4SF") (V2SI "V4SI") (V4HF "V8HF")])
>
>  ;; Mapping of vector modes back to the scalar modes
>  (define_mode_attr mmxscalarmode
> @@ -594,7 +592,7 @@ (define_insn "sse_movntq"
>  (define_expand "movq__to_sse"
>[(set (match_operand: 0 "register_operand")
> (vec_concat:
> - (match_operand:V2FI 1 "nonimmediate_operand")
> + (match_operand:V2FI_V4HF 1 "nonimmediate_operand")
>   (match_dup 2)))]
>"TARGET_SSE2"
>"operands[2] = CONST0_RTX (mode);")
> @@ -1927,21 +1925,94 @@ (define_expand "lroundv2sfv2si2"
>  ;;
>  ;
>
> -(define_insn "3"
> -  [(set (match_operand:VHF_32_64 0 "register_operand" "=v")
> -   (plusminusmultdiv:VHF_32_64
> - (match_operand:VHF_32_64 1 "register_operand" "v")
> - (match_operand:VHF_32_64 2 "register_operand" "v")))]
> +(define_expand "v4hf3"
> +  [(set (match_operand:V4HF 0 "register_operand")
> +   (plusminusmult:V4HF
> + (match_operand:V4HF 1 "nonimmediate_operand")
> + (match_operand:V4HF 2 "nonimmediate_operand")))]
>"TARGET_AVX512FP16 && TARGET_AVX512VL"
> -  "vph\t{%2, %1, %0|%0, %1, %2}"
> -  [(set (attr "type")
> -  (cond [(match_test " == MULT")
> -   (const_string "ssemul")
> -(match_test " == DIV")
> -   (const_string "ssediv")]
> -(const_string "sseadd")))
> -   (set_attr "prefix" "evex")
> -   (set_attr "mode" "V8HF")])
> +{
> +  rtx op2 = gen_reg_rtx (V8HFmode);
> +  rtx op1 = gen_reg_rtx (V8HFmode);
> +  rtx op0 = gen_reg_rtx (V8HFmode);
> +
> +  emit_insn (gen_movq_v4hf_to_sse (op2, operands[2]));
> +  emit_insn (gen_movq_v4hf_to_sse (op1, operands[1]));
> +
> +  emit_insn (gen_v8hf3 (op0, op1, op2));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (V4HFmode, op0, V8HFmode));
> +  DONE;
> +})
> +
> +(define_expand "divv4hf3"
> +  [(set (match_operand:V4HF 0 "register_operand")
> +   (div:V4HF
> + (match_operand:V4HF 1 "nonimmediate_operand")
> + (match_operand:V4HF 2 "nonimmediate_operand")))]
> +  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> +{
> +  rtx op2 = gen_reg_rtx (V8HFmode);
> +  rtx op1 = gen_reg_rtx (V8HFmode);
> +  rtx op0 = gen_reg_rtx (V8HFmode);
> +
> +  emit_insn (gen_movq_v4hf_to_sse (op1, operands[1]));
> +  rtx tmp = gen_rtx_VEC_CONCAT (V8HFmode, operands[2],
> +   

Re: [x86 PATCH] Split SUBREGs of SSE vector registers into vec_select insns.

2023-08-03 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 3, 2023 at 9:10 AM Roger Sayle  wrote:
>
>
> This patch is the final piece in the series to improve the ABI issues
> affecting PR 88873.  The previous patches tackled inserting DFmode
> values into V2DFmode registers, by introducing insvti_{low,high}part
> patterns.  This patch improves the extraction of DFmode values from
> v2DFmode registers via TImode intermediates.
>
> I'd initially thought this would require new extvti_{low,high}part
> patterns to be defined, but all that's required is to recognize that
> the SUBREG idioms produced by combine are equivalent to (forms of)
> vec_select patterns.  The target-independent middle-end can't be sure
> that the appropriate vec_select instruction exists on the target,
> hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
> but the backend can provide a define_split stating where and when
> this is useful, for example, considering whether the operand is in
> memory, or whether !TARGET_SSE_MATH and the destination is i387.
>
> For pr88873.c, gcc -O2 -march=cascadelake currently generates:
>
> foo:vpunpcklqdq %xmm3, %xmm2, %xmm7
> vpunpcklqdq %xmm1, %xmm0, %xmm6
> vpunpcklqdq %xmm5, %xmm4, %xmm2
> vmovdqa %xmm7, -24(%rsp)
> vmovdqa %xmm6, %xmm1
> movq-16(%rsp), %rax
> vpinsrq $1, %rax, %xmm7, %xmm4
> vmovapd %xmm4, %xmm6
> vfmadd132pd %xmm1, %xmm2, %xmm6
> vmovapd %xmm6, -24(%rsp)
> vmovsd  -16(%rsp), %xmm1
> vmovsd  -24(%rsp), %xmm0
> ret
>
> with this patch, we now generate:
>
> foo:vpunpcklqdq %xmm1, %xmm0, %xmm6
> vpunpcklqdq %xmm3, %xmm2, %xmm7
> vpunpcklqdq %xmm5, %xmm4, %xmm2
> vmovdqa %xmm6, %xmm1
> vfmadd132pd %xmm7, %xmm2, %xmm1
> vmovsd  %xmm1, %xmm1, %xmm0
> vunpckhpd   %xmm1, %xmm1, %xmm1
> ret
>
> The improvement is even more dramatic when compared to the original
> 29 instructions shown in comment #8.  GCC 13, for example, required
> 12 transfers to/from memory.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-08-03  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/sse.md (define_split): Convert highpart:DF extract
> from V2DFmode register into a sse2_storehpd instruction.
> (define_split): Likewise, convert lowpart:DF extract from V2DF
> register into a sse2_storelpd instruction.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/pr88873.c: Tweak to check for improved code.

OK.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>


Re: [x86 PATCH] PR target/110792: Early clobber issues with rot32di2_doubleword.

2023-08-02 Thread Uros Bizjak via Gcc-patches
On Thu, Aug 3, 2023 at 12:18 AM Roger Sayle  wrote:
>
>
> This patch is a conservative fix for PR target/110792, a wrong-code
> regression affecting doubleword rotations by BITS_PER_WORD, which
> effectively swaps the highpart and lowpart words, when the source to be
> rotated resides in memory. The issue is that if the register used to
> hold the lowpart of the destination is mentioned in the address of
> the memory operand, the current define_insn_and_split unintentionally
> clobbers it before reading the highpart.
>
> Hence, for the testcase, the incorrectly generated code looks like:
>
> salq$4, %rdi// calculate address
> movqWHIRL_S+8(%rdi), %rdi   // accidentally clobber addr
> movqWHIRL_S(%rdi), %rbp // load (wrong) lowpart
>
> Traditionally, the textbook way to fix this would be to add an
> explicit early clobber to the instruction's constraints.
>
>  (define_insn_and_split "32di2_doubleword"
> - [(set (match_operand:DI 0 "register_operand" "=r,r,r")
> + [(set (match_operand:DI 0 "register_operand" "=r,r,")
> (any_rotate:DI (match_operand:DI 1 "nonimmediate_operand" "0,r,o")
>(const_int 32)))]
>
> but unfortunately this currently generates significantly worse code,
> due to a strange choice of reloads (effectively memcpy), which ends up
> looking like:
>
> salq$4, %rdi// calculate address
> movdqa  WHIRL_S(%rdi), %xmm0// load the double word in SSE reg.
> movaps  %xmm0, -16(%rsp)// store the SSE reg back to the
> stack
> movq-8(%rsp), %rdi  // load highpart
> movq-16(%rsp), %rbp // load lowpart
>
> Note that reload's "&" doesn't distinguish between the memory being
> early clobbered, vs the registers used in an addressing mode being
> early clobbered.
>
> The fix proposed in this patch is to remove the third alternative, that
> allowed offsetable memory as an operand, forcing reload to place the
> operand into a register before the rotation.  This results in:
>
> salq$4, %rdi
> movqWHIRL_S(%rdi), %rax
> movqWHIRL_S+8(%rdi), %rdi
> movq%rax, %rbp
>
> I believe there's a more advanced solution, by swapping the order of
> the loads (if first destination register is mentioned in the address),
> or inserting a lea insn (if both destination registers are mentioned
> in the address), but this fix is a minimal "safe" solution, that
> should hopefully be suitable for backporting.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-08-02  Roger Sayle  
>
> gcc/ChangeLog
> PR target/110792
> * config/i386/i386.md (ti3): For rotations by 64 bits
> place operand in a register before gen_64ti2_doubleword.
> (di3): Likewise, for rotations by 32 bits, place
> operand in a register before gen_32di2_doubleword.
> (32di2_doubleword): Constrain operand to be in register.
> (64ti2_doubleword): Likewise.
>
> gcc/testsuite/ChangeLog
> PR target/110792
> * g++.target/i386/pr110792.C: New 32-bit C++ test case.
> * gcc.target/i386/pr110792.c: New 64-bit C test case.

OK.

Thanks,
Uros.
>
>
> Thanks in advance,
> Roger
> --
>


Re: [PATCH] Optimize vlddqu + inserti128 to vbroadcasti128

2023-08-01 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 2, 2023 at 3:33 AM liuhongt  wrote:
>
> In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics
> after AVX2, it's rejected as
> > The instruction is reachable only as __builtin_ia32_lddqu* (aka
> > _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> > think that in this case, the compiler should not be too smart and
> > change the instruction behind the programmer's back. The caveats are
> > also explained at length in the ISA manual.
>
> So the patch is more conservative, only optimize vlddqu + vinserti128
> to vbroadcasti128.
> vlddqu + vinserti128 will use shuffle port in addition to load port
> comparing to vbroadcasti128, For latency perspective,vbroadcasti is no
> worse than vlddqu + vinserti128.
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625122.html
>
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (*avx2_lddqu_inserti_to_bcasti): New
> pre_reload define_insn_and_split.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/vlddqu_vinserti128.c: New test.

OK with a small change bellow.

Thanks,
Uros.

> ---
>  gcc/config/i386/sse.md | 18 ++
>  .../gcc.target/i386/vlddqu_vinserti128.c   | 11 +++
>  2 files changed, 29 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 2d81347c7b6..4bdd2b43ba7 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -26600,6 +26600,24 @@ (define_insn "avx2_vbroadcasti128_"
> (set_attr "prefix" "vex,evex,evex")
> (set_attr "mode" "OI")])
>
> +;; optimize vlddqu + vinserti128 to vbroadcasti128, the former will use
> +;; extra shuffle port in addition to load port than the latter.
> +;; For latency perspective,vbroadcasti is no worse.
> +(define_insn_and_split "avx2_lddqu_inserti_to_bcasti"
> +  [(set (match_operand:V4DI 0 "register_operand" "=x,v,v")
> +   (vec_concat:V4DI
> + (subreg:V2DI
> +   (unspec:V16QI [(match_operand:V16QI 1 "memory_operand")]
> + UNSPEC_LDDQU) 0)
> + (subreg:V2DI (unspec:V16QI [(match_dup 1)]
> + UNSPEC_LDDQU) 0)))]
> +  "TARGET_AVX2 && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +   (vec_concat:V4DI (match_dup 1) (match_dup 1)))]
> +  "operands[1] = adjust_address (operands[1], V2DImode, 0);")

No need to validate address before reload, adjust_address_nv can be used.

> +
>  ;; Modes handled by AVX vec_dup patterns.
>  (define_mode_iterator AVX_VEC_DUP_MODE
>[V8SI V8SF V4DI V4DF])
> diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c 
> b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> new file mode 100644
> index 000..29699a5fa7f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2" } */
> +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> +
> +#include 
> +__m256i foo(void *data) {
> +__m128i X1 = _mm_lddqu_si128((__m128i*)data);
> +__m256i V1 = _mm256_broadcastsi128_si256 (X1);
> +return V1;
> +}
> --
> 2.39.1.388.g2fc9e9ca3c
>


Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

2023-07-31 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 31, 2023 at 11:40 AM Richard Biener  wrote:
>
> On Sun, 30 Jul 2023, Uros Bizjak wrote:
>
> > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > named patterns in order to avoid generation of partial vector V4SFmode
> > trapping instructions.
> >
> > The new option is enabled by default, because even with sanitization,
> > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > benchmark can be achieved vs. scalar code.
> >
> > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > vs. scalar code.  This is what clang does by default, as it defaults
> > to -fno-trapping-math.
>
> I like the new option, note you lack invoke.texi documentation where
> I'd also elaborate a bit on the interaction with -fno-trapping-math
> and the possible performance impact then NaNs or denormals leak
> into the upper halves and cross-reference -mdaz-ftz.

Yes, this is my plan (lack of documentation is due to RFC status of
the patch). OTOH, Hongtao has some other ideas in the PR, so I'll wait
with the patch a bit.

Thanks,
Uros.

> Thanks,
> Richard.
>
> > PR target/110832
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro.
> > * config/i386/i386/opt (mmmxfp-with-sse): New option.
> > * config/i386/mmx.md (movq__to_sse): Do not sanitize
> > upper part of V2SFmode register with -fno-trapping-math.
> > (v2sf3): Enable for TARGET_MMXFP_WITH_SSE.
> > (divv2sf3): Ditto.
> > (v2sf3): Ditto.
> > (sqrtv2sf2): Ditto.
> > (*mmx_haddv2sf3_low): Ditto.
> > (*mmx_hsubv2sf3_low): Ditto.
> > (vec_addsubv2sf3): Ditto.
> > (vec_cmpv2sfv2si): Ditto.
> > (vcondv2sf): Ditto.
> > (fmav2sf4): Ditto.
> > (fmsv2sf4): Ditto.
> > (fnmav2sf4): Ditto.
> > (fnmsv2sf4): Ditto.
> > (fix_truncv2sfv2si2): Ditto.
> > (fixuns_truncv2sfv2si2): Ditto.
> > (floatv2siv2sf2): Ditto.
> > (floatunsv2siv2sf2): Ditto.
> > (nearbyintv2sf2): Ditto.
> > (rintv2sf2): Ditto.
> > (lrintv2sfv2si2): Ditto.
> > (ceilv2sf2): Ditto.
> > (lceilv2sfv2si2): Ditto.
> > (floorv2sf2): Ditto.
> > (lfloorv2sfv2si2): Ditto.
> > (btruncv2sf2): Ditto.
> > (roundv2sf2): Ditto.
> > (lroundv2sfv2si2): Ditto.
> >
> > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >
> > Uros.
> >
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

2023-07-30 Thread Uros Bizjak via Gcc-patches
Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
named patterns in order to avoid generation of partial vector V4SFmode
trapping instructions.

The new option is enabled by default, because even with sanitization,
a small but consistent speed up of 2 to 3% with Polyhedron capacita
benchmark can be achieved vs. scalar code.

Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
vs. scalar code.  This is what clang does by default, as it defaults
to -fno-trapping-math.

PR target/110832

gcc/ChangeLog:

* config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro.
* config/i386/i386/opt (mmmxfp-with-sse): New option.
* config/i386/mmx.md (movq__to_sse): Do not sanitize
upper part of V2SFmode register with -fno-trapping-math.
(v2sf3): Enable for TARGET_MMXFP_WITH_SSE.
(divv2sf3): Ditto.
(v2sf3): Ditto.
(sqrtv2sf2): Ditto.
(*mmx_haddv2sf3_low): Ditto.
(*mmx_hsubv2sf3_low): Ditto.
(vec_addsubv2sf3): Ditto.
(vec_cmpv2sfv2si): Ditto.
(vcondv2sf): Ditto.
(fmav2sf4): Ditto.
(fmsv2sf4): Ditto.
(fnmav2sf4): Ditto.
(fnmsv2sf4): Ditto.
(fix_truncv2sfv2si2): Ditto.
(fixuns_truncv2sfv2si2): Ditto.
(floatv2siv2sf2): Ditto.
(floatunsv2siv2sf2): Ditto.
(nearbyintv2sf2): Ditto.
(rintv2sf2): Ditto.
(lrintv2sfv2si2): Ditto.
(ceilv2sf2): Ditto.
(lceilv2sfv2si2): Ditto.
(floorv2sf2): Ditto.
(lfloorv2sfv2si2): Ditto.
(btruncv2sf2): Ditto.
(roundv2sf2): Ditto.
(lroundv2sfv2si2): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index ef342fcee9b..af72b6c48a9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -50,6 +50,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
 #define TARGET_16BIT_P(x)  TARGET_CODE16_P(x)
 
 #define TARGET_MMX_WITH_SSE(TARGET_64BIT && TARGET_SSE2)
+#define TARGET_MMXFP_WITH_SSE  (TARGET_MMX_WITH_SSE && ix86_mmxfp_with_sse)
 
 #include "config/vxworks-dummy.h"
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 1cc8563477a..1b65fed5daf 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -670,6 +670,10 @@ m3dnowa
 Target Mask(ISA_3DNOW_A) Var(ix86_isa_flags) Save
 Support Athlon 3Dnow! built-in functions.
 
+mmmxfp-with-sse
+Target Var(ix86_mmxfp_with_sse) Init(1)
+Enable MMX floating point vectors in SSE registers
+
 msse
 Target Mask(ISA_SSE) Var(ix86_isa_flags) Save
 Support MMX and SSE built-in functions and code generation.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 896af76a33f..0555da9022b 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -597,7 +597,18 @@ (define_expand "movq__to_sse"
  (match_operand:V2FI 1 "nonimmediate_operand")
  (match_dup 2)))]
   "TARGET_SSE2"
-  "operands[2] = CONST0_RTX (mode);")
+{
+  if (mode == V2SFmode
+  && !flag_trapping_math)
+{
+  rtx op1 = force_reg (mode, operands[1]);
+  emit_move_insn (operands[0], lowpart_subreg (mode,
+  op1, mode));
+  DONE;
+}
+
+  operands[2] = CONST0_RTX (mode);
+})
 
 ;
 ;;
@@ -650,7 +661,7 @@ (define_expand "v2sf3"
(plusminusmult:V2SF
  (match_operand:V2SF 1 "nonimmediate_operand")
  (match_operand:V2SF 2 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -728,7 +739,7 @@ (define_expand "divv2sf3"
   [(set (match_operand:V2SF 0 "register_operand")
(div:V2SF (match_operand:V2SF 1 "register_operand")
  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -750,7 +761,7 @@ (define_expand "v2sf3"
 (smaxmin:V2SF
  (match_operand:V2SF 1 "register_operand")
  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -852,7 +863,7 @@ (define_insn "mmx_rcpit2v2sf3"
 (define_expand "sqrtv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
(sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -933,7 +944,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low"
  (vec_select:SF
(match_dup 1)
(parallel [(match_operand:SI 3 "const_0_to_1_operand")]]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE
+  "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE
&& INTVAL (operands[2]) != INTVAL (operands[3])
&& ix86_pre_reload_split ()"
   "#"
@@ -979,7 +990,7 

[committed] testsuite: Fix gfortran.dg/ieee/comparisons_3.F90 testsuite failures

2023-07-26 Thread Uros Bizjak via Gcc-patches
The testcase should use dg-additional-options instead of dg-options to
not overwrite default compile flags that include path for finding
the IEEE modules.

gcc/testsuite/ChangeLog:

* gfortran.dg/ieee/comparisons_3.F90: Use dg-additional-options
instead of dg-options.

Tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/testsuite/gfortran.dg/ieee/comparisons_3.F90 
b/gcc/testsuite/gfortran.dg/ieee/comparisons_3.F90
index c15678fec35..40e8466c132 100644
--- a/gcc/testsuite/gfortran.dg/ieee/comparisons_3.F90
+++ b/gcc/testsuite/gfortran.dg/ieee/comparisons_3.F90
@@ -1,5 +1,5 @@
 ! { dg-do run }
-! { dg-options "-ffree-line-length-none" }
+! { dg-additional-options "-ffree-line-length-none" }
 program foo
   use ieee_arithmetic
   use iso_fortran_env


[committed] i386: Clear upper half of XMM register for V2SFmode operations [PR110762]

2023-07-26 Thread Uros Bizjak via Gcc-patches
Clear the upper half of a V4SFmode operand register in front of all
potentially trapping instructions. The testcase:

--cut here--
typedef float v2sf __attribute__((vector_size(8)));
typedef float v4sf __attribute__((vector_size(16)));

v2sf test(v4sf x, v4sf y)
{
  v2sf x2, y2;

  x2 = __builtin_shufflevector (x, x, 0, 1);
  y2 = __builtin_shufflevector (y, y, 0, 1);

  return x2 + y2;
}
--cut here--

now compiles to:

movq%xmm1, %xmm1# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 10[c=4 l=4]  *vec_concatv4sf_0
addps   %xmm1, %xmm0# 11[c=12 l=3]  *addv4sf3/0

This approach addresses issues with exceptions, as well as issues with
denormal/invalid values. An obvious exception to the rule is a division,
where the value != 0.0 should be loaded into the upper half of the
denominator to avoid division by zero exception.

The patch effectively tightens the solution from PR95046 by clearing upper
halves of all operand registers before every potentially trapping instruction.
The testcase:

--cut here--
typedef float __attribute__((vector_size(8))) v2sf;

v2sf test (v2sf a, v2sf b, v2sf c)
{
  return a * b - c;
}
--cut here--

compiles to:

movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0
subps   %xmm2, %xmm0# 14[c=12 l=3]  *subv4sf3/0

The implementation emits V4SFmode operation, so we can remove all "emulated"
SSE2 V2SFmode trapping instructions and remove "emulated" SSE2 V2SFmode
alternatives from 3dNOW! insn patterns.

PR target/110762

gcc/ChangeLog:

* config/i386/i386.md (plusminusmult): New code iterator.
* config/i386/mmx.md (mmxdoublevecmode): New mode attribute.
(movq__to_sse): New expander.
(v2sf3): Macroize expander from addv2sf3,
subv2sf3 and mulv2sf3 using plusminusmult code iterator.  Rewrite
as a wrapper around V4SFmode operation.
(mmx_addv2sf3): Change operand 1 and operand 2 predicates to
nonimmediate_operand.
(*mmx_addv2sf3): Remove SSE alternatives.  Change operand 1 and
operand 2 predicates to nonimmediate_operand.
(mmx_subv2sf3): Change operand 2 predicate to nonimmediate_operand.
(mmx_subrv2sf3): Change operand 1 predicate to nonimmediate_operand.
(*mmx_subv2sf3): Remove SSE alternatives.  Change operand 1 and
operand 2 predicates to nonimmediate_operand.
(mmx_mulv2sf3): Change operand 1 and operand 2 predicates to
nonimmediate_operand.
(*mmx_mulv2sf3): Remove SSE alternatives.  Change operand 1 and
operand 2 predicates to nonimmediate_operand.
(divv2sf3): Rewrite as a wrapper around V4SFmode operation.
(v2sf3): Ditto.
(mmx_v2sf3): Change operand 1 and operand 2
predicates to nonimmediate_operand.
(*mmx_v2sf3): Remove SSE alternatives.  Change
operand 1 and operand 2 predicates to nonimmediate_operand.
(mmx_ieee_v2sf3): Ditto.
(sqrtv2sf2): Rewrite as a wrapper around V4SFmode operation.
(*mmx_haddv2sf3_low): Ditto.
(*mmx_hsubv2sf3_low): Ditto.
(vec_addsubv2sf3): Ditto.
(*mmx_maskcmpv2sf3_comm): Remove.
(*mmx_maskcmpv2sf3): Remove.
(vec_cmpv2sfv2si): Rewrite as a wrapper around V4SFmode operation.
(vcondv2sf): Ditto.
(fmav2sf4): Ditto.
(fmsv2sf4): Ditto.
(fnmav2sf4): Ditto.
(fnmsv2sf4): Ditto.
(fix_truncv2sfv2si2): Ditto.
(fixuns_truncv2sfv2si2): Ditto.
(mmx_fix_truncv2sfv2si2): Remove SSE alternatives.
Change operand 1 predicate to nonimmediate_operand.
(floatv2siv2sf2): Rewrite as a wrapper around V4SFmode operation.
(floatunsv2siv2sf2): Ditto.
(mmx_floatv2siv2sf2): Remove SSE alternatives.
Change operand 1 predicate to nonimmediate_operand.
(nearbyintv2sf2): Rewrite as a wrapper around V4SFmode operation.
(rintv2sf2): Ditto.
(lrintv2sfv2si2): Ditto.
(ceilv2sf2): Ditto.
(lceilv2sfv2si2): Ditto.
(floorv2sf2): Ditto.
(lfloorv2sfv2si2): Ditto.
(btruncv2sf2): Ditto.
(roundv2sf2): Ditto.
(lroundv2sfv2si2): Ditto.
(*mmx_roundv2sf2): Remove.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110762.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 4db210cc795..cedba3b90f0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -933,6 +933,7 @@ (define_asm_attributes
(set_attr "type" "multi")])
 
 (define_code_iterator plusminus [plus minus])
+(define_code_iterator plusminusmult [plus minus mult])
 (define_code_iterator plusminusmultdiv [plus minus mult div])
 
 (define_code_iterator sat_plusminus [ss_plus us_plus ss_minus us_minus])
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md

Re: [x86 PATCH] Don't use insvti_{high, low}part with -O0 (for compile-time).

2023-07-22 Thread Uros Bizjak via Gcc-patches
On Sat, Jul 22, 2023 at 4:17 PM Roger Sayle  wrote:
>
>
> This patch attempts to help with PR rtl-optimization/110587, a regression
> of -O0 compile time for the pathological pr28071.c.  My recent patch helps
> a bit, but hasn't returned -O0 compile-time to where it was before my
> ix86_expand_move changes.  The obvious solution/workaround is to guard
> these new TImode parameter passing optimizations with "&& optimize", so
> they don't trigger when compiling with -O0.  The very minor complication
> is that "&& optimize" alone leads to the regression of pr110533.c, where
> our improved TImode parameter passing fixes a wrong-code issue with naked
> functions, importantly, when compiling with -O0.  This should explain
> the one line fix below "&& (optimize || ix86_function_naked (cfun))".
>
> I've an additional fix/tweak or two for this compile-time issue, but
> this change eliminates the part of the regression that I've caused.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
> 2023-07-22  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_move): Disable the
> 64-bit insertions into TImode optimizations with -O0, unless
> the function has the "naked" attribute (for PR target/110533).

LGTM, but please add some comments, why only when optimizing (please
mention PR110587) and especially mention PR110533 on why the naked
attribute is allowed.

Thanks,
Uros.

> Cheers,
> Roger
> --
>


Re: [x86 PATCH] Use QImode for offsets in zero_extract/sign_extract in i386.md

2023-07-22 Thread Uros Bizjak via Gcc-patches
On Sat, Jul 22, 2023 at 5:37 PM Roger Sayle  wrote:
>
>
> As suggested by Uros, this patch changes the ZERO_EXTRACTs and SIGN_EXTRACTs
> in i386.md to consistently use QImode for bit offsets (i.e. third and fourth
> operands), matching the use of QImode for bit counts in shifts and rotates.
>
> There's no change in functionality, and the new patterns simply ensure that
> we continue to generate the same code (match revised patterns) as before.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-07-22  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (extv): Use QImode for offsets.
> (extzv): Likewise.
> (insv): Likewise.
> (*testqi_ext_3): Likewise.
> (*btr_2): Likewise.
> (define_split): Likewise.
> (*btsq_imm): Likewise.
> (*btrq_imm): Likewise.
> (*btcq_imm): Likewise.
> (define_peephole2 x3): Likewise.
> (*bt): Likewise
> (*bt_mask): New define_insn_and_split.
> (*jcc_bt): Use QImode for offsets.
> (*jcc_bt_1): Delete obsolete pattern.
> (*jcc_bt_mask): Use QImode offsets.
> (*jcc_bt_mask_1): Likewise.
> (define_split): Likewise.
> (*bt_setcqi): Likewise.
> (*bt_setncqi): Likewise.
> (*bt_setnc): Likewise.
> (*bt_setncqi_2): Likewise.
> (*bt_setc_mask): New define_insn_and_split.
> (bmi2_bzhi_3): Use QImode offsets.
> (*bmi2_bzhi_3): Likewise.
> (*bmi2_bzhi_3_1): Likewise.
> (*bmi2_bzhi_3_1_ccz): Likewise.
> (@tbm_bextri_): Likewise.

OK.

Thanks,
Uros.

>
>
> Thanks,
> Roger
> --
>


[committed] i386: Double-word sign-extension missed-optimization [PR110717]

2023-07-20 Thread Uros Bizjak via Gcc-patches
When sign-extending the value in a double-word register pair using shift and
ashiftrt sequence with the same count immediate value less than word width,
there is no need to shift the lower word of the value. The sign-extension
could be limited to the upper word, but we uselessly shift the lower word
with it as well:
movq%rdi, %rax
movq%rsi, %rdx
shldq$59, %rdi, %rdx
salq$59, %rax
shrdq$59, %rdx, %rax
sarq$59, %rdx
ret
for -m64 and
movl4(%esp), %eax
movl8(%esp), %edx
shldl$27, %eax, %edx
sall$27, %eax
shrdl$27, %edx, %eax
sarl$27, %edx
ret
for -m32.

The patch introduces a new post-reload splitter to provide the combined
ASHIFTRT/SHIFT instruction pattern.  The instruction is split to a sequence
of SAL and SAR insns with the same count immediate operand:
movq%rsi, %rdx
movq%rdi, %rax
salq$59, %rdx
sarq$59, %rdx
ret

Some complication is required to properly handle STV transform, where we
emit a sequence with DImode PSLLQ and PSRAQ insns for 32-bit AVX512VL
targets when profitable.

The patch also fixes a small oversight and enables STV transform of SImode
ASHIFTRT to PSRAD also for SSE2 targets.

PR target/110717

gcc/ChangeLog:

* config/i386/i386-features.cc
(general_scalar_chain::compute_convert_gain): Calculate gain
for extend higpart case.
(general_scalar_chain::convert_op): Handle
ASHIFTRT/ASHIFT combined RTX.
(general_scalar_to_vector_candidate_p): Enable ASHIFTRT for
SImode for SSE2 targets.  Handle ASHIFTRT/ASHIFT combined RTX.
* config/i386/i386.md (*extend2_doubleword_highpart):
New define_insn_and_split pattern.
(*extendv2di2_highpart_stv): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110717.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 4d69251d4f5..f801a8fc94a 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -572,6 +572,9 @@ general_scalar_chain::compute_convert_gain ()
  {
if (INTVAL (XEXP (src, 1)) >= 32)
  igain += ix86_cost->add;
+   /* Gain for extend highpart case.  */
+   else if (GET_CODE (XEXP (src, 0)) == ASHIFT)
+ igain += ix86_cost->shift_const - ix86_cost->sse_op;
else
  igain += ix86_cost->shift_const;
  }
@@ -951,7 +954,8 @@ general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
-  if (GET_CODE (*op) == NOT)
+  if (GET_CODE (*op) == NOT
+  || GET_CODE (*op) == ASHIFT)
 {
   convert_op ( (*op, 0), insn);
   PUT_MODE (*op, vmode);
@@ -2120,7 +2124,7 @@ general_scalar_to_vector_candidate_p (rtx_insn *insn, 
enum machine_mode mode)
   switch (GET_CODE (src))
 {
 case ASHIFTRT:
-  if (!TARGET_AVX512VL)
+  if (mode == DImode && !TARGET_AVX512VL)
return false;
   /* FALLTHRU */
 
@@ -2131,6 +2135,14 @@ general_scalar_to_vector_candidate_p (rtx_insn *insn, 
enum machine_mode mode)
   if (!CONST_INT_P (XEXP (src, 1))
  || !IN_RANGE (INTVAL (XEXP (src, 1)), 0, GET_MODE_BITSIZE (mode)-1))
return false;
+
+  /* Check for extend highpart case.  */
+  if (mode != DImode
+ || GET_CODE (src) != ASHIFTRT
+ || GET_CODE (XEXP (src, 0)) != ASHIFT)
+   break;
+
+  src = XEXP (src, 0);
   break;
 
 case SMAX:
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 8c54aa5e981..4db210cc795 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -15292,6 +15292,41 @@ (define_insn "*qi_ext_2"
(const_string "0")
(const_string "*")))
(set_attr "mode" "QI")])
+
+(define_insn_and_split "*extend2_doubleword_highpart"
+  [(set (match_operand: 0 "register_operand" "=r")
+   (ashiftrt:
+ (ashift: (match_operand: 1 "nonimmediate_operand" "0")
+   (match_operand:QI 2 "const_int_operand"))
+ (match_operand:QI 3 "const_int_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "INTVAL (operands[2]) == INTVAL (operands[3])
+   && UINTVAL (operands[2]) <  * BITS_PER_UNIT"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (match_dup 4)
+  (ashift:DWIH (match_dup 4) (match_dup 2)))
+ (clobber (reg:CC FLAGS_REG))])
+   (parallel [(set (match_dup 4)
+  (ashiftrt:DWIH (match_dup 4) (match_dup 2)))
+ (clobber (reg:CC FLAGS_REG))])]
+  "split_double_mode (mode, [0], 1, [0], 
[4]);")
+
+(define_insn_and_split "*extendv2di2_highpart_stv"
+  [(set (match_operand:V2DI 0 "register_operand" "=v")
+   (ashiftrt:V2DI
+ (ashift:V2DI (match_operand:V2DI 1 "nonimmediate_operand" "vm")
+  (match_operand:QI 2 "const_int_operand"))
+ 

Re: [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX

2023-07-20 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 20, 2023 at 9:35 AM liuhongt  wrote:
>
> For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
> as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
> Can someone confirm this with AMD folks?
> If AMD doesn't like such optimization, I'll put my optimization under
> micro-architecture tuning.

The instruction is reachable only as __builtin_ia32_lddqu* (aka
_mm_lddqu_si*), so it was chosen by the programmer for a reason. I
think that in this case, the compiler should not be too smart and
change the instruction behind the programmer's back. The caveats are
also explained at length in the ISA manual.

Uros.

> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> If AMD also like such optimization, Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (_lddqu): Change to
> define_expand, expand as simple move when TARGET_AVX
> && ( == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD).
> The original define_insn is renamed to
> ..
> (_lddqu): .. this.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/vlddqu_vinserti128.c: New test.
> ---
>  gcc/config/i386/sse.md| 15 ++-
>  .../gcc.target/i386/vlddqu_vinserti128.c  | 11 +++
>  2 files changed, 25 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 2d81347c7b6..d571a78f4c4 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1835,7 +1835,20 @@ (define_peephole2
>[(set (match_dup 4) (match_dup 1))]
>"operands[4] = adjust_address (operands[0], V2DFmode, 0);")
>
> -(define_insn "_lddqu"
> +(define_expand "_lddqu"
> +  [(set (match_operand:VI1 0 "register_operand")
> +   (unspec:VI1 [(match_operand:VI1 1 "memory_operand")]
> +   UNSPEC_LDDQU))]
> +  "TARGET_SSE3"
> +{
> +  if (TARGET_AVX && ( == 16 || 
> !TARGET_AVX256_SPLIT_UNALIGNED_LOAD))
> +{
> +  emit_move_insn (operands[0], operands[1]);
> +  DONE;
> +}
> +})
> +
> +(define_insn "*_lddqu"
>[(set (match_operand:VI1 0 "register_operand" "=x")
> (unspec:VI1 [(match_operand:VI1 1 "memory_operand" "m")]
> UNSPEC_LDDQU))]
> diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c 
> b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> new file mode 100644
> index 000..29699a5fa7f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2" } */
> +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> +
> +#include 
> +__m256i foo(void *data) {
> +__m128i X1 = _mm_lddqu_si128((__m128i*)data);
> +__m256i V1 = _mm256_broadcastsi128_si256 (X1);
> +return V1;
> +}
> --
> 2.39.1.388.g2fc9e9ca3c
>


Re: [x86_64 PATCH] More TImode parameter passing improvements.

2023-07-20 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 20, 2023 at 9:44 AM Roger Sayle  wrote:
>
>
> Hi Uros,
>
> > From: Uros Bizjak 
> > Sent: 20 July 2023 07:50
> >
> > On Wed, Jul 19, 2023 at 10:07 PM Roger Sayle 
> > wrote:
> > >
> > > This patch is the next piece of a solution to the x86_64 ABI issues in
> > > PR 88873.  This splits the *concat3_3 define_insn_and_split
> > > into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT
> > > *concatsidi3_3.  This allows us to add an additional alternative to
> > > the the 64-bit version, enabling the register allocator to perform
> > > this operation using SSE registers, which is implemented/split after
> > > reload using vec_concatv2di.
> > >
> > > To demonstrate the improvement, the test case from PR88873:
> > >
> > > typedef struct { double x, y; } s_t;
> > >
> > > s_t foo (s_t a, s_t b, s_t c)
> > > {
> > >   return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y,
> > > c.y) }; }
> > >
> > > when compiled with -O2 -march=cascadelake, currently generates:
> > >
> > > foo:vmovq   %xmm2, -56(%rsp)
> > > movq-56(%rsp), %rax
> > > vmovq   %xmm3, -48(%rsp)
> > > vmovq   %xmm4, -40(%rsp)
> > > movq-48(%rsp), %rcx
> > > vmovq   %xmm5, -32(%rsp)
> > > vmovq   %rax, %xmm6
> > > movq-40(%rsp), %rax
> > > movq-32(%rsp), %rsi
> > > vpinsrq $1, %rcx, %xmm6, %xmm6
> > > vmovq   %xmm0, -24(%rsp)
> > > vmovq   %rax, %xmm7
> > > vmovq   %xmm1, -16(%rsp)
> > > vmovapd %xmm6, %xmm2
> > > vpinsrq $1, %rsi, %xmm7, %xmm7
> > > vfmadd132pd -24(%rsp), %xmm7, %xmm2
> > > vmovapd %xmm2, -56(%rsp)
> > > vmovsd  -48(%rsp), %xmm1
> > > vmovsd  -56(%rsp), %xmm0
> > > ret
> > >
> > > with this change, we avoid many of the reloads via memory,
> > >
> > > foo:vpunpcklqdq %xmm3, %xmm2, %xmm7
> > > vpunpcklqdq %xmm1, %xmm0, %xmm6
> > > vpunpcklqdq %xmm5, %xmm4, %xmm2
> > > vmovdqa %xmm7, -24(%rsp)
> > > vmovdqa %xmm6, %xmm1
> > > movq-16(%rsp), %rax
> > > vpinsrq $1, %rax, %xmm7, %xmm4
> > > vmovapd %xmm4, %xmm6
> > > vfmadd132pd %xmm1, %xmm2, %xmm6
> > > vmovapd %xmm6, -24(%rsp)
> > > vmovsd  -16(%rsp), %xmm1
> > > vmovsd  -24(%rsp), %xmm0
> > > ret
> > >
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-07-19  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386-expand.cc (ix86_expand_move): Don't call
> > > force_reg, to use SUBREG rather than create a new pseudo when
> > > inserting DFmode fields into TImode with insvti_{high,low}part.
> > > (*concat3_3): Split into two define_insn_and_split...
> > > (*concatditi3_3): 64-bit implementation.  Provide alternative
> > > that allows register allocation to use SSE registers that is
> > > split into vec_concatv2di after reload.
> > > (*concatsidi3_3): 32-bit implementation.
> > >
> > > gcc/testsuite/ChangeLog
> > > * gcc.target/i386/pr88873.c: New test case.
> >
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index f9b0dc6..9c3febe 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -558,7 +558,7 @@ ix86_expand_move (machine_mode mode, rtx
> > operands[])
> >op0 = SUBREG_REG (op0);
> >tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
> >if (mode == DFmode)
> > -op1 = force_reg (DImode, gen_lowpart (DImode, op1));
> > +op1 = gen_lowpart (DImode, op1);
> >
> > Please note that gen_lowpart will ICE when op1 is a SUBREG. This is the 
> > reason
> > that we need to first force a SUBREG to a register and then perform 
> > gen_lowpart,
> > and it is necessary to avoid ICE.
>
> The good news is that we know op1 is a register, as this is tested by
> "&& REG_P (op1)" on line 551.  You'll also notice that I'm not removing
> the force_reg from before the call to gen_lowpart, but removing the call
> to force_reg after the call to gen_lowpart.  When I originally wrote this,
> the hope was that placing this SUBREG in its own pseudo would help
> with register allocation/CSE.  Unfortunately, increasing the number of
> pseudos (in this case) increases compile-time (due to quadratic behaviour
> in LRA), as shown by PR rtl-optimization/110587, and keeping the DF->DI
> conversion in a SUBREG inside the insvti_{high,low}part allows the
> register allocator to see the DF->DI->TI sequence in a single pattern,
> and hence choose to keep the TI mode in SSE registers, rather than use
> a pair of reloads, to write the DF value to memory, then read it back as
> a scalar in DImode, and perhaps the same again to go the other way.

This was my 

Re: [x86_64 PATCH] More TImode parameter passing improvements.

2023-07-20 Thread Uros Bizjak via Gcc-patches
On Wed, Jul 19, 2023 at 10:07 PM Roger Sayle  wrote:
>
>
> This patch is the next piece of a solution to the x86_64 ABI issues in
> PR 88873.  This splits the *concat3_3 define_insn_and_split
> into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT
> *concatsidi3_3.  This allows us to add an additional alternative to the
> the 64-bit version, enabling the register allocator to perform this
> operation using SSE registers, which is implemented/split after reload
> using vec_concatv2di.
>
> To demonstrate the improvement, the test case from PR88873:
>
> typedef struct { double x, y; } s_t;
>
> s_t foo (s_t a, s_t b, s_t c)
> {
>   return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y, c.y)
> };
> }
>
> when compiled with -O2 -march=cascadelake, currently generates:
>
> foo:vmovq   %xmm2, -56(%rsp)
> movq-56(%rsp), %rax
> vmovq   %xmm3, -48(%rsp)
> vmovq   %xmm4, -40(%rsp)
> movq-48(%rsp), %rcx
> vmovq   %xmm5, -32(%rsp)
> vmovq   %rax, %xmm6
> movq-40(%rsp), %rax
> movq-32(%rsp), %rsi
> vpinsrq $1, %rcx, %xmm6, %xmm6
> vmovq   %xmm0, -24(%rsp)
> vmovq   %rax, %xmm7
> vmovq   %xmm1, -16(%rsp)
> vmovapd %xmm6, %xmm2
> vpinsrq $1, %rsi, %xmm7, %xmm7
> vfmadd132pd -24(%rsp), %xmm7, %xmm2
> vmovapd %xmm2, -56(%rsp)
> vmovsd  -48(%rsp), %xmm1
> vmovsd  -56(%rsp), %xmm0
> ret
>
> with this change, we avoid many of the reloads via memory,
>
> foo:vpunpcklqdq %xmm3, %xmm2, %xmm7
> vpunpcklqdq %xmm1, %xmm0, %xmm6
> vpunpcklqdq %xmm5, %xmm4, %xmm2
> vmovdqa %xmm7, -24(%rsp)
> vmovdqa %xmm6, %xmm1
> movq-16(%rsp), %rax
> vpinsrq $1, %rax, %xmm7, %xmm4
> vmovapd %xmm4, %xmm6
> vfmadd132pd %xmm1, %xmm2, %xmm6
> vmovapd %xmm6, -24(%rsp)
> vmovsd  -16(%rsp), %xmm1
> vmovsd  -24(%rsp), %xmm0
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-07-19  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_move): Don't call
> force_reg, to use SUBREG rather than create a new pseudo when
> inserting DFmode fields into TImode with insvti_{high,low}part.
> (*concat3_3): Split into two define_insn_and_split...
> (*concatditi3_3): 64-bit implementation.  Provide alternative
> that allows register allocation to use SSE registers that is
> split into vec_concatv2di after reload.
> (*concatsidi3_3): 32-bit implementation.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/pr88873.c: New test case.

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index f9b0dc6..9c3febe 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -558,7 +558,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   op0 = SUBREG_REG (op0);
   tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
   if (mode == DFmode)
-op1 = force_reg (DImode, gen_lowpart (DImode, op1));
+op1 = gen_lowpart (DImode, op1);

Please note that gen_lowpart will ICE when op1 is a SUBREG. This is
the reason that we need to first force a SUBREG to a register and then
perform gen_lowpart, and it is necessary to avoid ICE.

   op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
   op1 = gen_rtx_IOR (TImode, tmp, op1);
  }
@@ -570,7 +570,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   op0 = SUBREG_REG (op0);
   tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
   if (mode == DFmode)
-op1 = force_reg (DImode, gen_lowpart (DImode, op1));
+op1 = gen_lowpart (DImode, op1);

Also here.

   op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
   op1 = gen_rtx_ASHIFT (TImode, op1, GEN_INT (64));
   op1 = gen_rtx_IOR (TImode, tmp, op1);

Uros.


Re: [GCC 13 PATCH] PR target/109973: CCZmode and CCCmode variants of [v]ptest.

2023-07-19 Thread Uros Bizjak via Gcc-patches
On Wed, Jul 19, 2023 at 2:21 PM Richard Biener
 wrote:
>
> On Sun, Jun 11, 2023 at 12:55 AM Roger Sayle  
> wrote:
> >
> >
> > This is a backport of the fixes for PR target/109973 and PR target/110083.
> >
> > This backport to the releases/gcc-13 branch has been tested on
> > x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and
> > without --target_board=unix{-m32} with no new failures.  Ok for gcc-13,
> > or should we just close PR 109973 in Bugzilla?
>
> As alternative solution for the GCC 13 branch I have tested reverting
> r13-2006-ga56c1641e9d25e successfully.  Can we choose between the
> options please?  Sorry I'm only bringing this up now but 13.2 RC is due
> tomorrow.
>
> Thank you,
> Richard.
>
> >
> >
> > 2023-06-10  Roger Sayle  
> > Uros Bizjak  
> >
> > gcc/ChangeLog
> > PR target/109973
> > PR target/110083
> > * config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new
> > CODE_for_sse4_1_ptestzv2di.
> > (__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di.
> > (__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di.
> > (__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di.
> > * config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode
> > when expanding UNSPEC_PTEST to compare against zero.
> > * config/i386/i386-features.cc (scalar_chain::convert_compare):
> > Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons.
> > Update or delete REG_EQUAL notes, converting CONST_INT and
> > CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR.
> > (general_scalar_chain::convert_insn): Use CCZmode for COMPARE
> > result.
> > (timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result.
> > * config/i386/i386-protos.h (ix86_match_ptest_ccmode): Prototype.
> > * config/i386/i386.cc (ix86_match_ptest_ccmode): New predicate to
> > check for suitable matching modes for the UNSPEC_PTEST pattern.
> > * config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK
> > to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ.
> > (*_ptest): Add asterisk to hide define_insn.  Remove
> > ":CC" mode of FLAGS_REG, instead use ix86_match_ptest_ccmode.
> > (_ptestz): New define_expand to specify CCZ.
> > (_ptestc): New define_expand to specify CCC.
> > (_ptest): A define_expand using CC to preserve the
> > current behavior.
> > (*ptest_and): Specify CCZ to only perform this optimization
> > when only the Z flag is required.
> >
> > gcc/testsuite/ChangeLog
> > PR target/109973
> > PR target/110083
> > * gcc.target/i386/pr109973-1.c: New test case.
> > * gcc.target/i386/pr109973-2.c: Likewise.
> > * gcc.target/i386/pr110083.c: Likewise.

Yes, I would rather have the offending patch reverted on gcc-13.

Uros.


[committed] dwarf2: Change return type of predicate functions from int to bool

2023-07-18 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* dwarf2asm.cc: Change FALSE to false.
* dwarf2cfi.cc (execute_dwarf2_frame): Change return type to void.
* dwarf2out.cc (matches_main_base): Change return type from
int to bool.  Change "last_match" variable to bool.
(dump_struct_debug): Change return type from int to bool.
Change "matches" and "result" function arguments to bool.
(is_pseudo_reg): Change return type from int to bool.
(is_tagged_type): Ditto.
(same_loc_p): Ditto.
(same_dw_val_p): Change return type from int to bool and adjust
function body accordingly.
(same_attr_p): Ditto.
(same_die_p): Ditto.
(is_type_die): Ditto.
(is_declaration_die): Ditto.
(should_move_die_to_comdat): Ditto.
(is_base_type): Ditto.
(is_based_loc): Ditto.
(local_scope_p): Ditto.
(class_scope_p): Ditto.
(class_or_namespace_scope_p): Ditto.
(is_tagged_type): Ditto.
(is_rust): Use void argument.
(is_nested_in_subprogram): Change return type from int to bool.
(contains_subprogram_definition): Ditto.
(gen_struct_or_union_type_die): Change "nested", "complete"
and "ns_decl" variables to bool.
(is_naming_typedef_decl): Change FALSE to false.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/dwarf2asm.cc b/gcc/dwarf2asm.cc
index 65b95fee243..ea3f3982347 100644
--- a/gcc/dwarf2asm.cc
+++ b/gcc/dwarf2asm.cc
@@ -52,7 +52,7 @@ dw2_assemble_integer (int size, rtx x)
 relocations usually result in assembler errors.  Assume
 all such values are positive and emit the relocation only
 in the least significant half.  */
-  const char *op = integer_asm_op (DWARF2_ADDR_SIZE, FALSE);
+  const char *op = integer_asm_op (DWARF2_ADDR_SIZE, false);
   if (BYTES_BIG_ENDIAN)
{
  if (op)
@@ -92,7 +92,7 @@ dw2_assemble_integer (int size, rtx x)
   return;
 }
 
-  const char *op = integer_asm_op (size, FALSE);
+  const char *op = integer_asm_op (size, false);
 
   if (op)
 {
@@ -142,7 +142,7 @@ dw2_asm_output_data (int size, unsigned HOST_WIDE_INT value,
 const char *comment, ...)
 {
   va_list ap;
-  const char *op = integer_asm_op (size, FALSE);
+  const char *op = integer_asm_op (size, false);
 
   va_start (ap, comment);
 
diff --git a/gcc/dwarf2cfi.cc b/gcc/dwarf2cfi.cc
index 57283c10a29..ddc728f4ad0 100644
--- a/gcc/dwarf2cfi.cc
+++ b/gcc/dwarf2cfi.cc
@@ -3291,7 +3291,7 @@ create_cie_data (void)
state at each location within the function.  These notes will be
emitted during pass_final.  */
 
-static unsigned int
+static void
 execute_dwarf2_frame (void)
 {
   /* Different HARD_FRAME_POINTER_REGNUM might coexist in the same file.  */
@@ -3322,8 +3322,6 @@ execute_dwarf2_frame (void)
 
   delete trace_index;
   trace_index = NULL;
-
-  return 0;
 }
 
 /* Convert a DWARF call frame info. operation to its string name */
@@ -3796,7 +3794,8 @@ public:
   bool gate (function *) final override;
   unsigned int execute (function *) final override
   {
-return execute_dwarf2_frame ();
+execute_dwarf2_frame ();
+return 0;
   }
 
 }; // class pass_dwarf2_frame
diff --git a/gcc/dwarf2out.cc b/gcc/dwarf2out.cc
index 238d0a94400..fa0fe4c41bb 100644
--- a/gcc/dwarf2out.cc
+++ b/gcc/dwarf2out.cc
@@ -339,12 +339,12 @@ static unsigned int rnglist_idx;
 
 /* Match the base name of a file to the base name of a compilation unit. */
 
-static int
+static bool
 matches_main_base (const char *path)
 {
   /* Cache the last query. */
   static const char *last_path = NULL;
-  static int last_match = 0;
+  static bool last_match = false;
   if (path != last_path)
 {
   const char *base;
@@ -358,10 +358,10 @@ matches_main_base (const char *path)
 
 #ifdef DEBUG_DEBUG_STRUCT
 
-static int
+static bool
 dump_struct_debug (tree type, enum debug_info_usage usage,
   enum debug_struct_file criterion, int generic,
-  int matches, int result)
+  bool matches, bool result)
 {
   /* Find the type name. */
   tree type_decl = TYPE_STUB_DECL (type);
@@ -3730,9 +3730,9 @@ enum dw_scalar_form
 
 /* Forward declarations for functions defined in this file.  */
 
-static int is_pseudo_reg (const_rtx);
+static bool is_pseudo_reg (const_rtx);
 static tree type_main_variant (tree);
-static int is_tagged_type (const_tree);
+static bool is_tagged_type (const_tree);
 static const char *dwarf_tag_name (unsigned);
 static const char *dwarf_attr_name (unsigned);
 static const char *dwarf_form_name (unsigned);
@@ -3805,14 +3805,14 @@ static void collect_checksum_attributes (struct 
checksum_attributes *, dw_die_re
 static void die_checksum_ordered (dw_die_ref, struct md5_ctx *, int *);
 static void checksum_die_context (dw_die_ref, struct md5_ctx *);
 static void generate_type_signature (dw_die_ref, comdat_type_node *);
-static int same_loc_p 

[committed] combine: Change return type of predicate functions from int to bool

2023-07-17 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* combine.cc (struct reg_stat_type): Change last_set_invalid to bool.
(cant_combine_insn_p): Change return type from int to bool and adjust
function body accordingly.
(can_combine_p): Ditto.
(combinable_i3pat): Ditto.  Change "i1_not_in_src" and "i0_not_in_src"
function arguments from int to bool.
(contains_muldiv): Change return type from int to bool and adjust
function body accordingly.
(try_combine): Ditto. Change "new_direct_jump" pointer function
argument from int to bool.  Change "substed_i2", "substed_i1",
"substed_i0", "added_sets_0", "added_sets_1", "added_sets_2",
"i2dest_in_i2src", "i1dest_in_i1src", "i2dest_in_i1src",
"i0dest_in_i0src", "i1dest_in_i0src", "i2dest_in_i0src",
"i2dest_killed", "i1dest_killed", "i0dest_killed", "i1_feeds_i2_n",
"i0_feeds_i2_n", "i0_feeds_i1_n", "i3_subst_into_i2", "have_mult",
"swap_i2i3", "split_i2i3" and "changed_i3_dest" variables
from int to bool.
(subst): Change "in_dest", "in_cond" and "unique_copy" function
arguments from int to bool.
(combine_simplify_rtx): Change "in_dest" and "in_cond" function
arguments from int to bool.
(make_extraction): Change "unsignedp", "in_dest" and "in_compare"
function argument from int to bool.
(force_int_to_mode): Change "just_select" function argument
from int to bool.  Change "next_select" variable to bool.
(rtx_equal_for_field_assignment_p): Change return type from
int to bool and adjust function body accordingly.
(merge_outer_ops): Ditto.  Change "pcomp_p" pointer function
argument from int to bool.
(get_last_value_validate): Change return type from int to bool
and adjust function body accordingly.
(reg_dead_at_p): Ditto.
(reg_bitfield_target_p): Ditto.
(combine_instructions): Ditto.  Change "new_direct_jump"
variable to bool.
(can_combine_p): Change return type from int to bool
and adjust function body accordingly.
(likely_spilled_retval_p): Ditto.
(can_change_dest_mode): Change "added_sets" function argument
from int to bool.
(find_split_point): Change "unsignedp" variable to bool.
(simplify_if_then_else): Change "comparison_p" and "swapped"
variables to bool.
(simplify_set): Change "other_changed" variable to bool.
(expand_compound_operation): Change "unsignedp" variable to bool.
(force_to_mode): Change "just_select" function argument
from int to bool.  Change "next_select" variable to bool.
(extended_count): Change "unsignedp" function argument to bool.
(simplify_shift_const_1): Change "complement_p" variable to bool.
(simplify_comparison): Change "changed" variable to bool.
(rest_of_handle_combine): Change return type to void.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/combine.cc b/gcc/combine.cc
index 304c020ec79..d9161b257e8 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -156,7 +156,7 @@ struct reg_stat_type {
register was assigned
  last_set_table_tick   records the value of label_tick when a
value using the register is assigned
- last_set_invalid  set to nonzero when it is not valid
+ last_set_invalid  set to true when it is not valid
to use the value of this register in some
register's value
 
@@ -202,11 +202,11 @@ struct reg_stat_type {
   char last_set_sign_bit_copies;
   ENUM_BITFIELD(machine_mode)  last_set_mode : MACHINE_MODE_BITSIZE;
 
-  /* Set nonzero if references to register n in expressions should not be
+  /* Set to true if references to register n in expressions should not be
  used.  last_set_invalid is set nonzero when this register is being
  assigned to and last_set_table_tick == label_tick.  */
 
-  char last_set_invalid;
+  bool last_set_invalid;
 
   /* Some registers that are set more than once and used in more than one
  basic block are nevertheless always set in similar ways.  For example,
@@ -416,35 +416,36 @@ static void do_SUBST_INT (int *, int);
 static void init_reg_last (void);
 static void setup_incoming_promotions (rtx_insn *);
 static void set_nonzero_bits_and_sign_copies (rtx, const_rtx, void *);
-static int cant_combine_insn_p (rtx_insn *);
-static int can_combine_p (rtx_insn *, rtx_insn *, rtx_insn *, rtx_insn *,
- rtx_insn *, rtx_insn *, rtx *, rtx *);
-static int combinable_i3pat (rtx_insn *, rtx *, rtx, rtx, rtx, int, int, rtx 
*);
-static int contains_muldiv (rtx);
+static bool cant_combine_insn_p (rtx_insn *);
+static bool can_combine_p (rtx_insn *, rtx_insn *, rtx_insn *, rtx_insn *,
+  rtx_insn *, rtx_insn *, rtx *, rtx *);
+static bool 

Re: [PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-07-17 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 17, 2023 at 10:28 AM Hongtao Liu  wrote:
>
> I'd like to ping for this patch (only patch 1/2, for patch 2/2, I
> think that may not be necessary).
>
> On Mon, May 15, 2023 at 9:20 AM Hongtao Liu  wrote:
> >
> > ping.
> >
> > On Fri, Apr 21, 2023 at 9:55 PM liuhongt  wrote:
> > >
> > > > > +  if (!TARGET_SSE2)
> > > > > +{
> > > > > +  if (c_dialect_cxx ()
> > > > > +   && cxx_dialect > cxx20)
> > > >
> > > > Formatting, both conditions are short, so just put them on one line.
> > > Changed.
> > >
> > > > But for the C++23 macros, more importantly I think we really should
> > > > also in ix86_target_macros_internal add
> > > >   if (c_dialect_cxx ()
> > > >   && cxx_dialect > cxx20
> > > >   && (isa_flag & OPTION_MASK_ISA_SSE2))
> > > > {
> > > >   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
> > > >   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> > > > }
> > > > plus associated libstdc++ changes.  It can be done incrementally though.
> > > Added in PATCH 2/2
> > >
> > > > > +  if (flag_building_libgcc)
> > > > > + {
> > > > > +   /* libbid uses __LIBGCC_HAS_HF_MODE__ and 
> > > > > __LIBGCC_HAS_BF_MODE__
> > > > > +  to check backend support of _Float16 and __bf16 type.  */
> > > >
> > > > That is actually the case only for HFmode, but not for BFmode right now.
> > > > So, we need further work.  One is to add the BFmode support in there,
> > > > and another one is make sure the _Float16 <-> _Decimal* and __bf16 <->
> > > > _Decimal* conversions are compiled in also if not -msse2 by default.
> > > > One way to do that is wrap the HF and BF mode related functions on x86
> > > > #ifndef __SSE2__ into the pragmas like intrin headers use (but then
> > > > perhaps we don't need to undef this stuff here), another is not provide
> > > > the hf/bf support in that case from the TUs where they are provided now,
> > > > but from a different one which would be compiled with -msse2.
> > > Add CFLAGS-_hf_to_sd.c += -msse2, similar for other files in libbid, just 
> > > like
> > > we did before for HFtype softfp. Then no need to undef libgcc macros.
> > >
> > > > >/* We allowed the user to turn off SSE for kernel mode.  Don't 
> > > > > crash if
> > > > >   some less clueful developer tries to use floating-point anyway. 
> > > > >  */
> > > > > -  if (needed_sseregs && !TARGET_SSE)
> > > > > +  if (needed_sseregs
> > > > > +  && (!TARGET_SSE
> > > > > +   || (VALID_SSE2_TYPE_MODE (mode)
> > > > > +   && !TARGET_SSE2)))
> > > >
> > > > Formatting, no need to split this up that much.
> > > >   if (needed_sseregs
> > > >   && (!TARGET_SSE
> > > >   || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > > or even better
> > > >   if (needed_sseregs
> > > >   && (!TARGET_SSE || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > > will do it.
> > > Changed.
> > >
> > > > Instead of this, just use
> > > >   if (!float16_type_node)
> > > > {
> > > >   float16_type_node = ix86_float16_type_node;
> > > >   callback (float16_type_node);
> > > >   float16_type_node = NULL_TREE;
> > > > }
> > > >   if (!bfloat16_type_node)
> > > > {
> > > >   bfloat16_type_node = ix86_bf16_type_node;
> > > >   callback (bfloat16_type_node);
> > > >   bfloat16_type_node = NULL_TREE;
> > > > }
> > > Changed.
> > >
> > >
> > > > > +static const char *
> > > > > +ix86_invalid_conversion (const_tree fromtype, const_tree totype)
> > > > > +{
> > > > > +  if (element_mode (fromtype) != element_mode (totype))
> > > > > +{
> > > > > +  /* Do no allow conversions to/from BFmode/HFmode scalar types
> > > > > +  when TARGET_SSE2 is not available.  */
> > > > > +  if ((TYPE_MODE (fromtype) == BFmode
> > > > > +|| TYPE_MODE (fromtype) == HFmode)
> > > > > +   && !TARGET_SSE2)
> > > >
> > > > First of all, not really sure if this should be purely about scalar
> > > > modes, not also complex and vector modes involving those inner modes.
> > > > Because complex or vector modes with BF/HF elements will be without
> > > > TARGET_SSE2 for sure lowered into scalar code and that can't be handled
> > > > either.
> > > > So if (!TARGET_SSE2 && GET_MODE_INNER (TYPE_MODE (fromtype)) == BFmode)
> > > > or even better
> > > > if (!TARGET_SSE2 && element_mode (fromtype) == BFmode)
> > > > ?
> > > > Or even better remember the 2 modes above into machine_mode temporaries
> > > > and just use those in the != comparison and for the checks?
> > > >
> > > > Also, I think it is weird to tell user %<__bf16%> or %<_Float16%> when
> > > > we know which one it is.  Just return separate messages?
> > > Changed.
> > >
> > > > > +  /* Reject all single-operand operations on BFmode/HFmode except 
> > > > > for &
> > > > > + when TARGET_SSE2 is not available.  */
> > > > > +  if ((element_mode (type) == BFmode || element_mode (type) == 
> > > > > HFmode)
> 

Re: [PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-17 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 17, 2023 at 8:44 AM Hongtao Liu  wrote:
>
> Ping.
>
> On Tue, Jul 11, 2023 at 5:16 PM liuhongt via Gcc-patches
>  wrote:
> >
> > Similar like we did for CMPXCHG, but extended to all
> > ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same
> > as CMP.
> >
> > When operand order in CMP insn is same as that in CMPCCXADD,
> > CMP insn can be eliminated directly.
> >
> > When operand order is swapped in CMP insn, only optimize
> > cmpccxadd + cmpl + jcc/setcc to cmpccxadd + jcc/setcc when FLAGS_REG is dead
> > after jcc/setcc plus adjusting code for jcc/setcc.
> >
> > gcc/ChangeLog:
> >
> > PR target/110591
> > * config/i386/sync.md (cmpccxadd_): Adjust the pattern
> > to explicitly set FLAGS_REG like *cmp_1, also add extra
> > 3 define_peephole2 after the pattern.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr110591.c: New test.
> > * gcc.target/i386/pr110591-2.c: New test.

LGTM.

Thanks,
Uros.

> > ---
> >  gcc/config/i386/sync.md| 160 -
> >  gcc/testsuite/gcc.target/i386/pr110591-2.c |  90 
> >  gcc/testsuite/gcc.target/i386/pr110591.c   |  66 +
> >  3 files changed, 315 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591.c
> >
> > diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> > index e1fa1504deb..e84226cf895 100644
> > --- a/gcc/config/i386/sync.md
> > +++ b/gcc/config/i386/sync.md
> > @@ -1093,7 +1093,9 @@ (define_insn "cmpccxadd_"
> >   UNSPECV_CMPCCXADD))
> > (set (match_dup 1)
> > (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> > -   (clobber (reg:CC FLAGS_REG))]
> > +   (set (reg:CC FLAGS_REG)
> > +   (compare:CC (match_dup 1)
> > +   (match_dup 2)))]
> >"TARGET_CMPCCXADD && TARGET_64BIT"
> >  {
> >char buf[128];
> > @@ -1105,3 +1107,159 @@ (define_insn "cmpccxadd_"
> >output_asm_insn (buf, operands);
> >return "";
> >  })
> > +
> > +(define_peephole2
> > +  [(set (match_operand:SWI48x 0 "register_operand")
> > +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> > +   (parallel [(set (match_dup 0)
> > +  (unspec_volatile:SWI48x
> > +[(match_operand:SWI48x 2 "memory_operand")
> > + (match_dup 0)
> > + (match_operand:SWI48x 3 "register_operand")
> > + (match_operand:SI 4 "const_int_operand")]
> > +UNSPECV_CMPCCXADD))
> > + (set (match_dup 2)
> > +  (unspec_volatile:SWI48x [(const_int 0)] 
> > UNSPECV_CMPCCXADD))
> > + (set (reg:CC FLAGS_REG)
> > +  (compare:CC (match_dup 2)
> > +  (match_dup 0)))])
> > +   (set (reg FLAGS_REG)
> > +   (compare (match_operand:SWI48x 5 "register_operand")
> > +(match_operand:SWI48x 6 "x86_64_general_operand")))]
> > +  "TARGET_CMPCCXADD && TARGET_64BIT
> > +   && rtx_equal_p (operands[0], operands[5])
> > +   && rtx_equal_p (operands[1], operands[6])"
> > +  [(set (match_dup 0)
> > +   (match_dup 1))
> > +   (parallel [(set (match_dup 0)
> > +  (unspec_volatile:SWI48x
> > +[(match_dup 2)
> > + (match_dup 0)
> > + (match_dup 3)
> > + (match_dup 4)]
> > +UNSPECV_CMPCCXADD))
> > + (set (match_dup 2)
> > +  (unspec_volatile:SWI48x [(const_int 0)] 
> > UNSPECV_CMPCCXADD))
> > + (set (reg:CC FLAGS_REG)
> > +  (compare:CC (match_dup 2)
> > +  (match_dup 0)))])
> > +   (set (match_dup 7)
> > +   (match_op_dup 8
> > + [(match_dup 9) (const_int 0)]))])
> > +
> > +(define_peephole2
> > +  [(set (match_operand:SWI48x 0 "register_operand")
> > +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> > +   (parallel [(set (match_dup 0)
> > +  (unspec_volatile:SWI48x
> > +[(match_operand:SWI48x 2 "memory_operand")
> > + (match_dup 0)
> > + (match_operand:SWI48x 3 "register_operand")
> > + (match_operand:SI 4 "const_int_operand")]
> > +UNSPECV_CMPCCXADD))
> > + (set (match_dup 2)
> > +  (unspec_volatile:SWI48x [(const_int 0)] 
> > UNSPECV_CMPCCXADD))
> > + (set (reg:CC FLAGS_REG)
> > +  (compare:CC (match_dup 2)
> > +  (match_dup 0)))])
> > +   (set (reg FLAGS_REG)
> > +   (compare (match_operand:SWI48x 5 "register_operand")
> > +(match_operand:SWI48x 6 "x86_64_general_operand")))
> > +   (set (match_operand:QI 7 "nonimmediate_operand")
> > +   (match_operator:QI 8 

Re: [PATCH] x86: replace "extendhfdf2" expander

2023-07-14 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 14, 2023 at 11:44 AM Jan Beulich  wrote:
>
> The corresponding insn serves this purpose quite fine, and leads to
> slightly less (generated) code. All we need is the insn to not have a
> leading * in its name, while retaining that * for "extendhfsf2".
> Introduce a mode attribute in exchange to achieve that.
>
> gcc/
>
> * config/i386/i386.md (extendhfdf2): Delete expander.
> (extendhf): New mode attribute.
> (*extendhf2): Use it.

No, please leave the expander, it is there due to extendhfsf2 that
prevents effective macroization.

FYI, there is no less generated code when the named pattern is used,
the same code is generated from the named pattern as from the
expander. Source code can be shrinked, but in this particular case,
forced macroization complicates things more.

Uros.

> ---
> Of course the mode attribute could as well supply the full names.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5221,13 +5221,9 @@
>  }
>  })
>
> -(define_expand "extendhfdf2"
> -  [(set (match_operand:DF 0 "register_operand")
> -   (float_extend:DF
> - (match_operand:HF 1 "nonimmediate_operand")))]
> -  "TARGET_AVX512FP16")
> +(define_mode_attr extendhf [(SF "*") (DF "")])
>
> -(define_insn "*extendhf2"
> +(define_insn "extendhf2"
>[(set (match_operand:MODEF 0 "register_operand" "=v")
>  (float_extend:MODEF
>   (match_operand:HF 1 "nonimmediate_operand" "vm")))]


Re: [x86 PATCH] PR target/110588: Add *bt_setncqi_2 to generate btl

2023-07-14 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 14, 2023 at 11:27 AM Roger Sayle  wrote:
>
>
> > From: Uros Bizjak 
> > Sent: 13 July 2023 19:21
> >
> > On Thu, Jul 13, 2023 at 7:10 PM Roger Sayle 
> > wrote:
> > >
> > > This patch resolves PR target/110588 to catch another case in combine
> > > where the i386 backend should be generating a btl instruction.  This
> > > adds another define_insn_and_split to recognize the RTL representation
> > > for this case.
> > >
> > > I also noticed that two related define_insn_and_split weren't using
> > > the preferred string style for single statement
> > > preparation-statements, so I've reformatted these to be consistent in 
> > > style with
> > the new one.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-07-13  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > PR target/110588
> > > * config/i386/i386.md (*bt_setcqi): Prefer string form
> > > preparation statement over braces for a single statement.
> > > (*bt_setncqi): Likewise.
> > > (*bt_setncqi_2): New define_insn_and_split.
> > >
> > > gcc/testsuite/ChangeLog
> > > PR target/110588
> > > * gcc.target/i386/pr110588.c: New test case.
> >
> > +;; Help combine recognize bt followed by setnc (PR target/110588)
> > +(define_insn_and_split "*bt_setncqi_2"
> > +  [(set (match_operand:QI 0 "register_operand")  (eq:QI
> > +  (zero_extract:SWI48
> > +(match_operand:SWI48 1 "register_operand")
> > +(const_int 1)
> > +(zero_extend:SI (match_operand:QI 2 "register_operand")))
> > +  (const_int 0)))
> > +   (clobber (reg:CC FLAGS_REG))]
> > +  "TARGET_USE_BT && ix86_pre_reload_split ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (reg:CCC FLAGS_REG)
> > +(compare:CCC
> > + (zero_extract:SWI48 (match_dup 1) (const_int 1) (match_dup 2))
> > + (const_int 0)))
> > +   (set (match_dup 0)
> > +(ne:QI (reg:CCC FLAGS_REG) (const_int 0)))]
> > +  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
> >
> > I don't think the above transformation is 100% correct, mainly due to the 
> > use of
> > paradoxical subreg.
> >
> > The combined instruction is operating with a zero_extended QImode register, 
> > so
> > all bits of the register are well defined. You are splitting using 
> > paradoxical subreg,
> > so you don't know what garbage is there in the highpart of the count 
> > register.
> > However, BTL/BTQ uses modulo 64 (or 32) of this register, so even with a 
> > slightly
> > invalid RTX, everything checks out.
> >
> > +  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
> >
> > You probably need mode instead of SImode here.
>
> The define_insn for *bt is:
>
> (define_insn "*bt"
>   [(set (reg:CCC FLAGS_REG)
> (compare:CCC
>   (zero_extract:SWI48
> (match_operand:SWI48 0 "nonimmediate_operand" "r,m")
> (const_int 1)
> (match_operand:SI 1 "nonmemory_operand" "r,"))
>   (const_int 0)))]
>
> So  isn't appropriate here.
>
> But now you've made me think about it, it's inconsistent that all of the 
> shifts
> and rotates in i386.md standardize on QImode for shift counts, but the bit 
> test
> instructions use SImode?  I think this explains where the paradoxical SUBREGs
> come from, and in theory any_extend from QImode to SImode here could/should
> be handled/unnecessary.
>
> Is it worth investigating a follow-up patch to convert all ZERO_EXTRACTs and
> SIGN_EXTRACTs in i386.md to use QImode (instead of SImode)?

IIRC, zero_extract was moved from modeless to a pattern with defined
mode a while ago. Perhaps SImode is just because of these ancient
times, and BT pattern was written that way to satisfy combine. I think
it is definitely worth investigating; perhaps some BT-related pattern
will become obsolete because of the change.

Uros.


Re: [PATCH] cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206]

2023-07-14 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 14, 2023 at 10:53 AM Richard Biener  wrote:
>
> On Fri, 14 Jul 2023, Uros Bizjak wrote:
>
> > On Fri, Jul 14, 2023 at 10:31?AM Richard Biener  wrote:
> > >
> > > On Fri, 14 Jul 2023, Uros Bizjak wrote:
> > >
> > > > cprop1 pass does not consider paradoxical subreg and for (insn 22) 
> > > > claims
> > > > that it equals 8 elements of HImodeby setting REG_EQUAL note:
> > > >
> > > > (insn 21 19 22 4 (set (reg:V4QI 98)
> > > > (mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S4
> > > > A32])) "pr110206.c":12:42 1530 {*movv4qi_internal}
> > > >  (expr_list:REG_EQUAL (const_vector:V4QI [
> > > > (const_int -52 [0xffcc]) repeated x4
> > > > ])
> > > > (nil)))
> > > > (insn 22 21 23 4 (set (reg:V8HI 100)
> > > > (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 
> > > > 0)
> > > > (parallel [
> > > > (const_int 0 [0])
> > > > (const_int 1 [0x1])
> > > > (const_int 2 [0x2])
> > > > (const_int 3 [0x3])
> > > > (const_int 4 [0x4])
> > > > (const_int 5 [0x5])
> > > > (const_int 6 [0x6])
> > > > (const_int 7 [0x7])
> > > > ] "pr110206.c":12:42 7471 
> > > > {sse4_1_zero_extendv8qiv8hi2}
> > > >  (expr_list:REG_EQUAL (const_vector:V8HI [
> > > > (const_int 204 [0xcc]) repeated x8
> > > > ])
> > > > (expr_list:REG_DEAD (reg:V4QI 98)
> > > > (nil
> > > >
> > > > We rely on the "undefined" vals to have a specific value (from the 
> > > > earlier
> > > > REG_EQUAL note) but actual code generation doesn't ensure this (it 
> > > > doesn't
> > > > need to).  That said, the issue isn't the constant folding per-se but 
> > > > that
> > > > we do not actually constant fold but register an equality that doesn't 
> > > > hold.
> > > >
> > > > PR target/110206
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * fwprop.cc (contains_paradoxical_subreg_p): Move to ...
> > > > * rtlanal.cc (contains_paradoxical_subreg_p): ... here.
> > > > * rtlanal.h (contains_paradoxical_subreg_p): Add prototype.
> > > > * cprop.cc (try_replace_reg): Do not set REG_EQUAL note
> > > > when the original source contains a paradoxical subreg.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > * gcc.dg/torture/pr110206.c: New test.
> > > >
> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> > > >
> > > > OK for mainline and backports?
> > >
> > > OK.
> > >
> > > I think the testcase can also run on other targets if you add
> > > dg-additional-options "-w -Wno-psabi", all generic vector ops
> > > should be lowered if not supported.
> >
> > True, but with lowered vector ops, the test would not even come close
> > to the problem. The problem is specific to generic vector ops, and can
> > be triggered only when paradoxical subregs are used to implement
> > (partial) vector modes. This is the case on x86, where partial vectors
> > are now heavily used, and even there we need the latest vector ISA
> > enabled to trip the condition.
> >
> > The above is the reason that dg-torture is used, with the hope that
> > the runtime failure will trip when testsuite is run with specific
> > target options.
>
> I see.  I'm fine with this then though moving to gcc.target/i386
> with appropriate triggering options and a dg-require for runtime
> support would also work.

You are right. I'll add the attached testcase to gcc.target/i386 instead.

Uros.
/* PR target/110206 */
/* { dg-do run } */
/* { dg-options "-Os -mavx512bw -mavx512vl" } */
/* { dg-require-effective-target avx512bw } */
/* { dg-require-effective-target avx512vl } */

#define AVX512BW
#define AVX512VL

#include "avx512f-check.h"

typedef unsigned char __attribute__((__vector_size__ (4))) U;
typedef unsigned char __attribute__((__vector_size__ (8))) V;
typedef unsigned short u16;

V g;

void
__attribute__((noinline))
foo (U u, u16 c, V *r)
{
  if (!c)
abort ();
  V x = __builtin_shufflevector (u, (204 >> u), 7, 0, 5, 1, 3, 5, 0, 2);
  V y = __builtin_shufflevector (g, (V) { }, 7, 6, 6, 7, 2, 6, 3, 5);
  V z = __builtin_shufflevector (y, 204 * x, 3, 9, 8, 1, 4, 6, 14, 5);
  *r = z;
}

static void test_256 (void) { };

static void
test_128 (void)
{
  V r;
  foo ((U){4}, 5, );
  if (r[6] != 0x30)
abort();
}


Re: [PATCH] cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206]

2023-07-14 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 14, 2023 at 10:31 AM Richard Biener  wrote:
>
> On Fri, 14 Jul 2023, Uros Bizjak wrote:
>
> > cprop1 pass does not consider paradoxical subreg and for (insn 22) claims
> > that it equals 8 elements of HImodeby setting REG_EQUAL note:
> >
> > (insn 21 19 22 4 (set (reg:V4QI 98)
> > (mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S4
> > A32])) "pr110206.c":12:42 1530 {*movv4qi_internal}
> >  (expr_list:REG_EQUAL (const_vector:V4QI [
> > (const_int -52 [0xffcc]) repeated x4
> > ])
> > (nil)))
> > (insn 22 21 23 4 (set (reg:V8HI 100)
> > (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
> > (parallel [
> > (const_int 0 [0])
> > (const_int 1 [0x1])
> > (const_int 2 [0x2])
> > (const_int 3 [0x3])
> > (const_int 4 [0x4])
> > (const_int 5 [0x5])
> > (const_int 6 [0x6])
> > (const_int 7 [0x7])
> > ] "pr110206.c":12:42 7471 
> > {sse4_1_zero_extendv8qiv8hi2}
> >  (expr_list:REG_EQUAL (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x8
> > ])
> > (expr_list:REG_DEAD (reg:V4QI 98)
> > (nil
> >
> > We rely on the "undefined" vals to have a specific value (from the earlier
> > REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't
> > need to).  That said, the issue isn't the constant folding per-se but that
> > we do not actually constant fold but register an equality that doesn't hold.
> >
> > PR target/110206
> >
> > gcc/ChangeLog:
> >
> > * fwprop.cc (contains_paradoxical_subreg_p): Move to ...
> > * rtlanal.cc (contains_paradoxical_subreg_p): ... here.
> > * rtlanal.h (contains_paradoxical_subreg_p): Add prototype.
> > * cprop.cc (try_replace_reg): Do not set REG_EQUAL note
> > when the original source contains a paradoxical subreg.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/torture/pr110206.c: New test.
> >
> > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >
> > OK for mainline and backports?
>
> OK.
>
> I think the testcase can also run on other targets if you add
> dg-additional-options "-w -Wno-psabi", all generic vector ops
> should be lowered if not supported.

True, but with lowered vector ops, the test would not even come close
to the problem. The problem is specific to generic vector ops, and can
be triggered only when paradoxical subregs are used to implement
(partial) vector modes. This is the case on x86, where partial vectors
are now heavily used, and even there we need the latest vector ISA
enabled to trip the condition.

The above is the reason that dg-torture is used, with the hope that
the runtime failure will trip when testsuite is run with specific
target options.

Uros.


Re: [x86_64 PATCH] Improved insv of DImode/DFmode {high,low}parts into TImode.

2023-07-14 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 13, 2023 at 6:45 PM Roger Sayle  wrote:
>
>
> This is the next piece towards a fix for (the x86_64 ABI issues affecting)
> PR 88873.  This patch generalizes the recent tweak to ix86_expand_move
> for setting the highpart of a TImode reg from a DImode source using
> *insvti_highpart_1, to handle both DImode and DFmode sources, and also
> use the recently added *insvti_lowpart_1 for setting the lowpart.
>
> Although this is another intermediate step (not yet a fix), towards
> enabling *insvti and *concat* patterns to be candidates for TImode STV
> (by using V2DI/V2DF instructions), it already improves things a little.
>
> For the test case from PR 88873
>
> typedef struct { double x, y; } s_t;
> typedef double v2df __attribute__ ((vector_size (2 * sizeof(double;
>
> s_t foo (s_t a, s_t b, s_t c)
> {
>   return (s_t) { fma(a.x, b.x, c.x), fma (a.y, b.y, c.y) };
> }
>
>
> With -O2 -march=cascadelake, GCC currently generates:
>
> Before (29 instructions):
> vmovq   %xmm2, -56(%rsp)
> movq-56(%rsp), %rdx
> vmovq   %xmm4, -40(%rsp)
> movq$0, -48(%rsp)
> movq%rdx, -56(%rsp)
> movq-40(%rsp), %rdx
> vmovq   %xmm0, -24(%rsp)
> movq%rdx, -40(%rsp)
> movq-24(%rsp), %rsi
> movq-56(%rsp), %rax
> movq$0, -32(%rsp)
> vmovq   %xmm3, -48(%rsp)
> movq-48(%rsp), %rcx
> vmovq   %xmm5, -32(%rsp)
> vmovq   %rax, %xmm6
> movq-40(%rsp), %rax
> movq$0, -16(%rsp)
> movq%rsi, -24(%rsp)
> movq-32(%rsp), %rsi
> vpinsrq $1, %rcx, %xmm6, %xmm6
> vmovq   %rax, %xmm7
> vmovq   %xmm1, -16(%rsp)
> vmovapd %xmm6, %xmm3
> vpinsrq $1, %rsi, %xmm7, %xmm7
> vfmadd132pd -24(%rsp), %xmm7, %xmm3
> vmovapd %xmm3, -56(%rsp)
> vmovsd  -48(%rsp), %xmm1
> vmovsd  -56(%rsp), %xmm0
> ret
>
> After (20 instructions):
> vmovq   %xmm2, -56(%rsp)
> movq-56(%rsp), %rax
> vmovq   %xmm3, -48(%rsp)
> vmovq   %xmm4, -40(%rsp)
> movq-48(%rsp), %rcx
> vmovq   %xmm5, -32(%rsp)
> vmovq   %rax, %xmm6
> movq-40(%rsp), %rax
> movq-32(%rsp), %rsi
> vpinsrq $1, %rcx, %xmm6, %xmm6
> vmovq   %xmm0, -24(%rsp)
> vmovq   %rax, %xmm7
> vmovq   %xmm1, -16(%rsp)
> vmovapd %xmm6, %xmm2
> vpinsrq $1, %rsi, %xmm7, %xmm7
> vfmadd132pd -24(%rsp), %xmm7, %xmm2
> vmovapd %xmm2, -56(%rsp)
> vmovsd  -48(%rsp), %xmm1
> vmovsd  -56(%rsp), %xmm0
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  No testcase yet, as the above code will hopefully
> change dramatically with the next pieces.  Ok for mainline?
>
>
> 2023-07-13  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_move): Generalize special
> case inserting of 64-bit values into a TImode register, to handle
> both DImode and DFmode using either *insvti_lowpart_1
> or *isnvti_highpart_1.

LGTM, but please watch out for fallout.

Thanks,
Uros.


Re: [PATCH] i386: Auto vectorize usdot_prod, udot_prod with AVXVNNIINT16 instruction.

2023-07-14 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 14, 2023 at 8:24 AM Haochen Jiang  wrote:
>
> Hi all,
>
> This patch aims to auto vectorize usdot_prod and udot_prod with newly
> introduced AVX-VNNI-INT16.
>
> Also I refined the redundant mode iterator in the patch.
>
> Regtested on x86_64-pc-linux-gnu. Ok for trunk after AVX-VNNI-INT16 patch
> checked in?
>
> BRs,
> Haochen
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (VI2_AVX2): Delete V32HI since we actually
> have the same iterator. Also renaming all the occurence to
> VI2_AVX2_AVX512BW.
> (usdot_prod): New define_expand.
> (udot_prod): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/vnniint16-auto-vectorize-1.c: New test.
> * gcc.target/i386/vnniint16-auto-vectorize-2.c: Ditto.

OK with two changes below.

Thanks,
Uros.

> ---
>  gcc/config/i386/sse.md| 98 +--
>  .../i386/vnniint16-auto-vectorize-1.c | 28 ++
>  .../i386/vnniint16-auto-vectorize-2.c | 76 ++
>  3 files changed, 172 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vnniint16-auto-vectorize-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vnniint16-auto-vectorize-2.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 7471932b27e..98e7f9334bc 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -545,6 +545,9 @@
> V32HI (V16HI "TARGET_AVX512VL")])
>
>  (define_mode_iterator VI2_AVX2
> +  [(V16HI "TARGET_AVX2") V8HI])
> +
> +(define_mode_iterator VI2_AVX2_AVX512BW
>[(V32HI "TARGET_AVX512BW") (V16HI "TARGET_AVX2") V8HI])
>
>  (define_mode_iterator VI2_AVX512F
> @@ -637,9 +640,6 @@
> (V16HI "TARGET_AVX2") V8HI
> (V8SI "TARGET_AVX2") V4SI])
>
> -(define_mode_iterator VI2_AVX2_AVX512BW
> -  [(V32HI "TARGET_AVX512BW") (V16HI "TARGET_AVX2") V8HI])
> -
>  (define_mode_iterator VI248_AVX512VL
>[V32HI V16SI V8DI
> (V16HI "TARGET_AVX512VL") (V8SI "TARGET_AVX512VL")
> @@ -15298,16 +15298,16 @@
>  })
>
>  (define_expand "mul3"
> -  [(set (match_operand:VI2_AVX2 0 "register_operand")
> -   (mult:VI2_AVX2 (match_operand:VI2_AVX2 1 "vector_operand")
> -  (match_operand:VI2_AVX2 2 "vector_operand")))]
> +  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand")
> +   (mult:VI2_AVX2_AVX512BW (match_operand:VI2_AVX2_AVX512BW 1 
> "vector_operand")
> +  (match_operand:VI2_AVX2_AVX512BW 2 "vector_operand")))]
>"TARGET_SSE2 &&  && "
>"ix86_fixup_binary_operands_no_copy (MULT, mode, operands);")
>
>  (define_insn "*mul3"
> -  [(set (match_operand:VI2_AVX2 0 "register_operand" "=x,")
> -   (mult:VI2_AVX2 (match_operand:VI2_AVX2 1 "vector_operand" "%0,")
> -  (match_operand:VI2_AVX2 2 "vector_operand" 
> "xBm,m")))]
> +  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand" "=x,")
> +   (mult:VI2_AVX2_AVX512BW (match_operand:VI2_AVX2_AVX512BW 1 
> "vector_operand" "%0,")
> +  (match_operand:VI2_AVX2_AVX512BW 2 "vector_operand" 
> "xBm,m")))]
>"TARGET_SSE2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))
> &&  && "
>"@
> @@ -15320,28 +15320,28 @@
> (set_attr "mode" "")])
>
>  (define_expand "mul3_highpart"
> -  [(set (match_operand:VI2_AVX2 0 "register_operand")
> -   (truncate:VI2_AVX2
> +  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand")
> +   (truncate:VI2_AVX2_AVX512BW
>   (lshiftrt:
> (mult:
>   (any_extend:
> -   (match_operand:VI2_AVX2 1 "vector_operand"))
> +   (match_operand:VI2_AVX2_AVX512BW 1 "vector_operand"))
>   (any_extend:
> -   (match_operand:VI2_AVX2 2 "vector_operand")))
> +   (match_operand:VI2_AVX2_AVX512BW 2 "vector_operand")))
> (const_int 16]
>"TARGET_SSE2
> &&  && "
>"ix86_fixup_binary_operands_no_copy (MULT, mode, operands);")
>
>  (define_insn "*mul3_highpart"
> -  [(set (match_operand:VI2_AVX2 0 "register_operand" "=x,")
> -   (truncate:VI2_AVX2
> +  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand" "=x,")
> +   (truncate:VI2_AVX2_AVX512BW
>   (lshiftrt:
> (mult:
>   (any_extend:
> -   (match_operand:VI2_AVX2 1 "vector_operand" "%0,"))
> +   (match_operand:VI2_AVX2_AVX512BW 1 "vector_operand" 
> "%0,"))
>   (any_extend:
> -   (match_operand:VI2_AVX2 2 "vector_operand" "xBm,m")))
> +   (match_operand:VI2_AVX2_AVX512BW 2 "vector_operand" 
> "xBm,m")))
> (const_int 16]
>"TARGET_SSE2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))
> &&  && "
> @@ -15591,8 +15591,8 @@
>  (define_insn "avx512bw_pmaddwd512"
>[(set (match_operand: 0 "register_operand" "=v")
>(unspec:
> -[(match_operand:VI2_AVX2 1 "register_operand" "v")
> - (match_operand:VI2_AVX2 2 

[PATCH] cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206]

2023-07-14 Thread Uros Bizjak via Gcc-patches
cprop1 pass does not consider paradoxical subreg and for (insn 22) claims
that it equals 8 elements of HImodeby setting REG_EQUAL note:

(insn 21 19 22 4 (set (reg:V4QI 98)
(mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S4
A32])) "pr110206.c":12:42 1530 {*movv4qi_internal}
 (expr_list:REG_EQUAL (const_vector:V4QI [
(const_int -52 [0xffcc]) repeated x4
])
(nil)))
(insn 22 21 23 4 (set (reg:V8HI 100)
(zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 3 [0x3])
(const_int 4 [0x4])
(const_int 5 [0x5])
(const_int 6 [0x6])
(const_int 7 [0x7])
] "pr110206.c":12:42 7471 {sse4_1_zero_extendv8qiv8hi2}
 (expr_list:REG_EQUAL (const_vector:V8HI [
(const_int 204 [0xcc]) repeated x8
])
(expr_list:REG_DEAD (reg:V4QI 98)
(nil

We rely on the "undefined" vals to have a specific value (from the earlier
REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't
need to).  That said, the issue isn't the constant folding per-se but that
we do not actually constant fold but register an equality that doesn't hold.

PR target/110206

gcc/ChangeLog:

* fwprop.cc (contains_paradoxical_subreg_p): Move to ...
* rtlanal.cc (contains_paradoxical_subreg_p): ... here.
* rtlanal.h (contains_paradoxical_subreg_p): Add prototype.
* cprop.cc (try_replace_reg): Do not set REG_EQUAL note
when the original source contains a paradoxical subreg.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr110206.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

OK for mainline and backports?

Uros.
diff --git a/gcc/cprop.cc b/gcc/cprop.cc
index b7400c9a421..cf6facaa8c4 100644
--- a/gcc/cprop.cc
+++ b/gcc/cprop.cc
@@ -22,6 +22,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "backend.h"
 #include "rtl.h"
+#include "rtlanal.h"
 #include "cfghooks.h"
 #include "df.h"
 #include "insn-config.h"
@@ -795,7 +796,8 @@ try_replace_reg (rtx from, rtx to, rtx_insn *insn)
   /* If we've failed perform the replacement, have a single SET to
 a REG destination and don't yet have a note, add a REG_EQUAL note
 to not lose information.  */
-  if (!success && note == 0 && set != 0 && REG_P (SET_DEST (set)))
+  if (!success && note == 0 && set != 0 && REG_P (SET_DEST (set))
+ && !contains_paradoxical_subreg_p (SET_SRC (set)))
note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (src));
 }
 
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index ae342f59407..0707a234726 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -25,6 +25,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "backend.h"
 #include "rtl.h"
+#include "rtlanal.h"
 #include "df.h"
 #include "rtl-ssa.h"
 
@@ -353,21 +354,6 @@ reg_single_def_p (rtx x)
   return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
 }
 
-/* Return true if X contains a paradoxical subreg.  */
-
-static bool
-contains_paradoxical_subreg_p (rtx x)
-{
-  subrtx_var_iterator::array_type array;
-  FOR_EACH_SUBRTX_VAR (iter, array, x, NONCONST)
-{
-  x = *iter;
-  if (SUBREG_P (x) && paradoxical_subreg_p (x))
-   return true;
-}
-  return false;
-}
-
 /* Try to substitute (set DEST SRC), which defines DEF, into note NOTE of
USE_INSN.  Return the number of substitutions on success, otherwise return
-1 and leave USE_INSN unchanged.
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index 31707f3b90a..8b48fc243a1 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -6970,3 +6970,18 @@ vec_series_lowpart_p (machine_mode result_mode, 
machine_mode op_mode, rtx sel)
 }
   return false;
 }
+
+/* Return true if X contains a paradoxical subreg.  */
+
+bool
+contains_paradoxical_subreg_p (rtx x)
+{
+  subrtx_var_iterator::array_type array;
+  FOR_EACH_SUBRTX_VAR (iter, array, x, NONCONST)
+{
+  x = *iter;
+  if (SUBREG_P (x) && paradoxical_subreg_p (x))
+   return true;
+}
+  return false;
+}
diff --git a/gcc/rtlanal.h b/gcc/rtlanal.h
index 9013e75c04b..4f0dea8e99f 100644
--- a/gcc/rtlanal.h
+++ b/gcc/rtlanal.h
@@ -338,4 +338,6 @@ vec_series_highpart_p (machine_mode result_mode, 
machine_mode op_mode,
 bool
 vec_series_lowpart_p (machine_mode result_mode, machine_mode op_mode, rtx sel);
 
+bool
+contains_paradoxical_subreg_p (rtx x);
 #endif
diff --git a/gcc/testsuite/gcc.dg/torture/pr110206.c 
b/gcc/testsuite/gcc.dg/torture/pr110206.c
new file mode 100644
index 000..3a4f221ef47
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr110206.c
@@ -0,0 +1,30 

Re: [x86 PATCH] PR target/110588: Add *bt_setncqi_2 to generate btl

2023-07-13 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 13, 2023 at 7:10 PM Roger Sayle  wrote:
>
>
> This patch resolves PR target/110588 to catch another case in combine
> where the i386 backend should be generating a btl instruction.  This adds
> another define_insn_and_split to recognize the RTL representation for this
> case.
>
> I also noticed that two related define_insn_and_split weren't using the
> preferred string style for single statement preparation-statements, so
> I've reformatted these to be consistent in style with the new one.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-07-13  Roger Sayle  
>
> gcc/ChangeLog
> PR target/110588
> * config/i386/i386.md (*bt_setcqi): Prefer string form
> preparation statement over braces for a single statement.
> (*bt_setncqi): Likewise.
> (*bt_setncqi_2): New define_insn_and_split.
>
> gcc/testsuite/ChangeLog
> PR target/110588
> * gcc.target/i386/pr110588.c: New test case.

+;; Help combine recognize bt followed by setnc (PR target/110588)
+(define_insn_and_split "*bt_setncqi_2"
+  [(set (match_operand:QI 0 "register_operand")
+ (eq:QI
+  (zero_extract:SWI48
+(match_operand:SWI48 1 "register_operand")
+(const_int 1)
+(zero_extend:SI (match_operand:QI 2 "register_operand")))
+  (const_int 0)))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_USE_BT && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (reg:CCC FLAGS_REG)
+(compare:CCC
+ (zero_extract:SWI48 (match_dup 1) (const_int 1) (match_dup 2))
+ (const_int 0)))
+   (set (match_dup 0)
+(ne:QI (reg:CCC FLAGS_REG) (const_int 0)))]
+  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")

I don't think the above transformation is 100% correct, mainly due to
the use of paradoxical subreg.

The combined instruction is operating with a zero_extended QImode
register, so all bits of the register are well defined. You are
splitting using paradoxical subreg, so you don't know what garbage is
there in the highpart of the count register. However, BTL/BTQ uses
modulo 64 (or 32) of this register, so even with a slightly invalid
RTX, everything checks out.

+  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")

You probably need mode instead of SImode here.

Otherwise OK.

Thanks,
Uros.


[committed] alpha: Fix computation mode in alpha_emit_set_long_cost [PR106966]

2023-07-13 Thread Uros Bizjak via Gcc-patches
PR target/106966

gcc/ChangeLog:

* config/alpha/alpha.cc (alpha_emit_set_long_const):
Always use DImode when constructing long const.

gcc/testsuite/ChangeLog:

* gcc.target/alpha/pr106966.c: New test.

Bootstrapped and regression tested by Matthias on alpha-linux-gnu.

Uros.
diff --git a/gcc/config/alpha/alpha.cc b/gcc/config/alpha/alpha.cc
index 360b50e20d4..beeab06a1aa 100644
--- a/gcc/config/alpha/alpha.cc
+++ b/gcc/config/alpha/alpha.cc
@@ -2070,6 +2070,8 @@ static rtx
 alpha_emit_set_long_const (rtx target, HOST_WIDE_INT c1)
 {
   HOST_WIDE_INT d1, d2, d3, d4;
+  machine_mode mode = GET_MODE (target);
+  rtx orig_target = target;
 
   /* Decompose the entire word */
 
@@ -2082,6 +2084,9 @@ alpha_emit_set_long_const (rtx target, HOST_WIDE_INT c1)
   d4 = ((c1 & 0x) ^ 0x8000) - 0x8000;
   gcc_assert (c1 == d4);
 
+  if (mode != DImode)
+target = gen_lowpart (DImode, target);
+
   /* Construct the high word */
   if (d4)
 {
@@ -2101,7 +2106,7 @@ alpha_emit_set_long_const (rtx target, HOST_WIDE_INT c1)
   if (d1)
 emit_move_insn (target, gen_rtx_PLUS (DImode, target, GEN_INT (d1)));
 
-  return target;
+  return orig_target;
 }
 
 /* Given an integral CONST_INT or CONST_VECTOR, return the low 64 bits.  */
diff --git a/gcc/testsuite/gcc.target/alpha/pr106966.c 
b/gcc/testsuite/gcc.target/alpha/pr106966.c
new file mode 100644
index 000..7145c2096c6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/alpha/pr106966.c
@@ -0,0 +1,13 @@
+/* PR target/106906 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mbuild-constants" } */
+
+void
+do_console (unsigned short *vga)
+{
+  vga[0] = 'H';
+  vga[1] = 'e';
+  vga[2] = 'l';
+  vga[3] = 'l';
+  vga[4] = 'o';
+}


[committed] IRA+LRA: Change return type of predicate functions from int to bool

2023-07-12 Thread Uros Bizjak via Gcc-patches
gcc/ChangeLog:

* ira.cc (equiv_init_varies_p): Change return type from int to bool
and adjust function body accordingly.
(equiv_init_movable_p): Ditto.
(memref_used_between_p): Ditto.
* lra-constraints.cc (valid_address_p): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/ira.cc b/gcc/ira.cc
index 02dea5d49ee..a1860105c60 100644
--- a/gcc/ira.cc
+++ b/gcc/ira.cc
@@ -3075,7 +3075,7 @@ validate_equiv_mem_from_store (rtx dest, const_rtx set 
ATTRIBUTE_UNUSED,
 info->equiv_mem_modified = true;
 }
 
-static int equiv_init_varies_p (rtx x);
+static bool equiv_init_varies_p (rtx x);
 
 enum valid_equiv { valid_none, valid_combine, valid_reload };
 
@@ -3145,8 +3145,8 @@ validate_equiv_mem (rtx_insn *start, rtx reg, rtx memref)
   return valid_none;
 }
 
-/* Returns zero if X is known to be invariant.  */
-static int
+/* Returns false if X is known to be invariant.  */
+static bool
 equiv_init_varies_p (rtx x)
 {
   RTX_CODE code = GET_CODE (x);
@@ -3162,14 +3162,14 @@ equiv_init_varies_p (rtx x)
 CASE_CONST_ANY:
 case SYMBOL_REF:
 case LABEL_REF:
-  return 0;
+  return false;
 
 case REG:
   return reg_equiv[REGNO (x)].replace == 0 && rtx_varies_p (x, 0);
 
 case ASM_OPERANDS:
   if (MEM_VOLATILE_P (x))
-   return 1;
+   return true;
 
   /* Fall through.  */
 
@@ -3182,24 +3182,24 @@ equiv_init_varies_p (rtx x)
 if (fmt[i] == 'e')
   {
if (equiv_init_varies_p (XEXP (x, i)))
- return 1;
+ return true;
   }
 else if (fmt[i] == 'E')
   {
int j;
for (j = 0; j < XVECLEN (x, i); j++)
  if (equiv_init_varies_p (XVECEXP (x, i, j)))
-   return 1;
+   return true;
   }
 
-  return 0;
+  return false;
 }
 
-/* Returns nonzero if X (used to initialize register REGNO) is movable.
+/* Returns true if X (used to initialize register REGNO) is movable.
X is only movable if the registers it uses have equivalent initializations
which appear to be within the same loop (or in an inner loop) and movable
or if they are not candidates for local_alloc and don't vary.  */
-static int
+static bool
 equiv_init_movable_p (rtx x, int regno)
 {
   int i, j;
@@ -3212,7 +3212,7 @@ equiv_init_movable_p (rtx x, int regno)
   return equiv_init_movable_p (SET_SRC (x), regno);
 
 case CLOBBER:
-  return 0;
+  return false;
 
 case PRE_INC:
 case PRE_DEC:
@@ -3220,7 +3220,7 @@ equiv_init_movable_p (rtx x, int regno)
 case POST_DEC:
 case PRE_MODIFY:
 case POST_MODIFY:
-  return 0;
+  return false;
 
 case REG:
   return ((reg_equiv[REGNO (x)].loop_depth >= reg_equiv[regno].loop_depth
@@ -3229,11 +3229,11 @@ equiv_init_movable_p (rtx x, int regno)
  && ! rtx_varies_p (x, 0)));
 
 case UNSPEC_VOLATILE:
-  return 0;
+  return false;
 
 case ASM_OPERANDS:
   if (MEM_VOLATILE_P (x))
-   return 0;
+   return false;
 
   /* Fall through.  */
 
@@ -3247,16 +3247,16 @@ equiv_init_movable_p (rtx x, int regno)
   {
   case 'e':
if (! equiv_init_movable_p (XEXP (x, i), regno))
- return 0;
+ return false;
break;
   case 'E':
for (j = XVECLEN (x, i) - 1; j >= 0; j--)
  if (! equiv_init_movable_p (XVECEXP (x, i, j), regno))
-   return 0;
+   return false;
break;
   }
 
-  return 1;
+  return true;
 }
 
 static bool memref_referenced_p (rtx memref, rtx x, bool read_p);
@@ -3370,7 +3370,7 @@ memref_referenced_p (rtx memref, rtx x, bool read_p)
Callers should not call this routine if START is after END in the
RTL chain.  */
 
-static int
+static bool
 memref_used_between_p (rtx memref, rtx_insn *start, rtx_insn *end)
 {
   rtx_insn *insn;
@@ -3383,15 +3383,15 @@ memref_used_between_p (rtx memref, rtx_insn *start, 
rtx_insn *end)
continue;
 
   if (memref_referenced_p (memref, PATTERN (insn), false))
-   return 1;
+   return true;
 
   /* Nonconst functions may access memory.  */
   if (CALL_P (insn) && (! RTL_CONST_CALL_P (insn)))
-   return 1;
+   return true;
 }
 
   gcc_assert (insn == NEXT_INSN (end));
-  return 0;
+  return false;
 }
 
 /* Mark REG as having no known equivalence.
diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 123ff662cbc..9bfc88149ff 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -329,20 +329,20 @@ in_mem_p (int regno)
   return get_reg_class (regno) == NO_REGS;
 }
 
-/* Return 1 if ADDR is a valid memory address for mode MODE in address
+/* Return true if ADDR is a valid memory address for mode MODE in address
space AS, and check that each pseudo has the proper kind of hard
reg. */
-static int
+static bool
 valid_address_p (machine_mode mode ATTRIBUTE_UNUSED,
 rtx addr, addr_space_t as)
 {
 #ifdef GO_IF_LEGITIMATE_ADDRESS
   

[committed] ifcvt: Change return type of predicate functions from int to bool

2023-07-12 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* ifcvt.cc (cond_exec_changed_p): Change variable to bool.
(last_active_insn): Change "skip_use_p" function argument to bool.
(noce_operand_ok): Change return type from int to bool.
(find_cond_trap): Ditto.
(block_jumps_and_fallthru_p): Change "fallthru_p" and
"jump_p" variables to bool.
(noce_find_if_block): Change return type from int to bool.
(cond_exec_find_if_block): Ditto.
(find_if_case_1): Ditto.
(find_if_case_2): Ditto.
(dead_or_predicable): Ditto. Change "reversep" function arg to bool.
(block_jumps_and_fallthru): Rename from block_jumps_and_fallthru_p.
(cond_exec_process_insns): Change return type from int to bool.
Change "mod_ok" function arg to bool.
(cond_exec_process_if_block): Change return type from int to bool.
Change "do_multiple_p" function arg to bool.  Change "then_mod_ok"
variable to bool.
(noce_emit_store_flag): Change return type from int to bool.
Change "reversep" function arg to bool.  Change "cond_complex"
variable to bool.
(noce_try_move): Change return type from int to bool.
(noce_try_ifelse_collapse): Ditto.
(noce_try_store_flag): Ditto. Change "reversep" variable to bool.
(noce_try_addcc): Change return type from int to bool.  Change
"subtract" variable to bool.
(noce_try_store_flag_constants): Change return type from int to bool.
(noce_try_store_flag_mask): Ditto.  Change "reversep" variable to bool.
(noce_try_cmove): Change return type from int to bool.
(noce_try_cmove_arith): Ditto. Change "is_mem" variable to bool.
(noce_try_minmax): Change return type from int to bool.  Change
"unsignedp" variable to bool.
(noce_try_abs): Change return type from int to bool.  Change
"negate" variable to bool.
(noce_try_sign_mask): Change return type from int to bool.
(noce_try_move): Ditto.
(noce_try_store_flag_constants): Ditto.
(noce_try_cmove): Ditto.
(noce_try_cmove_arith): Ditto.
(noce_try_minmax): Ditto.  Change "unsignedp" variable to bool.
(noce_try_bitop): Change return type from int to bool.
(noce_operand_ok): Ditto.
(noce_convert_multiple_sets): Ditto.
(noce_convert_multiple_sets_1): Ditto.
(noce_process_if_block): Ditto.
(check_cond_move_block): Ditto.
(cond_move_process_if_block): Ditto. Change "success_p"
variable to bool.
(rest_of_handle_if_conversion): Change return type to void.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
index 0b180b4568f..a0af553b9ff 100644
--- a/gcc/ifcvt.cc
+++ b/gcc/ifcvt.cc
@@ -73,29 +73,29 @@ static int num_updated_if_blocks;
 static int num_true_changes;
 
 /* Whether conditional execution changes were made.  */
-static int cond_exec_changed_p;
+static bool cond_exec_changed_p;
 
 /* Forward references.  */
 static int count_bb_insns (const_basic_block);
 static bool cheap_bb_rtx_cost_p (const_basic_block, profile_probability, int);
 static rtx_insn *first_active_insn (basic_block);
-static rtx_insn *last_active_insn (basic_block, int);
+static rtx_insn *last_active_insn (basic_block, bool);
 static rtx_insn *find_active_insn_before (basic_block, rtx_insn *);
 static rtx_insn *find_active_insn_after (basic_block, rtx_insn *);
 static basic_block block_fallthru (basic_block);
 static rtx cond_exec_get_condition (rtx_insn *, bool);
 static rtx noce_get_condition (rtx_insn *, rtx_insn **, bool);
-static int noce_operand_ok (const_rtx);
+static bool noce_operand_ok (const_rtx);
 static void merge_if_block (ce_if_block *);
-static int find_cond_trap (basic_block, edge, edge);
+static bool find_cond_trap (basic_block, edge, edge);
 static basic_block find_if_header (basic_block, int);
-static int block_jumps_and_fallthru_p (basic_block, basic_block);
-static int noce_find_if_block (basic_block, edge, edge, int);
-static int cond_exec_find_if_block (ce_if_block *);
-static int find_if_case_1 (basic_block, edge, edge);
-static int find_if_case_2 (basic_block, edge, edge);
-static int dead_or_predicable (basic_block, basic_block, basic_block,
-  edge, int);
+static int block_jumps_and_fallthru (basic_block, basic_block);
+static bool noce_find_if_block (basic_block, edge, edge, int);
+static bool cond_exec_find_if_block (ce_if_block *);
+static bool find_if_case_1 (basic_block, edge, edge);
+static bool find_if_case_2 (basic_block, edge, edge);
+static bool dead_or_predicable (basic_block, basic_block, basic_block,
+   edge, bool);
 static void noce_emit_move_insn (rtx, rtx);
 static rtx_insn *block_has_only_trap (basic_block);
 static void need_cmov_or_rewire (basic_block, hash_set *,
@@ -234,7 +234,7 @@ first_active_insn (basic_block bb)
 /* Return the last non-jump active (non-jump) insn in the basic block.  */
 
 static rtx_insn *

Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-12 Thread Uros Bizjak via Gcc-patches
On Wed, Jul 12, 2023 at 12:58 PM Uros Bizjak  wrote:
>
> On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
>  wrote:
> >
> > Richard Biener via Gcc-patches  writes:
> > > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak  wrote:
> > >>
> > >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> > >>  wrote:
> > >> >
> > >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak  wrote:
> > >> > >
> > >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> > >> > >  wrote:
> > >> > > >
> > >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > >> > > >  wrote:
> > >> > > > >
> > >> > > > > As shown in the PR, simplify_gen_subreg call in 
> > >> > > > > simplify_replace_fn_rtx:
> > >> > > > >
> > >> > > > > (gdb) list
> > >> > > > > 469   if (code == SUBREG)
> > >> > > > > 470 {
> > >> > > > > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > >> > > > > old_rtx, fn, data);
> > >> > > > > 472   if (op0 == SUBREG_REG (x))
> > >> > > > > 473 return x;
> > >> > > > > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > >> > > > > 475  GET_MODE 
> > >> > > > > (SUBREG_REG (x)),
> > >> > > > > 476  SUBREG_BYTE (x));
> > >> > > > > 477   return op0 ? op0 : x;
> > >> > > > > 478 }
> > >> > > > >
> > >> > > > > simplifies with following arguments:
> > >> > > > >
> > >> > > > > (gdb) p debug_rtx (op0)
> > >> > > > > (const_vector:V4QI [
> > >> > > > > (const_int -52 [0xffcc]) repeated x4
> > >> > > > > ])
> > >> > > > > (gdb) p debug_rtx (x)
> > >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> > >> > > > >
> > >> > > > > to:
> > >> > > > >
> > >> > > > > (gdb) p debug_rtx (op0)
> > >> > > > > (const_vector:V16QI [
> > >> > > > > (const_int -52 [0xffcc]) repeated x16
> > >> > > > > ])
> > >> > > > >
> > >> > > > > This simplification is invalid, it is not possible to get 
> > >> > > > > V16QImode vector
> > >> > > > > from V4QImode vector, even when all elements are duplicates.
> > >> >
> > >> > ^^^
> > >> >
> > >> > I think this simplification is valid.  A simplification to
> > >> >
> > >> > (const_vector:V16QI [
> > >> >  (const_int -52 [0xffcc]) repeated x4
> > >> >  (const_int 0 [0]) repeated x12
> > >> >  ])
> > >> >
> > >> > would be valid as well.
> > >> >
> > >> > > > > The simplification happens in simplify_context::simplify_subreg:
> > >> > > > >
> > >> > > > > (gdb) list
> > >> > > > > 7558  if (VECTOR_MODE_P (outermode)
> > >> > > > > 7559  && GET_MODE_INNER (outermode) == 
> > >> > > > > GET_MODE_INNER (innermode)
> > >> > > > > 7560  && vec_duplicate_p (op, ))
> > >> > > > > 7561return gen_vec_duplicate (outermode, elt);
> > >> > > > >
> > >> > > > > but the above simplification is valid only for non-paradoxical 
> > >> > > > > registers,
> > >> > > > > where outermode <= innermode.  We should not assume that 
> > >> > > > > elements outside
> > >> > > > > the original register are valid, let alone all duplicates.
> > >> > > >
> > >> > > > Hmm, but looking at the audit trail the x86 backend expects them 
> > >> > > > to be zero?

Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-12 Thread Uros Bizjak via Gcc-patches
On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
 wrote:
>
> Richard Biener via Gcc-patches  writes:
> > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak  wrote:
> >>
> >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> >>  wrote:
> >> >
> >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak  wrote:
> >> > >
> >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> >> > >  wrote:
> >> > > >
> >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> >> > > >  wrote:
> >> > > > >
> >> > > > > As shown in the PR, simplify_gen_subreg call in 
> >> > > > > simplify_replace_fn_rtx:
> >> > > > >
> >> > > > > (gdb) list
> >> > > > > 469   if (code == SUBREG)
> >> > > > > 470 {
> >> > > > > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> >> > > > > old_rtx, fn, data);
> >> > > > > 472   if (op0 == SUBREG_REG (x))
> >> > > > > 473 return x;
> >> > > > > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> >> > > > > 475  GET_MODE (SUBREG_REG 
> >> > > > > (x)),
> >> > > > > 476  SUBREG_BYTE (x));
> >> > > > > 477   return op0 ? op0 : x;
> >> > > > > 478 }
> >> > > > >
> >> > > > > simplifies with following arguments:
> >> > > > >
> >> > > > > (gdb) p debug_rtx (op0)
> >> > > > > (const_vector:V4QI [
> >> > > > > (const_int -52 [0xffcc]) repeated x4
> >> > > > > ])
> >> > > > > (gdb) p debug_rtx (x)
> >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> >> > > > >
> >> > > > > to:
> >> > > > >
> >> > > > > (gdb) p debug_rtx (op0)
> >> > > > > (const_vector:V16QI [
> >> > > > > (const_int -52 [0xffcc]) repeated x16
> >> > > > > ])
> >> > > > >
> >> > > > > This simplification is invalid, it is not possible to get 
> >> > > > > V16QImode vector
> >> > > > > from V4QImode vector, even when all elements are duplicates.
> >> >
> >> > ^^^
> >> >
> >> > I think this simplification is valid.  A simplification to
> >> >
> >> > (const_vector:V16QI [
> >> >  (const_int -52 [0xffcc]) repeated x4
> >> >  (const_int 0 [0]) repeated x12
> >> >  ])
> >> >
> >> > would be valid as well.
> >> >
> >> > > > > The simplification happens in simplify_context::simplify_subreg:
> >> > > > >
> >> > > > > (gdb) list
> >> > > > > 7558  if (VECTOR_MODE_P (outermode)
> >> > > > > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> >> > > > > (innermode)
> >> > > > > 7560  && vec_duplicate_p (op, ))
> >> > > > > 7561return gen_vec_duplicate (outermode, elt);
> >> > > > >
> >> > > > > but the above simplification is valid only for non-paradoxical 
> >> > > > > registers,
> >> > > > > where outermode <= innermode.  We should not assume that elements 
> >> > > > > outside
> >> > > > > the original register are valid, let alone all duplicates.
> >> > > >
> >> > > > Hmm, but looking at the audit trail the x86 backend expects them to 
> >> > > > be zero?
> >> > > > Isn't that wrong as well?
> >> > >
> >> > > If you mean Comment #10, it is just an observation that
> >> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> >> > >
> >> > > (gdb) p debug_rtx (src)
> >> > > (const_vector:V8HI [
> >> > > (const_int 204 [0xcc]) repeated x4
> >> > > (const_int 0 [0

Re: [x86 PATCH] Fix FAIL of gcc.target/i386/pr91681-1.c

2023-07-12 Thread Uros Bizjak via Gcc-patches
On Tue, Jul 11, 2023 at 10:07 PM Roger Sayle  wrote:
>
>
> The recent change in TImode parameter passing on x86_64 results in the
> FAIL of pr91681-1.c.  The issue is that with the extra flexibility,
> the combine pass is now spoilt for choice between using either the
> *add3_doubleword_concat or the *add3_doubleword_zext
> patterns, when one operand is a *concat and the other is a zero_extend.
> The solution proposed below is provide an *add3_doubleword_concat_zext
> define_insn_and_split, that can benefit both from the register allocation
> of *concat, and still avoid the xor normally required by zero extension.
>
> I'm investigating a follow-up refinement to improve register allocation
> further by avoiding the early clobber in the =, and handling (custom)
> reloads explicitly, but this piece resolves the testcase failure.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-07-11  Roger Sayle  
>
> gcc/ChangeLog
> PR target/91681
> * config/i386/i386.md (*add3_doubleword_concat_zext): New
> define_insn_and_split derived from *add3_doubleword_concat
> and *add3_doubleword_zext.

OK.

Thanks,
Uros.

>
>
> Thanks,
> Roger
> --
>


Re: [x86 PATCH] PR target/110598: Fix rega = 0; rega ^= rega regression.

2023-07-12 Thread Uros Bizjak via Gcc-patches
On Tue, Jul 11, 2023 at 9:07 PM Roger Sayle  wrote:
>
>
> This patch fixes the regression PR target/110598 caused by my recent
> addition of a peephole2.  The intention of that optimization was to
> simplify zeroing a register, followed by an IOR, XOR or PLUS operation
> on it into a move, or as described in the comment:
> ;; Peephole2 rega = 0; rega op= regb into rega = regb.
>
> The issue is that I'd failed to consider the (rare and unusual) case,
> where regb is rega, where the transformation leads to the incorrect
> "rega = rega", when it should be "rega = 0".  The minimal fix is to
> add a !reg_mentioned_p check to the recent peephole2.
>
> In addition to resolving the regression, I've added a second peephole2
> to optimize the problematic case above, which contains a false
> dependency and is therefore tricky to optimize elsewhere.  This is an
> improvement over GCC 13, for example, that generates the redundant:
>
> xorl%edx, %edx
> xorq%rdx, %rdx
>
>
> 2023-07-11  Roger Sayle  
>
> gcc/ChangeLog
> PR target/110598
> * config/i386/i386.md (peephole2): Check !reg_mentioned_p when
> optimizing rega = 0; rega op= regb for op in [XOR,IOR,PLUS].
> (peephole2): Simplify rega = 0; rega op= rega cases.
>
> gcc/testsuite/ChangeLog
> PR target/110598
> * gcc.target/i386/pr110598.c: New test case.

OK.

Thanks,
Uros.

>
>
> Thanks in advance (and apologies for any inconvenience),
> Roger
> --
>


[committed] cfg+gcse: Change return type of predicate functions from int to bool

2023-07-11 Thread Uros Bizjak via Gcc-patches
Also change some internal variables from int to bool.

gcc/ChangeLog:

* cfghooks.cc (verify_flow_info): Change "err" variable to bool.
* cfghooks.h (struct cfg_hooks): Change return type of
verify_flow_info from integer to bool.
* cfgrtl.cc (can_delete_note_p): Change return type from int to bool.
(can_delete_label_p): Ditto.
(rtl_verify_flow_info): Change return type from int to bool
and adjust function body accordingly.  Change "err" variable to bool.
(rtl_verify_flow_info_1): Ditto.
(free_bb_for_insn): Change return type to void.
(rtl_merge_blocks): Change "b_empty" variable to bool.
(try_redirect_by_replacing_jump): Change "fallthru" variable to bool.
(verify_hot_cold_block_grouping): Change return type from int to bool.
Change "err" variable to bool.
(rtl_verify_edges): Ditto.
(rtl_verify_bb_insns): Ditto.
(rtl_verify_bb_pointers): Ditto.
(rtl_verify_bb_insn_chain): Ditto.
(rtl_verify_fallthru): Ditto.
(rtl_verify_bb_layout): Ditto.
(purge_all_dead_edges): Change "purged" variable to bool.
* cfgrtl.h (free_bb_for_insn): Change return type from int to void.
* postreload-gcse.cc (expr_hasher::equal): Change "equiv_p" to bool.
(load_killed_in_block_p): Change return type from int to bool
and adjust function body accordingly.
(oprs_unchanged_p): Return true/false.
(rest_of_handle_gcse2): Change return type to void.
* tree-cfg.cc (gimple_verify_flow_info): Change return type from
int to bool.  Change "err" variable to bool.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/cfghooks.cc b/gcc/cfghooks.cc
index 54564035415..37e0dbc1a13 100644
--- a/gcc/cfghooks.cc
+++ b/gcc/cfghooks.cc
@@ -102,7 +102,7 @@ DEBUG_FUNCTION void
 verify_flow_info (void)
 {
   size_t *edge_checksum;
-  int err = 0;
+  bool err = false;
   basic_block bb, last_bb_seen;
   basic_block *last_visited;
 
@@ -118,14 +118,14 @@ verify_flow_info (void)
  && bb != BASIC_BLOCK_FOR_FN (cfun, bb->index))
{
  error ("bb %d on wrong place", bb->index);
- err = 1;
+ err = true;
}
 
   if (bb->prev_bb != last_bb_seen)
{
  error ("prev_bb of %d should be %d, not %d",
 bb->index, last_bb_seen->index, bb->prev_bb->index);
- err = 1;
+ err = true;
}
 
   last_bb_seen = bb;
@@ -142,18 +142,18 @@ verify_flow_info (void)
{
  error ("verify_flow_info: Block %i has loop_father, but there are no 
loops",
 bb->index);
- err = 1;
+ err = true;
}
   if (bb->loop_father == NULL && current_loops != NULL)
{
  error ("verify_flow_info: Block %i lacks loop_father", bb->index);
- err = 1;
+ err = true;
}
 
   if (!bb->count.verify ())
{
  error ("verify_flow_info: Wrong count of block %i", bb->index);
- err = 1;
+ err = true;
}
   /* FIXME: Graphite and SLJL and target code still tends to produce
 edges with no probability.  */
@@ -161,13 +161,13 @@ verify_flow_info (void)
   && !bb->count.initialized_p () && !flag_graphite && 0)
{
  error ("verify_flow_info: Missing count of block %i", bb->index);
- err = 1;
+ err = true;
}
 
   if (bb->flags & ~cfun->cfg->bb_flags_allocated)
{
  error ("verify_flow_info: unallocated flag set on BB %d", bb->index);
- err = 1;
+ err = true;
}
 
   FOR_EACH_EDGE (e, ei, bb->succs)
@@ -176,7 +176,7 @@ verify_flow_info (void)
{
  error ("verify_flow_info: Duplicate edge %i->%i",
 e->src->index, e->dest->index);
- err = 1;
+ err = true;
}
  /* FIXME: Graphite and SLJL and target code still tends to produce
 edges with no probability.  */
@@ -185,13 +185,13 @@ verify_flow_info (void)
{
  error ("Uninitialized probability of edge %i->%i", e->src->index,
 e->dest->index);
- err = 1;
+ err = true;
}
  if (!e->probability.verify ())
{
  error ("verify_flow_info: Wrong probability of edge %i->%i",
 e->src->index, e->dest->index);
- err = 1;
+ err = true;
}
 
  last_visited [e->dest->index] = bb;
@@ -208,14 +208,14 @@ verify_flow_info (void)
  fprintf (stderr, "\nSuccessor: ");
  dump_edge_info (stderr, e, TDF_DETAILS, 1);
  fprintf (stderr, "\n");
- err = 1;
+ err = true;
}
 
  if (e->flags & ~cfun->cfg->edge_flags_allocated)
{
  error ("verify_flow_info: unallocated edge flag set on %d -> %d",
 e->src->index, e->dest->index);
- 

[committed] reorg: Change return type of predicate functions from int to bool

2023-07-10 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* reorg.cc (stop_search_p): Change return type from int to bool
and adjust function body accordingly.
(resource_conflicts_p): Ditto.
(insn_references_resource_p): Change return type from int to bool.
(insn_sets_resource_p): Ditto.
(redirect_with_delay_slots_safe_p): Ditto.
(condition_dominates_p): Change return type from int to bool
and adjust function body accordingly.
(redirect_with_delay_list_safe_p): Ditto.
(check_annul_list_true_false): Ditto.  Change "annul_true_p"
function argument to bool.
(steal_delay_list_from_target): Change "pannul_p" function
argument to bool pointer.  Change "must_annul" and "used_annul"
variables from int to bool.
(steal_delay_list_from_fallthrough): Ditto.
(own_thread_p): Change return type from int to bool and adjust
function body accordingly.  Change "allow_fallthrough" function
argument to bool.
(reorg_redirect_jump): Change return type from int to bool.
(fill_simple_delay_slots): Change "non_jumps_p" function
argument from int to bool.  Change "maybe_never" varible to bool.
(fill_slots_from_thread): Change "likely", "thread_if_true" and
"own_thread" function arguments to bool.  Change "lose" and
"must_annul" variables to bool.
(delete_from_delay_slot): Change "had_barrier" variable to bool.
(try_merge_delay_insns): Change "annul_p" variable to bool.
(fill_eager_delay_slots): Change "own_target" and "own_fallthrouhg"
variables to bool.
(rest_of_handle_delay_slots): Change return type from int to void
and adjust function body accordingly.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/reorg.cc b/gcc/reorg.cc
index ed32c91c3fa..81290463833 100644
--- a/gcc/reorg.cc
+++ b/gcc/reorg.cc
@@ -174,10 +174,10 @@ static int *uid_to_ruid;
 /* Highest valid index in `uid_to_ruid'.  */
 static int max_uid;
 
-static int stop_search_p (rtx_insn *, int);
-static int resource_conflicts_p (struct resources *, struct resources *);
-static int insn_references_resource_p (rtx, struct resources *, bool);
-static int insn_sets_resource_p (rtx, struct resources *, bool);
+static bool stop_search_p (rtx_insn *, bool);
+static bool resource_conflicts_p (struct resources *, struct resources *);
+static bool insn_references_resource_p (rtx, struct resources *, bool);
+static bool insn_sets_resource_p (rtx, struct resources *, bool);
 static rtx_code_label *find_end_label (rtx);
 static rtx_insn *emit_delay_sequence (rtx_insn *, const vec &,
  int);
@@ -188,35 +188,35 @@ static void note_delay_statistics (int, int);
 static int get_jump_flags (const rtx_insn *, rtx);
 static int mostly_true_jump (rtx);
 static rtx get_branch_condition (const rtx_insn *, rtx);
-static int condition_dominates_p (rtx, const rtx_insn *);
-static int redirect_with_delay_slots_safe_p (rtx_insn *, rtx, rtx);
-static int redirect_with_delay_list_safe_p (rtx_insn *, rtx,
-   const vec &);
-static int check_annul_list_true_false (int, const vec &);
+static bool condition_dominates_p (rtx, const rtx_insn *);
+static bool redirect_with_delay_slots_safe_p (rtx_insn *, rtx, rtx);
+static bool redirect_with_delay_list_safe_p (rtx_insn *, rtx,
+const vec &);
+static bool check_annul_list_true_false (bool, const vec &);
 static void steal_delay_list_from_target (rtx_insn *, rtx, rtx_sequence *,
  vec *,
  struct resources *,
  struct resources *,
  struct resources *,
- int, int *, int *,
+ int, int *, bool *,
  rtx *);
 static void steal_delay_list_from_fallthrough (rtx_insn *, rtx, rtx_sequence *,
   vec *,
   struct resources *,
   struct resources *,
   struct resources *,
-  int, int *, int *);
+  int, int *, bool *);
 static void try_merge_delay_insns (rtx_insn *, rtx_insn *);
 static rtx_insn *redundant_insn (rtx, rtx_insn *, const vec &);
-static int own_thread_p (rtx, rtx, int);
+static bool own_thread_p (rtx, rtx, bool);
 static void update_block (rtx_insn *, rtx_insn *);
-static int reorg_redirect_jump (rtx_jump_insn *, rtx);
+static bool reorg_redirect_jump (rtx_jump_insn *, rtx);
 static void update_reg_dead_notes (rtx_insn *, rtx_insn *);
 static void fix_reg_dead_note (rtx_insn *, rtx);
 static void 

Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
 wrote:
>
> On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak  wrote:
> >
> > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> >  wrote:
> > >
> > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > >  wrote:
> > > >
> > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > > >
> > > > (gdb) list
> > > > 469   if (code == SUBREG)
> > > > 470 {
> > > > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > > old_rtx, fn, data);
> > > > 472   if (op0 == SUBREG_REG (x))
> > > > 473 return x;
> > > > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > > 475  GET_MODE (SUBREG_REG (x)),
> > > > 476  SUBREG_BYTE (x));
> > > > 477   return op0 ? op0 : x;
> > > > 478 }
> > > >
> > > > simplifies with following arguments:
> > > >
> > > > (gdb) p debug_rtx (op0)
> > > > (const_vector:V4QI [
> > > > (const_int -52 [0xffcc]) repeated x4
> > > > ])
> > > > (gdb) p debug_rtx (x)
> > > > (subreg:V16QI (reg:V4QI 98) 0)
> > > >
> > > > to:
> > > >
> > > > (gdb) p debug_rtx (op0)
> > > > (const_vector:V16QI [
> > > > (const_int -52 [0xffcc]) repeated x16
> > > > ])
> > > >
> > > > This simplification is invalid, it is not possible to get V16QImode 
> > > > vector
> > > > from V4QImode vector, even when all elements are duplicates.
>
> ^^^
>
> I think this simplification is valid.  A simplification to
>
> (const_vector:V16QI [
>  (const_int -52 [0xffcc]) repeated x4
>  (const_int 0 [0]) repeated x12
>  ])
>
> would be valid as well.
>
> > > > The simplification happens in simplify_context::simplify_subreg:
> > > >
> > > > (gdb) list
> > > > 7558  if (VECTOR_MODE_P (outermode)
> > > > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> > > > (innermode)
> > > > 7560  && vec_duplicate_p (op, ))
> > > > 7561return gen_vec_duplicate (outermode, elt);
> > > >
> > > > but the above simplification is valid only for non-paradoxical 
> > > > registers,
> > > > where outermode <= innermode.  We should not assume that elements 
> > > > outside
> > > > the original register are valid, let alone all duplicates.
> > >
> > > Hmm, but looking at the audit trail the x86 backend expects them to be 
> > > zero?
> > > Isn't that wrong as well?
> >
> > If you mean Comment #10, it is just an observation that
> > simplify_replace_rtx simplifies arguments from Comment #9 to:
> >
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x4
> > (const_int 0 [0]) repeated x4
> > ])
> >
> > instead of:
> >
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x8
> > ])
> >
> > which is in line with the statement below.
> > >
> > > That is, I think putting any random value into the upper lanes when
> > > constant folding
> > > a paradoxical subreg sounds OK to me, no?
> >
> > The compiler is putting zero there as can be seen from the above new RTX.
> >
> > > Of course we might choose to not do such constant propagation for
> > > efficiency reason - at least
> > > when the resulting CONST_* would require a larger constant pool entry
> > > or more costly
> > > construction.
> >
> > This is probably a follow-up improvement, where this patch tries to
> > fix a specific invalid simplification of simplify_replace_rtx that is
> > invalid universally.
>
> How so?  What specifies the values of the paradoxical subreg for the
> bytes not covered by the subreg operand?

I don't know why 0 is generated here (and if it is valid) for
paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
the wrong constant and triggers unwanted propagation later on.

Uros.


Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
 wrote:
>
> On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
>  wrote:
> >
> > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> >
> > (gdb) list
> > 469   if (code == SUBREG)
> > 470 {
> > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > old_rtx, fn, data);
> > 472   if (op0 == SUBREG_REG (x))
> > 473 return x;
> > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > 475  GET_MODE (SUBREG_REG (x)),
> > 476  SUBREG_BYTE (x));
> > 477   return op0 ? op0 : x;
> > 478 }
> >
> > simplifies with following arguments:
> >
> > (gdb) p debug_rtx (op0)
> > (const_vector:V4QI [
> > (const_int -52 [0xffcc]) repeated x4
> > ])
> > (gdb) p debug_rtx (x)
> > (subreg:V16QI (reg:V4QI 98) 0)
> >
> > to:
> >
> > (gdb) p debug_rtx (op0)
> > (const_vector:V16QI [
> > (const_int -52 [0xffcc]) repeated x16
> > ])
> >
> > This simplification is invalid, it is not possible to get V16QImode vector
> > from V4QImode vector, even when all elements are duplicates.
> >
> > The simplification happens in simplify_context::simplify_subreg:
> >
> > (gdb) list
> > 7558  if (VECTOR_MODE_P (outermode)
> > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> > (innermode)
> > 7560  && vec_duplicate_p (op, ))
> > 7561return gen_vec_duplicate (outermode, elt);
> >
> > but the above simplification is valid only for non-paradoxical registers,
> > where outermode <= innermode.  We should not assume that elements outside
> > the original register are valid, let alone all duplicates.
>
> Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> Isn't that wrong as well?

If you mean Comment #10, it is just an observation that
simplify_replace_rtx simplifies arguments from Comment #9 to:

(gdb) p debug_rtx (src)
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x4
(const_int 0 [0]) repeated x4
])

instead of:

(gdb) p debug_rtx (src)
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x8
])

which is in line with the statement below.
>
> That is, I think putting any random value into the upper lanes when
> constant folding
> a paradoxical subreg sounds OK to me, no?

The compiler is putting zero there as can be seen from the above new RTX.

> Of course we might choose to not do such constant propagation for
> efficiency reason - at least
> when the resulting CONST_* would require a larger constant pool entry
> or more costly
> construction.

This is probably a follow-up improvement, where this patch tries to
fix a specific invalid simplification of simplify_replace_rtx that is
invalid universally.

Uros.


Re: [X86 PATCH] Add new insvti_lowpart_1 and insvdi_lowpart_1 patterns.

2023-07-10 Thread Uros Bizjak via Gcc-patches
On Sun, Jul 9, 2023 at 11:30 PM Roger Sayle  wrote:
>
>
> This patch implements another of Uros' suggestions, to investigate a
> insvti_lowpart_1 pattern to improve TImode parameter passing on x86_64.
> In PR 88873, the RTL the middle-end expands for passing V2DF in TImode
> is subtly different from what it does for V2DI in TImode, sufficiently so
> that my explanations for why insvti_lowpart_1 isn't required don't apply
> in this case.
>
> This patch adds an insvti_lowpart_1 pattern, complementing the existing
> insvti_highpart_1 pattern, and also a 32-bit variant, insvdi_lowpart_1.
> Because the middle-end represents 128-bit constants using CONST_WIDE_INT
> and 64-bit constants using CONST_INT, it's easiest to treat these as
> different patterns, rather than attempt  parameterization.
>
> This patch also includes a peephole2 (actually a pair) to transform
> xchg instructions into mov instructions, when one of the destinations
> is unused.  This optimization is required to produce the optimal code
> sequences below.
>
> For the 64-bit case:
>
> __int128 foo(__int128 x, unsigned long long y)
> {
>   __int128 m = ~((__int128)~0ull);
>   __int128 t = x & m;
>   __int128 r = t | y;
>   return r;
> }
>
> Before:
> xchgq   %rdi, %rsi
> movq%rdx, %rax
> xorl%esi, %esi
> xorl%edx, %edx
> orq %rsi, %rax
> orq %rdi, %rdx
> ret
>
> After:
> movq%rdx, %rax
> movq%rsi, %rdx
> ret
>
> For the 32-bit case:
>
> long long bar(long long x, int y)
> {
>   long long mask = ~0ull << 32;
>   long long t = x & mask;
>   long long r = t | (unsigned int)y;
>   return r;
> }
>
> Before:
> pushl   %ebx
> movl12(%esp), %edx
> xorl%ebx, %ebx
> xorl%eax, %eax
> movl16(%esp), %ecx
> orl %ebx, %edx
> popl%ebx
> orl %ecx, %eax
> ret
>
> After:
> movl12(%esp), %eax
> movl8(%esp), %edx
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-07-09  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (peephole2): Transform xchg insn with a
> REG_UNUSED note to a (simple) move.
> (*insvti_lowpart_1): New define_insn_and_split.
> (*insvdi_lowpart_1): Likewise.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/insvdi_lowpart-1.c: New test case.
> * gcc.target/i386/insvti_lowpart-1.c: Likewise.

OK.

Thanks,
Uros.

>
>
> Cheers,
> Roger
> --
>


Re: [x86 PATCH] Add AVX512 support for STV of SI/DImode rotation by constant.

2023-07-10 Thread Uros Bizjak via Gcc-patches
On Sun, Jul 9, 2023 at 10:35 PM Roger Sayle  wrote:
>
>
> Following Uros' suggestion, this patch adds support for AVX512VL's
> vpro[lr][dq] instructions to the recently added scalar-to-vector (STV)
> enhancements to handle DImode and SImode rotations by a constant.
>
> For the test cases:
>
> unsigned long long rot1(unsigned long long x) {
>   return (x>>1) | (x<<63);
> }
>
> void mem1(unsigned long long *p) {
>   *p = rot1(*p);
> }
>
> with -m32 -O2 -mavx512vl, we currently generate:
>
> rot1:   movl4(%esp), %eax
> movl8(%esp), %edx
> movl%eax, %ecx
> shrdl   $1, %edx, %eax
> shrdl   $1, %ecx, %edx
> ret
>
> mem1:   movl4(%esp), %eax
> vmovq   (%eax), %xmm0
> vpshufd $20, %xmm0, %xmm0
> vpsrlq  $1, %xmm0, %xmm0
> vpshufd $136, %xmm0, %xmm0
> vmovq   %xmm0, (%eax)
> ret
>
> with this patch, we now generate:
>
> rot1:   vmovq   4(%esp), %xmm0
> vprorq  $1, %xmm0, %xmm0
> vmovd   %xmm0, %eax
> vpextrd $1, %xmm0, %edx
> ret
>
> mem1:   movl4(%esp), %eax
> vmovq   (%eax), %xmm0
> vprorq  $1, %xmm0, %xmm0
> vmovq   %xmm0, (%eax)
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu (cascadelake which has
> avx512) with make bootstrap and make -k check, both with and without
> --target_board=unix{-m32} with no new failures.  Ok for mainline?
>
>
> 2023-07-09  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (compute_convert_gain): Tweak
> gains/costs for ROTATE/ROTATERT by integer constant on AVX512VL.
> (general_scalar_chain::convert_rotate): On TARGET_AVX512F generate
> avx512vl_rolv2di or avx412vl_rolv4si when appropriate.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/avx512vl-stv-rotatedi-1.c: New test case.

OK.

Thanks,
Uros.

>
>
> Cheers,
> Roger
> --
>


[PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-09 Thread Uros Bizjak via Gcc-patches
As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:

(gdb) list
469   if (code == SUBREG)
470 {
471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
old_rtx, fn, data);
472   if (op0 == SUBREG_REG (x))
473 return x;
474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
475  GET_MODE (SUBREG_REG (x)),
476  SUBREG_BYTE (x));
477   return op0 ? op0 : x;
478 }

simplifies with following arguments:

(gdb) p debug_rtx (op0)
(const_vector:V4QI [
(const_int -52 [0xffcc]) repeated x4
])
(gdb) p debug_rtx (x)
(subreg:V16QI (reg:V4QI 98) 0)

to:

(gdb) p debug_rtx (op0)
(const_vector:V16QI [
(const_int -52 [0xffcc]) repeated x16
])

This simplification is invalid, it is not possible to get V16QImode vector
from V4QImode vector, even when all elements are duplicates.

The simplification happens in simplify_context::simplify_subreg:

(gdb) list
7558  if (VECTOR_MODE_P (outermode)
7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
7560  && vec_duplicate_p (op, ))
7561return gen_vec_duplicate (outermode, elt);

but the above simplification is valid only for non-paradoxical registers,
where outermode <= innermode.  We should not assume that elements outside
the original register are valid, let alone all duplicates.

PR target/110206

gcc/ChangeLog:

* simplify-rtx.cc (simplify_context::simplify_subreg):
Avoid returning a vector with duplicated value
outside the original register.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr110206.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

OK for master and release branches?

Uros.
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index d7315d82aa3..87ca25086dc 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7557,6 +7557,7 @@ simplify_context::simplify_subreg (machine_mode 
outermode, rtx op,
 
   if (VECTOR_MODE_P (outermode)
  && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
+ && !paradoxical_subreg_p (outermode, innermode)
  && vec_duplicate_p (op, ))
return gen_vec_duplicate (outermode, elt);
 
diff --git a/gcc/testsuite/gcc.dg/torture/pr110206.c 
b/gcc/testsuite/gcc.dg/torture/pr110206.c
new file mode 100644
index 000..3a4f221ef47
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr110206.c
@@ -0,0 +1,30 @@
+/* PR target/110206 */
+/* { dg-do run { target x86_64-*-* i?86-*-* } } */
+
+typedef unsigned char __attribute__((__vector_size__ (4))) U;
+typedef unsigned char __attribute__((__vector_size__ (8))) V;
+typedef unsigned short u16;
+
+V g;
+
+void
+__attribute__((noinline))
+foo (U u, u16 c, V *r)
+{
+  if (!c)
+__builtin_abort ();
+  V x = __builtin_shufflevector (u, (204 >> u), 7, 0, 5, 1, 3, 5, 0, 2);
+  V y = __builtin_shufflevector (g, (V) { }, 7, 6, 6, 7, 2, 6, 3, 5);
+  V z = __builtin_shufflevector (y, 204 * x, 3, 9, 8, 1, 4, 6, 14, 5);
+  *r = z;
+}
+
+int
+main (void)
+{
+  V r;
+  foo ((U){4}, 5, );
+  if (r[6] != 0x30)
+__builtin_abort();
+  return 0;
+}


[committed] cprop: Change return type of predicate functions from int to bool

2023-07-08 Thread Uros Bizjak via Gcc-patches
Also change some internal variables from int to bool.

gcc/ChangeLog:

* cprop.cc (reg_available_p): Change return type from int to bool.
(reg_not_set_p): Ditto.
(try_replace_reg): Ditto.  Change "success" variable to bool.
(cprop_jump): Change return type from int to void
and adjust function body accordingly.
(constprop_register): Ditto.
(cprop_insn): Ditto.  Change "changed" variable to bool.
(local_cprop_pass): Change return type from int to void
and adjust function body accordingly.
(bypass_block): Ditto.  Change "change", "may_be_loop_header"
and "removed_p" variables to bool.
(bypass_conditional_jumps): Change return type from int to void
and adjust function body accordingly.  Change "changed"
variable to bool.
(one_cprop_pass): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/cprop.cc b/gcc/cprop.cc
index 6ec0bda4a24..b7400c9a421 100644
--- a/gcc/cprop.cc
+++ b/gcc/cprop.cc
@@ -142,10 +142,10 @@ cprop_alloc (unsigned long size)
   return obstack_alloc (_obstack, size);
 }
 
-/* Return nonzero if register X is unchanged from INSN to the end
+/* Return true if register X is unchanged from INSN to the end
of INSN's basic block.  */
 
-static int
+static bool
 reg_available_p (const_rtx x, const rtx_insn *insn ATTRIBUTE_UNUSED)
 {
   return ! REGNO_REG_SET_P (reg_set_bitmap, REGNO (x));
@@ -517,10 +517,10 @@ reset_opr_set_tables (void)
   CLEAR_REG_SET (reg_set_bitmap);
 }
 
-/* Return nonzero if the register X has not been set yet [since the
+/* Return true if the register X has not been set yet [since the
start of the basic block containing INSN].  */
 
-static int
+static bool
 reg_not_set_p (const_rtx x, const rtx_insn *insn ATTRIBUTE_UNUSED)
 {
   return ! REGNO_REG_SET_P (reg_set_bitmap, REGNO (x));
@@ -722,14 +722,14 @@ find_used_regs (rtx *xptr, void *data ATTRIBUTE_UNUSED)
 }
 
 /* Try to replace all uses of FROM in INSN with TO.
-   Return nonzero if successful.  */
+   Return true if successful.  */
 
-static int
+static bool
 try_replace_reg (rtx from, rtx to, rtx_insn *insn)
 {
   rtx note = find_reg_equal_equiv_note (insn);
   rtx src = 0;
-  int success = 0;
+  bool success = false;
   rtx set = single_set (insn);
 
   bool check_rtx_costs = true;
@@ -765,7 +765,7 @@ try_replace_reg (rtx from, rtx to, rtx_insn *insn)
 
 
   if (num_changes_pending () && apply_change_group ())
-success = 1;
+success = true;
 
   /* Try to simplify SET_SRC if we have substituted a constant.  */
   if (success && set && CONSTANT_P (to))
@@ -790,7 +790,7 @@ try_replace_reg (rtx from, rtx to, rtx_insn *insn)
 
   if (!rtx_equal_p (src, SET_SRC (set))
  && validate_change (insn, _SRC (set), src, 0))
-   success = 1;
+   success = true;
 
   /* If we've failed perform the replacement, have a single SET to
 a REG destination and don't yet have a note, add a REG_EQUAL note
@@ -808,7 +808,7 @@ try_replace_reg (rtx from, rtx to, rtx_insn *insn)
 
   if (!rtx_equal_p (dest, SET_DEST (set))
   && validate_change (insn, _DEST (set), dest, 0))
-success = 1;
+   success = true;
 }
 
   /* REG_EQUAL may get simplified into register.
@@ -889,10 +889,10 @@ find_avail_set (int regno, rtx_insn *insn, struct 
cprop_expr *set_ret[2])
JUMP_INSNS.  JUMP must be a conditional jump.  If SETCC is non-NULL
it is the instruction that immediately precedes JUMP, and must be a
single SET of a register.  FROM is what we will try to replace,
-   SRC is the constant we will try to substitute for it.  Return nonzero
+   SRC is the constant we will try to substitute for it.  Return true
if a change was made.  */
 
-static int
+static bool
 cprop_jump (basic_block bb, rtx_insn *setcc, rtx_insn *jump, rtx from, rtx src)
 {
   rtx new_rtx, set_src, note_src;
@@ -931,7 +931,7 @@ cprop_jump (basic_block bb, rtx_insn *setcc, rtx_insn 
*jump, rtx from, rtx src)
 
   /* If no simplification can be made, then try the next register.  */
   if (rtx_equal_p (new_rtx, SET_SRC (set)))
-return 0;
+return false;
 
   /* If this is now a no-op delete it, otherwise this must be a valid insn.  */
   if (new_rtx == pc_rtx)
@@ -941,7 +941,7 @@ cprop_jump (basic_block bb, rtx_insn *setcc, rtx_insn 
*jump, rtx from, rtx src)
   /* Ensure the value computed inside the jump insn to be equivalent
  to one computed by setcc.  */
   if (setcc && modified_in_p (new_rtx, setcc))
-   return 0;
+   return false;
   if (! validate_unshare_change (jump, _SRC (set), new_rtx, 0))
{
  /* When (some) constants are not valid in a comparison, and there
@@ -955,7 +955,7 @@ cprop_jump (basic_block bb, rtx_insn *setcc, rtx_insn 
*jump, rtx from, rtx src)
 
  if (!rtx_equal_p (new_rtx, note_src))
set_unique_reg_note (jump, REG_EQUAL, copy_rtx (new_rtx));
- return 0;
+ return false;
}
 
   

[committed] gcse: Change return type of predicate functions from int to bool

2023-07-08 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* gcse.cc (expr_equiv_p): Change return type from int to bool.
(oprs_unchanged_p): Change return type from int to void
and adjust function body accordingly.
(oprs_anticipatable_p): Ditto.
(oprs_available_p): Ditto.
(insert_expr_in_table): Ditto.  Change "antic_p" and "avail_p"
arguments to bool. Change "found" variable to bool.
(load_killed_in_block_p): Change return type from int to void and
adjust function body accordingly.  Change "avail_p" argument to bool.
(pre_expr_reaches_here_p): Change return type from int to void
and adjust function body accordingly.
(pre_delete): Ditto.  Change "changed" variable to bool.
(pre_gcse): Change return type from int to void
and adjust function body accordingly. Change "did_insert" and
"changed" variables to bool.
(one_pre_gcse_pass): Change return type from int to void
and adjust function body accordingly.  Change "changed" variable
to bool.
(should_hoist_expr_to_dom): Change return type from int to void
and adjust function body accordingly.  Change
"visited_allocated_locally" variable to bool.
(hoist_code): Change return type from int to void and adjust
function body accordingly.  Change "changed" variable to bool.
(one_code_hoisting_pass): Ditto.
(pre_edge_insert): Change return type from int to void and adjust
function body accordingly.  Change "did_insert" variable to bool.
(pre_expr_reaches_here_p_work): Change return type from int to void
and adjust function body accordingly.
(simple_mem): Ditto.
(want_to_gcse_p): Change return type from int to void
and adjust function body accordingly.
(can_assign_to_reg_without_clobbers_p): Update function body
for bool return type.
(hash_scan_set): Change "antic_p" and "avail_p" variables to bool.
(pre_insert_copies): Change "added_copy" variable to bool.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/gcse.cc b/gcc/gcse.cc
index 72832736572..8413c9a18f3 100644
--- a/gcc/gcse.cc
+++ b/gcc/gcse.cc
@@ -371,7 +371,7 @@ pre_ldst_expr_hasher::hash (const ls_expr *x)
 hash_rtx (x->pattern, GET_MODE (x->pattern), _not_record_p, NULL, 
false);
 }
 
-static int expr_equiv_p (const_rtx, const_rtx);
+static bool expr_equiv_p (const_rtx, const_rtx);
 
 inline bool
 pre_ldst_expr_hasher::equal (const ls_expr *ptr1,
@@ -454,10 +454,10 @@ static void hash_scan_insn (rtx_insn *, struct 
gcse_hash_table_d *);
 static void hash_scan_set (rtx, rtx_insn *, struct gcse_hash_table_d *);
 static void hash_scan_clobber (rtx, rtx_insn *, struct gcse_hash_table_d *);
 static void hash_scan_call (rtx, rtx_insn *, struct gcse_hash_table_d *);
-static int oprs_unchanged_p (const_rtx, const rtx_insn *, int);
-static int oprs_anticipatable_p (const_rtx, const rtx_insn *);
-static int oprs_available_p (const_rtx, const rtx_insn *);
-static void insert_expr_in_table (rtx, machine_mode, rtx_insn *, int, int,
+static bool oprs_unchanged_p (const_rtx, const rtx_insn *, bool);
+static bool oprs_anticipatable_p (const_rtx, const rtx_insn *);
+static bool oprs_available_p (const_rtx, const rtx_insn *);
+static void insert_expr_in_table (rtx, machine_mode, rtx_insn *, bool, bool,
  HOST_WIDE_INT, struct gcse_hash_table_d *);
 static unsigned int hash_expr (const_rtx, machine_mode, int *, int);
 static void record_last_reg_set_info (rtx_insn *, int);
@@ -471,42 +471,42 @@ static void dump_hash_table (FILE *, const char *, struct 
gcse_hash_table_d *);
 static void compute_local_properties (sbitmap *, sbitmap *, sbitmap *,
  struct gcse_hash_table_d *);
 static void mems_conflict_for_gcse_p (rtx, const_rtx, void *);
-static int load_killed_in_block_p (const_basic_block, int, const_rtx, int);
+static bool load_killed_in_block_p (const_basic_block, int, const_rtx, bool);
 static void alloc_pre_mem (int, int);
 static void free_pre_mem (void);
 static struct edge_list *compute_pre_data (void);
-static int pre_expr_reaches_here_p (basic_block, struct gcse_expr *,
-   basic_block);
+static bool pre_expr_reaches_here_p (basic_block, struct gcse_expr *,
+basic_block);
 static void insert_insn_end_basic_block (struct gcse_expr *, basic_block);
 static void pre_insert_copy_insn (struct gcse_expr *, rtx_insn *);
 static void pre_insert_copies (void);
-static int pre_delete (void);
-static int pre_gcse (struct edge_list *);
-static int one_pre_gcse_pass (void);
+static bool pre_delete (void);
+static bool pre_gcse (struct edge_list *);
+static bool one_pre_gcse_pass (void);
 static void add_label_notes (rtx, rtx_insn *);
 static void alloc_code_hoist_mem (int, int);
 static void free_code_hoist_mem (void);
 static void compute_code_hoist_vbeinout (void);
 static void 

Re: [PATCH V2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-07 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 7, 2023 at 7:31 AM liuhongt  wrote:
>
> > Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX
> > and the other emitting UNSPEC_IEEE_MIN.
> Splitted.
>
> > The test involves blendv instruction, which is SSE4.1, so it is
> > pointless to test it without -msse4.1. Please add -msse4.1 instead of
> > -march=x86_64 and use sse4_runtime target selector, as is the case
> > with gcc.target/i386/pr90358.c.
> Changed.
>
> > Please also use -msse4.1 instead of -march here. With -mfpmath=sse,
> > the test is valid also for 32bit targets, you should use -msseregparm
> > additional options for ia32 (please see gcc.target/i386/pr43546.c
> > testcase) in the same way as -mregparm to pass SSE arguments in
> > registers.
> 32-bit target still failed to do condition elimination for DFmode due to
> below code in rtx_cost
>
>   /* A size N times larger than UNITS_PER_WORD likely needs N times as
>  many insns, taking N times as long.  */
>   factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>
> It looks like a separate issue for DFmode operation under 32-bit target.
>
> I've enable 32-bit for the testcase, but only scan for minss/maxss
> currently.
>
> Here's updated patch.
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> We have ix86_expand_sse_fp_minmax to detect min/max sematics, but
> it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for
> the testcase in the PR, there's an extra move from cmp_op0 to if_true,
> and it failed ix86_expand_sse_fp_minmax.
>
> This patch adds pre_reload splitter to detect the min/max pattern.
>
> Operands order in MINSS matters for signed zero and NANs, since the
> instruction always returns second operand when any operand is NAN or
> both operands are zero.
>
> gcc/ChangeLog:
>
> PR target/110170
> * config/i386/i386.md (*ieee_max3_1): New pre_reload
> splitter to detect fp max pattern.
> (*ieee_min3_1): Ditto, but for fp min pattern.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/pr110170.C: New test.
> * gcc.target/i386/pr110170.c: New test.

OK with a testcase fix below.

Uros.

> ---
>  gcc/config/i386/i386.md  | 43 +
>  gcc/testsuite/g++.target/i386/pr110170.C | 78 
>  gcc/testsuite/gcc.target/i386/pr110170.c | 21 +++
>  3 files changed, 142 insertions(+)
>  create mode 100644 gcc/testsuite/g++.target/i386/pr110170.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110170.c
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index a82cc353cfd..6f415f899ae 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -23163,6 +23163,49 @@ (define_insn "*ieee_s3"
> (set_attr "type" "sseadd")
> (set_attr "mode" "")])
>
> +;; Operands order in min/max instruction matters for signed zero and NANs.
> +(define_insn_and_split "*ieee_max3_1"
> +  [(set (match_operand:MODEF 0 "register_operand")
> +   (unspec:MODEF
> + [(match_operand:MODEF 1 "register_operand")
> +  (match_operand:MODEF 2 "register_operand")
> +  (lt:MODEF
> +(match_operand:MODEF 3 "register_operand")
> +(match_operand:MODEF 4 "register_operand"))]
> + UNSPEC_BLENDV))]
> +  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH
> +  && (rtx_equal_p (operands[1], operands[3])
> +  && rtx_equal_p (operands[2], operands[4]))
> +  && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +   (unspec:MODEF
> + [(match_dup 2)
> +  (match_dup 1)]
> +UNSPEC_IEEE_MAX))])
> +
> +(define_insn_and_split "*ieee_min3_1"
> +  [(set (match_operand:MODEF 0 "register_operand")
> +   (unspec:MODEF
> + [(match_operand:MODEF 1 "register_operand")
> +  (match_operand:MODEF 2 "register_operand")
> +  (lt:MODEF
> +(match_operand:MODEF 3 "register_operand")
> +(match_operand:MODEF 4 "register_operand"))]
> + UNSPEC_BLENDV))]
> +  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH
> +  && (rtx_equal_p (operands[1], operands[4])
> +  && rtx_equal_p (operands[2], operands[3]))
> +  && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +   (unspec:MODEF
> + [(match_dup 2)
> +  (match_dup 1)]
> +UNSPEC_IEEE_MIN))])
> +
>  ;; Make two stack loads independent:
>  ;;   fld aa  fld aa
>  ;;   fld %st(0) ->   fld bb
> diff --git a/gcc/testsuite/g++.target/i386/pr110170.C 
> b/gcc/testsuite/g++.target/i386/pr110170.C
> new file mode 100644
> index 000..5d6842270d0
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/i386/pr110170.C
> @@ -0,0 +1,78 @@
> +/* { dg-do run } */
> +/* { dg-options " -O2 -msse4.1 -mfpmath=sse -std=gnu++20" } */

Please either change the first line to:

{ dg-do run { target sse4_runtime } }

or add

{ dg-require-effective-target sse4_runtime }

to the runtime test.

> 

Re: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move).

2023-07-06 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle  wrote:
>
> > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle 
> > wrote:
> > >
> > >
> > > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> > > result in surprising code.  Consider the example below (from PR 43644):
> > >
> > > __uint128 foo(__uint128 x, unsigned long long y) {
> > >   return x+y;
> > > }
> > >
> > > which currently results in 6 consecutive movq instructions:
> > >
> > > foo:movq%rsi, %rax
> > > movq%rdi, %rsi
> > > movq%rdx, %rcx
> > > movq%rax, %rdi
> > > movq%rsi, %rax
> > > movq%rdi, %rdx
> > > addq%rcx, %rax
> > > adcq$0, %rdx
> > > ret
> > >
> > > The underlying issue is that during RTL expansion, we generate the
> > > following initial RTL for the x argument:
> > >
> > > (insn 4 3 5 2 (set (reg:TI 85)
> > > (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > >  (nil))
> > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > > (reg:DI 87)) "pr43644-2.c":5:1 -1
> > >  (nil))
> > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > > (reg:TI 85)) "pr43644-2.c":5:1 -1
> > >  (nil))
> > >
> > > which by combine/reload becomes
> > >
> > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > > (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > >  (nil))
> > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > > (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > >  (expr_list:REG_DEAD (reg:DI 93)
> > > (nil)))
> > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > > (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > >  (expr_list:REG_DEAD (reg:DI 94)
> > > (nil)))
> > >
> > > where the heavy use of SUBREG SET_DESTs creates challenges for both
> > > combine and register allocation.
> > >
> > > The improvement proposed here is to avoid these problematic SUBREGs by
> > > adding (two) special cases to ix86_expand_move.  For insn 4, which
> > > sets a TImode destination from a paradoxical SUBREG, to assign the
> > > lowpart, we can use an explicit zero extension (zero_extendditi2 was
> > > added in July 2022), and for insn 5, which sets the highpart of a
> > > TImode register we can use the *insvti_highpart_1 instruction (that
> > > was added in May 2023, after being approved for stage1 in January).
> > > This allows combine to work its magic, merging these insns into a
> > > *concatditi3 and from there into other optimized forms.
> >
> > How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in 
> > the
> > hope that combine is smart enough to also combine these two instructions? 
> > IMO,
> > faking insert to lowpart of the register with zero_extend is a bit 
> > overkill, and could
> > hinder some other optimization opportunities (as perhaps hinted by failing
> > testcases).
>
> The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart
> and of informing the RTL passes that the highpart is dead.  Notice in the 
> original
> RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs 
> pass, clearing
> the entirety of the TImode register (like a clobber), and preventing TI:84 
> from
> occupying the same registers as DI:93 and DI:94.
>
> If the middle-end had asked the backend to generate a SET to STRICT_LOWPART
> then our hands would be tied, but a paradoxical SUBREG allows us the freedom
> to set the highpart bits to a defined value (we could have used sign 
> extension if
> that was cheap), which then simplifies data-flow and liveness analysis.  
> Allowing the
> highpart to contain undefined or untouched data is exactly the sort of 
> security
> side-channel leakage that the clear regs pass attempts to address.
>
> I can investigate an *insvti_lowpart_1, but I don't think it will help with 
> this
> issue, i.e. it won't prevent init-regs from clobbering/clearing TImode 
> parameters.

Thanks for the explanation, the patch is OK then.

Thanks,
Uros.

>
> > > So for the test case above, we now generate only a single movq:
> > >
> > > foo:movq%rdx, %rax
> > > xorl%edx, %edx
> > > addq%rdi, %rax
> > > adcq%rsi, %rdx
> > > ret
> > >
> > > But there is a little bad news.  This patch causes two (minor) missed
> > > optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> > > gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> > > no longer generating adcq $0, but instead using xorl.  For the other
> > > FAIL, register allocation now has more freedom and is (arbitrarily)
> > > choosing a register assignment that doesn't match what the test is
> > > expecting.  These issues are easier to explain and fix once this patch
> > > is in the tree.
> > >
> > > The good news is that this approach fixes a number of long standing
> > > issues, that need to checked in bugzilla, including PR target/110533
> > > which was just 

Re: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move).

2023-07-06 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle  wrote:
>
>
> Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> result in surprising code.  Consider the example below (from PR 43644):
>
> __uint128 foo(__uint128 x, unsigned long long y) {
>   return x+y;
> }
>
> which currently results in 6 consecutive movq instructions:
>
> foo:movq%rsi, %rax
> movq%rdi, %rsi
> movq%rdx, %rcx
> movq%rax, %rdi
> movq%rsi, %rax
> movq%rdi, %rdx
> addq%rcx, %rax
> adcq$0, %rdx
> ret
>
> The underlying issue is that during RTL expansion, we generate the
> following initial RTL for the x argument:
>
> (insn 4 3 5 2 (set (reg:TI 85)
> (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
>  (nil))
> (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> (reg:DI 87)) "pr43644-2.c":5:1 -1
>  (nil))
> (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> (reg:TI 85)) "pr43644-2.c":5:1 -1
>  (nil))
>
> which by combine/reload becomes
>
> (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> (const_int 0 [0])) "pr43644-2.c":5:1 -1
>  (nil))
> (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
>  (expr_list:REG_DEAD (reg:DI 93)
> (nil)))
> (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
>  (expr_list:REG_DEAD (reg:DI 94)
> (nil)))
>
> where the heavy use of SUBREG SET_DESTs creates challenges for both
> combine and register allocation.
>
> The improvement proposed here is to avoid these problematic SUBREGs
> by adding (two) special cases to ix86_expand_move.  For insn 4, which
> sets a TImode destination from a paradoxical SUBREG, to assign the
> lowpart, we can use an explicit zero extension (zero_extendditi2 was
> added in July 2022), and for insn 5, which sets the highpart of a
> TImode register we can use the *insvti_highpart_1 instruction (that
> was added in May 2023, after being approved for stage1 in January).
> This allows combine to work its magic, merging these insns into a
> *concatditi3 and from there into other optimized forms.

How about we introduce *insvti_lowpart_1, similar to
*insvti_highpart_1, in the hope that combine is smart enough to also
combine these two instructions? IMO, faking insert to lowpart of the
register with zero_extend is a bit overkill, and could hinder some
other optimization opportunities (as perhaps hinted by failing
testcases).

Uros.

> So for the test case above, we now generate only a single movq:
>
> foo:movq%rdx, %rax
> xorl%edx, %edx
> addq%rdi, %rax
> adcq%rsi, %rdx
> ret
>
> But there is a little bad news.  This patch causes two (minor) missed
> optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> no longer generating adcq $0, but instead using xorl.  For the other
> FAIL, register allocation now has more freedom and is (arbitrarily)
> choosing a register assignment that doesn't match what the test is
> expecting.  These issues are easier to explain and fix once this patch
> is in the tree.
>
> The good news is that this approach fixes a number of long standing
> issues, that need to checked in bugzilla, including PR target/110533
> which was just opened/reported earlier this week.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with only the two new FAILs described above.  Ok for mainline?
>
> 2023-07-06  Roger Sayle  
>
> gcc/ChangeLog
> PR target/43644
> PR target/110533
> * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> TImode destinations from paradoxical SUBREGs (setting the lowpart)
> into explicit zero extensions.  Use *insvti_highpart_1 instruction
> to set the highpart of a TImode destination.
>
> gcc/testsuite/ChangeLog
> PR target/43644
> PR target/110533
> * gcc.target/i386/pr110533.c: New test case.
> * gcc.target/i386/pr43644-2.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>


Re: [PATCH] i386: Update document for inlining rules

2023-07-06 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 6, 2023 at 8:39 AM Hongyu Wang  wrote:
>
> Hi,
>
> This is a follow-up patch for
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623525.html
> that updates document about x86 inlining rules.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * doc/extend.texi: Move x86 inlining rule to a new subsubsection
> and add description for inling of function with arch and tune
> attributes.

LGTM.

Thanks,
Uros.

> ---
>  gcc/doc/extend.texi | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index d1b018ee6d6..d701b4d1d41 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -7243,11 +7243,6 @@ Prefer 256-bit vector width for instructions.
>  Prefer 512-bit vector width for instructions.
>  @end table
>
> -On the x86, the inliner does not inline a
> -function that has different target options than the caller, unless the
> -callee has a subset of the target options of the caller.  For example
> -a function declared with @code{target("sse3")} can inline a function
> -with @code{target("sse2")}, since @code{-msse3} implies @code{-msse2}.
>  @end table
>
>  @cindex @code{indirect_branch} function attribute, x86
> @@ -7361,6 +7356,20 @@ counterpart to option 
> @option{-mno-direct-extern-access}.
>
>  @end table
>
> +@subsubsection Inlining rules
> +On the x86, the inliner does not inline a
> +function that has different target options than the caller, unless the
> +callee has a subset of the target options of the caller.  For example
> +a function declared with @code{target("sse3")} can inline a function
> +with @code{target("sse2")}, since @code{-msse3} implies @code{-msse2}.
> +
> +Besides the basic rule, when a function specifies
> +@code{target("arch=@var{ARCH}")} or @code{target("tune=@var{TUNE}")}
> +attribute, the inlining rule will be different. It allows inlining of
> +a function with default @option{-march=x86-64} and
> +@option{-mtune=generic} specified, or a function that has a subset
> +of ISA features and marked with always_inline.
> +
>  @node Xstormy16 Function Attributes
>  @subsection Xstormy16 Function Attributes
>
> --
> 2.31.1
>


Re: [PATCH 1/2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-06 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 6, 2023 at 3:20 AM liuhongt  wrote:
>
> We have ix86_expand_sse_fp_minmax to detect min/max sematics, but
> it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for
> the testcase in the PR, there's an extra move from cmp_op0 to if_true,
> and it failed ix86_expand_sse_fp_minmax.
>
> This patch adds pre_reload splitter to detect the min/max pattern.
>
> Operands order in MINSS matters for signed zero and NANs, since the
> instruction always returns second operand when any operand is NAN or
> both operands are zero.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/110170
> * config/i386/i386.md (*ieee_minmax3_1): New pre_reload
> splitter to detect fp min/max pattern.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/pr110170.C: New test.
> * gcc.target/i386/pr110170.c: New test.
> ---
>  gcc/config/i386/i386.md  | 30 +
>  gcc/testsuite/g++.target/i386/pr110170.C | 78 
>  gcc/testsuite/gcc.target/i386/pr110170.c | 18 ++
>  3 files changed, 126 insertions(+)
>  create mode 100644 gcc/testsuite/g++.target/i386/pr110170.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110170.c
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index e6ebc461e52..353bb21993d 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -22483,6 +22483,36 @@ (define_insn "*ieee_s3"
> (set_attr "type" "sseadd")
> (set_attr "mode" "")])
>
> +;; Operands order in min/max instruction matters for signed zero and NANs.
> +(define_insn_and_split "*ieee_minmax3_1"
> +  [(set (match_operand:MODEF 0 "register_operand")
> +   (unspec:MODEF
> + [(match_operand:MODEF 1 "register_operand")
> +  (match_operand:MODEF 2 "register_operand")
> +  (lt:MODEF
> +(match_operand:MODEF 3 "register_operand")
> +(match_operand:MODEF 4 "register_operand"))]
> + UNSPEC_BLENDV))]
> +  "SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH
> +  && ((rtx_equal_p (operands[1], operands[3])
> +   && rtx_equal_p (operands[2], operands[4]))
> +  || (rtx_equal_p (operands[1], operands[4])
> + && rtx_equal_p (operands[2], operands[3])))
> +  && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)]
> +{
> +  int u = (rtx_equal_p (operands[1], operands[3])
> +  && rtx_equal_p (operands[2], operands[4]))
> +  ? UNSPEC_IEEE_MAX : UNSPEC_IEEE_MIN;
> +  emit_move_insn (operands[0],
> + gen_rtx_UNSPEC (mode,
> + gen_rtvec (2, operands[2], operands[1]),
> + u));
> +  DONE;
> +})

Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX
and the other emitting UNSPEC_IEEE_MIN.

> +
>  ;; Make two stack loads independent:
>  ;;   fld aa  fld aa
>  ;;   fld %st(0) ->   fld bb
> diff --git a/gcc/testsuite/g++.target/i386/pr110170.C 
> b/gcc/testsuite/g++.target/i386/pr110170.C
> new file mode 100644
> index 000..1e9a781ca74
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/i386/pr110170.C
> @@ -0,0 +1,78 @@
> +/* { dg-do run } */
> +/* { dg-options " -O2 -march=x86-64 -mfpmath=sse -std=gnu++20" } */

The test involves blendv instruction, which is SSE4.1, so it is
pointless to test it without -msse4.1. Please add -msse4.1 instead of
-march=x86_64 and use sse4_runtime target selector, as is the case
with gcc.target/i386/pr90358.c.

> +#include 
> +
> +void
> +__attribute__((noinline))
> +__cond_swap(double* __x, double* __y) {
> +  bool __r = (*__x < *__y);
> +  auto __tmp = __r ? *__x : *__y;
> +  *__y = __r ? *__y : *__x;
> +  *__x = __tmp;
> +}
> +
> +auto test1() {
> +double nan = -0.0;
> +double x = 0.0;
> +__cond_swap(, );
> +return x == -0.0 && nan == 0.0;
> +}
> +
> +auto test1r() {
> +double nan = NAN;
> +double x = 1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 0 && nan == 1.0;
> +}
> +
> +auto test2() {
> +double nan = NAN;
> +double x = -1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 0 && nan == -1.0;
> +}
> +
> +auto test2r() {
> +double nan = NAN;
> +double x = -1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 0 && nan == -1.0;
> +}
> +
> +auto test3() {
> +double nan = -NAN;
> +double x = 1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 1 && nan == 1.0;
> +}
> +
> +auto test3r() {
> +double nan = -NAN;
> +double x = 1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 1 && nan == 1.0;
> +}
> +
> +auto test4() {
> +double nan = -NAN;
> +double x = -1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 1 && nan == -1.0;
> +}
> +
> +auto test4r() {
> +double nan = -NAN;
> +double x = -1.0;
> +__cond_swap(, );
> +return isnan(x) && signbit(x) == 1 

Re: [PATCH 2/2] Adjust rtx_cost for DF/SFmode AND/IOR/XOR/ANDN operations.

2023-07-05 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 6, 2023 at 3:20 AM liuhongt  wrote:
>
> They should have same cost as vector mode since both generate
> pand/pandn/pxor/por instruction.
>
> Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for
> DF/SFmode AND/IOR/XOR/ANDN operations.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110170-2.c: New test.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.cc|  6 --
>  gcc/testsuite/gcc.target/i386/pr110170-2.c | 16 
>  2 files changed, 20 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110170-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..fe31acd7646 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -21153,7 +21153,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> outer_code_i, int opno,
>
>  case IOR:
>  case XOR:
> -  if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
> +  if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> + || SSE_FLOAT_MODE_P (mode))
> *total = ix86_vec_cost (mode, cost->sse_op);
>else if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
> *total = cost->add * 2;
> @@ -21167,7 +21168,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> outer_code_i, int opno,
>   *total = cost->lea;
>   return true;
> }
> -  else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
> +  else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> +  || SSE_FLOAT_MODE_P (mode))
> {
>   /* pandn is a single instruction.  */
>   if (GET_CODE (XEXP (x, 0)) == NOT)
> diff --git a/gcc/testsuite/gcc.target/i386/pr110170-2.c 
> b/gcc/testsuite/gcc.target/i386/pr110170-2.c
> new file mode 100644
> index 000..d43e322fc49
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr110170-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-msse2 -O2 -mfpmath=sse" } */
> +/* { dg-final { scan-assembler-not "comi" } }  */
> +
> +double
> +foo (double* a, double* b, double c, double d)
> +{
> +  return *a < *b ? c : d;
> +}
> +
> +float
> +foo1 (float* a, float* b, float c, float d)
> +{
> +  return *a < *b ? c : d;
> +}
> +
> --
> 2.39.1.388.g2fc9e9ca3c
>


Re: [PATCH] Disparage slightly for the alternative which move DFmode between SSE_REGS and GENERAL_REGS.

2023-07-05 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 6, 2023 at 3:14 AM liuhongt  wrote:
>
> For testcase
>
> void __cond_swap(double* __x, double* __y) {
>   bool __r = (*__x < *__y);
>   auto __tmp = __r ? *__x : *__y;
>   *__y = __r ? *__y : *__x;
>   *__x = __tmp;
> }
>
> GCC-14 with -O2 and -march=x86-64 options generates the following code:
>
> __cond_swap(double*, double*):
> movsd   xmm1, QWORD PTR [rdi]
> movsd   xmm0, QWORD PTR [rsi]
> comisd  xmm0, xmm1
> jbe .L2
> movqrax, xmm1
> movapd  xmm1, xmm0
> movqxmm0, rax
> .L2:
> movsd   QWORD PTR [rsi], xmm1
> movsd   QWORD PTR [rdi], xmm0
> ret
>
> rax is used to save and restore DFmode value. In RA both GENERAL_REGS
> and SSE_REGS cost zero since we didn't disparage the
> alternative in movdf_internal pattern, according to register
> allocation order, GENERAL_REGS is allocated. The patch add ? for
> alternative (r,v) and (v,r) just like we did for movsf/hf/bf_internal
> pattern, after that we get optimal RA.
>
> __cond_swap:
> .LFB0:
> .cfi_startproc
> movsd   (%rdi), %xmm1
> movsd   (%rsi), %xmm0
> comisd  %xmm1, %xmm0
> jbe .L2
> movapd  %xmm1, %xmm2
> movapd  %xmm0, %xmm1
> movapd  %xmm2, %xmm0
> .L2:
> movsd   %xmm1, (%rsi)
> movsd   %xmm0, (%rdi)
> ret
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> Ok for trunk?
>
>
> gcc/ChangeLog:
>
> PR target/110170
> * config/i386/i386.md (movdf_internal): Disparage slightly for
> 2 alternatives (r,v) and (v,r) by adding constraint modifier
> '?'.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110170-3.c: New test.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.md|  4 ++--
>  gcc/testsuite/gcc.target/i386/pr110170-3.c | 11 +++
>  2 files changed, 13 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110170-3.c
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index a82cc353cfd..e47ced1bb70 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -3915,9 +3915,9 @@ (define_split
>  ;; Possible store forwarding (partial memory) stall in alternatives 4, 6 and 
> 7.
>  (define_insn "*movdf_internal"
>[(set (match_operand:DF 0 "nonimmediate_operand"
> -"=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r 
> ,v,r  ,o ,r  ,m")
> +"=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m 
> ,?r,?v,r  ,o ,r  ,m")
> (match_operand:DF 1 "general_operand"
> -"Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r 
> ,roF,rF,rmF,rC"))]
> +"Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x, v, 
> r,roF,rF,rmF,rC"))]
>"!(MEM_P (operands[0]) && MEM_P (operands[1]))
> && (lra_in_progress || reload_completed
> || !CONST_DOUBLE_P (operands[1])
> diff --git a/gcc/testsuite/gcc.target/i386/pr110170-3.c 
> b/gcc/testsuite/gcc.target/i386/pr110170-3.c
> new file mode 100644
> index 000..70daa89e9aa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr110170-3.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -fno-if-conversion -fno-if-conversion2" } */
> +/* { dg-final { scan-assembler-not {(?n)movq.*r} } } */
> +
> +void __cond_swap(double* __x, double* __y) {
> +  _Bool __r = (*__x < *__y);
> +  double __tmp = __r ? *__x : *__y;
> +  *__y = __r ? *__y : *__x;
> +  *__x = __tmp;
> +}
> +
> --
> 2.39.1.388.g2fc9e9ca3c
>


[committed] sched: Change return type of predicate functions from int to bool

2023-07-05 Thread Uros Bizjak via Gcc-patches
Also change some internal variables to bool.

gcc/ChangeLog:

* sched-int.h (struct haifa_sched_info): Change can_schedule_ready_p,
scehdule_more_p and contributes_to_priority indirect frunction
type from int to bool.
(no_real_insns_p): Change return type from int to bool.
(contributes_to_priority): Ditto.
* haifa-sched.cc (no_real_insns_p): Change return type from
int to bool and adjust function body accordingly.
* modulo-sched.cc (try_scheduling_node_in_cycle): Change "success"
variable type from int to bool.
(ps_insn_advance_column): Change return type from int to bool.
(ps_has_conflicts): Ditto. Change "has_conflicts"
variable type from int to bool.
* sched-deps.cc (deps_may_trap_p): Change return type from int to bool.
(conditions_mutex_p): Ditto.
* sched-ebb.cc (schedule_more_p): Ditto.
(ebb_contributes_to_priority): Change return type from
int to bool and adjust function body accordingly.
* sched-rgn.cc (is_cfg_nonregular): Ditto.
(check_live_1): Ditto.
(is_pfree): Ditto.
(find_conditional_protection): Ditto.
(is_conditionally_protected): Ditto.
(is_prisky): Ditto.
(is_exception_free): Ditto.
(haifa_find_rgns): Change "unreachable" and "too_large_failure"
variables from int to bool.
(extend_rgns): Change "rescan" variable from int to bool.
(check_live): Change return type from
int to bool and adjust function body accordingly.
(can_schedule_ready_p): Ditto.
(schedule_more_p): Ditto.
(contributes_to_priority): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 2c881ede0ec..01a2a80d982 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -5033,18 +5033,18 @@ get_ebb_head_tail (basic_block beg, basic_block end,
   *tailp = end_tail;
 }
 
-/* Return nonzero if there are no real insns in the range [ HEAD, TAIL ].  */
+/* Return true if there are no real insns in the range [ HEAD, TAIL ].  */
 
-int
+bool
 no_real_insns_p (const rtx_insn *head, const rtx_insn *tail)
 {
   while (head != NEXT_INSN (tail))
 {
   if (!NOTE_P (head) && !LABEL_P (head))
-   return 0;
+   return false;
   head = NEXT_INSN (head);
 }
-  return 1;
+  return true;
 }
 
 /* Restore-other-notes: NOTE_LIST is the end of a chain of notes
diff --git a/gcc/modulo-sched.cc b/gcc/modulo-sched.cc
index 26752213d19..c5a392dd511 100644
--- a/gcc/modulo-sched.cc
+++ b/gcc/modulo-sched.cc
@@ -2119,7 +2119,7 @@ try_scheduling_node_in_cycle (partial_schedule_ptr ps,
  sbitmap must_follow)
 {
   ps_insn_ptr psi;
-  bool success = 0;
+  bool success = false;
 
   verify_partial_schedule (ps, sched_nodes);
   psi = ps_add_node_check_conflicts (ps, u, cycle, must_precede, must_follow);
@@ -2127,7 +2127,7 @@ try_scheduling_node_in_cycle (partial_schedule_ptr ps,
 {
   SCHED_TIME (u) = cycle;
   bitmap_set_bit (sched_nodes, u);
-  success = 1;
+  success = true;
   *num_splits = 0;
   if (dump_file)
fprintf (dump_file, "Scheduled w/o split in %d\n", cycle);
@@ -3067,7 +3067,7 @@ ps_insn_find_column (partial_schedule_ptr ps, ps_insn_ptr 
ps_i,
in failure and true in success.  Bit N is set in MUST_FOLLOW if
the node with cuid N must be come after the node pointed to by
PS_I when scheduled in the same cycle.  */
-static int
+static bool
 ps_insn_advance_column (partial_schedule_ptr ps, ps_insn_ptr ps_i,
sbitmap must_follow)
 {
@@ -3158,7 +3158,7 @@ advance_one_cycle (void)
 /* Checks if PS has resource conflicts according to DFA, starting from
FROM cycle to TO cycle; returns true if there are conflicts and false
if there are no conflicts.  Assumes DFA is being used.  */
-static int
+static bool
 ps_has_conflicts (partial_schedule_ptr ps, int from, int to)
 {
   int cycle;
@@ -3214,7 +3214,8 @@ ps_add_node_check_conflicts (partial_schedule_ptr ps, int 
n,
 int c, sbitmap must_precede,
 sbitmap must_follow)
 {
-  int i, first, amount, has_conflicts = 0;
+  int i, first, amount;
+  bool has_conflicts = false;
   ps_insn_ptr ps_i;
 
   /* First add the node to the PS, if this succeeds check for
diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index 998fe930804..c23218890f3 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -472,7 +472,7 @@ static int cache_size;
 /* True if we should mark added dependencies as a non-register deps.  */
 static bool mark_as_hard;
 
-static int deps_may_trap_p (const_rtx);
+static bool deps_may_trap_p (const_rtx);
 static void add_dependence_1 (rtx_insn *, rtx_insn *, enum reg_note);
 static void add_dependence_list (rtx_insn *, rtx_insn_list *, int,
 enum reg_note, bool);
@@ -488,7 +488,7 @@ static void sched_analyze_2 (class deps_desc *, rtx, 
rtx_insn *);
 static void 

Re: [PATCH V2] i386: Inline function with default arch/tune to caller

2023-07-04 Thread Uros Bizjak via Gcc-patches
On Tue, Jul 4, 2023 at 10:32 AM Hongyu Wang  wrote:
>
> > In a follow-up patch, can you please document inlining rules involving
> > -march and -mtune to "x86 Function Attributes" section? Currently, the
> > inlining rules at the end of "target function attribute" section does
> > not even mention -march and -mtune. Maybe a subsubsection "Inlining
> > rules" should be added (like AArch64 has) to mention that only default
> > arch and tune are inlined by default (but inline can be forced with
> > always_inline for different mtune flags).
>
> The document has below at the end of 'target (OPTIONS)' section
>
> On the x86, the inliner does not inline a function that has
> different target options than the caller, unless the callee
> has a subset of the target options of the caller.  For example
> a function declared with 'target("sse3")' can inline a
> function with 'target("sse2")', since '-msse3' implies
> '-msse2'.
>
> Do we need to move this part to a new section and combine with -march and
> -mtune rule description to the new subsubsection?
>
> > Looking at the above, perhaps inlining of different arches can also be
> > forced with always_inline? This would allow developers some control of
> > inlining, and would not be surprising.
>
> If so, I'd like to add the always_inline change on arch to current
> patch and leave the
> document change alone in the next patch.

Yes, this is OK.

Thanks,
Uros.
>
> Uros Bizjak via Gcc-patches  于2023年7月4日周二 14:19写道:
> >
> > On Tue, Jul 4, 2023 at 5:12 AM Hongyu Wang  wrote:
> > >
> > > Hi,
> > >
> > > For function with different target attributes, current logic rejects to
> > > inline the callee when any arch or tune is mismatched. Relax the
> > > condition to allow callee with default arch/tune to be inlined.
> > >
> > > Boostrapped/regtested on x86-64-linux-gnu{-m32,}.
> > >
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386.cc (ix86_can_inline_p): If callee has
> > > default arch=x86-64 and tune=generic, do not block the
> > > inlining to its caller.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/inline_target_clones.c: New test.
> >
> > OK.
> >
> > In a follow-up patch, can you please document inlining rules involving
> > -march and -mtune to "x86 Function Attributes" section? Currently, the
> > inlining rules at the end of "target function attribute" section does
> > not even mention -march and -mtune. Maybe a subsubsection "Inlining
> > rules" should be added (like AArch64 has) to mention that only default
> > arch and tune are inlined by default (but inline can be forced with
> > always_inline for different mtune flags).
> >
> > Looking at the above, perhaps inlining of different arches can also be
> > forced with always_inline? This would allow developers some control of
> > inlining, and would not be surprising.
> >
> > Thanks,
> > Uros.
> >
> > > ---
> > >  gcc/config/i386/i386.cc   | 22 +++--
> > >  .../gcc.target/i386/inline_target_clones.c| 24 +++
> > >  2 files changed, 39 insertions(+), 7 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/inline_target_clones.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 8989985700a..4741c9b5364 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -605,13 +605,6 @@ ix86_can_inline_p (tree caller, tree callee)
> > >!= (callee_opts->x_target_flags & 
> > > ~always_inline_safe_mask))
> > >  ret = false;
> > >
> > > -  /* See if arch, tune, etc. are the same.  */
> > > -  else if (caller_opts->arch != callee_opts->arch)
> > > -ret = false;
> > > -
> > > -  else if (!always_inline && caller_opts->tune != callee_opts->tune)
> > > -ret = false;
> > > -
> > >else if (caller_opts->x_ix86_fpmath != callee_opts->x_ix86_fpmath
> > >/* If the calle doesn't use FP expressions differences in
> > >   ix86_fpmath can be ignored.  We are called from FEs
> > > @@ -622,6 +615,21 @@ ix86_can_inline_p (tree caller, tree callee)
> > >|| ipa_fn_summaries->ge

Re: [PATCH V2] i386: Inline function with default arch/tune to caller

2023-07-04 Thread Uros Bizjak via Gcc-patches
On Tue, Jul 4, 2023 at 5:12 AM Hongyu Wang  wrote:
>
> Hi,
>
> For function with different target attributes, current logic rejects to
> inline the callee when any arch or tune is mismatched. Relax the
> condition to allow callee with default arch/tune to be inlined.
>
> Boostrapped/regtested on x86-64-linux-gnu{-m32,}.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_can_inline_p): If callee has
> default arch=x86-64 and tune=generic, do not block the
> inlining to its caller.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/inline_target_clones.c: New test.

OK.

In a follow-up patch, can you please document inlining rules involving
-march and -mtune to "x86 Function Attributes" section? Currently, the
inlining rules at the end of "target function attribute" section does
not even mention -march and -mtune. Maybe a subsubsection "Inlining
rules" should be added (like AArch64 has) to mention that only default
arch and tune are inlined by default (but inline can be forced with
always_inline for different mtune flags).

Looking at the above, perhaps inlining of different arches can also be
forced with always_inline? This would allow developers some control of
inlining, and would not be surprising.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.cc   | 22 +++--
>  .../gcc.target/i386/inline_target_clones.c| 24 +++
>  2 files changed, 39 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/inline_target_clones.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 8989985700a..4741c9b5364 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -605,13 +605,6 @@ ix86_can_inline_p (tree caller, tree callee)
>!= (callee_opts->x_target_flags & ~always_inline_safe_mask))
>  ret = false;
>
> -  /* See if arch, tune, etc. are the same.  */
> -  else if (caller_opts->arch != callee_opts->arch)
> -ret = false;
> -
> -  else if (!always_inline && caller_opts->tune != callee_opts->tune)
> -ret = false;
> -
>else if (caller_opts->x_ix86_fpmath != callee_opts->x_ix86_fpmath
>/* If the calle doesn't use FP expressions differences in
>   ix86_fpmath can be ignored.  We are called from FEs
> @@ -622,6 +615,21 @@ ix86_can_inline_p (tree caller, tree callee)
>|| ipa_fn_summaries->get (callee_node)->fp_expressions))
>  ret = false;
>
> +  /* At this point we cannot identify whether arch or tune setting
> + comes from target attribute or not. So the most conservative way
> + is to allow the callee that uses default arch and tune string to
> + be inlined.  */
> +  else if (!strcmp (callee_opts->x_ix86_arch_string, "x86-64")
> +  && !strcmp (callee_opts->x_ix86_tune_string, "generic"))
> +ret = true;
> +
> +  /* See if arch, tune, etc. are the same.  */
> +  else if (caller_opts->arch != callee_opts->arch)
> +ret = false;
> +
> +  else if (!always_inline && caller_opts->tune != callee_opts->tune)
> +ret = false;
> +
>else if (!always_inline
>&& caller_opts->branch_cost != callee_opts->branch_cost)
>  ret = false;
> diff --git a/gcc/testsuite/gcc.target/i386/inline_target_clones.c 
> b/gcc/testsuite/gcc.target/i386/inline_target_clones.c
> new file mode 100644
> index 000..53db1600ce5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/inline_target_clones.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-require-ifunc "" } */
> +/* { dg-options "-O3 -march=x86-64" } */
> +/* { dg-final { scan-assembler-not "call\[ \t\]+callee" } } */
> +
> +float callee (float a, float b, float c, float d,
> + float e, float f, float g, float h)
> +{
> +  return a * b + c * d + e * f + g + h + a * c + b * c
> ++ a * d + b * e + a * f + c * h +
> +b * (a - 0.4f) * (c + h) * (b + e * d) - a / f * h;
> +}
> +
> +__attribute__((target_clones("default","arch=icelake-server")))
> +void caller (int n, float *a,
> +float c1, float c2, float c3,
> +float c4, float c5, float c6,
> +float c7)
> +{
> +  for (int i = 0; i < n; i++)
> +{
> +  a[i] = callee (a[i], c1, c2, c3, c4, c5, c6, c7);
> +}
> +}
> --
> 2.31.1
>


[committed] tree+ggc: Change return type of predicate functions from int to bool

2023-07-03 Thread Uros Bizjak via Gcc-patches
Also change internal variable from int to bool.

gcc/ChangeLog:

* tree.h (tree_int_cst_equal): Change return type from int to bool.
(operand_equal_for_phi_arg_p): Ditto.
(tree_map_base_marked_p): Ditto.
* tree.cc (contains_placeholder_p): Update function body
for bool return type.
(type_cache_hasher::equal): Ditto.
(tree_map_base_hash): Change return type
from int to void and adjust function body accordingly.
(tree_int_cst_equal): Ditto.
(operand_equal_for_phi_arg_p): Ditto.
(get_narrower): Change "first" variable to bool.
(cl_option_hasher::equal): Update function body for bool return type.
* ggc.h (ggc_set_mark): Change return type from int to bool.
(ggc_marked_p): Ditto.
* ggc-page.cc (gt_ggc_mx): Change return type
from int to void and adjust function body accordingly.
(ggc_set_mark): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/ggc-page.cc b/gcc/ggc-page.cc
index c25218d7415..2f0b72e1b22 100644
--- a/gcc/ggc-page.cc
+++ b/gcc/ggc-page.cc
@@ -1538,7 +1538,7 @@ gt_ggc_mx (unsigned char& x ATTRIBUTE_UNUSED)
P must have been allocated by the GC allocator; it mustn't point to
static objects, stack variables, or memory allocated with malloc.  */
 
-int
+bool
 ggc_set_mark (const void *p)
 {
   page_entry *entry;
@@ -1558,7 +1558,7 @@ ggc_set_mark (const void *p)
 
   /* If the bit was previously set, skip it.  */
   if (entry->in_use_p[word] & mask)
-return 1;
+return true;
 
   /* Otherwise set it, and decrement the free object count.  */
   entry->in_use_p[word] |= mask;
@@ -1567,14 +1567,14 @@ ggc_set_mark (const void *p)
   if (GGC_DEBUG_LEVEL >= 4)
 fprintf (G.debug_file, "Marking %p\n", p);
 
-  return 0;
+  return false;
 }
 
-/* Return 1 if P has been marked, zero otherwise.
+/* Return true if P has been marked, zero otherwise.
P must have been allocated by the GC allocator; it mustn't point to
static objects, stack variables, or memory allocated with malloc.  */
 
-int
+bool
 ggc_marked_p (const void *p)
 {
   page_entry *entry;
diff --git a/gcc/ggc.h b/gcc/ggc.h
index 78eab7eaba6..34108e2f006 100644
--- a/gcc/ggc.h
+++ b/gcc/ggc.h
@@ -90,15 +90,15 @@ extern const struct ggc_root_tab * const 
gt_pch_scalar_rtab[];
 
 /* Actually set the mark on a particular region of memory, but don't
follow pointers.  This function is called by ggc_mark_*.  It
-   returns zero if the object was not previously marked; nonzero if
+   returns false if the object was not previously marked; true if
the object was already marked, or if, for any other reason,
pointers in this data structure should not be traversed.  */
-extern int ggc_set_mark(const void *);
+extern bool ggc_set_mark (const void *);
 
-/* Return 1 if P has been marked, zero otherwise.
+/* Return true if P has been marked, zero otherwise.
P must have been allocated by the GC allocator; it mustn't point to
static objects, stack variables, or memory allocated with malloc.  */
-extern int ggc_marked_p(const void *);
+extern bool ggc_marked_p (const void *);
 
 /* PCH and GGC handling for strings, mostly trivial.  */
 extern void gt_pch_n_S (const void *);
diff --git a/gcc/tree.cc b/gcc/tree.cc
index 58288efa2e2..bd500ec72a5 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -2839,7 +2839,7 @@ grow_tree_vec (tree v, int len MEM_STAT_DECL)
   return v;
 }
 
-/* Return 1 if EXPR is the constant zero, whether it is integral, float or
+/* Return true if EXPR is the constant zero, whether it is integral, float or
fixed, and scalar, complex or vector.  */
 
 bool
@@ -2850,7 +2850,7 @@ zerop (const_tree expr)
  || fixed_zerop (expr));
 }
 
-/* Return 1 if EXPR is the integer constant zero or a complex constant
+/* Return true if EXPR is the integer constant zero or a complex constant
of zero, or a location wrapper for such a constant.  */
 
 bool
@@ -2874,7 +2874,7 @@ integer_zerop (const_tree expr)
 }
 }
 
-/* Return 1 if EXPR is the integer constant one or the corresponding
+/* Return true if EXPR is the integer constant one or the corresponding
complex constant, or a location wrapper for such a constant.  */
 
 bool
@@ -2898,9 +2898,9 @@ integer_onep (const_tree expr)
 }
 }
 
-/* Return 1 if EXPR is the integer constant one.  For complex and vector,
-   return 1 if every piece is the integer constant one.
-   Also return 1 for location wrappers for such a constant.  */
+/* Return true if EXPR is the integer constant one.  For complex and vector,
+   return true if every piece is the integer constant one.
+   Also return true for location wrappers for such a constant.  */
 
 bool
 integer_each_onep (const_tree expr)
@@ -2914,8 +2914,8 @@ integer_each_onep (const_tree expr)
 return integer_onep (expr);
 }
 
-/* Return 1 if EXPR is an integer containing all 1's in as much precision as
-   it contains, or a complex or vector whose subparts are such integers,

[committed] fold-const+optabs: Change return type of predicate functions from int to bool

2023-06-30 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function argument from int to bool.

gcc/ChangeLog:

* fold-const.h (multiple_of_p): Change return type from int to bool.
* fold-const.cc (split_tree): Change negl_p, neg_litp_p,
neg_conp_p and neg_var_p variables to bool.
(const_binop): Change sat_p variable to bool.
(merge_ranges): Change no_overlap variable to bool.
(extract_muldiv_1): Change same_p variable to bool.
(tree_swap_operands_p): Update function body for bool return type.
(fold_truth_andor): Change commutative variable to bool.
(multiple_of_p): Change return type
from int to void and adjust function body accordingly.
* optabs.h (expand_twoval_unop): Change return type from int to bool.
(expand_twoval_binop): Ditto.
(can_compare_p): Ditto.
(have_add2_insn): Ditto.
(have_addptr3_insn): Ditto.
(have_sub2_insn): Ditto.
(have_insn_for): Ditto.
* optabs.cc (add_equal_note): Ditto.
(widen_operand): Change no_extend argument from int to bool.
(expand_binop): Ditto.
(expand_twoval_unop): Change return type
from int to void and adjust function body accordingly.
(expand_twoval_binop): Ditto.
(can_compare_p): Ditto.
(have_add2_insn): Ditto.
(have_addptr3_insn): Ditto.
(have_sub2_insn): Ditto.
(have_insn_for): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index ac90a594fcc..a02ede79fed 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -922,8 +922,8 @@ split_tree (tree in, tree type, enum tree_code code,
 {
   tree op0 = TREE_OPERAND (in, 0);
   tree op1 = TREE_OPERAND (in, 1);
-  int neg1_p = TREE_CODE (in) == MINUS_EXPR;
-  int neg_litp_p = 0, neg_conp_p = 0, neg_var_p = 0;
+  bool neg1_p = TREE_CODE (in) == MINUS_EXPR;
+  bool neg_litp_p = false, neg_conp_p = false, neg_var_p = false;
 
   /* First see if either of the operands is a literal, then a constant.  */
   if (TREE_CODE (op0) == INTEGER_CST || TREE_CODE (op0) == REAL_CST
@@ -1450,7 +1450,7 @@ const_binop (enum tree_code code, tree arg1, tree arg2)
   FIXED_VALUE_TYPE f2;
   FIXED_VALUE_TYPE result;
   tree t, type;
-  int sat_p;
+  bool sat_p;
   bool overflow_p;
 
   /* The following codes are handled by fixed_arithmetic.  */
@@ -5680,7 +5680,7 @@ bool
 merge_ranges (int *pin_p, tree *plow, tree *phigh, int in0_p, tree low0,
  tree high0, int in1_p, tree low1, tree high1)
 {
-  int no_overlap;
+  bool no_overlap;
   int subset;
   int temp;
   tree tem;
@@ -6855,7 +6855,7 @@ extract_muldiv_1 (tree t, tree c, enum tree_code code, 
tree wide_type,
> GET_MODE_SIZE (SCALAR_INT_TYPE_MODE (type)))
? wide_type : type);
   tree t1, t2;
-  int same_p = tcode == code;
+  bool same_p = tcode == code;
   tree op0 = NULL_TREE, op1 = NULL_TREE;
   bool sub_strict_overflow_p;
 
@@ -7467,17 +7467,17 @@ bool
 tree_swap_operands_p (const_tree arg0, const_tree arg1)
 {
   if (CONSTANT_CLASS_P (arg1))
-return 0;
+return false;
   if (CONSTANT_CLASS_P (arg0))
-return 1;
+return true;
 
   STRIP_NOPS (arg0);
   STRIP_NOPS (arg1);
 
   if (TREE_CONSTANT (arg1))
-return 0;
+return false;
   if (TREE_CONSTANT (arg0))
-return 1;
+return true;
 
   /* It is preferable to swap two SSA_NAME to ensure a canonical form
  for commutative and comparison operators.  Ensuring a canonical
@@ -7486,21 +7486,21 @@ tree_swap_operands_p (const_tree arg0, const_tree arg1)
   if (TREE_CODE (arg0) == SSA_NAME
   && TREE_CODE (arg1) == SSA_NAME
   && SSA_NAME_VERSION (arg0) > SSA_NAME_VERSION (arg1))
-return 1;
+return true;
 
   /* Put SSA_NAMEs last.  */
   if (TREE_CODE (arg1) == SSA_NAME)
-return 0;
+return false;
   if (TREE_CODE (arg0) == SSA_NAME)
-return 1;
+return true;
 
   /* Put variables last.  */
   if (DECL_P (arg1))
-return 0;
+return false;
   if (DECL_P (arg0))
-return 1;
+return true;
 
-  return 0;
+  return false;
 }
 
 
@@ -9693,10 +9693,10 @@ fold_truth_andor (location_t loc, enum tree_code code, 
tree type,
   tree a01 = TREE_OPERAND (arg0, 1);
   tree a10 = TREE_OPERAND (arg1, 0);
   tree a11 = TREE_OPERAND (arg1, 1);
-  int commutative = ((TREE_CODE (arg0) == TRUTH_OR_EXPR
- || TREE_CODE (arg0) == TRUTH_AND_EXPR)
-&& (code == TRUTH_AND_EXPR
-|| code == TRUTH_OR_EXPR));
+  bool commutative = ((TREE_CODE (arg0) == TRUTH_OR_EXPR
+  || TREE_CODE (arg0) == TRUTH_AND_EXPR)
+ && (code == TRUTH_AND_EXPR
+ || code == TRUTH_OR_EXPR));
 
   if (operand_equal_p (a00, a10, 0))
return fold_build2_loc (loc, TREE_CODE (arg0), type, a00,
@@ -14012,8 +14012,8 @@ fold_binary_initializer_loc (location_t loc, tree_code 

Re: [x86 PATCH] Add STV support for DImode and SImode rotations by constant.

2023-06-30 Thread Uros Bizjak via Gcc-patches
On Fri, Jun 30, 2023 at 9:29 AM Roger Sayle  wrote:
>
>
> This patch implements scalar-to-vector (STV) support for DImode and SImode
> rotations by constant bit counts.  Scalar rotations are almost always
> optimal on x86, requiring only one or two instructions, but it is also
> possible to implement these efficiently with SSE2, requiring only one
> or two instructions for SImode rotations and at most 3 instructions for
> DImode rotations.  This allows GCC to STV rotations with a small or no
> penalty if there are other (net) benefits to converting a chain.  An
> example of the benefits is shown below, which is based upon the BLAKE2
> cryptographic hash function:
>
> unsigned long long a,b,c,d;
>
> unsigned long rot(unsigned long long x, int y)
> {
>   return (x<>(64-y));
> }
>
> void foo()
> {
>   d = rot(d ^ a,32);
>   c = c + d;
>   b = rot(b ^ c,24);
>   a = a + b;
>   d = rot(d ^ a,16);
>   c = c + d;
>   b = rot(b ^ c,63);
> }
>
> where with -m32 -O2 -msse2
>
> Before (59 insns, 247 bytes):
> foo:pushl   %edi
> xorl%edx, %edx
> pushl   %esi
> pushl   %ebx
> subl$16, %esp
> movqa, %xmm1
> movqd, %xmm0
> movqb, %xmm2
> pxor%xmm1, %xmm0
> psrlq   $32, %xmm0
> movd%xmm0, %eax
> movd%edx, %xmm0
> movd%eax, %xmm3
> punpckldq   %xmm0, %xmm3
> movqc, %xmm0
> paddq   %xmm3, %xmm0
> pxor%xmm0, %xmm2
> movd%xmm2, %ecx
> psrlq   $32, %xmm2
> movd%xmm2, %ebx
> movl%ecx, %eax
> shldl   $24, %ebx, %ecx
> shldl   $24, %eax, %ebx
> movd%ebx, %xmm4
> movd%ecx, %xmm2
> punpckldq   %xmm4, %xmm2
> movdqa  .LC0, %xmm4
> pand%xmm4, %xmm2
> paddq   %xmm2, %xmm1
> movq%xmm1, a
> pxor%xmm3, %xmm1
> movd%xmm1, %esi
> psrlq   $32, %xmm1
> movd%xmm1, %edi
> movl%esi, %eax
> shldl   $16, %edi, %esi
> shldl   $16, %eax, %edi
> movd%esi, %xmm1
> movd%edi, %xmm3
> punpckldq   %xmm3, %xmm1
> pand%xmm4, %xmm1
> movq%xmm1, d
> paddq   %xmm1, %xmm0
> movq%xmm0, c
> pxor%xmm2, %xmm0
> movd%xmm0, 8(%esp)
> psrlq   $32, %xmm0
> movl8(%esp), %eax
> movd%xmm0, 12(%esp)
> movl12(%esp), %edx
> shrdl   $1, %edx, %eax
> xorl%edx, %edx
> movl%eax, b
> movl%edx, b+4
> addl$16, %esp
> popl%ebx
> popl%esi
> popl%edi
> ret
>
> After (32 insns, 165 bytes):
> movqa, %xmm1
> xorl%edx, %edx
> movqd, %xmm0
> movqb, %xmm2
> movdqa  .LC0, %xmm4
> pxor%xmm1, %xmm0
> psrlq   $32, %xmm0
> movd%xmm0, %eax
> movd%edx, %xmm0
> movd%eax, %xmm3
> punpckldq   %xmm0, %xmm3
> movqc, %xmm0
> paddq   %xmm3, %xmm0
> pxor%xmm0, %xmm2
> pshufd  $68, %xmm2, %xmm2
> psrldq  $5, %xmm2
> pand%xmm4, %xmm2
> paddq   %xmm2, %xmm1
> movq%xmm1, a
> pxor%xmm3, %xmm1
> pshuflw $147, %xmm1, %xmm1
> pand%xmm4, %xmm1
> movq%xmm1, d
> paddq   %xmm1, %xmm0
> movq%xmm0, c
> pxor%xmm2, %xmm0
> pshufd  $20, %xmm0, %xmm0
> psrlq   $1, %xmm0
> pshufd  $136, %xmm0, %xmm0
> pand%xmm4, %xmm0
> movq%xmm0, b
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-06-30  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (compute_convert_gain): Provide
> gains/costs for ROTATE and ROTATERT (by an integer constant).
> (general_scalar_chain::convert_rotate): New helper function to
> convert a DImode or SImode rotation by an integer constant into
> SSE vector form.
> (general_scalar_chain::convert_insn): Call the new convert_rotate
> for ROTATE and ROTATERT.
> (general_scalar_to_vector_candidate_p): Consider ROTATE and
> ROTATERT to be candidates if the second operand is an integer
> constant, valid for a rotation (or shift) in the given mode.
> * config/i386/i386-features.h (general_scalar_chain): Add new
> helper method convert_rotate.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/rotate-6.c: New test case.
> * gcc.target/i386/sse2-stv-1.c: Likewise.

LGTM.

Please note that AVX512VL provides VPROLD/VPROLQ and VPRORD/VPRORQ
native rotate instructions that can come handy here.

[committed] cselib+expr+bitmap: Change return type of predicate functions from int to bool

2023-06-29 Thread Uros Bizjak via Gcc-patches
gcc/ChangeLog:

* cselib.h (rtx_equal_for_cselib_1):
Change return type from int to bool.
(references_value_p): Ditto.
(rtx_equal_for_cselib_p): Ditto.
* expr.h (can_store_by_pieces): Ditto.
(try_casesi): Ditto.
(try_tablejump): Ditto.
(safe_from_p): Ditto.
* sbitmap.h (bitmap_equal_p): Ditto.
* cselib.cc (references_value_p): Change return type
from int to void and adjust function body accordingly.
(rtx_equal_for_cselib_1): Ditto.
* expr.cc (is_aligning_offset): Ditto.
(can_store_by_pieces): Ditto.
(mostly_zeros_p): Ditto.
(all_zeros_p): Ditto.
(safe_from_p): Ditto.
(is_aligning_offset): Ditto.
(try_casesi): Ditto.
(try_tablejump): Ditto.
(store_constructor): Change "need_to_clear" and
"const_bounds_p" variables to bool.
* sbitmap.cc (bitmap_equal_p): Change return type from int to bool.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros,
diff --git a/gcc/cselib.cc b/gcc/cselib.cc
index 065867b4a84..5b9843a5942 100644
--- a/gcc/cselib.cc
+++ b/gcc/cselib.cc
@@ -636,7 +636,7 @@ cselib_find_slot (machine_mode mode, rtx x, hashval_t hash,
element has been set to zero, which implies the cselib_val will be
removed.  */
 
-int
+bool
 references_value_p (const_rtx x, int only_useless)
 {
   const enum rtx_code code = GET_CODE (x);
@@ -646,19 +646,19 @@ references_value_p (const_rtx x, int only_useless)
   if (GET_CODE (x) == VALUE
   && (! only_useless
  || (CSELIB_VAL_PTR (x)->locs == 0 && !PRESERVED_VALUE_P (x
-return 1;
+return true;
 
   for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
 {
   if (fmt[i] == 'e' && references_value_p (XEXP (x, i), only_useless))
-   return 1;
+   return true;
   else if (fmt[i] == 'E')
for (j = 0; j < XVECLEN (x, i); j++)
  if (references_value_p (XVECEXP (x, i, j), only_useless))
-   return 1;
+   return true;
 }
 
-  return 0;
+  return false;
 }
 
 /* Return true if V is a useless VALUE and can be discarded as such.  */
@@ -926,13 +926,13 @@ autoinc_split (rtx x, rtx *off, machine_mode memmode)
   return x;
 }
 
-/* Return nonzero if we can prove that X and Y contain the same value,
+/* Return true if we can prove that X and Y contain the same value,
taking our gathered information into account.  MEMMODE holds the
mode of the enclosing MEM, if any, as required to deal with autoinc
addressing modes.  If X and Y are not (known to be) part of
addresses, MEMMODE should be VOIDmode.  */
 
-int
+bool
 rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode memmode, int depth)
 {
   enum rtx_code code;
@@ -956,7 +956,7 @@ rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode memmode, 
int depth)
 }
 
   if (x == y)
-return 1;
+return true;
 
   if (GET_CODE (x) == VALUE)
 {
@@ -973,11 +973,11 @@ rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode 
memmode, int depth)
  rtx yoff = NULL;
  rtx yr = autoinc_split (y, , memmode);
  if ((yr == x || yr == e->val_rtx) && yoff == NULL_RTX)
-   return 1;
+   return true;
}
 
   if (depth == 128)
-   return 0;
+   return false;
 
   for (l = e->locs; l; l = l->next)
{
@@ -989,10 +989,10 @@ rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode 
memmode, int depth)
  if (REG_P (t) || MEM_P (t) || GET_CODE (t) == VALUE)
continue;
  else if (rtx_equal_for_cselib_1 (t, y, memmode, depth + 1))
-   return 1;
+   return true;
}
 
-  return 0;
+  return false;
 }
   else if (GET_CODE (y) == VALUE)
 {
@@ -1006,11 +1006,11 @@ rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode 
memmode, int depth)
  rtx xoff = NULL;
  rtx xr = autoinc_split (x, , memmode);
  if ((xr == y || xr == e->val_rtx) && xoff == NULL_RTX)
-   return 1;
+   return true;
}
 
   if (depth == 128)
-   return 0;
+   return false;
 
   for (l = e->locs; l; l = l->next)
{
@@ -1019,14 +1019,14 @@ rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode 
memmode, int depth)
  if (REG_P (t) || MEM_P (t) || GET_CODE (t) == VALUE)
continue;
  else if (rtx_equal_for_cselib_1 (x, t, memmode, depth + 1))
-   return 1;
+   return true;
}
 
-  return 0;
+  return false;
 }
 
   if (GET_MODE (x) != GET_MODE (y))
-return 0;
+return false;
 
   if (GET_CODE (x) != GET_CODE (y)
   || (GET_CODE (x) == PLUS
@@ -1044,16 +1044,16 @@ rtx_equal_for_cselib_1 (rtx x, rtx y, machine_mode 
memmode, int depth)
   if (x != xorig || y != yorig)
{
  if (!xoff != !yoff)
-   return 0;
+   return false;
 
  if (xoff && !rtx_equal_for_cselib_1 (xoff, yoff, memmode, depth))
-   return 0;
+   return false;
 
  return rtx_equal_for_cselib_1 (x, y, memmode, 

[committed] final+varasm: Change return type of predicate functions from int to bool

2023-06-28 Thread Uros Bizjak via Gcc-patches
Also change some internal variables to bool and change return type of
compute_alignments to void.

gcc/ChangeLog:

* output.h (leaf_function_p): Change return type from int to bool.
(final_forward_branch_p): Ditto.
(only_leaf_regs_used): Ditto.
(maybe_assemble_visibility): Ditto.
* varasm.h (supports_one_only): Ditto.
* rtl.h (compute_alignments): Change return type from int to void.
* final.cc (app_on): Change return type from int to bool.
(compute_alignments): Change return type from int to void
and adjust function body accordingly.
(shorten_branches):  Change "something_changed" variable
type from int to bool.
(leaf_function_p):  Change return type from int to bool
and adjust function body accordingly.
(final_forward_branch_p): Ditto.
(only_leaf_regs_used): Ditto.
* varasm.cc (contains_pointers_p): Change return type from
int to bool and adjust function body accordingly.
(compare_constant): Ditto.
(maybe_assemble_visibility): Ditto.
(supports_one_only): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/final.cc b/gcc/final.cc
index e614491a69a..dd3e22547ac 100644
--- a/gcc/final.cc
+++ b/gcc/final.cc
@@ -163,9 +163,9 @@ static int insn_counter = 0;
 
 static int block_depth;
 
-/* Nonzero if have enabled APP processing of our assembler output.  */
+/* True if have enabled APP processing of our assembler output.  */
 
-static int app_on;
+static bool app_on;
 
 /* If we are outputting an insn sequence, this contains the sequence rtx.
Zero otherwise.  */
@@ -603,7 +603,7 @@ insn_current_reference_address (rtx_insn *branch)
 
 /* Compute branch alignments based on CFG profile.  */
 
-unsigned int
+void
 compute_alignments (void)
 {
   basic_block bb;
@@ -617,7 +617,7 @@ compute_alignments (void)
 
   /* If not optimizing or optimizing for size, don't assign any alignments.  */
   if (! optimize || optimize_function_for_size_p (cfun))
-return 0;
+return;
 
   if (dump_file)
 {
@@ -721,7 +721,6 @@ compute_alignments (void)
 
   loop_optimizer_finalize ();
   free_dominance_info (CDI_DOMINATORS);
-  return 0;
 }
 
 /* Grow the LABEL_ALIGN array after new labels are created.  */
@@ -790,7 +789,8 @@ public:
   /* opt_pass methods: */
   unsigned int execute (function *) final override
   {
-return compute_alignments ();
+compute_alignments ();
+return 0;
   }
 
 }; // class pass_compute_alignments
@@ -822,7 +822,7 @@ shorten_branches (rtx_insn *first)
   int max_uid;
   int i;
   rtx_insn *seq;
-  int something_changed = 1;
+  bool something_changed = true;
   char *varying_length;
   rtx body;
   int uid;
@@ -1103,7 +1103,7 @@ shorten_branches (rtx_insn *first)
 
   while (something_changed)
 {
-  something_changed = 0;
+  something_changed = false;
   insn_current_align = MAX_CODE_ALIGN - 1;
   for (insn_current_address = 0, insn = first;
   insn != 0;
@@ -1136,7 +1136,7 @@ shorten_branches (rtx_insn *first)
{
  log = newlog;
  LABEL_TO_ALIGNMENT (insn) = log;
- something_changed = 1;
+ something_changed = true;
}
}
}
@@ -1274,7 +1274,7 @@ shorten_branches (rtx_insn *first)
   * GET_MODE_SIZE (table->get_data_mode ()));
  insn_current_address += insn_lengths[uid];
  if (insn_lengths[uid] != old_length)
-   something_changed = 1;
+   something_changed = true;
}
 
  continue;
@@ -1332,7 +1332,7 @@ shorten_branches (rtx_insn *first)
  if (!increasing || inner_length > insn_lengths[inner_uid])
{
  insn_lengths[inner_uid] = inner_length;
- something_changed = 1;
+ something_changed = true;
}
  else
inner_length = insn_lengths[inner_uid];
@@ -1358,7 +1358,7 @@ shorten_branches (rtx_insn *first)
  && (!increasing || new_length > insn_lengths[uid]))
{
  insn_lengths[uid] = new_length;
- something_changed = 1;
+ something_changed = true;
}
  else
insn_current_address += insn_lengths[uid] - new_length;
@@ -4043,9 +4043,9 @@ asm_fprintf (FILE *file, const char *p, ...)
   va_end (argptr);
 }
 
-/* Return nonzero if this function has no function calls.  */
+/* Return true if this function has no function calls.  */
 
-int
+bool
 leaf_function_p (void)
 {
   rtx_insn *insn;
@@ -4056,29 +4056,29 @@ leaf_function_p (void)
   /* Some back-ends (e.g. s390) want leaf functions to stay leaf
  functions even if they call mcount.  */
   if (crtl->profile && 

Re: [PATCH] i386: Relax inline requirement for functions with different target attrs

2023-06-28 Thread Uros Bizjak via Gcc-patches
On Wed, Jun 28, 2023 at 10:20 AM Hongyu Wang  wrote:
>
> > If the user specified a different arch for callee than the caller,
> > then the compiler will switch on different ISAs (-march is just a
> > shortcut for different ISA packs), and the programmer is aware that
> > inlining isn't intended here (we have -mtune, which is not as strong
> > as -march, but even functions with different -mtune are not inlined
> > without always_inline attribute). This is documented as:
>
> The original issue comes from a case like
>
> float callee (float a, float b, float c, float d,
> float e, float f, float g, float h)
> {
> return a * b + c * d + e * f + g + h + a * c + b * c
> + a * d + b * e + a * f + c * h +
> b * (a - 0.4f) * (c + h) * (b + e * d) - a / f * h;
> }
>
> __attribute__((target_clones("default","arch=icelake-server")))
> void caller (int n, float *a,
> float c1, float c2, float c3,
> float c4, float c5, float c6,
> float c7)
> {
>   for (int i = 0; i < n; i++)
>   {
> a[i] = callee (a[i], c1, c2, c3, c4, c5, c6, c7);
>   }
> }
>
> For current gcc, the .icelake_server clone fails to inline callee due
> to target specific option mismatch, while the .default clone
> succeeded and the loop get vectorized. I think it is not reasonable
> that the specific clone with higher arch cannot produce better code.
> So I think at least we can decide to inline those callee without any
> arch/tune specified, but for now they are rejected by the strict arch=
> and tune= check.

Yes, I think it is reasonable to inline callee without an arch/tune
specified. We expect "default" callee to have properties that allow
inlining it into all callers, independent of callers arch/tune target
attribute.

Uros.

>
> Uros Bizjak  于2023年6月28日周三 14:43写道:
> >
> > On Wed, Jun 28, 2023 at 3:56 AM Hongyu Wang  wrote:
> > >
> > > > I don't think this is desirable. If we inline something with different
> > > > ISAs, we get some strange mix of ISAs when the function is inlined.
> > > > OTOH - we already inline with mismatched tune flags if the function is
> > > > marked with always_inline.
> > >
> > > Previously ix86_can_inline_p has
> > >
> > > if (((caller_opts->x_ix86_isa_flags & callee_opts->x_ix86_isa_flags)
> > >  != callee_opts->x_ix86_isa_flags)
> > > || ((caller_opts->x_ix86_isa_flags2 & callee_opts->x_ix86_isa_flags2)
> > > != callee_opts->x_ix86_isa_flags2))
> > >   ret = false;
> > >
> > > It make sure caller ISA is a super set of callee, and the inlined one
> > > should follow caller's ISA specification.
> > >
> > > IMHO I cannot give a real example that after inline the caller's
> > > performance get harmed, I added PVW since there might
> > > be some callee want to limit its vector size and caller may have
> > > larger preferred vector size. At least with current change
> > > we get more optimization opportunity for different target_clones.
> > >
> > > But I agree the tuning setting may be a factor that affect the
> > > performance. One possible choice is that if the
> > > tune for callee is unspecified or default, just inline it to the
> > > caller with specified arch and tune.
> >
> > If the user specified a different arch for callee than the caller,
> > then the compiler will switch on different ISAs (-march is just a
> > shortcut for different ISA packs), and the programmer is aware that
> > inlining isn't intended here (we have -mtune, which is not as strong
> > as -march, but even functions with different -mtune are not inlined
> > without always_inline attribute). This is documented as:
> >
> > --q--
> > On the x86, the inliner does not inline a function that has different
> > target options than the caller, unless the callee has a subset of the
> > target options of the caller. For example a function declared with
> > target("sse3") can inline a function with target("sse2"), since -msse3
> > implies -msse2.
> > --/q--
> >
> > I don't think arch=skylake can be considered as a subset of 
> > arch=icelake-server.
> >
> > I agree that the compiler should reject functions with different PVW.
> > This is also in accordance with the documentation.
> >
> > Uros.
> >
> > >
> > > Uros Bizjak via Gcc-patches  于2023年6月27日周二 
> > > 17:16写道:
> > >
> > >
> > >
> > &

Re: [PATCH] i386: Relax inline requirement for functions with different target attrs

2023-06-28 Thread Uros Bizjak via Gcc-patches
On Wed, Jun 28, 2023 at 3:56 AM Hongyu Wang  wrote:
>
> > I don't think this is desirable. If we inline something with different
> > ISAs, we get some strange mix of ISAs when the function is inlined.
> > OTOH - we already inline with mismatched tune flags if the function is
> > marked with always_inline.
>
> Previously ix86_can_inline_p has
>
> if (((caller_opts->x_ix86_isa_flags & callee_opts->x_ix86_isa_flags)
>  != callee_opts->x_ix86_isa_flags)
> || ((caller_opts->x_ix86_isa_flags2 & callee_opts->x_ix86_isa_flags2)
> != callee_opts->x_ix86_isa_flags2))
>   ret = false;
>
> It make sure caller ISA is a super set of callee, and the inlined one
> should follow caller's ISA specification.
>
> IMHO I cannot give a real example that after inline the caller's
> performance get harmed, I added PVW since there might
> be some callee want to limit its vector size and caller may have
> larger preferred vector size. At least with current change
> we get more optimization opportunity for different target_clones.
>
> But I agree the tuning setting may be a factor that affect the
> performance. One possible choice is that if the
> tune for callee is unspecified or default, just inline it to the
> caller with specified arch and tune.

If the user specified a different arch for callee than the caller,
then the compiler will switch on different ISAs (-march is just a
shortcut for different ISA packs), and the programmer is aware that
inlining isn't intended here (we have -mtune, which is not as strong
as -march, but even functions with different -mtune are not inlined
without always_inline attribute). This is documented as:

--q--
On the x86, the inliner does not inline a function that has different
target options than the caller, unless the callee has a subset of the
target options of the caller. For example a function declared with
target("sse3") can inline a function with target("sse2"), since -msse3
implies -msse2.
--/q--

I don't think arch=skylake can be considered as a subset of arch=icelake-server.

I agree that the compiler should reject functions with different PVW.
This is also in accordance with the documentation.

Uros.

>
> Uros Bizjak via Gcc-patches  于2023年6月27日周二 17:16写道:
>
>
>
> >
> > On Mon, Jun 26, 2023 at 4:36 AM Hongyu Wang  wrote:
> > >
> > > Hi,
> > >
> > > For function with different target attributes, current logic rejects to
> > > inline the callee when any arch or tune is mismatched. Relax the
> > > condition to honor just prefer_vecotr_width_type and other flags that
> > > may cause safety issue so caller can get more optimization opportunity.
> >
> > I don't think this is desirable. If we inline something with different
> > ISAs, we get some strange mix of ISAs when the function is inlined.
> > OTOH - we already inline with mismatched tune flags if the function is
> > marked with always_inline.
> >
> > Uros.
> >
> > > Bootstrapped/regtested on x86_64-pc-linux-gnu{-m32,}
> > >
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386.cc (ix86_can_inline_p): Do not check arch or
> > > tune directly, just check prefer_vector_width_type and make sure
> > > not to inline if they mismatch.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/inline-target-attr.c: New test.
> > > ---
> > >  gcc/config/i386/i386.cc   | 11 +
> > >  .../gcc.target/i386/inline-target-attr.c  | 24 +++
> > >  2 files changed, 30 insertions(+), 5 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/inline-target-attr.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 0761965344b..1d86384ac06 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -605,11 +605,12 @@ ix86_can_inline_p (tree caller, tree callee)
> > >!= (callee_opts->x_target_flags & 
> > > ~always_inline_safe_mask))
> > >  ret = false;
> > >
> > > -  /* See if arch, tune, etc. are the same.  */
> > > -  else if (caller_opts->arch != callee_opts->arch)
> > > -ret = false;
> > > -
> > > -  else if (!always_inline && caller_opts->tune != callee_opts->tune)
> > > +  /* Do not inline when specified perfer-vector-width mismatched between
> > > + callee and caller.  */
> > > +  else if ((callee_opts->

Re: [x86 PATCH] Add cbranchti4 pattern to i386.md (for -m32 compare_by_pieces).

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Tue, Jun 27, 2023 at 7:22 PM Roger Sayle  wrote:
>
>
> This patch fixes some very odd (unanticipated) code generation by
> compare_by_pieces with -m32 -mavx, since the recent addition of the
> cbranchoi4 pattern.  The issue is that cbranchoi4 is available with
> TARGET_AVX, but cbranchti4 is currently conditional on TARGET_64BIT
> which results in the odd behaviour (thanks to OPTAB_WIDEN) that with
> -m32 -mavx, compare_by_pieces ends up (inefficiently) widening 128-bit
> comparisons to 256-bits before performing PTEST.
>
> This patch fixes this by providing a cbranchti4 pattern that's available
> with either TARGET_64BIT or TARGET_SSE4_1.
>
> For the test case below (again from PR 104610):
>
> int foo(char *a)
> {
> static const char t[] = "0123456789012345678901234567890";
> return __builtin_memcmp(a, [0], sizeof(t)) == 0;
> }
>
> GCC with -m32 -O2 -mavx currently produces the bonkers:
>
> foo:pushl   %ebp
> movl%esp, %ebp
> andl$-32, %esp
> subl$64, %esp
> movl8(%ebp), %eax
> vmovdqa .LC0, %xmm4
> movl$0, 48(%esp)
> vmovdqu (%eax), %xmm2
> movl$0, 52(%esp)
> movl$0, 56(%esp)
> movl$0, 60(%esp)
> movl$0, 16(%esp)
> movl$0, 20(%esp)
> movl$0, 24(%esp)
> movl$0, 28(%esp)
> vmovdqa %xmm2, 32(%esp)
> vmovdqa %xmm4, (%esp)
> vmovdqa (%esp), %ymm5
> vpxor   32(%esp), %ymm5, %ymm0
> vptest  %ymm0, %ymm0
> jne .L2
> vmovdqu 16(%eax), %xmm7
> movl$0, 48(%esp)
> movl$0, 52(%esp)
> vmovdqa %xmm7, 32(%esp)
> vmovdqa .LC1, %xmm7
> movl$0, 56(%esp)
> movl$0, 60(%esp)
> movl$0, 16(%esp)
> movl$0, 20(%esp)
> movl$0, 24(%esp)
> movl$0, 28(%esp)
> vmovdqa %xmm7, (%esp)
> vmovdqa (%esp), %ymm1
> vpxor   32(%esp), %ymm1, %ymm0
> vptest  %ymm0, %ymm0
> je  .L6
> .L2:movl$1, %eax
> xorl$1, %eax
> vzeroupper
> leave
> ret
> .L6:xorl%eax, %eax
> xorl$1, %eax
> vzeroupper
> leave
> ret
>
> with this patch, we now generate the (slightly) more sensible:
>
> foo:vmovdqa .LC0, %xmm0
> movl4(%esp), %eax
> vpxor   (%eax), %xmm0, %xmm0
> vptest  %xmm0, %xmm0
> jne .L2
> vmovdqa .LC1, %xmm0
> vpxor   16(%eax), %xmm0, %xmm0
> vptest  %xmm0, %xmm0
> je  .L5
> .L2:movl$1, %eax
> xorl$1, %eax
> ret
> .L5:xorl%eax, %eax
> xorl$1, %eax
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-06-27  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_branch): Also use ptest
> for TImode comparisons on 32-bit architectures.
> * config/i386/i386.md (cbranch4): Change from SDWIM to
> SWIM1248x to exclude/avoid TImode being conditional on -m64.
> (cbranchti4): New define_expand for TImode on both TARGET_64BIT
> and/or with TARGET_SSE4_1.
> * config/i386/predicates.md (ix86_timode_comparison_operator):
> New predicate that depends upon TARGET_64BIT.
> (ix86_timode_comparison_operand): Likewise.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/pieces-memcmp-2.c: New test case.

OK with a small fix.

Thanks,
Uros.

+;; Return true if this is a valid second operand for a TImode comparison.
+(define_predicate "ix86_timode_comparison_operand"
+  (if_then_else (match_test "TARGET_64BIT")
+(match_operand 0 "x86_64_general_operand")
+(match_operand 0 "nonimmediate_operand")))
+
+

Please remove the duplicate blank line above.


Re: [x86 PATCH] Fix FAIL of gcc.target/i386/pr78794.c on ia32.

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Tue, Jun 27, 2023 at 8:40 PM Roger Sayle  wrote:
>
>
> This patch fixes the FAIL of gcc.target/i386/pr78794.c on ia32, which
> is caused by minor STV rtx_cost differences with -march=silvermont.
> It turns out that generic tuning results in pandn, but the lack of
> accurate parameterization for COMPARE in compute_convert_gain combined
> with small differences in scalar<->SSE costs on silvermont results in
> this DImode chain not being converted.
>
> The solution is to provide more accurate costs/gains for converting
> (DImode and SImode) comparisons.
>
> I'd been holding off of doing this as I'd thought it would be possible
> to turn pandn;ptestz into ptestc (for an even bigger scalar-to-vector
> win) but I've recently realized that these optimizations (as I've
> implemented them) occur in the wrong order (stv2 occurs after
> combine), so it isn't easy for STV to convert CCZmode into CCCmode.
> Doh!  Perhaps something can be done in peephole2...
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-06-27  Roger Sayle  
>
> gcc/ChangeLog
> PR target/78794
> * config/i386/i386-features.cc (compute_convert_gain): Provide
> more accurate gains for conversion of scalar comparisons to
> PTEST.

LGTM.

Thanks,
Uros.

>
> Thanks for your patience.
> Roger
> --
>


Re: [PATCH] i386: Relax inline requirement for functions with different target attrs

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Mon, Jun 26, 2023 at 4:36 AM Hongyu Wang  wrote:
>
> Hi,
>
> For function with different target attributes, current logic rejects to
> inline the callee when any arch or tune is mismatched. Relax the
> condition to honor just prefer_vecotr_width_type and other flags that
> may cause safety issue so caller can get more optimization opportunity.

I don't think this is desirable. If we inline something with different
ISAs, we get some strange mix of ISAs when the function is inlined.
OTOH - we already inline with mismatched tune flags if the function is
marked with always_inline.

Uros.

> Bootstrapped/regtested on x86_64-pc-linux-gnu{-m32,}
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_can_inline_p): Do not check arch or
> tune directly, just check prefer_vector_width_type and make sure
> not to inline if they mismatch.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/inline-target-attr.c: New test.
> ---
>  gcc/config/i386/i386.cc   | 11 +
>  .../gcc.target/i386/inline-target-attr.c  | 24 +++
>  2 files changed, 30 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/inline-target-attr.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 0761965344b..1d86384ac06 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -605,11 +605,12 @@ ix86_can_inline_p (tree caller, tree callee)
>!= (callee_opts->x_target_flags & ~always_inline_safe_mask))
>  ret = false;
>
> -  /* See if arch, tune, etc. are the same.  */
> -  else if (caller_opts->arch != callee_opts->arch)
> -ret = false;
> -
> -  else if (!always_inline && caller_opts->tune != callee_opts->tune)
> +  /* Do not inline when specified perfer-vector-width mismatched between
> + callee and caller.  */
> +  else if ((callee_opts->x_prefer_vector_width_type != PVW_NONE
> +  && caller_opts->x_prefer_vector_width_type != PVW_NONE)
> +  && callee_opts->x_prefer_vector_width_type
> + != caller_opts->x_prefer_vector_width_type)
>  ret = false;
>
>else if (caller_opts->x_ix86_fpmath != callee_opts->x_ix86_fpmath
> diff --git a/gcc/testsuite/gcc.target/i386/inline-target-attr.c 
> b/gcc/testsuite/gcc.target/i386/inline-target-attr.c
> new file mode 100644
> index 000..995502165f0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/inline-target-attr.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { scan-assembler-not "call\[ \t\]callee" } } */
> +
> +__attribute__((target("arch=skylake")))
> +int callee (int n)
> +{
> +  int sum = 0;
> +  for (int i = 0; i < n; i++)
> +{
> +  if (i % 2 == 0)
> +   sum +=i;
> +  else
> +   sum += (i - 1);
> +}
> +  return sum + n;
> +}
> +
> +__attribute__((target("arch=icelake-server")))
> +int caller (int n)
> +{
> +  return callee (n) + n;
> +}
> +
> --
> 2.31.1
>


Re: [PATCH 2/2] Make option mvzeroupper independent of optimization level.

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Tue, Jun 27, 2023 at 8:09 AM Hongtao Liu  wrote:
>
> On Tue, Jun 27, 2023 at 2:05 PM Uros Bizjak  wrote:
> >
> > On Tue, Jun 27, 2023 at 7:55 AM liuhongt  wrote:
> > >
> > > pass_insert_vzeroupper is under condition
> > >
> > > TARGET_AVX && TARGET_VZEROUPPER
> > > && flag_expensive_optimizations && !optimize_size
> > >
> > > But the document of mvzeroupper doesn't mention the insertion
> > > required -O2 and above, it may confuse users when they explicitly
> > > use -Os -mvzeroupper.
> > >
> > > 
> > > mvzeroupper
> > > Target Mask(VZEROUPPER) Save
> > > Generate vzeroupper instruction before a transfer of control flow out of
> > > the function.
> > > 
> > >
> > > The patch moves flag_expensive_optimizations && !optimize_size to
> > > ix86_option_override_internal. It makes -mvzeroupper independent of
> > > optimization level, but still keeps the behavior of architecture
> > > tuning(emit_vzeroupper) unchanged.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386-features.cc (pass_insert_vzeroupper:gate):
> > > Move flag_expensive_optimizations && !optimize_size to ..
> > > * config/i386/i386-options.cc (ix86_option_override_internal):
> > > .. this, it makes -mvzeroupper independent of optimization
> > > level, but still keeps the behavior of architecture
> > > tuning(emit_vzeroupper) unchanged.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/avx-vzeroupper-29.c: New testcase.
> >
> > OK.
> I'd like to backport this patch to GCC10/GCC11/GCC12/GCC13.

Also OK.

Thanks,
Uros.

> >
> > Thanks,
> > Uros.
> >
> > > ---
> > >  gcc/config/i386/i386-features.cc  |  3 +--
> > >  gcc/config/i386/i386-options.cc   |  4 +++-
> > >  gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c | 14 ++
> > >  3 files changed, 18 insertions(+), 3 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c
> > >
> > > diff --git a/gcc/config/i386/i386-features.cc 
> > > b/gcc/config/i386/i386-features.cc
> > > index 4a3b07ae045..92ae08d442e 100644
> > > --- a/gcc/config/i386/i386-features.cc
> > > +++ b/gcc/config/i386/i386-features.cc
> > > @@ -2489,8 +2489,7 @@ public:
> > >/* opt_pass methods: */
> > >bool gate (function *) final override
> > >  {
> > > -  return TARGET_AVX && TARGET_VZEROUPPER
> > > -   && flag_expensive_optimizations && !optimize_size;
> > > +  return TARGET_AVX && TARGET_VZEROUPPER;
> > >  }
> > >
> > >unsigned int execute (function *) final override
> > > diff --git a/gcc/config/i386/i386-options.cc 
> > > b/gcc/config/i386/i386-options.cc
> > > index 2cb0bddcd35..f76e7c5947b 100644
> > > --- a/gcc/config/i386/i386-options.cc
> > > +++ b/gcc/config/i386/i386-options.cc
> > > @@ -2727,7 +2727,9 @@ ix86_option_override_internal (bool main_args_p,
> > >  sorry ("%<-mcall-ms2sysv-xlogues%> isn%'t currently supported with 
> > > SEH");
> > >
> > >if (!(opts_set->x_target_flags & MASK_VZEROUPPER)
> > > -  && TARGET_EMIT_VZEROUPPER)
> > > +  && TARGET_EMIT_VZEROUPPER
> > > +  && flag_expensive_optimizations
> > > +  && !optimize_size)
> > >  opts->x_target_flags |= MASK_VZEROUPPER;
> > >if (!(opts_set->x_target_flags & MASK_STV))
> > >  opts->x_target_flags |= MASK_STV;
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c 
> > > b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c
> > > new file mode 100644
> > > index 000..4af637757f7
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c
> > > @@ -0,0 +1,14 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O0 -mavx -mtune=generic -mvzeroupper -dp" } */
> > > +
> > > +#include 
> > > +
> > > +extern __m256 x, y;
> > > +
> > > +void
> > > +foo ()
> > > +{
> > > +  x = y;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "avx_vzeroupper" 1 } } */
> > > --
> > > 2.39.1.388.g2fc9e9ca3c
> > >
>
>
>
> --
> BR,
> Hongtao


Re: [PATCH 1/2] Don't issue vzeroupper for vzeroupper call_insn.

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Tue, Jun 27, 2023 at 8:08 AM Hongtao Liu  wrote:
>
> On Tue, Jun 27, 2023 at 2:05 PM Uros Bizjak  wrote:
> >
> > On Tue, Jun 27, 2023 at 7:55 AM liuhongt  wrote:
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/82735
> > > * config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit
> > > vzeroupper for vzeroupper call_insn.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/avx-vzeroupper-30.c: New test.
> > > ---
> > >  gcc/config/i386/i386.cc   |  5 +++--
> > >  gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c | 15 +++
> > >  2 files changed, 18 insertions(+), 2 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 0761965344b..caca74d6dec 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -14489,8 +14489,9 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
> > >  modes wider than 256 bits.  It's only safe to issue a
> > >  vzeroupper if all SSE registers are clobbered.  */
> > >const function_abi  = insn_callee_abi (insn);
> > > -  if (!hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> > > - abi.mode_clobbers (V4DImode)))
> > > +  if (vzeroupper_pattern (PATTERN (insn), VOIDmode)
> > > + || !hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> > > +abi.mode_clobbers (V4DImode)))
> > > return AVX_U128_ANY;
> >
> > You also want to check for vzeroall_pattern here.
> This is inside
>  if (CALL_P (insn))
>
> vzeroupper is defined as special call_insn, but vzeroall is not.

Indeed. Patch is OK as it is then.

Thanks,
Uros.

> >
> > OK with the above change.
> >
> > Thanks,
> > Uros.
> >
> > >
> > >return AVX_U128_CLEAN;
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c 
> > > b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c
> > > new file mode 100644
> > > index 000..c1c9baa8fc4
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx -mvzeroupper -dp" } */
> > > +
> > > +#include 
> > > +
> > > +extern __m256 x, y;
> > > +
> > > +void
> > > +foo ()
> > > +{
> > > +  x = y;
> > > +  _mm256_zeroupper ();
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "avx_vzeroupper" 1 } } */
> > > --
> > > 2.39.1.388.g2fc9e9ca3c
> > >
>
>
>
> --
> BR,
> Hongtao


Re: [PATCH 2/2] Make option mvzeroupper independent of optimization level.

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Tue, Jun 27, 2023 at 7:55 AM liuhongt  wrote:
>
> pass_insert_vzeroupper is under condition
>
> TARGET_AVX && TARGET_VZEROUPPER
> && flag_expensive_optimizations && !optimize_size
>
> But the document of mvzeroupper doesn't mention the insertion
> required -O2 and above, it may confuse users when they explicitly
> use -Os -mvzeroupper.
>
> 
> mvzeroupper
> Target Mask(VZEROUPPER) Save
> Generate vzeroupper instruction before a transfer of control flow out of
> the function.
> 
>
> The patch moves flag_expensive_optimizations && !optimize_size to
> ix86_option_override_internal. It makes -mvzeroupper independent of
> optimization level, but still keeps the behavior of architecture
> tuning(emit_vzeroupper) unchanged.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386-features.cc (pass_insert_vzeroupper:gate):
> Move flag_expensive_optimizations && !optimize_size to ..
> * config/i386/i386-options.cc (ix86_option_override_internal):
> .. this, it makes -mvzeroupper independent of optimization
> level, but still keeps the behavior of architecture
> tuning(emit_vzeroupper) unchanged.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx-vzeroupper-29.c: New testcase.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-features.cc  |  3 +--
>  gcc/config/i386/i386-options.cc   |  4 +++-
>  gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c | 14 ++
>  3 files changed, 18 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c
>
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index 4a3b07ae045..92ae08d442e 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -2489,8 +2489,7 @@ public:
>/* opt_pass methods: */
>bool gate (function *) final override
>  {
> -  return TARGET_AVX && TARGET_VZEROUPPER
> -   && flag_expensive_optimizations && !optimize_size;
> +  return TARGET_AVX && TARGET_VZEROUPPER;
>  }
>
>unsigned int execute (function *) final override
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 2cb0bddcd35..f76e7c5947b 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -2727,7 +2727,9 @@ ix86_option_override_internal (bool main_args_p,
>  sorry ("%<-mcall-ms2sysv-xlogues%> isn%'t currently supported with SEH");
>
>if (!(opts_set->x_target_flags & MASK_VZEROUPPER)
> -  && TARGET_EMIT_VZEROUPPER)
> +  && TARGET_EMIT_VZEROUPPER
> +  && flag_expensive_optimizations
> +  && !optimize_size)
>  opts->x_target_flags |= MASK_VZEROUPPER;
>if (!(opts_set->x_target_flags & MASK_STV))
>  opts->x_target_flags |= MASK_STV;
> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c 
> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c
> new file mode 100644
> index 000..4af637757f7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-29.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O0 -mavx -mtune=generic -mvzeroupper -dp" } */
> +
> +#include 
> +
> +extern __m256 x, y;
> +
> +void
> +foo ()
> +{
> +  x = y;
> +}
> +
> +/* { dg-final { scan-assembler-times "avx_vzeroupper" 1 } } */
> --
> 2.39.1.388.g2fc9e9ca3c
>


Re: [PATCH 1/2] Don't issue vzeroupper for vzeroupper call_insn.

2023-06-27 Thread Uros Bizjak via Gcc-patches
On Tue, Jun 27, 2023 at 7:55 AM liuhongt  wrote:
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/82735
> * config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit
> vzeroupper for vzeroupper call_insn.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx-vzeroupper-30.c: New test.
> ---
>  gcc/config/i386/i386.cc   |  5 +++--
>  gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c | 15 +++
>  2 files changed, 18 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 0761965344b..caca74d6dec 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -14489,8 +14489,9 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
>  modes wider than 256 bits.  It's only safe to issue a
>  vzeroupper if all SSE registers are clobbered.  */
>const function_abi  = insn_callee_abi (insn);
> -  if (!hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> - abi.mode_clobbers (V4DImode)))
> +  if (vzeroupper_pattern (PATTERN (insn), VOIDmode)
> + || !hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> +abi.mode_clobbers (V4DImode)))
> return AVX_U128_ANY;

You also want to check for vzeroall_pattern here.

OK with the above change.

Thanks,
Uros.

>
>return AVX_U128_CLEAN;
> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c 
> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c
> new file mode 100644
> index 000..c1c9baa8fc4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-30.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx -mvzeroupper -dp" } */
> +
> +#include 
> +
> +extern __m256 x, y;
> +
> +void
> +foo ()
> +{
> +  x = y;
> +  _mm256_zeroupper ();
> +}
> +
> +/* { dg-final { scan-assembler-times "avx_vzeroupper" 1 } } */
> --
> 2.39.1.388.g2fc9e9ca3c
>


Re: [PATCH] i386: Sync tune_string with arch_string for target attribute arch=*

2023-06-26 Thread Uros Bizjak via Gcc-patches
On Mon, Jun 26, 2023 at 4:31 AM Hongyu Wang  wrote:
>
> Hi,
>
> For function with target attribute arch=*, current logic will set its
> tune to -mtune from command line so all target_clones will get same
> tuning flags which would affect the performance for each clone. Override
> tune with arch if tune was not explicitly specified to get proper tuning
> flags for target_clones.
>
> Bootstrapped/regtested on x86_64-pc-linux-gnu{-m32,}
>
> Ok for trunk and backport to active release branches?
>
> gcc/ChangeLog:
>
> * config/i386/i386-options.cc (ix86_valid_target_attribute_tree):
> Override tune_string with arch_string if tune_string is not
> explicitly specified.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/mvc17.c: New test.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-options.cc   |  6 +-
>  gcc/testsuite/gcc.target/i386/mvc17.c | 11 +++
>  2 files changed, 16 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/mvc17.c
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 2cb0bddcd35..7f593cebe76 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1400,7 +1400,11 @@ ix86_valid_target_attribute_tree (tree fndecl, tree 
> args,
>if (option_strings[IX86_FUNCTION_SPECIFIC_TUNE])
> opts->x_ix86_tune_string
>   = ggc_strdup (option_strings[IX86_FUNCTION_SPECIFIC_TUNE]);
> -  else if (orig_tune_defaulted)
> +  /* If we have explicit arch string and no tune string specified, set
> +tune_string to NULL and later it will be overriden by arch_string
> +so target clones can get proper optimization.  */
> +  else if (option_strings[IX86_FUNCTION_SPECIFIC_ARCH]
> +  || orig_tune_defaulted)
> opts->x_ix86_tune_string = NULL;
>
>/* If fpmath= is not set, and we now have sse2 on 32-bit, use it.  */
> diff --git a/gcc/testsuite/gcc.target/i386/mvc17.c 
> b/gcc/testsuite/gcc.target/i386/mvc17.c
> new file mode 100644
> index 000..2c7cc2fdace
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/mvc17.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-require-ifunc "" } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { scan-assembler-times "rep mov" 1 } } */
> +
> +__attribute__((target_clones("default","arch=icelake-server")))
> +void
> +foo (char *a, char *b, int size)
> +{
> +  __builtin_memcpy (a, b, size & 0x7F);
> +}
> --
> 2.31.1
>


Re: [x86_PATCH] New *ashl_doubleword_highpart define_insn_and_split.

2023-06-25 Thread Uros Bizjak via Gcc-patches
On Sat, Jun 24, 2023 at 8:04 PM Roger Sayle  wrote:
>
>
> This patch contains a pair of (related) optimizations in i386.md that
> allow us to generate better code for the example below (this is a step
> towards fixing a bugzilla PR, but I've forgotten the number).
>
> __int128 foo64(__int128 x, long long y)
> {
>   __int128 t = (__int128)y << 64;
>   return x ^ t;
> }
>
> The hidden issue is that the RTL currently seen by reload contains
> the sign extension of y from DImode to TImode, even though this is
> dead (not required) for left shifts by more than WORD_SIZE bits.
>
> (insn 11 8 12 2 (parallel [
> (set (reg:TI 0 ax [orig:91 y ] [91])
> (sign_extend:TI (reg:DI 1 dx [97])))
> (clobber (reg:CC 17 flags))
> (clobber (scratch:DI))
> ]) {extendditi2}
>
> What makes this particularly undesirable is that the sign-extension
> pattern above requires an additional DImode scratch register, indicated
> by the clobber, which unnecessarily increases register pressure.
>
> The proposed solution is to add a define_insn_and_split for such
> left shifts (of sign or zero extensions) that only have a non-zero
> highpart, where the extension is redundant and eliminated, that can
> be split after reload, without scratch registers or early clobbers.
>
> This (late split) exposes a second optimization opportunity where
> setting the lowpart to zero can sometimes be combined/simplified with
> the following instruction during peephole2.
>
> For the test case above, we previously generated with -O2:
>
> foo64:  xorl%eax, %eax
> xorq%rsi, %rdx
> xorq%rdi, %rax
> ret
>
> with this patch, we now generate:
>
> foo64:  movq%rdi, %rax
> xorq%rsi, %rdx
> ret
>
> Likewise for the related -m32 test case, we go from:
>
> foo32:  movl12(%esp), %eax
> movl%eax, %edx
> xorl%eax, %eax
> xorl8(%esp), %edx
> xorl4(%esp), %eax
> ret
>
> to the improved:
>
> foo32:  movl12(%esp), %edx
> movl4(%esp), %eax
> xorl8(%esp), %edx
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-06-24  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (peephole2): Simplify zeroing a register
> followed by an IOR, XOR or PLUS operation on it, into a move.
> (*ashl3_doubleword_highpart): New define_insn_and_split to
> eliminate (and hide from reload) unnecessary word to doubleword
> extensions that are followed by left shifts by sufficient large
> (but valid) bit counts.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/ashldi3-1.c: New 32-bit test case.
> * gcc.target/i386/ashlti3-2.c: New 64-bit test case.

OK.

Thanks,
Uros.

>
>
> Thanks again,
> Roger
> --
>


[committed] function: Change return type of predicate function from int to bool

2023-06-21 Thread Uros Bizjak via Gcc-patches
Also change some internal variables to bool and some functions to void.

gcc/ChangeLog:

* function.h (emit_initial_value_sets):
Change return type from int to void.
(aggregate_value_p): Change return type from int to bool.
(prologue_contains): Ditto.
(epilogue_contains): Ditto.
(prologue_epilogue_contains): Ditto.
* function.cc (temp_slot): Make "in_use" variable bool.
(make_slot_available): Update for changed "in_use" variable.
(assign_stack_temp_for_type): Ditto.
(emit_initial_value_sets): Change return type from int to void
and update function body accordingly.
(instantiate_virtual_regs): Ditto.
(rest_of_handle_thread_prologue_and_epilogue): Ditto.
(safe_insn_predicate): Change return type from int to bool.
(aggregate_value_p): Change return type from int to bool
and update function body accordingly.
(prologue_contains): Change return type from int to bool.
(prologue_epilogue_contains): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/function.cc b/gcc/function.cc
index 82102ed78d7..6a79a8290f6 100644
--- a/gcc/function.cc
+++ b/gcc/function.cc
@@ -578,8 +578,8 @@ public:
   tree type;
   /* The alignment (in bits) of the slot.  */
   unsigned int align;
-  /* Nonzero if this temporary is currently in use.  */
-  char in_use;
+  /* True if this temporary is currently in use.  */
+  bool in_use;
   /* Nesting level at which this slot is being used.  */
   int level;
   /* The offset of the slot from the frame_pointer, including extra space
@@ -674,7 +674,7 @@ make_slot_available (class temp_slot *temp)
 {
   cut_slot_from_list (temp, temp_slots_at_level (temp->level));
   insert_slot_to_list (temp, _temp_slots);
-  temp->in_use = 0;
+  temp->in_use = false;
   temp->level = -1;
   n_temp_slots_in_use--;
 }
@@ -848,7 +848,7 @@ assign_stack_temp_for_type (machine_mode mode, poly_int64 
size, tree type)
  if (known_ge (best_p->size - rounded_size, alignment))
{
  p = ggc_alloc ();
- p->in_use = 0;
+ p->in_use = false;
  p->size = best_p->size - rounded_size;
  p->base_offset = best_p->base_offset + rounded_size;
  p->full_size = best_p->full_size - rounded_size;
@@ -918,7 +918,7 @@ assign_stack_temp_for_type (machine_mode mode, poly_int64 
size, tree type)
 }
 
   p = selected;
-  p->in_use = 1;
+  p->in_use = true;
   p->type = type;
   p->level = temp_slot_level;
   n_temp_slots_in_use++;
@@ -1340,7 +1340,7 @@ has_hard_reg_initial_val (machine_mode mode, unsigned int 
regno)
   return NULL_RTX;
 }
 
-unsigned int
+void
 emit_initial_value_sets (void)
 {
   struct initial_value_struct *ivs = crtl->hard_reg_initial_vals;
@@ -1348,7 +1348,7 @@ emit_initial_value_sets (void)
   rtx_insn *seq;
 
   if (ivs == 0)
-return 0;
+return;
 
   start_sequence ();
   for (i = 0; i < ivs->num_entries; i++)
@@ -1357,7 +1357,6 @@ emit_initial_value_sets (void)
   end_sequence ();
 
   emit_insn_at_entry (seq);
-  return 0;
 }
 
 /* Return the hardreg-pseudoreg initial values pair entry I and
@@ -1535,7 +1534,7 @@ instantiate_virtual_regs_in_rtx (rtx *loc)
 /* A subroutine of instantiate_virtual_regs_in_insn.  Return true if X
matches the predicate for insn CODE operand OPERAND.  */
 
-static int
+static bool
 safe_insn_predicate (int code, int operand, rtx x)
 {
   return code < 0 || insn_operand_matches ((enum insn_code) code, operand, x);
@@ -1947,7 +1946,7 @@ instantiate_decls (tree fndecl)
 /* Pass through the INSNS of function FNDECL and convert virtual register
references to hard register references.  */
 
-static unsigned int
+static void
 instantiate_virtual_regs (void)
 {
   rtx_insn *insn;
@@ -2001,8 +2000,6 @@ instantiate_virtual_regs (void)
   /* Indicate that, from now on, assign_stack_local should use
  frame_pointer_rtx.  */
   virtuals_instantiated = 1;
-
-  return 0;
 }
 
 namespace {
@@ -2030,7 +2027,8 @@ public:
   /* opt_pass methods: */
   unsigned int execute (function *) final override
 {
-  return instantiate_virtual_regs ();
+  instantiate_virtual_regs ();
+  return 0;
 }
 
 }; // class pass_instantiate_virtual_regs
@@ -2044,12 +2042,12 @@ make_pass_instantiate_virtual_regs (gcc::context *ctxt)
 }
 
 
-/* Return 1 if EXP is an aggregate type (or a value with aggregate type).
+/* Return true if EXP is an aggregate type (or a value with aggregate type).
This means a type for which function calls must pass an address to the
function or get an address back from the function.
EXP may be a type node or an expression (whose type is tested).  */
 
-int
+bool
 aggregate_value_p (const_tree exp, const_tree fntype)
 {
   const_tree type = (TYPE_P (exp)) ? exp : TREE_TYPE (exp);
@@ -2069,7 +2067,7 @@ aggregate_value_p (const_tree exp, const_tree fntype)
  else
/* For internal functions, assume nothing needs to be

  1   2   3   4   5   6   7   8   9   10   >