Go patch committed: Be strict about escape analysis of builtin functions

2021-08-04 Thread Ian Lance Taylor via Gcc-patches
This Go frontend patch by Cherry Mui makes the escape analysis pass
stricter about builtin functions  In the places where we handle
builtin functions, list all supported ones, and fail if an unexpected
one is seen. So if a new builtin function is added in the future we
can detect it, instead of silently treating it as nonescaping.
Bootstrapped and ran Go testsuite on x86_64-pc-linux-gnu.  Committed
to mainline.

Ian
5b720746d8456986f4bb6b53d30b462f93ff58c4
diff --git a/gcc/go/gofrontend/MERGE b/gcc/go/gofrontend/MERGE
index be1a90f7aa1..394530c1cbc 100644
--- a/gcc/go/gofrontend/MERGE
+++ b/gcc/go/gofrontend/MERGE
@@ -1,4 +1,4 @@
-616ee658a6238e7de53592ebda5997f6de6a00de
+b47bcf942daa9a0c252db9b57b8f138adbfcdaa2
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
diff --git a/gcc/go/gofrontend/escape.cc b/gcc/go/gofrontend/escape.cc
index 347ac2534c9..c8978ac9239 100644
--- a/gcc/go/gofrontend/escape.cc
+++ b/gcc/go/gofrontend/escape.cc
@@ -1608,8 +1608,33 @@ Escape_analysis_assign::expression(Expression** pexpr)
 }
 break;
 
-  default:
+  case Builtin_call_expression::BUILTIN_CLOSE:
+  case Builtin_call_expression::BUILTIN_DELETE:
+  case Builtin_call_expression::BUILTIN_PRINT:
+  case Builtin_call_expression::BUILTIN_PRINTLN:
+  case Builtin_call_expression::BUILTIN_LEN:
+  case Builtin_call_expression::BUILTIN_CAP:
+  case Builtin_call_expression::BUILTIN_COMPLEX:
+  case Builtin_call_expression::BUILTIN_REAL:
+  case Builtin_call_expression::BUILTIN_IMAG:
+  case Builtin_call_expression::BUILTIN_RECOVER:
+  case Builtin_call_expression::BUILTIN_ALIGNOF:
+  case Builtin_call_expression::BUILTIN_OFFSETOF:
+  case Builtin_call_expression::BUILTIN_SIZEOF:
+// these do not escape.
+break;
+
+  case Builtin_call_expression::BUILTIN_ADD:
+  case Builtin_call_expression::BUILTIN_SLICE:
+// handled in ::assign.
 break;
+
+  case Builtin_call_expression::BUILTIN_MAKE:
+  case Builtin_call_expression::BUILTIN_NEW:
+// should have been lowered to runtime calls at this point.
+// fallthrough
+  default:
+go_unreachable();
   }
 break;
   }
@@ -2372,8 +2397,35 @@ Escape_analysis_assign::assign(Node* dst, Node* src)
 }
 break;
 
-  default:
+  case Builtin_call_expression::BUILTIN_LEN:
+  case Builtin_call_expression::BUILTIN_CAP:
+  case Builtin_call_expression::BUILTIN_COMPLEX:
+  case Builtin_call_expression::BUILTIN_REAL:
+  case Builtin_call_expression::BUILTIN_IMAG:
+  case Builtin_call_expression::BUILTIN_RECOVER:
+  case Builtin_call_expression::BUILTIN_ALIGNOF:
+  case Builtin_call_expression::BUILTIN_OFFSETOF:
+  case Builtin_call_expression::BUILTIN_SIZEOF:
+// these do not escape.
+break;
+
+  case Builtin_call_expression::BUILTIN_COPY:
+// handled in ::expression.
 break;
+
+  case Builtin_call_expression::BUILTIN_CLOSE:
+  case Builtin_call_expression::BUILTIN_DELETE:
+  case Builtin_call_expression::BUILTIN_PRINT:
+  case Builtin_call_expression::BUILTIN_PRINTLN:
+  case Builtin_call_expression::BUILTIN_PANIC:
+// these do not have result.
+// fallthrough
+  case Builtin_call_expression::BUILTIN_MAKE:
+  case Builtin_call_expression::BUILTIN_NEW:
+// should have been lowered to runtime calls at this point.
+// fallthrough
+  default:
+go_unreachable();
   }
 break;
   }


[PATCH] rs6000: Add vec_unpacku_{hi,lo}_v4si

2021-08-04 Thread Kewen.Lin via Gcc-patches
Hi,

The existing vec_unpacku_{hi,lo} supports emulated unsigned
unpacking for short and char but misses the support for int.
This patch adds the support for vec_unpacku_{hi,lo}_v4si.

Meanwhile, the current implementation uses vector permutation
way, which requires one extra customized constant vector as
the permutation control vector.  It's better to use vector
merge high/low with zero constant vector, to save the space
in constant area as well as the cost to initialize pcv in
prologue.  This patch updates it with vector merging and
simplify it with iterators.

Bootstrapped & regtested on powerpc64le-linux-gnu P9 and
powerpc64-linux-gnu P8.

btw, the loop in unpack-vectorize-2.c doesn't get vectorized
without this patch, unpack-vectorize-[13]* is to verify
the vector merging and simplification works expectedly.

Is it ok for trunk?

BR,
Kewen
-
gcc/ChangeLog:

* config/rs6000/altivec.md (vec_unpacku_hi_v16qi): Remove.
(vec_unpacku_hi_v8hi): Likewise.
(vec_unpacku_lo_v16qi): Likewise.
(vec_unpacku_lo_v8hi): Likewise.
(vec_unpacku_hi_): New define_expand.
(vec_unpacku_lo_): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/unpack-vectorize-1.c: New test.
* gcc.target/powerpc/unpack-vectorize-1.h: New test.
* gcc.target/powerpc/unpack-vectorize-2.c: New test.
* gcc.target/powerpc/unpack-vectorize-2.h: New test.
* gcc.target/powerpc/unpack-vectorize-3.c: New test.
* gcc.target/powerpc/unpack-vectorize-3.h: New test.
* gcc.target/powerpc/unpack-vectorize-run-1.c: New test.
* gcc.target/powerpc/unpack-vectorize-run-2.c: New test.
* gcc.target/powerpc/unpack-vectorize-run-3.c: New test.
* gcc.target/powerpc/unpack-vectorize.h: New test.
---
 gcc/config/rs6000/altivec.md  | 158 --
 .../gcc.target/powerpc/unpack-vectorize-1.c   |  18 ++
 .../gcc.target/powerpc/unpack-vectorize-1.h   |  14 ++
 .../gcc.target/powerpc/unpack-vectorize-2.c   |  12 ++
 .../gcc.target/powerpc/unpack-vectorize-2.h   |   7 +
 .../gcc.target/powerpc/unpack-vectorize-3.c   |  11 ++
 .../gcc.target/powerpc/unpack-vectorize-3.h   |   7 +
 .../powerpc/unpack-vectorize-run-1.c  |  24 +++
 .../powerpc/unpack-vectorize-run-2.c  |  16 ++
 .../powerpc/unpack-vectorize-run-3.c  |  16 ++
 .../gcc.target/powerpc/unpack-vectorize.h |  42 +
 11 files changed, 196 insertions(+), 129 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-1.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-2.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-3.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-run-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize-run-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/unpack-vectorize.h

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index d70c17e6bc2..0e8b66cd6a5 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -134,10 +134,8 @@ (define_c_enum "unspec"
UNSPEC_VMULWLUH
UNSPEC_VMULWHSH
UNSPEC_VMULWLSH
-   UNSPEC_VUPKHUB
-   UNSPEC_VUPKHUH
-   UNSPEC_VUPKLUB
-   UNSPEC_VUPKLUH
+   UNSPEC_VUPKHUBHW
+   UNSPEC_VUPKLUBHW
UNSPEC_VPERMSI
UNSPEC_VPERMHI
UNSPEC_INTERHI
@@ -3885,143 +3883,45 @@ (define_insn "xxeval"
[(set_attr "type" "vecsimple")
 (set_attr "prefixed" "yes")])
 
-(define_expand "vec_unpacku_hi_v16qi"
-  [(set (match_operand:V8HI 0 "register_operand" "=v")
-(unspec:V8HI [(match_operand:V16QI 1 "register_operand" "v")]
- UNSPEC_VUPKHUB))]
-  "TARGET_ALTIVEC"  
-{  
-  rtx vzero = gen_reg_rtx (V8HImode);
-  rtx mask = gen_reg_rtx (V16QImode);
-  rtvec v = rtvec_alloc (16);
-  bool be = BYTES_BIG_ENDIAN;
-   
-  emit_insn (gen_altivec_vspltish (vzero, const0_rtx));
-   
-  RTVEC_ELT (v,  0) = gen_rtx_CONST_INT (QImode, be ? 16 :  7);
-  RTVEC_ELT (v,  1) = gen_rtx_CONST_INT (QImode, be ?  0 : 16);
-  RTVEC_ELT (v,  2) = gen_rtx_CONST_INT (QImode, be ? 16 :  6);
-  RTVEC_ELT (v,  3) = gen_rtx_CONST_INT (QImode, be ?  1 : 16);
-  RTVEC_ELT (v,  4) = gen_rtx_CONST_INT (QImode, be ? 16 :  5);
-  RTVEC_ELT (v,  5) = gen_rtx_CONST_INT (QImode, be ?  2 : 16);
-  RTVEC_ELT (v,  6) = gen_rtx_CONST_INT (QImode, be ? 16 :  4);
-  RTVEC_ELT (v,  7) = gen_rtx_CONST_INT (QImode, be ?  3 : 16);
-  RTVEC_ELT (v,  8) = gen_rtx_CONST_INT (QImode, be ? 16 :  3);
-  RTVEC_ELT (v,  9) = gen_rtx_CONST_INT (QImode, be ?  4 : 16);
-  RTVEC_ELT (v, 10) = gen_rtx_CONST_INT (QImode, be ? 16 :  2);
-  RTVEC_ELT 

[PATCH,V2 1/3] bpf: Add new -mcore option for BPF CO-RE

2021-08-04 Thread Indu Bhagat via Gcc-patches
-mcore in the BPF backend enables code generation for the CO-RE usecase. LTO is
disabled for CO-RE compilations.

gcc/ChangeLog:

* config/bpf/bpf.c (bpf_option_override): For BPF backend, disable LTO
support when compiling for CO-RE.
* config/bpf/bpf.opt: Add new command line option -mcore.

gcc/testsuite/ChangeLog:

* gcc.target/bpf/core-lto-1.c: New test.
---
 gcc/config/bpf/bpf.c  | 15 +++
 gcc/config/bpf/bpf.opt|  4 
 gcc/testsuite/gcc.target/bpf/core-lto-1.c |  9 +
 3 files changed, 28 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-lto-1.c

diff --git a/gcc/config/bpf/bpf.c b/gcc/config/bpf/bpf.c
index e635f9e..028013e 100644
--- a/gcc/config/bpf/bpf.c
+++ b/gcc/config/bpf/bpf.c
@@ -158,6 +158,21 @@ bpf_option_override (void)
 {
   /* Set the initializer for the per-function status structure.  */
   init_machine_status = bpf_init_machine_status;
+
+  /* To support the portability needs of BPF CO-RE approach, BTF debug
+ information includes the BPF CO-RE relocations.  The information
+ necessary for these relocations is added to the CTF container by the
+ BPF backend.  Enabling LTO poses challenges in the generation of the BPF
+ CO-RE relocations because if LTO is in effect, they need to be
+ generated late in the LTO link phase.  This in turn means the compiler
+ needs to provide means to combine the early and late BTF debug info,
+ similar to DWARF debug info.
+
+ In any case, in absence of linker support for BTF sections at this time,
+ it is acceptable to simply disallow LTO for BPF CO-RE compilations.  */
+
+  if (flag_lto && TARGET_BPF_CORE)
+error ("BPF CO-RE does not support LTO");
 }
 
 #undef TARGET_OPTION_OVERRIDE
diff --git a/gcc/config/bpf/bpf.opt b/gcc/config/bpf/bpf.opt
index 916b53c..e8926f5 100644
--- a/gcc/config/bpf/bpf.opt
+++ b/gcc/config/bpf/bpf.opt
@@ -127,3 +127,7 @@ Generate little-endian eBPF.
 mframe-limit=
 Target Joined RejectNegative UInteger IntegerRange(0, 32767) 
Var(bpf_frame_limit) Init(512)
 Set a hard limit for the size of each stack frame, in bytes.
+
+mcore
+Target Mask(BPF_CORE)
+Generate all necessary information for BPF Compile Once - Run Everywhere.
diff --git a/gcc/testsuite/gcc.target/bpf/core-lto-1.c 
b/gcc/testsuite/gcc.target/bpf/core-lto-1.c
new file mode 100644
index 000..a90dc5b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/core-lto-1.c
@@ -0,0 +1,9 @@
+/* Test -mcore with -flto.
+  
+   -mcore is used to generate information for BPF CO-RE usecase. To support
+   the generataion of the .BTF and .BTF.ext sections in GCC, -flto is disabled
+   with -mcore.  */
+
+/* { dg-do compile } */
+/* { dg-error "BPF CO-RE does not support LTO" "" { target bpf-*-* } 0 } */
+/* { dg-options "-gbtf -mcore -flto" } */
-- 
1.8.3.1



[PATCH, V2 3/3] dwarf2out: Emit BTF in dwarf2out_finish for BPF CO-RE usecase

2021-08-04 Thread Indu Bhagat via Gcc-patches
DWARF generation is split between early and late phases when LTO is in effect.
This poses challenges for CTF/BTF generation especially if late debug info
generation is desirable, as turns out to be the case for BPF CO-RE.

In case of BPF CO-RE, the BPF backend adds information about CO-RE relocations
to the CTF container. This information is what needs to be emitted as a
separate .BTF.ext section when -more is in effect. Further, each CO-RE
relocation record holds an offset to a string specifying the access to the
structure's field. This means that .BTF string table needs to be modified
"late" in the compilation process. In other words, this implies that the BTF
sections cannot be finalized in dwarf2out_early_finish when -mcore for the BPF
backend is in effect.

Now, the emission of CTF/BTF debug info cannot be moved unconditionally to
dwarf2out_finish because dwarf2out_finish is not invoked at all for the LTO
compile phase for slim LTO objects, thus breaking CTF/BTF generation for other
targets when used with LTO.

The approach taken here in this patch is that -

1. LTO is disabled for BPF CO-RE
The reason to disable LTO for BPF CO-RE is that if LTO is in effect, BPF CO-RE
relocations need to be generated in the LTO link phase _after_ the optimizations
are done. This means we need to devise way to combine early and late BTF. At
this time, in absence of linker support for BTF sections, it makes sense to
steer clear of LTO for BPF CO-RE and bypass the issue.

2. Use a target hook to allow BPF backend to cleanly convey the case when late
finalization of the CTF container is desirable.

So, in other words,

dwarf2out_early_finish
  - Always emit CTF here.
  - if (BTF && ctfc_debuginfo_early_finish_p), emit BTF now.

dwarf2out_finish
  - if (BTF && !ctfc_debuginfo_early_finish_p && !in_lto_p) emit BTF now.
  - Use of in_lto_p to make sure LTO link phase does not affect BTF sections
for other targets.

gcc/ChangeLog:

* dwarf2ctf.c (ctf_debug_finalize): Make it static.
(ctf_debug_early_finish): New definition.
(ctf_debug_finish): Likewise.
* dwarf2ctf.h (ctf_debug_finalize): Remove declaration.
(ctf_debug_early_finish): New declaration.
(ctf_debug_finish): Likewise.
* dwarf2out.c (dwarf2out_finish): Invoke ctf_debug_finish.
(dwarf2out_early_finish): Invoke ctf_debug_early_finish.
---
 gcc/dwarf2ctf.c | 55 +++
 gcc/dwarf2ctf.h |  4 +++-
 gcc/dwarf2out.c |  9 +++--
 3 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/gcc/dwarf2ctf.c b/gcc/dwarf2ctf.c
index 5e8a725..0fa429c 100644
--- a/gcc/dwarf2ctf.c
+++ b/gcc/dwarf2ctf.c
@@ -917,6 +917,27 @@ gen_ctf_type (ctf_container_ref ctfc, dw_die_ref die)
   return type_id;
 }
 
+/* Prepare for output and write out the CTF debug information.  */
+
+static void
+ctf_debug_finalize (const char *filename, bool btf)
+{
+  if (btf)
+{
+  btf_output (filename);
+  btf_finalize ();
+}
+
+  else
+{
+  /* Emit the collected CTF information.  */
+  ctf_output (filename);
+
+  /* Reset the CTF state.  */
+  ctf_finalize ();
+}
+}
+
 bool
 ctf_do_die (dw_die_ref die)
 {
@@ -966,25 +987,35 @@ ctf_debug_init_postprocess (bool btf)
 btf_init_postprocess ();
 }
 
-/* Prepare for output and write out the CTF debug information.  */
+/* Early finish CTF/BTF debug info.  */
 
 void
-ctf_debug_finalize (const char *filename, bool btf)
+ctf_debug_early_finish (const char * filename)
 {
-  if (btf)
+  /* Emit CTF debug info early always.  */
+  if (ctf_debug_info_level > CTFINFO_LEVEL_NONE
+  /* Emit BTF debug info early if the target does not require late
+emission.  */
+   || (btf_debuginfo_p ()
+  && targetm.ctfc_debuginfo_early_finish_p ()))
 {
-  btf_output (filename);
-  btf_finalize ();
+  /* Emit CTF/BTF debug info.  */
+  ctf_debug_finalize (filename, btf_debuginfo_p ());
 }
+}
 
-  else
-{
-  /* Emit the collected CTF information.  */
-  ctf_output (filename);
+/* Finish CTF/BTF debug info emission.  */
 
-  /* Reset the CTF state.  */
-  ctf_finalize ();
-}
+void
+ctf_debug_finish (const char * filename)
+{
+  /* Emit BTF debug info here when the target needs to update the CTF container
+ (ctfc) in the backend.  An example of this, at this time is the BPF CO-RE
+ usecase.  */
+  if (btf_debuginfo_p ()
+  && (!in_lto_p && !targetm.ctfc_debuginfo_early_finish_p ()))
+/* Emit BTF debug info.  */
+ctf_debug_finalize (filename, btf_debuginfo_p ());
 }
 
 #include "gt-dwarf2ctf.h"
diff --git a/gcc/dwarf2ctf.h b/gcc/dwarf2ctf.h
index a3cf567..9edbde0 100644
--- a/gcc/dwarf2ctf.h
+++ b/gcc/dwarf2ctf.h
@@ -24,13 +24,15 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_DWARF2CTF_H 1
 
 #include "dwarf2out.h"
+#include "flags.h"
 
 /* Debug Format Interface.  Used in dwarf2out.c.  */
 
 extern void ctf_debug_init 

[PATCH, V2 2/3] targhooks: New target hook for CTF/BTF debug info emission

2021-08-04 Thread Indu Bhagat via Gcc-patches
This patch adds a new target hook to detect if the CTF container can allow the
emission of CTF/BTF debug info at DWARF debug info early finish time. Some
backends, e.g., BPF when generating code for CO-RE usecase, may need to emit
the CTF/BTF debug info sections around the time when late DWARF debug is
finalized (dwarf2out_finish).

gcc/ChangeLog:

* config/bpf/bpf.c (ctfc_debuginfo_early_finish_p): New definition.
(TARGET_CTFC_DEBUGINFO_EARLY_FINISH_P): Undefine and override.
* doc/tm.texi: Regenerated.
* doc/tm.texi.in: Document the new hook.
* target.def: Add a new hook.
* targhooks.c (default_ctfc_debuginfo_early_finish_p): Likewise.
* targhooks.h (default_ctfc_debuginfo_early_finish_p): Likewise.
---
 gcc/config/bpf/bpf.c | 14 ++
 gcc/doc/tm.texi  |  6 ++
 gcc/doc/tm.texi.in   |  2 ++
 gcc/target.def   | 10 ++
 gcc/targhooks.c  |  6 ++
 gcc/targhooks.h  |  2 ++
 6 files changed, 40 insertions(+)

diff --git a/gcc/config/bpf/bpf.c b/gcc/config/bpf/bpf.c
index 028013e..85f6b76 100644
--- a/gcc/config/bpf/bpf.c
+++ b/gcc/config/bpf/bpf.c
@@ -178,6 +178,20 @@ bpf_option_override (void)
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE bpf_option_override
 
+/* Return FALSE iff -mcore has been specified.  */
+
+static bool
+ctfc_debuginfo_early_finish_p (void)
+{
+  if (TARGET_BPF_CORE)
+return false;
+  else
+return true;
+}
+
+#undef TARGET_CTFC_DEBUGINFO_EARLY_FINISH_P
+#define TARGET_CTFC_DEBUGINFO_EARLY_FINISH_P ctfc_debuginfo_early_finish_p
+
 /* Define target-specific CPP macros.  This function in used in the
definition of TARGET_CPU_CPP_BUILTINS in bpf.h */
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index cb01528..2d5ff05 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -10400,6 +10400,12 @@ Define this macro if GCC should produce debugging 
output in BTF debug
 format in response to the @option{-gbtf} option.
 @end defmac
 
+@deftypefn {Target Hook} bool TARGET_CTFC_DEBUGINFO_EARLY_FINISH_P (void)
+This target hook returns nonzero if the CTF Container can allow the
+ emission of the CTF/BTF debug info at the DWARF debuginfo early finish
+ time.
+@end deftypefn
+
 @node Floating Point
 @section Cross Compilation and Floating Point
 @cindex cross compilation and floating point
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4a522ae..05b3c2c 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7020,6 +7020,8 @@ Define this macro if GCC should produce debugging output 
in BTF debug
 format in response to the @option{-gbtf} option.
 @end defmac
 
+@hook TARGET_CTFC_DEBUGINFO_EARLY_FINISH_P
+
 @node Floating Point
 @section Cross Compilation and Floating Point
 @cindex cross compilation and floating point
diff --git a/gcc/target.def b/gcc/target.def
index 68a46aa..44e2251 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4016,6 +4016,16 @@ clobbered parts of a register altering the frame 
register size",
  machine_mode, (int regno),
  default_dwarf_frame_reg_mode)
 
+/* Return nonzero if CTF Container can finalize the CTF/BTF emission
+   at DWARF debuginfo early finish time.  */
+DEFHOOK
+(ctfc_debuginfo_early_finish_p,
+ "This target hook returns nonzero if the CTF Container can allow the\n\
+ emission of the CTF/BTF debug info at the DWARF debuginfo early finish\n\
+ time.",
+ bool, (void),
+ default_ctfc_debuginfo_early_finish_p)
+
 /* If expand_builtin_init_dwarf_reg_sizes needs to fill in table
entries not corresponding directly to registers below
FIRST_PSEUDO_REGISTER, this hook should generate the necessary
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index eb51909..e38566c 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -2112,6 +2112,12 @@ default_dwarf_frame_reg_mode (int regno)
   return save_mode;
 }
 
+bool
+default_ctfc_debuginfo_early_finish_p (void)
+{
+  return true;
+}
+
 /* To be used by targets where reg_raw_mode doesn't return the right
mode for registers used in apply_builtin_return and apply_builtin_arg.  */
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index f92e102..55dc443 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -255,6 +255,8 @@ extern unsigned int default_dwarf_poly_indeterminate_value 
(unsigned int,
unsigned int *,
int *);
 extern machine_mode default_dwarf_frame_reg_mode (int);
+extern bool default_ctfc_debuginfo_early_finish_p (void);
+
 extern fixed_size_mode default_get_reg_raw_mode (int);
 extern bool default_keep_leaf_when_profiled ();
 
-- 
1.8.3.1



[PATCH,V2 0/3] Allow means for late BTF generation for BPF CO-RE

2021-08-04 Thread Indu Bhagat via Gcc-patches
[Changes from V1]
- [1/3] bpf: Add new -mcore option for BPF CO-RE
  Moved the testcase from gcc.dg/debug/btf/ to gcc.target/bpf/. Adjusted the
  testcase a bit.
- targhooks: New target hook for CTF/BTF debug info emission
  (Same as V1)
- dwarf2out: Emit BTF in dwarf2out_finish for BPF CO-RE usecase
  Moved the call to ctf_debug_finish (in dwarf2out_finish) before the point of
  early exit taken when dwarf_debuginfo_p () is false.
[End of Changes from V1]


Hello,

This patch series puts the framework in place for late BTF generation (in
dwarf2out_finish). This is needed for the landing of BPF CO-RE support in GCC,
patches for which were posted recently
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576719.html.

BPF's Compile Once - Run Everywhere (CO-RE) feature is used to make a compiled 
BPF program portable across kernel versions, all this without the need to
recompile the BPF program. A key part of BPF CO-RE capability is the BTF debug
info generated for them.

A traditional BPF program (non CO-RE) will have a .BTF section which contains
the type information in the BTF debug format. In case of CO-RE, however, an 
additional section .BTF.ext section is generated. The .BTF.ext section contains
the CO-RE relocations. A BPF loader will use the .BTF.ext section along with the
associated .BTF.ext section to adjust some references in the instructions of
program to ensure it is compatible with the required kernel version / headers.

Roughly, each CO-RE relocation record will contain the following info
 - offset of BPF instruction to be patched
 - the BTF ID of the data structure being accessed by the instruction, and 
 - an offset to the BTF string which encodes a series of field accesses to
   retrieve the field of interest in the instruction.

High-level design
-
- The CTF container is populated with the compiler-internal representation for
the "type information" at dwarf2out_early_finish time.
- In case of CO-RE compilation, the information needed to generate .BTF.ext
section is added by the BPF backend to the CTF container (CTFC) at XXX time.
This introduces challenges in having LTO support for CO-RE - CO-RE relocations
can only be generated late, much like late DWARF. 
- Combining late and early BTF is not being done as the patch set disables LTO
to be used together with CO-RE for the BPF target.
- A new target hook is added for the CTFC (CTF Container) to know whether early
emission of CTF/BTF is allowed for the target.

Testing Notes

- Bootstrapped and reg tested on x86_64
- make all-gcc for --target=bpf-unknown-none; tested ctf.exp, btf.exp and 
bpf.exp

Thanks,
Indu Bhagat (3):
  bpf: Add new -mcore option for BPF CO-RE
  targhooks: New target hook for CTF/BTF debug info emission
  dwarf2out: Emit BTF in dwarf2out_finish for BPF CO-RE usecase

 gcc/config/bpf/bpf.c  | 29 
 gcc/config/bpf/bpf.opt|  4 +++
 gcc/doc/tm.texi   |  6 
 gcc/doc/tm.texi.in|  2 ++
 gcc/dwarf2ctf.c   | 55 ---
 gcc/dwarf2ctf.h   |  4 ++-
 gcc/dwarf2out.c   |  9 +++--
 gcc/target.def| 10 ++
 gcc/targhooks.c   |  6 
 gcc/targhooks.h   |  2 ++
 gcc/testsuite/gcc.target/bpf/core-lto-1.c |  9 +
 11 files changed, 121 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-lto-1.c

-- 
1.8.3.1



Re: [PATCH 02/34] rs6000: Add gengtype handling to the build machinery

2021-08-04 Thread Segher Boessenkool
On Thu, Jul 29, 2021 at 08:30:49AM -0500, Bill Schmidt wrote:
>   * config.gcc (target_gtfiles): Add ./rs6000-builtins.h.
>   * config/rs6000/t-rs6000 (EXTRA_GTYPE_DEPS): Set.

> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -22,6 +22,7 @@ TM_H += $(srcdir)/config/rs6000/rs6000-builtin.def
>  TM_H += $(srcdir)/config/rs6000/rs6000-cpus.def
>  TM_H += $(srcdir)/config/rs6000/rs6000-modes.h
>  PASSES_EXTRA += $(srcdir)/config/rs6000/rs6000-passes.def
> +EXTRA_GTYPE_DEPS += $(srcdir)/config/rs6000/rs6000-builtin-new.def
>  
>  rs6000-pcrel-opt.o: $(srcdir)/config/rs6000/rs6000-pcrel-opt.c
>   $(COMPILE) $<

Surprisingly I couldn't find docs or examples for EXTRA_GTYPE_DEPS.
But it loks like it will work.  Okay for trunkm thanks!


Segher


Re: [PATCH 01/34] rs6000: Incorporate new builtins code into the build machinery

2021-08-04 Thread Segher Boessenkool
Hi!

On Thu, Jul 29, 2021 at 08:30:48AM -0500, Bill Schmidt wrote:
>   * config/rs6000/rs6000-gen-builtins.c (main): Close init_file
>   last.

That easily fits on one line?

> +rs6000-gen-builtins: rs6000-gen-builtins.o rbtree.o
> + $(LINKER_FOR_BUILD) $(BUILD_LINKERFLAGS) $(BUILD_LDFLAGS) -o $@ \
> + $(filter-out $(BUILD_LIBDEPS), $^) $(BUILD_LIBS)

I wonder what the difference is between BUILD_LINKERFLAGS and
BUILD_LDFLAGS?  Do you have any idea?

Okay for trunk.  Thanks!


Segher


[committed] analyzer: initial implementation of asm support [PR101570]

2021-08-04 Thread David Malcolm via Gcc-patches
Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Pushed to trunk as r12-2749-gded2c2c068f6f2825474758cb03a05070a5837e8.

gcc/ChangeLog:
PR analyzer/101570
* Makefile.in (ANALYZER_OBJS): Add analyzer/region-model-asm.o.

gcc/analyzer/ChangeLog:
PR analyzer/101570
* analyzer.cc (maybe_reconstruct_from_def_stmt): Add GIMPLE_ASM
case.
* analyzer.h (class asm_output_svalue): New forward decl.
(class reachable_regions): New forward decl.
* complexity.cc (complexity::from_vec_svalue): New.
* complexity.h (complexity::from_vec_svalue): New decl.
* engine.cc (feasibility_state::maybe_update_for_edge): Handle
asm stmts by calling on_asm_stmt.
* region-model-asm.cc: New file.
* region-model-manager.cc
(region_model_manager::maybe_fold_asm_output_svalue): New.
(region_model_manager::get_or_create_asm_output_svalue): New.
(region_model_manager::log_stats): Log m_asm_output_values_map.
* region-model.cc (region_model::on_stmt_pre): Handle GIMPLE_ASM.
* region-model.h (visitor::visit_asm_output_svalue): New.
(region_model_manager::get_or_create_asm_output_svalue): New decl.
(region_model_manager::maybe_fold_asm_output_svalue): New decl.
(region_model_manager::asm_output_values_map_t): New typedef.
(region_model_manager::m_asm_output_values_map): New field.
(region_model::on_asm_stmt): New.
* store.cc (binding_cluster::on_asm): New.
* store.h (binding_cluster::on_asm): New decl.
* svalue.cc (svalue::cmp_ptr): Handle SK_ASM_OUTPUT.
(asm_output_svalue::dump_to_pp): New.
(asm_output_svalue::dump_input): New.
(asm_output_svalue::input_idx_to_asm_idx): New.
(asm_output_svalue::accept): New.
* svalue.h (enum svalue_kind): Add SK_ASM_OUTPUT.
(svalue::dyn_cast_asm_output_svalue): New.
(class asm_output_svalue): New.
(is_a_helper ::test): New.
(struct default_hash_traits): New.

gcc/testsuite/ChangeLog:
PR analyzer/101570
* gcc.dg/analyzer/asm-x86-1.c: New test.
* gcc.dg/analyzer/asm-x86-lp64-1.c: New test.
* gcc.dg/analyzer/asm-x86-lp64-2.c: New test.
* gcc.dg/analyzer/pr101570.c: New test.
* gcc.dg/analyzer/torture/asm-x86-linux-array_index_mask_nospec.c:
New test.
* gcc.dg/analyzer/torture/asm-x86-linux-cpuid-paravirt-1.c: New
test.
* gcc.dg/analyzer/torture/asm-x86-linux-cpuid-paravirt-2.c: New
test.
* gcc.dg/analyzer/torture/asm-x86-linux-cpuid.c: New test.
* gcc.dg/analyzer/torture/asm-x86-linux-rdmsr-paravirt.c: New
test.
* gcc.dg/analyzer/torture/asm-x86-linux-rdmsr.c: New test.
* gcc.dg/analyzer/torture/asm-x86-linux-wfx_get_ps_timeout-full.c:
New test.
* gcc.dg/analyzer/torture/asm-x86-linux-wfx_get_ps_timeout-reduced.c:
New test.

Signed-off-by: David Malcolm 
---
 gcc/Makefile.in   |   1 +
 gcc/analyzer/analyzer.cc  |   1 +
 gcc/analyzer/analyzer.h   |   2 +
 gcc/analyzer/complexity.cc|  16 +
 gcc/analyzer/complexity.h |   1 +
 gcc/analyzer/engine.cc|   2 +
 gcc/analyzer/region-model-asm.cc  | 303 +
 gcc/analyzer/region-model-manager.cc  |  48 +++
 gcc/analyzer/region-model.cc  |   5 +-
 gcc/analyzer/region-model.h   |  13 +
 gcc/analyzer/store.cc |  17 +
 gcc/analyzer/store.h  |   1 +
 gcc/analyzer/svalue.cc|  89 +
 gcc/analyzer/svalue.h | 145 +++-
 gcc/testsuite/gcc.dg/analyzer/asm-x86-1.c |  69 
 .../gcc.dg/analyzer/asm-x86-lp64-1.c  | 131 +++
 .../gcc.dg/analyzer/asm-x86-lp64-2.c  |  34 ++
 gcc/testsuite/gcc.dg/analyzer/pr101570.c  |   5 +
 .../asm-x86-linux-array_index_mask_nospec.c   |  74 
 .../torture/asm-x86-linux-cpuid-paravirt-1.c  |  81 +
 .../torture/asm-x86-linux-cpuid-paravirt-2.c  | 135 
 .../analyzer/torture/asm-x86-linux-cpuid.c|  46 +++
 .../torture/asm-x86-linux-rdmsr-paravirt.c| 210 
 .../analyzer/torture/asm-x86-linux-rdmsr.c|  33 ++
 .../asm-x86-linux-wfx_get_ps_timeout-full.c   | 319 ++
 ...asm-x86-linux-wfx_get_ps_timeout-reduced.c |  77 +
 26 files changed, 1855 insertions(+), 3 deletions(-)
 create mode 100644 gcc/analyzer/region-model-asm.cc
 create mode 100644 gcc/testsuite/gcc.dg/analyzer/asm-x86-1.c
 create mode 100644 gcc/testsuite/gcc.dg/analyzer/asm-x86-lp64-1.c
 create mode 100644 gcc/testsuite/gcc.dg/analyzer/asm-x86-lp64-2.c
 create mode 100644 gcc/testsuite/gcc.dg/analyzer/pr101570.c
 create mode 100644 

PING^1 [PATCH v3 1/2] Add -f[no-]direct-extern-access

2021-08-04 Thread H.J. Lu via Gcc-patches
On Mon, Jul 12, 2021 at 5:13 AM H.J. Lu  wrote:
>
> On Sun, Jul 11, 2021 at 11:13 PM Richard Biener
>  wrote:
> >
> > On Fri, Jul 9, 2021 at 4:50 PM H.J. Lu  wrote:
> > >
> > > -fdirect-extern-access is the default.  With -fno-direct-extern-access:
> > >
> > > 1. Always use GOT to access undefined data and function symbols,
> > >including in PIE and non-PIE.  These will avoid copy relocations
> > >in executables.  This is compatible with existing executables and
> > >shared libraries.
> > > 2. In executable and shared library, bind symbols with the STV_PROTECTED
> > >visibility locally:
> > >a. The address of data symbol is the address of data body.
> > >b. For systems without function descriptor, the function pointer is
> > >   the address of function body.
> > >c. The resulting shared libraries may not be incompatible with
> > >   executables which have copy relocations on protected symbols or
> > >   use executable PLT entries as function addresses for protected
> > >   functions in shared libraries.
> > > 3. Update asm_preferred_eh_data_format to select PC relative EH encoding
> > > format with -fno-direct-extern-access to avoid copy relocation.
> > > 4. Add ix86_reloc_rw_mask for TARGET_ASM_RELOC_RW_MASK to avoid copy
> > > relocation with -fno-direct-extern-access.
> >
> > Did you check how relocations in .debug_info behave?  I don't remember 
> > whether
>
> Yes, I did.   I added ix86_reloc_rw_mask and use PC-relative format for
> EH pointer encodings to avoid copy relocation for -fno-direct-extern-access
> in read-only sections.

PING:

https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574846.html

> > we're doing anything special there or if we just copy how we emit
> > relocs in .text
> >
> > Richard.

-- 
H.J.


Re: [PATCH, rs6000] Add store fusion support for Power10

2021-08-04 Thread Segher Boessenkool
Hi!

On Wed, Aug 04, 2021 at 04:16:45PM -0500, Pat Haugen wrote:
> On 8/4/21 9:23 AM, Bill Schmidt wrote:
> >> +  /* GPR stores can be ascending or descending offsets, FPR/VSR stores
> > VSR?  I don't see how that applies here.

Almost all scalar FP insns have a VR alternative, but unfortunately not
(always) with a unified or consistent syntax.

> > Can you think of any test cases we can use to demonstrate store fusion?
> 
> Yeah, should be able to come up with something to verify two adjacent stores.

The testcase should also stop working when the fusion does.  This might
mean you need more than one testcase.  The number one goal of our
testsuite is to warn of regressions, so it really should be more than
just a single two-liner scanning for one assembler insn.  Maybe do a
handful of cases?  :-)


Segher


Re: [PATCH, rs6000] Add store fusion support for Power10

2021-08-04 Thread Pat Haugen via Gcc-patches
On 8/4/21 9:23 AM, Bill Schmidt wrote:
> Hi Pat,
> 
> Good stuff!  Comments below.
> 
> On 8/2/21 3:19 PM, Pat Haugen via Gcc-patches wrote:
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 279f00cc648..1460a0d7c5c 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -4490,6 +4490,10 @@ rs6000_option_override_internal (bool global_init_p)
>>     && (rs6000_isa_flags_explicit & OPTION_MASK_P10_FUSION_2ADD) == 0)
>>   rs6000_isa_flags |= OPTION_MASK_P10_FUSION_2ADD;
>>
>> +  if (TARGET_POWER10
>> +  && (rs6000_isa_flags_explicit & OPTION_MASK_P10_FUSION_2STORE) == 0)
>> +    rs6000_isa_flags |= OPTION_MASK_P10_FUSION_2STORE;
>> +
>>     /* Turn off vector pair/mma options on non-power10 systems.  */
>>     else if (!TARGET_POWER10 && TARGET_MMA)
>>   {
>> @@ -18357,7 +18361,7 @@ is_load_insn1 (rtx pat, rtx *load_mem)
>>     if (!pat || pat == NULL_RTX)
>>   return false;
>>
>> -  if (GET_CODE (pat) == SET)
>> +  if (GET_CODE (pat) == SET && REG_P (SET_DEST (pat)))
>>   return find_mem_ref (SET_SRC (pat), load_mem);
> Looks like this is just an optimization to quickly discard stores, right?

Additional verification check to make sure destination is REG, yes. This will 
become a separate patch.

>>     if (GET_CODE (pat) == PARALLEL)
>> @@ -18394,7 +18398,8 @@ is_store_insn1 (rtx pat, rtx *str_mem)
>>     if (!pat || pat == NULL_RTX)
>>   return false;
>>
>> -  if (GET_CODE (pat) == SET)
>> +  if (GET_CODE (pat) == SET
>> +  && (REG_P (SET_SRC (pat)) || SUBREG_P (SET_SRC (pat
>>   return find_mem_ref (SET_DEST (pat), str_mem);
> 
> 
> Similar question.

Similar answer. :)
> 
>>
>>     if (GET_CODE (pat) == PARALLEL)
>> @@ -18859,6 +18864,96 @@ power9_sched_reorder2 (rtx_insn **ready, int 
>> lastpos)
>>     return cached_can_issue_more;
>>   }
>>
>> +/* Determine if INSN is a store to memory that can be fused with a similar
>> +   adjacent store.  */
>> +
>> +static bool
>> +is_fusable_store (rtx_insn *insn, rtx *str_mem)
>> +{
>> +  /* Exit early if not doing store fusion.  */
>> +  if (!(TARGET_P10_FUSION && TARGET_P10_FUSION_2STORE))
>> +    return false;
>> +
>> +  /* Insn must be a non-prefixed base+disp form store.  */
>> +  if (is_store_insn (insn, str_mem)
>> +  && get_attr_prefixed (insn) == PREFIXED_NO
>> +  && get_attr_update (insn) == UPDATE_NO
>> +  && get_attr_indexed (insn) == INDEXED_NO)
>> +    {
>> +  /* Further restictions by mode and size.  */
>> +  machine_mode mode = GET_MODE (*str_mem);
>> +  HOST_WIDE_INT size;
>> +  if MEM_SIZE_KNOWN_P (*str_mem)
>> +    size = MEM_SIZE (*str_mem);
>> +  else
>> +    return false;
>> +
>> +  if INTEGRAL_MODE_P (mode)
>> +    {
>> +  /* Must be word or dword size.  */
>> +  return (size == 4 || size == 8);
>> +    }
>> +  else if FLOAT_MODE_P (mode)
>> +    {
>> +  /* Must be dword size.  */
>> +  return (size == 8);
>> +    }
>> +    }
>> +
>> +  return false;
>> +}
>> +
>> +/* Do Power10 specific reordering of the ready list.  */
>> +
>> +static int
>> +power10_sched_reorder (rtx_insn **ready, int lastpos)
>> +{
>> +  int pos;
>> +  rtx mem1, mem2;
>> +
>> +  /* Do store fusion during sched2 only.  */
>> +  if (!reload_completed)
>> +    return cached_can_issue_more;
>> +
>> +  /* If the prior insn finished off a store fusion pair then simply
>> + reset the counter and return, nothing more to do.  */
> 
> 
> Good comments throughout, thanks!
> 
>> +  if (load_store_pendulum != 0)
>> +    {
>> +  load_store_pendulum = 0;
>> +  return cached_can_issue_more;
>> +    }
>> +
>> +  /* Try to pair certain store insns to adjacent memory locations
>> + so that the hardware will fuse them to a single operation.  */
>> +  if (is_fusable_store (last_scheduled_insn, ))
>> +    {
>> +  /* A fusable store was just scheduled.  Scan the ready list for 
>> another
>> + store that it can fuse with.  */
>> +  pos = lastpos;
>> +  while (pos >= 0)
>> +    {
>> +  /* GPR stores can be ascending or descending offsets, FPR/VSR stores
> VSR?  I don't see how that applies here.

Scalar floating point store from VSX reg (i.e. stxsd).

>> + must be ascending only.  */
>> +  if (is_fusable_store (ready[pos], )
>> +  && ((INTEGRAL_MODE_P (GET_MODE (mem1))
>> +   && adjacent_mem_locations (mem1, mem2))
>> +  || (FLOAT_MODE_P (GET_MODE (mem1))
>> +   && (adjacent_mem_locations (mem1, mem2) == mem1
>> +    {
>> +  /* Found a fusable store.  Move it to the end of the ready list
>> + so it is scheduled next.  */
>> +  move_to_end_of_ready (ready, pos, lastpos);
>> +
>> +  load_store_pendulum = -1;
>> +  break;
>> +    }
>> +  pos--;
>> +    }
>> +    }
>> +
>> +  return cached_can_issue_more;
>> +}
>> +
>>   /* We are about to begin issuing insns for this clock cycle. */
>>
>>   static int
>> 

[PATCH, part2] PR fortran/98411 [10/11/12 Regression] Pointless: Array larger than ‘-fmax-stack-var-size=’, ...

2021-08-04 Thread Harald Anlauf via Gcc-patches
Dear all,

here's the second part that should fix this regression for good.
The patch also adjusts the warning message to make it easier to
understand, using the suggestion by Tobias (see PR).

Since F2018 in principle makes RECURSIVE the default, which might
conflict with the purpose of the testcase, I chose to change the
options to include -std=f2008, and to verify that implicit SAVE
works the same as explicit SAVE.

Regtested on x86_64-pc-linux-gnu.  OK for affected branches?

Thanks,
Harald


Fortran: fix pointless warning for static variables

gcc/fortran/ChangeLog:

PR fortran/98411
* trans-decl.c (gfc_finish_var_decl): Adjust check to handle
implicit SAVE as well as variables in the main program.  Improve
warning message text.

gcc/testsuite/ChangeLog:

PR fortran/98411
* gfortran.dg/pr98411.f90: Adjust testcase options to restrict to
F2008, and verify case of implicit SAVE.

diff --git a/gcc/fortran/trans-decl.c b/gcc/fortran/trans-decl.c
index 784f7b61ce1..bed61e2325d 100644
--- a/gcc/fortran/trans-decl.c
+++ b/gcc/fortran/trans-decl.c
@@ -743,8 +743,10 @@ gfc_finish_var_decl (tree decl, gfc_symbol * sym)

   /* Keep variables larger than max-stack-var-size off stack.  */
   if (!(sym->ns->proc_name && sym->ns->proc_name->attr.recursive)
+  && !(sym->ns->proc_name && sym->ns->proc_name->attr.is_main_program)
   && !sym->attr.automatic
   && sym->attr.save != SAVE_EXPLICIT
+  && sym->attr.save != SAVE_IMPLICIT
   && INTEGER_CST_P (DECL_SIZE_UNIT (decl))
   && !gfc_can_put_var_on_stack (DECL_SIZE_UNIT (decl))
 	 /* Put variable length auto array pointers always into stack.  */
@@ -757,13 +759,17 @@ gfc_finish_var_decl (tree decl, gfc_symbol * sym)
 {
   if (flag_max_stack_var_size > 0)
 	gfc_warning (OPT_Wsurprising,
-		 "Array %qs at %L is larger than limit set by"
-		 " %<-fmax-stack-var-size=%>, moved from stack to static"
-		 " storage. This makes the procedure unsafe when called"
-		 " recursively, or concurrently from multiple threads."
-		 " Consider using %<-frecursive%>, or increase the"
-		 " %<-fmax-stack-var-size=%> limit, or change the code to"
-		 " use an ALLOCATABLE array.",
+		 "Array %qs at %L is larger than limit set by "
+		 "%<-fmax-stack-var-size=%>, moved from stack to static "
+		 "storage. This makes the procedure unsafe when called "
+		 "recursively, or concurrently from multiple threads. "
+		 "Consider increasing the %<-fmax-stack-var-size=%> "
+		 "limit (or use %<-frecursive%>, which implies "
+		 "unlimited %<-fmax-stack-var-size%>) - or change the "
+		 "code to use an ALLOCATABLE array. If the variable is "
+		 "never accessed concurrently, this warning can be "
+		 "ignored, and the variable could also be declared with "
+		 "the SAVE attribute.",
 		 sym->name, >declared_at);

   TREE_STATIC (decl) = 1;
diff --git a/gcc/testsuite/gfortran.dg/pr98411.f90 b/gcc/testsuite/gfortran.dg/pr98411.f90
index 249afaea419..7c906a96f60 100644
--- a/gcc/testsuite/gfortran.dg/pr98411.f90
+++ b/gcc/testsuite/gfortran.dg/pr98411.f90
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-Wall -fautomatic -fmax-stack-var-size=100" }
+! { dg-options "-std=f2008 -Wall -fautomatic -fmax-stack-var-size=100" }
 ! PR fortran/98411 - Pointless warning for static variables

 module try
@@ -9,8 +9,10 @@ contains
   subroutine initmodule
 real, save :: b(1000)
 logical:: c(1000) ! { dg-warning "moved from stack to static storage" }
+integer:: e(1000) = 1
 a(1) = 42
 b(2) = 3.14
 c(3) = .true.
+e(5) = -1
   end subroutine initmodule
 end module try


[RFC, Fortran] Fix c_float128 and c_float128_complex on targets with 128-bit long double.

2021-08-04 Thread Sandra Loosemore
I was trying last week to run my not-yet-committed TS29113 testsuite on 
a powerpc64le-linux-gnu target and ran into some problems with the kind 
constants c_float128 and c_float128_complex from the ISO_C_BINDING 
module; per the gfortran manual they are supposed to represent the kind 
of the gcc extension type __float128 and the corresponding complex type. 
 They were being set to -4 (e.g., not supported) instead of 16, 
although this target does define __float128 and real(16) is accepted as 
a supported type by the Fortran front end.


Anyway, the root of the problem is that the definition of these 
constants only looked at gfc_float128_type_node, which only gets set if 
TFmode is not the same type as long_double_type_node.  I experimented 
with setting gfc_float128_type_node = long_double_type_node but that 
caused various Bad Things to happen elsewhere in code that expected them 
to be distinct types, so I ended up with this minimally intrusive patch 
that only tweaks the definitions of the c_float128 and 
c_float128_complex constants.


I'm not sure this is completely correct, though.  I see PowerPC
supports 2 different 128-bit encodings and it looks like TFmode/long 
double is mapped onto the one selected by the ABI and/or command-line 
options; that's the only one the Fortran front end knows about.  All of 
TFmode, IFmode, and KFmode would map onto kind 16 anyway (in spite of 
having different TYPE_PRECISION values) so Fortran wouldn't be able to 
distinguish them.  The thing that confuses me is how/when the rs6000 
backend defines __float128; it looks like the documentation in the GCC 
manual doesn't agree with the code, and I'm not sure what the intended 
behavior really is.  Is it possible that __float128 could end up defined 
but specifying a different type than TFmode, and if so is there a 
target-independent way to identify that situation?  Can the PowerPC 
experts help straighten me out?


-Sandra
commit 158c2f6b1a4134bbdbe59034d38ce12faa8167a8
Author: Sandra Loosemore 
Date:   Tue Aug 3 16:21:16 2021 -0700

Fix c_float128 and c_float128_complex on targets with 128-bit long double.

gfc_float128_type_node is only non-NULL on targets where float128 is
supported and is a distinct type from long double.  So, check
long_double_type_node as well when computing the value of the kind
constants c_float128 and c_float128_complex from the ISO_C_BINDING
intrinsic module.

2021-08-03  Sandra Loosemore  

gcc/fortran/
	* iso-c-binding.def (c_float128, c_float128_complex): Also
	check long_double_type_node.  Add comments to explain why.

diff --git a/gcc/fortran/iso-c-binding.def b/gcc/fortran/iso-c-binding.def
index 8bf69ef..a05e324 100644
--- a/gcc/fortran/iso-c-binding.def
+++ b/gcc/fortran/iso-c-binding.def
@@ -114,9 +114,25 @@ NAMED_REALCST (ISOCBINDING_DOUBLE, "c_double", \
get_real_kind_from_node (double_type_node), GFC_STD_F2003)
 NAMED_REALCST (ISOCBINDING_LONG_DOUBLE, "c_long_double", \
get_real_kind_from_node (long_double_type_node), GFC_STD_F2003)
+
+/* GNU Extension.  gfc_float128_type_node is only non-null if the target
+   supports a 128-bit type distinct from the long double type.  Otherwise
+   if long double has kind 16, that's also the float128 type and we can
+   use kind 16 for that too.
+
+   Specifically, on x86_64, long double is the 80-bit encoding with kind
+   10; it has a storage size of 128 bits due to alignment requirements,
+   but if a true 128-bit float is supported it will have kind 16 and
+   gfc_float128_type_node will point to it.  PowerPC has 3 different
+   128-bit encodings that are distinguished by having different
+   TYPE_PRECISION values (not necessarily 128).  They all map onto
+   Fortran kind 16, which corresponds to C long double.  The default
+   encoding is determined by the ABI.  */
 NAMED_REALCST (ISOCBINDING_FLOAT128, "c_float128", \
-	   gfc_float128_type_node == NULL_TREE \
-		  ? -4 : get_real_kind_from_node (gfc_float128_type_node), \
+	   (gfc_float128_type_node == NULL_TREE \
+		  ? (get_real_kind_from_node (long_double_type_node) == 16 \
+		 ? 16 : -4) \
+		  : get_real_kind_from_node (gfc_float128_type_node)), \
 	   GFC_STD_GNU)
 NAMED_CMPXCST (ISOCBINDING_FLOAT_COMPLEX, "c_float_complex", \
get_real_kind_from_node (float_type_node), GFC_STD_F2003)
@@ -124,9 +140,13 @@ NAMED_CMPXCST (ISOCBINDING_DOUBLE_COMPLEX, "c_double_complex", \
get_real_kind_from_node (double_type_node), GFC_STD_F2003)
 NAMED_CMPXCST (ISOCBINDING_LONG_DOUBLE_COMPLEX, "c_long_double_complex", \
get_real_kind_from_node (long_double_type_node), GFC_STD_F2003)
+
+/* GNU Extension.  Similar issues to c_float128 above.  */
 NAMED_CMPXCST (ISOCBINDING_FLOAT128_COMPLEX, "c_float128_complex", \
-	   gfc_float128_type_node == NULL_TREE \
-		  ? -4 : get_real_kind_from_node (gfc_float128_type_node), \
+	   (gfc_float128_type_node 

[PATCH v3] x86: Update STORE_MAX_PIECES

2021-08-04 Thread H.J. Lu via Gcc-patches
On Wed, Aug 4, 2021 at 11:46 AM Uros Bizjak  wrote:
>
> On Wed, Aug 4, 2021 at 3:34 PM H.J. Lu  wrote:
> >
> > On Tue, Aug 3, 2021 at 6:56 AM H.J. Lu  wrote:
> > >
> > > 1. Update x86 STORE_MAX_PIECES to use OImode and XImode only if inter-unit
> > > move is enabled since x86 uses vec_duplicate, which is enabled only when
> > > inter-unit move is enabled, to implement store_by_pieces.
> > > 2. Update op_by_pieces_d::op_by_pieces_d to set m_max_size to
> > > STORE_MAX_PIECES for store_by_pieces and to COMPARE_MAX_PIECES for
> > > compare_by_pieces.
> > >
> > > gcc/
> > >
> > > PR target/101742
> > > * expr.c (op_by_pieces_d::op_by_pieces_d): Set m_max_size to
> > > STORE_MAX_PIECES for store_by_pieces and to COMPARE_MAX_PIECES
> > > for compare_by_pieces.
> > > * config/i386/i386.h (STORE_MAX_PIECES): Use OImode and XImode
> > > only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/101742
> > > * gcc.target/i386/pr101742a.c: New test.
> > > * gcc.target/i386/pr101742b.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.h| 20 +++-
> > >  gcc/expr.c|  6 +-
> > >  gcc/testsuite/gcc.target/i386/pr101742a.c | 16 
> > >  gcc/testsuite/gcc.target/i386/pr101742b.c |  4 
> > >  4 files changed, 36 insertions(+), 10 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101742a.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101742b.c
> > >
> > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > index bed9cd9da18..9b416abd5f4 100644
> > > --- a/gcc/config/i386/i386.h
> > > +++ b/gcc/config/i386/i386.h
> > > @@ -1783,15 +1783,17 @@ typedef struct ix86_args {
> > >  /* STORE_MAX_PIECES is the number of bytes at a time that we can
> > > store efficiently.  */
> > >  #define STORE_MAX_PIECES \
> > > -  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > > -   ? 64 \
> > > -   : ((TARGET_AVX \
> > > -   && !TARGET_PREFER_AVX128 \
> > > -   && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
> > > -  ? 32 \
> > > -  : ((TARGET_SSE2 \
> > > - && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > > -? 16 : UNITS_PER_WORD)))
> > > +  (TARGET_INTER_UNIT_MOVES_TO_VEC \
> > > +   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > > +  ? 64 \
> > > +  : ((TARGET_AVX \
> > > + && !TARGET_PREFER_AVX128 \
> > > + && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
> > > + ? 32 \
> > > + : ((TARGET_SSE2 \
> > > + && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > > + ? 16 : UNITS_PER_WORD))) \
> > > +   : UNITS_PER_WORD)
> > >
> > >  /* If a memory-to-memory move would take MOVE_RATIO or more simple
> > > move-instruction pairs, we will do a cpymem or libcall instead.
> >
> > expr.c has been fixed.   Here is the v2 patch for x86 backend.
> > OK for master?
>
> OK, but please add the comment about vec_duplicate before the define
> to explain the situation with TARGET_INTER_UNIT_MOVES_TO_VEC.

This is what I am checking in with

/* STORE_MAX_PIECES is the number of bytes at a time that we can store
   efficiently.  Allow 16/32/64 bytes only if inter-unit move is enabled
   since vec_duplicate enabled by inter-unit move is used to implement
   store_by_pieces of 16/32/64 bytes.  */

> Thanks,
> Uros.

Thanks.

-- 
H.J.
From 9487c165afb5b6083a3fc09a2e8b7bcabfe28765 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Tue, 3 Aug 2021 06:17:22 -0700
Subject: [PATCH v3] x86: Update STORE_MAX_PIECES

Update STORE_MAX_PIECES to allow 16/32/64 bytes only if inter-unit move
is enabled since vec_duplicate enabled by inter-unit move is used to
implement store_by_pieces of 16/32/64 bytes.

gcc/

	PR target/101742
	* config/i386/i386.h (STORE_MAX_PIECES): Allow 16/32/64 bytes
	only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.

gcc/testsuite/

	PR target/101742
	* gcc.target/i386/pr101742a.c: New test.
	* gcc.target/i386/pr101742b.c: Likewise.
---
 gcc/config/i386/i386.h| 26 +--
 gcc/testsuite/gcc.target/i386/pr101742a.c | 16 ++
 gcc/testsuite/gcc.target/i386/pr101742b.c |  4 
 3 files changed, 35 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101742a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101742b.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bed9cd9da18..21fe51bba40 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1780,18 +1780,22 @@ typedef struct ix86_args {
 	  && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
 	 ? 16 : UNITS_PER_WORD)))
 
-/* STORE_MAX_PIECES is the number of bytes at a time that we can
-   store efficiently.  */
+/* STORE_MAX_PIECES is the number of bytes at a time that we can store
+   efficiently.  Allow 16/32/64 bytes only if inter-unit move is enabled
+   since vec_duplicate 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely 
wrote:

> On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> >
> > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> :-)
>
> > The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> Wow, that is a convincing argument for making it not inline, thanks.
>
> > IIUC, we van use the following. Right?
> >
> > __attribute__((__noinline__))
>
> Right.
>
> > I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> This is a valid C++ program:
>
> #define noinline 1
> #include 
> int main() { }
>
> But if anything in  uses "noinline" then this valid program
> will not compile. Which is why we must use ((__noinline__)) instead of
> ((noinline)).
>
> Thanks. Now I get it.


>
>
> >
> > How about something like this comment?
> >
> > // Noinline to avoid code size increase.
>
> Great, thanks.
>
> On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > Actually I take back what I said. Sorry. I think the logic in your patch
> is correct. I missed some of the atomic decrements.
> > But I'd be concerned about performance. If we make _M_release_last_use
> noinline then we are adding overhead to the fast path of the original logic
> (where both counts are 1).
>
> Oh, I see. So the code duplication serves a purpose. We want the
> _M_release_last_use() code to be non-inline for the new logic, because
> in the common case we never call it (we either detect that both counts
> are 1 and do the dispose & destroy without atomic decrements, or we do
> a single decrement and don't dispose or destroy anything). But for the
> old logic, we do want that code inlined into _M_release (or
> _M_release_orig as it was called in your patch). Have I got that right
> now?
>
> Yes. Completely right.


> What if we remove the __noinline__ from _M_release_last_use() so that
> it can be inlined, but than add a noinline wrapper that can be called
> when we don't want to inline it?
>
> So:
>   // Called by _M_release() when the use count reaches zero.
>   void
>   _M_release_last_use() noexcept
>   {
> // unchanged from previous patch, but without the attribute.
> // ...
>   }
>
>   // As above, but 'noinline' to reduce code size on the cold path.
>   __attribute__((__noinline__))
>   void
>   _M_release_last_use_cold() noexcept
>   { _M_release_last_use(); }
>
>
> And then:
>
>   template<>
> inline void
> _Sp_counted_base<_S_atomic>::_M_release() noexcept
> {
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> #if ! _GLIBCXX_TSAN
>   constexpr bool __lock_free
> = __atomic_always_lock_free(sizeof(long long), 0)
> && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>   constexpr bool __double_word
> = sizeof(long long) == 2 * sizeof(_Atomic_word);
>   // The ref-count members follow the vptr, so are aligned to
>   // alignof(void*).
>   constexpr bool __aligned = __alignof(long long) <= alignof(void*);
>   if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
> {
>   constexpr long long __unique_ref
> = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
>   auto __both_counts = reinterpret_cast(&_M_use_count);
>
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>   if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> __unique_ref)
> {
>   // Both counts are 1, so there are no weak references and
>   // we are releasing the last strong reference. No other
>   // threads can observe the effects of this _M_release()
>   // call (e.g. calling use_count()) without a data race.
>   *(long long*)(&_M_use_count) = 0;
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>   _M_dispose();
>   _M_destroy();
>   return;
> }
>   if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
> {
>   _M_release_last_use_cold();
>   return;
> }
> }
>   else
> #endif
>   if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
> {
>   _M_release_last_use();
> }
> }
>
>
> So we use the noinline version for the else branch in the new 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Jonathan Wakely via Gcc-patches
On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
>
> Sorry. I totally missed the rest of your message and the patch. My fuzzy 
> eyesight, which usually guesses correctly 90% of the time, mistook "Secondly" 
> on a line by itself for "Sincerely" :-)

:-)

> The noinlining was based on looking at generated code. That was for clang. It 
> was inlining the _M_last_use function for every instance of _M_release (e.g., 
> ~shared_ptr). This optimization with the noinline for _M_release_last_use 
> ended up reducing massive binary text sizes by 1.5% (several megabytes)., 
> which was an extra benefit. Without the noinline we saw code size increase.

Wow, that is a convincing argument for making it not inline, thanks.

> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))

Right.

> I didn't understand the part about programmers doing #define noinline 1. I 
> don't see code in the patch that uses noinline.

This is a valid C++ program:

#define noinline 1
#include 
int main() { }

But if anything in  uses "noinline" then this valid program
will not compile. Which is why we must use ((__noinline__)) instead of
((noinline)).



>
> How about something like this comment?
>
> // Noinline to avoid code size increase.

Great, thanks.

On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> Actually I take back what I said. Sorry. I think the logic in your patch is 
> correct. I missed some of the atomic decrements.
> But I'd be concerned about performance. If we make _M_release_last_use 
> noinline then we are adding overhead to the fast path of the original logic 
> (where both counts are 1).

Oh, I see. So the code duplication serves a purpose. We want the
_M_release_last_use() code to be non-inline for the new logic, because
in the common case we never call it (we either detect that both counts
are 1 and do the dispose & destroy without atomic decrements, or we do
a single decrement and don't dispose or destroy anything). But for the
old logic, we do want that code inlined into _M_release (or
_M_release_orig as it was called in your patch). Have I got that right
now?

What if we remove the __noinline__ from _M_release_last_use() so that
it can be inlined, but than add a noinline wrapper that can be called
when we don't want to inline it?

So:
  // Called by _M_release() when the use count reaches zero.
  void
  _M_release_last_use() noexcept
  {
// unchanged from previous patch, but without the attribute.
// ...
  }

  // As above, but 'noinline' to reduce code size on the cold path.
  __attribute__((__noinline__))
  void
  _M_release_last_use_cold() noexcept
  { _M_release_last_use(); }


And then:

  template<>
inline void
_Sp_counted_base<_S_atomic>::_M_release() noexcept
{
  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
#if ! _GLIBCXX_TSAN
  constexpr bool __lock_free
= __atomic_always_lock_free(sizeof(long long), 0)
&& __atomic_always_lock_free(sizeof(_Atomic_word), 0);
  constexpr bool __double_word
= sizeof(long long) == 2 * sizeof(_Atomic_word);
  // The ref-count members follow the vptr, so are aligned to
  // alignof(void*).
  constexpr bool __aligned = __alignof(long long) <= alignof(void*);
  if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
{
  constexpr long long __unique_ref
= 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
  auto __both_counts = reinterpret_cast(&_M_use_count);

  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
  if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
{
  // Both counts are 1, so there are no weak references and
  // we are releasing the last strong reference. No other
  // threads can observe the effects of this _M_release()
  // call (e.g. calling use_count()) without a data race.
  *(long long*)(&_M_use_count) = 0;
  _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
  _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
  _M_dispose();
  _M_destroy();
  return;
}
  if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
  _M_release_last_use_cold();
  return;
}
}
  else
#endif
  if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
  _M_release_last_use();
}
}


So we use the noinline version for the else branch in the new logic,
but the can-inline version for the old logic. Would that work?

We could also consider adding __attribute__((__cold__)) to the
_M_release_last_use_cold() function, and/or add [[__unlikely__]] to
the 'if' that calls it, but maybe that's overkill.

It seems like this will slightly pessimize the case where the last use

Re: [PATCH v2] x86: Update STORE_MAX_PIECES

2021-08-04 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 4, 2021 at 3:34 PM H.J. Lu  wrote:
>
> On Tue, Aug 3, 2021 at 6:56 AM H.J. Lu  wrote:
> >
> > 1. Update x86 STORE_MAX_PIECES to use OImode and XImode only if inter-unit
> > move is enabled since x86 uses vec_duplicate, which is enabled only when
> > inter-unit move is enabled, to implement store_by_pieces.
> > 2. Update op_by_pieces_d::op_by_pieces_d to set m_max_size to
> > STORE_MAX_PIECES for store_by_pieces and to COMPARE_MAX_PIECES for
> > compare_by_pieces.
> >
> > gcc/
> >
> > PR target/101742
> > * expr.c (op_by_pieces_d::op_by_pieces_d): Set m_max_size to
> > STORE_MAX_PIECES for store_by_pieces and to COMPARE_MAX_PIECES
> > for compare_by_pieces.
> > * config/i386/i386.h (STORE_MAX_PIECES): Use OImode and XImode
> > only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.
> >
> > gcc/testsuite/
> >
> > PR target/101742
> > * gcc.target/i386/pr101742a.c: New test.
> > * gcc.target/i386/pr101742b.c: Likewise.
> > ---
> >  gcc/config/i386/i386.h| 20 +++-
> >  gcc/expr.c|  6 +-
> >  gcc/testsuite/gcc.target/i386/pr101742a.c | 16 
> >  gcc/testsuite/gcc.target/i386/pr101742b.c |  4 
> >  4 files changed, 36 insertions(+), 10 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101742a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101742b.c
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index bed9cd9da18..9b416abd5f4 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -1783,15 +1783,17 @@ typedef struct ix86_args {
> >  /* STORE_MAX_PIECES is the number of bytes at a time that we can
> > store efficiently.  */
> >  #define STORE_MAX_PIECES \
> > -  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > -   ? 64 \
> > -   : ((TARGET_AVX \
> > -   && !TARGET_PREFER_AVX128 \
> > -   && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
> > -  ? 32 \
> > -  : ((TARGET_SSE2 \
> > - && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > -? 16 : UNITS_PER_WORD)))
> > +  (TARGET_INTER_UNIT_MOVES_TO_VEC \
> > +   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > +  ? 64 \
> > +  : ((TARGET_AVX \
> > + && !TARGET_PREFER_AVX128 \
> > + && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
> > + ? 32 \
> > + : ((TARGET_SSE2 \
> > + && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > + ? 16 : UNITS_PER_WORD))) \
> > +   : UNITS_PER_WORD)
> >
> >  /* If a memory-to-memory move would take MOVE_RATIO or more simple
> > move-instruction pairs, we will do a cpymem or libcall instead.
>
> expr.c has been fixed.   Here is the v2 patch for x86 backend.
> OK for master?

OK, but please add the comment about vec_duplicate before the define
to explain the situation with TARGET_INTER_UNIT_MOVES_TO_VEC.

Thanks,
Uros.


Re: [PATCH] x86: Avoid stack realignment when copying data with SSE register

2021-08-04 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 4, 2021 at 3:20 PM H.J. Lu  wrote:
>
> To avoid stack realignment, call ix86_gen_scratch_sse_rtx to get a
> scratch SSE register to copy data with with SSE register from one
> memory location to another.
>
> gcc/
>
> PR target/101772
> * config/i386/i386-expand.c (ix86_expand_vector_move): Call
> ix86_gen_scratch_sse_rtx to get a scratch SSE register to copy
> data with SSE register from one memory location to another.
>
> gcc/testsuite/
>
> PR target/101772
> * gcc.target/i386/eh_return-2.c: New test.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.c   |  6 +-
>  gcc/testsuite/gcc.target/i386/eh_return-2.c | 16 
>  2 files changed, 21 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/eh_return-2.c
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index 1d469bf7221..bd21efa9530 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -613,7 +613,11 @@ ix86_expand_vector_move (machine_mode mode, rtx 
> operands[])
>  arguments in memory.  */
>if (!register_operand (op0, mode)
>   && !register_operand (op1, mode))
> -   op1 = force_reg (mode, op1);
> +   {
> + rtx scratch = ix86_gen_scratch_sse_rtx (mode);
> + emit_move_insn (scratch, op1);
> + op1 = scratch;
> +   }
>
>tmp[0] = op0; tmp[1] = op1;
>ix86_expand_vector_move_misalign (mode, tmp);
> diff --git a/gcc/testsuite/gcc.target/i386/eh_return-2.c 
> b/gcc/testsuite/gcc.target/i386/eh_return-2.c
> new file mode 100644
> index 000..f23f4492dac
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/eh_return-2.c
> @@ -0,0 +1,16 @@
> +/* PR target/101772  */
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O0 -march=x86-64 -mstackrealign" } */
> +
> +struct _Unwind_Context _Unwind_Resume_or_Rethrow_this_context;
> +
> +void offset (int);
> +
> +struct _Unwind_Context {
> +  void *reg[7];
> +} _Unwind_Resume_or_Rethrow() {
> +  struct _Unwind_Context cur_contextcur_context =
> +  _Unwind_Resume_or_Rethrow_this_context;
> +  offset(0);
> +  __builtin_eh_return ((long) offset, 0);
> +}
> --
> 2.31.1
>


Re: [PATCH 1/7] fortran: new abstract class gfc_dummy_arg

2021-08-04 Thread Mikael Morin

Le 04/08/2021 à 09:05, Thomas Koenig a écrit :


So far, we have refrained from adding too much explicit C++-isms into
the code, and if we do, my participation at least will have to be
reduced sharply (I don't speak much C++, and I don't intend to learn).

So, is this a path we want to go down?

I’m not a C++ fanboy, but I think that avoiding it at all price would be 
a mistake.
Even fortran has support for typebound procedures.  It’s not an obscure 
feature.
Of course my (lack of) recent activity makes my voice very weak for any 
decision regarding the future of the project.


Now regarding these patches, I can propose dropping patches 1-5 
completely.  I don’t want to rewrite it with unions and the like.
Patch 7 would need some adjustments, but I promised to do it for 
backport anyway.

Does that work?
Mikael


[PATCH] Factor out `find_a_program` helper around `find_a_file`

2021-08-04 Thread John Ericson
The helper is for `--print-prog-name` and similar things. Since all
executable finding goes through it, we can move the default overrides
into that path too. This also ensures that if some is looking for a
*non*-program that called `as`, `ld`, etc., weird things don't happen.
---
 gcc/gcc.c | 59 ---
 1 file changed, 34 insertions(+), 25 deletions(-)

diff --git a/gcc/gcc.c b/gcc/gcc.c
index 3e98bc7973e..1a74bf92f7a 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -367,6 +367,7 @@ static void putenv_from_prefixes (const struct path_prefix 
*, const char *,
  bool);
 static int access_check (const char *, int);
 static char *find_a_file (const struct path_prefix *, const char *, int, bool);
+static char *find_a_program (const char *);
 static void add_prefix (struct path_prefix *, const char *, const char *,
int, int, int);
 static void add_sysrooted_prefix (struct path_prefix *, const char *,
@@ -3052,22 +3053,7 @@ find_a_file (const struct path_prefix *pprefix, const 
char *name, int mode,
 {
   struct file_at_path_info info;
 
-#ifdef DEFAULT_ASSEMBLER
-  if (! strcmp (name, "as") && access (DEFAULT_ASSEMBLER, mode) == 0)
-return xstrdup (DEFAULT_ASSEMBLER);
-#endif
-
-#ifdef DEFAULT_LINKER
-  if (! strcmp (name, "ld") && access (DEFAULT_LINKER, mode) == 0)
-return xstrdup (DEFAULT_LINKER);
-#endif
-
-#ifdef DEFAULT_DSYMUTIL
-  if (! strcmp (name, "dsymutil") && access (DEFAULT_DSYMUTIL, mode) == 0)
-return xstrdup (DEFAULT_DSYMUTIL);
-#endif
-
-  /* Determine the filename to execute (special case for absolute paths).  */
+  /* Find the filename in question (special case for absolute paths).  */
 
   if (IS_ABSOLUTE_PATH (name))
 {
@@ -3088,6 +3074,32 @@ find_a_file (const struct path_prefix *pprefix, const 
char *name, int mode,
file_at_path, );
 }
 
+/* Specialization of find_a_file for programs that also takes into account
+   configure-specified default programs. */
+
+static char*
+find_a_program (const char *name)
+{
+  /* Do not search if default matches query. */
+
+#ifdef DEFAULT_ASSEMBLER
+  if (! strcmp (name, "as") && access (DEFAULT_ASSEMBLER, mode) == 0)
+return xstrdup (DEFAULT_ASSEMBLER);
+#endif
+
+#ifdef DEFAULT_LINKER
+  if (! strcmp (name, "ld") && access (DEFAULT_LINKER, mode) == 0)
+return xstrdup (DEFAULT_LINKER);
+#endif
+
+#ifdef DEFAULT_DSYMUTIL
+  if (! strcmp (name, "dsymutil") && access (DEFAULT_DSYMUTIL, mode) == 0)
+return xstrdup (DEFAULT_DSYMUTIL);
+#endif
+
+  return find_a_file (_prefixes, name, X_OK, false);
+}
+
 /* Ranking of prefixes in the sort list. -B prefixes are put before
all others.  */
 
@@ -3243,8 +3255,7 @@ execute (void)
 
   if (wrapper_string)
 {
-  string = find_a_file (_prefixes,
-   argbuf[0], X_OK, false);
+  string = find_a_program (argbuf[0]);
   if (string)
argbuf[0] = string;
   insert_wrapper (wrapper_string);
@@ -3269,7 +3280,7 @@ execute (void)
 
   if (!wrapper_string)
 {
-  string = find_a_file (_prefixes, commands[0].prog, X_OK, false);
+  string = find_a_program(commands[0].prog);
   if (string)
commands[0].argv[0] = string;
 }
@@ -3284,8 +3295,7 @@ execute (void)
commands[n_commands].prog = argbuf[i + 1];
commands[n_commands].argv
  = &(argbuf.address ())[i + 1];
-   string = find_a_file (_prefixes, commands[n_commands].prog,
- X_OK, false);
+   string = find_a_program(commands[n_commands].prog);
if (string)
  commands[n_commands].argv[0] = string;
n_commands++;
@@ -8556,8 +8566,7 @@ driver::maybe_putenv_COLLECT_LTO_WRAPPER () const
   if (have_c)
 lto_wrapper_file = NULL;
   else
-lto_wrapper_file = find_a_file (_prefixes, "lto-wrapper",
-   X_OK, false);
+lto_wrapper_file = find_a_program ("lto-wrapper");
   if (lto_wrapper_file)
 {
   lto_wrapper_file = convert_white_space (lto_wrapper_file);
@@ -8671,7 +8680,7 @@ driver::maybe_print_and_exit () const
 #endif
  print_prog_name = concat (print_prog_name, use_ld, NULL);
}
-  char *newname = find_a_file (_prefixes, print_prog_name, X_OK, 0);
+  char *newname = find_a_program (print_prog_name);
   printf ("%s\n", (newname ? newname : print_prog_name));
   return (0);
 }
@@ -9070,7 +9079,7 @@ driver::maybe_run_linker (const char *argv0) const
  /* We'll use ld if we can't find collect2.  */
  if (! strcmp (linker_name_spec, "collect2"))
{
- char *s = find_a_file (_prefixes, "collect2", X_OK, false);
+ char *s = find_a_program ("collect2");
  if (s == NULL)
set_static_spec_shared (_name_spec, "ld");
}
-- 
2.31.1



Re: [PATCH 2/2] Ada: Remove debug line number for DECL_IGNORED_P functions

2021-08-04 Thread Bernd Edlinger
On 8/4/21 4:33 PM, Eric Botcazou wrote:
>> The location of these ignored Ada decls looks completely sane to me.
>> However, it was an unintentional side effect of the patch to allow
>> minimal debugging of ignored decls.  This means we can now step into
>> those functions or set line breakpoints there, while previously that
>> was not possible.  And I guess it could be considered an improvement.
>>
>> So it's your choice, how you want these functions to be debugged.
> 
> The requirement on the GDB side is that these functions *cannot* be stepped 
> into, i.e. that they be completely transparent for the GDB user.  But we 
> still 
> want to have location information in the compiler itself to debug it.
> 

Well, I see.

But it is okay that we can set a breakpoint on defs__struct1IP,
in the test case of PR 101598.
And the debugger shall only show assembler here.
Right?

Do you have an example where this location information is used in the
compiler itself for debugging?

Of course we could do something like

diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index b91a9b5..c0ff4c6 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -28546,6 +28546,9 @@ dwarf2out_set_ignored_loc (unsigned int line, unsigned i
 {
   dw_fde_ref fde = cfun->fde;
 
+  if (is_ada ())
+return;
+
   fde->ignored_debug = false;
   set_cur_line_info_table (function_section (fde->decl));
 

But it would regress the attached test case (the Ada-equivalent
of PR 97937):

$ gnatmake -O2 main.adb -g -save-temps -f
produces line info for Test2:

test__test2:
.LFB8:
.cfi_startproc
.loc 1 8 4 view .LVU3
movl%edi, %eax
ret
.cfi_endproc

while with the above patch we would get something like

test__test2:
.LFB8:
.cfi_startproc
movl%edi, %eax
ret
.cfi_endproc

and, indeed it is impossible to step into test2 or get the source
line if we insert a breakpoint at the label test__test2.

I assume You would agree that having the location for Test2 is better
than no debug info at all?

So Maybe something like the following might work for You?

diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index b91a9b5..c0ff4c6 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -28546,6 +28546,9 @@ dwarf2out_set_ignored_loc (unsigned int line, unsigned i
 {
   dw_fde_ref fde = cfun->fde;
 
+  if (is_ada () && DECL_ARTIFICIAL (cfun->decl))
+return;
+
   fde->ignored_debug = false;
   set_cur_line_info_table (function_section (fde->decl));
 

This would remove the location info in the test case of PR 101598,
and still have location info in the ada variant of PR 97937.


What do you think?


Thanks
Bernd.
package test is

   type Func_Ptr is access function (X : Integer) return Integer;

   function Test1 (X : Integer) return Integer;
   function Test2 (X : Integer) return Integer;
   function DoIt (X : Integer; Func : Func_Ptr) return Integer;

end test;
package body test is

   function Test1 (X : Integer) return Integer is
   begin
  return X;
   end Test1;

   function Test2 (X : Integer) return Integer is
   begin
  return X;
   end Test2;

   function DoIt (X : Integer; Func : Func_Ptr) return Integer is
   begin
  return Func (X);
   end DoIt;

end test;
with Ada.Text_IO; use Ada.Text_IO;
with test;

procedure Main is

   -- Declare a pointer type, pointing to a function that takes
   -- two Integer variables as input and returns a Integer


   X : Integer := 7;
   Y : Integer := test.DoIt (X, test.Test1'Access);
   Z : Integer := test.DoIt (X, test.Test2'Access);

begin
   Put_Line (X'Img & " " & Y'Img & " " & Z'Img);
end Main;



[PATCH 5/7] bpf: BPF CO-RE support

2021-08-04 Thread David Faust via Gcc-patches
This commit introduces support for BPF Compile Once - Run
Everywhere (CO-RE) in GCC.

gcc/ChangeLog:

* config/bpf/bpf.c: Adjust includes.
(bpf_handle_preserve_access_index_attribute): New function.
(bpf_attribute_table): Use it here.
(bpf_builtins): Add BPF_BUILTIN_PRESERVE_ACCESS_INDEX.
(bpf_option_override): Handle "-mcore" option.
(bpf_asm_init_sections): New.
(TARGET_ASM_INIT_SECTIONS): Redefine.
(bpf_file_end): New.
(TARGET_ASM_FILE_END): Redefine.
(bpf_init_builtins): Add "__builtin_preserve_access_index".
(bpf_core_compute, bpf_core_get_index): New.
(is_attr_preserve_access): New.
(bpf_expand_builtin): Handle new builtins.
(bpf_core_newdecl, bpf_core_is_maybe_aggregate_access): New.
(bpf_core_walk): New.
(bpf_resolve_overloaded_builtin): New.
(TARGET_RESOLVE_OVERLOADED_BUILTIN): Redefine.
(handle_attr): New.
(pass_bpf_core_attr): New RTL pass.
* config/bpf/bpf-passes.def: New file.
* config/bpf/bpf-protos.h (make_pass_bpf_core_attr): New.
* config/bpf/coreout.c: New file.
* config/bpf/coreout.h: Likewise.
* config/bpf/t-bpf (TM_H): Add $(srcdir)/config/bpf/coreout.h.
(coreout.o): New rule.
(PASSES_EXTRA): Add $(srcdir)/config/bpf/bpf-passes.def.
* config.gcc (bpf): Add coreout.h to extra_headers.
Add coreout.o to extra_objs.
Add $(srcdir)/config/bpf/coreout.c to target_gtfiles.
---
 gcc/config.gcc|   3 +
 gcc/config/bpf/bpf-passes.def |  20 ++
 gcc/config/bpf/bpf-protos.h   |   2 +
 gcc/config/bpf/bpf.c  | 579 ++
 gcc/config/bpf/coreout.c  | 356 +
 gcc/config/bpf/coreout.h  | 114 +++
 gcc/config/bpf/t-bpf  |   8 +
 7 files changed, 1082 insertions(+)
 create mode 100644 gcc/config/bpf/bpf-passes.def
 create mode 100644 gcc/config/bpf/coreout.c
 create mode 100644 gcc/config/bpf/coreout.h

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 93e2b3219b9..6c790ce1b35 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -1515,6 +1515,9 @@ bpf-*-*)
 use_collect2=no
 extra_headers="bpf-helpers.h"
 use_gcc_stdint=provide
+extra_headers="coreout.h"
+extra_objs="coreout.o"
+target_gtfiles="$target_gtfiles \$(srcdir)/config/bpf/coreout.c"
 ;;
 cr16-*-elf)
 tm_file="elfos.h ${tm_file} newlib-stdint.h"
diff --git a/gcc/config/bpf/bpf-passes.def b/gcc/config/bpf/bpf-passes.def
new file mode 100644
index 000..3e961659411
--- /dev/null
+++ b/gcc/config/bpf/bpf-passes.def
@@ -0,0 +1,20 @@
+/* Declaration of target-specific passes for eBPF.
+   Copyright (C) 2021 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   .  */
+
+INSERT_PASS_AFTER (pass_df_initialize_opt, 1, pass_bpf_core_attr);
diff --git a/gcc/config/bpf/bpf-protos.h b/gcc/config/bpf/bpf-protos.h
index aeb512665ed..7ce3386ffda 100644
--- a/gcc/config/bpf/bpf-protos.h
+++ b/gcc/config/bpf/bpf-protos.h
@@ -30,4 +30,6 @@ extern void bpf_print_operand_address (FILE *, rtx);
 extern void bpf_expand_prologue (void);
 extern void bpf_expand_epilogue (void);
 
+rtl_opt_pass * make_pass_bpf_core_attr (gcc::context *);
+
 #endif /* ! GCC_BPF_PROTOS_H */
diff --git a/gcc/config/bpf/bpf.c b/gcc/config/bpf/bpf.c
index 85f6b76a11f..5edc8cc715a 100644
--- a/gcc/config/bpf/bpf.c
+++ b/gcc/config/bpf/bpf.c
@@ -54,6 +54,25 @@ along with GCC; see the file COPYING3.  If not see
 #include "builtins.h"
 #include "predict.h"
 #include "langhooks.h"
+#include "flags.h"
+
+#include "cfg.h" /* needed for struct control_flow_graph used in BB macros */
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "gimple-walk.h"
+#include "tree-pass.h"
+#include "tree-iterator.h"
+
+#include "context.h"
+#include "pass_manager.h"
+
+#include "gimplify.h"
+#include "gimplify-me.h"
+
+#include "ctfc.h"
+#include "btf.h"
+
+#include "coreout.h"
 
 /* Per-function machine data.  */
 struct GTY(()) machine_function
@@ -104,6 +123,27 @@ bpf_handle_fndecl_attribute (tree *node, tree name,
   return NULL_TREE;
 }
 
+/* Handle preserve_access_index attribute, which can be applied to structs,
+   unions and classes. Actually adding the attribute to the TYPE_DECL 

[PATCH 6/7] bpf testsuite: Add BPF CO-RE tests

2021-08-04 Thread David Faust via Gcc-patches
This commit adds several tests for the new BPF CO-RE functionality to
the BPF target testsuite.

gcc/testsuite/ChangeLog:

* gcc.target/bpf/core-attr-1.c: New test.
* gcc.target/bpf/core-attr-2.c: Likewise.
* gcc.target/bpf/core-attr-3.c: Likewise.
* gcc.target/bpf/core-attr-4.c: Likewise
* gcc.target/bpf/core-builtin-1.c: Likewise
* gcc.target/bpf/core-builtin-2.c: Likewise.
* gcc.target/bpf/core-builtin-3.c: Likewise.
* gcc.target/bpf/core-section-1.c: Likewise.
---
 gcc/testsuite/gcc.target/bpf/core-attr-1.c| 23 +++
 gcc/testsuite/gcc.target/bpf/core-attr-2.c| 21 ++
 gcc/testsuite/gcc.target/bpf/core-attr-3.c| 41 
 gcc/testsuite/gcc.target/bpf/core-attr-4.c| 35 ++
 gcc/testsuite/gcc.target/bpf/core-builtin-1.c | 64 +++
 gcc/testsuite/gcc.target/bpf/core-builtin-2.c | 26 
 gcc/testsuite/gcc.target/bpf/core-builtin-3.c | 26 
 gcc/testsuite/gcc.target/bpf/core-section-1.c | 38 +++
 8 files changed, 274 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-attr-1.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-attr-2.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-attr-3.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-attr-4.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-builtin-1.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-builtin-2.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-builtin-3.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/core-section-1.c

diff --git a/gcc/testsuite/gcc.target/bpf/core-attr-1.c 
b/gcc/testsuite/gcc.target/bpf/core-attr-1.c
new file mode 100644
index 000..7f0d0e50dd6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/core-attr-1.c
@@ -0,0 +1,23 @@
+/* Basic test for struct __attribute__((preserve_access_index))
+   for BPF CO-RE support.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O0 -dA -gbtf -mcore" } */
+
+struct S {
+  int a;
+  int b;
+  int c;
+} __attribute__((preserve_access_index));
+
+void
+func (struct S * s)
+{
+  /* 0:2 */
+  int *x = &(s->c);
+
+  *x = 4;
+}
+
+/* { dg-final { scan-assembler-times "ascii \"0:2.0\"\[\t 
\]+\[^\n\]*btf_aux_string" 1 } } */
+/* { dg-final { scan-assembler-times "bpfcr_type" 1 } } */
diff --git a/gcc/testsuite/gcc.target/bpf/core-attr-2.c 
b/gcc/testsuite/gcc.target/bpf/core-attr-2.c
new file mode 100644
index 000..508e1e4c4b1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/core-attr-2.c
@@ -0,0 +1,21 @@
+/* Basic test for union __attribute__((preserve_access_index))
+   for BPF CO-RE support.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O0 -dA -gbtf -mcore" } */
+
+union U {
+  int a;
+  char c;
+} __attribute__((preserve_access_index));
+
+void
+func (union U *u)
+{
+  /* 0:1 */
+  char *c = &(u->c);
+  *c = 'c';
+}
+
+/* { dg-final { scan-assembler-times "ascii \"0:1.0\"\[\t 
\]+\[^\n\]*btf_aux_string" 1 } } */
+/* { dg-final { scan-assembler-times "bpfcr_type" 1 } } */
diff --git a/gcc/testsuite/gcc.target/bpf/core-attr-3.c 
b/gcc/testsuite/gcc.target/bpf/core-attr-3.c
new file mode 100644
index 000..1813fd07a2f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/core-attr-3.c
@@ -0,0 +1,41 @@
+/* Test for __attribute__((preserve_access_index)) for BPF CO-RE support
+   for nested structure.
+
+   Note that even though struct O lacks the attribute, when accessed as a
+   member of another attributed type, CO-RE relocations should still be
+   generated.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O0 -dA -gbtf -mcore" } */
+
+struct O {
+  int e;
+  int f;
+};
+
+struct S {
+  int a;
+  struct {
+int b;
+int c;
+  } inner;
+  struct O other;
+} __attribute__((preserve_access_index));
+
+void
+func (struct S *foo)
+{
+  /* 0:1:1 */
+  int *x = &(foo->inner.c);
+
+  /* 0:2:0 */
+  int *y = &(foo->other.e);
+
+  *x = 4;
+  *y = 5;
+}
+
+/* { dg-final { scan-assembler-times "ascii \"0:1:1.0\"\[\t 
\]+\[^\n\]*btf_aux_string" 1 } } */
+/* { dg-final { scan-assembler-times "ascii \"0:2:0.0\"\[\t 
\]+\[^\n\]*btf_aux_string" 1 } } */
+
+/* { dg-final { scan-assembler-times "bpfcr_type" 2 } } */
diff --git a/gcc/testsuite/gcc.target/bpf/core-attr-4.c 
b/gcc/testsuite/gcc.target/bpf/core-attr-4.c
new file mode 100644
index 000..30d859a1c57
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/core-attr-4.c
@@ -0,0 +1,35 @@
+/* Test for BPF CO-RE __attribute__((preserve_access_index)) with accesses on
+   LHS and both LHS and RHS of assignment.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O0 -dA -gbtf -mcore" } */
+
+struct T {
+  int a;
+  int b;
+  struct U {
+int c;
+struct V {
+  int d;
+  int e[4];
+  int f;
+} v;
+  } u;
+} __attribute__((preserve_access_index));
+
+
+void
+func (struct T *t)
+{
+  /* 0:2:1:1:3 */
+  t->u.v.e[3] = 0xa1;
+
+  /* 0:2:0, 0:0, 0:1 */
+  t->u.c = t->a + t->b;
+}
+
+/* { dg-final { scan-assembler-times "ascii 

[PATCH 7/7] doc: BPF CO-RE documentation

2021-08-04 Thread David Faust via Gcc-patches
Document the new command line options (-mcore and -mno-core), the new
BPF target builtin (__builtin_preserve_access_index), and the new BPF
target attribute (preserve_access_index) introduced with BPF CO-RE.

gcc/ChangeLog:

* doc/extend.texi (BPF Type Attributes) New node.
Document new preserve_access_index attribute.
Document new preserve_access_index builtin.
---
 gcc/doc/extend.texi | 16 
 gcc/doc/invoke.texi | 13 -
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index b83cd4919bb..bb5fc921907 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -8194,6 +8194,7 @@ attributes.
 * Common Type Attributes::
 * ARC Type Attributes::
 * ARM Type Attributes::
+* BPF Type Attributes::
 * MeP Type Attributes::
 * PowerPC Type Attributes::
 * x86 Type Attributes::
@@ -8757,6 +8758,17 @@ virtual table for @code{C} is not exported.  (You can use
 @code{__attribute__} instead of @code{__declspec} if you prefer, but
 most Symbian OS code uses @code{__declspec}.)
 
+@node BPF Type Attributes
+@subsection BPF Type Attributes
+
+@cindex @code{preserve_access_index} type attribute, BPF
+BPF Compile Once - Run Everywhere (CO-RE) support. When attached to a
+@code{struct} or @code{union} type definition, indicates that CO-RE
+relocation information should be generated for any access to a variable
+of that type. The behavior is equivalent to the programmer manually
+wrapping every such access with @code{__builtin_preserve_access_index}.
+
+
 @node MeP Type Attributes
 @subsection MeP Type Attributes
 
@@ -15388,6 +15400,10 @@ Load 16-bits from the @code{struct sk_buff} packet 
data pointed by the register
 Load 32-bits from the @code{struct sk_buff} packet data pointed by the 
register @code{%r6} and return it.
 @end deftypefn
 
+@deftypefn {Built-in Function} void * __builtin_preserve_access_index 
(@var{expr})
+BPF Compile Once-Run Everywhere (CO-RE) support. Instruct GCC to generate 
CO-RE relocation records for any accesses to aggregate data structures (struct, 
union, array types) in @var{expr}. This builtin is otherwise transparent, the 
return value is whatever @var{expr} evaluates to. It is also overloaded: 
@var{expr} may be of any type (not necessarily a pointer), the return type is 
the same. Has no effect if @code{-mcore} is not in effect (either specified or 
implied).
+@end deftypefn
+
 @node FR-V Built-in Functions
 @subsection FR-V Built-in Functions
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 32697e6117c..915bbc4ee65 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -903,7 +903,7 @@ Objective-C and Objective-C++ Dialects}.
 
 @emph{eBPF Options}
 @gccoptlist{-mbig-endian -mlittle-endian -mkernel=@var{version}
--mframe-limit=@var{bytes} -mxbpf}
+-mframe-limit=@var{bytes} -mxbpf -mcore -mno-core}
 
 @emph{FR30 Options}
 @gccoptlist{-msmall-model  -mno-lsim}
@@ -22520,6 +22520,17 @@ Generate code for a big-endian target.
 @opindex mlittle-endian
 Generate code for a little-endian target.  This is the default.
 
+@item -mcore
+@opindex mcore
+Enable BPF Compile Once - Run Everywhere (CO-RE) support. Requires and
+is implied by @option{-gbtf}.
+
+@item -mno-core
+@opindex mno-core
+Disable BPF Compile Once - Run Everywhere (CO-RE) support. BPF CO-RE
+support is enabled by default when generating BTF debug information for
+the BPF target.
+
 @item -mxbpf
 Generate code for an expanded version of BPF, which relaxes some of
 the restrictions imposed by the BPF architecture:
-- 
2.32.0



[PATCH 2/7] ctfc: externalize ctf_dtd_lookup

2021-08-04 Thread David Faust via Gcc-patches
Expose the function ctf_dtd_lookup, so that it can be used by the BPF
CO-RE machinery. The function is no longer static, and an extern
prototype is added in ctfc.h.

gcc/ChangeLog:

* ctfc.c (ctf_dtd_lookup): Function is no longer static.
* ctfc.h: Analogous change.
---
 gcc/ctfc.c | 2 +-
 gcc/ctfc.h | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/ctfc.c b/gcc/ctfc.c
index 1a6ddb80829..db6ba030301 100644
--- a/gcc/ctfc.c
+++ b/gcc/ctfc.c
@@ -132,7 +132,7 @@ ctf_dtd_insert (ctf_container_ref ctfc, ctf_dtdef_ref dtd)
 
 /* Lookup CTF type given a DWARF die for the type.  */
 
-static ctf_dtdef_ref
+ctf_dtdef_ref
 ctf_dtd_lookup (const ctf_container_ref ctfc, const dw_die_ref type)
 {
   ctf_dtdef_t entry;
diff --git a/gcc/ctfc.h b/gcc/ctfc.h
index 39c527074b5..825570d807e 100644
--- a/gcc/ctfc.h
+++ b/gcc/ctfc.h
@@ -388,7 +388,10 @@ extern bool ctf_type_exists (ctf_container_ref, 
dw_die_ref, ctf_id_t *);
 
 extern void ctf_add_cuname (ctf_container_ref, const char *);
 
-extern ctf_dvdef_ref ctf_dvd_lookup (const ctf_container_ref, dw_die_ref);
+extern ctf_dtdef_ref ctf_dtd_lookup (const ctf_container_ref ctfc,
+dw_die_ref die);
+extern ctf_dvdef_ref ctf_dvd_lookup (const ctf_container_ref ctfc,
+dw_die_ref die);
 
 extern const char * ctf_add_string (ctf_container_ref, const char *,
uint32_t *, int);
-- 
2.32.0



[PATCH 0/7] BPF CO-RE Support

2021-08-04 Thread David Faust via Gcc-patches
[ These patches depend on the series "Allow means for late BTF generation
  for BPF CO-RE" by Indu Bhagat, here:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576446.html ]

Hello,

This patch series adds support for the BPF Compile Once - Run Everywhere
(BPF CO-RE) mechanism in GCC.

A BPF program is some user code which is injected (via a verifier and loader)
into a running kernel, and executed in kernel context. To do useful work, a BPF
program generally must interact with kernel data structures in some way.
Therefore, BPF programs written in C usually include kernel headers.

This introduces two major portability issues when compiling BPF programs:

   1. Kernel data structures regularly change, with fields added, moved or
  deleted between versions. An eBPF program cannot in general be expected
  to run on any systems which does not share an identical kernel version to
  the system on which it was compiled.

   2. Included kernel headers (and used data structures) may be internal, not
  exposed in an userspace API, and therefore target-specific. An eBPF
  program compiled on an x86_64 machine will include x86_64 kernel headers.
  The resulting program may not run well (or at all) in machines of
  another architecture.

BPF CO-RE is designed to solve the first issue by leveraging the BPF loader to
adjust references to kernel data structures made by the program as-needed
according to versions of structures actually present on the host kernel.

To achieve this, additional information is placed in a ".BTF.ext" section.  This
information tells the loader which references will require adjusting, and how to
perform each necessary adjustment.

For any access to a data structure which may require load-time adjustment,
the following information is recorded (making up a CO-RE relocation record):
- The BTF type ID of the outermost structure which is accessed.
- An access string encoding the accessed member via a series of member and
  array indexes. These indexes are used to look up detailed BTF information
  about the member.
- The offset of the appropriate instruction to patch in the BPF program.
- An integer specifying what kind of relocation to perform.

A CO-RE-capable BPF loader reads this information together with the BTF
information of the program, compares it against BTF information of the host
kernel, and determines the appropriate way to patch the specified instruction.

Once all CO-RE relocations are resolved, the program is loaded and verified as
usual. The process can be summarized with the following diagram:

  ++
  | C compiler |
  +-+--+
| BPF + BTF + CO-RE relocations
v
  ++
 +--->| BPF loader |
 |+-+--+
 |  | BPF (adapted)
 BTF |  v
 |++
 ++   Kernel   |
  ++

Note that a single ELF object may contain multiple eBPF programs. As a result, a
single .BTF.ext section can contain CO-RE relocations for multiple programs in
distinct sections.

Many data structure accesses (e.g., those described in the program itself) do
not need to be patched. So, GCC only generates CO-RE information for accesses
marked as being "of interest." To be compatible with LLVM a new BPF target
builtin, __builtin_preserve_access_index, is implemented. Any accesses to
aggregate data structures (structs, unions, arrays) in the argument will have
appropriate CO-RE information generated and output. This builtin is otherwise
transparent - it does not alter the program's functionality in any way.

In addition, a new BPF target attribute preserve_access_index is added.  This
attribute may annotate struct and union type definitions. Any access to a type
with this attribute is automatically "of interest," and will have CO-RE
information generated accordingly.

Finally, generation of BPF CO-RE information is gated behind a new BPF option,
-mcore (and its negative, -mno-core). Because CO-RE support is intimately tied
to BTF debug information, -gbtf for BPF target implies -mcore, and -mcore
requires BTF generation. For cases where BTF information is desired but CO-RE
is not important, it can be disabled with -mno-core.

David Faust (7):
  dwarf: externalize lookup_type_die
  ctfc: externalize ctf_dtd_lookup
  ctfc: add function to lookup CTF ID of a TREE type
  btf: expose get_btf_id
  bpf: BPF CO-RE support
  bpf testsuite: Add BPF CO-RE tests
  doc: BPF CO-RE documentation

 gcc/btfout.c  |   2 +-
 gcc/config.gcc|   3 +
 gcc/config/bpf/bpf-passes.def |  20 +
 gcc/config/bpf/bpf-protos.h   |   2 +
 gcc/config/bpf/bpf.c  | 579 ++
 gcc/config/bpf/coreout.c  | 356 +++
 gcc/config/bpf/coreout.h   

[PATCH 4/7] btf: expose get_btf_id

2021-08-04 Thread David Faust via Gcc-patches
Expose the function get_btf_id, so that it may be used by the BPF
backend. This enables the BPF CO-RE machinery in the BPF backend to
lookup BTF type IDs, in order to create CO-RE relocation records.

A prototype is added in ctfc.h

gcc/ChangeLog:

* btfout.c (get_btf_id): Function is no longer static.
* ctfc.h: Expose it here.
---
 gcc/btfout.c | 2 +-
 gcc/ctfc.h   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/btfout.c b/gcc/btfout.c
index 8cdd9905fb6..cdc6c6378c0 100644
--- a/gcc/btfout.c
+++ b/gcc/btfout.c
@@ -156,7 +156,7 @@ init_btf_id_map (size_t len)
 /* Return the BTF type ID of CTF type ID KEY, or BTF_INVALID_TYPEID if the CTF
type with ID KEY does not map to a BTF type.  */
 
-static inline ctf_id_t
+ctf_id_t
 get_btf_id (ctf_id_t key)
 {
   return btf_id_map[key];
diff --git a/gcc/ctfc.h b/gcc/ctfc.h
index 14180c1e5de..a0b7e4105a8 100644
--- a/gcc/ctfc.h
+++ b/gcc/ctfc.h
@@ -431,6 +431,7 @@ extern int ctf_add_variable (ctf_container_ref, const char 
*, ctf_id_t,
 dw_die_ref, unsigned int);
 
 extern ctf_id_t ctf_lookup_tree_type (ctf_container_ref, const tree);
+extern ctf_id_t get_btf_id (ctf_id_t);
 
 /* CTF section does not emit location information; at this time, location
information is needed for BTF CO-RE use-cases.  */
-- 
2.32.0



[PATCH 3/7] ctfc: add function to lookup CTF ID of a TREE type

2021-08-04 Thread David Faust via Gcc-patches
Add a new function, ctf_lookup_tree_type, to return the CTF type ID
associated with a type via its is TREE node. The function is exposed via
a prototype in ctfc.h.

gcc/ChangeLog:

* ctfc.c (ctf_lookup_tree_type): New function.
* ctfc.h: Likewise.
---
 gcc/ctfc.c | 16 
 gcc/ctfc.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/gcc/ctfc.c b/gcc/ctfc.c
index db6ba030301..73c118e3d49 100644
--- a/gcc/ctfc.c
+++ b/gcc/ctfc.c
@@ -791,6 +791,22 @@ ctf_add_sou (ctf_container_ref ctfc, uint32_t flag, const 
char * name,
   return type;
 }
 
+/* Given a TREE_TYPE node, return the CTF type ID for that type.  */
+
+ctf_id_t
+ctf_lookup_tree_type (ctf_container_ref ctfc, const tree type)
+{
+  dw_die_ref die = lookup_type_die (type);
+  if (die == NULL)
+return CTF_NULL_TYPEID;
+
+  ctf_dtdef_ref dtd = ctf_dtd_lookup (ctfc, die);
+  if (dtd == NULL)
+return CTF_NULL_TYPEID;
+
+  return dtd->dtd_type;
+}
+
 /* Check if CTF for TYPE has already been generated.  Mainstay for
de-duplication.  If CTF type already exists, returns TRUE and updates
the TYPE_ID for the caller.  */
diff --git a/gcc/ctfc.h b/gcc/ctfc.h
index 825570d807e..14180c1e5de 100644
--- a/gcc/ctfc.h
+++ b/gcc/ctfc.h
@@ -430,6 +430,8 @@ extern int ctf_add_function_arg (ctf_container_ref, 
dw_die_ref,
 extern int ctf_add_variable (ctf_container_ref, const char *, ctf_id_t,
 dw_die_ref, unsigned int);
 
+extern ctf_id_t ctf_lookup_tree_type (ctf_container_ref, const tree);
+
 /* CTF section does not emit location information; at this time, location
information is needed for BTF CO-RE use-cases.  */
 
-- 
2.32.0



[PATCH 1/7] dwarf: externalize lookup_type_die

2021-08-04 Thread David Faust via Gcc-patches
Expose the function lookup_type_die in dwarf2out, so that it can be used
by CTF/BTF when adding BPF CO-RE information. The function is now
non-static, and an extern prototype is added in dwarf2out.h.

gcc/ChangeLog:

* dwarf2out.c (lookup_type_die): Function is no longer static.
* dwarf2out.h: Expose it here.
---
 gcc/dwarf2out.c | 3 +--
 gcc/dwarf2out.h | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index 1022fb75315..f32084c3eaf 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -3740,7 +3740,6 @@ static bool remove_AT (dw_die_ref, enum dwarf_attribute);
 static void remove_child_TAG (dw_die_ref, enum dwarf_tag);
 static void add_child_die (dw_die_ref, dw_die_ref);
 static dw_die_ref new_die (enum dwarf_tag, dw_die_ref, tree);
-static dw_die_ref lookup_type_die (tree);
 static dw_die_ref strip_naming_typedef (tree, dw_die_ref);
 static dw_die_ref lookup_type_die_strip_naming_typedef (tree);
 static void equate_type_number_to_die (tree, dw_die_ref);
@@ -5838,7 +5837,7 @@ new_die (enum dwarf_tag tag_value, dw_die_ref parent_die, 
tree t)
 
 /* Return the DIE associated with the given type specifier.  */
 
-static inline dw_die_ref
+dw_die_ref
 lookup_type_die (tree type)
 {
   dw_die_ref die = TYPE_SYMTAB_DIE (type);
diff --git a/gcc/dwarf2out.h b/gcc/dwarf2out.h
index b2152a53bf9..312a9909784 100644
--- a/gcc/dwarf2out.h
+++ b/gcc/dwarf2out.h
@@ -417,6 +417,7 @@ extern dw_die_ref new_die_raw (enum dwarf_tag);
 extern dw_die_ref base_type_die (tree, bool);
 
 extern dw_die_ref lookup_decl_die (tree);
+extern dw_die_ref lookup_type_die (tree);
 
 extern dw_die_ref dw_get_die_child (dw_die_ref);
 extern dw_die_ref dw_get_die_sib (dw_die_ref);
-- 
2.32.0



Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
On Wed, Aug 4, 2021 at 1:19 PM Maged Michael 
wrote:

> Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely 
> wrote:
>
>> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>> >
>> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>> > >
>> > > This is the right patch. The previous one is missing noexcept. Sorry.
>> > >
>> > >
>> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
>> wrote:
>> > >>
>> > >> Please find attached an updated patch after incorporating Jonathan's
>> suggestions.
>> > >>
>> > >> Changes from the last patch include:
>> > >> - Add a TSAN macro to bits/c++config.
>> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
>> double-width and alignment.
>> > >> - Move the code in the optimized path to a separate function
>> _M_release_double_width_cas.
>> >
>> > Thanks for the updated patch. At a quick glance it looks great. I'll
>> > apply it locally and test it tomorrow.
>>
>>
>> It occurs to me now that _M_release_double_width_cas is the wrong
>> name, because we're not doing a DWCAS, just a double-width load. What
>> do you think about renaming it to _M_release_combined instead? Since
>> it does a combined load of the two counts.
>>
>> I'm no longer confident in the alignof suggestion I gave you.
>>
>> +constexpr bool __double_word
>> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> +// The ref-count members follow the vptr, so are aligned to
>> +// alignof(void*).
>> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> For IA32 (i.e. 32-bit x86) this constant will be true, because
>> alignof(long long) and alignof(void*) are both 4, even though
>> sizeof(long long) is 8. So in theory the _M_use_count and
>> _M_weak_count members could be in different cache lines, and the
>> atomic load will not be atomic. I think we want to use __alignof(long
>> long) here to get the "natural" alignment, not the smaller 4B
>> alignment allowed by SysV ABI. That means that we won't do this
>> optimization on IA32, but I think that's acceptable.
>>
>> Alternatively, we could keep using alignof, but with an additional
>> run-time check something like (uintptr_t)&_M_use_count / 64 ==
>> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
>> line. I think for now we should just use __alignof and live without
>> the optimization on IA32.
>>
>> Secondly:
>>
>> +  void
>> +  __attribute__ ((noinline))
>>
>> This needs to be __noinline__ because noinline is not a reserved word,
>> so users can do:
>> #define noinline 1
>> #include 
>>
>> Was it performance profiling, or code-size measurements (or something
>> else?) that showed this should be non-inline?
>> I'd like to add a comment explaining why we want it to be noinline.
>>
>> The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))
>
>
> I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> How about something like this comment?
>
> // Noinline to avoid code size increase.
>
>
>
> In that function ...
>>
>> +  _M_release_last_use() noexcept
>> +  {
>> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> +_M_dispose();
>> +if (_Mutex_base<_Lp>::_S_need_barriers)
>> +  {
>> +__atomic_thread_fence (__ATOMIC_ACQ_REL);
>> +  }
>>
>> I think we can remove this fence. The _S_need_barriers constant is
>> only true for the _S_mutex policy, and that just calls
>> _M_release_orig(), so never uses this function. I'll remove it and add
>> a comment noting that the lack of barrier is intentional.
>>
>> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>> +   -1) == 1)
>> +  {
>> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> +_M_destroy();
>> +  }
>> +  }
>>
>> Alternatively, we could keep the fence in _M_release_last_use() and
>> refactor the other _M_release functions, so that we have explicit
>> specializations for each of:
>> _Sp_counted_base<_S_single>::_M_release() (already present)
>> _Sp_counted_base<_S_mutex>::_M_release()
>> _Sp_counted_base<_S_atomic>::_M_release()
>>
>> The second and third would be new, as currently they both 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
Sorry. I totally missed the rest of your message and the patch. My fuzzy
eyesight, which usually guesses correctly 90% of the time, mistook
"Secondly" on a line by itself for "Sincerely" :-)

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely 
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>
> I'm no longer confident in the alignof suggestion I gave you.
>
> +constexpr bool __double_word
> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +// The ref-count members follow the vptr, so are aligned to
> +// alignof(void*).
> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> Secondly:
>
> +  void
> +  __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include 
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> The noinlining was based on looking at generated code. That was for clang.
It was inlining the _M_last_use function for every instance of _M_release
(e.g., ~shared_ptr). This optimization with the noinline for
_M_release_last_use ended up reducing massive binary text sizes by 1.5%
(several megabytes)., which was an extra benefit. Without the noinline we
saw code size increase.

IIUC, we van use the following. Right?

__attribute__((__noinline__))


I didn't understand the part about programmers doing #define noinline 1. I
don't see code in the patch that uses noinline.

How about something like this comment?

// Noinline to avoid code size increase.



In that function ...
>
> +  _M_release_last_use() noexcept
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +_M_dispose();
> +if (_Mutex_base<_Lp>::_S_need_barriers)
> +  {
> +__atomic_thread_fence (__ATOMIC_ACQ_REL);
> +  }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +   -1) == 1)
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +_M_destroy();
> +  }
> +  }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The 

Re: [PATCH, rs6000] Add store fusion support for Power10

2021-08-04 Thread Segher Boessenkool
On Wed, Aug 04, 2021 at 09:23:13AM -0500, Bill Schmidt wrote:
> On 8/2/21 3:19 PM, Pat Haugen via Gcc-patches wrote:

(I reviewed this elsewhere instead of on the list...  Not good, since
the patch was on the list already.  Sorry.)

> >@@ -18885,6 +18980,10 @@ rs6000_sched_reorder (FILE *dump 
> >ATTRIBUTE_UNUSED, int sched_verbose,
> >if (rs6000_tune == PROCESSOR_POWER6)
> >  load_store_pendulum = 0;
> >
> >+  /* Do Power10 dependent reordering.  */
> >+  if (rs6000_tune == PROCESSOR_POWER10 && last_scheduled_insn)
> >+power10_sched_reorder (ready, *pn_ready - 1);
> >+
> 
> I happened to notice that pn_ready is marked as ATTRIBUTE_UNUSED.  This 
> predates your patch, but maybe you could clean that up too.

*All* ATTRIBUTE_UNUSED instances should go away (or almost all, only
those that really mean "maybe unused" can stay).  The preferred way to
in C++ say some argument is unused is by not naming it (or only in a
comment).  So instead of

void f (int a ATTRIBUTE_UNUSED, int b ATTRIBUTE_UNUSED) { ... }

you can say

void f (int, int) { ... }

or

void f (int/*a*/, int/*b*/) { ... }

> Can you think of any test cases we can use to demonstrate store fusion?

It may be possible to make that reliable by doing a *really* simple
testcase?  But then it isn't testing very much :-(


Segher


[committed] aarch64: Fix a typo

2021-08-04 Thread Richard Sandiford via Gcc-patches
Tested on aarch64-linux-gnu and pushed.

Richard


gcc/
* config/aarch64/aarch64.c: Fix a typo.
---
 gcc/config/aarch64/aarch64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f80de2ca897..81c002ba0b0 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15032,7 +15032,7 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
  scalar operation.
 
- If VEC_FLAGS & VEC_ADVSIMD, return the loop carry latency of the
- the Advanced SIMD implementation.
+ Advanced SIMD implementation.
 
- If VEC_FLAGS & VEC_ANY_SVE, return the loop carry latency of the
  SVE implementation.


[PATCH V2] aarch64: Don't include vec_select high-half in SIMD add cost

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi,

V2 of this patch uses the same approach as that just implemented
for the multiply high-half cost patch.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan 

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon add cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vaddX_high_cost.c: New test.

From: Jonathan Wright
Sent: 29 July 2021 10:22
To: gcc-patches@gcc.gnu.org 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] aarch64: Don't include vec_select high-half in SIMD add cost 
 
Hi,

The Neon add-long/add-widen instructions can select the top or bottom
half of the operand registers. This selection does not change the
cost of the underlying instruction and this should be reflected by
the RTL cost function.

This patch adds RTL tree traversal in the Neon add cost function to
match vec_select high-half of its operands. This traversal prevents
the cost of the vec_select from being added into the cost of the
subtract - meaning that these instructions can now be emitted in the
combine pass as they are no longer deemed prohibitively expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

    * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
    of vec_select high-half from being added into Neon add cost.

gcc/testsuite/ChangeLog:

    * gcc.target/aarch64/vaddX_high_cost.c: New test.

rb14710.patch
Description: rb14710.patch


Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Jonathan Wakely via Gcc-patches
On Wed, 4 Aug 2021 at 16:47, Maged Michael  wrote:
>
> Thanks, Jonathan!
>
> On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely  wrote:
>>
>> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>> >
>> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>> > >
>> > > This is the right patch. The previous one is missing noexcept. Sorry.
>> > >
>> > >
>> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael  
>> > > wrote:
>> > >>
>> > >> Please find attached an updated patch after incorporating Jonathan's 
>> > >> suggestions.
>> > >>
>> > >> Changes from the last patch include:
>> > >> - Add a TSAN macro to bits/c++config.
>> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, 
>> > >> double-width and alignment.
>> > >> - Move the code in the optimized path to a separate function 
>> > >> _M_release_double_width_cas.
>> >
>> > Thanks for the updated patch. At a quick glance it looks great. I'll
>> > apply it locally and test it tomorrow.
>>
>>
>> It occurs to me now that _M_release_double_width_cas is the wrong
>> name, because we're not doing a DWCAS, just a double-width load. What
>> do you think about renaming it to _M_release_combined instead? Since
>> it does a combined load of the two counts.
>
>
> Yes definitely _M_release_combined makes sense. Will do.

See the patch I attached to my previous mail, which has a refactored
version that gets rid of that function entirely.

>
>>
>> I'm no longer confident in the alignof suggestion I gave you.
>>
>> +constexpr bool __double_word
>> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> +// The ref-count members follow the vptr, so are aligned to
>> +// alignof(void*).
>> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> For IA32 (i.e. 32-bit x86) this constant will be true, because
>> alignof(long long) and alignof(void*) are both 4, even though
>> sizeof(long long) is 8. So in theory the _M_use_count and
>> _M_weak_count members could be in different cache lines, and the
>> atomic load will not be atomic. I think we want to use __alignof(long
>> long) here to get the "natural" alignment, not the smaller 4B
>> alignment allowed by SysV ABI. That means that we won't do this
>> optimization on IA32, but I think that's acceptable.
>>
>> Alternatively, we could keep using alignof, but with an additional
>> run-time check something like (uintptr_t)&_M_use_count / 64 ==
>> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
>> line. I think for now we should just use __alignof and live without
>> the optimization on IA32.
>>
> I'd rather avoid any runtime checks because they may negate the speed 
> rationale for doing this optimization.
> I'd be OK with not doing this optimization for any 32-bit architecture.
>
> Is it OK to change the __align condition to the following?
> constexpr bool __aligned =
>   (alignof(long long) <= alignof(void*))
>   && (sizeof(long long) == sizeof(void*));

Yes, that will work fine.


Re: [PATCH] c: Fix ICE caused by get_parm_array_spec [PR101702]

2021-08-04 Thread Martin Sebor via Gcc-patches

On 8/3/21 1:17 AM, Jakub Jelinek wrote:

Hi!

The following testcase ICEs, because nelts is NOP_EXPR around INTEGER_CST
- it is a VLA whose extent folds into a constant - and get_parm_array_spec
has specific INTEGER_CST handling and otherwise strips nops from nelts
and stores it into a TREE_LIST that is later asserted to be a DECL_P
or EXPR_P, where the INTEGER_CST is neither of that.

So, either we can strip nops earlier (needs moving the integral type
check first as STRIP_NOPS can alter that e.g. to pointer or from
pointer to integer) and thus handle as INTEGER_CST even the case
of INTEGER_CST wrapped into casts as this patch does, or we need
to change handle_argspec_attribute's assertion to allow INTEGER_CSTs
as well there.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Or do you prefer to change handle_argspec_attribute ?


I think the bug is in the bound being treated as variable but then
in determining its value to be constant.  I expect the following
to be diagnosed again, the same way as in GCC 11:

  double foo (double x[!__builtin_copysignf (~2, 3)]);
  double foo (double x[]);

Your fix prevents it.  So I think the right fix would treat
the bound as constant in this case.

Martin



2021-08-03  Jakub Jelinek  

PR c/101702
* c-decl.c (get_parm_array_spec): Check for non-integral
type first, then STRIP_NOPS and only afterwards check for
INTEGER_CST.

* gcc.dg/pr101702.c: New test.

--- gcc/c/c-decl.c.jj   2021-07-15 18:50:52.0 +0200
+++ gcc/c/c-decl.c  2021-08-02 18:56:35.532045128 +0200
@@ -5842,6 +5842,11 @@ get_parm_array_spec (const struct c_parm
if (pd->u.array.static_p)
spec += 's';
  
+  if (!INTEGRAL_TYPE_P (TREE_TYPE (nelts)))

+   /* Avoid invalid NELTS.  */
+   return attrs;
+
+  STRIP_NOPS (nelts);
if (TREE_CODE (nelts) == INTEGER_CST)
{
  /* Skip all constant bounds except the most significant one.
@@ -5859,13 +5864,9 @@ get_parm_array_spec (const struct c_parm
  spec += buf;
  break;
}
-  else if (!INTEGRAL_TYPE_P (TREE_TYPE (nelts)))
-   /* Avoid invalid NELTS.  */
-   return attrs;
  
/* Each variable VLA bound is represented by a dollar sign.  */

spec += "$";
-  STRIP_NOPS (nelts);
vbchain = tree_cons (NULL_TREE, nelts, vbchain);
  }
  
--- gcc/testsuite/gcc.dg/pr101702.c.jj	2021-08-02 18:58:24.614534975 +0200

+++ gcc/testsuite/gcc.dg/pr101702.c 2021-08-02 18:57:52.611978024 +0200
@@ -0,0 +1,11 @@
+/* PR c/101702 */
+/* { dg-do compile } */
+/* { dg-options "" } */
+
+double foo (double x[!__builtin_copysignf (~2, 3)]);
+
+double
+bar (double *x)
+{
+  return foo (x);
+}

Jakub





Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
Thanks, Jonathan!

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely 
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>

Yes definitely _M_release_combined makes sense. Will do.


> I'm no longer confident in the alignof suggestion I gave you.
>
> +constexpr bool __double_word
> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +// The ref-count members follow the vptr, so are aligned to
> +// alignof(void*).
> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> I'd rather avoid any runtime checks because they may negate the speed
rationale for doing this optimization.
I'd be OK with not doing this optimization for any 32-bit architecture.

Is it OK to change the __align condition to the following?
constexpr bool __aligned =
  (alignof(long long) <= alignof(void*))
  && (sizeof(long long) == sizeof(void*));

Thanks,
Maged



> Secondly:
>
> +  void
> +  __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include 
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> In that function ...
>
> +  _M_release_last_use() noexcept
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +_M_dispose();
> +if (_Mutex_base<_Lp>::_S_need_barriers)
> +  {
> +__atomic_thread_fence (__ATOMIC_ACQ_REL);
> +  }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +   -1) == 1)
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +_M_destroy();
> +  }
> +  }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The _S_atomic one
> would have your new optimization. See the attached patch showing what
> I mean. I find this version a bit simpler to understand, as we just
> have _M_release and _M_release_last_use, without
> _M_release_double_width_cas and _M_release_orig. What do you think of
> this version? Does it lose any important properties of your version
> which I've failed to notice?
>


Re: [PATCH V2] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-08-04 Thread Richard Sandiford via Gcc-patches
Jonathan Wright  writes:
> Hi,
>
> Changes suggested here and those discussed off-list have been
> implemented in V2 of the patch.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
> * config/aarch64/aarch64.c (aarch64_strip_extend_vec_half):
> Define.
> (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
> vec_select high-half from being added into Neon multiply
> cost.
> * rtlanal.c (vec_series_highpart_p): Define.
> * rtlanal.h (vec_series_highpart_p): Declare.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/vmul_high_cost.c: New test.

OK, thanks.

Richard

>
> From: Richard Sandiford 
> Sent: 04 August 2021 10:05
> To: Jonathan Wright via Gcc-patches 
> Cc: Jonathan Wright 
> Subject: Re: [PATCH] aarch64: Don't include vec_select high-half in SIMD 
> multiply cost
>
> Jonathan Wright via Gcc-patches  writes:
>> Hi,
>>
>> The Neon multiply/multiply-accumulate/multiply-subtract instructions
>> can select the top or bottom half of the operand registers. This
>> selection does not change the cost of the underlying instruction and
>> this should be reflected by the RTL cost function.
>>
>> This patch adds RTL tree traversal in the Neon multiply cost function
>> to match vec_select high-half of its operands. This traversal
>> prevents the cost of the vec_select from being added into the cost of
>> the multiply - meaning that these instructions can now be emitted in
>> the combine pass as they are no longer deemed prohibitively
>> expensive.
>>
>> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
>> issues.
>
> Like you say, the instructions can handle both the low and high halves.
> Shouldn't we also check for the low part (as a SIGN/ZERO_EXTEND of
> a subreg)?
>
>> Ok for master?
>>
>> Thanks,
>> Jonathan
>>
>> ---
>>
>> gcc/ChangeLog:
>>
>> 2021-07-19  Jonathan Wright  
>>
>>* config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p):
>>Define.
>>(aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
>>vec_select high-half from being added into Neon multiply
>>cost.
>>* rtlanal.c (vec_series_highpart_p): Define.
>>* rtlanal.h (vec_series_highpart_p): Declare.
>>
>> gcc/testsuite/ChangeLog:
>>
>>* gcc.target/aarch64/vmul_high_cost.c: New test.
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 
>> 5809887997305317c5a81421089db431685e2927..a49672afe785e3517250d324468edacceab5c9d3
>>  100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -76,6 +76,7 @@
>>  #include "function-abi.h"
>>  #include "gimple-pretty-print.h"
>>  #include "tree-ssa-loop-niter.h"
>> +#include "rtlanal.h"
>>
>>  /* This file should be included last.  */
>>  #include "target-def.h"
>> @@ -11970,6 +11971,19 @@ aarch64_cheap_mult_shift_p (rtx x)
>>return false;
>>  }
>>
>> +/* Return true iff X is an operand of a select-high-half vector
>> +   instruction.  */
>> +
>> +static bool
>> +aarch64_vec_select_high_operand_p (rtx x)
>> +{
>> +  return ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
>> +   && GET_CODE (XEXP (x, 0)) == VEC_SELECT
>> +   && vec_series_highpart_p (GET_MODE (XEXP (x, 0)),
>> + GET_MODE (XEXP (XEXP (x, 0), 0)),
>> + XEXP (XEXP (x, 0), 1)));
>> +}
>> +
>>  /* Helper function for rtx cost calculation.  Calculate the cost of
>> a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx.
>> Return the calculated cost of the expression, recursing manually in to
>> @@ -11995,6 +12009,13 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
>> int outer, bool speed)
>>unsigned int vec_flags = aarch64_classify_vector_mode (mode);
>>if (vec_flags & VEC_ADVSIMD)
>>{
>> +   /* The select-operand-high-half versions of the instruction have the
>> +  same cost as the three vector version - don't add the costs of the
>> +  select into the costs of the multiply.  */
>> +   if (aarch64_vec_select_high_operand_p (op0))
>> + op0 = XEXP (XEXP (op0, 0), 0);
>> +   if (aarch64_vec_select_high_operand_p (op1))
>> + op1 = XEXP (XEXP (op1, 0), 0);
>
> For consistency with aarch64_strip_duplicate_vec_elt, I think this
> should be something like aarch64_strip_vec_extension, returning
> the inner rtx on success and the original one on failure.
>
> Thanks,
> Richard
>
>>  /* The by-element versions of the instruction have the same costs as
>> the normal 3-vector version.  So don't add the costs of the
>> duplicate or subsequent select into the costs of the multiply.  
>> We
>> diff --git a/gcc/rtlanal.h b/gcc/rtlanal.h
>> index 
>> 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Jonathan Wakely via Gcc-patches
On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>
> On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> >
> > This is the right patch. The previous one is missing noexcept. Sorry.
> >
> >
> > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael  
> > wrote:
> >>
> >> Please find attached an updated patch after incorporating Jonathan's 
> >> suggestions.
> >>
> >> Changes from the last patch include:
> >> - Add a TSAN macro to bits/c++config.
> >> - Use separate constexpr bool-s for the conditions for lock-freedom, 
> >> double-width and alignment.
> >> - Move the code in the optimized path to a separate function 
> >> _M_release_double_width_cas.
>
> Thanks for the updated patch. At a quick glance it looks great. I'll
> apply it locally and test it tomorrow.


It occurs to me now that _M_release_double_width_cas is the wrong
name, because we're not doing a DWCAS, just a double-width load. What
do you think about renaming it to _M_release_combined instead? Since
it does a combined load of the two counts.

I'm no longer confident in the alignof suggestion I gave you.

+constexpr bool __double_word
+  = sizeof(long long) == 2 * sizeof(_Atomic_word);
+// The ref-count members follow the vptr, so are aligned to
+// alignof(void*).
+constexpr bool __aligned = alignof(long long) <= alignof(void*);

For IA32 (i.e. 32-bit x86) this constant will be true, because
alignof(long long) and alignof(void*) are both 4, even though
sizeof(long long) is 8. So in theory the _M_use_count and
_M_weak_count members could be in different cache lines, and the
atomic load will not be atomic. I think we want to use __alignof(long
long) here to get the "natural" alignment, not the smaller 4B
alignment allowed by SysV ABI. That means that we won't do this
optimization on IA32, but I think that's acceptable.

Alternatively, we could keep using alignof, but with an additional
run-time check something like (uintptr_t)&_M_use_count / 64 ==
(uintptr_t)&_M_weak_count / 64 to check they're on the same cache
line. I think for now we should just use __alignof and live without
the optimization on IA32.

Secondly:

+  void
+  __attribute__ ((noinline))

This needs to be __noinline__ because noinline is not a reserved word,
so users can do:
#define noinline 1
#include 

Was it performance profiling, or code-size measurements (or something
else?) that showed this should be non-inline?
I'd like to add a comment explaining why we want it to be noinline.

In that function ...

+  _M_release_last_use() noexcept
+  {
+_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+_M_dispose();
+if (_Mutex_base<_Lp>::_S_need_barriers)
+  {
+__atomic_thread_fence (__ATOMIC_ACQ_REL);
+  }

I think we can remove this fence. The _S_need_barriers constant is
only true for the _S_mutex policy, and that just calls
_M_release_orig(), so never uses this function. I'll remove it and add
a comment noting that the lack of barrier is intentional.

+_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+   -1) == 1)
+  {
+_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+_M_destroy();
+  }
+  }

Alternatively, we could keep the fence in _M_release_last_use() and
refactor the other _M_release functions, so that we have explicit
specializations for each of:
_Sp_counted_base<_S_single>::_M_release() (already present)
_Sp_counted_base<_S_mutex>::_M_release()
_Sp_counted_base<_S_atomic>::_M_release()

The second and third would be new, as currently they both use the
definition in the primary template. The _S_mutex one would just
decrement _M_use_count then call _M_release_last_use() (which still
has the barrier needed for the _S_mutex policy). The _S_atomic one
would have your new optimization. See the attached patch showing what
I mean. I find this version a bit simpler to understand, as we just
have _M_release and _M_release_last_use, without
_M_release_double_width_cas and _M_release_orig. What do you think of
this version? Does it lose any important properties of your version
which I've failed to notice?
diff --git a/libstdc++-v3/include/bits/c++config 
b/libstdc++-v3/include/bits/c++config
index 32b8957f814..07465f7ecd5 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -143,6 +143,15 @@
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..b2397c8fddb 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ 

[PATCH] gcov: check return code of a fclose

2021-08-04 Thread Martin Liška

We miss a place where I/O write error can occur.

Pushed to master as obvious.

Martin

gcc/ChangeLog:

PR gcov-profile/101773
* gcov-io.c (gcov_close): Check return code of a fclose.
---
 gcc/gcov-io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/gcov-io.c b/gcc/gcov-io.c
index 4b1e11d4530..7819593234a 100644
--- a/gcc/gcov-io.c
+++ b/gcc/gcov-io.c
@@ -199,7 +199,9 @@ gcov_close (void)
 {
   if (gcov_var.file)
 {
-  fclose (gcov_var.file);
+  if (fclose (gcov_var.file))
+   gcov_var.error = 1;
+
   gcov_var.file = 0;
 }
   gcov_var.mode = 0;
--
2.32.0



[PATCH V2] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi,

Changes suggested here and those discussed off-list have been
implemented in V2 of the patch.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64.c (aarch64_strip_extend_vec_half):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
vec_select high-half from being added into Neon multiply
cost.
* rtlanal.c (vec_series_highpart_p): Define.
* rtlanal.h (vec_series_highpart_p): Declare.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vmul_high_cost.c: New test.

From: Richard Sandiford 
Sent: 04 August 2021 10:05
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright 
Subject: Re: [PATCH] aarch64: Don't include vec_select high-half in SIMD 
multiply cost 
 
Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can select the top or bottom half of the operand registers. This
> selection does not change the cost of the underlying instruction and
> this should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match vec_select high-half of its operands. This traversal
> prevents the cost of the vec_select from being added into the cost of
> the multiply - meaning that these instructions can now be emitted in
> the combine pass as they are no longer deemed prohibitively
> expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.

Like you say, the instructions can handle both the low and high halves.
Shouldn't we also check for the low part (as a SIGN/ZERO_EXTEND of
a subreg)?

> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
>    * config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p):
>    Define.
>    (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
>    vec_select high-half from being added into Neon multiply
>    cost.
>    * rtlanal.c (vec_series_highpart_p): Define.
>    * rtlanal.h (vec_series_highpart_p): Declare.
>
> gcc/testsuite/ChangeLog:
>
>    * gcc.target/aarch64/vmul_high_cost.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 5809887997305317c5a81421089db431685e2927..a49672afe785e3517250d324468edacceab5c9d3
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -76,6 +76,7 @@
>  #include "function-abi.h"
>  #include "gimple-pretty-print.h"
>  #include "tree-ssa-loop-niter.h"
> +#include "rtlanal.h"
>  
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -11970,6 +11971,19 @@ aarch64_cheap_mult_shift_p (rtx x)
>    return false;
>  }
>  
> +/* Return true iff X is an operand of a select-high-half vector
> +   instruction.  */
> +
> +static bool
> +aarch64_vec_select_high_operand_p (rtx x)
> +{
> +  return ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
> +   && GET_CODE (XEXP (x, 0)) == VEC_SELECT
> +   && vec_series_highpart_p (GET_MODE (XEXP (x, 0)),
> + GET_MODE (XEXP (XEXP (x, 0), 0)),
> + XEXP (XEXP (x, 0), 1)));
> +}
> +
>  /* Helper function for rtx cost calculation.  Calculate the cost of
> a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx.
> Return the calculated cost of the expression, recursing manually in to
> @@ -11995,6 +12009,13 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> int outer, bool speed)
>    unsigned int vec_flags = aarch64_classify_vector_mode (mode);
>    if (vec_flags & VEC_ADVSIMD)
>    {
> +   /* The select-operand-high-half versions of the instruction have the
> +  same cost as the three vector version - don't add the costs of the
> +  select into the costs of the multiply.  */
> +   if (aarch64_vec_select_high_operand_p (op0))
> + op0 = XEXP (XEXP (op0, 0), 0);
> +   if (aarch64_vec_select_high_operand_p (op1))
> + op1 = XEXP (XEXP (op1, 0), 0);

For consistency with aarch64_strip_duplicate_vec_elt, I think this
should be something like aarch64_strip_vec_extension, returning
the inner rtx on success and the original one on failure.

Thanks,
Richard

>  /* The by-element versions of the instruction have the same costs as
> the normal 3-vector version.  So don't add the costs of the
> duplicate or subsequent select into the costs of the multiply.  We
> diff --git a/gcc/rtlanal.h b/gcc/rtlanal.h
> index 
> e1642424db89736675ac3e0d505aeaa59dca8bad..542dc7898bead27d3da89e5138c49563ba226eae
>  100644
> --- a/gcc/rtlanal.h
> +++ b/gcc/rtlanal.h
> @@ -331,6 +331,10 @@ inline vec_rtx_properties_base::~vec_rtx_properties_base 
> ()
> collecting the references a second 

[OG11, committed] libgomp amdgcn: Fix issues with dynamic OpenMP thread scaling

2021-08-04 Thread Andrew Stubbs
This patch fixes a bug in which testcases using thread_limit larger than 
the number of physical threads would crash with a memory fault. This was 
exacerbated in testcases with a lot of register pressure because the 
autoscaling reduces the number of physical threads to compensate for the 
increased resource usage.


Committed to devel/omp/gcc-11.

@ Thomas, this should probably be folded into another patch when 
upstreaming OG11 to mainline.


Andrew
libgomp amdgcn: Fix issues with dynamic OpenMP thread scaling

libgomp/ChangeLog:

* config/gcn/bar.h (gomp_barrier_init): Limit thread count to the
actual physical number.
* config/gcn/team.c (gomp_team_start): Don't attempt to set up
threads that do not exist.

diff --git a/libgomp/config/gcn/bar.h b/libgomp/config/gcn/bar.h
index bbd3141837f..63e803bd72b 100644
--- a/libgomp/config/gcn/bar.h
+++ b/libgomp/config/gcn/bar.h
@@ -55,6 +55,9 @@ typedef unsigned int gomp_barrier_state_t;
 
 static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
 {
+  unsigned actual_thread_count = __builtin_gcn_dim_size (1);
+  if (count > actual_thread_count)
+count = actual_thread_count;
   bar->total = count;
   bar->awaited = count;
   bar->awaited_final = count;
diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c
index 627210ea407..6aa74744315 100644
--- a/libgomp/config/gcn/team.c
+++ b/libgomp/config/gcn/team.c
@@ -187,6 +187,10 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
   if (nthreads == 1)
 return;
 
+  unsigned actual_thread_count = __builtin_gcn_dim_size (1);
+  if (nthreads > actual_thread_count)
+nthreads = actual_thread_count;
+
   /* Release existing idle threads.  */
   for (unsigned i = 1; i < nthreads; ++i)
 {


Re: [PATCH 2/2] Ada: Remove debug line number for DECL_IGNORED_P functions

2021-08-04 Thread Eric Botcazou
> The location of these ignored Ada decls looks completely sane to me.
> However, it was an unintentional side effect of the patch to allow
> minimal debugging of ignored decls.  This means we can now step into
> those functions or set line breakpoints there, while previously that
> was not possible.  And I guess it could be considered an improvement.
> 
> So it's your choice, how you want these functions to be debugged.

The requirement on the GDB side is that these functions *cannot* be stepped 
into, i.e. that they be completely transparent for the GDB user.  But we still 
want to have location information in the compiler itself to debug it.

-- 
Eric Botcazou




Re: [PATCH, rs6000] Add store fusion support for Power10

2021-08-04 Thread Bill Schmidt via Gcc-patches

Hi Pat,

Good stuff!  Comments below.

On 8/2/21 3:19 PM, Pat Haugen via Gcc-patches wrote:

Enable store fusion on Power10.

Use the SCHED_REORDER hook to implement Power10 specific ready list reordering.
As of now, pairing stores for store fusion is the only function being
performed.

Bootstrap/regtest on powerpc64le(Power10) with no new regressions. Ok for
master?

-Pat


2021-08-02  Pat Haugen  

gcc/ChangeLog:

* config/rs6000/rs6000-cpus.def (ISA_3_1_MASKS_SERVER): Add new flag.
(POWERPC_MASKS): Likewise.
* config/rs6000/rs6000.c (rs6000_option_override_internal): Enable
store fusion for Power10.
(is_load_insn1): Verify destination is a register.
(is_store_insn1): Verify source is a register.
(is_fusable_store): New.
(power10_sched_reorder): Likewise.
(rs6000_sched_reorder): Do Power10 specific reordering.
(rs6000_sched_reorder2): Likewise.
* config/rs6000/rs6000.opt: Add new option.



diff --git a/gcc/config/rs6000/rs6000-cpus.def 
b/gcc/config/rs6000/rs6000-cpus.def
index 6758296c0fd..f5812da0184 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -90,7 +90,8 @@
 | OPTION_MASK_P10_FUSION_2LOGICAL  \
 | OPTION_MASK_P10_FUSION_LOGADD\
 | OPTION_MASK_P10_FUSION_ADDLOG\
-| OPTION_MASK_P10_FUSION_2ADD)
+| OPTION_MASK_P10_FUSION_2ADD  \
+| OPTION_MASK_P10_FUSION_2STORE)



This is all fine for now; as we've discussed elsewhere, we probably want 
to eventually consolidate all these fusion flags into one.




  /* Flags that need to be turned off if -mno-power9-vector.  */
  #define OTHER_P9_VECTOR_MASKS (OPTION_MASK_FLOAT128_HW\
@@ -143,6 +144,7 @@
 | OPTION_MASK_P10_FUSION_LOGADD\
 | OPTION_MASK_P10_FUSION_ADDLOG\
 | OPTION_MASK_P10_FUSION_2ADD  \
+| OPTION_MASK_P10_FUSION_2STORE\
 | OPTION_MASK_HTM  \
 | OPTION_MASK_ISEL \
 | OPTION_MASK_MFCRF\
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 279f00cc648..1460a0d7c5c 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4490,6 +4490,10 @@ rs6000_option_override_internal (bool global_init_p)
&& (rs6000_isa_flags_explicit & OPTION_MASK_P10_FUSION_2ADD) == 0)
  rs6000_isa_flags |= OPTION_MASK_P10_FUSION_2ADD;

+  if (TARGET_POWER10
+  && (rs6000_isa_flags_explicit & OPTION_MASK_P10_FUSION_2STORE) == 0)
+rs6000_isa_flags |= OPTION_MASK_P10_FUSION_2STORE;
+
/* Turn off vector pair/mma options on non-power10 systems.  */
else if (!TARGET_POWER10 && TARGET_MMA)
  {
@@ -18357,7 +18361,7 @@ is_load_insn1 (rtx pat, rtx *load_mem)
if (!pat || pat == NULL_RTX)
  return false;

-  if (GET_CODE (pat) == SET)
+  if (GET_CODE (pat) == SET && REG_P (SET_DEST (pat)))
  return find_mem_ref (SET_SRC (pat), load_mem);

Looks like this is just an optimization to quickly discard stores, right?

if (GET_CODE (pat) == PARALLEL)
@@ -18394,7 +18398,8 @@ is_store_insn1 (rtx pat, rtx *str_mem)
if (!pat || pat == NULL_RTX)
  return false;

-  if (GET_CODE (pat) == SET)
+  if (GET_CODE (pat) == SET
+  && (REG_P (SET_SRC (pat)) || SUBREG_P (SET_SRC (pat
  return find_mem_ref (SET_DEST (pat), str_mem);



Similar question.



if (GET_CODE (pat) == PARALLEL)
@@ -18859,6 +18864,96 @@ power9_sched_reorder2 (rtx_insn **ready, int lastpos)
return cached_can_issue_more;
  }

+/* Determine if INSN is a store to memory that can be fused with a similar
+   adjacent store.  */
+
+static bool
+is_fusable_store (rtx_insn *insn, rtx *str_mem)
+{
+  /* Exit early if not doing store fusion.  */
+  if (!(TARGET_P10_FUSION && TARGET_P10_FUSION_2STORE))
+return false;
+
+  /* Insn must be a non-prefixed base+disp form store.  */
+  if (is_store_insn (insn, str_mem)
+  && get_attr_prefixed (insn) == PREFIXED_NO
+  && get_attr_update (insn) == UPDATE_NO
+  && get_attr_indexed (insn) == INDEXED_NO)
+{
+  /* Further restictions by mode and size.  */
+  machine_mode mode = GET_MODE (*str_mem);
+  HOST_WIDE_INT size;
+  if MEM_SIZE_KNOWN_P (*str_mem)
+   size = MEM_SIZE (*str_mem);
+  else
+   return false;
+
+  if INTEGRAL_MODE_P (mode)
+   {
+ /* Must be word or dword size.  */
+ return (size == 4 || size == 8);
+   }
+  else if FLOAT_MODE_P (mode)
+   {
+ /* Must be dword size.  */
+ return (size == 8);
+   }
+

Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support

2021-08-04 Thread Thomas Schwinge
Hi!

On 2021-03-02T04:20:11-0800, Julian Brown  wrote:
> This patch implements worker-partitioning support in the middle end,
> by rewriting gimple.  [...]

Yay!


> This version of the patch [...]
> avoids moving SESE-region finding code out
> of the NVPTX backend

So that's 'struct bb_sese' and following functions.

> since that code isn't used by the middle-end worker
> partitioning neutering/broadcasting implementation yet.

I understand correctly that "isn't used [...] yet" means that (a) "this
isn't implemented yet" (on og11 etc.), and doesn't mean (b) "is missing
from this patch submission"?  ... thus from (a) it follows that we may
later also drop from the og11 branch these changes?


Relatedly, a nontrivial amount of data structures/logic/code did get
duplicated from the nvptx back end, and modified slightly or
not-so-slightly (RTL vs. GIMPLE plus certain implementation "details").

We should at least cross reference the two instances, to make sure that
any changes to one are also propagated to the other.  (I'll take care.)

And then, do you (or anyone else, of course) happen to have any clever
idea about how to avoid the duplication, and somehow combine the RTL
vs. GIMPLE implementations?  Given that we nowadays may use C++ -- do you
foresee it feasible to have an abstract base class capturing basically
the data structures, logic, common code, and then RTL-specialized plus
GIMPLE-specialized classes inheriting from that?

For example:

$ sed -e s%parallel_g%parallel%g < gcc/oacc-neuter-bcast.c > 
gcc/oacc-neuter-bcast.c_
$ git diff --no-index --word-diff -b --patience gcc/config/nvptx/nvptx.c 
gcc/oacc-neuter-bcast.c_
[...]
/* Loop structure of the function.  The entire function is described as
   a NULL loop.  */
@@ -3229,17 +80,21 @@ struct parallel
  basic_block forked_block;
  basic_block join_block;

  [-rtx_insn *forked_insn;-]
[-  rtx_insn *join_insn;-]{+gimple *forked_stmt;+}
{+  gimple *join_stmt;+}

  [-rtx_insn *fork_insn;-]
[-  rtx_insn *joining_insn;-]{+gimple *fork_stmt;+}
{+  gimple *joining_stmt;+}

  /* Basic blocks in this parallel, but not in child parallels.  The
 FORKED and JOINING blocks are in the partition.  The FORK and JOIN
 blocks are not.  */
  auto_vec blocks;

  {+tree record_type;+}
{+  tree sender_decl;+}
{+  tree receiver_decl;+}

public:
  parallel (parallel *parent, unsigned mode);
  ~parallel ();
@@ -3252,8 +107,12 @@ parallel::parallel (parallel *parent_, unsigned mask_)
  :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
{
  forked_block = join_block = 0;
  [-forked_insn-]{+forked_stmt+} = [-join_insn-]{+join_stmt+} = [-0;-]
[-  fork_insn-]{+NULL;+}
{+  fork_stmt+} = [-joining_insn-]{+joining_stmt+} = [-0;-]{+NULL;+}

{+  record_type = NULL_TREE;+}
{+  sender_decl = NULL_TREE;+}
{+  receiver_decl = NULL_TREE;+}

  if (parent)
{
@@ -3268,12 +127,54 @@ parallel::~parallel ()
  delete next;
}
[...]
/* Split basic blocks such that each forked and join unspecs are at
   the start of their basic blocks.  Thus afterwards each block will
@@ -3284,111 +185,168 @@ typedef auto_vec insn_bb_vec_t;
   used when finding partitions.  */

static void
[-nvptx_split_blocks (bb_insn_map_t-]{+omp_sese_split_blocks 
(bb_stmt_map_t+} *map)
{
  [-insn_bb_vec_t-]{+auto_vec+} worklist;
  basic_block block;
[-  rtx_insn *insn;-]

  /* Locate all the reorg instructions of interest.  */
  FOR_ALL_BB_FN (block, cfun)
{
[-  bool seen_insn = false;-]

  /* Clear visited flag, for use by parallel locator  */
  block->flags &= ~BB_VISITED;

  [-FOR_BB_INSNS (block, insn)-]{+for (gimple_stmt_iterator gsi = 
gsi_start_bb (block);+}
{+ !gsi_end_p (gsi);+}
{+ gsi_next ())+}
{
[...]
/* Dump this parallel and all its inner parallels.  */

static void
[-nvptx_dump_pars-]{+omp_sese_dump_pars+} (parallel *par, unsigned depth)
{
  fprintf (dump_file, "%u: mask %d {+(%s)+} head=%d, tail=%d\n",
   depth, par->mask, {+mask_name (par->mask),+}
   par->forked_block ? par->forked_block->index : -1,
   par->join_block ? par->join_block->index : -1);

@@ -3399,10 +357,10 @@ nvptx_dump_pars (parallel *par, unsigned depth)
fprintf (dump_file, " %d", block->index);
  fprintf (dump_file, "\n");
  if (par->inner)
[-nvptx_dump_pars-]{+omp_sese_dump_pars+} (par->inner, depth + 1);

  if (par->next)
[-nvptx_dump_pars-]{+omp_sese_dump_pars+} (par->next, depth);
}

/* If BLOCK contains a fork/join marker, process it to create or
@@ -3410,60 +368,84 @@ nvptx_dump_pars (parallel *par, unsigned depth)
   and then walk successor blocks.   */

static parallel *
[-nvptx_find_par (bb_insn_map_t-]{+omp_sese_find_par 

[PATCH v2] x86: Update STORE_MAX_PIECES

2021-08-04 Thread H.J. Lu via Gcc-patches
On Tue, Aug 3, 2021 at 6:56 AM H.J. Lu  wrote:
>
> 1. Update x86 STORE_MAX_PIECES to use OImode and XImode only if inter-unit
> move is enabled since x86 uses vec_duplicate, which is enabled only when
> inter-unit move is enabled, to implement store_by_pieces.
> 2. Update op_by_pieces_d::op_by_pieces_d to set m_max_size to
> STORE_MAX_PIECES for store_by_pieces and to COMPARE_MAX_PIECES for
> compare_by_pieces.
>
> gcc/
>
> PR target/101742
> * expr.c (op_by_pieces_d::op_by_pieces_d): Set m_max_size to
> STORE_MAX_PIECES for store_by_pieces and to COMPARE_MAX_PIECES
> for compare_by_pieces.
> * config/i386/i386.h (STORE_MAX_PIECES): Use OImode and XImode
> only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.
>
> gcc/testsuite/
>
> PR target/101742
> * gcc.target/i386/pr101742a.c: New test.
> * gcc.target/i386/pr101742b.c: Likewise.
> ---
>  gcc/config/i386/i386.h| 20 +++-
>  gcc/expr.c|  6 +-
>  gcc/testsuite/gcc.target/i386/pr101742a.c | 16 
>  gcc/testsuite/gcc.target/i386/pr101742b.c |  4 
>  4 files changed, 36 insertions(+), 10 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101742a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101742b.c
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index bed9cd9da18..9b416abd5f4 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -1783,15 +1783,17 @@ typedef struct ix86_args {
>  /* STORE_MAX_PIECES is the number of bytes at a time that we can
> store efficiently.  */
>  #define STORE_MAX_PIECES \
> -  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> -   ? 64 \
> -   : ((TARGET_AVX \
> -   && !TARGET_PREFER_AVX128 \
> -   && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
> -  ? 32 \
> -  : ((TARGET_SSE2 \
> - && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> -? 16 : UNITS_PER_WORD)))
> +  (TARGET_INTER_UNIT_MOVES_TO_VEC \
> +   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> +  ? 64 \
> +  : ((TARGET_AVX \
> + && !TARGET_PREFER_AVX128 \
> + && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
> + ? 32 \
> + : ((TARGET_SSE2 \
> + && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> + ? 16 : UNITS_PER_WORD))) \
> +   : UNITS_PER_WORD)
>
>  /* If a memory-to-memory move would take MOVE_RATIO or more simple
> move-instruction pairs, we will do a cpymem or libcall instead.

expr.c has been fixed.   Here is the v2 patch for x86 backend.
OK for master?

Thanks.

-- 
H.J.
From 0f8d9c643eb5e74bfc4951bf7d608f40f5f64275 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Tue, 3 Aug 2021 06:17:22 -0700
Subject: [PATCH v2] x86: Update STORE_MAX_PIECES

Update STORE_MAX_PIECES to use OImode and XImode only if inter-unit
move is enabled since vec_duplicate enabled by inter-unit move is
used to implement store_by_pieces.

gcc/

	PR target/101742
	* config/i386/i386.h (STORE_MAX_PIECES): Use OImode and XImode
	only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.

gcc/testsuite/

	PR target/101742
	* gcc.target/i386/pr101742a.c: New test.
	* gcc.target/i386/pr101742b.c: Likewise.
---
 gcc/config/i386/i386.h| 20 +++-
 gcc/testsuite/gcc.target/i386/pr101742a.c | 16 
 gcc/testsuite/gcc.target/i386/pr101742b.c |  4 
 3 files changed, 31 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101742a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101742b.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bed9cd9da18..9b416abd5f4 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1783,15 +1783,17 @@ typedef struct ix86_args {
 /* STORE_MAX_PIECES is the number of bytes at a time that we can
store efficiently.  */
 #define STORE_MAX_PIECES \
-  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
-   ? 64 \
-   : ((TARGET_AVX \
-   && !TARGET_PREFER_AVX128 \
-   && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
-  ? 32 \
-  : ((TARGET_SSE2 \
-	  && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
-	 ? 16 : UNITS_PER_WORD)))
+  (TARGET_INTER_UNIT_MOVES_TO_VEC \
+   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+  ? 64 \
+  : ((TARGET_AVX \
+	  && !TARGET_PREFER_AVX128 \
+	  && !TARGET_AVX256_SPLIT_UNALIGNED_STORE) \
+	  ? 32 \
+	  : ((TARGET_SSE2 \
+	  && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
+	  ? 16 : UNITS_PER_WORD))) \
+   : UNITS_PER_WORD)
 
 /* If a memory-to-memory move would take MOVE_RATIO or more simple
move-instruction pairs, we will do a cpymem or libcall instead.
diff --git a/gcc/testsuite/gcc.target/i386/pr101742a.c b/gcc/testsuite/gcc.target/i386/pr101742a.c
new file mode 100644
index 000..67ea40587dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101742a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 

[PATCH] x86: Avoid stack realignment when copying data with SSE register

2021-08-04 Thread H.J. Lu via Gcc-patches
To avoid stack realignment, call ix86_gen_scratch_sse_rtx to get a
scratch SSE register to copy data with with SSE register from one
memory location to another.

gcc/

PR target/101772
* config/i386/i386-expand.c (ix86_expand_vector_move): Call
ix86_gen_scratch_sse_rtx to get a scratch SSE register to copy
data with SSE register from one memory location to another.

gcc/testsuite/

PR target/101772
* gcc.target/i386/eh_return-2.c: New test.
---
 gcc/config/i386/i386-expand.c   |  6 +-
 gcc/testsuite/gcc.target/i386/eh_return-2.c | 16 
 2 files changed, 21 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/eh_return-2.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 1d469bf7221..bd21efa9530 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -613,7 +613,11 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
 arguments in memory.  */
   if (!register_operand (op0, mode)
  && !register_operand (op1, mode))
-   op1 = force_reg (mode, op1);
+   {
+ rtx scratch = ix86_gen_scratch_sse_rtx (mode);
+ emit_move_insn (scratch, op1);
+ op1 = scratch;
+   }
 
   tmp[0] = op0; tmp[1] = op1;
   ix86_expand_vector_move_misalign (mode, tmp);
diff --git a/gcc/testsuite/gcc.target/i386/eh_return-2.c 
b/gcc/testsuite/gcc.target/i386/eh_return-2.c
new file mode 100644
index 000..f23f4492dac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/eh_return-2.c
@@ -0,0 +1,16 @@
+/* PR target/101772  */
+/* { dg-do compile } */
+/* { dg-additional-options "-O0 -march=x86-64 -mstackrealign" } */
+
+struct _Unwind_Context _Unwind_Resume_or_Rethrow_this_context;
+
+void offset (int);
+
+struct _Unwind_Context {
+  void *reg[7];
+} _Unwind_Resume_or_Rethrow() {
+  struct _Unwind_Context cur_contextcur_context =
+  _Unwind_Resume_or_Rethrow_this_context;
+  offset(0);
+  __builtin_eh_return ((long) offset, 0);
+}
-- 
2.31.1



Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support

2021-08-04 Thread Thomas Schwinge
Hi!

On 2021-03-02T04:20:11-0800, Julian Brown  wrote:
> This patch implements worker-partitioning support in the middle end,
> by rewriting gimple.  [...]

Yay!

Given:

> --- /dev/null
> +++ b/gcc/oacc-neuter-bcast.c

> +/* A map from SSA names or var decls to record fields.  */
> +
> +typedef hash_map field_map_t;
> +
> +/* For each propagation record type, this is a map from SSA names or var 
> decls
> +   to propagate, to the field in the record type that should be used for
> +   transmission and reception.  */
> +
> +typedef hash_map record_field_map_t;
> +
> +static GTY(()) record_field_map_t *field_map;

Per 'gcc/doc/gty.texi': "Whenever you [...] create a new source file
containing 'GTY' markers, [...] add the filename to the 'GTFILES'
variable in 'Makefile.in'.  [...] The generated header file should be
included after everything else in the source file."  Thus:

--- gcc/Makefile.in
+++ gcc/Makefile.in
@@ -2720,2 +2720,3 @@ GTFILES = $(CPPLIB_H) $(srcdir)/input.h 
$(srcdir)/coretypes.h \
   $(srcdir)/omp-general.h \
+  $(srcdir)/oacc-neuter-bcast.c \
   @all_gtfiles@
--- gcc/oacc-neuter-bcast.c
+++ gcc/oacc-neuter-bcast.c
@@ -1514 +1514,4 @@ make_pass_oacc_gimple_workers (gcc::context *ctxt)
 }
+
+
+#include "gt-oacc-neuter-bcast.h"

That however results in:

[...]
build/gengtype  \
-r gtype.state
warning: structure `field_map_t' used but not defined
gengtype: Internal error: abort in error_at_line, at gengtype.c:111
make[2]: *** [Makefile:2796: s-gtype] Error 1
[...]

I shall try to figure out the right GC annotations to make the 'typedef's
known to the GC machinery (unless somebody can tell me off hand) -- but
actually is this really necessary to allocate as GC memory?

> +void
> +oacc_do_neutering (void)
> +{
> +  [...]
> +  field_map = record_field_map_t::create_ggc (40);
> +  [...]
> +  FOR_ALL_BB_FN (bb, cfun)
> +{
> +  propagation_set *ws_prop = prop_set[bb->index];
> +  if (ws_prop)
> + {
> +   tree record_type = lang_hooks.types.make_type (RECORD_TYPE);
> +   [...]
> +   field_map->put (record_type, field_map_t::create_ggc (17));
> +   [...]
> +}
> +  [...]
> +}

'oacc_do_neutering' is the 'execute' function of the pass, so that means
every time this executes, a fresh 'field_map' is set up, no state
persists across runs (assuming I'm understanding that correctly).  Why
don't we simply use standard (non-GC) memory management for that?  "For
convenience" shall be fine as an answer ;-) -- but maybe instead of
figuring out the right GC annotations, changing the memory management
will be easier?  (Or, of course, maybe I completely misunderstood that?)


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


[PATCH v2] by_pieces: Pass MAX_PIECES to op_by_pieces_d

2021-08-04 Thread H.J. Lu via Gcc-patches
On Wed, Aug 4, 2021 at 12:27 AM Richard Sandiford
 wrote:
>
> "H.J. Lu via Gcc-patches"  writes:
> > @@ -1122,8 +1122,8 @@ class op_by_pieces_d
> > and its associated FROM_CFN_DATA can be used to replace loads with
> > constant values.  LEN describes the length of the operation.  */
> >
> > -op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
> > - rtx from, bool from_load,
> > +op_by_pieces_d::op_by_pieces_d (unsigned int max_pieces, rtx to,
> > + bool to_load, rtx from, bool from_load,
> >   by_pieces_constfn from_cfn,
> >   void *from_cfn_data,
> >   unsigned HOST_WIDE_INT len,
>
> The comment above the function needs to describe the new parameter.
>
> OK with that change, thanks.
>

This is the patch I am checking in.

Thanks.

---
H.J.
From 27343601ab064553eac695ed58e741c7b2f6059d Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Tue, 3 Aug 2021 06:17:22 -0700
Subject: [PATCH v2] by_pieces: Pass MAX_PIECES to op_by_pieces_d

Pass MAX_PIECES to op_by_pieces_d::op_by_pieces_d for move, store and
compare.

	PR target/101742
	* expr.c (op_by_pieces_d::op_by_pieces_d): Add a max_pieces
	argument to set m_max_size.
	(move_by_pieces_d): Pass MOVE_MAX_PIECES to op_by_pieces_d.
	(store_by_pieces_d): Pass STORE_MAX_PIECES to op_by_pieces_d.
	(compare_by_pieces_d): Pass COMPARE_MAX_PIECES to op_by_pieces_d.

diff --git a/gcc/expr.c b/gcc/expr.c
index b65cfcfdcd1..096c0315ecc 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -1110,8 +1110,8 @@ class op_by_pieces_d
   }
 
  public:
-  op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
-		  unsigned HOST_WIDE_INT, unsigned int, bool,
+  op_by_pieces_d (unsigned int, rtx, bool, rtx, bool, by_pieces_constfn,
+		  void *, unsigned HOST_WIDE_INT, unsigned int, bool,
 		  bool = false);
   void run ();
 };
@@ -1120,10 +1120,12 @@ class op_by_pieces_d
objects named TO and FROM, which are identified as loads or stores
by TO_LOAD and FROM_LOAD.  If FROM is a load, the optional FROM_CFN
and its associated FROM_CFN_DATA can be used to replace loads with
-   constant values.  LEN describes the length of the operation.  */
+   constant values.  MAX_PIECES describes the maximum number of bytes
+   at a time which can be moved efficiently.  LEN describes the length
+   of the operation.  */
 
-op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
-rtx from, bool from_load,
+op_by_pieces_d::op_by_pieces_d (unsigned int max_pieces, rtx to,
+bool to_load, rtx from, bool from_load,
 by_pieces_constfn from_cfn,
 void *from_cfn_data,
 unsigned HOST_WIDE_INT len,
@@ -1131,7 +1133,7 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
 bool qi_vector_mode)
   : m_to (to, to_load, NULL, NULL),
 m_from (from, from_load, from_cfn, from_cfn_data),
-m_len (len), m_max_size (MOVE_MAX_PIECES + 1),
+m_len (len), m_max_size (max_pieces + 1),
 m_push (push), m_qi_vector_mode (qi_vector_mode)
 {
   int toi = m_to.get_addr_inc ();
@@ -1324,8 +1326,8 @@ class move_by_pieces_d : public op_by_pieces_d
  public:
   move_by_pieces_d (rtx to, rtx from, unsigned HOST_WIDE_INT len,
 		unsigned int align)
-: op_by_pieces_d (to, false, from, true, NULL, NULL, len, align,
-		  PUSHG_P (to))
+: op_by_pieces_d (MOVE_MAX_PIECES, to, false, from, true, NULL,
+		  NULL, len, align, PUSHG_P (to))
   {
   }
   rtx finish_retmode (memop_ret);
@@ -1421,8 +1423,8 @@ class store_by_pieces_d : public op_by_pieces_d
   store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data,
 		 unsigned HOST_WIDE_INT len, unsigned int align,
 		 bool qi_vector_mode)
-: op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len,
-		  align, false, qi_vector_mode)
+: op_by_pieces_d (STORE_MAX_PIECES, to, false, NULL_RTX, true, cfn,
+		  cfn_data, len, align, false, qi_vector_mode)
   {
   }
   rtx finish_retmode (memop_ret);
@@ -1618,8 +1620,8 @@ class compare_by_pieces_d : public op_by_pieces_d
   compare_by_pieces_d (rtx op0, rtx op1, by_pieces_constfn op1_cfn,
 		   void *op1_cfn_data, HOST_WIDE_INT len, int align,
 		   rtx_code_label *fail_label)
-: op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len,
-		  align, false)
+: op_by_pieces_d (COMPARE_MAX_PIECES, op0, true, op1, true, op1_cfn,
+		  op1_cfn_data, len, align, false)
   {
 m_fail_label = fail_label;
   }


Re: [PATCH] omp-low.c split

2021-08-04 Thread Jakub Jelinek via Gcc-patches
On Wed, Aug 04, 2021 at 02:40:27PM +0200, Thomas Schwinge wrote:
> Small fix-up for r243673 (Git commit 629b3d75c8c5a244d891a9c292bca6912d4b0dd9)
> "Split omp-low into multiple files".
> 
>   gcc/
>   * Makefile.in (GTFILES): Remove '$(srcdir)/omp-offload.c'.

Ok, thanks.
> ---
>  gcc/Makefile.in | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index a9c9b506034..a3d9ee797df 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -2693,7 +2693,6 @@ GTFILES = $(CPPLIB_H) $(srcdir)/input.h 
> $(srcdir)/coretypes.h \
>$(srcdir)/tree-ssa-operands.h \
>$(srcdir)/tree-profile.c $(srcdir)/tree-nested.c \
>$(srcdir)/omp-offload.h \
> -  $(srcdir)/omp-offload.c \
>$(srcdir)/omp-general.c \
>$(srcdir)/omp-low.c \
>$(srcdir)/targhooks.c $(out_file) $(srcdir)/passes.c \
> -- 
> 2.30.2
> 


Jakub



Re: [PATCH 0/3] [i386] Support cond_{smax, smin, umax, umin, xor, ior, and} for vector modes under AVX512

2021-08-04 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 4, 2021 at 8:39 PM liuhongt  wrote:
>
> Hi:
>   Together with the previous 3 patches, all cond_op expanders of vector
> modes are supported (if they have a corresponding avx512 mask instruction).
Oh, after double check, I realize there're still shift instructions
left, will support in another patch,
OPTAB_D (cond_ashl_optab, "cond_ashl$a")
OPTAB_D (cond_ashr_optab, "cond_ashr$a")
OPTAB_D (cond_lshr_optab, "cond_lshr$a")
>
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>
> liuhongt (3):
>   [i386] Support cond_{smax,smin,umax,umin} for vector integer modes
> under AVX512.
>   [i386] Support cond_{smax,smin} for vector float/double modes under
> AVX512.
>   [i386] Support cond_{xor,ior,and} for vector integer mode under
> AVX512.
>
>  gcc/config/i386/sse.md| 54 +
>  .../gcc.target/i386/cond_op_anylogic_d-1.c| 38 +
>  .../gcc.target/i386/cond_op_anylogic_d-2.c| 78 +++
>  .../gcc.target/i386/cond_op_anylogic_q-1.c| 10 +++
>  .../gcc.target/i386/cond_op_anylogic_q-2.c|  5 ++
>  .../gcc.target/i386/cond_op_maxmin_b-1.c  |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_b-2.c  |  6 ++
>  .../gcc.target/i386/cond_op_maxmin_d-1.c  | 41 ++
>  .../gcc.target/i386/cond_op_maxmin_d-2.c  | 67 
>  .../gcc.target/i386/cond_op_maxmin_double-1.c | 39 ++
>  .../gcc.target/i386/cond_op_maxmin_double-2.c | 67 
>  .../gcc.target/i386/cond_op_maxmin_float-1.c  |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_float-2.c  |  5 ++
>  .../gcc.target/i386/cond_op_maxmin_q-1.c  |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_q-2.c  |  5 ++
>  .../gcc.target/i386/cond_op_maxmin_ub-1.c |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_ub-2.c |  6 ++
>  .../gcc.target/i386/cond_op_maxmin_ud-1.c |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_ud-2.c |  5 ++
>  .../gcc.target/i386/cond_op_maxmin_uq-1.c |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_uq-2.c |  5 ++
>  .../gcc.target/i386/cond_op_maxmin_uw-1.c |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_uw-2.c |  6 ++
>  .../gcc.target/i386/cond_op_maxmin_w-1.c  |  8 ++
>  .../gcc.target/i386/cond_op_maxmin_w-2.c  |  6 ++
>  25 files changed, 507 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_q-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_q-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_d-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_d-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_float-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_float-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_q-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_q-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ud-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ud-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uq-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uq-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uw-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uw-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_w-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_w-2.c
>
> --
> 2.18.1
>


-- 
BR,
Hongtao


Re: [PATCH] omp-low.c split

2021-08-04 Thread Thomas Schwinge
Hi!

On 2016-12-09T14:08:21+0100, Martin Jambor  wrote:
> this is the promised attempt at splitting omp-low.c [...]

> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in

> @@ -2479,8 +2483,10 @@ GTFILES = $(CPP_ID_DATA_H) $(srcdir)/input.h 
> $(srcdir)/coretypes.h \
>$(srcdir)/tree-scalar-evolution.c \
>$(srcdir)/tree-ssa-operands.h \
>$(srcdir)/tree-profile.c $(srcdir)/tree-nested.c \
> +  $(srcdir)/omp-device.h \
> +  $(srcdir)/omp-device.c \
> +  $(srcdir)/omp-expand.c \
>$(srcdir)/omp-low.c \
> -  $(srcdir)/omp-low.h \
>$(srcdir)/targhooks.c $(out_file) $(srcdir)/passes.c 
> $(srcdir)/cgraphunit.c \
>$(srcdir)/cgraphclones.c \
>$(srcdir)/tree-phinodes.c \

'gcc/omp-device.*' eventually got renamed to 'gcc/omp-offload.*'.

OK to push the attached "Remove 'gcc/omp-offload.c' from 'GTFILES'"?

| Given that it doesn't contain any 'GTY' markers, no 'gcc/gt-omp-offload.h' 
file
| gets generated (and '#include'd anywhere).


Grüße
 Thomas


-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From 1af32cf74a008a48328e82a6730b984f602b9979 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Wed, 4 Aug 2021 13:41:22 +0200
Subject: [PATCH] Remove 'gcc/omp-offload.c' from 'GTFILES'

Given that it doesn't contain any 'GTY' markers, no 'gcc/gt-omp-offload.h' file
gets generated (and '#include'd anywhere).

Small fix-up for r243673 (Git commit 629b3d75c8c5a244d891a9c292bca6912d4b0dd9)
"Split omp-low into multiple files".

	gcc/
	* Makefile.in (GTFILES): Remove '$(srcdir)/omp-offload.c'.
---
 gcc/Makefile.in | 1 -
 1 file changed, 1 deletion(-)

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a9c9b506034..a3d9ee797df 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -2693,7 +2693,6 @@ GTFILES = $(CPPLIB_H) $(srcdir)/input.h $(srcdir)/coretypes.h \
   $(srcdir)/tree-ssa-operands.h \
   $(srcdir)/tree-profile.c $(srcdir)/tree-nested.c \
   $(srcdir)/omp-offload.h \
-  $(srcdir)/omp-offload.c \
   $(srcdir)/omp-general.c \
   $(srcdir)/omp-low.c \
   $(srcdir)/targhooks.c $(out_file) $(srcdir)/passes.c \
-- 
2.30.2



[PATCH 2/3] [i386] Support cond_{smax, smin} for vector float/double modes under AVX512.

2021-08-04 Thread liuhongt via Gcc-patches
gcc/ChangeLog:

* config/i386/sse.md (cond_): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_maxmin_double-1.c: New test.
* gcc.target/i386/cond_op_maxmin_double-2.c: New test.
* gcc.target/i386/cond_op_maxmin_float-1.c: New test.
* gcc.target/i386/cond_op_maxmin_float-2.c: New test.
---
 gcc/config/i386/sse.md| 18 +
 .../gcc.target/i386/cond_op_maxmin_double-1.c | 39 +++
 .../gcc.target/i386/cond_op_maxmin_double-2.c | 67 +++
 .../gcc.target/i386/cond_op_maxmin_float-1.c  |  8 +++
 .../gcc.target/i386/cond_op_maxmin_float-2.c  |  5 ++
 5 files changed, 137 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_float-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_float-2.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 6035411ea75..51733a3849d 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2376,6 +2376,24 @@ (define_insn "*sse_vmrsqrtv4sf2"
(set_attr "prefix" "orig,vex")
(set_attr "mode" "SF")])
 
+(define_expand "cond_"
+  [(set (match_operand:VF 0 "register_operand")
+   (vec_merge:VF
+ (smaxmin:VF
+   (match_operand:VF 2 "vector_operand")
+   (match_operand:VF 3 "vector_operand"))
+ (match_operand:VF 4 "nonimm_or_0_operand")
+ (match_operand: 1 "register_operand")))]
+  " == 64 || TARGET_AVX512VL"
+{
+  emit_insn (gen_3_mask (operands[0],
+operands[2],
+operands[3],
+operands[4],
+operands[1]));
+  DONE;
+})
+
 (define_expand "3"
   [(set (match_operand:VF 0 "register_operand")
(smaxmin:VF
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-1.c 
b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-1.c
new file mode 100644
index 000..eda8e1974b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-1.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
+/* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
+/* { dg-final { scan-assembler-times "vmaxpd"  1 } } */
+/* { dg-final { scan-assembler-times "vminpd"  1 } } */
+
+#include
+#ifndef NUM
+#define NUM 800
+#endif
+#ifndef TYPE
+#define TYPE double
+#endif
+#ifndef FN_MAX
+#define FN_MAX fmax
+#endif
+#ifndef FN_MIN
+#define FN_MIN fmin
+#endif
+
+TYPE a[NUM], b[NUM], c[NUM], d[NUM], e[NUM], j[NUM];
+#define MAX FN_MAX
+#define MIN FN_MIN
+
+#define BIN(OPNAME, OP)\
+  void \
+  __attribute__ ((noipa,optimize ("Ofast")))   \
+  foo_##OPNAME ()  \
+  {\
+for (int i = 0; i != NUM; i++) \
+  if (b[i] < c[i]) \
+   a[i] = (OP (d[i], e[i]));   \
+  else \
+   a[i] = d[i] - e[i]; \
+  }
+
+BIN (max, MAX);
+BIN (min, MIN);
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-2.c 
b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-2.c
new file mode 100644
index 000..c50a831000a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-2.c
@@ -0,0 +1,67 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512vl -mprefer-vector-width=256 -ffast-math" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#ifndef CHECK
+#define CHECK "avx512f-helper.h"
+#endif
+
+#include CHECK
+
+#include "cond_op_maxmin_double-1.c"
+#define BINO2(OPNAME, OP)  \
+  void \
+  __attribute__ ((noipa))  \
+  foo_o2_##OPNAME ()   \
+  {\
+for (int i = 0; i != NUM; i++) \
+  if (b[i] < c[i]) \
+   j[i] = OP(d[i], e[i]);  \
+  else \
+   j[i] = d[i] - e[i]; \
+  }
+
+BINO2 (max, MAX);
+BINO2 (min, MIN);
+
+static void
+test_256 (void)
+{
+  int sign = -1;
+  for (int i = 0; i != NUM; i++)
+{
+  a[i] = 0;
+  d[i] = i * 2;
+  e[i] = i * i * 3 - i * 9 + 153;
+  b[i] = i * 83;
+  c[i] = b[i] + sign;
+  sign *= -1;
+  j[i] = 1;
+}
+  foo_max ();
+  foo_o2_max ();
+  for (int i = 0; i != NUM; i++)
+{
+  if (a[i] != j[i])
+   abort ();
+  a[i] = 0;
+  b[i] = 1;
+}
+
+  foo_min ();
+  foo_o2_min ();
+  for (int i = 0; i != NUM; 

[PATCH 3/3] [i386] Support cond_{xor, ior, and} for vector integer mode under AVX512.

2021-08-04 Thread liuhongt via Gcc-patches
gcc/ChangeLog:

* config/i386/sse.md (cond_): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_anylogic_d-1.c: New test.
* gcc.target/i386/cond_op_anylogic_d-2.c: New test.
* gcc.target/i386/cond_op_anylogic_q-1.c: New test.
* gcc.target/i386/cond_op_anylogic_q-2.c: New test.
---
 gcc/config/i386/sse.md| 18 +
 .../gcc.target/i386/cond_op_anylogic_d-1.c| 38 +
 .../gcc.target/i386/cond_op_anylogic_d-2.c| 78 +++
 .../gcc.target/i386/cond_op_anylogic_q-1.c| 10 +++
 .../gcc.target/i386/cond_op_anylogic_q-2.c|  5 ++
 5 files changed, 149 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_q-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_q-2.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 51733a3849d..a46a2373547 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -14063,6 +14063,24 @@ (define_expand "3"
   DONE;
 })
 
+(define_expand "cond_"
+  [(set (match_operand:VI48_AVX512VL 0 "register_operand")
+   (vec_merge:VI48_AVX512VL
+ (any_logic:VI48_AVX512VL
+   (match_operand:VI48_AVX512VL 2 "vector_operand")
+   (match_operand:VI48_AVX512VL 3 "vector_operand"))
+ (match_operand:VI48_AVX512VL 4 "nonimm_or_0_operand")
+ (match_operand: 1 "register_operand")))]
+  "TARGET_AVX512F"
+{
+  emit_insn (gen_3_mask (operands[0],
+operands[2],
+operands[3],
+operands[4],
+operands[1]));
+  DONE;
+})
+
 (define_insn "3"
   [(set (match_operand:VI48_AVX_AVX512F 0 "register_operand" "=x,x,v")
(any_logic:VI48_AVX_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-1.c 
b/gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-1.c
new file mode 100644
index 000..8951f4a3a27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump ".COND_AND" "optimized" } } */
+/* { dg-final { scan-tree-dump ".COND_XOR" "optimized" } } */
+/* { dg-final { scan-tree-dump ".COND_IOR" "optimized" } } */
+/* { dg-final { scan-assembler-times "vpxord"  1 } } */
+/* { dg-final { scan-assembler-times "vpord"  1 } } */
+/* { dg-final { scan-assembler-times "vpandd"  1 } } */
+
+typedef int int32;
+typedef unsigned int uint32;
+typedef long long int64;
+typedef unsigned long long uint64;
+
+#ifndef NUM
+#define NUM 800
+#endif
+#ifndef TYPE
+#define TYPE int
+#endif
+
+TYPE a[NUM], b[NUM], c[NUM], d[NUM], e[NUM], j[NUM];
+
+#define BIN(OPNAME, OP)\
+  void \
+  __attribute__ ((noipa,optimize ("O3")))  \
+  foo_##OPNAME ()  \
+  {\
+for (int i = 0; i != NUM; i++) \
+  if (b[i] < c[i]) \
+   a[i] = d[i] OP e[i];\
+  else \
+   a[i] = d[i] - e[i]; \
+  }
+
+BIN (and, &);
+BIN (ior, |);
+BIN (xor, ^);
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-2.c 
b/gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-2.c
new file mode 100644
index 000..23ca4120cf2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-2.c
@@ -0,0 +1,78 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512vl -mprefer-vector-width=256" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#ifndef CHECK
+#define CHECK "avx512f-helper.h"
+#endif
+
+#include CHECK
+
+#include "cond_op_anylogic_d-1.c"
+#define BINO2(OPNAME, OP)  \
+  void \
+  __attribute__ ((noipa,optimize ("O2")))  \
+  foo_o2_##OPNAME ()   \
+  {\
+for (int i = 0; i != NUM; i++) \
+  if (b[i] < c[i]) \
+   j[i] = d[i] OP e[i];\
+  else \
+   j[i] = d[i] - e[i]; \
+  }
+
+BINO2 (and, &);
+BINO2 (ior, |);
+BINO2 (xor, ^);
+
+static void
+test_256 (void)
+{
+  int sign = -1;
+  for (int i = 0; i != NUM; i++)
+{
+  a[i] = 0;
+  d[i] = i * 2;
+  e[i] = i * i * 3 - i * 9 + 153;
+  b[i] = i * 83;
+  c[i] = b[i] + sign;
+  sign *= -1;
+  j[i] = 1;
+}
+  foo_and ();
+  foo_o2_and ();
+  for (int i = 0; i != NUM; i++)
+{
+  if (a[i] != j[i])
+   abort ();
+  a[i] = 0;

[PATCH 1/3] [i386] Support cond_{smax, smin, umax, umin} for vector integer modes under AVX512.

2021-08-04 Thread liuhongt via Gcc-patches
gcc/ChangeLog:

* config/i386/sse.md (cond_): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_maxmin_b-1.c: New test.
* gcc.target/i386/cond_op_maxmin_b-2.c: New test.
* gcc.target/i386/cond_op_maxmin_d-1.c: New test.
* gcc.target/i386/cond_op_maxmin_d-2.c: New test.
* gcc.target/i386/cond_op_maxmin_q-1.c: New test.
* gcc.target/i386/cond_op_maxmin_q-2.c: New test.
* gcc.target/i386/cond_op_maxmin_ub-1.c: New test.
* gcc.target/i386/cond_op_maxmin_ub-2.c: New test.
* gcc.target/i386/cond_op_maxmin_ud-1.c: New test.
* gcc.target/i386/cond_op_maxmin_ud-2.c: New test.
* gcc.target/i386/cond_op_maxmin_uq-1.c: New test.
* gcc.target/i386/cond_op_maxmin_uq-2.c: New test.
* gcc.target/i386/cond_op_maxmin_uw-1.c: New test.
* gcc.target/i386/cond_op_maxmin_uw-2.c: New test.
* gcc.target/i386/cond_op_maxmin_w-1.c: New test.
* gcc.target/i386/cond_op_maxmin_w-2.c: New test.
---
 gcc/config/i386/sse.md| 18 +
 .../gcc.target/i386/cond_op_maxmin_b-1.c  |  8 +++
 .../gcc.target/i386/cond_op_maxmin_b-2.c  |  6 ++
 .../gcc.target/i386/cond_op_maxmin_d-1.c  | 41 
 .../gcc.target/i386/cond_op_maxmin_d-2.c  | 67 +++
 .../gcc.target/i386/cond_op_maxmin_q-1.c  |  8 +++
 .../gcc.target/i386/cond_op_maxmin_q-2.c  |  5 ++
 .../gcc.target/i386/cond_op_maxmin_ub-1.c |  8 +++
 .../gcc.target/i386/cond_op_maxmin_ub-2.c |  6 ++
 .../gcc.target/i386/cond_op_maxmin_ud-1.c |  8 +++
 .../gcc.target/i386/cond_op_maxmin_ud-2.c |  5 ++
 .../gcc.target/i386/cond_op_maxmin_uq-1.c |  8 +++
 .../gcc.target/i386/cond_op_maxmin_uq-2.c |  5 ++
 .../gcc.target/i386/cond_op_maxmin_uw-1.c |  8 +++
 .../gcc.target/i386/cond_op_maxmin_uw-2.c |  6 ++
 .../gcc.target/i386/cond_op_maxmin_w-1.c  |  8 +++
 .../gcc.target/i386/cond_op_maxmin_w-2.c  |  6 ++
 17 files changed, 221 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_d-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_d-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_q-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_q-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ud-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ud-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uq-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uq-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uw-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uw-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_w-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_w-2.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index f5968e04669..6035411ea75 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -13070,6 +13070,24 @@ (define_insn "*avx2_3"
(set_attr "prefix" "vex")
(set_attr "mode" "OI")])
 
+(define_expand "cond_"
+  [(set (match_operand:VI1248_AVX512VLBW 0 "register_operand")
+   (vec_merge:VI1248_AVX512VLBW
+ (maxmin:VI1248_AVX512VLBW
+   (match_operand:VI1248_AVX512VLBW 2 "nonimmediate_operand")
+   (match_operand:VI1248_AVX512VLBW 3 "nonimmediate_operand"))
+ (match_operand:VI1248_AVX512VLBW 4 "nonimm_or_0_operand")
+ (match_operand: 1 "register_operand")))]
+  "TARGET_AVX512F"
+{
+  emit_insn (gen_3_mask (operands[0],
+operands[2],
+operands[3],
+operands[4],
+operands[1]));
+  DONE;
+})
+
 (define_expand "3_mask"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand")
(vec_merge:VI48_AVX512VL
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c 
b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
new file mode 100644
index 000..78c6600f83b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" 
} */
+/* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
+/* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
+/* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
+/* { dg-final { scan-assembler-times "vpminsb"  1 } } */
+
+#include "cond_op_maxmin_d-1.c"
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-2.c 

[PATCH 0/3] [i386] Support cond_{smax, smin, umax, umin, xor, ior, and} for vector modes under AVX512

2021-08-04 Thread liuhongt via Gcc-patches
Hi:
  Together with the previous 3 patches, all cond_op expanders of vector
modes are supported (if they have a corresponding avx512 mask instruction).

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  
liuhongt (3):
  [i386] Support cond_{smax,smin,umax,umin} for vector integer modes
under AVX512.
  [i386] Support cond_{smax,smin} for vector float/double modes under
AVX512.
  [i386] Support cond_{xor,ior,and} for vector integer mode under
AVX512.

 gcc/config/i386/sse.md| 54 +
 .../gcc.target/i386/cond_op_anylogic_d-1.c| 38 +
 .../gcc.target/i386/cond_op_anylogic_d-2.c| 78 +++
 .../gcc.target/i386/cond_op_anylogic_q-1.c| 10 +++
 .../gcc.target/i386/cond_op_anylogic_q-2.c|  5 ++
 .../gcc.target/i386/cond_op_maxmin_b-1.c  |  8 ++
 .../gcc.target/i386/cond_op_maxmin_b-2.c  |  6 ++
 .../gcc.target/i386/cond_op_maxmin_d-1.c  | 41 ++
 .../gcc.target/i386/cond_op_maxmin_d-2.c  | 67 
 .../gcc.target/i386/cond_op_maxmin_double-1.c | 39 ++
 .../gcc.target/i386/cond_op_maxmin_double-2.c | 67 
 .../gcc.target/i386/cond_op_maxmin_float-1.c  |  8 ++
 .../gcc.target/i386/cond_op_maxmin_float-2.c  |  5 ++
 .../gcc.target/i386/cond_op_maxmin_q-1.c  |  8 ++
 .../gcc.target/i386/cond_op_maxmin_q-2.c  |  5 ++
 .../gcc.target/i386/cond_op_maxmin_ub-1.c |  8 ++
 .../gcc.target/i386/cond_op_maxmin_ub-2.c |  6 ++
 .../gcc.target/i386/cond_op_maxmin_ud-1.c |  8 ++
 .../gcc.target/i386/cond_op_maxmin_ud-2.c |  5 ++
 .../gcc.target/i386/cond_op_maxmin_uq-1.c |  8 ++
 .../gcc.target/i386/cond_op_maxmin_uq-2.c |  5 ++
 .../gcc.target/i386/cond_op_maxmin_uw-1.c |  8 ++
 .../gcc.target/i386/cond_op_maxmin_uw-2.c |  6 ++
 .../gcc.target/i386/cond_op_maxmin_w-1.c  |  8 ++
 .../gcc.target/i386/cond_op_maxmin_w-2.c  |  6 ++
 25 files changed, 507 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_d-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_q-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_anylogic_q-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_d-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_d-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_double-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_float-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_float-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_q-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_q-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ud-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_ud-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uq-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uq-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uw-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_uw-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_w-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_maxmin_w-2.c

-- 
2.18.1



Re: [PATCH 6/8] aarch64: Tweak MLA vector costs

2021-08-04 Thread Richard Sandiford via Gcc-patches
Richard Sandiford via Gcc-patches  writes:
> Richard Biener  writes:
>> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
>>  wrote:
>>>
>>> The issue-based vector costs currently assume that a multiply-add
>>> sequence can be implemented using a single instruction.  This is
>>> generally true for scalars (which have a 4-operand instruction)
>>> and SVE (which allows the output to be tied to any input).
>>> However, for Advanced SIMD, multiplying two values and adding
>>> an invariant will end up being a move and an MLA.
>>>
>>> The only target to use the issue-based vector costs is Neoverse V1,
>>> which would generally prefer SVE in this case anyway.  I therefore
>>> don't have a self-contained testcase.  However, the distinction
>>> becomes more important with a later patch.
>>
>> But we do cost any invariants separately (for the prologue), so they
>> should be available in a register.  How doesn't that work?
>
> Yeah, that works, and the prologue part is costed correctly.  But the
> result of an Advanced SIMD FMLA is tied to the accumulator input, so if
> the accumulator input is an invariant, we need a register move (in the
> loop body) before the FMLA.
>
> E.g. for:
>
> void
> f (float *restrict x, float *restrict y, float *restrict z, float a)
> {
>   for (int i = 0; i < 100; ++i)
> x[i] = y[i] * z[i];

+ 1.0, not sure where that went.

> }
>
> the scalar code is:
>
> .L2:
> ldr s1, [x1, x3]
> ldr s2, [x2, x3]
> fmadd   s1, s1, s2, s0
> str s1, [x0, x3]
> add x3, x3, 4
> cmp x3, 400
> bne .L2
>
> the SVE code is:
>
> .L2:
> ld1wz1.s, p0/z, [x1, x3, lsl 2]
> ld1wz0.s, p0/z, [x2, x3, lsl 2]
> fmadz0.s, p1/m, z1.s, z2.s
> st1wz0.s, p0, [x0, x3, lsl 2]
> add x3, x3, x5
> whilelo p0.s, w3, w4
> b.any   .L2
>
> but the Advanced SIMD code is:
>
> .L2:
> mov v0.16b, v3.16b   // < boo
> ldr q2, [x2, x3]
> ldr q1, [x1, x3]
> fmlav0.4s, v2.4s, v1.4s
> str q0, [x0, x3]
> add x3, x3, 16
> cmp x3, 400
> bne .L2
>
> Thanks,
> Richard
>
>
>>
>>> gcc/
>>> * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>> parameter.  Detect cases in which an Advanced SIMD MLA would almost
>>> certainly require a MOV.
>>> (aarch64_count_ops): Update accordingly.
>>> ---
>>>  gcc/config/aarch64/aarch64.c | 25 ++---
>>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 084f8caa0da..19045ef6944 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info 
>>> stmt_info)
>>>
>>>  /* Return true if STMT_INFO is the second part of a two-statement 
>>> multiply-add
>>> or multiply-subtract sequence that might be suitable for fusing into a
>>> -   single instruction.  */
>>> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
>>> +   a scalar one, otherwise analyze it as an operation on vectors with those
>>> +   VEC_* flags.  */
>>>  static bool
>>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>>> +   unsigned int vec_flags)
>>>  {
>>>gassign *assign = dyn_cast (stmt_info->stmt);
>>>if (!assign)
>>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, 
>>> stmt_vec_info stmt_info)
>>>if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>> continue;
>>>
>>> +  if (vec_flags & VEC_ADVSIMD)
>>> +   {
>>> + /* Scalar and SVE code can tie the result to any FMLA input (or 
>>> none,
>>> +although that requires a MOVPRFX for SVE).  However, Advanced 
>>> SIMD
>>> +only supports MLA forms, so will require a move if the result
>>> +cannot be tied to the accumulator.  The most important case in
>>> +which this is true is when the accumulator input is invariant. 
>>>  */
>>> + rhs = gimple_op (assign, 3 - i);
>>> + if (TREE_CODE (rhs) != SSA_NAME)
>>> +   return false;
>>> + def_stmt_info = vinfo->lookup_def (rhs);
>>> + if (!def_stmt_info
>>> + || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>>> +   return false;
>>> +   }
>>> +
>>>return true;
>>>  }
>>>return false;
>>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, 
>>> aarch64_vector_costs *costs,
>>>  }
>>>
>>>/* Assume that multiply-adds will become a single operation.  */
>>> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>>> +  if (stmt_info && aarch64_multiply_add_p (vinfo, 

Re: [PATCH 6/8] aarch64: Tweak MLA vector costs

2021-08-04 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
>  wrote:
>>
>> The issue-based vector costs currently assume that a multiply-add
>> sequence can be implemented using a single instruction.  This is
>> generally true for scalars (which have a 4-operand instruction)
>> and SVE (which allows the output to be tied to any input).
>> However, for Advanced SIMD, multiplying two values and adding
>> an invariant will end up being a move and an MLA.
>>
>> The only target to use the issue-based vector costs is Neoverse V1,
>> which would generally prefer SVE in this case anyway.  I therefore
>> don't have a self-contained testcase.  However, the distinction
>> becomes more important with a later patch.
>
> But we do cost any invariants separately (for the prologue), so they
> should be available in a register.  How doesn't that work?

Yeah, that works, and the prologue part is costed correctly.  But the
result of an Advanced SIMD FMLA is tied to the accumulator input, so if
the accumulator input is an invariant, we need a register move (in the
loop body) before the FMLA.

E.g. for:

void
f (float *restrict x, float *restrict y, float *restrict z, float a)
{
  for (int i = 0; i < 100; ++i)
x[i] = y[i] * z[i];
}

the scalar code is:

.L2:
ldr s1, [x1, x3]
ldr s2, [x2, x3]
fmadd   s1, s1, s2, s0
str s1, [x0, x3]
add x3, x3, 4
cmp x3, 400
bne .L2

the SVE code is:

.L2:
ld1wz1.s, p0/z, [x1, x3, lsl 2]
ld1wz0.s, p0/z, [x2, x3, lsl 2]
fmadz0.s, p1/m, z1.s, z2.s
st1wz0.s, p0, [x0, x3, lsl 2]
add x3, x3, x5
whilelo p0.s, w3, w4
b.any   .L2

but the Advanced SIMD code is:

.L2:
mov v0.16b, v3.16b   // < boo
ldr q2, [x2, x3]
ldr q1, [x1, x3]
fmlav0.4s, v2.4s, v1.4s
str q0, [x0, x3]
add x3, x3, 16
cmp x3, 400
bne .L2

Thanks,
Richard


>
>> gcc/
>> * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>> parameter.  Detect cases in which an Advanced SIMD MLA would almost
>> certainly require a MOV.
>> (aarch64_count_ops): Update accordingly.
>> ---
>>  gcc/config/aarch64/aarch64.c | 25 ++---
>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 084f8caa0da..19045ef6944 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info 
>> stmt_info)
>>
>>  /* Return true if STMT_INFO is the second part of a two-statement 
>> multiply-add
>> or multiply-subtract sequence that might be suitable for fusing into a
>> -   single instruction.  */
>> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
>> +   a scalar one, otherwise analyze it as an operation on vectors with those
>> +   VEC_* flags.  */
>>  static bool
>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>> +   unsigned int vec_flags)
>>  {
>>gassign *assign = dyn_cast (stmt_info->stmt);
>>if (!assign)
>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, 
>> stmt_vec_info stmt_info)
>>if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>> continue;
>>
>> +  if (vec_flags & VEC_ADVSIMD)
>> +   {
>> + /* Scalar and SVE code can tie the result to any FMLA input (or 
>> none,
>> +although that requires a MOVPRFX for SVE).  However, Advanced 
>> SIMD
>> +only supports MLA forms, so will require a move if the result
>> +cannot be tied to the accumulator.  The most important case in
>> +which this is true is when the accumulator input is invariant.  
>> */
>> + rhs = gimple_op (assign, 3 - i);
>> + if (TREE_CODE (rhs) != SSA_NAME)
>> +   return false;
>> + def_stmt_info = vinfo->lookup_def (rhs);
>> + if (!def_stmt_info
>> + || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>> +   return false;
>> +   }
>> +
>>return true;
>>  }
>>return false;
>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, 
>> aarch64_vector_costs *costs,
>>  }
>>
>>/* Assume that multiply-adds will become a single operation.  */
>> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>  return;
>>
>>/* When costing scalar statements in vector code, the count already


Re: [PATCH 1/2] Add emulated gather capability to the vectorizer

2021-08-04 Thread Richard Biener
On Wed, 4 Aug 2021, Richard Sandiford wrote:

> Richard Biener  writes:
> > This adds a gather vectorization capability to the vectorizer
> > without target support by decomposing the offset vector, doing
> > sclar loads and then building a vector from the result.  This
> > is aimed mainly at cases where vectorizing the rest of the loop
> > offsets the cost of vectorizing the gather.
> >
> > Note it's difficult to avoid vectorizing the offset load, but in
> > some cases later passes can turn the vector load + extract into
> > scalar loads, see the followup patch.
> >
> > On SPEC CPU 2017 510.parest_r this improves runtime from 250s
> > to 219s on a Zen2 CPU which has its native gather instructions
> > disabled (using those the runtime instead increases to 254s)
> > using -Ofast -march=znver2 [-flto].  It turns out the critical
> > loops in this benchmark all perform gather operations.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > 2021-07-30  Richard Biener  
> >
> > * tree-vect-data-refs.c (vect_check_gather_scatter):
> > Include widening conversions only when the result is
> > still handed by native gather or the current offset
> > size not already matches the data size.
> > Also succeed analysis in case there's no native support,
> > noted by a IFN_LAST ifn and a NULL decl.
> > (vect_analyze_data_refs): Always consider gathers.
> > * tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
> > Test for no IFN gather rather than decl gather.
> > * tree-vect-stmts.c (vect_model_load_cost): Pass in the
> > gather-scatter info and cost emulated gathers accordingly.
> > (vect_truncate_gather_scatter_offset): Properly test for
> > no IFN gather.
> > (vect_use_strided_gather_scatters_p): Likewise.
> > (get_load_store_type): Handle emulated gathers and its
> > restrictions.
> > (vectorizable_load): Likewise.  Emulate them by extracting
> > scalar offsets, doing scalar loads and a vector construct.
> >
> > * gcc.target/i386/vect-gather-1.c: New testcase.
> > * gfortran.dg/vect/vect-8.f90: Adjust.
> > ---
> >  gcc/testsuite/gcc.target/i386/vect-gather-1.c |  18 
> >  gcc/testsuite/gfortran.dg/vect/vect-8.f90 |   2 +-
> >  gcc/tree-vect-data-refs.c |  34 --
> >  gcc/tree-vect-patterns.c  |   2 +-
> >  gcc/tree-vect-stmts.c | 100 --
> >  5 files changed, 138 insertions(+), 18 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c 
> > b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> > new file mode 100644
> > index 000..134aef39666
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
> > +
> > +#ifndef INDEXTYPE
> > +#define INDEXTYPE int
> > +#endif
> > +double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
> > +   double *luval, double *dst)
> > +{
> > +  double res = 0;
> > +  for (const INDEXTYPE * col = rowstart; col != rowend; ++col, ++luval)
> > +res += *luval * dst[*col];
> > +  return res;
> > +}
> > +
> > +/* With gather emulation this should be profitable to vectorize
> > +   even with plain SSE2.  */
> > +/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
> > diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 
> > b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> > index 9994805d77f..cc1aebfbd84 100644
> > --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> > +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> > @@ -706,5 +706,5 @@ END SUBROUTINE kernel
> >  
> >  ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { 
> > target aarch64_sve } } }
> >  ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { 
> > target { aarch64*-*-* && { ! aarch64_sve } } } } }
> > -! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 "vect" { 
> > target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
> > +! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
> > target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
> >  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { 
> > target { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }
> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > index 6995efba899..3c29ff04fd8 100644
> > --- a/gcc/tree-vect-data-refs.c
> > +++ b/gcc/tree-vect-data-refs.c
> > @@ -4007,8 +4007,27 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, 
> > loop_vec_info loop_vinfo,
> >   continue;
> > }
> >  
> > - if (TYPE_PRECISION (TREE_TYPE (op0))
> > - < TYPE_PRECISION (TREE_TYPE (off)))
> > + /* Include the conversion if it is widening and we're using
> > +the IFN path or the target can 

Re: [PATCH v3] Make loops_list support an optional loop_p root

2021-08-04 Thread Richard Biener via Gcc-patches
On Wed, Aug 4, 2021 at 12:47 PM Kewen.Lin  wrote:
>
> on 2021/8/4 下午6:01, Richard Biener wrote:
> > On Wed, Aug 4, 2021 at 4:36 AM Kewen.Lin  wrote:
> >>
> >> on 2021/8/3 下午8:08, Richard Biener wrote:
> >>> On Fri, Jul 30, 2021 at 7:20 AM Kewen.Lin  wrote:
> 
>  on 2021/7/29 下午4:01, Richard Biener wrote:
> > On Fri, Jul 23, 2021 at 10:41 AM Kewen.Lin  wrote:
> >>
> >> on 2021/7/22 下午8:56, Richard Biener wrote:
> >>> On Tue, Jul 20, 2021 at 4:37
> >>> PM Kewen.Lin  wrote:
> 
>  Hi,
> 
>  This v2 has addressed some review comments/suggestions:
> 
>    - Use "!=" instead of "<" in function operator!= (const Iter )
>    - Add new CTOR loops_list (struct loops *loops, unsigned flags)
>  to support loop hierarchy tree rather than just a function,
>  and adjust to use loops* accordingly.
> >>>
> >>> I actually meant struct loop *, not struct loops * ;)  At the point
> >>> we pondered to make loop invariant motion work on single
> >>> loop nests we gave up not only but also because it iterates
> >>> over the loop nest but all the iterators only ever can process
> >>> all loops, not say, all loops inside a specific 'loop' (and
> >>> including that 'loop' if LI_INCLUDE_ROOT).  So the
> >>> CTOR would take the 'root' of the loop tree as argument.
> >>>
> >>> I see that doesn't trivially fit how loops_list works, at least
> >>> not for LI_ONLY_INNERMOST.  But I guess FROM_INNERMOST
> >>> could be adjusted to do ONLY_INNERMOST as well?
> >>>
> >>
> >>
> >> Thanks for the clarification!  I just realized that the previous
> >> version with struct loops* is problematic, all traversal is
> >> still bounded with outer_loop == NULL.  I think what you expect
> >> is to respect the given loop_p root boundary.  Since we just
> >> record the loops' nums, I think we still need the function* fn?
> >
> > Would it simplify things if we recorded the actual loop *?
> >
> 
>  I'm afraid it's unsafe to record the loop*.  I had the same
>  question why the loop iterator uses index rather than loop* when
>  I read this at the first time.  I guess the design of processing
>  loops allows its user to update or even delete the folllowing
>  loops to be visited.  For example, when the user does some tricks
>  on one loop, then it duplicates the loop and its children to
>  somewhere and then removes the loop and its children, when
>  iterating onto its children later, the "index" way will check its
>  validity by get_loop at that point, but the "loop *" way will
>  have some recorded pointers to become dangling, can't do the
>  validity check on itself, seems to need a side linear search to
>  ensure the validity.
> 
> > There's still the to_visit reserve which needs a bound on
> > the number of loops for efficiency reasons.
> >
> 
>  Yes, I still keep the fn in the updated version.
> 
> >> So I add one optional argument loop_p root and update the
> >> visiting codes accordingly.  Before this change, the previous
> >> visiting uses the outer_loop == NULL as the termination condition,
> >> it perfectly includes the root itself, but with this given root,
> >> we have to use it as the termination condition to avoid to iterate
> >> onto its possible existing next.
> >>
> >> For LI_ONLY_INNERMOST, I was thinking whether we can use the
> >> code like:
> >>
> >> struct loops *fn_loops = loops_for_fn (fn)->larray;
> >> for (i = 0; vec_safe_iterate (fn_loops, i, ); i++)
> >> if (aloop != NULL
> >> && aloop->inner == NULL
> >> && flow_loop_nested_p (tree_root, aloop))
> >>  this->to_visit.quick_push (aloop->num);
> >>
> >> it has the stable bound, but if the given root only has several
> >> child loops, it can be much worse if there are many loops in fn.
> >> It seems impossible to predict the given root loop hierarchy size,
> >> maybe we can still use the original linear searching for the case
> >> loops_for_fn (fn) == root?  But since this visiting seems not so
> >> performance critical, I chose to share the code originally used
> >> for FROM_INNERMOST, hope it can have better readability and
> >> maintainability.
> >
> > I was indeed looking for something that has execution/storage
> > bound on the subtree we're interested in.  If we pull the CTOR
> > out-of-line we can probably keep the linear search for
> > LI_ONLY_INNERMOST when looking at the whole loop tree.
> >
> 
>  OK, I've moved the suggested single loop tree walker out-of-line
>  to cfgloop.c, and brought the linear search back for
>  LI_ONLY_INNERMOST when looking at the whole loop tree.
> 
> > It just seemed to me 

Re: [committed 2/2] libstdc++: Add [[nodiscard]] to sequence containers

2021-08-04 Thread Jonathan Wakely via Gcc-patches

On 04/08/21 12:56 +0100, Jonathan Wakely wrote:

... and container adaptors.

This adds the [[nodiscard]] attribute to functions with no side-effects
for the sequence containers and their iterators, and the debug versions
of those containers, and the container adaptors,


I don't plan to add any more [[nodiscard]] attributes for now, but
these two commits should demonstrate how to do it for anybody who
wants to contribute similar patches.

I didn't add tests that verify we do actually warn on each of those
functions, because there are hundreds of them, and I know they're
working because I had to alter existing tests to not warn.




[committed 2/2] libstdc++: Add [[nodiscard]] to sequence containers

2021-08-04 Thread Jonathan Wakely via Gcc-patches

... and container adaptors.

This adds the [[nodiscard]] attribute to functions with no side-effects
for the sequence containers and their iterators, and the debug versions
of those containers, and the container adaptors,

Tested powerpc64le-linux, committed to trunk.



commit 0d04fe49239d91787850036599164788f1c87785
Author: Jonathan Wakely 
Date:   Tue Aug 3 20:50:52 2021

libstdc++: Add [[nodiscard]] to sequence containers

... and container adaptors.

This adds the [[nodiscard]] attribute to functions with no side-effects
for the sequence containers and their iterators, and the debug versions
of those containers, and the container adaptors,

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* include/bits/forward_list.h: Add [[nodiscard]] to functions
with no side-effects.
* include/bits/stl_bvector.h: Likewise.
* include/bits/stl_deque.h: Likewise.
* include/bits/stl_list.h: Likewise.
* include/bits/stl_queue.h: Likewise.
* include/bits/stl_stack.h: Likewise.
* include/bits/stl_vector.h: Likewise.
* include/debug/deque: Likewise.
* include/debug/forward_list: Likewise.
* include/debug/list: Likewise.
* include/debug/safe_iterator.h: Likewise.
* include/debug/vector: Likewise.
* include/std/array: Likewise.
* testsuite/23_containers/array/creation/3_neg.cc: Use
-Wno-unused-result.
* testsuite/23_containers/array/debug/back1_neg.cc: Cast result
to void.
* testsuite/23_containers/array/debug/back2_neg.cc: Likewise.
* testsuite/23_containers/array/debug/front1_neg.cc: Likewise.
* testsuite/23_containers/array/debug/front2_neg.cc: Likewise.
* testsuite/23_containers/array/debug/square_brackets_operator1_neg.cc:
Likewise.
* testsuite/23_containers/array/debug/square_brackets_operator2_neg.cc:
Likewise.
* testsuite/23_containers/array/tuple_interface/get_neg.cc:
Adjust dg-error line numbers.
* testsuite/23_containers/deque/cons/clear_allocator.cc: Cast
result to void.
* testsuite/23_containers/deque/debug/invalidation/4.cc:
Likewise.
* testsuite/23_containers/deque/types/1.cc: Use
-Wno-unused-result.
* testsuite/23_containers/list/types/1.cc: Cast result to void.
* testsuite/23_containers/priority_queue/members/7161.cc:
Likewise.
* testsuite/23_containers/queue/members/7157.cc: Likewise.
* testsuite/23_containers/vector/59829.cc: Likewise.
* testsuite/23_containers/vector/ext_pointer/types/1.cc:
Likewise.
* testsuite/23_containers/vector/ext_pointer/types/2.cc:
Likewise.
* testsuite/23_containers/vector/types/1.cc: Use
-Wno-unused-result.

diff --git a/libstdc++-v3/include/bits/forward_list.h b/libstdc++-v3/include/bits/forward_list.h
index e61746848f6..ab6d9389194 100644
--- a/libstdc++-v3/include/bits/forward_list.h
+++ b/libstdc++-v3/include/bits/forward_list.h
@@ -150,10 +150,12 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   _Fwd_list_iterator(_Fwd_list_node_base* __n) noexcept
   : _M_node(__n) { }
 
+  [[__nodiscard__]]
   reference
   operator*() const noexcept
   { return *static_cast<_Node*>(this->_M_node)->_M_valptr(); }
 
+  [[__nodiscard__]]
   pointer
   operator->() const noexcept
   { return static_cast<_Node*>(this->_M_node)->_M_valptr(); }
@@ -176,6 +178,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   /**
*  @brief  Forward list iterator equality comparison.
*/
+  [[__nodiscard__]]
   friend bool
   operator==(const _Self& __x, const _Self& __y) noexcept
   { return __x._M_node == __y._M_node; }
@@ -184,6 +187,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   /**
*  @brief  Forward list iterator inequality comparison.
*/
+  [[__nodiscard__]]
   friend bool
   operator!=(const _Self& __x, const _Self& __y) noexcept
   { return __x._M_node != __y._M_node; }
@@ -229,10 +233,12 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   _Fwd_list_const_iterator(const iterator& __iter) noexcept
   : _M_node(__iter._M_node) { }
 
+  [[__nodiscard__]]
   reference
   operator*() const noexcept
   { return *static_cast<_Node*>(this->_M_node)->_M_valptr(); }
 
+  [[__nodiscard__]]
   pointer
   operator->() const noexcept
   { return static_cast<_Node*>(this->_M_node)->_M_valptr(); }
@@ -255,6 +261,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   /**
*  @brief  Forward list const_iterator equality comparison.
*/
+  [[__nodiscard__]]
   friend bool
   operator==(const _Self& __x, const 

[committed 1/2] libstdc++: Add [[nodiscard]] to iterators and related utilities

2021-08-04 Thread Jonathan Wakely via Gcc-patches
This adds [[nodiscard]] throughout , as proposed by P2377R0
(with some minor corrections).

The attribute is added for all modes from C++11 up, using
[[__nodiscard__]] or _GLIBCXX_NODISCARD where C++17 [[nodiscard]] can't
be used directly.


commit 240b01b0215f9e46ecf04267c8a3faeb19d4fe3c
Author: Jonathan Wakely 
Date:   Tue Aug 3 18:06:27 2021

libstdc++: Add [[nodiscard]] to iterators and related utilities

This adds [[nodiscard]] throughout , as proposed by P2377R0
(with some minor corrections).

The attribute is added for all modes from C++11 up, using
[[__nodiscard__]] or _GLIBCXX_NODISCARD where C++17 [[nodiscard]] can't
be used directly.

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* include/bits/iterator_concepts.h (iter_move): Add
[[nodiscard]].
* include/bits/range_access.h (begin, end, cbegin, cend)
(rbegin, rend, crbegin, crend, size, data, ssize): Likewise.
* include/bits/ranges_base.h (ranges::begin, ranges::end)
(ranges::cbegin, ranges::cend, ranges::rbegin, ranges::rend)
(ranges::crbegin, ranges::crend, ranges::size, ranges::ssize)
(ranges::empty, ranges::data, ranges::cdata): Likewise.
* include/bits/stl_iterator.h (reverse_iterator, __normal_iterator)
(back_insert_iterator, front_insert_iterator, insert_iterator)
(move_iterator, move_sentinel, common_iterator)
(counted_iterator): Likewise.
* include/bits/stl_iterator_base_funcs.h (distance, next, prev):
Likewise.
* include/bits/stream_iterator.h (istream_iterator)
(ostream_iterartor): Likewise.
* include/bits/streambuf_iterator.h (istreambuf_iterator)
(ostreambuf_iterator): Likewise.
* include/std/ranges (views::single, views::iota, views::all)
(views::filter, views::transform, views::take, views::take_while)
(views::drop, views::drop_while, views::join, views::lazy_split)
(views::split, views::counted, views::common, views::reverse)
(views::elements): Likewise.
* testsuite/20_util/rel_ops.cc: Use -Wno-unused-result.
* testsuite/24_iterators/move_iterator/greedy_ops.cc: Likewise.
* testsuite/24_iterators/normal_iterator/greedy_ops.cc:
Likewise.
* testsuite/24_iterators/reverse_iterator/2.cc: Likewise.
* testsuite/24_iterators/reverse_iterator/greedy_ops.cc:
Likewise.
* testsuite/21_strings/basic_string/range_access/char/1.cc:
Cast result to void.
* testsuite/21_strings/basic_string/range_access/wchar_t/1.cc:
Likewise.
* testsuite/21_strings/basic_string_view/range_access/char/1.cc:
Likewise.
* testsuite/21_strings/basic_string_view/range_access/wchar_t/1.cc:
Likewise.
* testsuite/23_containers/array/range_access.cc: Likewise.
* testsuite/23_containers/deque/range_access.cc: Likewise.
* testsuite/23_containers/forward_list/range_access.cc:
Likewise.
* testsuite/23_containers/list/range_access.cc: Likewise.
* testsuite/23_containers/map/range_access.cc: Likewise.
* testsuite/23_containers/multimap/range_access.cc: Likewise.
* testsuite/23_containers/multiset/range_access.cc: Likewise.
* testsuite/23_containers/set/range_access.cc: Likewise.
* testsuite/23_containers/unordered_map/range_access.cc:
Likewise.
* testsuite/23_containers/unordered_multimap/range_access.cc:
Likewise.
* testsuite/23_containers/unordered_multiset/range_access.cc:
Likewise.
* testsuite/23_containers/unordered_set/range_access.cc:
Likewise.
* testsuite/23_containers/vector/range_access.cc: Likewise.
* testsuite/24_iterators/customization_points/iter_move.cc:
Likewise.
* testsuite/24_iterators/istream_iterator/sentinel.cc:
Likewise.
* testsuite/24_iterators/istreambuf_iterator/sentinel.cc:
Likewise.
* testsuite/24_iterators/move_iterator/dr2061.cc: Likewise.
* testsuite/24_iterators/operations/prev_neg.cc: Likewise.
* testsuite/24_iterators/ostreambuf_iterator/2.cc: Likewise.
* testsuite/24_iterators/range_access/range_access.cc:
Likewise.
* testsuite/24_iterators/range_operations/100768.cc: Likewise.
* testsuite/26_numerics/valarray/range_access2.cc: Likewise.
* testsuite/28_regex/range_access.cc: Likewise.
* testsuite/experimental/string_view/range_access/char/1.cc:
Likewise.
* testsuite/experimental/string_view/range_access/wchar_t/1.cc:

Re: [PATCH] vect: Tweak comparisons with existing epilogue loops

2021-08-04 Thread Richard Biener via Gcc-patches
On Tue, Aug 3, 2021 at 3:52 PM Richard Sandiford via Gcc-patches
 wrote:
>
> This patch uses a more accurate scalar iteration estimate when
> comparing the epilogue of a constant-iteration loop with a candidate
> replacement epilogue.
>
> In the testcase, the patch prevents a 1-to-3-element SVE epilogue
> from seeming better than a 64-bit Advanced SIMD epilogue.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

OK.

Richard.

> Richard
>
>
> gcc/
> * tree-vect-loop.c (vect_better_loop_vinfo_p): Detect cases in
> which old_loop_vinfo is an epilogue loop that handles a constant
> number of iterations.
>
> gcc/testsuite/
> * gcc.target/aarch64/sve/cost_model_12.c: New test.
> ---
>  .../gcc.target/aarch64/sve/cost_model_12.c| 19 +++
>  gcc/tree-vect-loop.c  | 10 +-
>  2 files changed, 28 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_12.c
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 0009d0964af..0a5b65adb04 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -2778,7 +2778,15 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>
>/* Limit the VFs to what is likely to be the maximum number of iterations,
>   to handle cases in which at least one loop_vinfo is fully-masked.  */
> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> +  HOST_WIDE_INT estimated_max_niter;
> +  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
> +  unsigned HOST_WIDE_INT main_vf;
> +  if (main_loop
> +  && LOOP_VINFO_NITERS_KNOWN_P (main_loop)
> +  && LOOP_VINFO_VECT_FACTOR (main_loop).is_constant (_vf))
> +estimated_max_niter = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> +  else
> +estimated_max_niter = likely_max_stmt_executions_int (loop);
>if (estimated_max_niter != -1)
>  {
>if (known_le (estimated_max_niter, new_vf))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_12.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_12.c
> new file mode 100644
> index 000..4c5226e05de
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_12.c
> @@ -0,0 +1,19 @@
> +/* { dg-options "-O3 -mtune=neoverse-512tvb" } */
> +
> +void
> +f (float x[restrict 10][1024],
> +   float y[restrict 10][1024], float z)
> +{
> +  for (int i = 0; i < 10; ++i)
> +{
> +#pragma GCC unroll 10
> +  for (int j = 0; j < 10; ++j)
> +   x[j][i] = y[j][i] * z;
> +}
> +}
> +
> +/* We should unroll the outer loop, with 2x 16-byte vectors and 1x
> +   8-byte vectors.  */
> +/* { dg-final { scan-assembler-not {\tptrue\t} } } */
> +/* { dg-final { scan-assembler {\tv[0-9]+\.4s,} } } */
> +/* { dg-final { scan-assembler {\tv[0-9]+\.2s,} } } */


Re: [PATCH 6/8] aarch64: Tweak MLA vector costs

2021-08-04 Thread Richard Biener via Gcc-patches
On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
 wrote:
>
> The issue-based vector costs currently assume that a multiply-add
> sequence can be implemented using a single instruction.  This is
> generally true for scalars (which have a 4-operand instruction)
> and SVE (which allows the output to be tied to any input).
> However, for Advanced SIMD, multiplying two values and adding
> an invariant will end up being a move and an MLA.
>
> The only target to use the issue-based vector costs is Neoverse V1,
> which would generally prefer SVE in this case anyway.  I therefore
> don't have a self-contained testcase.  However, the distinction
> becomes more important with a later patch.

But we do cost any invariants separately (for the prologue), so they
should be available in a register.  How doesn't that work?

> gcc/
> * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
> parameter.  Detect cases in which an Advanced SIMD MLA would almost
> certainly require a MOV.
> (aarch64_count_ops): Update accordingly.
> ---
>  gcc/config/aarch64/aarch64.c | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 084f8caa0da..19045ef6944 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info 
> stmt_info)
>
>  /* Return true if STMT_INFO is the second part of a two-statement 
> multiply-add
> or multiply-subtract sequence that might be suitable for fusing into a
> -   single instruction.  */
> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
> +   a scalar one, otherwise analyze it as an operation on vectors with those
> +   VEC_* flags.  */
>  static bool
> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
> +   unsigned int vec_flags)
>  {
>gassign *assign = dyn_cast (stmt_info->stmt);
>if (!assign)
> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info)
>if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
> continue;
>
> +  if (vec_flags & VEC_ADVSIMD)
> +   {
> + /* Scalar and SVE code can tie the result to any FMLA input (or 
> none,
> +although that requires a MOVPRFX for SVE).  However, Advanced 
> SIMD
> +only supports MLA forms, so will require a move if the result
> +cannot be tied to the accumulator.  The most important case in
> +which this is true is when the accumulator input is invariant.  
> */
> + rhs = gimple_op (assign, 3 - i);
> + if (TREE_CODE (rhs) != SSA_NAME)
> +   return false;
> + def_stmt_info = vinfo->lookup_def (rhs);
> + if (!def_stmt_info
> + || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
> +   return false;
> +   }
> +
>return true;
>  }
>return false;
> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, 
> aarch64_vector_costs *costs,
>  }
>
>/* Assume that multiply-adds will become a single operation.  */
> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>  return;
>
>/* When costing scalar statements in vector code, the count already


Re: [PATCH 1/2] Add emulated gather capability to the vectorizer

2021-08-04 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> This adds a gather vectorization capability to the vectorizer
> without target support by decomposing the offset vector, doing
> sclar loads and then building a vector from the result.  This
> is aimed mainly at cases where vectorizing the rest of the loop
> offsets the cost of vectorizing the gather.
>
> Note it's difficult to avoid vectorizing the offset load, but in
> some cases later passes can turn the vector load + extract into
> scalar loads, see the followup patch.
>
> On SPEC CPU 2017 510.parest_r this improves runtime from 250s
> to 219s on a Zen2 CPU which has its native gather instructions
> disabled (using those the runtime instead increases to 254s)
> using -Ofast -march=znver2 [-flto].  It turns out the critical
> loops in this benchmark all perform gather operations.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> 2021-07-30  Richard Biener  
>
>   * tree-vect-data-refs.c (vect_check_gather_scatter):
>   Include widening conversions only when the result is
>   still handed by native gather or the current offset
>   size not already matches the data size.
>   Also succeed analysis in case there's no native support,
>   noted by a IFN_LAST ifn and a NULL decl.
>   (vect_analyze_data_refs): Always consider gathers.
>   * tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
>   Test for no IFN gather rather than decl gather.
>   * tree-vect-stmts.c (vect_model_load_cost): Pass in the
>   gather-scatter info and cost emulated gathers accordingly.
>   (vect_truncate_gather_scatter_offset): Properly test for
>   no IFN gather.
>   (vect_use_strided_gather_scatters_p): Likewise.
>   (get_load_store_type): Handle emulated gathers and its
>   restrictions.
>   (vectorizable_load): Likewise.  Emulate them by extracting
> scalar offsets, doing scalar loads and a vector construct.
>
>   * gcc.target/i386/vect-gather-1.c: New testcase.
>   * gfortran.dg/vect/vect-8.f90: Adjust.
> ---
>  gcc/testsuite/gcc.target/i386/vect-gather-1.c |  18 
>  gcc/testsuite/gfortran.dg/vect/vect-8.f90 |   2 +-
>  gcc/tree-vect-data-refs.c |  34 --
>  gcc/tree-vect-patterns.c  |   2 +-
>  gcc/tree-vect-stmts.c | 100 --
>  5 files changed, 138 insertions(+), 18 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c 
> b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> new file mode 100644
> index 000..134aef39666
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
> +
> +#ifndef INDEXTYPE
> +#define INDEXTYPE int
> +#endif
> +double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
> + double *luval, double *dst)
> +{
> +  double res = 0;
> +  for (const INDEXTYPE * col = rowstart; col != rowend; ++col, ++luval)
> +res += *luval * dst[*col];
> +  return res;
> +}
> +
> +/* With gather emulation this should be profitable to vectorize
> +   even with plain SSE2.  */
> +/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
> diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 
> b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> index 9994805d77f..cc1aebfbd84 100644
> --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> @@ -706,5 +706,5 @@ END SUBROUTINE kernel
>  
>  ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { target 
> aarch64_sve } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { target 
> { aarch64*-*-* && { ! aarch64_sve } } } } }
> -! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 "vect" { 
> target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
> +! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
> target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target 
> { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index 6995efba899..3c29ff04fd8 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -4007,8 +4007,27 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
> continue;
>   }
>  
> -   if (TYPE_PRECISION (TREE_TYPE (op0))
> -   < TYPE_PRECISION (TREE_TYPE (off)))
> +   /* Include the conversion if it is widening and we're using
> +  the IFN path or the target can handle the converted from
> +  offset or the current size is not already the same as the
> +  data vector element size.  */
> +   if ((TYPE_PRECISION (TREE_TYPE (op0))
> +< 

Re: [PATCH 5/8] aarch64: Tweak the cost of elementwise stores

2021-08-04 Thread Richard Biener via Gcc-patches
On Tue, Aug 3, 2021 at 2:09 PM Richard Sandiford via Gcc-patches
 wrote:
>
> When the vectoriser scalarises a strided store, it counts one
> scalar_store for each element plus one vec_to_scalar extraction
> for each element.  However, extracting element 0 is free on AArch64,
> so it should have zero cost.
>
> I don't have a testcase that requires this for existing -mtune
> options, but it becomes more important with a later patch.
>
> gcc/
> * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
> function, split out from...
> (aarch64_detect_vector_stmt_subtype): ...here.
> (aarch64_add_stmt_cost): Treat extracting element 0 as free.
> ---
>  gcc/config/aarch64/aarch64.c | 22 +++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 36f11808916..084f8caa0da 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -14622,6 +14622,18 @@ aarch64_builtin_vectorization_cost (enum 
> vect_cost_for_stmt type_of_cost,
>  }
>  }
>
> +/* Return true if an operaton of kind KIND for STMT_INFO represents
> +   the extraction of an element from a vector in preparation for
> +   storing the element to memory.  */
> +static bool
> +aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
> +stmt_vec_info stmt_info)
> +{
> +  return (kind == vec_to_scalar
> + && STMT_VINFO_DATA_REF (stmt_info)
> + && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
> +}

It would be nice to put functions like this in tree-vectorizer.h in some
section marked with a comment to contain helpers for the target
add_stmt_cost.

>  /* Return true if STMT_INFO represents part of a reduction.  */
>  static bool
>  aarch64_is_reduction (stmt_vec_info stmt_info)
> @@ -14959,9 +14971,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
> vect_cost_for_stmt kind,
>/* Detect cases in which vec_to_scalar is describing the extraction of a
>   vector element in preparation for a scalar store.  The store itself is
>   costed separately.  */
> -  if (kind == vec_to_scalar
> -  && STMT_VINFO_DATA_REF (stmt_info)
> -  && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
> +  if (aarch64_is_store_elt_extraction (kind, stmt_info))
>  return simd_costs->store_elt_extra_cost;
>
>/* Detect SVE gather loads, which are costed as a single scalar_load
> @@ -15382,6 +15392,12 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void 
> *data, int count,
>   if (vectype && aarch64_sve_only_stmt_p (stmt_info, vectype))
> costs->saw_sve_only_op = true;
>
> + /* If we scalarize a strided store, the vectorizer costs one
> +vec_to_scalar for each element.  However, we can store the first
> +element using an FP store without a separate extract step.  */
> + if (aarch64_is_store_elt_extraction (kind, stmt_info))
> +   count -= 1;
> +
>   stmt_cost = aarch64_detect_scalar_stmt_subtype
> (vinfo, kind, stmt_info, stmt_cost);
>


Re: [PATCH 2/2] Rewrite more vector loads to scalar loads

2021-08-04 Thread Richard Biener via Gcc-patches
On Mon, Aug 2, 2021 at 3:41 PM Richard Biener  wrote:
>
> This teaches forwprop to rewrite more vector loads that are only
> used in BIT_FIELD_REFs as scalar loads.  This provides the
> remaining uplift to SPEC CPU 2017 510.parest_r on Zen 2 which
> has CPU gathers disabled.
>
> In particular vector load + vec_unpack + bit-field-ref is turned
> into (extending) scalar loads which avoids costly XMM/GPR
> transitions.  To not conflict with vector load + bit-field-ref
> + vector constructor matching to vector load + shuffle the
> extended transform is only done after vector lowering.
>
> Overall the two patches provide a 22% speedup of 510.parest_r.
>
> I'm in the process of confirming speedups of 500.perlbench_r,
> 557.xz_r, 549.fotonik3d_r and 554.roms_r as well as slowdowns
> of 503.bwaves_r, 507.cactuBSSN_r and 538.imagick_r.

I have confirmed the 500.perlbench_r and 557.xz_r speedups,
the 554.roms was noise, so were the 503.bwaves and
507.cactuBSSN_r slowdowns.  The 538.imagick_r slowdown
is real but it doesn't reproduce with -flto and analyzing it
doesn't show any effect of the two patches on the code
pointed to by perf.

I've now pushed [2/2] first because that makes more sense
and thus its effect can be independently assessed.

Richard.

> 2021-07-30  Richard Biener  
>
> * tree-ssa-forwprop.c (pass_forwprop::execute): Split
> out code to decompose vector loads ...
> (optimize_vector_load): ... here.  Generalize it to
> handle intermediate widening and TARGET_MEM_REF loads
> and apply it to loads with a supported vector mode as well.
>
> * gcc.target/i386/vect-gather-1.c: Amend.
> ---
>  gcc/testsuite/gcc.target/i386/vect-gather-1.c |   4 +-
>  gcc/tree-ssa-forwprop.c   | 244 +-
>  2 files changed, 185 insertions(+), 63 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c 
> b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> index 134aef39666..261b66be061 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
> +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details 
> -fdump-tree-forwprop4" } */
>
>  #ifndef INDEXTYPE
>  #define INDEXTYPE int
> @@ -16,3 +16,5 @@ double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
>  /* With gather emulation this should be profitable to vectorize
> even with plain SSE2.  */
>  /* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
> +/* The index vector loads and promotions should be scalar after forwprop.  */
> +/* { dg-final { scan-tree-dump-not "vec_unpack" "forwprop4" } } */
> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index db3b18b275c..bd64b8e46bc 100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -2757,6 +2757,182 @@ simplify_vector_constructor (gimple_stmt_iterator 
> *gsi)
>  }
>
>
> +/* Rewrite the vector load at *GSI to component-wise loads if the load
> +   is only used in BIT_FIELD_REF extractions with eventual intermediate
> +   widening.  */
> +
> +static void
> +optimize_vector_load (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  tree lhs = gimple_assign_lhs (stmt);
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +
> +  /* Gather BIT_FIELD_REFs to rewrite, looking through
> + VEC_UNPACK_{LO,HI}_EXPR.  */
> +  use_operand_p use_p;
> +  imm_use_iterator iter;
> +  bool rewrite = true;
> +  auto_vec bf_stmts;
> +  auto_vec worklist;
> +  worklist.quick_push (lhs);
> +  do
> +{
> +  tree def = worklist.pop ();
> +  unsigned HOST_WIDE_INT def_eltsize
> +   = TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (def;
> +  FOR_EACH_IMM_USE_FAST (use_p, iter, def)
> +   {
> + gimple *use_stmt = USE_STMT (use_p);
> + if (is_gimple_debug (use_stmt))
> +   continue;
> + if (!is_gimple_assign (use_stmt))
> +   {
> + rewrite = false;
> + break;
> +   }
> + enum tree_code use_code = gimple_assign_rhs_code (use_stmt);
> + tree use_rhs = gimple_assign_rhs1 (use_stmt);
> + if (use_code == BIT_FIELD_REF
> + && TREE_OPERAND (use_rhs, 0) == def
> + /* If its on the VEC_UNPACK_{HI,LO}_EXPR
> +def need to verify it is element aligned.  */
> + && (def == lhs
> + || (known_eq (bit_field_size (use_rhs), def_eltsize)
> + && constant_multiple_p (bit_field_offset (use_rhs),
> + def_eltsize
> +   {
> + bf_stmts.safe_push (use_stmt);
> + continue;
> +   }
> + /* Walk through one level of VEC_UNPACK_{LO,HI}_EXPR.  */
> + if (def == lhs
> + && (use_code == VEC_UNPACK_HI_EXPR
> + || use_code 

Re: [PATCH 2/6] [i386] Enable _Float16 type for TARGET_SSE2 and above.

2021-08-04 Thread Richard Biener via Gcc-patches
On Wed, Aug 4, 2021 at 4:39 AM Hongtao Liu  wrote:
>
> On Mon, Aug 2, 2021 at 2:31 PM liuhongt  wrote:
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386-modes.def (FLOAT_MODE): Define ieee HFmode.
> > * config/i386/i386.c (enum x86_64_reg_class): Add
> > X86_64_SSEHF_CLASS.
> > (merge_classes): Handle X86_64_SSEHF_CLASS.
> > (examine_argument): Ditto.
> > (construct_container): Ditto.
> > (classify_argument): Ditto, and set HFmode/HCmode to
> > X86_64_SSEHF_CLASS.
> > (function_value_32): Return _FLoat16/Complex Float16 by
> > %xmm0.
> > (function_value_64): Return _Float16/Complex Float16 by SSE
> > register.
> > (ix86_print_operand): Handle CONST_DOUBLE HFmode.
> > (ix86_secondary_reload): Require gpr as intermediate register
> > to store _Float16 from sse register when sse4 is not
> > available.
> > (ix86_libgcc_floating_mode_supported_p): Enable _FLoat16 under
> > sse2.
> > (ix86_scalar_mode_supported_p): Ditto.
> > (TARGET_LIBGCC_FLOATING_MODE_SUPPORTED_P): Defined.
> > * config/i386/i386.h (VALID_SSE2_REG_MODE): Add HFmode.
> > (VALID_INT_MODE_P): Add HFmode and HCmode.
> > * config/i386/i386.md (*pushhf_rex64): New define_insn.
> > (*pushhf): Ditto.
> > (*movhf_internal): Ditto.
> > * doc/extend.texi (Half-Precision Floating Point): Documemt
> > _Float16 for x86.
> > * emit-rtl.c (validate_subreg): Allow (subreg:SI (reg:HF) 0)
> > which is used by extract_bit_field but not backends.
> >
[...]
>
> Ping, i'd like to ask for approval for the below codes which is
> related to generic part.
>
> start from ..
> > diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
> > index ff3b4449b37..775ee397836 100644
> > --- a/gcc/emit-rtl.c
> > +++ b/gcc/emit-rtl.c
> > @@ -928,6 +928,11 @@ validate_subreg (machine_mode omode, machine_mode 
> > imode,
> >   fix them all.  */
> >if (omode == word_mode)
> >  ;
> > +  /* ???Similarly like (subreg:DI (reg:SF), also allow (subreg:SI (reg:HF))
> > + here. Though extract_bit_field is the culprit here, not the backends. 
> >  */
> > +  else if (known_gt (regsize, osize) && known_gt (osize, isize)
> > +  && FLOAT_MODE_P (imode) && INTEGRAL_MODE_P (omode))
> > +;
> >/* ??? Similarly, e.g. with (subreg:DF (reg:TI)).  Though store_bit_field
> >   is the culprit here, and not the backends.  */
> >else if (known_ge (osize, regsize) && known_ge (isize, osize))
>
> and end here.

So the main restriction otherwise in place is

  /* Subregs involving floating point modes are not allowed to
 change size.  Therefore (subreg:DI (reg:DF) 0) is fine, but
 (subreg:SI (reg:DF) 0) isn't.  */
  else if (FLOAT_MODE_P (imode) || FLOAT_MODE_P (omode))
{
  if (! (known_eq (isize, osize)
 /* LRA can use subreg to store a floating point value in
an integer mode.  Although the floating point and the
integer modes need the same number of hard registers,
the size of floating point mode can be less than the
integer mode.  LRA also uses subregs for a register
should be used in different mode in on insn.  */
 || lra_in_progress))
return false;

I'm not sure if it would be possible to do (subreg:SI (subreg:HI (reg:HF)))
to "work around" this restriction.  Alternatively one could finally do away
with all the exceptions and simply allow all such subregs giving them
semantics as to intermediate same-size subregs to integer modes
if this definition issue is why we disallow them?

That is, any float-mode source or destination subreg is interpreted as
wrapping the source operand (if float-mode) in a same size int subreg
and performing the subreg in an integer mode first if the destination
mode is a float mode?

Also I detest that validate_subreg list things not allowed as opposed
to things allowed.  Why are FLOAT_MODE special, but
fractional and accumulating modes not?  The subreg documentation
also doesn't talk about cases not allowed.

Richard.


Re: [PATCH v3] Make loops_list support an optional loop_p root

2021-08-04 Thread Kewen.Lin via Gcc-patches
on 2021/8/4 下午6:01, Richard Biener wrote:
> On Wed, Aug 4, 2021 at 4:36 AM Kewen.Lin  wrote:
>>
>> on 2021/8/3 下午8:08, Richard Biener wrote:
>>> On Fri, Jul 30, 2021 at 7:20 AM Kewen.Lin  wrote:

 on 2021/7/29 下午4:01, Richard Biener wrote:
> On Fri, Jul 23, 2021 at 10:41 AM Kewen.Lin  wrote:
>>
>> on 2021/7/22 下午8:56, Richard Biener wrote:
>>> On Tue, Jul 20, 2021 at 4:37
>>> PM Kewen.Lin  wrote:

 Hi,

 This v2 has addressed some review comments/suggestions:

   - Use "!=" instead of "<" in function operator!= (const Iter )
   - Add new CTOR loops_list (struct loops *loops, unsigned flags)
 to support loop hierarchy tree rather than just a function,
 and adjust to use loops* accordingly.
>>>
>>> I actually meant struct loop *, not struct loops * ;)  At the point
>>> we pondered to make loop invariant motion work on single
>>> loop nests we gave up not only but also because it iterates
>>> over the loop nest but all the iterators only ever can process
>>> all loops, not say, all loops inside a specific 'loop' (and
>>> including that 'loop' if LI_INCLUDE_ROOT).  So the
>>> CTOR would take the 'root' of the loop tree as argument.
>>>
>>> I see that doesn't trivially fit how loops_list works, at least
>>> not for LI_ONLY_INNERMOST.  But I guess FROM_INNERMOST
>>> could be adjusted to do ONLY_INNERMOST as well?
>>>
>>
>>
>> Thanks for the clarification!  I just realized that the previous
>> version with struct loops* is problematic, all traversal is
>> still bounded with outer_loop == NULL.  I think what you expect
>> is to respect the given loop_p root boundary.  Since we just
>> record the loops' nums, I think we still need the function* fn?
>
> Would it simplify things if we recorded the actual loop *?
>

 I'm afraid it's unsafe to record the loop*.  I had the same
 question why the loop iterator uses index rather than loop* when
 I read this at the first time.  I guess the design of processing
 loops allows its user to update or even delete the folllowing
 loops to be visited.  For example, when the user does some tricks
 on one loop, then it duplicates the loop and its children to
 somewhere and then removes the loop and its children, when
 iterating onto its children later, the "index" way will check its
 validity by get_loop at that point, but the "loop *" way will
 have some recorded pointers to become dangling, can't do the
 validity check on itself, seems to need a side linear search to
 ensure the validity.

> There's still the to_visit reserve which needs a bound on
> the number of loops for efficiency reasons.
>

 Yes, I still keep the fn in the updated version.

>> So I add one optional argument loop_p root and update the
>> visiting codes accordingly.  Before this change, the previous
>> visiting uses the outer_loop == NULL as the termination condition,
>> it perfectly includes the root itself, but with this given root,
>> we have to use it as the termination condition to avoid to iterate
>> onto its possible existing next.
>>
>> For LI_ONLY_INNERMOST, I was thinking whether we can use the
>> code like:
>>
>> struct loops *fn_loops = loops_for_fn (fn)->larray;
>> for (i = 0; vec_safe_iterate (fn_loops, i, ); i++)
>> if (aloop != NULL
>> && aloop->inner == NULL
>> && flow_loop_nested_p (tree_root, aloop))
>>  this->to_visit.quick_push (aloop->num);
>>
>> it has the stable bound, but if the given root only has several
>> child loops, it can be much worse if there are many loops in fn.
>> It seems impossible to predict the given root loop hierarchy size,
>> maybe we can still use the original linear searching for the case
>> loops_for_fn (fn) == root?  But since this visiting seems not so
>> performance critical, I chose to share the code originally used
>> for FROM_INNERMOST, hope it can have better readability and
>> maintainability.
>
> I was indeed looking for something that has execution/storage
> bound on the subtree we're interested in.  If we pull the CTOR
> out-of-line we can probably keep the linear search for
> LI_ONLY_INNERMOST when looking at the whole loop tree.
>

 OK, I've moved the suggested single loop tree walker out-of-line
 to cfgloop.c, and brought the linear search back for
 LI_ONLY_INNERMOST when looking at the whole loop tree.

> It just seemed to me that we can eventually re-use a
> single loop tree walker for all orders, just adjusting the
> places we push.
>

 Wow, good point!  Indeed, I have further unified all orders
 handlings into a single function walk_loop_tree.


[PATCH] tree-optimization/101756 - avoid vectorizing boolean MAX reductions

2021-08-04 Thread Richard Biener
The following avoids vectorizing MIN/MAX reductions on bools which,
when ending up as vector(2)  would need to be
adjusted because of the sign change.  The fix instead avoids any
reduction vectorization where the result isn't compatible
to the original scalar type since we don't compensate for that
either.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

2021-08-04  Richard Biener  

PR tree-optimization/101756
* tree-vect-slp.c (vectorizable_bb_reduc_epilogue): Make sure
the result of the reduction epilogue is compatible to the original
scalar result.

* gcc.dg/vect/bb-slp-pr101756.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-pr101756.c | 15 +++
 gcc/tree-vect-slp.c |  8 +---
 2 files changed, 20 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101756.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101756.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101756.c
new file mode 100644
index 000..9420e77f64e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101756.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+
+__attribute__ ((simd)) int
+tq (long int ea, int of, int kk)
+{
+  int bc;
+
+  for (bc = 0; bc < 2; ++bc)
+{
+  ++ea;
+  of |= !!kk < !!ea;
+}
+
+  return of;
+}
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index a554c24e0fb..d169bed8e94 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -4847,15 +4847,17 @@ static bool
 vectorizable_bb_reduc_epilogue (slp_instance instance,
stmt_vector_for_cost *cost_vec)
 {
-  enum tree_code reduc_code
-= gimple_assign_rhs_code (instance->root_stmts[0]->stmt);
+  gassign *stmt = as_a  (instance->root_stmts[0]->stmt);
+  enum tree_code reduc_code = gimple_assign_rhs_code (stmt);
   if (reduc_code == MINUS_EXPR)
 reduc_code = PLUS_EXPR;
   internal_fn reduc_fn;
   tree vectype = SLP_TREE_VECTYPE (SLP_INSTANCE_TREE (instance));
   if (!reduction_fn_for_scalar_code (reduc_code, _fn)
   || reduc_fn == IFN_LAST
-  || !direct_internal_fn_supported_p (reduc_fn, vectype, 
OPTIMIZE_FOR_BOTH))
+  || !direct_internal_fn_supported_p (reduc_fn, vectype, OPTIMIZE_FOR_BOTH)
+  || !useless_type_conversion_p (TREE_TYPE (gimple_assign_lhs (stmt)),
+TREE_TYPE (vectype)))
 return false;
 
   /* There's no way to cost a horizontal vector reduction via REDUC_FN so
-- 
2.31.1


[PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi,

Recent refactoring of the arm_neon.h header enabled better code
generation for intrinsics that manipulate vector structures. New
tests were also added to verify the benefit of these changes. It now
transpires that the code generation improvements are observed only on
little-endian systems. This patch restricts the code generation tests
to little-endian targets (for now.)

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-08-04  Jonathan Wright  

* gcc.target/aarch64/vector_structure_intrinsics.c: Restrict
tests to little-endian targets.



From: Christophe Lyon 
Sent: 03 August 2021 10:42
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org ; Richard Sandiford 

Subject: Re: [PATCH 1/8] aarch64: Use memcpy to copy vector tables in 
vqtbl[234] intrinsics 
 


On Fri, Jul 23, 2021 at 10:22 AM Jonathan Wright via Gcc-patches 
 wrote:
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vqtbl[234] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

        * config/aarch64/arm_neon.h (vqtbl2_s8): Use __builtin_memcpy
        instead of constructing __builtin_aarch64_simd_oi one vector
        at a time.
        (vqtbl2_u8): Likewise.
        (vqtbl2_p8): Likewise.
        (vqtbl2q_s8): Likewise.
        (vqtbl2q_u8): Likewise.
        (vqtbl2q_p8): Likewise.
        (vqtbl3_s8): Use __builtin_memcpy instead of constructing
        __builtin_aarch64_simd_ci one vector at a time.
        (vqtbl3_u8): Likewise.
        (vqtbl3_p8): Likewise.
        (vqtbl3q_s8): Likewise.
        (vqtbl3q_u8): Likewise.
        (vqtbl3q_p8): Likewise.
        (vqtbl4_s8): Use __builtin_memcpy instead of constructing
        __builtin_aarch64_simd_xi one vector at a time.
        (vqtbl4_u8): Likewise.
        (vqtbl4_p8): Likewise.
        (vqtbl4q_s8): Likewise.
        (vqtbl4q_u8): Likewise.
        (vqtbl4q_p8): Likewise.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/vector_structure_intrinsics.c: New test.

Hi,

This new test fails on aarch64_be:
 FAIL: gcc.target/aarch64/vector_structure_intrinsics.c scan-assembler-not 
mov\\t

Can you check?

Thanks

Christophe


rb14749.patch
Description: rb14749.patch


[PATCH] c++: Fix up parsing of attributes for using-directive

2021-08-04 Thread Jakub Jelinek via Gcc-patches
Hi!

As I've said earlier and added xfails in gen-attrs-76.C test,
https://eel.is/c++draft/namespace.udir#nt:using-directive
has attribute-specifier-seq[opt] at the start, not at the end before ;
as gcc is expecting.
IMHO we should continue parsing at the end the GNU attributes
because using namespace N __attribute__((strong));, while not supported
anymore, used to be supported in the past, but my code searches for
using namespace N [[gnu::strong]]; didn't reveal anything at all.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2021-08-04  Jakub Jelinek  

* parser.c (cp_parser_block_declaration): Call
cp_parser_using_directive for C++11 attributes followed by
using namespace tokens.
(cp_parser_using_directive): Parse C++11 attributes at the start
of the directive rather than at the end, only parse GNU attributes
at the end.

* g++.dg/lookup/strong-using.C: Add test using [[gnu::strong]]
as well.
* g++.dg/lookup/strong-using2.C: Likewise.
* g++.dg/cpp0x/gen-attrs-58.C: Move alignas(int) before
using namespace.
* g++.dg/cpp0x/gen-attrs-59.C: Move alignas(X) before
using namespace, add tests for alignas before semicolon.
* g++.dg/cpp0x/gen-attrs-76.C: Remove xfails.  Add test for
C++11 attributes on using directive before semicolon.

--- gcc/cp/parser.c.jj  2021-08-03 00:44:32.890492433 +0200
+++ gcc/cp/parser.c 2021-08-03 17:38:07.541725977 +0200
@@ -14655,6 +14655,7 @@ cp_parser_block_declaration (cp_parser *
   /* Peek at the next token to figure out which kind of declaration is
  present.  */
   cp_token *token1 = cp_lexer_peek_token (parser->lexer);
+  size_t attr_idx;
 
   /* If the next keyword is `asm', we have an asm-definition.  */
   if (token1->keyword == RID_ASM)
@@ -14708,6 +14709,18 @@ cp_parser_block_declaration (cp_parser *
   /* If the next token is `static_assert' we have a static assertion.  */
   else if (token1->keyword == RID_STATIC_ASSERT)
 cp_parser_static_assert (parser, /*member_p=*/false);
+  /* If the next tokens after attributes is `using namespace', then we have
+ a using-directive.  */
+  else if ((attr_idx = cp_parser_skip_std_attribute_spec_seq (parser, 1)) != 1
+  && cp_lexer_peek_nth_token (parser->lexer,
+  attr_idx)->keyword == RID_USING
+  && cp_lexer_peek_nth_token (parser->lexer,
+  attr_idx + 1)->keyword == RID_NAMESPACE)
+{
+  if (statement_p)
+   cp_parser_commit_to_tentative_parse (parser);
+  cp_parser_using_directive (parser);
+}
   /* Anything else must be a simple-declaration.  */
   else
 cp_parser_simple_declaration (parser, !statement_p,
@@ -21394,14 +21407,21 @@ cp_parser_alias_declaration (cp_parser*
 /* Parse a using-directive.
 
using-directive:
- using namespace :: [opt] nested-name-specifier [opt]
-   namespace-name ;  */
+ attribute-specifier-seq [opt] using namespace :: [opt]
+   nested-name-specifier [opt] namespace-name ;  */
 
 static void
 cp_parser_using_directive (cp_parser* parser)
 {
   tree namespace_decl;
-  tree attribs;
+  tree attribs = cp_parser_std_attribute_spec_seq (parser);
+  if (cp_lexer_next_token_is (parser->lexer, CPP_SEMICOLON))
+{
+  /* Error during attribute parsing that resulted in skipping
+to next semicolon.  */
+  cp_parser_require (parser, CPP_SEMICOLON, RT_SEMICOLON);
+  return;
+}
 
   /* Look for the `using' keyword.  */
   cp_parser_require_keyword (parser, RID_USING, RT_USING);
@@ -21418,8 +21438,9 @@ cp_parser_using_directive (cp_parser* pa
   /* Get the namespace being used.  */
   namespace_decl = cp_parser_namespace_name (parser);
   cp_warn_deprecated_use_scopes (namespace_decl);
-  /* And any specified attributes.  */
-  attribs = cp_parser_attributes_opt (parser);
+  /* And any specified GNU attributes.  */
+  if (cp_next_tokens_can_be_gnu_attribute_p (parser))
+attribs = chainon (attribs, cp_parser_gnu_attributes_opt (parser));
 
   /* Update the symbol table.  */
   finish_using_directive (namespace_decl, attribs);
--- gcc/testsuite/g++.dg/lookup/strong-using.C.jj   2020-01-12 
11:54:37.197401580 +0100
+++ gcc/testsuite/g++.dg/lookup/strong-using.C  2021-08-03 17:12:05.872281490 
+0200
@@ -8,3 +8,12 @@ namespace A
 
   using namespace B __attribute__ ((strong)); // { dg-warning "no longer 
supported" "" }
 }
+
+namespace C
+{
+  namespace D // { dg-message "inline namespace" }
+  {
+  }
+
+  [[gnu::strong]] using namespace D; // { dg-warning "no longer supported" "" }
+}
--- gcc/testsuite/g++.dg/lookup/strong-using2.C.jj  2020-01-12 
11:54:37.197401580 +0100
+++ gcc/testsuite/g++.dg/lookup/strong-using2.C 2021-08-03 17:12:31.968921065 
+0200
@@ -9,3 +9,12 @@ namespace A
 
   using namespace B __attribute__ ((strong)); // { dg-bogus "no longer 
supported" }
 }
+
+namespace C
+{
+  

Re: [PATCH v3] Make loops_list support an optional loop_p root

2021-08-04 Thread Richard Biener via Gcc-patches
On Wed, Aug 4, 2021 at 4:36 AM Kewen.Lin  wrote:
>
> on 2021/8/3 下午8:08, Richard Biener wrote:
> > On Fri, Jul 30, 2021 at 7:20 AM Kewen.Lin  wrote:
> >>
> >> on 2021/7/29 下午4:01, Richard Biener wrote:
> >>> On Fri, Jul 23, 2021 at 10:41 AM Kewen.Lin  wrote:
> 
>  on 2021/7/22 下午8:56, Richard Biener wrote:
> > On Tue, Jul 20, 2021 at 4:37
> > PM Kewen.Lin  wrote:
> >>
> >> Hi,
> >>
> >> This v2 has addressed some review comments/suggestions:
> >>
> >>   - Use "!=" instead of "<" in function operator!= (const Iter )
> >>   - Add new CTOR loops_list (struct loops *loops, unsigned flags)
> >> to support loop hierarchy tree rather than just a function,
> >> and adjust to use loops* accordingly.
> >
> > I actually meant struct loop *, not struct loops * ;)  At the point
> > we pondered to make loop invariant motion work on single
> > loop nests we gave up not only but also because it iterates
> > over the loop nest but all the iterators only ever can process
> > all loops, not say, all loops inside a specific 'loop' (and
> > including that 'loop' if LI_INCLUDE_ROOT).  So the
> > CTOR would take the 'root' of the loop tree as argument.
> >
> > I see that doesn't trivially fit how loops_list works, at least
> > not for LI_ONLY_INNERMOST.  But I guess FROM_INNERMOST
> > could be adjusted to do ONLY_INNERMOST as well?
> >
> 
> 
>  Thanks for the clarification!  I just realized that the previous
>  version with struct loops* is problematic, all traversal is
>  still bounded with outer_loop == NULL.  I think what you expect
>  is to respect the given loop_p root boundary.  Since we just
>  record the loops' nums, I think we still need the function* fn?
> >>>
> >>> Would it simplify things if we recorded the actual loop *?
> >>>
> >>
> >> I'm afraid it's unsafe to record the loop*.  I had the same
> >> question why the loop iterator uses index rather than loop* when
> >> I read this at the first time.  I guess the design of processing
> >> loops allows its user to update or even delete the folllowing
> >> loops to be visited.  For example, when the user does some tricks
> >> on one loop, then it duplicates the loop and its children to
> >> somewhere and then removes the loop and its children, when
> >> iterating onto its children later, the "index" way will check its
> >> validity by get_loop at that point, but the "loop *" way will
> >> have some recorded pointers to become dangling, can't do the
> >> validity check on itself, seems to need a side linear search to
> >> ensure the validity.
> >>
> >>> There's still the to_visit reserve which needs a bound on
> >>> the number of loops for efficiency reasons.
> >>>
> >>
> >> Yes, I still keep the fn in the updated version.
> >>
>  So I add one optional argument loop_p root and update the
>  visiting codes accordingly.  Before this change, the previous
>  visiting uses the outer_loop == NULL as the termination condition,
>  it perfectly includes the root itself, but with this given root,
>  we have to use it as the termination condition to avoid to iterate
>  onto its possible existing next.
> 
>  For LI_ONLY_INNERMOST, I was thinking whether we can use the
>  code like:
> 
>  struct loops *fn_loops = loops_for_fn (fn)->larray;
>  for (i = 0; vec_safe_iterate (fn_loops, i, ); i++)
>  if (aloop != NULL
>  && aloop->inner == NULL
>  && flow_loop_nested_p (tree_root, aloop))
>   this->to_visit.quick_push (aloop->num);
> 
>  it has the stable bound, but if the given root only has several
>  child loops, it can be much worse if there are many loops in fn.
>  It seems impossible to predict the given root loop hierarchy size,
>  maybe we can still use the original linear searching for the case
>  loops_for_fn (fn) == root?  But since this visiting seems not so
>  performance critical, I chose to share the code originally used
>  for FROM_INNERMOST, hope it can have better readability and
>  maintainability.
> >>>
> >>> I was indeed looking for something that has execution/storage
> >>> bound on the subtree we're interested in.  If we pull the CTOR
> >>> out-of-line we can probably keep the linear search for
> >>> LI_ONLY_INNERMOST when looking at the whole loop tree.
> >>>
> >>
> >> OK, I've moved the suggested single loop tree walker out-of-line
> >> to cfgloop.c, and brought the linear search back for
> >> LI_ONLY_INNERMOST when looking at the whole loop tree.
> >>
> >>> It just seemed to me that we can eventually re-use a
> >>> single loop tree walker for all orders, just adjusting the
> >>> places we push.
> >>>
> >>
> >> Wow, good point!  Indeed, I have further unified all orders
> >> handlings into a single function walk_loop_tree.
> >>
> 
>  Bootstrapped and regtested on 

[committed] c++: Fix up #pragma omp declare {simd,variant} and acc routine parsing

2021-08-04 Thread Jakub Jelinek via Gcc-patches
Hi!

When parsing default arguments, we need to temporarily clear 
parser->omp_declare_simd
and parser->oacc_routine, otherwise it can clash with further declarations
inside of e.g. lambdas inside of those default arguments.

Bootstrapped/regtested on x86_64-linux and i686-linux, committed to trunk,
will backport eventually.

2021-08-04  Jakub Jelinek  

PR c++/101759
* parser.c (cp_parser_default_argument): Temporarily override
parser->omp_declare_simd and parser->oacc_routine to NULL.

* g++.dg/gomp/pr101759.C: New test.
* g++.dg/goacc/pr101759.C: New test.

--- gcc/cp/parser.c.jj  2021-08-03 17:38:07.541725977 +0200
+++ gcc/cp/parser.c 2021-08-03 19:23:08.693843077 +0200
@@ -24509,6 +24509,8 @@ cp_parser_default_argument (cp_parser *p
  set correctly.  */
   saved_greater_than_is_operator_p = parser->greater_than_is_operator_p;
   parser->greater_than_is_operator_p = !template_parm_p;
+  auto odsd = make_temp_override (parser->omp_declare_simd, NULL);
+  auto ord = make_temp_override (parser->oacc_routine, NULL);
   /* Local variable names (and the `this' keyword) may not
  appear in a default argument.  */
   saved_local_variables_forbidden_p = parser->local_variables_forbidden_p;
--- gcc/testsuite/g++.dg/gomp/pr101759.C.jj 2021-08-03 19:32:56.091725711 
+0200
+++ gcc/testsuite/g++.dg/gomp/pr101759.C2021-08-03 19:36:03.762138412 
+0200
@@ -0,0 +1,8 @@
+// PR c++/101759
+// { dg-do compile { target c++11 } }
+
+#pragma omp declare simd
+int foo (int x = []() { extern int bar (int); return 1; }());
+int corge (int = 1);
+#pragma omp declare variant (corge) match (user={condition(true)})
+int baz (int x = []() { extern int qux (int); return 1; }());
--- gcc/testsuite/g++.dg/goacc/pr101759.C.jj2021-08-03 19:33:15.079463941 
+0200
+++ gcc/testsuite/g++.dg/goacc/pr101759.C   2021-08-03 19:35:53.148284738 
+0200
@@ -0,0 +1,5 @@
+// PR c++/101759
+// { dg-do compile { target c++11 } }
+
+#pragma acc routine
+int foo (int x = []() { extern int bar (int); return 1; }());

Jakub



Re: [PATCH 5/6] AVX512FP16: Initial support for AVX512FP16 feature and scalar _Float16 instructions.

2021-08-04 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 2, 2021 at 8:44 AM liuhongt  wrote:
>
> From: "Guo, Xuepeng" 
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detect FEATURE_AVX512FP16.
> * common/config/i386/i386-common.c
> (OPTION_MASK_ISA_AVX512FP16_SET,
> OPTION_MASK_ISA_AVX512FP16_UNSET,
> OPTION_MASK_ISA2_AVX512FP16_SET,
> OPTION_MASK_ISA2_AVX512FP16_UNSET): New.
> (OPTION_MASK_ISA2_AVX512BW_UNSET,
> OPTION_MASK_ISA2_AVX512BF16_UNSET): Add AVX512FP16.
> (ix86_handle_option): Handle -mavx512fp16.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_AVX512FP16.
> * common/config/i386/i386-isas.h: Add entry for AVX512FP16.
> * config.gcc: Add avx512fp16intrin.h.
> * config/i386/avx512fp16intrin.h: New intrinsic header.
> * config/i386/cpuid.h: Add bit_AVX512FP16.
> * config/i386/i386-builtin-types.def: (FLOAT16): New primitive type.
> * config/i386/i386-builtins.c: Support _Float16 type for i386
> backend.
> (ix86_init_float16_builtins): New function.
> (ix86_float16_type_node): New.
> * config/i386/i386-c.c (ix86_target_macros_internal): Define
> __AVX512FP16__.
> * config/i386/i386-expand.c (ix86_expand_branch): Support
> HFmode.
> (ix86_prepare_fp_compare_args): Adjust TARGET_SSE_MATH &&
> SSE_FLOAT_MODE_P to SSE_FLOAT_MODE_SSEMATH_OR_HF_P.
> (ix86_expand_fp_movcc): Ditto.
> * config/i386/i386-isa.def: Add PTA define for AVX512FP16.
> * config/i386/i386-options.c (isa2_opts): Add -mavx512fp16.
> (ix86_valid_target_attribute_inner_p): Add avx512fp16 attribute.
> * config/i386/i386.c (ix86_get_ssemov): Use
> vmovdqu16/vmovw/vmovsh for HFmode/HImode scalar or vector.
> (ix86_get_excess_precision): Use
> FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when TARGET_AVX512FP16
> existed.
> (sse_store_index): Use SFmode cost for HFmode cost.
> (inline_memory_move_cost): Add HFmode, and perfer SSE cost over
> GPR cost for HFmode.
> (ix86_hard_regno_mode_ok): Allow HImode in sse register.
> (ix86_mangle_type): Add manlging for _Float16 type.
> (inline_secondary_memory_needed): No memory is needed for
> 16bit movement between gpr and sse reg under
> TARGET_AVX512FP16.
> (ix86_multiplication_cost): Adjust TARGET_SSE_MATH &&
> SSE_FLOAT_MODE_P to SSE_FLOAT_MODE_SSEMATH_OR_HF_P.
> (ix86_division_cost): Ditto.
> (ix86_rtx_costs): Ditto.
> (ix86_add_stmt_cost): Ditto.
> (ix86_optab_supported_p): Ditto.
> * config/i386/i386.h (VALID_AVX512F_SCALAR_MODE): Add HFmode.
> (SSE_FLOAT_MODE_SSEMATH_OR_HF_P): Add HFmode.
> (PTA_SAPPHIRERAPIDS): Add PTA_AVX512FP16.
> * config/i386/i386.md (mode): Add HFmode.
> (MODE_SIZE): Add HFmode.
> (isa): Add avx512fp16.
> (enabled): Handle avx512fp16.
> (ssemodesuffix): Add sh suffix for HFmode.
> (comm): Add mult, div.
> (plusminusmultdiv): New code iterator.
> (insn): Add mult, div.
> (*movhf_internal): Adjust for avx512fp16 instruction.
> (*movhi_internal): Ditto.
> (*cmpihf): New define_insn for HFmode.
> (*ieee_shf3): Likewise.
> (extendhf2): Likewise.
> (trunchf2): Likewise.
> (floathf2): Likewise.
> (*hf): Likewise.
> (cbranchhf4): New expander.
> (movhfcc): Likewise.
> (hf3): Likewise.
> (mulhf3): Likewise.
> (divhf3): Likewise.
> * config/i386/i386.opt: Add mavx512fp16.
> * config/i386/immintrin.h: Include avx512fp16intrin.h.
> * doc/invoke.texi: Add mavx512fp16.
> * doc/extend.texi: Add avx512fp16 Usage Notes.

OK with some nits (e.g. please leave some vertical space to visually
split different functionality inside the function).

> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx-1.c: Add -mavx512fp16 in dg-options.
> * gcc.target/i386/avx-2.c: Ditto.
> * gcc.target/i386/avx512-check.h: Check cpuid for AVX512FP16.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute check.
> * gcc.target/i386/sse-13.c: Add -mavx512fp16.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Ditto.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp: (check_effective_target_avx512fp16): New.
> * g++.target/i386/float16-1.C: New test.
> * g++.target/i386/float16-2.C: Ditto.
> * g++.target/i386/float16-3.C: Ditto.
> * gcc.target/i386/avx512fp16-12a.c: Ditto.
> * gcc.target/i386/avx512fp16-12b.c: Ditto.
> * gcc.target/i386/float16-3a.c: Ditto.
> * gcc.target/i386/float16-3b.c: Ditto.
> * 

[committed] testsuite: Fix duplicated content of gcc.c-torture/execute/ieee/pr29302-1.x

2021-08-04 Thread Jakub Jelinek via Gcc-patches
Hi!

After seeing the config/t-slibgcc-fuchsia issue, I ran a dumb and slow
for f in `find . -type f`; do sz=`ls -l $f | awk '{print $5}'`; sz=`expr $sz / 
2`; [ $sz = 0 ] && continue; if [ $sz -gt 16 ]; then dd if=$f of=/tmp/1 bs=1 
count=16 2>/dev/null; dd if=$f of=/tmp/2 bs=1 skip=$sz count=16 2>/dev/null; 
cmp -s /tmp/1 /tmp/2 || continue; fi; dd if=$f of=/tmp/1 bs=1 count=$sz 
2>/dev/null; dd if=$f of=/tmp/2 bs=1 skip=$sz 2>/dev/null; cmp -s /tmp/1 /tmp/2 
&& echo $f; done
command to discover similar cases of files that have both halves identical.
The script found
gcc/testsuite/gcc.c-torture/execute/ieee/pr29302-1.x
gcc/testsuite/go.test/test/fixedbugs/bug206.out
gcc/testsuite/go.test/test/fixedbugs/issue30709.out
gcc/testsuite/go.test/test/fixedbugs/issue21879.out
gcc/testsuite/go.test/test/ken/cplx0.out
libgcc/config/t-slibgcc-fuchsia
libgo/misc/cgo/life/testdata/main.out
libgo/go/compress/flate/testdata/huffman-zero.in
libstdc++-v3/testsuite/data/wostream_inserter_char-1.txt
libstdc++-v3/testsuite/data/ios_base_members_static-1.tst
libstdc++-v3/testsuite/data/istream_unformatted-1.tst
libstdc++-v3/testsuite/data/wostream_inserter_char-1.tst
libstdc++-v3/testsuite/data/ostream_inserter_char-1.tst
libstdc++-v3/testsuite/data/ostream_inserter_char-1.txt
libstdc++-v3/testsuite/data/istream_unformatted-1.txt
libstdc++-v3/testsuite/data/wistream_unformatted-1.tst
libstdc++-v3/testsuite/data/wistream_unformatted-1.txt
from which libgcc/config/t-slibgcc-fuchsia is fixed,
gcc/testsuite/gcc.c-torture/execute/ieee/pr29302-1.x is also a clear bug
that I've committed fix now and the rest most probably is intentional.

Committed to trunk as obvious.

2021-08-04  Jakub Jelinek  

* gcc.c-torture/execute/ieee/pr29302-1.x: Undo doubly applied patch.

--- gcc/testsuite/gcc.c-torture/execute/ieee/pr29302-1.x
+++ gcc/testsuite/gcc.c-torture/execute/ieee/pr29302-1.x
@@ -4,9 +4,3 @@ if { [istarget "tic6x-*-*"] && [check_effective_target_ti_c67x] 
} {
 return 1
 }
 return 0
-if { [istarget "tic6x-*-*"] && [check_effective_target_ti_c67x] } {
-# C6X uses -freciprocal-math by default.
-set torture_execute_xfail "tic6x-*-*"
-return 1
-}
-return 0

Jakub



Re: OMP builtins in offloading

2021-08-04 Thread Thomas Schwinge
Hi!

On 2015-01-08T16:41:50+0100, I wrote:
> Committed to trunk in r219346:

(Git commit 45f46750a3513790573791c0eec6b600b42f2042.)

> Make sure that OMP builtins are available in offloading compilers.

> --- gcc/builtins.def
> +++ gcc/builtins.def
> @@ -148,11 +148,14 @@ along with GCC; see the file COPYING3.  If not see
>
>  /* Builtin used by the implementation of GNU OpenMP.  None of these are
> actually implemented in the compiler; they're all in libgomp.  */
> +/* These builtins also need to be enabled in offloading compilers invoked 
> from
> +   mkoffload; for that purpose, we're checking the -foffload-abi flag here.  
> */
>  #undef DEF_GOMP_BUILTIN
>  #define DEF_GOMP_BUILTIN(ENUM, NAME, TYPE, ATTRS) \
>DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE,\
> false, true, true, ATTRS, false, \
> -(flag_openmp || flag_tree_parallelize_loops))
> +(flag_openmp || flag_tree_parallelize_loops \
> + || flag_offload_abi != OFFLOAD_ABI_UNSET))

(Similar for 'DEF_GOACC_BUILTIN', later.)

Since Tom's PR64707 commit r220037 (Git commit
1506ae0e1e865fb7a42fc37a47f1799b71f21c53) "Make fopenmp an LTO option" as
well as PR64672 commit r220038 (Git commit
a0c88d0629a33161add8d5bc083f1e59f3f756f7) "Make fopenacc an LTO option",
we're now actually passing '-fopenacc'/'-fopenmp' to the 'mkoffload's,
which will pass these on to the offload compilers, so we may clean up
this change.

OK to push "Don't consider '-foffload-abi' in 'DEF_GOACC_BUILTIN',
'DEF_GOMP_BUILTIN'", see attached?


Grüße
 Thomas


-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From bd83a68fb7ed0d746149029424f01cd857219fc0 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Mon, 2 Aug 2021 18:33:50 +0200
Subject: [PATCH] Don't consider '-foffload-abi' in 'DEF_GOACC_BUILTIN',
 'DEF_GOMP_BUILTIN'

Since Tom's PR64707 commit r220037 (Git commit
1506ae0e1e865fb7a42fc37a47f1799b71f21c53) "Make fopenmp an LTO option" as well
as PR64672 commit r220038 (Git commit a0c88d0629a33161add8d5bc083f1e59f3f756f7)
"Make fopenacc an LTO option", we're now actually passing
'-fopenacc'/'-fopenmp' to the 'mkoffload's, which will pass these on to the
offload compilers.

	gcc/
	* builtins.def (DEF_GOACC_BUILTIN, DEF_GOMP_BUILTIN): Don't
	consider '-foffload-abi'.
	* common.opt (-foffload-abi): Remove 'Var', 'Init'.
	* opts.c (common_handle_option) <-foffload-abi> [ACCEL_COMPILER]:
	Ignore.
---
 gcc/builtins.def | 8 ++--
 gcc/common.opt   | 2 +-
 gcc/opts.c   | 6 --
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/gcc/builtins.def b/gcc/builtins.def
index ec556df4f66..45a09b4d42d 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -205,14 +205,11 @@ along with GCC; see the file COPYING3.  If not see
 
 /* Builtin used by the implementation of OpenACC and OpenMP.  Few of these are
actually implemented in the compiler; most are in libgomp.  */
-/* These builtins also need to be enabled in offloading compilers invoked from
-   mkoffload; for that purpose, we're checking the -foffload-abi flag here.  */
 #undef DEF_GOACC_BUILTIN
 #define DEF_GOACC_BUILTIN(ENUM, NAME, TYPE, ATTRS) \
   DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE,\
 	   false, true, true, ATTRS, false, \
-	   (flag_openacc \
-		|| flag_offload_abi != OFFLOAD_ABI_UNSET))
+	   flag_openacc)
 #undef DEF_GOACC_BUILTIN_COMPILER
 #define DEF_GOACC_BUILTIN_COMPILER(ENUM, NAME, TYPE, ATTRS) \
   DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE,\
@@ -227,8 +224,7 @@ along with GCC; see the file COPYING3.  If not see
false, true, true, ATTRS, false, \
 	   (flag_openacc \
 		|| flag_openmp \
-		|| flag_tree_parallelize_loops > 1 \
-		|| flag_offload_abi != OFFLOAD_ABI_UNSET))
+		|| flag_tree_parallelize_loops > 1))
 
 /* Builtin used by the implementation of GNU TM.  These
functions are mapped to the actual implementation of the STM library. */
diff --git a/gcc/common.opt b/gcc/common.opt
index d9da1131eda..ed8ab5fbe13 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2112,7 +2112,7 @@ Common Driver Joined MissingArgError(options or targets=options missing after %q
 -foffload-options==	Specify options for the offloading targets.
 
 foffload-abi=
-Common Joined RejectNegative Enum(offload_abi) Var(flag_offload_abi) Init(OFFLOAD_ABI_UNSET)
+Common Joined RejectNegative Enum(offload_abi)
 -foffload-abi=[lp64|ilp32]	Set the ABI to use in an offload compiler.
 
 Enum
diff --git a/gcc/opts.c b/gcc/opts.c
index 93366e6eb2d..1f52e1139c7 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -2737,12 +2737,14 @@ common_handle_option (struct gcc_options *opts,
   /* Deferred.  */
   break;
 
-#ifndef ACCEL_COMPILER

[committed] libgcc: Fix duplicated content of config/t-slibgcc-fuchsia

2021-08-04 Thread Jakub Jelinek via Gcc-patches
Hi!

The file has two identical halves, seems like twice applied patch.

Committed to trunk as obvious.

2021-08-04  Jakub Jelinek  

* config/t-slibgcc-fuchsia: Undo doubly applied patch.

--- libgcc/config/t-slibgcc-fuchsia.jj  2021-01-04 10:25:53.777064609 +0100
+++ libgcc/config/t-slibgcc-fuchsia 2021-08-03 16:23:18.396748464 +0200
@@ -20,25 +20,3 @@
 
 SHLIB_LDFLAGS = -Wl,--soname=$(SHLIB_SONAME) \
 $(LDFLAGS)
-# Copyright (C) 2017-2021 Free Software Foundation, Inc.
-#
-# This file is part of GCC.
-#
-# GCC is free software; you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation; either version 3, or (at your option)
-# any later version.
-#
-# GCC is distributed in the hope that it will be useful,
-# but WITHOUT ANY WARRANTY; without even the implied warranty of
-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-# GNU General Public License for more details.
-#
-# You should have received a copy of the GNU General Public License
-# along with GCC; see the file COPYING3.  If not see
-# .
-
-# Fuchsia-specific shared library overrides.
-
-SHLIB_LDFLAGS = -Wl,--soname=$(SHLIB_SONAME) \
-$(LDFLAGS)

Jakub



Re: [PATCH] [i386] Refine predicate of peephole2 to general_reg_operand. [PR target/101743]

2021-08-04 Thread Uros Bizjak via Gcc-patches
On Wed, Aug 4, 2021 at 5:33 AM liuhongt  wrote:
>
> Hi:
>   The define_peephole2 which is added by r12-2640-gf7bf03cf69ccb7dc
> should only work on general registers, considering that x86 also
> supports mov instructions between gpr, sse reg, mask reg, limiting the
> peephole2 predicate to general_reg_operand.
>   I failed to contruct a testcase, but I believe that the PR problem
> should be solved by this patch.
>
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>   Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/101743
> * config/i386/i386.md (peephole2): Refine predicate from
> register_operand to general_reg_operand.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.md | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 0c23ddb8d1f..51e8b475bca 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -19423,11 +19423,11 @@ (define_peephole2
>  ;; Eliminate a reg-reg mov by inverting the condition of a cmov (#1).
>  ;; mov r0,r1; dec r0; mov r2,r3; cmov r0,r2 -> dec r1; mov r0,r3; cmov r0, r1
>  (define_peephole2
> - [(set (match_operand:SWI248 0 "register_operand")
> -   (match_operand:SWI248 1 "register_operand"))
> + [(set (match_operand:SWI248 0 "general_reg_operand")
> +   (match_operand:SWI248 1 "general_reg_operand"))
>(parallel [(set (reg FLAGS_REG) (match_operand 5))
>  (set (match_dup 0) (match_operand:SWI248 6))])
> -  (set (match_operand:SWI248 2 "register_operand")
> +  (set (match_operand:SWI248 2 "general_reg_operand")
> (match_operand:SWI248 3))
>(set (match_dup 0)
> (if_then_else:SWI248 (match_operator 4 "ix86_comparison_operator"
> @@ -19455,10 +19455,10 @@ (define_peephole2
>  ;; Eliminate a reg-reg mov by inverting the condition of a cmov (#2).
>  ;; mov r2,r3; mov r0,r1; dec r0; cmov r0,r2 -> dec r1; mov r0,r3; cmov r0, r1
>  (define_peephole2
> - [(set (match_operand:SWI248 2 "register_operand")
> + [(set (match_operand:SWI248 2 "general_reg_operand")
> (match_operand:SWI248 3))
> -  (set (match_operand:SWI248 0 "register_operand")
> -   (match_operand:SWI248 1 "register_operand"))
> +  (set (match_operand:SWI248 0 "general_reg_operand")
> +   (match_operand:SWI248 1 "general_reg_operand"))
>(parallel [(set (reg FLAGS_REG) (match_operand 5))
>  (set (match_dup 0) (match_operand:SWI248 6))])
>(set (match_dup 0)
> --
> 2.27.0
>


Re: [PATCH] arm: Fix multilib mapping for CDE extensions [PR100856]

2021-08-04 Thread Christophe Lyon via Gcc-patches
ping?

On Thu, 15 Jul 2021 at 15:07, Christophe LYON via Gcc-patches
 wrote:
>
> This is a followup to Srinath's recent patch: the newly added test is
> failing e.g. on arm-linux-gnueabihf without R/M profile multilibs.
>
> It is also failing on arm-eabi with R/M profile multilibs if the
> execution engine does not support v8.1-M instructions.
>
> The patch avoids this by adding check_effective_target_FUNC_multilib
> in target-supports.exp which effectively checks whether the target
> supports linking and execution, like what is already done for other
> ARM effective targets.  pr100856.c is updated to use it instead of
> arm_v8_1m_main_cde_mve_ok (which makes the testcase a bit of a
> duplicate with check_effective_target_FUNC_multilib).
>
> In addition, I noticed that requiring MVE does not seem necessary and
> this enables the test to pass even when targeting a CPU without MVE:
> since the test does not involve actual CDE instructions, it can pass
> on other architecture versions.  For instance, when requiring MVE, we
> have to use cortex-m55 under QEMU for the test to pass because the
> memset() that comes from v8.1-m.main+mve multilib uses LOB
> instructions (DLS) (memset is used during startup).  Keeping
> arm_v8_1m_main_cde_mve_ok would mean we would enable the test provided
> we have the right multilibs, causing a runtime error if the simulator
> does not support LOB instructions (e.g. when targeting cortex-m7).
>
> I do not update sourcebuild.texi since the CDE effective targets are
> already collectively documented.
>
> Finally, the patch fixes two typos in comments.
>
> 2021-07-15  Christophe Lyon  
>
>  PR target/100856
>  gcc/
>  * config/arm/arm.opt: Fix typo.
>  * config/arm/t-rmprofile: Fix typo.
>
>  gcc/testsuite/
>  * gcc.target/arm/acle/pr100856.c: Use arm_v8m_main_cde_multilib
>  and arm_v8m_main_cde.
>  * lib/target-supports.exp: Add
> check_effective_target_FUNC_multilib for ARM CDE.
>
>


Re: [PATCH] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-08-04 Thread Richard Sandiford via Gcc-patches
Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can select the top or bottom half of the operand registers. This
> selection does not change the cost of the underlying instruction and
> this should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match vec_select high-half of its operands. This traversal
> prevents the cost of the vec_select from being added into the cost of
> the multiply - meaning that these instructions can now be emitted in
> the combine pass as they are no longer deemed prohibitively
> expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.

Like you say, the instructions can handle both the low and high halves.
Shouldn't we also check for the low part (as a SIGN/ZERO_EXTEND of
a subreg)?

> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
>   * config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p):
>   Define.
>   (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
>   vec_select high-half from being added into Neon multiply
>   cost.
>   * rtlanal.c (vec_series_highpart_p): Define.
>   * rtlanal.h (vec_series_highpart_p): Declare.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vmul_high_cost.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 5809887997305317c5a81421089db431685e2927..a49672afe785e3517250d324468edacceab5c9d3
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -76,6 +76,7 @@
>  #include "function-abi.h"
>  #include "gimple-pretty-print.h"
>  #include "tree-ssa-loop-niter.h"
> +#include "rtlanal.h"
>  
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -11970,6 +11971,19 @@ aarch64_cheap_mult_shift_p (rtx x)
>return false;
>  }
>  
> +/* Return true iff X is an operand of a select-high-half vector
> +   instruction.  */
> +
> +static bool
> +aarch64_vec_select_high_operand_p (rtx x)
> +{
> +  return ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
> +   && GET_CODE (XEXP (x, 0)) == VEC_SELECT
> +   && vec_series_highpart_p (GET_MODE (XEXP (x, 0)),
> + GET_MODE (XEXP (XEXP (x, 0), 0)),
> + XEXP (XEXP (x, 0), 1)));
> +}
> +
>  /* Helper function for rtx cost calculation.  Calculate the cost of
> a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx.
> Return the calculated cost of the expression, recursing manually in to
> @@ -11995,6 +12009,13 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> int outer, bool speed)
>unsigned int vec_flags = aarch64_classify_vector_mode (mode);
>if (vec_flags & VEC_ADVSIMD)
>   {
> +   /* The select-operand-high-half versions of the instruction have the
> +  same cost as the three vector version - don't add the costs of the
> +  select into the costs of the multiply.  */
> +   if (aarch64_vec_select_high_operand_p (op0))
> + op0 = XEXP (XEXP (op0, 0), 0);
> +   if (aarch64_vec_select_high_operand_p (op1))
> + op1 = XEXP (XEXP (op1, 0), 0);

For consistency with aarch64_strip_duplicate_vec_elt, I think this
should be something like aarch64_strip_vec_extension, returning
the inner rtx on success and the original one on failure.

Thanks,
Richard

> /* The by-element versions of the instruction have the same costs as
>the normal 3-vector version.  So don't add the costs of the
>duplicate or subsequent select into the costs of the multiply.  We
> diff --git a/gcc/rtlanal.h b/gcc/rtlanal.h
> index 
> e1642424db89736675ac3e0d505aeaa59dca8bad..542dc7898bead27d3da89e5138c49563ba226eae
>  100644
> --- a/gcc/rtlanal.h
> +++ b/gcc/rtlanal.h
> @@ -331,6 +331,10 @@ inline vec_rtx_properties_base::~vec_rtx_properties_base 
> ()
> collecting the references a second time.  */
>  using vec_rtx_properties = growing_rtx_properties;
>  
> +bool
> +vec_series_highpart_p (machine_mode result_mode, machine_mode op_mode,
> +rtx sel);
> +
>  bool
>  vec_series_lowpart_p (machine_mode result_mode, machine_mode op_mode, rtx 
> sel);
>  
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 
> ec7a062829cb4ead3eaedf1546956107f4ad3bb2..3db49e7a8237bef8ffd9aa4036bb2cfdb1cee6d5
>  100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -6941,6 +6941,25 @@ register_asm_p (const_rtx x)
> && DECL_REGISTER (REG_EXPR (x)));
>  }
>  
> +/* Return true if, for all OP of mode OP_MODE:
> +
> + (vec_select:RESULT_MODE OP SEL)
> +
> +   is equivalent to the highpart RESULT_MODE of OP.  */
> +
> +bool
> +vec_series_highpart_p (machine_mode result_mode, machine_mode op_mode, rtx 
> sel)
> +{
> +  int nunits;
> +  if (GET_MODE_NUNITS 

[PUSHED] Mark path_range_query::dump as override.

2021-08-04 Thread Aldy Hernandez via Gcc-patches

On 8/3/21 10:29 AM, Martin Liška wrote:

Hey.

I've just noticed that your recent change caused:

/home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-clang/build/gcc/gimple-range-path.h:44:8: 
warning: 'dump' overrides a member function but is not marked 'override' 
[-Winconsistent-missing-override]


Can you please take a look?


Absolutely.

Thanks for spotting this.

Pushed.

Aldy
>From 9db0bcd9fdc2e3a659d56435cb18d553f4292edb Mon Sep 17 00:00:00 2001
From: Aldy Hernandez 
Date: Wed, 4 Aug 2021 10:55:12 +0200
Subject: [PATCH] Mark path_range_query::dump as override.

gcc/ChangeLog:

	* gimple-range-path.h (path_range_query::dump): Mark override.
---
 gcc/gimple-range-path.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/gimple-range-path.h b/gcc/gimple-range-path.h
index 43f0ec80286..0d2d2e7f75d 100644
--- a/gcc/gimple-range-path.h
+++ b/gcc/gimple-range-path.h
@@ -41,7 +41,7 @@ public:
 			  const bitmap_head *imports);
   bool range_of_expr (irange , tree name, gimple * = NULL) override;
   bool range_of_stmt (irange , gimple *, tree name = NULL) override;
-  void dump (FILE *);
+  void dump (FILE *) override;
   void debug ();
 
 private:
-- 
2.31.1



Re: [PATCH 3/3] [PR libfortran/101305] Fix ISO_Fortran_binding.h paths in gfortran testsuite

2021-08-04 Thread Andreas Schwab
On Jul 13 2021, Sandra Loosemore wrote:

> diff --git a/gcc/testsuite/gfortran.dg/ISO_Fortran_binding_1.c 
> b/gcc/testsuite/gfortran.dg/ISO_Fortran_binding_1.c
> index a571459..9da5d85 100644
> --- a/gcc/testsuite/gfortran.dg/ISO_Fortran_binding_1.c
> +++ b/gcc/testsuite/gfortran.dg/ISO_Fortran_binding_1.c
> @@ -1,6 +1,6 @@
>  /* Test F2008 18.5: ISO_Fortran_binding.h functions.  */
>  
> -#include "../../../libgfortran/ISO_Fortran_binding.h"
> +#include "ISO_Fortran_binding.h"

Shouldn't that use  since that is an installed
header, not one that is supposed to be picked up from the current
directory?

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [PATCH][pushed] docs: document threader-mode param

2021-08-04 Thread Martin Liška

On 8/4/21 10:54 AM, Aldy Hernandez wrote:

On 8/4/21 9:49 AM, Martin Liška wrote:

Hi.

Pushing as obvious.

Martin

gcc/ChangeLog:

 * doc/invoke.texi: Document threader-mode param.
---
  gcc/doc/invoke.texi | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 65bb9981f02..4efc8b757ec 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13421,6 +13421,9 @@ Setting to 0 disables the analysis completely.
  @item modref-max-escape-points
  Specifies the maximum number of escape points tracked by modref per SSA-name.

+@item threader-mode
+Specifies the mode the backwards threader should run in.
+


This is slated to be removed sometime in the next few weeks, thus why I didn't 
document it.


Oh, I see.



In the future, is there a preferred way to add internal --param's not for 
public consumption?


Params are internal by design and can be liberally modifies as we want. That's 
different from options which
should be treated conservativelly.


I've run into the same problem with --param=threader-iterative, with folks 
adding PRs for an undocumented internal construct.

Sorry to have created work for you.


That's fine!
Martin



Aldy





Re: [PATCH][pushed] docs: document threader-mode param

2021-08-04 Thread Aldy Hernandez via Gcc-patches

On 8/4/21 9:49 AM, Martin Liška wrote:

Hi.

Pushing as obvious.

Martin

gcc/ChangeLog:

 * doc/invoke.texi: Document threader-mode param.
---
  gcc/doc/invoke.texi | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 65bb9981f02..4efc8b757ec 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13421,6 +13421,9 @@ Setting to 0 disables the analysis completely.
  @item modref-max-escape-points
  Specifies the maximum number of escape points tracked by modref per 
SSA-name.


+@item threader-mode
+Specifies the mode the backwards threader should run in.
+


This is slated to be removed sometime in the next few weeks, thus why I 
didn't document it.


In the future, is there a preferred way to add internal --param's not 
for public consumption?  I've run into the same problem with 
--param=threader-iterative, with folks adding PRs for an undocumented 
internal construct.


Sorry to have created work for you.

Aldy



Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost

2021-08-04 Thread Richard Sandiford via Gcc-patches
Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> V2 of the patch addresses the initial review comments, factors out
> common code (as we discussed off-list) and adds a set of unit tests
> to verify the code generation benefit.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
>   * config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
>   Define.
>   (aarch64_rtx_mult_cost): Traverse RTL tree to prevent
>   vec_select cost from being added into Neon multiply cost.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vmul_element_cost.c: New test.
>
>
>
> From: Richard Sandiford 
> Sent: 22 July 2021 18:16
> To: Jonathan Wright 
> Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov 
> 
> Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost 
>  
> Jonathan Wright  writes:
>> Hi,
>>
>> The Neon multiply/multiply-accumulate/multiply-subtract instructions
>> can take various forms - multiplying full vector registers of values
>> or multiplying one vector by a single element of another. Regardless
>> of the form used, these instructions have the same cost, and this
>> should be reflected by the RTL cost function.
>>
>> This patch adds RTL tree traversal in the Neon multiply cost function
>> to match the vec_select used by the lane-referencing forms of the
>> instructions already mentioned. This traversal prevents the cost of
>> the vec_select from being added into the cost of the multiply -
>> meaning that these instructions can now be emitted in the combine
>> pass as they are no longer deemed prohibitively expensive.
>>
>> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
>> issues.
>>
>> Ok for master?
>>
>> Thanks,
>> Jonathan
>>
>> ---
>>
>> gcc/ChangeLog:
>>
>> 2021-07-19  Jonathan Wright  
>>
>> * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
>> RTL tree to prevents vec_select from being added into Neon
>> multiply cost.
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 
>> f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd
>>  100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
>> int outer, bool speed)
>>    op0 = XEXP (op0, 0);
>>  else if (GET_CODE (op1) == VEC_DUPLICATE)
>>    op1 = XEXP (op1, 0);
>> +   /* The same argument applies to the VEC_SELECT when using the lane-
>> +  referencing forms of the MUL/MLA/MLS instructions. Without the
>> +  traversal here, the combine pass deems these patterns too
>> +  expensive and subsequently does not emit the lane-referencing
>> +  forms of the instructions. In addition, canonical form is for the
>> +  VEC_SELECT to be the second argument of the multiply - thus only
>> +  op1 is traversed.  */
>> +   if (GET_CODE (op1) == VEC_SELECT
>> +   && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
>> + op1 = XEXP (op1, 0);
>> +   else if ((GET_CODE (op1) == ZERO_EXTEND
>> + || GET_CODE (op1) == SIGN_EXTEND)
>> +    && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
>> +    && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
>> + op1 = XEXP (XEXP (op1, 0), 0);
>
> I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
> if block, since the condition is never true otherwise.  We can probably
> skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
> it would be better to add them to the existing VEC_DUPLICATE tests rather
> than restrict them to the VEC_SELECT ones.
>
> Also, although this is in Advanced SIMD-specific code, I think it'd be
> better to use:
>
>   is_a (GET_MODE (op1))
>
> instead of:
>
>   GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1
>
> Do you have a testcase?
>
> Thanks,
> Richard
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 3bdf19d71b54d0ade8e5648323f6e1f012bc4f8f..5809887997305317c5a81421089db431685e2927
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11908,6 +11908,26 @@ aarch64_strip_extend (rtx x, bool strip_shift)
>return x;
>  }
>  
> +
> +/* Helper function for rtx cost calculation. Strip VEC_DUPLICATE as well as
> +   any subsequent extend and VEC_SELECT from X. Returns the inner scalar
> +   operand if successful, or the original expression on failure.  */
> +static rtx
> +aarch64_strip_duplicate_vec_elt (rtx x)
> +{
> +  if (GET_CODE (x) == VEC_DUPLICATE
> +  && is_a (GET_MODE (XEXP (x, 0
> +{
> +  x = XEXP (x, 0);
> +  if (GET_CODE (x) == VEC_SELECT)
> + x = XEXP (x, 0);
> +  else if ((GET_CODE (x) == ZERO_EXTEND || GET_CODE 

[PATCH] tree-optimization/101769 - tail recursion creates possibly infinite loop

2021-08-04 Thread Richard Biener
This makes tail recursion optimization produce a loop structure
manually rather than relying on loop fixup.  That also allows the
loop to be marked as finite (it would eventually blow the stack
if it were not).

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

2021-08-04  Richard Biener  

PR tree-optimization/101769
* tree-tailcall.c (eliminate_tail_call): Add the created loop
for the first recursion and return it via the new output parameter.
(optimize_tail_call): Pass through new output param.
(tree_optimize_tail_calls_1): After creating all latches,
add the created loop to the loop tree.  Do not mark loops for fixup.

* g++.dg/tree-ssa/pr101769.C: New testcase.
---
 gcc/testsuite/g++.dg/tree-ssa/pr101769.C | 56 
 gcc/tree-tailcall.c  | 34 --
 2 files changed, 77 insertions(+), 13 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/tree-ssa/pr101769.C

diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr101769.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr101769.C
new file mode 100644
index 000..4979c42236b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr101769.C
@@ -0,0 +1,56 @@
+// { dg-do compile }
+// { dg-require-effective-target c++11 }
+// { dg-options "-O2 -fdump-tree-optimized" }
+
+struct Node
+{
+  Node*right;
+  Node*down;
+};
+
+inline
+void free_node(Node*)
+{
+}
+
+void free_all(Node* n_)
+{
+  if (n_ == nullptr) {
+  return;
+  }
+  free_all(n_->right);
+  do {
+  Node* t = n_->down;
+  free_node(n_);
+  n_ = t;
+  } while (n_);
+}
+
+void free_all2_r(Node* n_)
+{
+  if (n_->right) {
+  free_all2_r(n_->right);
+  }
+  do {
+  Node* t = n_->down;
+  free_node(n_);
+  n_ = t;
+  } while (n_);
+}
+
+void free_all2(Node* n_)
+{
+  if (n_) {
+  free_all2_r(n_);
+  }
+}
+
+void loop(Node* n_)
+{
+  do {
+  n_ = n_->down;
+  } while (n_);
+}
+
+// All functions should be empty.
+// { dg-final { scan-tree-dump-times "header = first;
+  new_loop->finite_p = true;
+}
+  else
+gcc_assert (new_loop->header == first);
+
   /* Add phi node entries for arguments.  The ordering of the phi nodes should
  be the same as the ordering of the arguments.  */
   for (param = DECL_ARGUMENTS (current_function_decl),
@@ -1037,11 +1045,12 @@ eliminate_tail_call (struct tailcall *t)
mark the tailcalls for the sibcall optimization.  */
 
 static bool
-optimize_tail_call (struct tailcall *t, bool opt_tailcalls)
+optimize_tail_call (struct tailcall *t, bool opt_tailcalls,
+   class loop *_loop)
 {
   if (t->tail_recursion)
 {
-  eliminate_tail_call (t);
+  eliminate_tail_call (t, new_loop);
   return true;
 }
 
@@ -1177,12 +1186,15 @@ tree_optimize_tail_calls_1 (bool opt_tailcalls)
   opt_tailcalls = false;
 }
 
+  class loop *new_loop = NULL;
   for (; tailcalls; tailcalls = next)
 {
   next = tailcalls->next;
-  changed |= optimize_tail_call (tailcalls, opt_tailcalls);
+  changed |= optimize_tail_call (tailcalls, opt_tailcalls, new_loop);
   free (tailcalls);
 }
+  if (new_loop)
+add_loop (new_loop, loops_for_fn (cfun)->tree_root);
 
   if (a_acc || m_acc)
 {
@@ -1198,11 +1210,7 @@ tree_optimize_tail_calls_1 (bool opt_tailcalls)
 }
 
   if (changed)
-{
-  /* We may have created new loops.  Make them magically appear.  */
-  loops_state_set (LOOPS_NEED_FIXUP);
-  free_dominance_info (CDI_DOMINATORS);
-}
+free_dominance_info (CDI_DOMINATORS);
 
   /* Add phi nodes for the virtual operands defined in the function to the
  header of the loop created by tail recursion elimination.  Do so
-- 
2.31.1


Re: [PATCH 0/3] arm: fix problems when targetting extended FPUs [PR101723]

2021-08-04 Thread Christophe Lyon via Gcc-patches
On Tue, Aug 3, 2021 at 5:40 PM Richard Earnshaw <
richard.earns...@foss.arm.com> wrote:

>
>
> On 03/08/2021 16:04, Christophe Lyon via Gcc-patches wrote:
> > On Mon, Aug 2, 2021 at 4:57 PM Richard Earnshaw 
> wrote:
> >
> >> This patch series addresses an issue that has come to light due to a
> >> change in the way GAS handles .fpu directives in the assembler.  A fix
> >> to the assembler made in binutils 2.34 to clear out all features
> >> realated to the FPU when .fpu is emitted has started causing problems
> >> for GCC because of the order in which we emit .fpu and .arch_extension
> >> directives.  To fully address this we need to re-organize the way in
> >> which the compiler does this.
> >>
> >> I'll hold of pushing the patches for a couple of days.  Although I've
> >> gone through the testsuite quite carefully and run this through
> >> several configurations, it's possible that this may have some impact
> >> on the testsuite that I've missed.  Christophe, is the any chance you
> >> can run this through your test environment before I commit this?
> >>
> >>
> > Sorry for the delay, still unpacking emails after hollidays.
> >
> > Yes I can run the validation for these patches. I think you mean with
> all 3
> > patches combined, not 3 validations (patch 1, patches 1+2, patches 1-3) ?
>
> Yes, the first two are trivial changes that just support the interesting
> one, which is the final patch.
>
>
Hi Richard,

There are a few regressions with these 3 patches applied, see:

https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r12-2683-g4d17ca1bc74109e5cc4ef34890b6293c4bcb1d6a-PR101723.patch/report-build-info.html

The cortex-m55-nofp-* failures reported in several configs are not actual
regressions, it seems: they were already failing, but with a different
scan-assembler string, so they are considered different tests.
I think I sent patches for these unresolved cortex-m55-nofp-*
testcases several weeks/months ago, but I'd have to check.

There are regressions when configured --with-cpu cortex-a5 --with-fpu
vfpv3-d16-fp16 as well as on armeb.

Thanks,

Christophe

R.
> >
> > Thanks,
> >
> > Christophe
> >
> >
> >> R.
> >>
> >> Richard Earnshaw (3):
> >>arm: ensure the arch_name is always set for the build target
> >>arm: Don't reconfigure globals in arm_configure_build_target
> >>arm: reorder assembler architecture directives [PR101723]
> >>
> >>   gcc/config/arm/arm-c.c|   1 +
> >>   gcc/config/arm/arm-cpus.in|   1 +
> >>   gcc/config/arm/arm.c  | 190 --
> >>   gcc/testsuite/gcc.target/arm/attr-neon.c  |   9 +-
> >>   gcc/testsuite/gcc.target/arm/attr-neon2.c |  35 +++-
> >>   gcc/testsuite/gcc.target/arm/attr-neon3.c |  43 +++-
> >>   .../arm/cortex-m55-nofp-flag-hard.c   |   2 +-
> >>   .../arm/cortex-m55-nofp-flag-softfp.c |   2 +-
> >>   .../arm/cortex-m55-nofp-nomve-flag-softfp.c   |   2 +-
> >>   .../gcc.target/arm/mve/intrinsics/mve_fpu1.c  |   5 +-
> >>   .../gcc.target/arm/mve/intrinsics/mve_fpu2.c  |   5 +-
> >>   gcc/testsuite/gcc.target/arm/pr98636.c|   3 +-
> >>   12 files changed, 153 insertions(+), 145 deletions(-)
> >>
> >> --
> >> 2.25.1
> >>
> >>
>


[PATCH][pushed] docs: document threader-mode param

2021-08-04 Thread Martin Liška

Hi.

Pushing as obvious.

Martin

gcc/ChangeLog:

* doc/invoke.texi: Document threader-mode param.
---
 gcc/doc/invoke.texi | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 65bb9981f02..4efc8b757ec 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13421,6 +13421,9 @@ Setting to 0 disables the analysis completely.
 @item modref-max-escape-points
 Specifies the maximum number of escape points tracked by modref per SSA-name.
 
+@item threader-mode

+Specifies the mode the backwards threader should run in.
+
 @item profile-func-internal-id
 A parameter to control whether to use function internal id in profile
 database lookup. If the value is 0, the compiler uses an id that
--
2.32.0



Re: [PATCH] by_pieces: Properly set m_max_size in op_by_pieces

2021-08-04 Thread Richard Sandiford via Gcc-patches
"H.J. Lu via Gcc-patches"  writes:
> @@ -1122,8 +1122,8 @@ class op_by_pieces_d
> and its associated FROM_CFN_DATA can be used to replace loads with
> constant values.  LEN describes the length of the operation.  */
> 
> -op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
> - rtx from, bool from_load,
> +op_by_pieces_d::op_by_pieces_d (unsigned int max_pieces, rtx to,
> + bool to_load, rtx from, bool from_load,
>   by_pieces_constfn from_cfn,
>   void *from_cfn_data,
>   unsigned HOST_WIDE_INT len,

The comment above the function needs to describe the new parameter.

OK with that change, thanks.

Richard


Re: [PATCH 1/7] fortran: new abstract class gfc_dummy_arg

2021-08-04 Thread Thomas Koenig via Gcc-patches



Hi Mikael,


Introduce a new abstract class gfc_dummy_arg that provides a common
interface to both dummy arguments of user-defined procedures (which
have type gfc_formal_arglist) and dummy arguments of intrinsic procedures
(which have type gfc_intrinsic_arg).


good to see you again!

So far, we have refrained from adding too much explicit C++-isms into
the code, and if we do, my participation at least will have to be
reduced sharply (I don't speak much C++, and I don't intend to learn).

So, is this a path we want to go down?

Regards

Thomas