Re: [x86 SSE] Improve handling of ternlog instructions in i386/sse.md (v2)

2024-05-20 Thread Alexander Monakov


Hello!

I looked at ternlog a bit last year, so I'd like to offer some drive-by
comments. If you want to tackle them in a follow-up patch, or leave for
someone else to handle, please let me know.

On Fri, 17 May 2024, Roger Sayle wrote:

> This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?

Just to make sure: no new tests for the new tricks?

> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> +/* Determine the ternlog immediate index that implements 3-operand
> +   ternary logic expression OP.  This uses and modifies the 3 element
> +   array ARGS to record and check the leaves, either 3 REGs, or 2 REGs
> +   and MEM.  Returns an index between 0 and 255 for a valid ternlog,
> +   or -1 if the expression isn't suitable.  */
> +
> +int
> +ix86_ternlog_idx (rtx op, rtx *args)
> +{
> +  int idx0, idx1;
> +
> +  if (!op)
> +return -1;
> +
> +  switch (GET_CODE (op))
> +{
> +case REG:
> +  if (!args[0])
> + {
> +   args[0] = op;
> +   return 0xf0;

>From readability perspective, I wonder if it's nicer to have something like

enum {
  TERNLOG_A = 0xf0,
  TERNLOG_B = 0xcc,
  TERNLOG_C = 0xaa
}

and then use them to build the immediates.

> + }
> +  if (REGNO (op) == REGNO (args[0]))
> + return 0xf0;
> +  if (!args[1])
> + {
> +   args[1] = op;
> +   return 0xcc;
> + }
[snip]
> +
> +/* Return TRUE if OP (in mode MODE) is the leaf of a ternary logic
> +   expression, such as a register or a memory reference.  */
> + 
> +bool
> +ix86_ternlog_leaf_p (rtx op, machine_mode mode)
> +{
> +  /* We can't use memory_operand here, as it may return a different
> + value before and after reload (for volatile MEMs) which creates
> + problems splitting instructions.  */
> +  return register_operand (op, mode)
> +  || MEM_P (op)
> +  || GET_CODE (op) == CONST_VECTOR
> +  || bcst_mem_operand (op, mode);

Did your editor automatically indent this correctly for you? I think
usually such expressions have outer parenthesis.

> +}
[snip]
> +/* Expand a 3-operand ternary logic expression.  Return TARGET. */
> +rtx
> +ix86_expand_ternlog (machine_mode mode, rtx op0, rtx op1, rtx op2, int idx,
> +  rtx target)
> +{
> +  rtx tmp0, tmp1, tmp2;
> +
> +  if (!target)
> +target = gen_reg_rtx (mode);
> +
> +  /* Canonicalize ternlog index for degenerate (duplicated) operands.  */

But this only canonicalizes the case of triplicated operands, and does nothing
if two operands are duplicates of each other, and the third is distinct.
Handling that would complicate the already large patch a lot though.

> +  if (rtx_equal_p (op0, op1) && rtx_equal_p (op0, op2))
> +switch (idx & 0x81)
> +  {
> +  case 0x00:
> + idx = 0x00;
> + break;
> +  case 0x01:
> + idx = 0x0f;
> + break;
> +  case 0x80:
> + idx = 0xf0;
> + break;
> +  case 0x81:
> + idx = 0xff;
> + break;
> +  }
> +
> +  switch (idx & 0xff)
> +{
> +case 0x00:
> +  if ((!op0 || !side_effects_p (op0))
> +  && (!op1 || !side_effects_p (op1))
> +  && (!op2 || !side_effects_p (op2)))
> +{
> +   emit_move_insn (target, CONST0_RTX (mode));
> +   return target;
> + }
> +  break;
> +
> +case 0x0a: /* ~a */

With the enum idea above, this could be 'case ~TERNLOG_A & TERNLOG_C', etc.

Alexander


[PATCH] tree-into-ssa: speed up sorting in prune_unused_phi_nodes [PR114480]

2024-05-15 Thread Alexander Monakov
In PR 114480 we are hitting a case where tree-into-ssa scales
quadratically due to prune_unused_phi_nodes doing O(N log N)
work for N basic blocks, for each variable individually.
Sorting the 'defs' array is especially costly.

It is possible to assist gcc_qsort by laying out dfs_out entries
in the reverse order in the 'defs' array, starting from its tail.
This is not always a win (in fact it flips most of 7-element qsorts
in this testcase from 9 comparisons (best case) to 15 (worst case)),
but overall it helps on the testcase and on libstdc++ build.
On the testcase we go from 1.28e9 comparator invocations to 1.05e9,
on libstdc++ from 2.91e6 to 2.84e6.

gcc/ChangeLog:

* tree-into-ssa.cc (prune_unused_phi_nodes): Add dfs_out entries
to the 'defs' array in the reverse order.
---

I expect it's possible to avoid the quadratic behavior in the first place,
but that needs looking at the wider picture of SSA construction. Meanwhile,
might as well pick up this low-hanging fruit.

Richi kindly preapproved the patch on Bugzilla, I'll hold off committing
for a day or two in case there are comments.

 gcc/tree-into-ssa.cc | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-into-ssa.cc b/gcc/tree-into-ssa.cc
index 3732c269ca..5b367c3581 100644
--- a/gcc/tree-into-ssa.cc
+++ b/gcc/tree-into-ssa.cc
@@ -805,21 +805,22 @@ prune_unused_phi_nodes (bitmap phis, bitmap kills, bitmap 
uses)
  locate the nearest dominating def in logarithmic time by binary search.*/
   bitmap_ior (to_remove, kills, phis);
   n_defs = bitmap_count_bits (to_remove);
-  defs = XNEWVEC (struct dom_dfsnum, 2 * n_defs + 1);
+  adef = 2 * n_defs + 1;
+  defs = XNEWVEC (struct dom_dfsnum, adef);
   defs[0].bb_index = 1;
   defs[0].dfs_num = 0;
-  adef = 1;
+  struct dom_dfsnum *head = defs + 1, *tail = defs + adef;
   EXECUTE_IF_SET_IN_BITMAP (to_remove, 0, i, bi)
 {
   def_bb = BASIC_BLOCK_FOR_FN (cfun, i);
-  defs[adef].bb_index = i;
-  defs[adef].dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
-  defs[adef + 1].bb_index = i;
-  defs[adef + 1].dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
-  adef += 2;
+  head->bb_index = i;
+  head->dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
+  head++, tail--;
+  tail->bb_index = i;
+  tail->dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
 }
+  gcc_checking_assert (head == tail);
   BITMAP_FREE (to_remove);
-  gcc_assert (adef == 2 * n_defs + 1);
   qsort (defs, adef, sizeof (struct dom_dfsnum), cmp_dfsnum);
   gcc_assert (defs[0].bb_index == 1);
 
-- 
2.44.0



Re: [PATCH v2] object lifetime instrumentation for Valgrind [PR66487]

2024-05-15 Thread Alexander Monakov


Hello,

I'd like to ask if anyone has any new thoughts on this patch.

Let me also point out that valgrind/memcheck.h is permissively
licensed (BSD-style, rest of Valgrind is GPLv2), with the intention
to allow importing into projects that are interested in using
client requests without build-time dependency on installed headers.
So maybe we have that as an option too.

Alexander

On Fri, 22 Dec 2023, Alexander Monakov wrote:

> From: Daniil Frolov 
> 
> PR 66487 is asking to provide sanitizer-like detection for C++ object
> lifetime violations that are worked around with -fno-lifetime-dse or
> -flifetime-dse=1 in Firefox, LLVM (PR 106943), OpenJade (PR 69534).
> 
> The discussion in the PR was centered around extending MSan, but MSan
> was not ported to GCC (and requires rebuilding everything with
> instrumentation).
> 
> Instead, allow Valgrind to see lifetime boundaries by emitting client
> requests along *this = { CLOBBER }.  The client request marks the
> "clobbered" memory as undefined for Valgrind; clobbering assignments
> mark the beginning of ctor and end of dtor execution for C++ objects.
> Hence, attempts to read object storage after the destructor, or
> "pre-initialize" its fields prior to the constructor will be caught.
> 
> Valgrind client requests are offered as macros that emit inline asm.
> For use in code generation, let's wrap them as libgcc builtins.
> 
> gcc/ChangeLog:
> 
>   * Makefile.in (OBJS): Add gimple-valgrind-interop.o.
>   * builtins.def (BUILT_IN_VALGRIND_MAKE_UNDEFINED): New.
>   * common.opt (-fvalgrind-annotations): New option.
>   * doc/install.texi (--enable-valgrind-interop): Document.
>   * doc/invoke.texi (-fvalgrind-annotations): Document.
>   * passes.def (pass_instrument_valgrind): Add.
>   * tree-pass.h (make_pass_instrument_valgrind): Declare.
>   * gimple-valgrind-interop.cc: New file.
> 
> libgcc/ChangeLog:
> 
>   * Makefile.in (LIB2ADD_ST): Add valgrind-interop.c.
>   * config.in: Regenerate.
>   * configure: Regenerate.
>   * configure.ac (--enable-valgrind-interop): New flag.
>   * libgcc2.h (__gcc_vgmc_make_mem_undefined): Declare.
>   * valgrind-interop.c: New file.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/valgrind-annotations-1.C: New test.
>   * g++.dg/valgrind-annotations-2.C: New test.
> 
> Co-authored-by: Alexander Monakov 
> ---
> Changes in v2:
> 
> * Take new clobber kinds into account.
> * Do not link valgrind-interop.o into libgcc_s.so.
> 
>  gcc/Makefile.in   |   1 +
>  gcc/builtins.def  |   3 +
>  gcc/common.opt|   4 +
>  gcc/doc/install.texi  |   5 +
>  gcc/doc/invoke.texi   |  27 
>  gcc/gimple-valgrind-interop.cc| 125 ++
>  gcc/passes.def|   1 +
>  gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 +++
>  gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
>  gcc/tree-pass.h   |   1 +
>  libgcc/Makefile.in|   3 +
>  libgcc/config.in  |   6 +
>  libgcc/configure  |  22 ++-
>  libgcc/configure.ac   |  15 ++-
>  libgcc/libgcc2.h  |   2 +
>  libgcc/valgrind-interop.c |  40 ++
>  16 files changed, 287 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/gimple-valgrind-interop.cc
>  create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
>  create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
>  create mode 100644 libgcc/valgrind-interop.c
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 9373800018..d027548203 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1507,6 +1507,7 @@ OBJS = \
>   gimple-ssa-warn-restrict.o \
>   gimple-streamer-in.o \
>   gimple-streamer-out.o \
> + gimple-valgrind-interop.o \
>   gimple-walk.o \
>   gimple-warn-recursion.o \
>   gimplify.o \
> diff --git a/gcc/builtins.def b/gcc/builtins.def
> index f03df32f98..b05e20e062 100644
> --- a/gcc/builtins.def
> +++ b/gcc/builtins.def
> @@ -1194,6 +1194,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, 
> ATTR_NOTHROW_LEAF_LIST)
>  /* Control Flow Redundancy hardening out-of-line checker.  */
>  DEF_BUILTIN_STUB (BUILT_IN___HARDCFR_CHECK, "__builtin___hardcfr_check")
>  
> +/* Wrappers for Valgrind client requests.  */
> +DEF_EXT_LIB_BUILTIN (BUILT_IN_VALGRIND_MAKE_UNDEFINED, 
> "__gcc_vgmc_make_m

[PATCH v2] object lifetime instrumentation for Valgrind [PR66487]

2023-12-22 Thread Alexander Monakov
From: Daniil Frolov 

PR 66487 is asking to provide sanitizer-like detection for C++ object
lifetime violations that are worked around with -fno-lifetime-dse or
-flifetime-dse=1 in Firefox, LLVM (PR 106943), OpenJade (PR 69534).

The discussion in the PR was centered around extending MSan, but MSan
was not ported to GCC (and requires rebuilding everything with
instrumentation).

Instead, allow Valgrind to see lifetime boundaries by emitting client
requests along *this = { CLOBBER }.  The client request marks the
"clobbered" memory as undefined for Valgrind; clobbering assignments
mark the beginning of ctor and end of dtor execution for C++ objects.
Hence, attempts to read object storage after the destructor, or
"pre-initialize" its fields prior to the constructor will be caught.

Valgrind client requests are offered as macros that emit inline asm.
For use in code generation, let's wrap them as libgcc builtins.

gcc/ChangeLog:

* Makefile.in (OBJS): Add gimple-valgrind-interop.o.
* builtins.def (BUILT_IN_VALGRIND_MAKE_UNDEFINED): New.
* common.opt (-fvalgrind-annotations): New option.
* doc/install.texi (--enable-valgrind-interop): Document.
* doc/invoke.texi (-fvalgrind-annotations): Document.
* passes.def (pass_instrument_valgrind): Add.
* tree-pass.h (make_pass_instrument_valgrind): Declare.
* gimple-valgrind-interop.cc: New file.

libgcc/ChangeLog:

* Makefile.in (LIB2ADD_ST): Add valgrind-interop.c.
* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (--enable-valgrind-interop): New flag.
* libgcc2.h (__gcc_vgmc_make_mem_undefined): Declare.
* valgrind-interop.c: New file.

gcc/testsuite/ChangeLog:

* g++.dg/valgrind-annotations-1.C: New test.
* g++.dg/valgrind-annotations-2.C: New test.

Co-authored-by: Alexander Monakov 
---
Changes in v2:

* Take new clobber kinds into account.
* Do not link valgrind-interop.o into libgcc_s.so.

 gcc/Makefile.in   |   1 +
 gcc/builtins.def  |   3 +
 gcc/common.opt|   4 +
 gcc/doc/install.texi  |   5 +
 gcc/doc/invoke.texi   |  27 
 gcc/gimple-valgrind-interop.cc| 125 ++
 gcc/passes.def|   1 +
 gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 +++
 gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
 gcc/tree-pass.h   |   1 +
 libgcc/Makefile.in|   3 +
 libgcc/config.in  |   6 +
 libgcc/configure  |  22 ++-
 libgcc/configure.ac   |  15 ++-
 libgcc/libgcc2.h  |   2 +
 libgcc/valgrind-interop.c |  40 ++
 16 files changed, 287 insertions(+), 2 deletions(-)
 create mode 100644 gcc/gimple-valgrind-interop.cc
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
 create mode 100644 libgcc/valgrind-interop.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 9373800018..d027548203 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1507,6 +1507,7 @@ OBJS = \
gimple-ssa-warn-restrict.o \
gimple-streamer-in.o \
gimple-streamer-out.o \
+   gimple-valgrind-interop.o \
gimple-walk.o \
gimple-warn-recursion.o \
gimplify.o \
diff --git a/gcc/builtins.def b/gcc/builtins.def
index f03df32f98..b05e20e062 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -1194,6 +1194,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, 
ATTR_NOTHROW_LEAF_LIST)
 /* Control Flow Redundancy hardening out-of-line checker.  */
 DEF_BUILTIN_STUB (BUILT_IN___HARDCFR_CHECK, "__builtin___hardcfr_check")
 
+/* Wrappers for Valgrind client requests.  */
+DEF_EXT_LIB_BUILTIN (BUILT_IN_VALGRIND_MAKE_UNDEFINED, 
"__gcc_vgmc_make_mem_undefined", BT_FN_VOID_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
+
 /* Synchronization Primitives.  */
 #include "sync-builtins.def"
 
diff --git a/gcc/common.opt b/gcc/common.opt
index d263a959df..2be5b8d0a6 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3377,6 +3377,10 @@ Enum(auto_init_type) String(pattern) 
Value(AUTO_INIT_PATTERN)
 EnumValue
 Enum(auto_init_type) String(zero) Value(AUTO_INIT_ZERO)
 
+fvalgrind-annotations
+Common Var(flag_valgrind_annotations) Optimization
+Annotate lifetime boundaries with Valgrind client requests.
+
 ; -fverbose-asm causes extra commentary information to be produced in
 ; the generated assembly code (to make it more readable).  This option
 ; is generally only of use to those who actually need to read the
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index d20b43a5b2..d6e5e5fdaf 100644
--- a/gcc/doc

Re: Disable FMADD in chains for Zen4 and generic

2023-12-12 Thread Alexander Monakov

On Tue, 12 Dec 2023, Richard Biener wrote:

> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka  wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic 
> > (for
> > x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> >578,500,241  cycles:u #3.645 GHz 
> > ( +-  0.12% )
> >753,318,477  instructions:u   #1.30  insn 
> > per cycle  ( +-  0.00% )
> >125,417,701  branches:u   #  790.227 M/sec   
> > ( +-  0.00% )
> >   0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
> >
> >
> > No FMA:
> >
> >577,573,960  cycles:u #3.514 GHz 
> > ( +-  0.15% )
> >878,318,479  instructions:u   #1.52  insn 
> > per cycle  ( +-  0.00% )
> >125,417,702  branches:u   #  763.035 M/sec   
> > ( +-  0.00% )
> >   0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time 
> > as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> >  484875179  cycles:u #3.599 GHz 
> >  ( +-  0.05% )  (82.11%)
> >  752031517  instructions:u   #1.55  insn 
> > per cycle
> >  125106525  branches:u   #  928.712 M/sec   
> >  ( +-  0.03% )  (85.09%)
> > 128356  branch-misses:u  #0.10% of all 
> > branches  ( +-  0.06% )  (83.58%)
> >
> > No FMA:
> >  375875209  cycles:u #3.592 GHz 
> >  ( +-  0.08% )  (80.74%)
> >  875725341  instructions:u   #2.33  insn 
> > per cycle
> >  124903825  branches:u   #1.194 G/sec   
> >  ( +-  0.04% )  (84.59%)
> >   0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.

> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.

The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.

> Do we actually understand _why_ the FMAs are slower here?

It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.

Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html

> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet?  I'm curious how you set up a micro benchmark to
> measure this.

Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).

> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle.  So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute?  That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?

It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.

> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four.  So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.

A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency 
exceeds FMUL latency (all Zens and Broadwell).

> > Since this seems noticeable win on zen and not loss on Core it seems like 
> > 

Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]

2023-12-12 Thread Alexander Monakov



On Tue, 12 Dec 2023, Jakub Jelinek wrote:

> On Mon, Dec 11, 2023 at 05:00:50PM -0500, Jason Merrill wrote:
> > In discussion of PR71093 it came up that more clobber_kind options would be
> > useful within the C++ front-end.
> > 
> > gcc/ChangeLog:
> > 
> > * tree-core.h (enum clobber_kind): Rename CLOBBER_EOL to
> > CLOBBER_STORAGE_END.  Add CLOBBER_STORAGE_BEGIN,
> > CLOBBER_OBJECT_BEGIN, CLOBBER_OBJECT_END.
> > * gimple-lower-bitint.cc
> > * gimple-ssa-warn-access.cc
> > * gimplify.cc
> > * tree-inline.cc
> > * tree-ssa-ccp.cc: Adjust for rename.

Doesn't build_clobber_this in the C++ front-end need to be adjusted too?
I think it is used to place clobbers at start of the ctor (should be
CLOBBER_OBJECT_BEGIN in the new nomenclature) and end of the dtor (i.e.
CLOBBER_OBJECT_END).

Alexander


Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]

2023-12-10 Thread Alexander Monakov

On Sun, 10 Dec 2023, Richard Biener wrote:

> > It seems wrong to me: CLOBBER_EOL is documented to mean that the storage is
> > expiring at that point as well, which a (pseudo-)destructor does not imply;
> > it's perfectly valid to destroy an object and then create another in the
> > same storage.
> > 
> > We probably do want another clobber kind for end of object lifetime. And/or
> > one for beginning of object lifetime.
> 
> There’s not much semantically different between UNDEF and end of object but
> not storage lifetime?  At least for what middle-end optimizations do.
> 
> EOL is used by stack slot sharing and that operates on the underlying storage,
> not individual objects live in it.

I thought EOL implies that ASan may poison underlying memory. In the respin
of the Valgrind interop patch we instrument CLOBBER_UNDEF, but not CLOBBER_EOL.

Alexander

Re: [PATCH 0/1] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov



On Fri, 8 Dec 2023, Jakub Jelinek wrote:

> On Fri, Dec 08, 2023 at 06:43:19PM +0300, Alexander Monakov wrote:
> > On Fri, 8 Dec 2023, Jakub Jelinek wrote:
> > > In your version, did the new function go just to libgcc.a or to
> > > libgcc_s.so.1?
> > 
> > It participates in libgcc_s link, but it's not listed in the version script,
> > so it's not exported from libgcc_s (and -gc-sections should eliminate it).
> 
> Then it at least should not participate in that link.
> There are various other objects which are libgcc.a only (e.g. all of dfp
> stuff, etc.).

Thanks, changing

LIB2ADD += $(srcdir)/valgrind-interop.c

to

LIB2ADD_ST += $(srcdir)/valgrind-interop.c

in my tree achieved that.

Alexander


Re: [PATCH 0/1] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov


On Fri, 8 Dec 2023, Jakub Jelinek wrote:

> Does VALGRIND_MAKE_MEM_UNDEFINED macro ever change onarches once implemented
> there?

It seems Valgrind folks take binary compatibility seriously, so that sounds
unlikely.

> Wouldn't this be better done by emitting the sequence inline?
> Even if it is done in libgcc, it is part of ABI.

I'd rather keep it as simple as possible. We could drop the libgcc parts,
users can drop in the wrapper as explained in the manual.

> So, basically add a new optab, valgrind_request, where each target would
> define_insn whatever is needed (it will need to be a single pattern, it
> can't be split among multiple) and sorry on -fvalgrind-annotations if the
> optab is not defined.

There are going to be complications since the request needs a descriptor
structure (on the stack), plus it needs more effort on the GCC side than
Valgrind side (when Valgrind is ported to a new target). I'd rather not
go that way.

> Advantage would be that --enable-valgrind-interop nor building against
> valgrind headers is not needed.

Alternatively, how about synthesizing an auxiliary translation unit with
the wrapper from the driver for -fvalgrind-annotations?

> In your version, did the new function go just to libgcc.a or to
> libgcc_s.so.1?

It participates in libgcc_s link, but it's not listed in the version script,
so it's not exported from libgcc_s (and -gc-sections should eliminate it).

Alexander


[PATCH 1/1] object lifetime instrumentation for Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov
From: Daniil Frolov 

PR 66487 is asking to provide sanitizer-like detection for C++ object
lifetime violations that are worked around with -fno-lifetime-dse or
-flifetime-dse=1 in Firefox, LLVM (PR 106943), OpenJade (PR 69534).

The discussion in the PR was centered around extending MSan, but MSan
was not ported to GCC (and requires rebuilding everything with
instrumentation).

Instead, allow Valgrind to see lifetime boundaries by emitting client
requests along *this = { CLOBBER }.  The client request marks the
"clobbered" memory as undefined for Valgrind; clobbering assignments
mark the beginning of ctor and end of dtor execution for C++ objects.
Hence, attempts to read object storage after the destructor, or
"pre-initialize" its fields prior to the constructor will be caught.

Valgrind client requests are offered as macros that emit inline asm.
For use in code generation, let's wrap them as libgcc builtins.

gcc/ChangeLog:

* Makefile.in (OBJS): Add gimple-valgrind-interop.o.
* builtins.def (BUILT_IN_VALGRIND_MAKE_UNDEFINED): New.
* common.opt (-fvalgrind-annotations): New option.
* doc/install.texi (--enable-valgrind-interop): Document.
* doc/invoke.texi (-fvalgrind-annotations): Document.
* passes.def (pass_instrument_valgrind): Add.
* tree-pass.h (make_pass_instrument_valgrind): Declare.
* gimple-valgrind-interop.cc: New file.

libgcc/ChangeLog:

* Makefile.in (LIB2ADD): Add valgrind-interop.c.
* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (--enable-valgrind-interop): New flag.
* libgcc2.h (__gcc_vgmc_make_mem_undefined): Declare.
* valgrind-interop.c: New file.

gcc/testsuite/ChangeLog:

* g++.dg/valgrind-annotations-1.C: New test.
* g++.dg/valgrind-annotations-2.C: New test.

Co-authored-by: Alexander Monakov 
---
 gcc/Makefile.in   |   1 +
 gcc/builtins.def  |   3 +
 gcc/common.opt|   4 +
 gcc/doc/install.texi  |   5 +
 gcc/doc/invoke.texi   |  27 +
 gcc/gimple-valgrind-interop.cc| 112 ++
 gcc/passes.def|   1 +
 gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 
 gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
 gcc/tree-pass.h   |   1 +
 libgcc/Makefile.in|   3 +
 libgcc/config.in  |   6 +
 libgcc/configure  |  22 +++-
 libgcc/configure.ac   |  15 ++-
 libgcc/libgcc2.h  |   2 +
 libgcc/valgrind-interop.c |  40 +++
 16 files changed, 274 insertions(+), 2 deletions(-)
 create mode 100644 gcc/gimple-valgrind-interop.cc
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
 create mode 100644 libgcc/valgrind-interop.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 68410a86af..4db18387c1 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1506,6 +1506,7 @@ OBJS = \
gimple-ssa-warn-restrict.o \
gimple-streamer-in.o \
gimple-streamer-out.o \
+   gimple-valgrind-interop.o \
gimple-walk.o \
gimple-warn-recursion.o \
gimplify.o \
diff --git a/gcc/builtins.def b/gcc/builtins.def
index f03df32f98..b05e20e062 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -1194,6 +1194,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, 
ATTR_NOTHROW_LEAF_LIST)
 /* Control Flow Redundancy hardening out-of-line checker.  */
 DEF_BUILTIN_STUB (BUILT_IN___HARDCFR_CHECK, "__builtin___hardcfr_check")
 
+/* Wrappers for Valgrind client requests.  */
+DEF_EXT_LIB_BUILTIN (BUILT_IN_VALGRIND_MAKE_UNDEFINED, 
"__gcc_vgmc_make_mem_undefined", BT_FN_VOID_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
+
 /* Synchronization Primitives.  */
 #include "sync-builtins.def"
 
diff --git a/gcc/common.opt b/gcc/common.opt
index f070aff8cb..b53565fc1a 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3372,6 +3372,10 @@ Enum(auto_init_type) String(pattern) 
Value(AUTO_INIT_PATTERN)
 EnumValue
 Enum(auto_init_type) String(zero) Value(AUTO_INIT_ZERO)
 
+fvalgrind-annotations
+Common Var(flag_valgrind_annotations) Optimization
+Annotate lifetime boundaries with Valgrind client requests.
+
 ; -fverbose-asm causes extra commentary information to be produced in
 ; the generated assembly code (to make it more readable).  This option
 ; is generally only of use to those who actually need to read the
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index c1128d9274..aaf0213bbf 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -1567,6 +1567,11 @@ Disable TM clone registry in libgcc. It is enab

[PATCH 0/1] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov
I would like to propose Valgrind integration previously sent as RFC for trunk.

Arsen and Sam, since you commented on the RFC I wonder if you can have
a look at the proposed configure and documentation changes and let me
know if they look fine for you? For reference, gccinstall.info will say:

‘--enable-valgrind-interop’
 Provide wrappers for Valgrind client requests in libgcc, which are
 used for ‘-fvalgrind-annotations’.  Requires Valgrind header files
 for the target (in the build-time sysroot if building a
 cross-compiler).

and GCC manual will document the new option as:

 -fvalgrind-annotations
 Emit Valgrind client requests annotating object lifetime
 boundaries.  This allows to detect attempts to access fields of a
 C++ object after its destructor has completed (but storage was
 not deallocated yet), or to initialize it in advance from
 "operator new" rather than the constructor.

 This instrumentation relies on presence of
 "__gcc_vgmc_make_mem_undefined" function that wraps the
 corresponding Valgrind client request. It is provided by libgcc
 when it is configured with --enable-valgrind-interop.  Otherwise,
 you can implement it like this:

 #include 

 void
 __gcc_vgmc_make_mem_undefined (void *addr, size_t size)
 {
   VALGRIND_MAKE_MEM_UNDEFINED (addr, size);
 }

Changes since the RFC:

* Add documentation and tests.

* Drop 'emit-' from -fvalgrind-emit-annotations.

* Use --enable-valgrind-interop instead of overloading
  --enable-valgrind-annotations.

* Do not build the wrapper unless --enable-valgrind-interop is given and
  Valgrind headers are present.

* Clean up libgcc configure changes.
* Reword comments.

Daniil Frolov (1):
  object lifetime instrumentation for Valgrind [PR66487]

 gcc/Makefile.in   |   1 +
 gcc/builtins.def  |   3 +
 gcc/common.opt|   4 +
 gcc/doc/install.texi  |   5 +
 gcc/doc/invoke.texi   |  27 +
 gcc/gimple-valgrind-interop.cc| 112 ++
 gcc/passes.def|   1 +
 gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 
 gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
 gcc/tree-pass.h   |   1 +
 libgcc/Makefile.in|   3 +
 libgcc/config.in  |   6 +
 libgcc/configure  |  22 +++-
 libgcc/configure.ac   |  15 ++-
 libgcc/libgcc2.h  |   2 +
 libgcc/valgrind-interop.c |  40 +++
 16 files changed, 274 insertions(+), 2 deletions(-)
 create mode 100644 gcc/gimple-valgrind-interop.cc
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
 create mode 100644 libgcc/valgrind-interop.c

-- 
2.39.2



[committed] sort.cc: fix mentions of sorting networks in comments

2023-11-26 Thread Alexander Monakov
Avoid using 'network sort' (a misnomer) in sort.cc, the correct term is
'sorting networks'.

gcc/ChangeLog:

* sort.cc: Use 'sorting networks' in comments.
---
 gcc/sort.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/sort.cc b/gcc/sort.cc
index 9a0113fb62f..feef345830c 100644
--- a/gcc/sort.cc
+++ b/gcc/sort.cc
@@ -25,7 +25,7 @@ along with GCC; see the file COPYING3.  If not see
- deterministic (but not necessarily stable)
- fast, especially for common cases (0-5 elements of size 8 or 4)
 
-   The implementation uses a network sort for up to 5 elements and
+   The implementation uses sorting networks for up to 5 elements and
a merge sort on top of that.  Neither stage has branches depending on
comparator result, trading extra arithmetic for branch mispredictions.  */
 
@@ -53,7 +53,7 @@ struct sort_ctx
   char   *out; // output buffer
   size_t n;// number of elements
   size_t size; // element size
-  size_t nlim; // limit for network sort
+  size_t nlim; // limit for using sorting networks
 };
 
 /* Like sort_ctx, but for use with qsort_r-style comparators.  Several
@@ -151,7 +151,7 @@ cmp1 (char *e0, char *e1, sort_ctx *c)
   return x & (c->cmp (e0, e1) >> 31);
 }
 
-/* Execute network sort on 2 to 5 elements from IN, placing them into C->OUT.
+/* Apply a sorting network to 2 to 5 elements from IN, placing them into 
C->OUT.
IN may be equal to C->OUT, in which case elements are sorted in place.  */
 template
 static void
-- 
2.33.0



[PATCH 0/2] Clean up Valgrind configury

2023-11-23 Thread Alexander Monakov
We have an RFC patch [1] that adds a libgcc wrapper for a Valgrind client
request.  GCC has autoconf detection for Valgrind in the compiler proper
as well as libcpp (where it is actually unnecessary).

It's grown rusty, let's clean it up.

[1] 
https://inbox.sourceware.org/gcc-patches/20231024141124.210708-1-exactl...@ispras.ru/

Alexander Monakov (2):
  libcpp: configure: drop unused Valgrind detection
  gcc: configure: drop Valgrind 3.1 compatibility

 gcc/config.in   | 12 ---
 gcc/configure   | 80 +++--
 gcc/configure.ac| 49 +++
 gcc/system.h| 23 ++---
 libcpp/config.in| 15 ++---
 libcpp/configure| 70 +--
 libcpp/configure.ac | 51 ++---
 libcpp/lex.cc   |  4 +--
 8 files changed, 29 insertions(+), 275 deletions(-)

-- 
2.39.2



[PATCH 1/2] libcpp: configure: drop unused Valgrind detection

2023-11-23 Thread Alexander Monakov
When top-level configure has either --enable-checking=valgrind or
--enable-valgrind-annotations, we want to activate a couple of workarounds
in libcpp. They do not use anything from the Valgrind API, so just
delete all detection.

libcpp/ChangeLog:

* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (ENABLE_VALGRIND_CHECKING): Delete.
(ENABLE_VALGRIND_ANNOTATIONS): Rename to
ENABLE_VALGRIND_WORKAROUNDS.  Delete Valgrind header checks.
* lex.cc (new_buff): Adjust for renaming.
(_cpp_free_buff): Ditto.
---
 libcpp/config.in| 15 ++
 libcpp/configure| 70 +
 libcpp/configure.ac | 51 ++---
 libcpp/lex.cc   |  4 +--
 4 files changed, 9 insertions(+), 131 deletions(-)

diff --git a/libcpp/config.in b/libcpp/config.in
index df4fd44c9e..253ef03a3d 100644
--- a/libcpp/config.in
+++ b/libcpp/config.in
@@ -24,12 +24,9 @@
language is requested. */
 #undef ENABLE_NLS
 
-/* Define to get calls to the valgrind runtime enabled. */
-#undef ENABLE_VALGRIND_ANNOTATIONS
-
-/* Define if you want to workaround valgrind (a memory checker) warnings about
-   possible memory leaks because of libcpp use of interior pointers. */
-#undef ENABLE_VALGRIND_CHECKING
+/* Define if you want to workaround Valgrind warnings about possible memory
+   leaks because of libcpp use of interior pointers. */
+#undef ENABLE_VALGRIND_WORKAROUNDS
 
 /* Define to 1 if you have `alloca', as a function or macro. */
 #undef HAVE_ALLOCA
@@ -201,9 +198,6 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_LOCALE_H
 
-/* Define if valgrind's memcheck.h header is installed. */
-#undef HAVE_MEMCHECK_H
-
 /* Define to 1 if you have the  header file. */
 #undef HAVE_MEMORY_H
 
@@ -252,9 +246,6 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_UNISTD_H
 
-/* Define if valgrind's valgrind/memcheck.h header is installed. */
-#undef HAVE_VALGRIND_MEMCHECK_H
-
 /* Define as const if the declaration of iconv() needs const. */
 #undef ICONV_CONST
 
diff --git a/libcpp/configure b/libcpp/configure
index 452e4c1f6c..8a38c0546e 100755
--- a/libcpp/configure
+++ b/libcpp/configure
@@ -9116,12 +9116,6 @@ $as_echo "#define ENABLE_ASSERT_CHECKING 1" >>confdefs.h
 
 fi
 
-if test x$ac_valgrind_checking != x ; then
-
-$as_echo "#define ENABLE_VALGRIND_CHECKING 1" >>confdefs.h
-
-fi
-
 # Check whether --enable-canonical-system-headers was given.
 if test "${enable_canonical_system_headers+set}" = set; then :
   enableval=$enable_canonical_system_headers;
@@ -9405,62 +9399,6 @@ case x$enable_languages in
 esac
 
 
-ac_fn_c_check_header_mongrel "$LINENO" "valgrind.h" "ac_cv_header_valgrind_h" 
"$ac_includes_default"
-if test "x$ac_cv_header_valgrind_h" = xyes; then :
-  have_valgrind_h=yes
-else
-  have_valgrind_h=no
-fi
-
-
-
-# It is certainly possible that there's valgrind but no valgrind.h.
-# GCC relies on making annotations so we must have both.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_c_try_cpp "$LINENO"; then :
-  gcc_cv_header_valgrind_memcheck_h=yes
-else
-  gcc_cv_header_valgrind_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$gcc_cv_header_valgrind_memcheck_h" >&5
-$as_echo "$gcc_cv_header_valgrind_memcheck_h" >&6; }
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_c_try_cpp "$LINENO"; then :
-  gcc_cv_header_memcheck_h=yes
-else
-  gcc_cv_header_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_header_memcheck_h" >&5
-$as_echo "$gcc_cv_header_memcheck_h" >&6; }
-if test $gcc_cv_header_valgrind_memcheck_h = yes; then
-
-$as_echo "#define HAVE_VALGRIND_MEMCHECK_H 1" >>confdefs.h
-
-fi
-if test $gcc_cv_header_memcheck_h = yes; then
-
-$as_echo "#define HAVE_MEMCHECK_H 1" >>confdefs.h
-
-fi
-
 # Check whether --enable-valgrind-annotations was given.
 if test "${enable_valgrind_annotations+set}" = set; then :
   enableval=$enable_valgrind_annotations;
@@ -9470,14 +9408,8 @@ fi
 
 if test x$enable_valgrind_annotations != xno \
 || test x$ac_valgrind_checking != x; then
-  if (test $have_valgrind_h = no \
-  && test $gcc_cv_header_memcheck_h = no \
-  && test $gcc_cv_header_valgrind_memcheck_h = no); then
-as_fn_error $? "*** valgrind annotations requested, but" "$LINENO" 5
-as_fn_error $? "*** Can't find 

[PATCH 2/2] gcc: configure: drop Valgrind 3.1 compatibility

2023-11-23 Thread Alexander Monakov
Our system.h and configure.ac try to accommodate valgrind-3.1, but it is
more than 15 years old at this point. As Valgrind-based checking is a
developer-oriented feature, drop the compatibility stuff and streamline
the detection.

gcc/ChangeLog:

* config.in: Regenerate.
* configure: Regenerate.
* configure.ac: Delete manual checks for old Valgrind headers.
* system.h (VALGRIND_MAKE_MEM_NOACCESS): Delete.
(VALGRIND_MAKE_MEM_DEFINED): Delete.
(VALGRIND_MAKE_MEM_UNDEFINED): Delete.
(VALGRIND_MALLOCLIKE_BLOCK): Delete.
(VALGRIND_FREELIKE_BLOCK): Delete.
---
 gcc/config.in| 12 
 gcc/configure| 80 
 gcc/configure.ac | 49 +++--
 gcc/system.h | 23 ++
 4 files changed, 20 insertions(+), 144 deletions(-)

diff --git a/gcc/config.in b/gcc/config.in
index e100c20dcd..3dfc65b844 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -1868,12 +1868,6 @@
 #endif
 
 
-/* Define if valgrind's memcheck.h header is installed. */
-#ifndef USED_FOR_TARGET
-#undef HAVE_MEMCHECK_H
-#endif
-
-
 /* Define to 1 if you have the  header file. */
 #ifndef USED_FOR_TARGET
 #undef HAVE_MEMORY_H
@@ -2136,12 +2130,6 @@
 #endif
 
 
-/* Define if valgrind's valgrind/memcheck.h header is installed. */
-#ifndef USED_FOR_TARGET
-#undef HAVE_VALGRIND_MEMCHECK_H
-#endif
-
-
 /* Define to 1 if you have the `vfork' function. */
 #ifndef USED_FOR_TARGET
 #undef HAVE_VFORK
diff --git a/gcc/configure b/gcc/configure
index cc0c3aad67..5be4592ba0 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -7679,63 +7679,6 @@ $as_echo "#define ENABLE_FOLD_CHECKING 1" >>confdefs.h
 fi
 valgrind_path_defines=
 valgrind_command=
-
-ac_fn_cxx_check_header_mongrel "$LINENO" "valgrind.h" 
"ac_cv_header_valgrind_h" "$ac_includes_default"
-if test "x$ac_cv_header_valgrind_h" = xyes; then :
-  have_valgrind_h=yes
-else
-  have_valgrind_h=no
-fi
-
-
-
-# It is certainly possible that there's valgrind but no valgrind.h.
-# GCC relies on making annotations so we must have both.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_cxx_try_cpp "$LINENO"; then :
-  gcc_cv_header_valgrind_memcheck_h=yes
-else
-  gcc_cv_header_valgrind_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$gcc_cv_header_valgrind_memcheck_h" >&5
-$as_echo "$gcc_cv_header_valgrind_memcheck_h" >&6; }
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_cxx_try_cpp "$LINENO"; then :
-  gcc_cv_header_memcheck_h=yes
-else
-  gcc_cv_header_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_header_memcheck_h" >&5
-$as_echo "$gcc_cv_header_memcheck_h" >&6; }
-if test $gcc_cv_header_valgrind_memcheck_h = yes; then
-
-$as_echo "#define HAVE_VALGRIND_MEMCHECK_H 1" >>confdefs.h
-
-fi
-if test $gcc_cv_header_memcheck_h = yes; then
-
-$as_echo "#define HAVE_MEMCHECK_H 1" >>confdefs.h
-
-fi
-
 if test x$ac_valgrind_checking != x ; then
 
 # Prepare PATH_SEPARATOR.
@@ -7804,11 +7747,8 @@ else
 $as_echo "no" >&6; }
 fi
 
-  if test "x$valgrind_path" = "x" \
-|| (test $have_valgrind_h = no \
-   && test $gcc_cv_header_memcheck_h = no \
-   && test $gcc_cv_header_valgrind_memcheck_h = no); then
-   as_fn_error $? "*** Can't find both valgrind and valgrind/memcheck.h, 
memcheck.h or valgrind.h" "$LINENO" 5
+  if test "x$valgrind_path" = "x"; then
+as_fn_error $? "*** Cannot find valgrind" "$LINENO" 5
   fi
   valgrind_path_defines=-DVALGRIND_PATH='\"'$valgrind_path'\"'
   valgrind_command="$valgrind_path -q"
@@ -7864,12 +7804,16 @@ else
   enable_valgrind_annotations=no
 fi
 
+ac_fn_cxx_check_header_mongrel "$LINENO" "valgrind/memcheck.h" 
"ac_cv_header_valgrind_memcheck_h" "$ac_includes_default"
+if test "x$ac_cv_header_valgrind_memcheck_h" = xyes; then :
+
+fi
+
+
 if test x$enable_valgrind_annotations != xno \
 || test x$ac_valgrind_checking != x; then
-  if (test $have_valgrind_h = no \
-  && test $gcc_cv_header_memcheck_h = no \
-  && test $gcc_cv_header_valgrind_memcheck_h = no); then
-as_fn_error $? "*** Can't find valgrind/memcheck.h, memcheck.h or 
valgrind.h" "$LINENO" 5
+  if test $ac_cv_header_valgrind_memcheck_h = no; then
+as_fn_error $? "*** Cannot find valgrind/memcheck.h" "$LINENO" 5
   fi
 
 $as_echo "#define ENABLE_VALGRIND_ANNOTATIONS 1" >>confdefs.h
@@ 

Re: [PATCH 1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

2023-11-21 Thread Alexander Monakov


On Tue, 21 Nov 2023, Maxim Kuvyrkov wrote:

>  This patch avoids sched-deps.cc:find_inc() creating exponential number
>  of dependencies, which become memory and compilation time hogs.
>  Consider example (simplified from PR96388) ...
>  ===
>  sp=sp-4 // sp_insnA
>  mem_insnA1[sp+A1]
>  ...
>  mem_insnAN[sp+AN]
>  sp=sp-4 // sp_insnB
>  mem_insnB1[sp+B1]
>  ...
>  mem_insnBM[sp+BM]
>  ===
>  ... in this example find_modifiable_mems() will arrange for mem_insnA*
>  to be able to pass sp_insnA, and, while doing this, will create
>  dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
>  is a consumer of sp_insnA.  After this sp_insnB will have N new
>  backward dependencies.
>  Then find_modifiable_mems() gets to mem_insnB*s and starts to create
>  N new dependencies for _every_ mem_insnB*.  This gets us N*M new
>  dependencies.
> >> 
> >> [For avoidance of doubt, below discussion is about the general 
> >> implementation
> >> of find_modifiable_mems() and not about the patch.]
> > 
> > I was saying the commit message is hard to read (unless it's just me).
> > 
> >>> It's a bit hard to read this without knowing which value of 'backwards'
> >>> is assumed.
> 
> Oh, sorry, I misunderstood your comment.
> 
> In the above example I want to describe outcome that current code generates,
> without going into details about exactly how it does it.  I'm not sure how to
> make it more readable, and would appreciate suggestions.

I think it would be easier to follow if you could fix a specific value of
'backwards' up front, and then ensure all following statements are consistent
with that, like I did in my explanation. Please feel free to pick up my text
into the commit message, if it helps.

> >>> Say 'backwards' is true and we are inspecting producer sp_insnB of 
> >>> mem_insnB1.
> >>> This is a true dependency. We know we can break it by adjusting B1 by -4, 
> >>> but
> >>> we need to be careful not to move such modified mem_insnB1 above 
> >>> sp_insnA, so
> >>> we need to iterate over *incoming true dependencies* of sp_insnB and add 
> >>> them.
> >>> 
> >>> But the code seems to be iterating over *all incoming dependencies*, so it
> >>> will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> >>> dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if 
> >>> my
> >>> understanding is correct.
> >> 
> >> Yeap, your understanding is correct.  However, this is what
> >> find_modifiable_mems() has to do to avoid complicated analysis of 
> >> second-level
> >> dependencies.
> > 
> > What is the reason it cannot simply skip anti-dependencies in the
> > 'if (backwards)' loop, and true dependencies in the 'else' loop?
> 
> I /think/, this should be possible.  However, rather than improving current
> implementation my preference is to rework it by integrating with the main
> dependency analysis.  This should provide both faster and more precise
> dependency analysis, which would generate breakable addr/mem dependencies.

I see, thank you.

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-21 Thread Alexander Monakov


On Tue, 21 Nov 2023, Richard Biener wrote:

> and this, too, btw. - the DSE actually happens, the latter transform not.
> We specifically "opt out" of doing that for QOI to not make undefined
> behavior worse.  The more correct transform would be to replace the
> load with a __builtin_trap () during path isolation (or wire in path isolation
> to value-numbering where we actually figure out there's no valid definition
> to reach for the load).
> 
> So yes, if you want to avoid these kind of transforms earlier instrumentation
> is better.

And then attempting to schedule it immediately after pass_ccp in the early-opts
pipeline is already too late, right?

Thanks!
Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-21 Thread Alexander Monakov


On Tue, 21 Nov 2023, Alexander Monakov wrote:

> I am concerned that if GCC ever learns to leave out the following access
> to 'this->foo', leaving tmp uninitialized, we will end up with:
> 
>   this->foo = 42;

Sorry, this store will be DSE'd out, of course, but my question stands.

Alexander

>   *this = { CLOBBER };
>   __valgrind_make_mem_undefined(this, sizeof *this);
>   int tmp(D);
>   return tmp(D); // uninitialized
> 
> and Valgrind will not report anything since the invalid load is optimized out.
> 
> With early instrumentation such optimization is not going to happen, since the
> builtin may modify *this.
> 
> Is my concern reasonable?
> 
> Thanks.
> Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-20 Thread Alexander Monakov

On Mon, 13 Nov 2023, Richard Biener wrote:

> > Ideally we'd position it such that more locals are put in SSA form,
> > but not too late to miss some UB, right? Perhaps after first pass_ccp?
> 
> I guess it’s worth experimenting.  Even doing it right before RTL expansion
> might work.  Note if you pick ccp you have to use a separate place for -O0

While Daniil is experimenting with this, I want to raise my concern about
attempting this instrumentation too late. Consider the main thing we are
trying to catch:

// inlined operator new:
this->foo = 42;
// inlined constructor:
*this = { CLOBBER };
// caller:
int tmp = this->foo;
return tmp;

Our instrumentation adds

__valgrind_make_mem_undefined(this, sizeof *this);

immediately after the clobber.

I am concerned that if GCC ever learns to leave out the following access
to 'this->foo', leaving tmp uninitialized, we will end up with:

this->foo = 42;
*this = { CLOBBER };
__valgrind_make_mem_undefined(this, sizeof *this);
int tmp(D);
return tmp(D); // uninitialized

and Valgrind will not report anything since the invalid load is optimized out.

With early instrumentation such optimization is not going to happen, since the
builtin may modify *this.

Is my concern reasonable?

Thanks.
Alexander

Re: [PATCH 1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

2023-11-20 Thread Alexander Monakov


On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:

> > On Nov 20, 2023, at 17:52, Alexander Monakov  wrote:
> > 
> > 
> > On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:
> > 
> >> This patch avoids sched-deps.cc:find_inc() creating exponential number
> >> of dependencies, which become memory and compilation time hogs.
> >> Consider example (simplified from PR96388) ...
> >> ===
> >> sp=sp-4 // sp_insnA
> >> mem_insnA1[sp+A1]
> >> ...
> >> mem_insnAN[sp+AN]
> >> sp=sp-4 // sp_insnB
> >> mem_insnB1[sp+B1]
> >> ...
> >> mem_insnBM[sp+BM]
> >> ===
> >> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> >> to be able to pass sp_insnA, and, while doing this, will create
> >> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> >> is a consumer of sp_insnA.  After this sp_insnB will have N new
> >> backward dependencies.
> >> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> >> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> >> dependencies.
> 
> [For avoidance of doubt, below discussion is about the general implementation
> of find_modifiable_mems() and not about the patch.]

I was saying the commit message is hard to read (unless it's just me).

> > It's a bit hard to read this without knowing which value of 'backwards'
> > is assumed.
> > 
> > Say 'backwards' is true and we are inspecting producer sp_insnB of 
> > mem_insnB1.
> > This is a true dependency. We know we can break it by adjusting B1 by -4, 
> > but
> > we need to be careful not to move such modified mem_insnB1 above sp_insnA, 
> > so
> > we need to iterate over *incoming true dependencies* of sp_insnB and add 
> > them.
> > 
> > But the code seems to be iterating over *all incoming dependencies*, so it
> > will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> > dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
> > understanding is correct.
> 
> Yeap, your understanding is correct.  However, this is what
> find_modifiable_mems() has to do to avoid complicated analysis of second-level
> dependencies.

What is the reason it cannot simply skip anti-dependencies in the
'if (backwards)' loop, and true dependencies in the 'else' loop?

Alexander


Re: [PATCH 1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

2023-11-20 Thread Alexander Monakov


On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:

> This patch avoids sched-deps.cc:find_inc() creating exponential number
> of dependencies, which become memory and compilation time hogs.
> Consider example (simplified from PR96388) ...
> ===
> sp=sp-4 // sp_insnA
> mem_insnA1[sp+A1]
> ...
> mem_insnAN[sp+AN]
> sp=sp-4 // sp_insnB
> mem_insnB1[sp+B1]
> ...
> mem_insnBM[sp+BM]
> ===
> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> to be able to pass sp_insnA, and, while doing this, will create
> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> is a consumer of sp_insnA.  After this sp_insnB will have N new
> backward dependencies.
> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> dependencies.

It's a bit hard to read this without knowing which value of 'backwards'
is assumed.

Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
This is a true dependency. We know we can break it by adjusting B1 by -4, but
we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
we need to iterate over *incoming true dependencies* of sp_insnB and add them.

But the code seems to be iterating over *all incoming dependencies*, so it
will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
understanding is correct.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-17 Thread Alexander Monakov


On Fri, 17 Nov 2023, Kewen.Lin wrote:
> > I don't think you can run cleanup_cfg after sched_init. I would suggest
> > to put it early in schedule_insns.
> 
> Thanks for the suggestion, I placed it at the beginning of haifa_sched_init
> instead, since schedule_insns invokes haifa_sched_init, although the
> calls rgn_setup_common_sched_info and rgn_setup_sched_infos are executed
> ahead but they are all "setup" functions, shouldn't affect or be affected
> by this placement.

I was worried because sched_init invokes df_analyze, and I'm not sure if
cfg_cleanup can invalidate it.

> > I suspect this may be caused by invoking cleanup_cfg too late.
> 
> By looking into some failures, I found that although cleanup_cfg is executed
> there would be still some empty blocks left, by analyzing a few failures there
> are at least such cases:
>   1. empty function body
>   2. block holding a label for return.
>   3. block without any successor.
>   4. block which becomes empty after scheduling some other block.
>   5. block which looks mergeable with its always successor but left.
>   ...
> 
> For 1,2, there is one single successor EXIT block, I think they don't affect
> state transition, for 3, it's the same.  For 4, it depends on if we can have
> the assumption this kind of empty block doesn't have the chance to have debug
> insn (like associated debug insn should be moved along), I'm not sure.  For 5,
> a reduced test case is:

Oh, I should have thought of cases like these, really sorry about the slip
of attention, and thanks for showing a testcase for item 5. As Richard as
saying in his response, cfg_cleanup cannot be a fix here. The thing to check
would be changing no_real_insns_p to always return false, and see if the
situation looks recoverable (if it breaks bootstrap, regtest statistics of
a non-bootstrapped compiler are still informative).

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-15 Thread Alexander Monakov


On Wed, 15 Nov 2023, Kewen.Lin wrote:

> >> And I suppose it would be OK to do that.  Empty BBs are usually removed by
> >> CFG cleanup so the situation should only happen in rare corner cases where
> >> the fix would be to actually run CFG cleanup ...
> > 
> > Yeah, sel-sched invokes 'cfg_cleanup (0)' up front, and I suppose that
> > may be a preferable compromise for sched-rgn as well.
> 
> Inspired by this discussion, I tested the attached patch 1 which is to run
> cleanup_cfg (0) first in haifa_sched_init, it's bootstrapped and
> regress-tested on x86_64-redhat-linux and powerpc64{,le}-linux-gnu.

I don't think you can run cleanup_cfg after sched_init. I would suggest
to put it early in schedule_insns.

> Then I assumed some of the current uses of no_real_insns_p won't encounter
> empty blocks any more, so made a patch 2 with some explicit assertions, but
> unfortunately I got ICEs during bootstrapping happens in function
> compute_priorities.  I'm going to investigate it further and post more
> findings, but just heads-up to ensure if this is on the right track.

I suspect this may be caused by invoking cleanup_cfg too late.

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-13 Thread Alexander Monakov


On Mon, 13 Nov 2023, Richard Biener wrote:

> Another generic comment - placing a built-in call probably pessimizes code
> generation unless we handle it specially during alias analysis (or in
> builtin_fnspec).

But considering the resulting code is intended to be run under Valgrind,
isn't a bit worse quality acceptable? Note that we don't want loads
following the built-in to be optimized out, they are necessary as they
will be flagged by Valgrind as attempts to read uninitialized memory.

I suspect positioning the pass immediately after build_ssa as we do now
is quite imperfect because we will then instrument 'x' in 

  void f()
  {
int x, *p;
p = 
  }

Ideally we'd position it such that more locals are put in SSA form,
but not too late to miss some UB, right? Perhaps after first pass_ccp?

> I also don't like having another pass for this - did you
> investigate to do the instrumentation at the point the CLOBBERs are
> introduced?

I don't see a better approach, some CLOBBERs are emitted by the C++
front-end via build_clobber_this, some by omp-expand, some during
gimplification. I'm not a fan of useless IR rescans either, but
this pass is supposed to run very rarely, not by default.

> Another possibility would be to make this more generic
> and emit the instrumentation when we lower GIMPLE_BIND during
> the GIMPLE lowering pass, you wouldn't then rely on the CLOBBERs
> some of which only appear when -fstack-reuse=none is not used.

The CLOBBERs that trigger on Firefox and LLVM are emitted not during
gimplification, but via build_clobber_this in the front-end.

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-11-12 Thread Alexander Monakov


On Sat, 11 Nov 2023, Sam James wrote:

> > Valgrind client requests are offered as macros that emit inline asm.  For 
> > use
> > in code generation, we need to wrap it in a built-in.  We know that 
> > implementing
> > such a built-in in libgcc is undesirable, [...].
> 
> Perhaps less objectionable than you think, at least given the new CFR
> stuff from oliva from the other week that landed.

Yeah; we haven't found any better solution anyway.

> This is a really neat idea (it also makes me wonder if there's any other
> opportunities for Valgrind integration like this?).

To (attempt to) answer the parenthetical question, note that the patch is not
limited to instrumenting C++ cdtors, it annotates all lifetime CLOBBER marks,
so Valgrind should see lifetime boundaries of various on-stack arrays too.

(I hope positioning the new pass after build_ssa is sufficient to avoid
annotating too much, like non-address-taken local scalars)

> LLVM was the most recent example but it wasn't the first, and this has
> come up with LLVM in a few other places too (same root cause, wasn't
> obvious at all).

I'm very curious what you mean by "this has come up with LLVM [] too": ttbomk,
LLVM doesn't do such lifetime-based optimization yet, which is why compiling
LLVM with LLVM doesn't break it. Can you share some examples? Or do you mean
instances when libLLVM-miscompiled-with-GCC was linked elsewhere, and that
program crashed mysteriously as a result?

Indeed this work is inspired by the LLVM incident in PR 106943. Unforunately
we don't see many other instances with -flifetime-dse workarounds in public.
Grepping Gentoo Portage reveals only openjade. Arch applies the workaround to
a jvm package too, and we know that Firefox and LLVM apply it on their own.

This patch finds the issue in LLVM and openjade; testing it on Spidermonkey
is TODO. Suggestions for other interesting tests would be welcome.

> > --- a/libgcc/configure.ac
> > +++ b/libgcc/configure.ac
> > @@ -269,6 +269,54 @@ GCC_CHECK_SJLJ_EXCEPTIONS
> >  GCC_CET_FLAGS(CET_FLAGS)
> >  AC_SUBST(CET_FLAGS)
> >  
> > +AC_CHECK_HEADER(valgrind.h, have_valgrind_h=yes, have_valgrind_h=no)
> 
> Consider using PKG_CHECK_MODULES and falling back to a manual search.

Thanks. autotools bits in this patch are one-to-one copy of the pre-existing
Valgrind detection in the 'gcc' subdirectory where it's necessary for
running the compiler under Valgrind without false positives.

I guess the right solution is to move Valgrind detection into the top-level
'config' directory (and apply the cleanups you mention), but as we are not
familiar with autotools we just made the copy-paste for this RFC.

With the patch, --enable-valgrind-annotations becomes "overloaded" to
simultaneously instrument the compiler and enhance libgcc to support
-fvalgrind-emit-annotations, but those are independent and in practice
people may need the latter without the former.

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-12 Thread Alexander Monakov

On Sat, 11 Nov 2023, Arsen Arsenović wrote:

> > +#else
> > +# define VALGRIND_MAKE_MEM_UNDEFINED(ptr, sz) __builtin_trap ()
> > +#endif
> > +
> > +void __valgrind_make_mem_undefined (void *ptr, unsigned long sz)
> > +{
> > +  VALGRIND_MAKE_MEM_UNDEFINED (ptr, sz);
> > +}
> 
> Would it be preferable to have a link-time error here if missing?

Indeed, thank you for the suggestion, will keep that in mind for resending.
That will allow to notice the problem earlier, and the user will be able
to drop in this snippet in their project to resolve the issue.

Alexander

Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-10 Thread Alexander Monakov

On Fri, 10 Nov 2023, Richard Biener wrote:

> On Fri, Nov 10, 2023 at 3:18 PM Alexander Monakov  wrote:
> >
> >
> > On Fri, 10 Nov 2023, Richard Biener wrote:
> >
> > > > I'm afraid ignoring debug-only BBs goes contrary to overall 
> > > > var-tracking design:
> > > > DEBUG_INSNs participate in dependency graph so that schedulers can 
> > > > remove or
> > > > mutate them as needed when moving real insns across them.
> > >
> > > Note that debug-only BBs do not exist - the BB would be there even 
> > > without debug
> > > insns!
> >
> > Yep, sorry, I misspoke when I earlier said
> >
> > >> and cause divergence when passing through a debug-only BB which would 
> > >> not be
> > >> present at all without -g.
> >
> > They are present in the region, but skipped via no_real_insns_p.
> >
> > > So instead you have to handle BBs with just debug insns the same you
> > > handle a completely empty BB.
> >
> > Yeah. There would be no problem if the scheduler never used no_real_insns_p
> > and handled empty and non-empty BBs the same way.
> 
> And I suppose it would be OK to do that.  Empty BBs are usually removed by
> CFG cleanup so the situation should only happen in rare corner cases where
> the fix would be to actually run CFG cleanup ...

Yeah, sel-sched invokes 'cfg_cleanup (0)' up front, and I suppose that
may be a preferable compromise for sched-rgn as well.

I'm afraid one does not simply remove all uses of no_real_insns_p from
sched-rgn, but would be happy to be wrong about that.

Alexander

Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-10 Thread Alexander Monakov


On Fri, 10 Nov 2023, Richard Biener wrote:

> > I'm afraid ignoring debug-only BBs goes contrary to overall var-tracking 
> > design:
> > DEBUG_INSNs participate in dependency graph so that schedulers can remove or
> > mutate them as needed when moving real insns across them.
> 
> Note that debug-only BBs do not exist - the BB would be there even without 
> debug
> insns!

Yep, sorry, I misspoke when I earlier said

>> and cause divergence when passing through a debug-only BB which would not be
>> present at all without -g.

They are present in the region, but skipped via no_real_insns_p.

> So instead you have to handle BBs with just debug insns the same you
> handle a completely empty BB.

Yeah. There would be no problem if the scheduler never used no_real_insns_p
and handled empty and non-empty BBs the same way.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-10 Thread Alexander Monakov


On Thu, 9 Nov 2023, Jeff Law wrote:

> > Yeah, I noticed that the scheduler takes care of DEBUG_INSNs as normal
> > operations.  When I started to work on this issue, initially I wanted to try
> > something similar to your idea #2, but when checking the APIs, I realized
> > why not just skip the basic block with NOTEs and LABELs, DEBUG_INSNs as
> > well.  IMHO there is no value to try to schedule this kind of BB (to be
> > scheduled range), skipping it can save some resource allocation (like block
> > dependencies) and make it more efficient (not enter function schedule_block
> > etc.), from this perspective it seems an enhancement.  Does it sound
> > reasonable to you?
> It sounds reasonable, but only if doing so doesn't add significant
> implementation complexity.  ie, the gains from doing less work here are likely
> to be very marginal, so I'm more interested in clean, easy to maintain code.

I'm afraid ignoring debug-only BBs goes contrary to overall var-tracking design:
DEBUG_INSNs participate in dependency graph so that schedulers can remove or
mutate them as needed when moving real insns across them.

Cc'ing Alexandre Oliva who can correct me on that if necessary.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-09 Thread Alexander Monakov


On Thu, 9 Nov 2023, Maxim Kuvyrkov wrote:

> Hi Kewen,
> 
> Below are my comments.  I don't want to override Alexander's review, and if
> the patch looks good to him, it's fine to ignore my concerns.
> 
> My main concern is that this adds a new entity -- forceful skipping of
> DEBUG_INSN-only basic blocks -- to the scheduler for a somewhat minor change
> in behavior.  Unlike NOTEs and LABELs, DEBUG_INSNs are INSNS, and there is
> already quite a bit of logic in the scheduler to skip them _as part of normal
> operation_.

I agree with the concern. I hoped that solving the problem by skipping the BB
like the (bit-rotted) debug code needs to would be a minor surgery. As things
look now, it may be better to remove the non-working sched_block debug counter
entirely and implement a good solution for the problem at hand.

> 
> Would you please consider 2 ideas below.
> 
> #1:
> After a brief look, I'm guessing this part is causing the problem:
> haifa-sched.cc :schedule_block():
> === [1]
>   /* Loop until all the insns in BB are scheduled.  */
>   while ((*current_sched_info->schedule_more_p) ())
> {
>   perform_replacements_new_cycle ();
>   do
>   {
> start_clock_var = clock_var;
> 
> clock_var++;
> 
> advance_one_cycle ();

As I understand, we have spurious calls to advance_one_cycle on basic block
boundaries, which don't model the hardware (the CPU doesn't see BB boundaries)
and cause divergence when passing through a debug-only BB which would not be
present at all without -g.

Since EBBs and regions may not have jump targets in the middle, advancing
a cycle on BB boundaries does not seem well motivated. Can we remove it?

Can we teach haifa-sched to emit RTX NOTEs with hashes of DFA states on BB
boundaries with -fcompare-debug is enabled? It should make the problem
readily detectable by -fcompare-debug even when scheduling did not diverge.

Alexander


Re: [PATCH 1/3] tree-ssa-sink: do not sink to in front of setjmp

2023-11-08 Thread Alexander Monakov

On Wed, 8 Nov 2023, Richard Biener wrote:

> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/setjmp-7.c
> >> @@ -0,0 +1,13 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2 -fno-guess-branch-probability -w" } */
> >> +/* { dg-require-effective-target indirect_jumps } */
> >> +
> >> +struct __jmp_buf_tag { };
> >> +typedef struct __jmp_buf_tag jmp_buf[1];
> >> +struct globals { jmp_buf listingbuf; };
> >> +extern struct globals *const ptr_to_globals;
> >> +void foo()
> >> +{
> >> +if ( _setjmp ( ((*ptr_to_globals).listingbuf )))
> >> +;
> >> +}
> > 
> > Is the implicit declaration of _setjmp important to this test?
> > Could we declare it explicitly instead?
> 
> It shouldn’t be important.

Yes, it's an artifact from testcase minimization, sorry about that.

Florian, I see you've sent a patch to fix this up — thank you!

Alexander

Re: [PATCH v2] Add a GCC Security policy

2023-10-04 Thread Alexander Monakov


On Thu, 28 Sep 2023, Siddhesh Poyarekar wrote:

> Define a security process and exclusions to security issues for GCC and
> all components it ships.

Some typos and wording suggestions below.

> --- /dev/null
> +++ b/SECURITY.txt
> @@ -0,0 +1,205 @@
> +What is a GCC security bug?
> +===
> +
> +A security bug is one that threatens the security of a system or
> +network, or might compromise the security of data stored on it.
> +In the context of GCC there are multiple ways in which this might
> +happen and some common scenarios are detailed below.
> +
> +If you're reporting a security issue and feel like it does not fit
> +into any of the descriptions below, you're encouraged to reach out
> +through the GCC bugzilla or if needed, privately by following the
> +instructions in the last two sections of this document.
> +
> +Compiler drivers, programs, libgccjit and support libraries
> +---
> +
> +The compiler driver processes source code, invokes other programs
> +such as the assembler and linker and generates the output result,
> +which may be assembly code or machine code.  Compiling untrusted
> +sources can result in arbitrary code execution and unconstrained
> +resource consumption in the compiler. As a result, compilation of
> +such code should be done inside a sandboxed environment to ensure
> +that it does not compromise the development environment.

"... the host environment" seems more appropriate.

> +
> +The libgccjit library can, despite the name, be used both for
> +ahead-of-time compilation and for just-in-compilation.  In both
> +cases it can be used to translate input representations (such as
> +source code) in the application context; in the latter case the
> +generated code is also run in the application context.
> +
> +Limitations that apply to the compiler driver, apply here too in
> +terms of sanitizing inputs and it is recommended that both the

s/sanitizing inputs/trusting inputs/ (I suggested it earlier, just unsure
if you don't agree or it simply fell through the cracks)

> +compilation *and* execution context of the code are appropriately
> +sandboxed to contain the effects of any bugs in libgccjit, the
> +application code using it, or its generated code to the sandboxed
> +environment.
> +
> +Libraries such as libiberty, libcc1 and libcpp are not distributed
> +for runtime support and have similar challenges to compiler drivers.
> +While they are expected to be robust against arbitrary input, they
> +should only be used with trusted inputs when linked into the
> +compiler.
> +
> +Libraries such as zlib that bundled into GCC to build it will be

'are bundled with' (missing 'are', s/into/with/)

> +treated the same as the compiler drivers and programs as far as
> +security coverage is concerned.  However if you find an issue in
> +these libraries independent of their use in GCC, you should reach
> +out to their upstream projects to report them.
> +
> +As a result, the only case for a potential security issue in the
> +compiler is when it generates vulnerable application code for
> +trusted input source code that is conforming to the relevant
> +programming standard or extensions documented as supported by GCC
> +and the algorithm expressed in the source code does not have the
> +vulnerability.  The output application code could be considered
> +vulnerable if it produces an actual vulnerability in the target
> +application, specifically in the following cases:

It seems ambiguous if the list that follows is meant to be an exhaustive
enumeration. I think it is meant to give examples without covering all
possibilities; if that's the case, I would suggest

s/specifically in the following cases/for example/

If I misunderstood and the list is really meant to be exhaustive,
it would be nice to make that clear and perhaps refer the reader
to the second paragraph when their scenario does not fit.

> +
> +- The application dereferences an invalid memory location despite
> +  the application sources being valid.
> +- The application reads from or writes to a valid but incorrect
> +  memory location, resulting in an information integrity issue or an
> +  information leak.
> +- The application ends up running in an infinite loop or with
> +  severe degradation in performance despite the input sources having
> +  no such issue, resulting in a Denial of Service.  Note that
> +  correct but non-performant code is not a security issue candidate,
> +  this only applies to incorrect code that may result in performance
> +  degradation severe enough to amount to a denial of service.
> +- The application crashes due to the generated incorrect code,
> +  resulting in a Denial of Service.
> +
> 

Re: RISC-V: Added support for CRC.

2023-09-26 Thread Alexander Monakov


On Tue, 26 Sep 2023, Jeff Law wrote:

> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including 4
> implementations (IIRC) in the kernel itself.

The kernel employs bitwise CRC only in look-up table generators.
Which run at build time. You are quite literally slowing down the compiler
in order to speed up generators that don't account for even one millisecond
of kernel build time, and have no relation to its run-time performance.

(incidentally you can't detect the actual CRC impls using those tables)

> And as I've stated before, the latency of clmuls is dropping.   I wouldn't be
> terribly surprised to see single cycle clmul implmementations showing up
> within the next 18-24 months.  It's really just a matter of gate budget vs
> expected value.

In a commercial implementation? I'll take that bet. You spend gates budget
like that after better avenues for raising ILP are exhausted (like adding
more ALUs that can do clmul at a reasonable 3c-4c latency).

> To reiterate the real goal here is to take code as-is and make it
> significantly faster.

Which code? Table generators in the kernel and xz-utils? 

> While the original target was Coremark, we've found similar bitwise
> implementations of CRCs all over the place. There's no good reason that code
> should have to change.

But did you look at them? There's no point to optimize table generators either.

Alexander


Re: RISC-V: Added support for CRC.

2023-09-24 Thread Alexander Monakov


On Sun, 24 Sep 2023, Joern Rennecke wrote:

> It is a stated goal of coremark to test performance for CRC.

I would expect a good CRC benchmark to print CRC throughput in
bytes per cycle or megabytes per second.

I don't see where Coremark states that goal. In the readme at
https://github.com/eembc/coremark/blob/main/README.md
it enumerates the three subcategories (linked list, matrix ops,
state machine) and indicates that CRC is used for validation.

If it claims that elsewhere, the way its code employs CRC does not
correspond to real-world use patterns, like in the Linux kernel for
protocol and filesystem checksumming, or decompression libraries.

> They do not use a library call to implement CRC, but a specific
> bit-banging algorithm they have chosen.  That algorithm is, for the
> vast majority of processors, not representative of the targets
> performance potential in calculating CRCs,

It is, however, representative of the target CPU's ability to run
those basic bitwise ops with good overlap with the rest of computation,
which is far more relevant for the real-world performance of the CPU.

> thus if a compiler fails to translate this into the CRC implementation
> that would be used for performance code, the compiler frustrates this
> goal of coremark to give a measure of CRC calculation performance.

Are you seriously saying that if a customer chooses CPU A over CPU B
based on Coremark scores, and then discovers that actual performance
in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
entirely fair and the benchmarks scores they saw were not misleading?

> > At best we might have
> > a discussion on providing a __builtin_clmul for carry-less multiplication
> > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> 
> Some processors have specialized instructions for CRC computations.

Only for one or two fixed polynomials. For that matter, some processors
have instructions for AES and SHA, but that doesn't change that clmul is
a more fundamental and flexible primitive than "CRC".

> If you want to recognize a loop that does a CRC on a block, it makes
> sense to start with recognizing the CRC computation for single array
> elements first.  We have to learn to walk before we can run.

If only the "walk before you run" logic was applied in favor of
implementing a portable clmul builtin prior to all this.

> A library can be used to implement built-ins in gcc (we still need to
> define one for block operations, one step at a time...).  However,
> someone or something needs to rewrite the existing code to use the
> library.  It is commonly accepted that an efficient way to do this is
> to make a compiler do the necessary transformations, as long as it can
> be made to churn out good enough code.

How does this apply to the real world? Among CRC implementations in the
Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only
a single instance where bitwise CRC is used. It's in the table
initialization function in xz-utils. The compiler would transform that
to copying one table into another. Is that a valuable transform?

> Alexander Monakov:
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> 
> We can provide a fallback implementation for all targets with table
> lookup and/or shifts .

How would it help when they are compiled with LLVM, or GCC version
earlier than 14?

Alexander


Re: [PATCH][RFC] middle-end/106811 - document GENERIC/GIMPLE undefined behavior

2023-09-20 Thread Alexander Monakov


On Fri, 15 Sep 2023, Richard Biener via Gcc-patches wrote:

> +@itemize @bullet
> +@item
> +When the result of negation, addition, subtraction or division of two signed
> +integers or signed integer vectors not subject to @option{-fwrapv} cannot be
> +represented in the type.

It would be a bit awkward to add 'or vectors' everywhere it applies, perhaps
say something general about elementwise vector operations up front?

> +
> +@item
> +The value of the second operand of any of the division or modulo operators
> +is zero.
> +
> +@item
> +When incrementing or decrementing a pointer not subject to
> +@option{-fwrapv-pointer} wraps around zero.
> +
> +@item
> +An expression is shifted by a negative number or by an amount greater
> +than or equal to the width of the shifted operand.
> +
> +@item
> +Pointers that do not point to the same object are compared using
> +relational operators.

This does not apply to '==' and '!='. Maybe say

  Ordered comparison operators are applied to pointers
  that do not point to the same object.

> +
> +@item
> +An object which has been modified is accessed through a restrict-qualified
> +pointer and another pointer that are not both based on the same object.
> +
> +@item
> +The @} that terminates a function is reached, and the value of the function
> +call is used by the caller.
> +
> +@item
> +When program execution reaches __builtin_unreachable.
> +
> +@item
> +When an object has its stored value accessed by an lvalue that
> +does not have one of the following types:
> +@itemize @minus
> +@item
> +a (qualified) type compatible with the effective type of the object
> +@item
> +a type that is the (qualified) signed or unsigned type corresponding to
> +the effective type of the object
> +@item
> +a character type, a ref-all qualified type or a type subject to
> +@option{-fno-strict-aliasing}
> +@item
> +a pointer to void with the same level of indirection as the accessed
> +pointer object
> +@end itemize

This list seems to miss a clause that allows aliasing between
scalar types and their vector counterparts?

Thanks.
Alexander


Re: RISC-V: Added support for CRC.

2023-08-17 Thread Alexander Monakov


On Wed, 16 Aug 2023, Philipp Tomsich wrote:

> > > I fully expect that latency to drop within the next 12-18 months.  In that
> > > world, there's not going to be much benefit to using hand-coded libraries 
> > > vs
> > > just letting the compiler do it.
> 
> I would also hope that the hand-coded libraries would eventually have
> a code path for compilers that support the built-in.

You seem to be working with the false assumption that the interface of the
proposed builtin matches how high-performance CRC computation is structured.
It is not. State-of-the-art CRC keeps unreduced intermediate residual, split
over multiple temporaries to allow overlapping CLMULs in the CPU. The
intermediate residuals are reduced only once, when the final CRC value is
needed. In constrast, the proposed builtin has data dependencies between
all adjacent instructions, and cannot allow the CPU to work at IPC > 1.0.

Shame how little you apparently understand of the "mindbending math".

Alexander


Re: RISC-V: Added support for CRC.

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, Jeff Law wrote:

> Because if the compiler can optimize it automatically, then the projects have
> to do literally nothing to take advantage of it.  They just compile normally
> and their bitwise CRC gets optimized down to either a table lookup or a clmul
> variant.  That's the real goal here.

The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':

https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c

It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).

> One final note.  Elsewhere in this thread you described performance concerns.
> Right now clmuls can be implemented in 4c, fully piped.

Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to

   li  a4,quotient
   li  a5,polynomial
   xor a0,a1,a0
   clmul   a0,a0,a4
   srlia0,a0,crc_size
   clmul   a0,a0,a5
   sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
   srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size

making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.

> I fully expect that latency to drop within the next 12-18 months.  In that
> world, there's not going to be much benefit to using hand-coded libraries vs
> just letting the compiler do it.

...

Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Wed, 16 Aug 2023, Siddhesh Poyarekar wrote:

> > Yeah, indicating scenarios that fall outside of intended guarantees should
> > be helpful. I feel the exact text quoted above will be hard to decipher
> > without knowing the discussion that led to it. Some sort of supplementary
> > section with examples might help there.
> 
> Ah, so I had started out by listing examples but dropped them before emailing.
> How about:
> 
> Similarly, GCC may transform code in a way that the correctness of
> the expressed algorithm is preserved but supplementary properties
> that are observable only outside the program or through a
> vulnerability in the program, may not be preserved.  Examples
> of such supplementary properties could be the state of memory after
> it is no longer in use, performance and timing characteristics of a
> program, state of the CPU cache, etc. Such issues are not security
> vulnerabilities in GCC and in such cases, the vulnerability that
> caused exposure of the supplementary properties must be fixed.

I would say that as follows:

Similarly, GCC may transform code in a way that the correctness of
the expressed algorithm is preserved, but supplementary properties
that are not specifically expressible in a high-level language
are not preserved. Examples of such supplementary properties
include absence of sensitive data in the program's address space
after an attempt to wipe it, or data-independent timing of code.
When the source code attempts to express such properties, failure
to preserve them in resulting machine code is not a security issue
in GCC.

Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Wed, 16 Aug 2023, Siddhesh Poyarekar wrote:

> No I understood the distinction you're trying to make, I just wanted to point
> out that the effect isn't all that different.  The intent of the wording is
> not to prescribe a solution, but to describe what the compiler cannot do and
> hence, users must find a way to do this.  I think we have a consensus on this
> part of the wording though because we're not really responsible for the
> prescription here and I'm happy with just asking users to sandbox.

Nice!

> I suppose it's kinda like saying "don't try this at home".  You know many will
> and some will break their leg while others will come out of it feeling
> invincible.  Our job is to let them know that they will likely break their leg
> :)

Continuing this analogy, I was protesting against doing our job by telling
users "when trying this at home, make sure to wear vibranium shielding"
while knowing for sure that nobody can, in fact, obtain said shielding,
making our statement not helpful and rather tautological.

> How about this in the last section titled "Security features implemented in
> GCC", since that's where we also deal with security hardening.
> 
> Similarly, GCC may transform code in a way that the correctness of
> the expressed algorithm is preserved but supplementary properties
> that are observable only outside the program or through a
> vulnerability in the program, may not be preserved.  This is not a
> security issue in GCC and in such cases, the vulnerability that
> caused exposure of the supplementary properties must be fixed.

Yeah, indicating scenarios that fall outside of intended guarantees should
be helpful. I feel the exact text quoted above will be hard to decipher
without knowing the discussion that led to it. Some sort of supplementary
section with examples might help there.

In any case, I hope further discussion, clarification and wordsmithing
goes productively for you both here on the list and during the Cauldron.

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov
> > Unfortunately the lines that follow:
> > 
> >>   either sanitized by an external program to allow only trusted,
> >>   safe compilation and execution in the context of the application,
> > 
> > again make a reference to a purely theoretical "external program" that
> > is not going to exist in reality, and I made a fuss about that in another
> > subthread (sorry Siddhesh). We shouldn't speak as if this solution is
> > actually available to users.
> > 
> > I know this is not the main point of your email, but we came up with
> > a better wording for the compiler driver, and it would be good to align
> > this text with that.
> 
> How about:
> 
> The libgccjit library can, despite the name, be used both for
> ahead-of-time compilation and for just-in-compilation.  In both
> cases it can be used to translate input representations (such as
> source code) in the application context; in the latter case the
> generated code is also run in the application context.
> 
> Limitations that apply to the compiler driver, apply here too in
> terms of sanitizing inputs and it is recommended that both the

I'd prefer 'trusting inputs' instead of 'sanitizing inputs' above.

> compilation *and* execution context of the code are appropriately
> sandboxed to contain the effects of any bugs in libgccjit, the
> application code using it, or its generated code to the sandboxed
> environment.

*thumbs up*

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, David Malcolm via Gcc-patches wrote:

> I'd prefer to reword this, as libgccjit was a poor choice of name for
> the library (sorry!), to make it clearer it can be used for both ahead-
> of-time and just-in-time compilation, and that as used for compilation,
> the host considerations apply, not just those of the generated target
> code.
> 
> How about:
> 
>  The libgccjit library can, despite the name, be used both for
>  ahead-of-time compilation and for just-in-compilation.  In both
>  cases it can be used to translate input representations (such as
>  source code) in the application context; in the latter case the
>  generated code is also run in the application context.
>  Limitations that apply to the compiler driver, apply here too in
>  terms of sanitizing inputs, so it is recommended that inputs are

Unfortunately the lines that follow:

>  either sanitized by an external program to allow only trusted,
>  safe compilation and execution in the context of the application,

again make a reference to a purely theoretical "external program" that
is not going to exist in reality, and I made a fuss about that in another
subthread (sorry Siddhesh). We shouldn't speak as if this solution is
actually available to users.

I know this is not the main point of your email, but we came up with
a better wording for the compiler driver, and it would be good to align
this text with that.

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, Paul Koning wrote:

> Now I'm confused.  I thought the whole point of what GCC is trying to, and
> wants to document, is that it DOES preserve security properties.  If the
> source code is standards-compliant and contains algorithms free of security
> holes, then the compiler is supposed to deliver output code that is likewise
> free of holes -- in other words, the transformation performed by GCC does not
> introduce holes in a hole-free input.

Yes, we seem to broadly agree here. The text given by Siddhesh enumerates
scenarios were an incorrent transform could be considered a security bug.
My examples explore situations outside of those scenarios, picking two
popular security properties that cannot be always attained by writing
C source that vaguely appears to conform, and expecting the compiler
to translate in to machine code that actually conforms.

> > Granted, it is a bit of a stretch since the notion of timing-safety is
> > not really well-defined for C source code, but I didn't come up with
> > better examples.
> 
> Is "timing-safety" a security property?  Not the way I understand that
> term.  It sounds like another way to say that the code meets real time
> constraints or requirements.

I meant in the sense of not admitting timing attacks:
https://en.wikipedia.org/wiki/Timing_attack

> No, compilers don't help with that (at least C doesn't -- Ada might be
> better here but I don't know enough).  For sufficiently strict
> requirements you'd have to examine both the generated machine code and
> understand, in gruesome detail, what the timing behaviors of the executing
> hardware are.  Good luck if it's a modern billion-transistor machine.

Yes. On the other hand, the reality in the FOSS ecosystem is that
cryptographic libraries heavily lean on the ability to express
a constant-time algorithm in C and get machine code that is actually
constant-time. There's a bit of a conflict here between what we
can promise and what people might expect of GCC, and it seems
relevant when discussing what goes into the Security Policy.

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-15 Thread Alexander Monakov


On Tue, 15 Aug 2023, David Edelsohn wrote:

> > Making users responsible for verifying that sources are "safe" is not okay
> > (we cannot teach them how to do that since there's no general method).
> > Making users responsible for sandboxing the compiler is fine (there's
> > a range of sandboxing solutions, from which they can choose according
> > to their requirements and threat model). Sorry about the ambiguity.
> >
> 
> Alex.
> 
> The compiler should faithfully implement the algorithms described by the
> programmer.  The compiler is responsible if it generates incorrect code for
> a well-defined, language-conforming program.  The compiler cannot be
> responsible for security issues inherent in the user code, whether that
> causes the compiler to function in a manner that deteriorates adversely
> affects the system or generates code that behaves in a manner that
> adversely affects the system.
> 
> If "safe" is the wrong word. What word would you suggest?

I think "safe" is the right word here. We also used "trusted" in a similar
sense. I believe we were on the same page about that.

> > For both 1) and 2), GCC is not engineered to respect such properties
> > during optimization and code generation, so it's not appropriate for such
> > tasks (a possible solution is to isolate such sensitive functions to
> > separate files, compile to assembly, inspect the assembly to check that it
> > still has the required properties, and use the inspected asm in subsequent
> > builds instead of the original high-level source).
> >
> 
> At some point the system tools need to respect the programmer or operator.
> There is a difference between writing "Hello, World" and writing
> performance critical or safety critical code.  That is the responsibility
> of the programmer and the development team to choose the right software
> engineers and right tools.  And to have the development environment and
> checks in place to ensure that the results are meeting the requirements.
> 
> It is not the role of GCC or its security policy to tell people how to do
> their job or hobby.  This isn't a safety tag required to be attached to a
> new mattress.

Yes (though I'm afraid the analogy with the mattress is a bit lost on me).
Those examples were meant to illustrate the point I tried to make earlier,
not as additions proposed for the Security Policy. Specific examples
where we can tell people in advance that compiler output needs to be
verified, because the compiler is not engineered to preserve those
security-relevant properties from the source code (and we would not
accept such accidents as security bugs).

Granted, it is a bit of a stretch since the notion of timing-safety is
not really well-defined for C source code, but I didn't come up with
better examples.

Alexander


Re: [RFC] GCC Security policy

2023-08-15 Thread Alexander Monakov


On Tue, 15 Aug 2023, Siddhesh Poyarekar wrote:

> > Thanks, this is nicer (see notes below). My main concern is that we
> > shouldn't pretend there's some method of verifying that arbitrary source
> > code is "safe" to pass to an unsandboxed compiler, nor should we push
> > the responsibility of doing that on users.
> 
> But responsibility would be pushed to users, wouldn't it?

Making users responsible for verifying that sources are "safe" is not okay
(we cannot teach them how to do that since there's no general method).
Making users responsible for sandboxing the compiler is fine (there's
a range of sandboxing solutions, from which they can choose according
to their requirements and threat model). Sorry about the ambiguity.

> So:
> 
> The compiler driver processes source code, invokes other programs such as the
> assembler and linker and generates the output result, which may be assembly
> code or machine code.  Compiling untrusted sources can result in arbitrary
> code execution and unconstrained resource consumption in the compiler. As a
> result, compilation of such code should be done inside a sandboxed environment
> to ensure that it does not compromise the development environment.

I'm happy with this, thanks for bearing with me.

> >> inside a sandboxed environment to ensure that it does not compromise the
> >> development environment.  Note that this still does not guarantee safety of
> >> the produced output programs and that such programs should still either be
> >> analyzed thoroughly for safety or run only inside a sandbox or an isolated
> >> system to avoid compromising the execution environment.
> > 
> > The last statement seems to be a new addition. It is too broad and again
> > makes a reference to analysis that appears quite theoretical. It might be
> > better to drop this (and instead talk in more specific terms about any
> > guarantees that produced binary code matches security properties intended
> > by the sources; I believe Richard Sandiford raised this previously).
> 
> OK, so I actually cover this at the end of the section; Richard's point AFAICT
> was about hardening, which I added another note for to make it explicit that
> missed hardening does not constitute a CVE-worthy threat:

Thanks for the reminder. To illustrate what I was talking about, let me give
two examples:

1) safety w.r.t timing attacks: even if the source code is written in
a manner that looks timing-safe, it might be transformed in a way that
mounting a timing attack on the resulting machine code is possible;

2) safety w.r.t information leaks: even if the source code attempts
to discard sensitive data (such as passwords and keys) immediately
after use, (partial) copies of that data may be left on stack and
in registers, to be leaked later via a different vulnerability.

For both 1) and 2), GCC is not engineered to respect such properties
during optimization and code generation, so it's not appropriate for such
tasks (a possible solution is to isolate such sensitive functions to
separate files, compile to assembly, inspect the assembly to check that it
still has the required properties, and use the inspected asm in subsequent
builds instead of the original high-level source).

Cheers.
Alexander


Re: [RFC] GCC Security policy

2023-08-15 Thread Alexander Monakov


On Tue, 15 Aug 2023, Siddhesh Poyarekar wrote:

> Does this as the first paragraph address your concerns:

Thanks, this is nicer (see notes below). My main concern is that we shouldn't
pretend there's some method of verifying that arbitrary source code is "safe"
to pass to an unsandboxed compiler, nor should we push the responsibility of
doing that on users.

> The compiler driver processes source code, invokes other programs such as the
> assembler and linker and generates the output result, which may be assembly
> code or machine code.  It is necessary that all source code inputs to the
> compiler are trusted, since it is impossible for the driver to validate input
> source code for safety.

The statement begins with "It is necessary", but the next statement offers
an alternative in case the code is untrusted. This is a contradiction.
Is it necessary or not in the end?

I'd suggest to drop this statement and instead make a brief note that
compiling crafted/untrusted sources can result in arbitrary code execution
and unconstrained resource consumption in the compiler.

> For untrusted code should compilation should be done
 ^^
 typo (spurious 'should')
 
> inside a sandboxed environment to ensure that it does not compromise the
> development environment.  Note that this still does not guarantee safety of
> the produced output programs and that such programs should still either be
> analyzed thoroughly for safety or run only inside a sandbox or an isolated
> system to avoid compromising the execution environment.

The last statement seems to be a new addition. It is too broad and again
makes a reference to analysis that appears quite theoretical. It might be
better to drop this (and instead talk in more specific terms about any
guarantees that produced binary code matches security properties intended
by the sources; I believe Richard Sandiford raised this previously).

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-14 Thread Alexander Monakov


On Mon, 14 Aug 2023, Siddhesh Poyarekar wrote:

> There's no practical (programmatic) way to do such validation; it has to be a
> manual audit, which is why source code passed to the compiler has to be
> *trusted*.

No, I do not think that is a logical conclusion. What is the problem with
passing untrusted code to a sandboxed compiler?

> Right, that's what we're essentially trying to convey in the security policy
> text.  It doesn't go into mechanisms for securing execution (because that's
> really beyond the scope of the *project's* policy IMO) but it states
> unambiguously that input to the compiler must be trusted:
> 
> """
>   ... It is necessary that
> all source code inputs to the compiler are trusted, since it is
> impossible for the driver to validate input source code beyond
> conformance to a programming language standard...
> """

I see two issues with this. First, it reads as if people wishing to build
not-entirely-trusted sources need to seek some other compiler, as somehow
we seem to imply that sandboxing GCC is out of the question.

Second, I take issue with the last part of the quoted text (language
conformance): verifying standards conformance is also impossible
(consider UB that manifests only during linking or dynamic loading)
so GCC is only doing that on a best-effort basis with no guarantees.

Alexander


Re: [RFC] GCC Security policy

2023-08-14 Thread Alexander Monakov


On Mon, 14 Aug 2023, Siddhesh Poyarekar wrote:

> 1. It makes it clear to users of the project the scope in which the project
> could be used and what safety it could reasonably expect from the project.  In
> the context of GCC for example, it cannot expect the compiler to do a safety
> check of untrusted sources; the compiler will consider #include "/etc/passwd"
> just as valid code as #include  and as a result, the onus is on the
> user environment to validate the input sources for safety.

Whoa, no. We shouldn't make such statements unless we are prepared to explain
to users how such validation can be practically implemented, which I'm sure
we cannot in this case, due to future extensions such as the #embed directive,
and ability to obfuscate filenames using the preprocessor.

I think it would be more honest to say that crafted sources can result in
arbitrary code execution with the privileges of the user invoking the compiler,
and hence the operator may want to ensure that no sensitive data is available
to that user (via measures ranging from plain UNIX permissions, to chroots,
to virtual machines, to air-gapped computers, depending on threat model).

Resource consumption is another good reason to sandbox compilers.

Alexander


Re: [PATCH] tree-optimization/110979 - fold-left reduction and partial vectors

2023-08-11 Thread Alexander Monakov


On Fri, 11 Aug 2023, Richard Biener wrote:

> > I think it converts SNan to QNan (when the partial vector has just one
> > element which is SNan), so is a test for -fsignaling-nans missing?
> 
> Hm, I guess that's a corner case that could happen when there's no
> runtime profitability check on more than one element and when the
> element accumulated is directly loaded from memory.  OTOH the
> loop vectorizer always expects an initial value for the reduction
> and thus we perform either no add (when the loop isn't entered)
> or at least a single add (when it is).  So I think this particular
> situation cannot occur?

Yes, that makes sense, thanks for the elaboration.
(it's a bit subtle so maybe worth a comment? not sure)

> > In the defaut -fno-rounding-math -fno-signaling-nans mode I think we
> > can do the reduction by substituting negative zero for masked-off
> > elements ? maybe it's worth diagnosing that case separately (i.e.
> > as "not yet implemented", not an incorrect transform)?
> 
> Ah, that's interesting.  So the only case we can't handle is
> -frounding-math -fsigned-zeros then.  I'll see to adjust the patch
> accordingly, like the following incremental patch:

Yeah, nice!

> > (note that in avx512 it's possible to materialize negative zeroes
> > by mask with a single vpternlog instruction, which is cheap)
> 
> It ends up loading the { -0.0, ... } constant from memory, the
> { 0.0, ... } mask is handled by using a zero-masked load, so
> indeed cheaper.

I was thinking it could be easily done without a memory load,
but got confused, sorry.

Alexander


Re: [PATCH] tree-optimization/110979 - fold-left reduction and partial vectors

2023-08-11 Thread Alexander Monakov


On Fri, 11 Aug 2023, Richard Biener wrote:

> When we vectorize fold-left reductions with partial vectors but
> no target operation available we use a vector conditional to force
> excess elements to zero.  But that doesn't correctly preserve
> the sign of zero.  The following patch disables partial vector
> support in that case.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> 
> Does this look OK?  With -frounding-math -fno-signed-zeros we are
> happily using the masking again, but that's OK, right?  An additional
> + 0.0 shouldn't do anything here.

I think it converts SNan to QNan (when the partial vector has just one
element which is SNan), so is a test for -fsignaling-nans missing?

In the defaut -fno-rounding-math -fno-signaling-nans mode I think we
can do the reduction by substituting negative zero for masked-off
elements — maybe it's worth diagnosing that case separately (i.e.
as "not yet implemented", not an incorrect transform)?

(note that in avx512 it's possible to materialize negative zeroes
by mask with a single vpternlog instruction, which is cheap)

Alexander


Re: [PATCH] Handle in-order reductions when SLP vectorizing non-loops

2023-08-09 Thread Alexander Monakov


On Wed, 9 Aug 2023, Richard Biener via Gcc-patches wrote:

> The following teaches the non-loop reduction vectorization code to
> handle non-associatable reductions.  Using the existing FOLD_LEFT_PLUS
> internal functions might be possible but I'd have to convince myself
> that +0.0 + x[0] is a safe extra operation in ever rounding mode
> (I also have no way to test the resulting code).

It's not. Under our default -fno-signaling-nans -fno-rounding-math
negative zero is the neutral element for addition, so '-0.0 + x[0]'
might be (but negative zero costs more to materialize).

If the reduction has at least two elements, then 

-0.0 + x[0] + x[1]

has the same behavior w.r.t SNaNs as 'x[0] + x[1]', but unfortunately
yields negative zero when x[0] = x[1] = +0.0 and rounding towards
negative infinity (unlike x[0] + x[1], which is +0.0).

Alexander


Re: RISC-V: Added support for CRC.

2023-08-09 Thread Alexander Monakov


On Tue, 8 Aug 2023, Jeff Law wrote:

> If the compiler can identify a CRC and collapse it down to a table or clmul,
> that's a major win and such code does exist in the real world. That was the
> whole point behind the Fedora experiment -- to determine if these things are
> showing up in the real world or if this is just a benchmarking exercise.

Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).

For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?

Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?

> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available.  If there is no clmul,
> then the RTL backend can expand to a table lookup version.

No, not if the compiler is not GCC, or its version is less than 14. And
those projects are not going to sacrifice their portability just for
__builtin_crc.

> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked.  If you think there's room for improvement to a
> builtin API, I'd love to hear it.
> 
> But it seems you don't think this is worth the effort at all.  That's
> unfortunate, but if that's the consensus, then so be it.

I think it's a strange application of development effort. You'd get more
done coding a library.

> I'll note LLVM is likely going forward with CRC detection and optimization at
> some point in the next ~6 months (effectively moving the implementation from
> the hexagon port into the generic parts of their loop optimizer).

I don't see CRC detection in the Hexagon port. There is a recognizer for
polynomial multiplication (CRC is division, not multiplication).

Alexander


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Tue, 8 Aug 2023, Jeff Law wrote:

> That was my thinking at one time.  Then we started looking at the distros and
> found enough crc implementations in there to change my mind about the overall
> utility.

The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).

> If we need to do something to make it more useful, we're certainly open to
> that.

So... just provide a library? A library code is easier to develop and audit,
it can be released independently, people can use it with their compiler of
choice. Not everything needs to be in libgcc.

> > - they overlap multiple CLMUL chains to make the loop throughput-bound
> >rather than latency-bound. The typical unroll factor is about 4x-8x.
> We do have the ability to build longer chains.  We actually use that in the
> coremark benchmark where the underlying primitives are 8-bit CRCs that are
> composed into 16/32 bit CRCs.

I'm talking about factoring a long chain into multiple independent chains
for latency hiding.

> > Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
> > that actually needs high-performance CRC (the Linux kernel, compression
> > and image libraries).
> > 
> > I think this crosses the line of "cheating in benchmarks" and not something
> > we should do in GCC.
> Certianly not the intention.  The intention is to provide a useful builtin_crc

Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?

> while at the same time putting one side of the infrastructure we need for
> automatic detection of CRC loops and turning them into table lookups or
> CLMULs.
> 
> With that in mind I'm certain Mariam & I would love feedback on a builtin API
> that would be more useful.

I think offering a conventional library for CRC has substantial advantages.

Alexander


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Thu, 3 Aug 2023, Jeff Law wrote:

> The end goal here is to actually detect bitwise CRC implementations in the
> gimple optimizers and turn them into table lookups or carryless multiplies in
> RTL.
> 
> Mariam has that working end-to-end and has proposed a talk for the Cauldron on
> the topic.
> 
> The idea here is to carve out the RTL side which we think provides potential
> value to end users (the ability to use the builtin to get an performant CRC
> implementation) and to get community feedback on the implementation.

Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where

- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all

we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.

Instead, efficient CRC loops have the following structure:

- they carry an unreduced remainder in the loop, performing final reduction
  modulo polynomial only once after the loop — this halves the CLMUL count;

- they overlap multiple CLMUL chains to make the loop throughput-bound
  rather than latency-bound. The typical unroll factor is about 4x-8x.

A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).

Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).

Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).

I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.

Alexander


RE: [PATCH] Replace invariant ternlog operands

2023-08-03 Thread Alexander Monakov


On Thu, 27 Jul 2023, Liu, Hongtao via Gcc-patches wrote:

> > +;; If the first and the second operands of ternlog are invariant and ;;
> > +the third operand is memory ;; then we should add load third operand
> > +from memory to register and ;; replace first and second operands with
> > +this register (define_split
> > +  [(set (match_operand:V 0 "register_operand")
> > +   (unspec:V
> > + [(match_operand:V 1 "register_operand")
> > +  (match_operand:V 2 "register_operand")
> > +  (match_operand:V 3 "memory_operand")
> > +  (match_operand:SI 4 "const_0_to_255_operand")]
> > + UNSPEC_VTERNLOG))]
> > +  "ternlog_invariant_operand_mask (operands) == 3 && !reload_completed"
> Maybe better with "!reload_completed  && ternlog_invariant_operand_mask 
> (operands) == 3"

I made this change (in both places), plus some style TLC. Ok to apply?

>From d24304a9efd049e8db6df5ac78de8ca2d941a3c7 Mon Sep 17 00:00:00 2001
From: Yan Simonaytes 
Date: Tue, 25 Jul 2023 20:43:19 +0300
Subject: [PATCH] Eliminate irrelevant operands of VPTERNLOG

As mentioned in PR 110202, GCC may be presented with input where control
word of the VPTERNLOG intrinsic implies that some of its operands do not
affect the result.  In that case, we can eliminate irrelevant operands
of the instruction by substituting any other operand in their place.
This removes false dependencies.

For instance, instead of (252 = 0xfc = _MM_TERNLOG_A | _MM_TERNLOG_B)

vpternlogq  $252, %zmm2, %zmm1, %zmm0

emit

vpternlogq  $252, %zmm0, %zmm1, %zmm0

When VPTERNLOG is invariant w.r.t first and second operands, and the
third operand is memory, load memory into the output operand first, i.e.
instead of (85 = 0x55 = ~_MM_TERNLOG_C)

vpternlogq  $85, (%rdi), %zmm1, %zmm0

emit

vmovdqa64   (%rdi), %zmm0
vpternlogq  $85, %zmm0, %zmm0, %zmm0

gcc/ChangeLog:

* config/i386/i386-protos.h (vpternlog_irrelevant_operand_mask):
Declare.
(substitute_vpternlog_operands): Declare.
* config/i386/i386.cc (vpternlog_irrelevant_operand_mask): New
helper.
(substitute_vpternlog_operands): New function.  Use them...
* config/i386/sse.md: ... here in new VPTERNLOG define_splits.

gcc/testsuite/ChangeLog:

* gcc.target/i386/invariant-ternlog-1.c: New test.
* gcc.target/i386/invariant-ternlog-2.c: New test.
---
 gcc/config/i386/i386-protos.h |  3 ++
 gcc/config/i386/i386.cc   | 43 +++
 gcc/config/i386/sse.md| 42 ++
 .../gcc.target/i386/invariant-ternlog-1.c | 21 +
 .../gcc.target/i386/invariant-ternlog-2.c | 12 ++
 5 files changed, 121 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-2.c

diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 27fe73ca65..12e6ff0ebc 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -70,6 +70,9 @@ extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
 extern int avx_vpermilp_parallel (rtx par, machine_mode mode);
 extern int avx_vperm2f128_parallel (rtx par, machine_mode mode);
 
+extern int vpternlog_irrelevant_operand_mask (rtx[]);
+extern void substitute_vpternlog_operands (rtx[]);
+
 extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx);
 extern bool ix86_expand_set_or_cpymem (rtx, rtx, rtx, rtx, rtx, rtx,
   rtx, rtx, rtx, rtx, bool);
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 32851a514a..9a7c1135a0 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -19420,6 +19420,49 @@ avx_vperm2f128_parallel (rtx par, machine_mode mode)
   return mask + 1;
 }
 
+/* Return a mask of VPTERNLOG operands that do not affect output.  */
+
+int
+vpternlog_irrelevant_operand_mask (rtx *operands)
+{
+  int mask = 0;
+  int imm8 = XINT (operands[4], 0);
+
+  if (((imm8 >> 4) & 0x0F) == (imm8 & 0x0F))
+mask |= 1;
+  if (((imm8 >> 2) & 0x33) == (imm8 & 0x33))
+mask |= 2;
+  if (((imm8 >> 1) & 0x55) == (imm8 & 0x55))
+mask |= 4;
+
+  return mask;
+}
+
+/* Eliminate false dependencies on operands that do not affect output
+   by substituting other operands of a VPTERNLOG.  */
+
+void
+substitute_vpternlog_operands (rtx *operands)
+{
+  int mask = vpternlog_irrelevant_operand_mask (operands);
+
+  if (mask & 1) /* The first operand is irrelevant.  */
+operands[1] = operands[2];
+
+  if (mask & 2) /* The second operand is irrelevant.  */
+operands[2] = operands[1];
+
+  if (mask & 4) /* The third operand is irrelevant.  */
+operands[3] = operands[1];
+  else if (REG_P (operands[3]))
+{
+  if (mask & 1)
+   operands[1] = operands[3];
+  if (mask & 2)
+   operands[2] = operands[3];
+}
+}
+
 /* Return a register priority for hard reg 

Re: [PATCH] Reduce floating-point difficulties in timevar.cc

2023-07-21 Thread Alexander Monakov


On Fri, 21 Jul 2023, Xi Ruoyao wrote:

> > See also PR 99903 for an earlier known issue which appears due to x87
> > excess precision and so tweaking -ffp-contract wouldn't help:
> > 
> >   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903
> 
> Does it affect AArch64 too?

Well, not literally (AArch64 doesn't have excess precision), but absence
of intermediate rounding in FMA is similar to excess precision.

I'm saying it's the same issue manifesting via different pathways on x86
and aarch64. Sorry if I misunderstood your question.

Alexander


Re: [PATCH] Reduce floating-point difficulties in timevar.cc

2023-07-21 Thread Alexander Monakov


On Fri, 21 Jul 2023, Xi Ruoyao via Gcc-patches wrote:

> Perhaps -ffp-contract=on (not off) is enough to fix the issue (if you
> are building GCC 14 snapshot).  The default is "fast" (if no -std=
> option is used), which allows some contractions disallowed by the
> standard.

Not fully, see below.

> But GCC is in C++ and I'm not sure if the C++ standard has the same
> definition for allowed contractions as C.

It doesn't, but in GCC we should aim to provide the same semantics in C++
as in C.

> > (Or is the severity of lack of support sufficiently different in the two 
> > cases that this is fine -- i.e. not compile vs may trigger floating 
> > point rounding inaccuracies?)
> 
> It's possible that the test itself is flaky.  Can you provide some
> detail about how it fails?

See also PR 99903 for an earlier known issue which appears due to x87
excess precision and so tweaking -ffp-contract wouldn't help:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Now that multiple platforms are hitting this, can we _please_ get rid
of the questionable attempt to compute time in a floating-point variable
and just use an uint64_t storing nanoseconds?

Alexander


Re: [PATCH] Fix bootstrap failure (with g++ 4.8.5) in tree-if-conv.cc.

2023-07-17 Thread Alexander Monakov


On Mon, 17 Jul 2023, Richard Biener wrote:

> > > > > OK.   Btw, while I didn't spot this during review I would appreciate
> > > > > if the code could use vec.[q]sort, this should work with a lambda as
> > > > > well I think.
> > > >
> > > > That was my first use, but that hits
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99469
> > >
> > > That is not hitting PR 99469 but rather it means your comparison is not
> > > correct for an (unstable) sort.
> > > That is qsort comparator should have this relationship `f(a,b) == !f(b, 
> > > a)` and
> > > `f(a,a)` should also return false.
> >
> > I'm using the standard std::pair comparator which indicates that f(a,a) is 
> > true,
> > https://en.cppreference.com/w/cpp/utility/pair/operator_cmp
> >
> > > If you are running into this for qsort here, you will most likely run 
> > > into issues
> > > with std::sort later on too.
> >
> > Don't see why or how. It needs to have a consistent relationship which 
> > std::pair
> > maintains.  So why would using the standard tuple comparator with a standard
> > std::sort cause problem?
> 
> At least for
> 
>  return left.second < right.second;
> 
> f(a,a) doesn't hold.  Note qsort can end up comparing an element to
> itself (not sure if GCCs implementation now can).

(it cannot but that is not important here)

Tamar, while std::sort receives a "less-than" comparison predicate, qsort
needs a tri-state comparator that returns a negative value for "less-than"
relation, positive for "more-than", and zero when operands are "equal".

Passing output of std::pair::operator< straight to qsort is not correct,
and qsort_chk catches that mistake at runtime.

std::sort is not a stable sort and therefore can cause code generation
differences by swapping around elements that are not bitwise-identical
but "equal" according to the comparator. This is the main reason for
preferring our internal qsort, which yields same results on all platforms.

Let me also note that #include  is pretty heavy-weight, and so
I'd suggest to avoid it to avoid needlessly increasing bootstrap times.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches



On Tue, 11 Jul 2023, Michael Matz wrote:

> Hey,
> 
> On Tue, 11 Jul 2023, Alexander Monakov via Gcc-patches wrote:
> 
> > > > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> > > > 
> > > > This is the weak/active form; I'd suggest "preserve_high_sse".
> > > 
> > > But it preserves only the low parts :-)  You swapped the two in your 
> > > mind when writing the reply?
> > 
> > Ahhh. By "high SSE" I mean the high-numbered SSE regs, i.e. xmm8-15, not
> > the higher halves of (unspecified subset of) SSE regs.
> 
> Ah, gotcha :-)  It just shows that all these names are confusing.  Maybe 
> I'll just go with "attribute1" and "attribute2" and rely on docu.  (SCNR)

Heh, that reminds me that decimal digits are allowed in attribute names.
Let me offer "preserve_xmm_8_15" and "only_xmm_0_7" then.

One more thing to keep in mind is interaction with SSE-AVX transition.
If the function with a new attribute is using classic non-VEX-encoded SSE,
but its caller is using 256-bit ymm0-15, it will incur a substantial penalty
on Intel CPUs. There's no penalty on AMD (afaik) and no penalty for zmm16-31,
since those are inaccessible in non-EVEX code.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches


On Tue, 11 Jul 2023, Michael Matz wrote:

> > > To that end I introduce actually two related attributes (for naming
> > > see below):
> > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> > 
> > This is the weak/active form; I'd suggest "preserve_high_sse".
> 
> But it preserves only the low parts :-)  You swapped the two in your 
> mind when writing the reply?

Ahhh. By "high SSE" I mean the high-numbered SSE regs, i.e. xmm8-15, not
the higher halves of (unspecified subset of) SSE regs.

If you look from AVX viewpoint, yes, it preserves lower 128 bits of the
high-numbered vector registers.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches


On Tue, 11 Jul 2023, Richard Biener wrote:

> > > If a function contains calls then GCC can't know which
> > > parts of the XMM regset is clobbered by that, it may be parts
> > > which don't even exist yet (say until avx2048 comes out), so we must
> > > restrict ourself to only save/restore the SSE2 parts and then of course
> > > can only claim to not clobber those parts.
> >
> > Hm, I guess this is kinda the reason a "weak" form is needed. But this
> > highlights the difference between the two: the "weak" form will actively
> > preserve some state (so it cannot preserve future extensions), while
> > the "strong" form may just passively not touch any state, preserving
> > any state it doesn't know about.
> >
> > > To that end I introduce actually two related attributes (for naming
> > > see below):
> > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> >
> > This is the weak/active form; I'd suggest "preserve_high_sse".
> 
> Isn't it the opposite?  "preserves_low_sse", unless you suggest
> the name applies to the caller which has to preserve high parts
> when calling nosseclobber.

This is the form where the function annnotated with this attribute
consumes 128 bytes on the stack to "blindly" save/restore xmm8-15
if it calls anything with a vanilla ABI.

(actually thinking about it more, I'd like to suggest shelving this part
and only implement the zero-cost variant, noanysseclobber)

> > > * noanysseclobber: claims (and ensures) that nothing of any of the
> > >   registers overlapping xmm8-15 is clobbered (not even future, as of
> > >   yet unknown, parts)
> >
> > This is the strong/passive form; I'd suggest "only_low_sse".
> 
> Likewise.

Sorry if I managed to sow confusion here. In my mind, this is the form where
only xmm0-xmm7 can be written in the function annotated with the attribute,
including its callees. I was thinking that writing to zmm16-31 would be
disallowed too. The initial example was memcpy, where eight vector registers
are sufficient for the job.

> As for mask registers I understand we'd have to split the 8 register
> set into two halves to make the same approach work, otherwise
> we'd have no registers left to allocate from.

I'd suggest to look how many mask registers OpenMP SIMD AVX-512 clones
can receive as implicit arguments, as one data point.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches
On Mon, 10 Jul 2023, Alexander Monakov wrote:

> > I chose to make it possible to write function definitions with that
> > attribute with GCC adding the necessary callee save/restore code in
> > the xlogue itself.
> 
> But you can't trivially restore if the callee is sibcalling — what
> happens then (a testcase might be nice)?

Sorry, when the caller is doing the sibcall, not the callee.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches


On Mon, 10 Jul 2023, Michael Matz via Gcc-patches wrote:

> Hello,
> 
> the ELF psABI for x86-64 doesn't have any callee-saved SSE
> registers (there were actual reasons for that, but those don't
> matter anymore).  This starts to hurt some uses, as it means that
> as soon as you have a call (say to memmove/memcpy, even if
> implicit as libcall) in a loop that manipulates floating point
> or vector data you get saves/restores around those calls.
> 
> But in reality many functions can be written such that they only need
> to clobber a subset of the 16 XMM registers (or do the save/restore
> themself in the codepaths that needs them, hello memcpy again).
> So we want to introduce a way to specify this, via an ABI attribute
> that basically says "doesn't clobber the high XMM regs".

I think the main question is why you're going with this (weak) form
instead of the (strong) form "may only clobber the low XMM regs":
as Richi noted, surely for libcalls we'd like to know they preserve
AVX-512 mask registers as well?

(I realize this is partially answered later)

Note this interacts with anything that interposes between the caller
and the callee, like the Glibc lazy binding stub (which used to
zero out high halves of 512-bit arguments in ZMM registers).
Not an immediate problem for the patch, just something to mind perhaps.

> I've opted to do only the obvious: do something special only for
> xmm8 to xmm15, without a way to specify the clobber set in more detail.
> I think such half/half split is reasonable, and as I don't want to
> change the argument passing anyway (whose regs are always clobbered)
> there isn't that much wiggle room anyway.
> 
> I chose to make it possible to write function definitions with that
> attribute with GCC adding the necessary callee save/restore code in
> the xlogue itself.

But you can't trivially restore if the callee is sibcalling — what
happens then (a testcase might be nice)?

> Carefully note that this is only possible for
> the SSE2 registers, as other parts of them would need instructions
> that are only optional.

What is supposed to happen on 32-bit x86 with -msse -mno-sse2?

> When a function doesn't contain calls to
> unknown functions we can be a bit more lenient: we can make it so that
> GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> necessary.

What if the source code has a local register variable bound to xmm15,
i.e. register double x asm("xmm15"); asm("..." : "+x"(x)); ?
Probably "dont'd do that", i.e. disallow that in the documentation?

> If a function contains calls then GCC can't know which
> parts of the XMM regset is clobbered by that, it may be parts
> which don't even exist yet (say until avx2048 comes out), so we must
> restrict ourself to only save/restore the SSE2 parts and then of course
> can only claim to not clobber those parts.

Hm, I guess this is kinda the reason a "weak" form is needed. But this
highlights the difference between the two: the "weak" form will actively
preserve some state (so it cannot preserve future extensions), while
the "strong" form may just passively not touch any state, preserving
any state it doesn't know about.

> To that end I introduce actually two related attributes (for naming
> see below):
> * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered

This is the weak/active form; I'd suggest "preserve_high_sse".

> * noanysseclobber: claims (and ensures) that nothing of any of the
>   registers overlapping xmm8-15 is clobbered (not even future, as of
>   yet unknown, parts)

This is the strong/passive form; I'd suggest "only_low_sse".

> Ensuring the first is simple: potentially add saves/restore in xlogue
> (e.g. when xmm8 is either used explicitely or implicitely by a call).
> Ensuring the second comes with more: we must also ensure that no
> functions are called that don't guarantee the same thing (in addition
> to just removing all xmm8-15 parts alltogether from the available
> regsters).
> 
> See also the added testcases for what I intended to support.
> 
> I chose to use the new target independend function-abi facility for
> this.  I need some adjustments in generic code:
> * the "default_abi" is actually more like a "current" abi: it happily
>   changes its contents according to conditional_register_usage,
>   and other code assumes that such changes do propagate.
>   But if that conditonal_reg_usage is actually done because the current
>   function is of a different ABI, then we must not change default_abi.
> * in insn_callee_abi we do look at a potential fndecl for a call
>   insn (only set when -fipa-ra), but doesn't work for calls through
>   pointers and (as said) is optional.  So, also always look at the
>   called functions type (it's always recorded in the MEM_EXPR for
>   non-libcalls), before asking the target.
>   (The function-abi accessors working on trees were already doing that,
>   its just the RTL accessor that missed this)
> 
> Accordingly I also implement some 

Re: [PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread Alexander Monakov via Gcc-patches


On Mon, 10 Jul 2023, liuhongt via Gcc-patches wrote:

> False dependency happens when destination is only updated by
> pternlog. There is no false dependency when destination is also used
> in source. So either a pxor should be inserted, or input operand
> should be set with constraint '0'.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready to push to trunk.

Shouldn't this patch also remove uses of vpternlog in
standard_sse_constant_opcode?

A couple more questions below:

> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
> ]
> (symbol_ref "true")))])
>  
> +; False dependency happens on destination register which is not really
> +; used when moving all ones to vector register
> +(define_split
> +  [(set (match_operand:VMOVE 0 "register_operand")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> +  "TARGET_AVX512F && reload_completed
> +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
> +  && optimize_function_for_speed_p (cfun)"

Yan's patch used optimize_insn_for_speed_p (), which looks more appropriate.
Doesn't it work here as well?

> +  [(set (match_dup 0) (match_dup 2))
> +   (parallel
> + [(set (match_dup 0) (match_dup 1))
> +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[2] = CONST0_RTX (mode);")
> +
> +(define_insn "*vmov_constm1_pternlog_false_dep"
> +  [(set (match_operand:VMOVE 0 "register_operand" "=v")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
> ""))
> +   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +   "TARGET_AVX512VL ||  == 64"
> +   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
> +  [(set_attr "type" "sselog1")
> +   (set_attr "prefix" "evex")])
> +
>  ;; If mem_addr points to a memory region with less than whole vector size 
> bytes
>  ;; of accessible memory and k is a mask that would prevent reading the 
> inaccessible
>  ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed 
> to vpblendd
> @@ -9336,7 +9359,7 @@ (define_expand "_cvtmask2"
>  operands[3] = CONST0_RTX (mode);
>}")
>  
> -(define_insn "*_cvtmask2"
> +(define_insn_and_split "*_cvtmask2"
>[(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
>   (vec_merge:VI48_AVX512VL
> (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> @@ -9346,11 +9369,35 @@ (define_insn "*_cvtmask2"
>"@
> vpmovm2\t{%1, %0|%0, %1}
> vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  "&& !TARGET_AVX512DQ && reload_completed
> +   && optimize_function_for_speed_p (cfun)"
> +  [(set (match_dup 0) (match_dup 4))
> +   (parallel
> +[(set (match_dup 0)
> +   (vec_merge:VI48_AVX512VL
> + (match_dup 2)
> + (match_dup 3)
> + (match_dup 1)))
> + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[4] = CONST0_RTX (mode);"
>[(set_attr "isa" "avx512dq,*")
> (set_attr "length_immediate" "0,1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>  
> +(define_insn "*_cvtmask2_pternlog_false_dep"
> +  [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
> + (vec_merge:VI48_AVX512VL
> +   (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> +   (match_operand:VI48_AVX512VL 3 "const0_operand")
> +   (match_operand: 1 "register_operand" "Yk")))
> +   (unspec [(match_operand:VI48_AVX512VL 4 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +  "TARGET_AVX512F && !TARGET_AVX512DQ"
> +  "vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  [(set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "")])
> +
>  (define_expand "extendv2sfv2df2"
>[(set (match_operand:V2DF 0 "register_operand")
>   (float_extend:V2DF
> @@ -17166,20 +17213,32 @@ (define_expand "one_cmpl2"
>  operands[2] = force_reg (mode, operands[2]);
>  })
>  
> -(define_insn "one_cmpl2"
> -  [(set (match_operand:VI 0 "register_operand" "=v,v")
> - (xor:VI (match_operand:VI 1 "bcst_vector_operand" "vBr,m")
> - (match_operand:VI 2 "vector_all_ones_operand" "BC,BC")))]
> +(define_insn_and_split "one_cmpl2"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v,v")
> + (xor:VI (match_operand:VI 1 "bcst_vector_operand" " 0, m,Br")
> + (match_operand:VI 2 "vector_all_ones_operand" "BC,BC,BC")))]
>"TARGET_AVX512F
> && (!
> || mode == SImode
> || mode == DImode)"
>  {
> +  if (! && which_alternative
> +  && optimize_function_for_speed_p (cfun))
> +return "#";
> +
>if (TARGET_AVX512VL)
>  return "vpternlog\t{$0x55, %1, %0, 
> %0|%0, %0, %1, 0x55}";
>else
>  return "vpternlog\t{$0x55, %g1, %g0, 
> %g0|%g0, %g0, %g1, 0x55}";
>  }
> +  "&& reload_completed && !REG_P (operands[1]) && !
> +   && optimize_function_for_speed_p (cfun)"
> 

Re: [PATCH] c-family: implement -ffp-contract=on

2023-06-19 Thread Alexander Monakov via Gcc-patches


Ping. OK for trunk?

On Mon, 5 Jun 2023, Alexander Monakov wrote:

> Ping for the front-end maintainers' input.
> 
> On Mon, 22 May 2023, Richard Biener wrote:
> 
> > On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
> >  wrote:
> > >
> > > Implement -ffp-contract=on for C and C++ without changing default
> > > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> > 
> > The documentation changes mention the defaults are changed for
> > standard modes, I suppose you want to remove that hunk.
> > 
> > > gcc/c-family/ChangeLog:
> > >
> > > * c-gimplify.cc (fma_supported_p): New helper.
> > > (c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
> > > contraction.
> > >
> > > gcc/ChangeLog:
> > >
> > > * common.opt (fp_contract_mode) [on]: Remove fallback.
> > > * config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
> > > * doc/invoke.texi (-ffp-contract): Update.
> > > * trans-mem.cc (diagnose_tm_1): Skip internal function calls.
> > > ---
> > >  gcc/c-family/c-gimplify.cc | 78 ++
> > >  gcc/common.opt |  3 +-
> > >  gcc/config/sh/sh.md|  2 +-
> > >  gcc/doc/invoke.texi|  8 ++--
> > >  gcc/trans-mem.cc   |  3 ++
> > >  5 files changed, 88 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
> > > index ef5c7d919f..f7635d3b0c 100644
> > > --- a/gcc/c-family/c-gimplify.cc
> > > +++ b/gcc/c-family/c-gimplify.cc
> > > @@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "c-ubsan.h"
> > >  #include "tree-nested.h"
> > >  #include "context.h"
> > > +#include "tree-pass.h"
> > > +#include "internal-fn.h"
> > >
> > >  /*  The gimplification pass converts the language-dependent trees
> > >  (ld-trees) emitted by the parser into language-independent trees
> > > @@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree 
> > > body)
> > >return bind;
> > >  }
> > >
> > > +/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
> > > +
> > > +static bool
> > > +fma_supported_p (enum internal_fn fn, tree type)
> > > +{
> > > +  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
> > > +}
> > > +
> > >  /* Gimplification of expression trees.  */
> > >
> > >  /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
> > > @@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
> > > ATTRIBUTE_UNUSED,
> > > break;
> > >}
> > >
> > > +case PLUS_EXPR:
> > > +case MINUS_EXPR:
> > > +  {
> > > +   tree type = TREE_TYPE (*expr_p);
> > > +   /* For -ffp-contract=on we need to attempt FMA contraction only
> > > +  during initial gimplification.  Late contraction across 
> > > statement
> > > +  boundaries would violate language semantics.  */
> > > +   if (SCALAR_FLOAT_TYPE_P (type)
> > > +   && flag_fp_contract_mode == FP_CONTRACT_ON
> > > +   && cfun && !(cfun->curr_properties & PROP_gimple_any)
> > > +   && fma_supported_p (IFN_FMA, type))
> > > + {
> > > +   bool neg_mul = false, neg_add = code == MINUS_EXPR;
> > > +
> > > +   tree *op0_p = _OPERAND (*expr_p, 0);
> > > +   tree *op1_p = _OPERAND (*expr_p, 1);
> > > +
> > > +   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
> > > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR
> > > +   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
> > > + /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
> > > +   else if (TREE_CODE (*op0_p) != MULT_EXPR)
> > > + {
> > > +   std::swap (op0_p, op1_p);
> > > +   std::swap (neg_mul, neg_add);
> > > + }
> > > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
> > > + {
> > > +   op0_p = _OPERAND (*op0_p, 0);
> > > +   neg_mul 

Re: [PATCH] c-family: implement -ffp-contract=on

2023-06-05 Thread Alexander Monakov via Gcc-patches
Ping for the front-end maintainers' input.

On Mon, 22 May 2023, Richard Biener wrote:

> On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Implement -ffp-contract=on for C and C++ without changing default
> > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> 
> The documentation changes mention the defaults are changed for
> standard modes, I suppose you want to remove that hunk.
> 
> > gcc/c-family/ChangeLog:
> >
> > * c-gimplify.cc (fma_supported_p): New helper.
> > (c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
> > contraction.
> >
> > gcc/ChangeLog:
> >
> > * common.opt (fp_contract_mode) [on]: Remove fallback.
> > * config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
> > * doc/invoke.texi (-ffp-contract): Update.
> > * trans-mem.cc (diagnose_tm_1): Skip internal function calls.
> > ---
> >  gcc/c-family/c-gimplify.cc | 78 ++
> >  gcc/common.opt |  3 +-
> >  gcc/config/sh/sh.md|  2 +-
> >  gcc/doc/invoke.texi|  8 ++--
> >  gcc/trans-mem.cc   |  3 ++
> >  5 files changed, 88 insertions(+), 6 deletions(-)
> >
> > diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
> > index ef5c7d919f..f7635d3b0c 100644
> > --- a/gcc/c-family/c-gimplify.cc
> > +++ b/gcc/c-family/c-gimplify.cc
> > @@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "c-ubsan.h"
> >  #include "tree-nested.h"
> >  #include "context.h"
> > +#include "tree-pass.h"
> > +#include "internal-fn.h"
> >
> >  /*  The gimplification pass converts the language-dependent trees
> >  (ld-trees) emitted by the parser into language-independent trees
> > @@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree 
> > body)
> >return bind;
> >  }
> >
> > +/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
> > +
> > +static bool
> > +fma_supported_p (enum internal_fn fn, tree type)
> > +{
> > +  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
> > +}
> > +
> >  /* Gimplification of expression trees.  */
> >
> >  /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
> > @@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
> > ATTRIBUTE_UNUSED,
> > break;
> >}
> >
> > +case PLUS_EXPR:
> > +case MINUS_EXPR:
> > +  {
> > +   tree type = TREE_TYPE (*expr_p);
> > +   /* For -ffp-contract=on we need to attempt FMA contraction only
> > +  during initial gimplification.  Late contraction across statement
> > +  boundaries would violate language semantics.  */
> > +   if (SCALAR_FLOAT_TYPE_P (type)
> > +   && flag_fp_contract_mode == FP_CONTRACT_ON
> > +   && cfun && !(cfun->curr_properties & PROP_gimple_any)
> > +   && fma_supported_p (IFN_FMA, type))
> > + {
> > +   bool neg_mul = false, neg_add = code == MINUS_EXPR;
> > +
> > +   tree *op0_p = _OPERAND (*expr_p, 0);
> > +   tree *op1_p = _OPERAND (*expr_p, 1);
> > +
> > +   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
> > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR
> > +   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
> > + /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
> > +   else if (TREE_CODE (*op0_p) != MULT_EXPR)
> > + {
> > +   std::swap (op0_p, op1_p);
> > +   std::swap (neg_mul, neg_add);
> > + }
> > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
> > + {
> > +   op0_p = _OPERAND (*op0_p, 0);
> > +   neg_mul = !neg_mul;
> > + }
> > +   if (TREE_CODE (*op0_p) != MULT_EXPR)
> > + break;
> > +   auto_vec ops (3);
> > +   ops.quick_push (TREE_OPERAND (*op0_p, 0));
> > +   ops.quick_push (TREE_OPERAND (*op0_p, 1));
> > +   ops.quick_push (*op1_p);
> > +
> > +   enum internal_fn ifn = IFN_FMA;
> > +   if (neg_mul)
> > + {
> > +   if (fma_supported_p (IFN_FNMA, type))
> > + ifn = IFN_F

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Alexander Monakov via Gcc-patches


On Fri, 2 Jun 2023, Matthias Kretz wrote:

> > Okay, I see opinions will vary here. I was thinking about our immintrin.h
> > which is partially implemented in terms of generic vectors. Imagine we
> > extend UBSan to trap on signed overflow for vector types. I expect that
> > will blow up on existing code that uses Intel intrinsics.
> 
> _mm_add_epi32 is already implemented via __v4su addition (i.e. unsigned). So 
> the intrinsic would continue to wrap on signed overflow.

Ah, if our intrinsics take care of it, that alleviates my concern.

> > I'm not sure what you consider a breaking change here. Is that the implied
> > threat to use undefinedness for range deduction and other optimizations?
> 
> Consider the stdx::simd implementation. It currently follows semantics of the 
> builtin types. So simd can be shifted by 30 without UB. The 
> implementation of the shift operator depends on the current behavior, even if 
> it is target-dependent. For PPC the simd implementation adds extra code to 
> avoid the "UB". With nailing down shifts > sizeof(T) as UB this extra code 
> now 
> needs to be added for all targets.

What does stdx::simd do on LLVM, where that has always been UB even on x86?

Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Alexander Monakov via Gcc-patches


On Fri, 2 Jun 2023, Matthias Kretz wrote:

> On Thursday, 1 June 2023 20:25:14 CEST Alexander Monakov wrote:
> > On Wed, 31 May 2023, Richard Biener wrote:
> > > So yes, we probably should clarify the semantics to match the
> > > implementation (since we have two targets doing things differently
> > > since forever we can only document it as UB) and also note the
> > > difference from OpenCL (in case OpenCL is still relevant these
> > > days we might want to offer a -fopencl-vectors to emit the required
> > > AND).
> > 
> > It doesn't have to be UB, in principle we could say that shift amount
> > is taken modulo some power of two depending on the target without UB.
> > But since LLVM already treats that as UB, we might as well follow.
> 
> I prefer UB (as your patch states ). If a user requires the AND, let them 
> state it explicitly. Don't let everybody pay in performance.

What I suggested does not imply a performance cost. All targets take some
lower bits of the shift amount anyway. It's only OpenCL's exact masking
that would imply a performance cost (and I agree it's inappropriate for
GCC's generic vectors).

> > I think for addition/multiplication of signed vectors everybody
> > expects them to have wrapping semantics without UB on overflow though?
> 
>   simd x = ...;
>   bool t = all_of(x < x + 1); // unconditionally true or not?
> 
> I'd expect t to be unconditionally true. Because simd simply is a data-
> parallel version of int.

Okay, I see opinions will vary here. I was thinking about our immintrin.h
which is partially implemented in terms of generic vectors. Imagine we
extend UBSan to trap on signed overflow for vector types. I expect that
will blow up on existing code that uses Intel intrinsics. But use of
generic vectors in immintrin.h is our implementation detail, and people
might have expected intrinsics to be overflow-safe, like for aliasing
(where we use __attribute__((may_alias)) in immintrin.h). Although, we
can solve that by inventing overflow-wraps attribute for types, maybe?

> > Revised patch below.
> 
> This can be considered a breaking change. Does it need a mention in the 
> release notes?

I'm not sure what you consider a breaking change here. Is that the implied
threat to use undefinedness for range deduction and other optimizations?

Thanks.
Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-01 Thread Alexander Monakov via Gcc-patches


On Wed, 31 May 2023, Richard Biener wrote:

> On Tue, May 30, 2023 at 4:49 PM Alexander Monakov  wrote:
> >
> >
> > On Thu, 25 May 2023, Richard Biener wrote:
> >
> > > On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  
> > > wrote:
> > > >
> > > >
> > > > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> > > >
> > > > > I’d have to check the ISAs what they actually do here - it of course 
> > > > > depends
> > > > > on RTL semantics as well but as you say those are not strictly 
> > > > > defined here
> > > > > either.
> > > >
> > > > Plus, we can add the following executable test to the testsuite:
> > >
> > > Yeah, that's probably a good idea.  I think your documentation change
> > > with the added sentence about the truncation is OK.
> >
> > I am no longer confident in my patch, sorry.
> >
> > My claim about vector shift semantics in OpenCL was wrong. In fact it 
> > specifies
> > that RHS of a vector shift is masked to the exact bitwidth of the element 
> > type.
> >
> > So, to collect various angles:
> >
> > 1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).
> >
> > 2. From user side we had a request to follow C integer promotion semantics
> >in https://gcc.gnu.org/PR91838 but I now doubt we can do that.
> >
> > 3. LLVM makes oversized vector shifts UB both for 'vector_size' and
> >'ext_vector_type'.
> 
> I had the impression GCC desired to do 3. as well, matching what we do
> for scalar shifts.
> 
> > 4. Vector lowering does not emit promotions, and starting from gcc-12
> >ranger treats oversized shifts according to the documentation you
> >cite below, and optimizes (e.g. with '-O2 -mno-sse')
> >
> > typedef short v8hi __attribute__((vector_size(16)));
> >
> > void f(v8hi *p)
> > {
> > *p >>= 16;
> > }
> >
> >to zeroing '*p'. If this looks unintended, I can file a bug.
> >
> > I still think we need to clarify semantics of vector shifts, but probably
> > not in the way I proposed initially. What do you think?
> 
> I think the intent at some point was to adhere to the OpenCL spec
> for the GCC vector extension (because that's a written spec while
> GCCs vector extension docs are lacking).  Originally the powerpc
> altivec 'vector' keyword spurred most of the development IIRC
> so it might be useful to see how they specify shifts.

It doesn't look like they document the semantics of '<<' and '>>'
operators for vector types.

> So yes, we probably should clarify the semantics to match the
> implementation (since we have two targets doing things differently
> since forever we can only document it as UB) and also note the
> difference from OpenCL (in case OpenCL is still relevant these
> days we might want to offer a -fopencl-vectors to emit the required
> AND).

It doesn't have to be UB, in principle we could say that shift amount
is taken modulo some power of two depending on the target without UB.
But since LLVM already treats that as UB, we might as well follow.

I think for addition/multiplication of signed vectors everybody
expects them to have wrapping semantics without UB on overflow though?

Revised patch below.

> It would be also good to amend the RTL documentation.
> 
> It would be very nice to start an internals documentation section
> around collecting what the middle-end considers undefined
> or implementation defined (aka target defined) behavior in the
> GENERIC, GIMPLE and RTL ILs and what predicates eventually
> control that (like TYPE_OVERFLOW_UNDEFINED).  Maybe spread it over
> {gimple,generic,rtl}.texi, though gimple.texi is only about the representation
> and all semantics are shared and documented in generic.texi.

Hm, noted. Thanks.

---8<---

>From e4e8d9e262f2f8dbc91a94291cf7accb74d27e7c Mon Sep 17 00:00:00 2001
From: Alexander Monakov 
Date: Wed, 24 May 2023 15:48:29 +0300
Subject: [PATCH] doc: clarify semantics of vector bitwise shifts

Explicitly say that attempted shift past element bit width is UB for
vector types.  Mention that integer promotions do not happen.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..3723cfe467 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,14 @@ elements in the operand.
 It is possible to use shifting

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-30 Thread Alexander Monakov via Gcc-patches


On Thu, 25 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  wrote:
> >
> >
> > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> >
> > > I’d have to check the ISAs what they actually do here - it of course 
> > > depends
> > > on RTL semantics as well but as you say those are not strictly defined 
> > > here
> > > either.
> >
> > Plus, we can add the following executable test to the testsuite:
> 
> Yeah, that's probably a good idea.  I think your documentation change
> with the added sentence about the truncation is OK.

I am no longer confident in my patch, sorry.

My claim about vector shift semantics in OpenCL was wrong. In fact it specifies
that RHS of a vector shift is masked to the exact bitwidth of the element type.

So, to collect various angles:

1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).

2. From user side we had a request to follow C integer promotion semantics
   in https://gcc.gnu.org/PR91838 but I now doubt we can do that.

3. LLVM makes oversized vector shifts UB both for 'vector_size' and
   'ext_vector_type'.

4. Vector lowering does not emit promotions, and starting from gcc-12
   ranger treats oversized shifts according to the documentation you
   cite below, and optimizes (e.g. with '-O2 -mno-sse')

typedef short v8hi __attribute__((vector_size(16)));

void f(v8hi *p)
{
*p >>= 16;
}

   to zeroing '*p'. If this looks unintended, I can file a bug.

I still think we need to clarify semantics of vector shifts, but probably
not in the way I proposed initially. What do you think?

Thanks.
Alexander

> Note we have
> 
> /* Shift operations for shift and rotate.
>Shift means logical shift if done on an
>unsigned type, arithmetic shift if done on a signed type.
>The second operand is the number of bits to
>shift by; it need not be the same type as the first operand and result.
>Note that the result is undefined if the second operand is larger
>than or equal to the first operand's type size.
> 
>The first operand of a shift can have either an integer or a
>(non-integer) fixed-point type.  We follow the ISO/IEC TR 18037:2004
>semantics for the latter.
> 
>Rotates are defined for integer types only.  */
> DEFTREECODE (LSHIFT_EXPR, "lshift_expr", tcc_binary, 2)
> 
> in tree.def which implies short << 24 is undefined behavior (similar
> wording in generic.texi).  The rtl docs say nothing about behavior
> but I think the semantics should carry over.  That works for x86
> even for scalar instructions working on GPRs (masking is applied
> but fixed to 5 or 6 bits even for QImode or HImode shifts).
> 
> Note that when we make these shifts well-defined there's
> also arithmetic on signed types smaller than int (which again
> doesn't exist in C) where overflow invokes undefined behavior
> in the middle-end.  Unless we want to change that as well
> this is somewhat inconsistent then.
> 
> There's also the issue that C 'int' is defined by INT_TYPE_SIZE
> and thus target dependent which makes what is undefined and
> what not target dependent.
> 
> Richard.
> 
> > #include 
> >
> > #define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
> > { \
> > typedef TYPE vec __attribute__((vector_size(WIDTH))); \
> >   \
> > static volatile vec zero; \
> > vec tmp = (zero-2) OP (COUNT);\
> > vec ref = INVERT zero;\
> > if (__builtin_memcmp(, , sizeof tmp)) \
> > __builtin_abort();\
> > }
> >
> > int main(void)
> > {
> > CHECK( uint8_t, 16, <<, 8,  )
> > CHECK( uint8_t, 16, <<, 31, )
> > CHECK( uint8_t, 16, >>, 8,  )
> > CHECK( uint8_t, 16, >>, 31, )
> > CHECK(  int8_t, 16, <<, 8,  )
> > CHECK(  int8_t, 16, <<, 31, )
> > CHECK(  int8_t, 16, >>, 8,  ~)
> > CHECK(  int8_t, 16, >>, 31, ~)
> > CHECK(uint16_t, 16, <<, 16, )
> > CHECK(uint16_t, 16, <<, 31, )
> > CHECK(uint16_t, 16, >>, 16, )
> > CHECK(uint16_t, 16, >>, 31, )
> > CHECK( int16_t, 16, <<, 16, )
> > CHECK( int16_t, 16, <<, 31, )
> > CHECK( int16_t, 16, >>, 16, ~)
> > CHECK( int16_t, 16, >>, 31, ~)
> > // Per-lane-variable shifts:
> &g

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches


On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:

> I’d have to check the ISAs what they actually do here - it of course depends
> on RTL semantics as well but as you say those are not strictly defined here
> either.

Plus, we can add the following executable test to the testsuite:

#include 

#define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
{ \
typedef TYPE vec __attribute__((vector_size(WIDTH))); \
  \
static volatile vec zero; \
vec tmp = (zero-2) OP (COUNT);\
vec ref = INVERT zero;\
if (__builtin_memcmp(, , sizeof tmp)) \
__builtin_abort();\
}

int main(void)
{
CHECK( uint8_t, 16, <<, 8,  )
CHECK( uint8_t, 16, <<, 31, )
CHECK( uint8_t, 16, >>, 8,  )
CHECK( uint8_t, 16, >>, 31, )
CHECK(  int8_t, 16, <<, 8,  )
CHECK(  int8_t, 16, <<, 31, )
CHECK(  int8_t, 16, >>, 8,  ~)
CHECK(  int8_t, 16, >>, 31, ~)
CHECK(uint16_t, 16, <<, 16, )
CHECK(uint16_t, 16, <<, 31, )
CHECK(uint16_t, 16, >>, 16, )
CHECK(uint16_t, 16, >>, 31, )
CHECK( int16_t, 16, <<, 16, )
CHECK( int16_t, 16, <<, 31, )
CHECK( int16_t, 16, >>, 16, ~)
CHECK( int16_t, 16, >>, 31, ~)
// Per-lane-variable shifts:
CHECK( uint8_t, 16, <<, zero+8,  )
CHECK( uint8_t, 16, <<, zero+31, )
CHECK( uint8_t, 16, >>, zero+8,  )
CHECK( uint8_t, 16, >>, zero+31, )
CHECK(  int8_t, 16, <<, zero+8,  )
CHECK(  int8_t, 16, <<, zero+31, )
CHECK(  int8_t, 16, >>, zero+8,  ~)
CHECK(  int8_t, 16, >>, zero+31, ~)
CHECK(uint16_t, 16, <<, zero+16, )
CHECK(uint16_t, 16, <<, zero+31, )
CHECK(uint16_t, 16, >>, zero+16, )
CHECK(uint16_t, 16, >>, zero+31, )
CHECK( int16_t, 16, <<, zero+16, )
CHECK( int16_t, 16, <<, zero+31, )
CHECK( int16_t, 16, >>, zero+16, ~)
CHECK( int16_t, 16, >>, zero+31, ~)

// Repeat for WIDTH=32 and WIDTH=64
}

Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches


On Wed, 24 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 2:54 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Explicitly say that bitwise shifts for narrow types work similar to
> > element-wise C shifts with integer promotions, which coincides with
> > OpenCL semantics.
> 
> Do we need to clarify that v << w with v being a vector of shorts
> still yields a vector of shorts and not a vector of ints?

I don't think so, but if necessary we could add "and the result was
truncated back to the base type":

When the base type is narrower than @code{int}, element-wise shifts
are performed as if operands underwent C integer promotions, and
the result was truncated back to the base type, like in OpenCL. 

> Btw, I don't see this promotion reflected in the IL.  For
> 
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi foo (v8hi a, v8hi b)
> {
>   return a << b;
> }
> 
> I get no masking of 'b' and vector lowering if the target doens't handle it
> yields
> 
>   short int _5;
>   short int _6;
> 
>   _5 = BIT_FIELD_REF ;
>   _6 = BIT_FIELD_REF ;
>   _7 = _5 << _6;
> 
> which we could derive ranges from for _6 (apparantly we don't yet).

Here it depends on how we define the GIMPLE-level semantics of bit-shift
operators for narrow types. To avoid changing lowering we could say that
shifting by up to 31 bits is well-defined for narrow types.

RTL-level semantics are also undocumented, unfortunately.

> Even
> 
> typedef int v8hi __attribute__((vector_size(16)));
> 
> v8hi x;
> int foo (v8hi a, v8hi b)
> {
>   x = a << b;
>   return (b[0] > 33);
> }
> 
> isn't optimized currently (but could - note I've used 'int' elements here).

Yeah. But let's constrain the optimizations first.

> So, I don't see us making sure the hardware does the right thing for
> out-of bound values.

I think in practice it worked out even if GCC did not pay attention to it,
because SIMD instructions had to facilitate autovectorization for C with
corresponding shift semantics.

Alexander

> 
> Richard.
> 
> > gcc/ChangeLog:
> >
> > * doc/extend.texi (Vector Extensions): Clarify bitwise shift
> > semantics.
> > ---
> >  gcc/doc/extend.texi | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> > index e426a2eb7d..6b4e94b6a1 100644
> > --- a/gcc/doc/extend.texi
> > +++ b/gcc/doc/extend.texi
> > @@ -12026,7 +12026,12 @@ elements in the operand.
> >  It is possible to use shifting operators @code{<<}, @code{>>} on
> >  integer-type vectors. The operation is defined as following: @code{@{a0,
> >  a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
> > -@dots{}, an >> bn@}}@. Vector operands must have the same number of
> > +@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
> > +element-wise shifts are performed as if operands underwent C integer
> > +promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
> > +well-defined for vectors with @code{char} and @code{short} base types.
> > +
> > +Operands of binary vector operations must have the same number of
> >  elements.
> >
> >  For convenience, it is allowed to use a binary vector operation
> > --
> > 2.39.2
> >
> 


[PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches
Explicitly say that bitwise shifts for narrow types work similar to
element-wise C shifts with integer promotions, which coincides with
OpenCL semantics.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..6b4e94b6a1 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,12 @@ elements in the operand.
 It is possible to use shifting operators @code{<<}, @code{>>} on
 integer-type vectors. The operation is defined as following: @code{@{a0,
 a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
-@dots{}, an >> bn@}}@. Vector operands must have the same number of
+@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
+element-wise shifts are performed as if operands underwent C integer
+promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
+well-defined for vectors with @code{char} and @code{short} base types.
+
+Operands of binary vector operations must have the same number of
 elements. 
 
 For convenience, it is allowed to use a binary vector operation
-- 
2.39.2



Re: [PATCH] c-family: implement -ffp-contract=on

2023-05-23 Thread Alexander Monakov via Gcc-patches


On Tue, 23 May 2023, Richard Biener wrote:
> > Ah, no, I deliberately decided against that, because that way we would go
> > via gimplify_arg, which would emit all side effects in *pre_p. That seems
> > wrong if arguments had side-effects that should go in *post_p.
> 
> Ah, true - that warrants a comment though.

Incrementally fixed up in my tree like this:

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index f7635d3b0c..17b0610a89 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -803,6 +803,7 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
ATTRIBUTE_UNUSED,
else
  ops[2] = build1 (NEGATE_EXPR, type, ops[2]);
  }
+   /* Avoid gimplify_arg: it emits all side effects into *PRE_P.  */
for (auto & : ops)
  if (gimplify_expr (, pre_p, post_p, is_gimple_val, fb_rvalue)
  == GS_ERROR)

Alexander


Re: [PATCH] c-family: implement -ffp-contract=on

2023-05-22 Thread Alexander Monakov via Gcc-patches


On Mon, 22 May 2023, Richard Biener wrote:

> On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Implement -ffp-contract=on for C and C++ without changing default
> > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> 
> The documentation changes mention the defaults are changed for
> standard modes, I suppose you want to remove that hunk.

No, the current documentation is incomplete, and that hunk extends it
to match the current GCC behavior. Should I break it out to a separate
patch? I see this drive-by fix could look confusing — sorry about that.

> it would be possible to do
> 
>   *expr_p = build_call_expr_internal (ifn, type, ops[0], ops[1]. ops[2]);
>   return GS_OK;
> 
> and not worry about temporary creation and gimplifying of the operands.
> That would in theory also leave the possibility to do this during
> genericization instead (and avoid the guard against late invocation of
> the hook).

Ah, no, I deliberately decided against that, because that way we would go
via gimplify_arg, which would emit all side effects in *pre_p. That seems
wrong if arguments had side-effects that should go in *post_p.

Thanks.
Alexander

> Otherwise it looks OK, but I'll let frontend maintainers have a chance to look
> as well.
> 
> Thanks for tackling this long-standing issue.
> Richard.


[PATCH] c-family: implement -ffp-contract=on

2023-05-18 Thread Alexander Monakov via Gcc-patches
Implement -ffp-contract=on for C and C++ without changing default
behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).

gcc/c-family/ChangeLog:

* c-gimplify.cc (fma_supported_p): New helper.
(c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
contraction.

gcc/ChangeLog:

* common.opt (fp_contract_mode) [on]: Remove fallback.
* config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
* doc/invoke.texi (-ffp-contract): Update.
* trans-mem.cc (diagnose_tm_1): Skip internal function calls.
---
 gcc/c-family/c-gimplify.cc | 78 ++
 gcc/common.opt |  3 +-
 gcc/config/sh/sh.md|  2 +-
 gcc/doc/invoke.texi|  8 ++--
 gcc/trans-mem.cc   |  3 ++
 5 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index ef5c7d919f..f7635d3b0c 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "c-ubsan.h"
 #include "tree-nested.h"
 #include "context.h"
+#include "tree-pass.h"
+#include "internal-fn.h"
 
 /*  The gimplification pass converts the language-dependent trees
 (ld-trees) emitted by the parser into language-independent trees
@@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree body)
   return bind;
 }
 
+/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
+
+static bool
+fma_supported_p (enum internal_fn fn, tree type)
+{
+  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
+}
+
 /* Gimplification of expression trees.  */
 
 /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
@@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
ATTRIBUTE_UNUSED,
break;
   }
 
+case PLUS_EXPR:
+case MINUS_EXPR:
+  {
+   tree type = TREE_TYPE (*expr_p);
+   /* For -ffp-contract=on we need to attempt FMA contraction only
+  during initial gimplification.  Late contraction across statement
+  boundaries would violate language semantics.  */
+   if (SCALAR_FLOAT_TYPE_P (type)
+   && flag_fp_contract_mode == FP_CONTRACT_ON
+   && cfun && !(cfun->curr_properties & PROP_gimple_any)
+   && fma_supported_p (IFN_FMA, type))
+ {
+   bool neg_mul = false, neg_add = code == MINUS_EXPR;
+
+   tree *op0_p = _OPERAND (*expr_p, 0);
+   tree *op1_p = _OPERAND (*expr_p, 1);
+
+   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
+   if (TREE_CODE (*op0_p) == NEGATE_EXPR
+   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
+ /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
+   else if (TREE_CODE (*op0_p) != MULT_EXPR)
+ {
+   std::swap (op0_p, op1_p);
+   std::swap (neg_mul, neg_add);
+ }
+   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
+ {
+   op0_p = _OPERAND (*op0_p, 0);
+   neg_mul = !neg_mul;
+ }
+   if (TREE_CODE (*op0_p) != MULT_EXPR)
+ break;
+   auto_vec ops (3);
+   ops.quick_push (TREE_OPERAND (*op0_p, 0));
+   ops.quick_push (TREE_OPERAND (*op0_p, 1));
+   ops.quick_push (*op1_p);
+
+   enum internal_fn ifn = IFN_FMA;
+   if (neg_mul)
+ {
+   if (fma_supported_p (IFN_FNMA, type))
+ ifn = IFN_FNMA;
+   else
+ ops[0] = build1 (NEGATE_EXPR, type, ops[0]);
+ }
+   if (neg_add)
+ {
+   enum internal_fn ifn2 = ifn == IFN_FMA ? IFN_FMS : IFN_FNMS;
+   if (fma_supported_p (ifn2, type))
+ ifn = ifn2;
+   else
+ ops[2] = build1 (NEGATE_EXPR, type, ops[2]);
+ }
+   for (auto & : ops)
+ if (gimplify_expr (, pre_p, post_p, is_gimple_val, fb_rvalue)
+ == GS_ERROR)
+   return GS_ERROR;
+
+   gcall *call = gimple_build_call_internal_vec (ifn, ops);
+   gimple_seq_add_stmt_without_update (pre_p, call);
+   *expr_p = create_tmp_var (type);
+   gimple_call_set_lhs (call, *expr_p);
+   return GS_ALL_DONE;
+ }
+   break;
+  }
+
 default:;
 }
 
diff --git a/gcc/common.opt b/gcc/common.opt
index a28ca13385..3daec85aef 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1662,9 +1662,8 @@ Name(fp_contract_mode) Type(enum fp_contract_mode) 
UnknownError(unknown floating
 EnumValue
 Enum(fp_contract_mode) String(off) Value(FP_CONTRACT_OFF)
 
-; Not implemented, fall back to conservative FP_CONTRACT_OFF.
 EnumValue
-Enum(fp_contract_mode) String(on) Value(FP_CONTRACT_OFF)
+Enum(fp_contract_mode) String(on) Value(FP_CONTRACT_ON)
 
 EnumValue
 

[committed] tree-ssa-math-opts: correct -ffp-contract= check

2023-05-17 Thread Alexander Monakov via Gcc-patches
Since tree-ssa-math-opts may freely contract across statement boundaries
we should enable it only for -ffp-contract=fast instead of disabling it
for -ffp-contract=off.

No functional change, since -ffp-contract=on is not exposed yet.

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma): Enable only for
FP_CONTRACT_FAST (no functional change).
---

Preapproved in PR 106092, pushed to trunk.

 gcc/tree-ssa-math-opts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index b58a2ac9e6..d71c51dc0e 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3320,7 +3320,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
   imm_use_iterator imm_iter;
 
   if (FLOAT_TYPE_P (type)
-  && flag_fp_contract_mode == FP_CONTRACT_OFF)
+  && flag_fp_contract_mode != FP_CONTRACT_FAST)
 return false;
 
   /* We don't want to do bitfield reduction ops.  */
-- 
2.39.2



Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches


On Sun, 14 May 2023, Andrew Pinski wrote:

> It is NOT a dummy iterator. SIGNBIT is a operator list that expands to
> "BUILT_IN_SIGNBITF  BUILT_IN_SIGNBIT BUILT_IN_SIGNBITL IFN_SIGNBIT".

Ah, it's in cfn-operators.pd in the build tree, not the source tree.

> > On the other hand, the following clauses both use SIGNBIT directly, and
> > it would be nice to be consistent.
> 
> You cannot use the operator list directly if you have a for loop
> expansion too. So it is internally consistent already.

I see. Wasn't aware of the limitation.

Thanks.
Alexander


Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches


On Sun, 14 May 2023, Alexander Monakov wrote:

> On Sat, 13 May 2023, Andrew Pinski via Gcc-patches wrote:
> 
> > +/* signbit(x) != 0 ? -x : x -> abs(x)
> > +   signbit(x) == 0 ? -x : x -> -abs(x) */
> > +(for sign (SIGNBIT)
> 
> Surprised to see a dummy iterator here. Was this meant to include
> float and long double versions of the builtin too (SIGNBITF and SIGNBITL)?

On the other hand, the following clauses both use SIGNBIT directly, and
it would be nice to be consistent.

> > + (for neeq (ne eq)
> > +  (simplify
> > +   (cond (neeq (sign @0) integer_zerop) (negate @0) @0)
> > +(if (neeq == NE_EXPR)
> > + (abs @0)
> > + (negate (abs @0))
> > +
> >  (simplify
> >   /* signbit(x) -> 0 if x is nonnegative.  */
> >   (SIGNBIT tree_expr_nonnegative_p@0)
> 
> Thanks.
> Alexander
> 


Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches


On Sat, 13 May 2023, Andrew Pinski via Gcc-patches wrote:

> +/* signbit(x) != 0 ? -x : x -> abs(x)
> +   signbit(x) == 0 ? -x : x -> -abs(x) */
> +(for sign (SIGNBIT)

Surprised to see a dummy iterator here. Was this meant to include
float and long double versions of the builtin too (SIGNBITF and SIGNBITL)?

> + (for neeq (ne eq)
> +  (simplify
> +   (cond (neeq (sign @0) integer_zerop) (negate @0) @0)
> +(if (neeq == NE_EXPR)
> + (abs @0)
> + (negate (abs @0))
> +
>  (simplify
>   /* signbit(x) -> 0 if x is nonnegative.  */
>   (SIGNBIT tree_expr_nonnegative_p@0)

Thanks.
Alexander


[PATCH 1/3] genmatch: clean up emit_func

2023-05-08 Thread Alexander Monakov via Gcc-patches
Eliminate boolean parameters of emit_func. The first ('open') just
prints 'extern' to generated header, which is unnecessary. Introduce a
separate function to use when finishing a declaration in place of the
second ('close').

Rename emit_func to 'fp_decl' (matching 'fprintf' in length) to unbreak
indentation in several places.

Reshuffle emitted line breaks in a few places to make generated
declarations less ugly.

gcc/ChangeLog:

* genmatch.cc (header_file): Make static.
(emit_func): Rename to...
(fp_decl): ... this.  Adjust all uses.
(fp_decl_done): New function.  Use it...
(decision_tree::gen): ... here and...
(write_predicate): ... here.
(main): Adjust.
---
 gcc/genmatch.cc | 97 ++---
 1 file changed, 52 insertions(+), 45 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index c593814871..d5e56e2d68 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -183,31 +183,37 @@ fprintf_indent (FILE *f, unsigned int indent, const char 
*format, ...)
   va_end (ap);
 }
 
-/* Like fprintf, but print to two files, one header one C implementation.  */
-FILE *header_file = NULL;
+/* Secondary stream for fp_decl.  */
+static FILE *header_file;
 
+/* Start or continue emitting a declaration in fprintf-like manner,
+   printing both to F and global header_file, if non-null.  */
 static void
 #if GCC_VERSION >= 4001
-__attribute__((format (printf, 4, 5)))
+__attribute__((format (printf, 2, 3)))
 #endif
-emit_func (FILE *f, bool open, bool close, const char *format, ...)
+fp_decl (FILE *f, const char *format, ...)
 {
-  va_list ap1, ap2;
-  if (header_file != NULL)
-{
-  if (open)
-   fprintf (header_file, "extern ");
-  va_start (ap2, format);
-  vfprintf (header_file, format, ap2);
-  va_end (ap2);
-  if (close)
-   fprintf (header_file, ";\n");
-}
+  va_list ap;
+  va_start (ap, format);
+  vfprintf (f, format, ap);
+  va_end (ap);
 
-  va_start (ap1, format);
-  vfprintf (f, format, ap1);
-  va_end (ap1);
-  fputc ('\n', f);
+  if (!header_file)
+return;
+
+  va_start (ap, format);
+  vfprintf (header_file, format, ap);
+  va_end (ap);
+}
+
+/* Finish a declaration being emitted by fp_decl.  */
+static void
+fp_decl_done (FILE *f, const char *trailer)
+{
+  fprintf (f, "%s\n", trailer);
+  if (header_file)
+fprintf (header_file, "%s;", trailer);
 }
 
 static void
@@ -3924,35 +3930,35 @@ decision_tree::gen (vec  , bool gimple)
   s->fname = xasprintf ("%s_simplify_%u", gimple ? "gimple" : "generic",
fcnt++);
   if (gimple)
-   emit_func (f, true, false, "\nbool\n"
+   fp_decl (f, "\nbool\n"
 "%s (gimple_match_op *res_op, gimple_seq *seq,\n"
 " tree (*valueize)(tree) ATTRIBUTE_UNUSED,\n"
 " const tree ARG_UNUSED (type), tree 
*ARG_UNUSED "
-"(captures)\n",
+"(captures)",
 s->fname);
   else
{
- emit_func (f, true, false, "\ntree\n"
+ fp_decl (f, "\ntree\n"
   "%s (location_t ARG_UNUSED (loc), const tree ARG_UNUSED 
(type),\n",
   (*iter).second->fname);
  for (unsigned i = 0;
   i < as_a (s->s->s->match)->ops.length (); ++i)
-   emit_func (f, false, false, " tree ARG_UNUSED (_p%d),", i);
- emit_func (f, false, false, " tree *captures\n");
+   fp_decl (f, " tree ARG_UNUSED (_p%d),", i);
+ fp_decl (f, " tree *captures");
}
   for (unsigned i = 0; i < s->s->s->for_subst_vec.length (); ++i)
{
  if (! s->s->s->for_subst_vec[i].first->used)
continue;
  if (is_a  (s->s->s->for_subst_vec[i].second))
-   emit_func (f, false, false, ", const enum tree_code ARG_UNUSED 
(%s)",
+   fp_decl (f, ",\n const enum tree_code ARG_UNUSED (%s)",
 s->s->s->for_subst_vec[i].first->id);
  else if (is_a  (s->s->s->for_subst_vec[i].second))
-   emit_func (f, false, false, ", const combined_fn ARG_UNUSED (%s)",
+   fp_decl (f, ",\n const combined_fn ARG_UNUSED (%s)",
 s->s->s->for_subst_vec[i].first->id);
}
 
-  emit_func (f, false, true, ")");
+  fp_decl_done (f, ")");
   fprintf (f, "{\n");
   fprintf_indent (f, 2, "const bool debug_dump = "
"dump_file && (dump_flags & TDF_FOLDING);\n");
@@ -3988,22 +3994,22 @@ decision_tree::gen (vec  , bool gimple)
  FILE *f = get_out_file (files);
 
  if (gimple)
-   emit_func (f, true, false,"\nbool\n"
+   fp_decl (f, "\nbool\n"
 "gimple_simplify_%s (gimple_match_op *res_op,"
 " gimple_seq *seq,\n"
 " tree (*valueize)(tree) "
 "ATTRIBUTE_UNUSED,\n"
 "  

[PATCH 3/3] genmatch: fixup get_out_file

2023-05-08 Thread Alexander Monakov via Gcc-patches
get_out_file did not follow the coding conventions (mixing three-space
and two-space indentation, missing linebreak before function name).

Take that as an excuse to reimplement it in a more terse manner and
rename as 'choose_output', which is hopefully more descriptive.

gcc/ChangeLog:

* genmatch.cc (get_out_file): Make static and rename to ...
(choose_output): ... this. Reimplement. Update all uses ...
(decision_tree::gen): ... here and ...
(main): ... here.
---
 gcc/genmatch.cc | 41 +
 1 file changed, 17 insertions(+), 24 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index baf93855a6..177c13d87c 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -255,28 +255,21 @@ output_line_directive (FILE *f, location_t location,
 
 #define SIZED_BASED_CHUNKS 1
 
-int current_file = 0;
-FILE *get_out_file (vec  )
+static FILE *
+choose_output (const vec )
 {
 #ifdef SIZED_BASED_CHUNKS
-   if (parts.length () == 1)
- return parts[0];
-
-   FILE *f = NULL;
-   long min = 0;
-   /* We've started writing all the files at pos 0, so ftell is equivalent
-  to the size and should be much faster.  */
-   for (unsigned i = 0; i < parts.length (); i++)
- {
-   long res = ftell (parts[i]);
-   if (!f || res < min)
- {
-   min = res;
-   f = parts[i];
- }
- }
-  return f;
+  FILE *shortest = NULL;
+  long min = 0;
+  for (FILE *part : parts)
+{
+  long len = ftell (part);
+  if (!shortest || min > len)
+   shortest = part, min = len;
+}
+  return shortest;
 #else
+  static int current_file;
   return parts[current_file++ % parts.length ()];
 #endif
 }
@@ -3924,7 +3917,7 @@ decision_tree::gen (vec  , bool gimple)
}
 
   /* Cycle the file buffers.  */
-  FILE *f = get_out_file (files);
+  FILE *f = choose_output (files);
 
   /* Generate a split out function with the leaf transform code.  */
   s->fname = xasprintf ("%s_simplify_%u", gimple ? "gimple" : "generic",
@@ -3991,7 +3984,7 @@ decision_tree::gen (vec  , bool gimple)
 
 
  /* Cycle the file buffers.  */
- FILE *f = get_out_file (files);
+ FILE *f = choose_output (files);
 
  if (gimple)
fp_decl (f, "\nbool\n"
@@ -4028,7 +4021,7 @@ decision_tree::gen (vec  , bool gimple)
{
 
  /* Cycle the file buffers.  */
- FILE *f = get_out_file (files);
+ FILE *f = choose_output (files);
 
  if (gimple)
fp_decl (f, "\nbool\n"
@@ -4053,7 +4046,7 @@ decision_tree::gen (vec  , bool gimple)
 
 
   /* Cycle the file buffers.  */
-  FILE *f = get_out_file (files);
+  FILE *f = choose_output (files);
 
   /* Then generate the main entry with the outermost switch and
  tail-calls to the split-out functions.  */
@@ -5461,7 +5454,7 @@ main (int argc, char **argv)
dt.print (stderr);
 
   /* Cycle the file buffers.  */
-  FILE *f = get_out_file (parts);
+  FILE *f = choose_output (parts);
 
   write_predicate (f, pred, dt, gimple);
 }
-- 
2.39.2



[PATCH 2/3] genmatch: clean up showUsage

2023-05-08 Thread Alexander Monakov via Gcc-patches
Display usage more consistently and get rid of camelCase.

gcc/ChangeLog:

* genmatch.cc (showUsage): Reimplement as ...
(usage): ...this.  Adjust all uses.
(main): Print usage when no arguments.  Add missing 'return 1'.
---
 gcc/genmatch.cc | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index d5e56e2d68..baf93855a6 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -5301,13 +5301,12 @@ round_alloc_size (size_t s)
 /* Construct and display the help menu.  */
 
 static void
-showUsage ()
+usage ()
 {
-  fprintf (stderr, "Usage: genmatch [--gimple] [--generic] "
-  "[--header=] [--include=] [-v[v]] input "
-  "[...]\n");
-  fprintf (stderr, "\nWhen more then one outputfile is specified --header "
-  "is required.\n");
+  const char *usage = "Usage:\n"
+" %s [--gimple|--generic] [-v[v]] \n"
+" %s [options] [--include=FILE] --header=FILE  ...\n";
+  fprintf (stderr, usage, progname, progname);
 }
 
 /* Write out the correct include to the match-head fle containing the helper
@@ -5332,9 +5331,6 @@ main (int argc, char **argv)
 
   progname = "genmatch";
 
-  if (argc < 2)
-return 1;
-
   bool gimple = true;
   char *s_header_file = NULL;
   char *s_include_file = NULL;
@@ -5359,14 +5355,17 @@ main (int argc, char **argv)
files.safe_push (argv[i]);
   else
{
- showUsage ();
+ usage ();
  return 1;
}
 }
 
   /* Validate if the combinations are valid.  */
   if ((files.length () > 1 && !s_header_file) || files.is_empty ())
-showUsage ();
+{
+  usage ();
+  return 1;
+}
 
   if (!s_include_file)
 s_include_file = s_header_file;
-- 
2.39.2



[PATCH 0/3] Trivial cleanups for genmatch

2023-05-08 Thread Alexander Monakov via Gcc-patches
I'm trying to study match.pd/genmatch with the eventual goal of
improving match-and-simplify code generation. Here's some trivial
cleanups for the recent refactoring in the meantime.

Alexander Monakov (3):
  genmatch: clean up emit_func
  genmatch: clean up showUsage
  genmatch: fixup get_out_file

 gcc/genmatch.cc | 159 
 1 file changed, 79 insertions(+), 80 deletions(-)

-- 
2.39.2



RE: [PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-08 Thread Alexander Monakov via Gcc-patches
On Fri, 5 May 2023, Alexander Monakov wrote:

> > > gimple-head-export.cc does not exist.
> > > 
> > > gimple-match-exports.cc is not a generated file. It's under source 
> > > control and
> > > edited independently from genmatch.cc. It is compiled separately, 
> > > producing
> > > gimple-match-exports.o.
> > > 
> > > gimple-match-head.cc is also not a generated file, also under source 
> > > control.
> > > It is transitively included into gimple-match-N.o files. If it changes, 
> > > they will be
> > > rebuilt. This is not changed by my patch.
> > > 
> > > gimple-match-auto.h is a generated file. It depends on s-gimple-match 
> > > stamp
> > > file, which in turn depends on genmatch and match.pd. If either changes, 
> > > the
> > > rule for the stamp file triggers. gimple-match-N.o files also depend on 
> > > the
> > > stamp file, so they will be rebuilt as well.
> > 
> > s-gimple-match does not depend on gimple-match-head.cc. if it changes the 
> > stamp
> > is not invalidated. 
> 
> Right, this is correct: there's no need to rerun the recipe for the stamp,
> because contents of gimple-match-head.cc do not affect it.
> 
> > This happens to work because gimple-match-N.cc does depend on 
> > gimple-match-head.cc,
> > but if the gimple-match-N.cc already exists then nothing changes.
> 
> No, if gimple-match-N.cc already exist, make notices they are out-of-date via
> 
> $(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc; @true
> 
> and this triggers rebuilding gimple-match-N.o.
> 
> I tested this. After 'touch gimple-match-head.cc' all ten gimple-match-N.o 
> files
> are rebuilt.

My explanation was incomplete here. The gcc/Makefile.in rule quoted above
applies to .cc files and does not trigger rebuilds of .o files on its own.
The reason .o files get rebuilt is implicit dependency tracking: initial
build records header dependencies in gcc/.deps/*.Po files, and incremental
rebuild sees that gimple-match-1.o depends on gimple-match-head.cc.

Alexander


RE: [PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-05 Thread Alexander Monakov via Gcc-patches


On Fri, 5 May 2023, Tamar Christina wrote:

> > -Original Message-
> > From: Alexander Monakov 
> > Sent: Friday, May 5, 2023 6:59 PM
> > To: Tamar Christina 
> > Cc: Richard Biener ; gcc-patches@gcc.gnu.org
> > Subject: RE: [PATCH] Makefile.in: clean up match.pd-related dependencies
> > 
> > 
> > On Fri, 5 May 2023, Tamar Christina wrote:
> > 
> > > > > Am 05.05.2023 um 19:03 schrieb Alexander Monakov via Gcc-patches
> > > > >  > > > patc...@gcc.gnu.org>:
> > > > >
> > > > > Clean up confusing changes from the recent refactoring for
> > > > > parallel match.pd build.
> > > > >
> > > > > gimple-match-head.o is not built. Remove related flags adjustment.
> > > > >
> > > > > Autogenerated gimple-match-N.o files do not depend on
> > > > > gimple-match-exports.cc.
> > > > >
> > > > > {gimple,generic)-match-auto.h only depend on the prerequisites of
> > > > > the corresponding s-{gimple,generic}-match stamp file, not any .cc 
> > > > > file.
> > > >
> > > > LGTM
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >* Makefile.in: (gimple-match-head.o-warn): Remove.
> > > > >(GIMPLE_MATCH_PD_SEQ_SRC): Do not depend on
> > > > >gimple-match-exports.cc.
> > > > >(gimple-match-auto.h): Only depend on s-gimple-match.
> > > > >(generic-match-auto.h): Likewise.
> > > > > ---
> > > > >
> > > > > Tamar, do I understand correctly that you do not have more plans
> > > > > for match.pd and I won't collide with you if I attempt more
> > > > > cleanups in this
> > > > area? Thanks!
> > >
> > > No, but I'm also not sure why this change.
> > > The idea here was that if gimple-head-export.cc changes you must have
> > > changed genmatch.cc and so you need to regenerate the gimple-match-*
> > which could change the header.
> > 
> > gimple-head-export.cc does not exist.
> > 
> > gimple-match-exports.cc is not a generated file. It's under source control 
> > and
> > edited independently from genmatch.cc. It is compiled separately, producing
> > gimple-match-exports.o.
> > 
> > gimple-match-head.cc is also not a generated file, also under source 
> > control.
> > It is transitively included into gimple-match-N.o files. If it changes, 
> > they will be
> > rebuilt. This is not changed by my patch.
> > 
> > gimple-match-auto.h is a generated file. It depends on s-gimple-match stamp
> > file, which in turn depends on genmatch and match.pd. If either changes, the
> > rule for the stamp file triggers. gimple-match-N.o files also depend on the
> > stamp file, so they will be rebuilt as well.
> 
> s-gimple-match does not depend on gimple-match-head.cc. if it changes the 
> stamp
> is not invalidated. 

Right, this is correct: there's no need to rerun the recipe for the stamp,
because contents of gimple-match-head.cc do not affect it.

> This happens to work because gimple-match-N.cc does depend on 
> gimple-match-head.cc,
> but if the gimple-match-N.cc already exists then nothing changes.

No, if gimple-match-N.cc already exist, make notices they are out-of-date via

$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc; @true

and this triggers rebuilding gimple-match-N.o.

I tested this. After 'touch gimple-match-head.cc' all ten gimple-match-N.o files
are rebuilt.

> So I don't think this changes anything. If anything I would say the stamp 
> file needs to
> depend on gimple-match-head.cc. 

Is my explanation above satisfactory?

Thanks.
Alexander

> 
> Thanks,
> Tamar
> 
> > 
> > Is there some problem I'm not seeing?
> > 
> > Thanks.
> > Alexander
> > 
> > > So not sure I agree with this.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > > >
> > > > > gcc/Makefile.in | 9 +++--
> > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in index
> > > > > 7e7ac078c5..0cc13c37d0 100644
> > > > > --- a/gcc/Makefile.in
> > > > > +++ b/gcc/Makefile.in
> > > > > @@ -230,7 +230,6 @@ gengtype-lex.o-warn = -Wno-error
> > > > > libgcov-util.o-warn = -Wno-error libgcov-driver-tool.o-warn =
> > > > > -Wno-error libgco

RE: [PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-05 Thread Alexander Monakov via Gcc-patches


On Fri, 5 May 2023, Tamar Christina wrote:

> > > Am 05.05.2023 um 19:03 schrieb Alexander Monakov via Gcc-patches  > patc...@gcc.gnu.org>:
> > >
> > > Clean up confusing changes from the recent refactoring for parallel
> > > match.pd build.
> > >
> > > gimple-match-head.o is not built. Remove related flags adjustment.
> > >
> > > Autogenerated gimple-match-N.o files do not depend on
> > > gimple-match-exports.cc.
> > >
> > > {gimple,generic)-match-auto.h only depend on the prerequisites of the
> > > corresponding s-{gimple,generic}-match stamp file, not any .cc file.
> > 
> > LGTM
> > 
> > > gcc/ChangeLog:
> > >
> > >* Makefile.in: (gimple-match-head.o-warn): Remove.
> > >(GIMPLE_MATCH_PD_SEQ_SRC): Do not depend on
> > >gimple-match-exports.cc.
> > >(gimple-match-auto.h): Only depend on s-gimple-match.
> > >(generic-match-auto.h): Likewise.
> > > ---
> > >
> > > Tamar, do I understand correctly that you do not have more plans for
> > > match.pd and I won't collide with you if I attempt more cleanups in this
> > area? Thanks!
> 
> No, but I'm also not sure why this change.
> The idea here was that if gimple-head-export.cc changes you must have changed
> genmatch.cc and so you need to regenerate the gimple-match-* which could 
> change the header.

gimple-head-export.cc does not exist.

gimple-match-exports.cc is not a generated file. It's under source control and
edited independently from genmatch.cc. It is compiled separately, producing
gimple-match-exports.o.

gimple-match-head.cc is also not a generated file, also under source control.
It is transitively included into gimple-match-N.o files. If it changes, they
will be rebuilt. This is not changed by my patch.

gimple-match-auto.h is a generated file. It depends on s-gimple-match stamp
file, which in turn depends on genmatch and match.pd. If either changes, the
rule for the stamp file triggers. gimple-match-N.o files also depend on the
stamp file, so they will be rebuilt as well.

Is there some problem I'm not seeing?

Thanks.
Alexander

> So not sure I agree with this.
> 
> Thanks,
> Tamar
> 
> > >
> > > gcc/Makefile.in | 9 +++--
> > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in index
> > > 7e7ac078c5..0cc13c37d0 100644
> > > --- a/gcc/Makefile.in
> > > +++ b/gcc/Makefile.in
> > > @@ -230,7 +230,6 @@ gengtype-lex.o-warn = -Wno-error
> > > libgcov-util.o-warn = -Wno-error libgcov-driver-tool.o-warn =
> > > -Wno-error libgcov-merge-tool.o-warn = -Wno-error
> > > -gimple-match-head.o-warn = -Wno-unused gimple-match-exports.o-warn
> > =
> > > -Wno-unused dfp.o-warn = -Wno-strict-aliasing
> > >
> > > @@ -2674,12 +2673,10 @@ s-tm-texi: build/genhooks$(build_exeext)
> > $(srcdir)/doc/tm.texi.in
> > >  false; \
> > >fi
> > >
> > > -$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc \
> > > -gimple-match-exports.cc; @true
> > > -gimple-match-auto.h: s-gimple-match gimple-match-head.cc \
> > > -gimple-match-exports.cc; @true
> > > +$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc;
> > > +@true
> > > +gimple-match-auto.h: s-gimple-match; @true
> > > $(GENERIC_MATCH_PD_SEQ_SRC): s-generic-match generic-match-head.cc;
> > > @true
> > > -generic-match-auto.h: s-generic-match generic-match-head.cc; @true
> > > +generic-match-auto.h: s-generic-match; @true
> > >
> > > s-gimple-match: build/genmatch$(build_exeext) \
> > >$(srcdir)/match.pd cfn-operators.pd
> > > --
> > > 2.39.2
> > >
> 


[PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-05 Thread Alexander Monakov via Gcc-patches
Clean up confusing changes from the recent refactoring for
parallel match.pd build.

gimple-match-head.o is not built. Remove related flags adjustment.

Autogenerated gimple-match-N.o files do not depend on
gimple-match-exports.cc.

{gimple,generic)-match-auto.h only depend on the prerequisites of the
corresponding s-{gimple,generic}-match stamp file, not any .cc file.

gcc/ChangeLog:

* Makefile.in: (gimple-match-head.o-warn): Remove.
(GIMPLE_MATCH_PD_SEQ_SRC): Do not depend on
gimple-match-exports.cc.
(gimple-match-auto.h): Only depend on s-gimple-match.
(generic-match-auto.h): Likewise.
---

Tamar, do I understand correctly that you do not have more plans for match.pd
and I won't collide with you if I attempt more cleanups in this area? Thanks!

 gcc/Makefile.in | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 7e7ac078c5..0cc13c37d0 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -230,7 +230,6 @@ gengtype-lex.o-warn = -Wno-error
 libgcov-util.o-warn = -Wno-error
 libgcov-driver-tool.o-warn = -Wno-error
 libgcov-merge-tool.o-warn = -Wno-error
-gimple-match-head.o-warn = -Wno-unused
 gimple-match-exports.o-warn = -Wno-unused
 dfp.o-warn = -Wno-strict-aliasing
 
@@ -2674,12 +2673,10 @@ s-tm-texi: build/genhooks$(build_exeext) 
$(srcdir)/doc/tm.texi.in
  false; \
fi
 
-$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc \
-   gimple-match-exports.cc; @true
-gimple-match-auto.h: s-gimple-match gimple-match-head.cc \
-   gimple-match-exports.cc; @true
+$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc; @true
+gimple-match-auto.h: s-gimple-match; @true
 $(GENERIC_MATCH_PD_SEQ_SRC): s-generic-match generic-match-head.cc; @true
-generic-match-auto.h: s-generic-match generic-match-head.cc; @true
+generic-match-auto.h: s-generic-match; @true
 
 s-gimple-match: build/genmatch$(build_exeext) \
$(srcdir)/match.pd cfn-operators.pd
-- 
2.39.2



[PATCH] do not tailcall __sanitizer_cov_trace_pc [PR90746]

2023-05-02 Thread Alexander Monakov via Gcc-patches
When instrumentation is requested via -fsanitize-coverage=trace-pc, GCC
emits calls to __sanitizer_cov_trace_pc callback into each basic block.
This callback is supposed to be implemented by the user, and should be
able to identify the containing basic block by inspecting its return
address. Tailcalling the callback prevents that, so disallow it.

gcc/ChangeLog:

PR sanitizer/90746
* calls.cc (can_implement_as_sibling_call_p): Reject calls
to __sanitizer_cov_trace_pc.

gcc/testsuite/ChangeLog:

PR sanitizer/90746
* gcc.dg/sancov/basic0.c: Verify absence of tailcall.
---
 gcc/calls.cc | 10 ++
 gcc/testsuite/gcc.dg/sancov/basic0.c |  4 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/gcc/calls.cc b/gcc/calls.cc
index 4d7f6c3d2..c6ed2f189 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -2541,6 +2541,16 @@ can_implement_as_sibling_call_p (tree exp,
   return false;
 }
 
+  /* __sanitizer_cov_trace_pc is supposed to inspect its return address
+ to identify the caller, and therefore should not be tailcalled.  */
+  if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL
+  && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_SANITIZER_COV_TRACE_PC)
+{
+  /* No need for maybe_complain_about_tail_call here: the call
+ is synthesized by the compiler.  */
+  return false;
+}
+
   /* If the called function is nested in the current one, it might access
  some of the caller's arguments, but could clobber them beforehand if
  the argument areas are shared.  */
diff --git a/gcc/testsuite/gcc.dg/sancov/basic0.c 
b/gcc/testsuite/gcc.dg/sancov/basic0.c
index af69b2d12..dfdaea848 100644
--- a/gcc/testsuite/gcc.dg/sancov/basic0.c
+++ b/gcc/testsuite/gcc.dg/sancov/basic0.c
@@ -1,9 +1,11 @@
 /* Basic test on number of inserted callbacks.  */
 /* { dg-do compile } */
-/* { dg-options "-fsanitize-coverage=trace-pc -fdump-tree-optimized" } */
+/* { dg-options "-fsanitize-coverage=trace-pc -fdump-tree-optimized 
-fdump-rtl-expand" } */
 
 void foo(void)
 {
 }
 
 /* { dg-final { scan-tree-dump-times "__builtin___sanitizer_cov_trace_pc 
\\(\\)" 1 "optimized" } } */
+/* The built-in should not be tail-called: */
+/* { dg-final { scan-rtl-dump-not "call_insn/j" "expand" } } */
-- 
2.39.2



[PATCH] haifa-sched: fix autopref_rank_for_schedule comparator [PR109187]

2023-03-28 Thread Alexander Monakov via Gcc-patches
Do not attempt to use a plain subtraction for generating a three-way
comparison result in autopref_rank_for_schedule qsort comparator, as
offsets are not restricted and subtraction may overflow.  Open-code
a safe three-way comparison instead.

gcc/ChangeLog:

PR rtl-optimization/109187
* haifa-sched.cc (autopref_rank_for_schedule): Avoid use of overflowing
subtraction in three-way comparison.

gcc/testsuite/ChangeLog:

PR rtl-optimization/109187
* gcc.dg/pr109187.c: New test.
---

I think I can commit this as obvious if no comment in a day, but explicit ack
is always appreciated.

Alexander

 gcc/haifa-sched.cc  | 2 +-
 gcc/testsuite/gcc.dg/pr109187.c | 8 
 2 files changed, 9 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr109187.c

diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 4efaa9445..e11cc5c35 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -5686,7 +5686,7 @@ autopref_rank_for_schedule (const rtx_insn *insn1, const 
rtx_insn *insn2)
 
   if (!irrel1 && !irrel2)
/* Sort memory references from lowest offset to the largest.  */
-   r = data1->offset - data2->offset;
+   r = (data1->offset > data2->offset) - (data1->offset < data2->offset);
   else if (write)
/* Schedule "irrelevant" insns before memory stores to resolve
   as many producer dependencies of stores as possible.  */
diff --git a/gcc/testsuite/gcc.dg/pr109187.c b/gcc/testsuite/gcc.dg/pr109187.c
new file mode 100644
index 0..1ef14a73d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr109187.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param sched-autopref-queue-depth=1" } */
+
+void f(int *a)
+{
+  for (;;)
+asm("" :: "r"(a[-0x1000]), "r"(a[0x1000]), "r"(a[0]) : "memory");
+}
-- 
2.39.1



Re: Should -ffp-contract=off the default on GCC?

2023-03-22 Thread Alexander Monakov via Gcc-patches


On Mon, 20 Mar 2023, Jakub Jelinek via Gcc-patches wrote:

> On Mon, Mar 20, 2023 at 10:05:57PM +, Qing Zhao via Gcc-patches wrote:
> > My question: is the above section the place in C standard “explicitly 
> > allows contractions”? If not, where it is in C standard?
> 
> http://port70.net/%7Ensz/c/c99/n1256.html#6.5p8
> http://port70.net/%7Ensz/c/c99/n1256.html#note78
> http://port70.net/%7Ensz/c/c99/n1256.html#F.6

C only allows contractions within expressions, not across statements (i.e.
either -ffp-contract=on or -ffp-contract=off would be compliant, but not
our default -ffp-contract=fast).

Unrestricted contraction across statements together with other optimizations
gives rise to difficult-to-debug issues such as PR 106902.

Alexander


Re: Should -ffp-contract=off the default on GCC?

2023-03-22 Thread Alexander Monakov via Gcc-patches


On Wed, 22 Mar 2023, Richard Biener wrote:

> I think it's even less realistic to expect users to know the details of
> floating-point math.  So I doubt any such sentence will be helpful
> besides spreading some FUD?

I think it's closer to "fundamental notions" rather than "details". For
users who bother to read the GCC manual there's a decent chance it wouldn't
be for naught.

For documentation, I was thinking

  Together with -fexcess-precision=standard, -ffp-contract=off
  is necessary to ensure that rounding of intermediate results to precision
  implied by the source code and the FLT_EVAL_METHOD macro is not
  omitted by the compiler.

Alexander


Re: Should -ffp-contract=off the default on GCC?

2023-03-21 Thread Alexander Monakov via Gcc-patches


On Tue, 21 Mar 2023, Jeff Law via Gcc-patches wrote:

> On 3/21/23 11:00, Qing Zhao via Gcc-patches wrote:
> > 
> >> On Mar 21, 2023, at 12:56 PM, Paul Koning  wrote:
> >>
> >>> On Mar 21, 2023, at 11:01 AM, Qing Zhao via Gcc-patches
> >>>  wrote:
> >>>
> >>> ...
> >>> Most of the compiler users are not familiar with language standards, or no
> >>> access to language standards. Without clearly documenting such warnings
> >>> along with the option explicitly, the users have not way to know such
> >>> potential impact.
> >>
> >> With modern highly optimized languages, not knowing the standard is going
> >> to get you in trouble.  There was a wonderful paper from MIT a few years
> >> ago describing all the many ways C can bite you if you don't know the
> >> rules.
> > 
> > Yes, it’s better to know the details of languages standard. -:)
> > However, I don’t think that this is a realistic expectation to the compiler
> > users:  to know all the details of a language standard.
> Umm, they really do need to know that stuff.
> 
> If the developer fails to understand the language standard, then they're
> likely going to write code that is ultimately undefined or doesn't behave in
> they expect.  How is the compiler supposed to guess what the developer
> originally intended?  How should the compiler handle the case when two
> developers have different understandings of how a particular piece of code
> should work?  In the end it's the language standard that defines how all this
> stuff should work.
> 
> Failure to understand the language is a common problem and we do try to emit
> various diagnostics to help developers avoid writing non-conformant code.  But
> ultimately if a developer fails to understand the language standard, then
> they're going to be surprised by the behavior of their code.

W h a t.

This subthread concerns documenting the option better ("Without clearly
documenting such warnings ...").

Are you arguing against adding a brief notice to the documentation blurb for
the -ffp-contract= option?

Perplexed,
Alexander


Re: [RFC/PATCH] sched: Consider debug insn in no_real_insns_p [PR108273]

2023-03-20 Thread Alexander Monakov via Gcc-patches


On Mon, 20 Mar 2023, Kewen.Lin wrote:

> Hi,

Hi. Thank you for the thorough analysis. Since I analyzed
PR108519, I'd like to offer my comments.

> As PR108273 shows, when there is one block which only has
> NOTE_P and LABEL_P insns at non-debug mode while has some
> extra DEBUG_INSN_P insns at debug mode, after scheduling
> it, the DFA states would be different between debug mode
> and non-debug mode.  Since at non-debug mode, the block
> meets no_real_insns_p, it gets skipped; while at debug
> mode, it gets scheduled, even it only has NOTE_P, LABEL_P
> and DEBUG_INSN_P, the call of function advance_one_cycle
> will change the DFA state.  PR108519 also shows this issue
> issue can be exposed by some scheduler changes.

(yes, so an alternative is to avoid extraneous advance_one_cycle
calls, but I think adjusting no_real_insns_p is preferable)

> This patch is to take debug insn into account in function
> no_real_insns_p, which make us not try to schedule for the
> block having only NOTE_P, LABEL_P and DEBUG_INSN_P insns,
> resulting in consistent DFA states between non-debug and
> debug mode.  Changing no_real_insns_p caused ICE when doing
> free_block_dependencies, the root cause is that we create
> dependencies for debug insns, those dependencies are
> expected to be resolved during scheduling insns which gets
> skipped after the change in no_real_insns_p.  By checking
> the code, it looks it's reasonable to skip to compute block
> dependencies for no_real_insns_p blocks.  It can be
> bootstrapped and regtested but it hit one ICE when built
> SPEC2017 bmks at option -O2 -g.  The root cause is that
> initially there are no no_real_insns_p blocks in a region,
> but in the later scheduling one block has one insn scheduled
> speculatively then becomes no_real_insns_p, so we compute
> dependencies and rgn_n_insns for this special block before
> scheduling, later it gets skipped so not scheduled, the
> following counts would mismatch:
> 
> /* Sanity check: verify that all region insns were scheduled.  */
>   gcc_assert (sched_rgn_n_insns == rgn_n_insns);
> 
> , and we miss to release the allocated dependencies.

Hm, but it is quite normal for BBs to become empty via speculative
scheduling in non-debug mode as well. So I don't think it's the
right way to frame the problem.

I think the main issue here is that debug_insns are "not real insns"
except we add them together with normal insns in the dependency graph,
and then we verify that the graph was exhausted by the scheduler.

We already handle a situation when dbg_cnt is telling the scheduler
to skip blocks. I guess the dbg_cnt handling is broken for a similar
reason?

Can we fix this issue together with the debug_cnt issue by adjusting
dbg_cnt handling in schedule_region, i.e. if no_real_insns_p || !dbg_cnt
then adjust sched_rgn_n_insns and manually resolve+free dependencies?

> To avoid the unexpected mis-matchings, this patch adds one
> bitmap to track this kind of special block which isn't
> no_real_insns_p but becomes no_real_insns_p later, then we
> can adjust the count and free deps for it.

Per above, I hope a simpler solution is possible.

(some comments on the patch below)

> This patch can be bootstrapped and regress-tested on
> x86_64-redhat-linux, aarch64-linux-gnu and
> powerpc64{,le}-linux-gnu.
> 
> I also verified this patch can pass SPEC2017 both intrate
> and fprate bmks building at -g -O2/-O3.
> 
> This is for next stage 1, but since I know little on the
> scheduler, I'd like to post it early for more comments.
> 
> Is it on the right track?  Any thoughts?
> 
> BR,
> Kewen
> -
>   PR rtl-optimization/108273
> 
> gcc/ChangeLog:
> 
>   * haifa-sched.cc (no_real_insns_p): Consider DEBUG_INSN_P insn.
>   * sched-rgn.cc (no_real_insns): New static bitmap variable.
>   (compute_block_dependences): Skip for no_real_insns_p.
>   (free_deps_for_bb_no_real_insns_p): New function.
>   (free_block_dependencies): Call free_deps_for_bb_no_real_insns_p for
>   no_real_insns_p bb.
>   (schedule_region): Fix up sched_rgn_n_insns for some block for which
>   rgn_n_insns is computed before, and move sched_rgn_local_finish after
>   free_block_dependencies loop.
>   (sched_rgn_local_init): Allocate and compute no_real_insns.
>   (sched_rgn_local_free): Free no_real_insns.
> ---
>  gcc/haifa-sched.cc |  8 -
>  gcc/sched-rgn.cc   | 84 +++---
>  2 files changed, 87 insertions(+), 5 deletions(-)
> 
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> index 48b53776fa9..378f3b34cc0 100644
> --- a/gcc/haifa-sched.cc
> +++ b/gcc/haifa-sched.cc
> @@ -5040,7 +5040,13 @@ no_real_insns_p (const rtx_insn *head, const rtx_insn 
> *tail)
>  {
>while (head != NEXT_INSN (tail))
>  {
> -  if (!NOTE_P (head) && !LABEL_P (head))
> +  /* Take debug insn into account here, otherwise we can have different
> +  DFA states after scheduling a block which 

Re: [PATCH] [RFC] RAII auto_mpfr and autp_mpz

2023-03-07 Thread Alexander Monakov via Gcc-patches


On Tue, 7 Mar 2023, Jonathan Wakely wrote:

> > Shouldn't this use the idiom suggested in ansidecl.h, i.e.
> >
> >   private:
> > DISABLE_COPY_AND_ASSIGN (auto_mpfr);
> 
> 
> Why? A macro like that (or a base class like boost::noncopyable) has
> some value in a code base that wants to work for both C++03 and C++11
> (or later). But in GCC we know we have C++11 now, so we can just
> delete members. I don't see what the macro adds.

Evidently it's possible to forget to delete one of the members, as
showcased in this very thread.

The idiom is also slightly easier to read.

Alexander


Re: [PATCH] [RFC] RAII auto_mpfr and autp_mpz

2023-03-07 Thread Alexander Monakov via Gcc-patches
Hi,

On Mon, 6 Mar 2023, Richard Biener via Gcc-patches wrote:

> --- a/gcc/realmpfr.h
> +++ b/gcc/realmpfr.h
> @@ -24,6 +24,26 @@
>  #include 
>  #include 
>  
> +class auto_mpfr
> +{
> +public:
> +  auto_mpfr () { mpfr_init (m_mpfr); }
> +  explicit auto_mpfr (mpfr_prec_t prec) { mpfr_init2 (m_mpfr, prec); }
> +  ~auto_mpfr () { mpfr_clear (m_mpfr); }
> +
> +  operator mpfr_t& () { return m_mpfr; }
> +
> +  auto_mpfr (const auto_mpfr &) = delete;
> +  auto_mpfr =(const auto_mpfr &) = delete;

Shouldn't this use the idiom suggested in ansidecl.h, i.e.

  private:
DISABLE_COPY_AND_ASSIGN (auto_mpfr);

Alexander


Re: RISC-V: Add divmod instruction support

2023-02-20 Thread Alexander Monakov via Gcc-patches


On Mon, 20 Feb 2023, Richard Biener via Gcc-patches wrote:

> On Sun, Feb 19, 2023 at 2:15 AM Maciej W. Rozycki  wrote:
> >
> > > The problem is you don't see it as a divmod in expand_divmod unless you 
> > > expose
> > > a divmod optab.  See tree-ssa-mathopts.cc's divmod handling.
> >
> >  That's the kind of stuff I'd expect to happen at the tree level though,
> > before expand.
> 
> The GIMPLE pass forming divmod could indeed choose to emit the
> div + mul/sub sequence instead if an actual divmod pattern isn't available.
> It could even generate some fake mul/sub/mod RTXen to cost the two
> variants against each other but I seriously doubt any uarch that implements
> division/modulo has a slower mul/sub.

Making a correct decision requires knowing to which degree the divider is
pipelined, and costs won't properly reflect that. If the divider accepts
a new div/mod instruction every couple of cycles, it's faster to just issue
a div followed by a mod with the same operands.

Therefore I think in this case it's fair for GIMPLE level to just check if
the divmod pattern is available, and let the target do the fine tuning via
the divmod expander.

It would make sense for tree-ssa-mathopts to emit div + mul/sub when neither
'divmod' nor 'mod' patterns are available, because RTL expansion will do the
same, just later, and we'll rely on RTL CSE to clean up the redundant div.
But RISC-V has both 'div' and 'mod', so as I tried to explain in the first
paragraph we should let the target decide.

Alexander


Re: Fix wrong code issues with ipa-sra

2023-01-21 Thread Alexander Monakov
Hello,

Coverity flagged a real issue in this patch:

On Mon, 16 Jan 2023, Jan Hubicka via Gcc-patches wrote:
> --- a/gcc/ipa-utils.cc
> +++ b/gcc/ipa-utils.cc
[...]
> +bitmap
> +find_always_executed_bbs (function *fun, bool assume_return_or_eh)
> +{
> +  auto_vec stack;
> +  auto_vec terminating_bbs;
> +  hash_set visited;
> +  edge e;
> +  edge_iterator ei;
> +
> +  /* First walk all BBs reachable from entry stopping on statements that may
> + terminate execution.  Everything past this statement is not going to be 
> executed
> + each invocation.  */
> +  stack.safe_push (ENTRY_BLOCK_PTR_FOR_FN (fun));
> +  while (!stack.is_empty ())
> +{
> +  basic_block bb = stack.pop ();
> +  bool found = false, found_exit = false;
> +  if (!assume_return_or_eh
> +   && (EDGE_COUNT (bb->succs) == 0 || (bb->flags & BB_IRREDUCIBLE_LOOP)))
> + found = true;
> +  FOR_EACH_EDGE (e, ei, bb->succs)
> + {
> +   if (e->dest == EXIT_BLOCK_PTR_FOR_FN (fun))
> + {
> +   found_exit = true;
> +   break;
> + }
> +   /* Watch for infinite loops.  */
> +   if (!found && (assume_return_or_eh & EDGE_DFS_BACK)
 ^^^
This bitwise 'and' always evaluates to zero, making the entire clause 
always-false.

Alexander


[PATCH] sched-deps: do not schedule pseudos across calls [PR108117]

2023-01-13 Thread Alexander Monakov


On Fri, 23 Dec 2022, Jose E. Marchesi wrote:

> > +1 for trying this FWIW.  There's still plenty of time to try an
> > alternative solution if there are unexpected performance problems.
> 
> Let me see if Alexander's patch fixes the issue at hand (it must) and
> will also do some regression testing.

Hi, I'm not sure at which court the ball is, but in the interest at moving
things forward here's the complete patch with the testcase. OK to apply?

---8<---

From: Alexander Monakov 
Date: Fri, 13 Jan 2023 21:04:02 +0300
Subject: [PATCH] sched-deps: do not schedule pseudos across calls [PR108117]

Scheduling across calls in the pre-RA scheduler is problematic: we do
not take liveness info into account, and are thus prone to extending
lifetime of a pseudo over the loop, requiring a callee-saved hardreg
or causing a spill.

If current function called a setjmp, lifting an assignment over a call
may be incorrect if a longjmp would happen before the assignment.

Thanks to Jose Marchesi for testing on AArch64.

gcc/ChangeLog:

PR rtl-optimization/108117
PR rtl-optimization/108132
* sched-deps.cc (deps_analyze_insn): Do not schedule across
calls before reload.

gcc/testsuite/ChangeLog:

PR rtl-optimization/108117
PR rtl-optimization/108132
* gcc.dg/pr108117.c: New test.
---
 gcc/sched-deps.cc   |  9 -
 gcc/testsuite/gcc.dg/pr108117.c | 30 ++
 2 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr108117.c

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index 948aa0c3b..5dc4fa4cd 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -3688,7 +3688,14 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)
 
   CANT_MOVE (insn) = 1;
 
-  if (find_reg_note (insn, REG_SETJMP, NULL))
+  if (!reload_completed)
+   {
+ /* Scheduling across calls may increase register pressure by extending
+live ranges of pseudos over the call.  Worse, in presence of setjmp
+it may incorrectly move up an assignment over a longjmp.  */
+ reg_pending_barrier = MOVE_BARRIER;
+   }
+  else if (find_reg_note (insn, REG_SETJMP, NULL))
 {
   /* This is setjmp.  Assume that all registers, not just
  hard registers, may be clobbered by this call.  */
diff --git a/gcc/testsuite/gcc.dg/pr108117.c b/gcc/testsuite/gcc.dg/pr108117.c
new file mode 100644
index 0..ae151693e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr108117.c
@@ -0,0 +1,30 @@
+/* { dg-do run } */
+/* { dg-require-effective-target nonlocal_goto } */
+/* { dg-options "-O2 -fschedule-insns" } */
+
+#include 
+#include 
+
+jmp_buf ex_buf;
+
+__attribute__((noipa))
+void fn_throw(int x)
+{
+   if (x)
+  longjmp(ex_buf, 1);
+}
+
+int main(void)
+{
+int vb = 0; // NB: not volatile, not modified after setjmp
+
+if (!setjmp(ex_buf)) {
+fn_throw(1);
+vb = 1; // not reached in the abstract machine
+}
+
+if (vb) {
+printf("Failed, vb = %d!\n", vb);
+return 1;
+}
+}
-- 
2.37.2



Re: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2023-01-03 Thread Alexander Monakov via Gcc-patches


On Tue, 3 Jan 2023, Jan Hubicka wrote:

> > * gcc/common/config/i386/i386-common.cc (processor_alias_table):
> > Use CPU_ZNVER4 for znver4.
> > * config/i386/i386.md: Add znver4.md.
> > * config/i386/znver4.md: New.
> OK,
> thanks!

Honza, I'm curious what are your further plans for this, you mentioned
merging znver4.md back in znver.md if I recall correctly?

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-24 Thread Alexander Monakov via Gcc-patches


On Sat, 24 Dec 2022, Jose E. Marchesi wrote:

> However, there is something I don't understand: wouldn't sched2
> introduce the same problem when -fsched2-use-superblocks is specified?

Superblocks are irrelevant, a call instruction does not end a basic block
and the problematic motion happens within a BB on your testcase. Didn't you
ask about this already?

> In that case, the option a) would need to be expanded to disable sched2
> as well, and b) wouldn't have effect (!after_reload)?

See my response to Qing Zhao, I think due to special-casing of pseudos
that are live at setjmp during register allocation, sched2 will not move
them in such manner (they should be assigned to memory and I don't expect
sched2 will move such MEMs across calls). But of course there may be holes
in this theory.

On some targets disabling sched2 is not so easy because it's responsible
for VLIW packing (bundling on ia64).

Alexander


  1   2   3   4   5   6   7   8   9   10   >