Compare loop bounds in ipa-icf

2024-05-29 Thread Jan Hubicka
Hi,
this testcase shows another poblem with missing comparators for metadata
in ICF. With value ranges available to loop optimizations during early
opts we can estimate number of iterations based on guarding condition that
can be split away by the fnsplit pass. This patch disables ICF when
number of iteraitons does not match.

Bootstrapped/regtesed x86_64-linux, will commit it shortly

gcc/ChangeLog:

PR ipa/115277
* ipa-icf-gimple.cc (func_checker::compare_loops):

gcc/testsuite/ChangeLog:

* gcc.c-torture/compile/pr115277.c: New test.

diff --git a/gcc/ipa-icf-gimple.cc b/gcc/ipa-icf-gimple.cc
index c25eb24710f..4c3174b68b6 100644
--- a/gcc/ipa-icf-gimple.cc
+++ b/gcc/ipa-icf-gimple.cc
@@ -543,6 +543,10 @@ func_checker::compare_loops (basic_block bb1, basic_block 
bb2)
 return return_false_with_msg ("unroll");
   if (!compare_variable_decl (l1->simduid, l2->simduid))
 return return_false_with_msg ("simduid");
+  if ((l1->any_upper_bound != l2->any_upper_bound)
+  || (l1->any_upper_bound
+ && (l1->nb_iterations_upper_bound != l2->nb_iterations_upper_bound)))
+return return_false_with_msg ("nb_iterations_upper_bound");
 
   return true;
 }
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115277.c 
b/gcc/testsuite/gcc.c-torture/compile/pr115277.c
new file mode 100644
index 000..27449eb254f
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr115277.c
@@ -0,0 +1,28 @@
+int array[1000];
+void
+test (int a)
+{
+if (__builtin_expect (a > 3, 1))
+return;
+for (int i = 0; i < a; i++)
+array[i]=i;
+}
+void
+test2 (int a)
+{
+if (__builtin_expect (a > 10, 1))
+return;
+for (int i = 0; i < a; i++)
+array[i]=i;
+}
+int
+main()
+{
+test(1);
+test(2);
+test(3);
+test2(10);
+if (array[9] != 9)
+__builtin_abort ();
+return 0;
+}


Fix points_to_local_or_readonly_memory_p wrt TARGET_MEM_REF

2024-05-16 Thread Jan Hubicka
Hi,
TARGET_MEM_REF can be used to offset constant base into a memory object (to
produce lea instruction).  This confuses points_to_local_or_readonly_memory_p
which treats the constant address as a base of the access.

Bootstrapped/regtsted x86_64-linux, comitted.
Honza

gcc/ChangeLog:

PR ipa/113787
* ipa-fnsummary.cc (points_to_local_or_readonly_memory_p): Do not
look into TARGET_MEM_REFS with constant opreand 0.

gcc/testsuite/ChangeLog:

* gcc.c-torture/execute/pr113787.c: New test.

diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
index 07a853f78e3..2faf2389297 100644
--- a/gcc/ipa-fnsummary.cc
+++ b/gcc/ipa-fnsummary.cc
@@ -2648,7 +2648,9 @@ points_to_local_or_readonly_memory_p (tree t)
return true;
   return !ptr_deref_may_alias_global_p (t, false);
 }
-  if (TREE_CODE (t) == ADDR_EXPR)
+  if (TREE_CODE (t) == ADDR_EXPR
+  && (TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF
+ || TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) != INTEGER_CST))
 return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
   return false;
 }
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr113787.c 
b/gcc/testsuite/gcc.c-torture/execute/pr113787.c
new file mode 100644
index 000..702b6c35fc6
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr113787.c
@@ -0,0 +1,38 @@
+void foo(int x, int y, int z, int d, int *buf)
+{
+  for(int i = z; i < y-z; ++i)
+for(int j = 0; j < d; ++j)
+  /* buf[x(i+1) + j] = buf[x(i+1)-j-1] */
+  buf[i*x+(x-z+j)] = buf[i*x+(x-z-1-j)];
+}
+
+void bar(int x, int y, int z, int d, int *buf)
+{
+  for(int i = 0; i < d; ++i)
+for(int j = z; j < x-z; ++j)
+  /* buf[j+(y+i)*x] = buf[j+(y-1-i)*x] */
+  buf[j+(y-z+i)*x] = buf[j+(y-z-1-i)*x];
+}
+
+__attribute__((noipa))
+void baz(int x, int y, int d, int *buf)
+{
+  foo(x, y, 0, d, buf);
+  bar(x, y, 0, d, buf);
+}
+
+int main(void)
+{
+  int a[] = { 1, 2, 3 };
+  baz (1, 2, 1, a);
+  /* foo does:
+ buf[1] = buf[0];
+ buf[2] = buf[1];
+
+ bar does:
+ buf[2] = buf[1]; (no-op)
+ so we should have { 1, 1, 1 }.  */
+  for (int i = 0; i < 3; i++)
+if (a[i] != 1)
+  __builtin_abort ();
+}


Re: [PATCH 6/7] lto: squash order of symbols in partitions

2024-05-14 Thread Jan Hubicka
> This patch squashes order of symbols in individual partitions, so that
> their relative order is conserved, but is not influenced by symbols in
> other partitions.
> Order of cloned symbols is set to 0. This should be fine because order
> specifies order of symbols in input files, which cloned symbols are not
> part of.

The current use of order is somewhat broken (after converting cgraph to
C++, that is a while).
The original code was setting order at the time function was finalized,
which made them to be output in same order as the bodies appear in
source code (with -fno-toplevel-reorder build at least).

With this logic the clones should have same order as originals, so they
appear next to tihem.

Later initialization of order was moved to register_symbol that
is king of wrong since frontends are allowed to produce symbols early.
So it would be nice to fix this problem and make sure that order of
clons is sane.

I guess this is bit of independent of the rest of caching, so maybe we
can first get the other patches in and then worry about order?
> 
> This is important for incremental LTO because if there is a new symbol,
> it otherwise shifts order of all symbols with higher order, which would
> diverge them all.
> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * lto-cgraph.cc (lto_output_node): Add and use order_remap.
>   (lto_output_varpool_node): Likewise.
>   (output_symtab): Likewise.
>   * lto-streamer-out.cc (produce_asm): Likewise.
>   (output_function): Likewise.
>   (output_constructor): Likewise.
>   (copy_function_or_variable): Likewise.
>   (cmp_int): New.
>   (lto_output): Generate order_remap.
>   * lto-streamer.h (produce_asm): Add order_remap.
>   (output_symtab): Likewise.
> ---
>  gcc/lto-cgraph.cc   | 20 
>  gcc/lto-streamer-out.cc | 71 +
>  gcc/lto-streamer.h  |  5 +--
>  3 files changed, 73 insertions(+), 23 deletions(-)
> 
> diff --git a/gcc/lto-cgraph.cc b/gcc/lto-cgraph.cc
> index 32c0f5ac6db..a7530290fba 100644
> --- a/gcc/lto-cgraph.cc
> +++ b/gcc/lto-cgraph.cc
> @@ -381,7 +381,8 @@ reachable_from_this_partition_p (struct cgraph_node 
> *node, lto_symtab_encoder_t
>  
>  static void
>  lto_output_node (struct lto_simple_output_block *ob, struct cgraph_node 
> *node,
> -  lto_symtab_encoder_t encoder)
> +  lto_symtab_encoder_t encoder,
> +  hash_map, int>* order_remap)
>  {
>unsigned int tag;
>struct bitpack_d bp;
> @@ -405,7 +406,9 @@ lto_output_node (struct lto_simple_output_block *ob, 
> struct cgraph_node *node,
>  
>streamer_write_enum (ob->main_stream, LTO_symtab_tags, LTO_symtab_last_tag,
>  tag);
> -  streamer_write_hwi_stream (ob->main_stream, node->order);
> +
> +  int order = flag_wpa ? *order_remap->get (node->order) : node->order;
> +  streamer_write_hwi_stream (ob->main_stream, order);
>  
>/* In WPA mode, we only output part of the call-graph.  Also, we
>   fake cgraph node attributes.  There are two cases that we care.
> @@ -585,7 +588,8 @@ lto_output_node (struct lto_simple_output_block *ob, 
> struct cgraph_node *node,
>  
>  static void
>  lto_output_varpool_node (struct lto_simple_output_block *ob, varpool_node 
> *node,
> -  lto_symtab_encoder_t encoder)
> +  lto_symtab_encoder_t encoder,
> +  hash_map, int>* order_remap)
>  {
>bool boundary_p = !lto_symtab_encoder_in_partition_p (encoder, node);
>bool encode_initializer_p
> @@ -602,7 +606,8 @@ lto_output_varpool_node (struct lto_simple_output_block 
> *ob, varpool_node *node,
>  
>streamer_write_enum (ob->main_stream, LTO_symtab_tags, LTO_symtab_last_tag,
>  LTO_symtab_variable);
> -  streamer_write_hwi_stream (ob->main_stream, node->order);
> +  int order = flag_wpa ? *order_remap->get (node->order) : node->order;
> +  streamer_write_hwi_stream (ob->main_stream, order);
>lto_output_var_decl_ref (ob->decl_state, ob->main_stream, node->decl);
>bp = bitpack_create (ob->main_stream);
>bp_pack_value (, node->externally_visible, 1);
> @@ -967,7 +972,7 @@ compute_ltrans_boundary (lto_symtab_encoder_t in_encoder)
>  /* Output the part of the symtab in SET and VSET.  */
>  
>  void
> -output_symtab (void)
> +output_symtab (hash_map, int>* order_remap)
>  {
>struct cgraph_node *node;
>struct lto_simple_output_block *ob;
> @@ -994,9 +999,10 @@ output_symtab (void)
>  {
>symtab_node *node = lto_symtab_encoder_deref (encoder, i);
>if (cgraph_node *cnode = dyn_cast  (node))
> -lto_output_node (ob, cnode, encoder);
> + lto_output_node (ob, cnode, encoder, order_remap);
>else
> - lto_output_varpool_node (ob, dyn_cast (node), encoder);
> + lto_output_varpool_node (ob, dyn_cast (node), encoder,
> +  order_remap);
>  }
>  
>

Re: [PATCH 7/7] lto: partition specific lto_clone_numbers

2024-05-14 Thread Jan Hubicka
> Replaces "lto_priv.$clone_number" by
> "lto_priv.$partition_hash.$partition_specific_clone_number".
> To reduce divergence for incremental LTO.
> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu
OK,
thanks!
Honza
> 
> gcc/lto/ChangeLog:
> 
>   * lto-partition.cc (set_clone_partition_name_checksum): New.
>   (CHECKSUM_STRING): New.
>   (privatize_symbol_name_1): Use partition hash for lto_priv.
>   (lto_promote_cross_file_statics): Use set_clone_partition_name_checksum.
>   (lto_promote_statics_nonwpa): Changed clone_map type.
> ---
>  gcc/lto/lto-partition.cc | 49 +++-
>  1 file changed, 43 insertions(+), 6 deletions(-)
> 
> diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
> index eb31ecba0d3..a2ce24eea23 100644
> --- a/gcc/lto/lto-partition.cc
> +++ b/gcc/lto/lto-partition.cc
> @@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "ipa-fnsummary.h"
>  #include "lto-partition.h"
>  #include "sreal.h"
> +#include "md5.h"
>  
>  #include 
>  #include 
> @@ -1516,8 +1517,36 @@ validize_symbol_for_target (symtab_node *node)
>  }
>  }
>  
> -/* Maps symbol names to unique lto clone counters.  */
> -static hash_map *lto_clone_numbers;
> +/* Maps symbol names with partition checksum to unique lto clone counters.  
> */
> +using clone_map = hash_map +  int_hash_base>, unsigned>;
> +static clone_map *lto_clone_numbers;
> +uint64_t current_partition_checksum = 0;
> +
> +/* Computes a quick checksum to distinguish partitions of clone numbers.  */
> +void
> +set_clone_partition_name_checksum (ltrans_partition part)
> +{
> +#define CHECKSUM_STRING(FOO) md5_process_bytes ((FOO), strlen (FOO), )
> +  struct md5_ctx ctx;
> +  md5_init_ctx ();
> +
> +  CHECKSUM_STRING (part->name);
> +
> +  lto_symtab_encoder_iterator lsei;
> +  lto_symtab_encoder_t encoder = part->encoder;
> +
> +  for (lsei = lsei_start (encoder); !lsei_end_p (lsei); lsei_next ())
> +{
> +  symtab_node *node = lsei_node (lsei);
> +  CHECKSUM_STRING (node->name ());
> +}
> +
> +  uint64_t checksum[2];
> +  md5_finish_ctx (, checksum);
> +  current_partition_checksum = checksum[0];
> +#undef CHECKSUM_STRING
> +}
>  
>  /* Helper for privatize_symbol_name.  Mangle NODE symbol name
> represented by DECL.  */
> @@ -1531,10 +1560,16 @@ privatize_symbol_name_1 (symtab_node *node, tree decl)
>  return false;
>  
>const char *name = maybe_rewrite_identifier (name0);
> -  unsigned _number = lto_clone_numbers->get_or_insert (name);
> +
> +  unsigned _number = lto_clone_numbers->get_or_insert (
> +std::pair {name, current_partition_checksum});
> +
> +  char lto_priv[32];
> +  sprintf (lto_priv, "lto_priv.%lu", current_partition_checksum);
> +
>symtab->change_decl_assembler_name (decl,
> clone_function_name (
> -   name, "lto_priv", clone_number));
> +   name, lto_priv, clone_number));
>clone_number++;
>  
>if (node->lto_file_data)
> @@ -1735,11 +1770,13 @@ lto_promote_cross_file_statics (void)
>part->encoder = compute_ltrans_boundary (part->encoder);
>  }
>  
> -  lto_clone_numbers = new hash_map;
> +  lto_clone_numbers = new clone_map;
>  
>/* Look at boundaries and promote symbols as needed.  */
>for (i = 0; i < n_sets; i++)
>  {
> +  set_clone_partition_name_checksum (ltrans_partitions[i]);
> +
>lto_symtab_encoder_iterator lsei;
>lto_symtab_encoder_t encoder = ltrans_partitions[i]->encoder;
>  
> @@ -1778,7 +1815,7 @@ lto_promote_statics_nonwpa (void)
>  {
>symtab_node *node;
>  
> -  lto_clone_numbers = new hash_map;
> +  lto_clone_numbers = new clone_map;
>FOR_EACH_SYMBOL (node)
>  {
>rename_statics (NULL, node);
> -- 
> 2.42.1
> 


Re: [PATCH 5/7] lto: Implement cache partitioning

2024-05-14 Thread Jan Hubicka
> gcc/ChangeLog:
> 
>   * common.opt: Add cache partitioning.
>   * flag-types.h (enum lto_partition_model): Likewise.
> 
> gcc/lto/ChangeLog:
> 
>   * lto-partition.cc (new_partition): Use new_partition_no_push.
>   (new_partition_no_push): New.
>   (free_ltrans_partition): New.
>   (free_ltrans_partitions): Use free_ltrans_partition.
>   (join_partitions): New.
>   (split_partition_into_nodes): New.
>   (is_partition_reorder): New.
>   (class partition_set): New.
>   (distribute_n_partitions): New.
>   (partition_over_target_split): New.
>   (partition_binary_split): New.
>   (partition_fixed_split): New.
>   (class partitioner_base): New.
>   (class partitioner_default): New.
>   (lto_cache_map): New.
>   * lto-partition.h (lto_cache_map): New.
>   * lto.cc (do_whole_program_analysis): Use lto_cache_map.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/completion-2.c: Add -flto-partition=cache.
> +/* Free memory used by ltrans partition.
> +   Encoder can be kept to be freed after streaming.  */
> +static void
> +free_ltrans_partition (ltrans_partition part, bool delete_encoder)
> +  {
No two spaces here (indent everything to left by 2).
> +if (part->initializers_visited)
> +  delete part->initializers_visited;
> +if (delete_encoder)
> +  lto_symtab_encoder_delete (part->encoder);
> +free (part);
It would make sense to turn this into C++ and use destructors
(incrementally).

OK,
Honza


Re: [PATCH 4/7] lto: Implement ltrans cache

2024-05-14 Thread Jan Hubicka
> This patch implements Incremental LTO as ltrans cache.
> 
> The cache is active when directory $GCC_LTRANS_CACHE is specified and exists.
> Stored are pairs of ltrans input/output files and input file hash.
> File locking is used to allow multiple GCC instances to use to same cache.
> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * Makefile.in: Add lto-ltrans-cache.o.
>   * lto-wrapper.cc: Use ltrans cache.
>   * lto-ltrans-cache.cc: New file.
>   * lto-ltrans-cache.h: New file.
> diff --git a/gcc/lto-ltrans-cache.cc b/gcc/lto-ltrans-cache.cc
> new file mode 100644
> index 000..0d43e548fb3
> --- /dev/null
> +++ b/gcc/lto-ltrans-cache.cc
> @@ -0,0 +1,407 @@
> +/* File caching.
> +   Copyright (C) 2009-2023 Free Software Foundation, Inc.

Probably copyright should be 2023-2024
> +const md5_checksum_t INVALID_CHECKSUM = {
Maybe static here? Officially there should be comment before the
function.
> +  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> +};
> +
> +/* Computes checksum for given file, returns INVALID_CHECKSUM if not 
> possible.
> + */
comment would look more regular if linebreak is made before possible :)
> +
> +/* Checks identity of two files byte by byte.  */
> +static bool
> +files_identical (char const *first_filename, char const *second_filename)
> +{
> +  FILE *f_first = fopen (first_filename, "rb");
> +  if (!f_first)
> +return false;
> +
> +  FILE *f_second = fopen (second_filename, "rb");
> +  if (!f_second)
> +{
> +  fclose (f_first);
> +  return false;
> +}
> +
> +  bool ret = true;
> +
> +  for (;;)
> +{
> +  int c1, c2;
> +  c1 = fgetc (f_first);
> +  c2 = fgetc (f_second);

I guess reading by fgetc may get quite ineffecient here.  Comparing
bigger blocks is probably going to be faster.  We could also
(incrementally) use mmap where supported.
> +
> +/* Contructor of cache item.  */
> +ltrans_file_cache::item::item (std::string input, std::string output,
> +  md5_checksum_t input_checksum, uint32_t last_used):
Here should be enough whitespace so md5_checksum appears just after ( in
line above
  md5_checksum_t input_checksum, uint32_t 
last_used):
> +  input (std::move (input)), output (std::move (output)),
> +  input_checksum (input_checksum), last_used (last_used)
> +{
> +  lock = lockfile (this->input + ".lock");
> +}
> +/* Destructor of cache item.  */
> +ltrans_file_cache::item::~item ()
> +{
> +  lock.unlock ();
> +}
> +
> +/* Reads next cache item from cachedata file.
> +   Adds `dir/` prefix to filenames.  */
> +static ltrans_file_cache::item*
> +read_cache_item (FILE* f, const char* dir)
> +{
> +  md5_checksum_t checksum;
> +  uint32_t last_used;
> +
> +  if (fread (, 1, checksum.size (), f) != checksum.size ())
> +return NULL;
> +  if (fread (_used, sizeof (last_used), 1, f) != 1)
> +return NULL;
> +
> +  std::vector input (strlen (dir));
> +  memcpy ([0], dir, input.size ());
> +  input.push_back ('/');
Why this is not std::string?
> +  /* Loads data about previously cached items from cachedata file.
> +
> + Must be called with creation_lock or deletion_lock held to
> + prevent data race.  */
> +  void
> +  load_cache ();
There should be no newline between type and name.  It is there only when
defining function (so it is easy to use old-school grep to find where
function is defined.)

Looks good to me otherwise.
Honza


Re: Fwd: [PATCH 2/7 v2] lto: Remove random_seed from section name.

2024-05-14 Thread Jan Hubicka
> This patch removes suffixes from section names during LTO linking.
> 
> These suffixes were originally added for ld -r to work (PR lto/44992).
> They were added to all LTO object files, but are only useful before WPA.
> After that they waste space, and if kept random, make LTO caching
> impossible.
> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * lto-streamer.cc (lto_get_section_name): Remove suffixes after WPA.
> 
> gcc/lto/ChangeLog:
> 
>   * lto-common.cc (lto_section_with_id): Dont load suffix during LTRANS.
OK,
thanks
Honza


Avoid TYPE_MAIN_VARIANT compares in TBAA

2024-05-14 Thread Jan Hubicka
Hi,
while building more testcases for ipa-icf I noticed that there are two places
in aliasing code where we still compare TYPE_MAIN_VARIANT for pointer equality.
This is not good idea for LTO since type merging may not happen for example
when in one unit pointed to type is forward declared while in other it is fully
defined.  We have same_type_for_tbaa for that.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

* alias.cc (reference_alias_ptr_type_1): Use view_converted_memref_p.
* alias.h (view_converted_memref_p): Declare.
* tree-ssa-alias.cc (view_converted_memref_p): Export.
(ao_compare::compare_ao_refs): Use same_type_for_tbaa.

diff --git a/gcc/alias.cc b/gcc/alias.cc
index 808e2095d9b..853e84d7439 100644
--- a/gcc/alias.cc
+++ b/gcc/alias.cc
@@ -770,10 +770,7 @@ reference_alias_ptr_type_1 (tree *t)
   /* If the innermost reference is a MEM_REF that has a
  conversion embedded treat it like a VIEW_CONVERT_EXPR above,
  using the memory access type for determining the alias-set.  */
-  if (TREE_CODE (inner) == MEM_REF
-  && (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
- != TYPE_MAIN_VARIANT
-  (TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1))
+  if (view_converted_memref_p (inner))
 {
   tree alias_ptrtype = TREE_TYPE (TREE_OPERAND (inner, 1));
   /* Unless we have the (aggregate) effective type of the access
diff --git a/gcc/alias.h b/gcc/alias.h
index f8d93e8b5f4..36095f0bf73 100644
--- a/gcc/alias.h
+++ b/gcc/alias.h
@@ -41,6 +41,7 @@ bool alias_ptr_types_compatible_p (tree, tree);
 int compare_base_decls (tree, tree);
 bool refs_same_for_tbaa_p (tree, tree);
 bool mems_same_for_tbaa_p (rtx, rtx);
+bool view_converted_memref_p (tree);
 
 /* This alias set can be used to force a memory to conflict with all
other memories, creating a barrier across which no memory reference
diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index e7c1c1aa624..632cf78028b 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -2044,7 +2044,7 @@ decl_refs_may_alias_p (tree ref1, tree base1,
which is done by ao_ref_base and thus one extra walk
of handled components is needed.  */
 
-static bool
+bool
 view_converted_memref_p (tree base)
 {
   if (TREE_CODE (base) != MEM_REF && TREE_CODE (base) != TARGET_MEM_REF)
@@ -4325,8 +4325,8 @@ ao_compare::compare_ao_refs (ao_ref *ref1, ao_ref *ref2,
   else if ((end_struct_ref1 != NULL) != (end_struct_ref2 != NULL))
 return flags | ACCESS_PATH;
   if (end_struct_ref1
-  && TYPE_MAIN_VARIANT (TREE_TYPE (end_struct_ref1))
-!= TYPE_MAIN_VARIANT (TREE_TYPE (end_struct_ref2)))
+  && same_type_for_tbaa (TREE_TYPE (end_struct_ref1),
+TREE_TYPE (end_struct_ref2)) != 1)
 return flags | ACCESS_PATH;
 
   /* Now compare all handled components of the access path.


Re: [PATCH] [RFC] Add function filtering to gcov

2024-05-08 Thread Jan Hubicka
> > 
> > For JSON output I suppose there's a way to "grep" without the line oriented
> > issue?  I suppose we could make the JSON more hierarchical by adding
> > an outer function object?
> 
> Absolutely, yes, this is much less useful for JSON. The filtering works,
> which may be occasionally handy for very large files, but jq and other query
> engines already work filter quite well. For me, the problem is actually
> working with JSON and mapping it back to the source.
> 
> > 
> > That said, I think this is a useful feature and thus OK for trunk if there 
> > are
> > no other comments in about a week if you also update the gcov documentation.
> 
> Thanks! I was actually surprised at how much I liked this feature once I
> started playing with it, and it made the edit, build, run, gcov -t cycle
> quite effective at exploring the program. For path coverage, filtering also
> becomes mandatory - the number of paths grows *very* quickly, and different
> levels of verbosity is useful too. Being able to focus on the function(s)
> you care about makes a huge difference.
> 
> I won't merge the patch in its current state - I will make a pass or two
> over it, add some documentation, and resubmit the patch, but not make large
> design changes unless someone objects. I won't merge that patch until it has
> been reviewed again, of course.

I also think it is useful feature to be able to restrict gcov to
selected functions.
Concerning path coverage profiling, this is something that can be
potentially also useful for optimization.  I.e. search for "Effecient
path profiling" by Ball and Larus on scholar.  One obvious thing to do
is to perform tail duplicaiton on code path that shows corelated
branches, but there are quite few other things to do with such
infrastructure (there are over 900 references to that at scholar and
some of them may be practically relevant).

Honza
> 
> Thanks,
> Jørgen
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > ---
> > >   gcc/gcov.cc| 101 +++--
> > >   gcc/testsuite/g++.dg/gcov/gcov-19.C|  35 +
> > >   gcc/testsuite/g++.dg/gcov/gcov-20.C|  38 ++
> > >   gcc/testsuite/gcc.misc-tests/gcov-24.c |  20 +
> > >   gcc/testsuite/gcc.misc-tests/gcov-25.c |  23 ++
> > >   gcc/testsuite/gcc.misc-tests/gcov-26.c |  23 ++
> > >   gcc/testsuite/gcc.misc-tests/gcov-27.c |  22 ++
> > >   gcc/testsuite/lib/gcov.exp |  53 -
> > >   8 files changed, 306 insertions(+), 9 deletions(-)
> > >   create mode 100644 gcc/testsuite/g++.dg/gcov/gcov-19.C
> > >   create mode 100644 gcc/testsuite/g++.dg/gcov/gcov-20.C
> > >   create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-24.c
> > >   create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-25.c
> > >   create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-26.c
> > >   create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-27.c
> > > 
> > > diff --git a/gcc/gcov.cc b/gcc/gcov.cc
> > > index fad704eb7c9..e53765de00a 100644
> > > --- a/gcc/gcov.cc
> > > +++ b/gcc/gcov.cc
> > > @@ -46,6 +46,7 @@ along with Gcov; see the file COPYING3.  If not see
> > >   #include "color-macros.h"
> > >   #include "pretty-print.h"
> > >   #include "json.h"
> > > +#include "xregex.h"
> > > 
> > >   #include 
> > >   #include 
> > > @@ -643,6 +644,43 @@ static int flag_counts = 0;
> > >   /* Return code of the tool invocation.  */
> > >   static int return_code = 0;
> > > 
> > > +/* "Keep policy" when adding functions to the global function table.  
> > > This will
> > > +   be set to false when --include is used, otherwise every function 
> > > should be
> > > +   added to the table.  Used for --include/exclude.  */
> > > +static bool default_keep = true;
> > > +
> > > +/* A 'function filter', a filter and action for determining if a function
> > > +   should be included in the output or not.  Used for --include/--exclude
> > > +   filtering.  */
> > > +struct fnfilter
> > > +{
> > > +  /* The (extended) compiled regex for this filter.  */
> > > +  regex_t regex;
> > > +
> > > +  /* The action when this filter (regex) matches - if true, the function 
> > > should
> > > + be kept, otherwise discarded.  */
> > > +  bool keep;
> > > +
> > > +  /* Compile the regex EXPR, or exit if pattern is malformed.  */
> > > +  void compile (const char *expr)
> > > +  {
> > > +int err = regcomp (, expr, REG_NOSUB | REG_EXTENDED);
> > > +if (err)
> > > +  {
> > > +   size_t len = regerror (err, , nullptr, 0);
> > > +   char *msg = XNEWVEC (char, len);
> > > +   regerror (err, , msg, len);
> > > +   fprintf (stderr, "Bad regular expression: %s\n", msg);
> > > +   free (msg);
> > > +   exit (EXIT_FAILURE);
> > > +}
> > > +  }
> > > +};
> > > +
> > > +/* A collection of filter functions for including/exclude functions in 
> > > the
> > > +   output.  This is empty unless --include/--exclude is used.  */
> > > +static vector filters;
> > > +
> > >   /* Forward 

[wwwdocs] Add some more stuff into GCC14 changes.html

2024-05-07 Thread Jan Hubicka
Hi,
I realize that I am late for the release (sorry for that). But here are
few things which I think may be added to changes.html at least for those
who will look later.
OK?
Honza

diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index ca5174de..b390db51 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -158,6 +158,22 @@ You may also want to check out our
or the vectorizer must be able to determine based on auxillary 
information
that the accesses are aligned.
   
+  Significant improvements in maintenance of edge profile across
+  optimizations performing control flow changes.
+  Condition coverage instrumentation is now available using
+  https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html#index-fcondition-coverage-fcondition-coverage
+  which can be analyzed using gcov  --conditions.
+  Inter-procedural value range propagation can now propagate value ranges
+  of return values of function and also ranges which can be derived from:
+  
+   if (val > constant)
+ __builtin_unreachable ();
+  
+  
+  Scalar replacement for aggregates now uses escape information
+  determined by mod-ref pass.  This improves code generation in case
+  some C++ member functions are not inlined.
+  
 
 
 New Languages and Language specific improvements
@@ -582,6 +598,7 @@ You may also want to check out our
   Faster numeric conversions using std::to_string and
 std::to_wstring.
   
+  Significantly faster std::vector::push_back
   Updated parallel algorithms that are compatible with oneTBB.
   std::numeric_limits_Float32 and
 std::numeric_limits_Float64 are now defined


[wwwdocs] Add Cauldron2024

2024-05-07 Thread Jan Hubicka
Hi,
this adds Cauldron2024 to main page. OK?

diff --git a/htdocs/index.html b/htdocs/index.html
index aa4683da..de5cca7b 100644
--- a/htdocs/index.html
+++ b/htdocs/index.html
@@ -54,6 +54,9 @@ mission statement.
 
 News
 
+https://gcc.gnu.org/wiki/cauldron2024;>GNU Tools Cauldron 
2024
+[2024-05-07]
+Prague, Czech Republic, September 14-16 2024
 
 GCC 14.1 released
 [2024-05-07]


Fix documentation of -ftree-loop-distibutive-patterns

2024-04-23 Thread Jan Hubicka
Hi,
we have:

   -ftree-loop-distribute-patterns
   Perform loop distribution of patterns that can be code generated 
with calls to a library.  This flag is enabled by default at -O2 and higher, 
and by -fprofile-use and -fauto-profile.

   This pass distributes the initialization loops and generates a call 
to memset zero.  For example, the loop

...

   and the initialization loop is transformed into a call to memset 
zero.  This flag is enabled by default at -O3.  It is also enabled by 
-fprofile-use and -fauto-profile.

Which mentions optimizatoin flags twice and the repeated mention is out of
date, since we enable this option at -O2 as well.

Regtested x86_64-linux, plan to commit it shortly as obvious.

gcc/ChangeLog:

* doc/invoke.texi (-ftree-loop-distribute-patterns): Remove duplicated
sentence about optimization flags implying this.

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2a35dc7ac75..27c31ab0c86 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13852,8 +13852,6 @@ DO I = 1, N
 ENDDO
 @end smallexample
 and the initialization loop is transformed into a call to memset zero.
-This flag is enabled by default at @option{-O3}.
-It is also enabled by @option{-fprofile-use} and @option{-fauto-profile}.
 
 @opindex floop-interchange
 @item -floop-interchange


Re: [PATCH] c++: Retry the aliasing of base/complete cdtor optimization at import_export_decl time [PR113208]

2024-04-17 Thread Jan Hubicka
> 
> Ah, you're right.
> If I compile (the one line modified) pr113208_0.C with
> -O -fno-early-inlining -fdisable-ipa-inline -std=c++20
> it does have just _ZN6vectorI12QualityValueEC2ERKS1_ in 
> _ZN6vectorI12QualityValueEC2ERKS1_
> comdat and no _ZN6vectorI12QualityValueEC1ERKS1_
> and pr113208_1.C with -O -fno-early-inlining -fdisable-ipa-inline -std=c++20 
> -fno-omit-frame-pointer
> and link that together with the above mentioned third *.C file, I see
> 0040112a <_ZN6vectorI12QualityValueEC2ERKS1_>:
>   40112a:   53  push   %rbx
>   40112b:   48 89 fbmov%rdi,%rbx
>   40112e:   48 89 f7mov%rsi,%rdi
>   401131:   e8 9c 00 00 00  call   4011d2 
> <_ZNK12_Vector_baseI12QualityValueE1gEv>
>   401136:   89 c2   mov%eax,%edx
>   401138:   be 01 00 00 00  mov$0x1,%esi
>   40113d:   48 89 dfmov%rbx,%rdi
>   401140:   e8 7b 00 00 00  call   4011c0 
> <_ZN12_Vector_baseI12QualityValueEC1Eii>
>   401145:   5b  pop%rbx
>   401146:   c3  ret
> i.e. the C2 prevailing from pr113208_0.s where it is the only symbol, and
> 00401196 <_ZN6vectorI12QualityValueEC1ERKS1_>:
>   401196:   55  push   %rbp
>   401197:   48 89 e5mov%rsp,%rbp
>   40119a:   53  push   %rbx
>   40119b:   48 83 ec 08 sub$0x8,%rsp
>   40119f:   48 89 fbmov%rdi,%rbx
>   4011a2:   48 89 f7mov%rsi,%rdi
>   4011a5:   e8 28 00 00 00  call   4011d2 
> <_ZNK12_Vector_baseI12QualityValueE1gEv>
>   4011aa:   89 c2   mov%eax,%edx
>   4011ac:   be 01 00 00 00  mov$0x1,%esi
>   4011b1:   48 89 dfmov%rbx,%rdi
>   4011b4:   e8 07 00 00 00  call   4011c0 
> <_ZN12_Vector_baseI12QualityValueEC1Eii>
>   4011b9:   48 8b 5d f8 mov-0x8(%rbp),%rbx
>   4011bd:   c9  leave  
>   4011be:   c3  ret
> which is the C1 alias originally aliased to C2 in C5 comdat.
> So, that would match linker behavior where it sees C1 -> C2 alias prevails,
> but a different version of C2 prevails, so let's either make C1 a non-alias
> or alias to a non-exported symbol or something like that.

Yep, I think the linker simply handles the C2 symbol as weak symbol and
after it decides to keep both comdat section (C2 and C5) the C5's weak
symbol is prevailed by C2.

I wrote patch doing the same with LTO - we need to handle alias
references specially and do not update them according to the resolution
info and then look for prevailed symbols which have aliases and turn
them to static symbols.  

I think for most scenarios this is OK, but I am not sure about
incremental linking (both LTO and non-LTO). What seems wrong is that we
produce C5 comdat that is not exporting what it should and thus breaking
the invariant that in valid code all comdats of same name are
semantically equivalent.  Perhaps it makes no difference since this
scenario is pretty special and we know that the functions are
semantically equivalent and their addresses are never compared for
equality (at least I failed to produce some useful testcase).

Honza


Re: [PATCH] c++: Retry the aliasing of base/complete cdtor optimization at import_export_decl time [PR113208]

2024-04-17 Thread Jan Hubicka
> 
> I've tried to see what actually happens during linking without LTO, so 
> compiled
> pr113208_0.C with -O1 -fkeep-inline-functions -std=c++20 with vanilla trunk
> (so it has those 2 separate comdats, one for C2 and one for C1), though I've
> changed the
> void m(k);
> line to
> __attribute__((noipa)) void m(k) {}
> in the testcase, then compiled
> pr113208_1.C with -O2 -fkeep-inline-functions -std=c++20 
> -fno-omit-frame-pointer
> so that one can clearly differentiate from where the implementation was
> picked and finally added
> template  struct _Vector_base {
>   int g() const;
>   _Vector_base(int, int);
> };
> 
> struct QualityValue;
> template <>
> _Vector_base::_Vector_base(int, int) {}
> template <>
> int _Vector_base::g() const { return 0; }
> int main () {}
> If I link this, I see _ZN6vectorI12QualityValueEC2ERKS1_ and
> _ZN6vectorI12QualityValueEC1ERKS1_ as separate functions with the
> omitted frame pointer bodies, so clearly the pr113208_0.C versions prevailed
> in both cases.  It is unclear why that isn't the case for LTO.

I think it is because of -fkeep-inline-functions which makes the first
object file to define both symbols, while with LTO we optimize out one
of them.  

So to reproduce same behaviour with non-LTO we would probably need use
-O1 and arrange the contructor to be unilinable instead of using
-fkeep-inline-functions.

Honza


Re: [PATCH] c++: Retry the aliasing of base/complete cdtor optimization at import_export_decl time [PR113208]

2024-04-17 Thread Jan Hubicka
> Hi!
Hello,
> The reason the optimization doesn't trigger when the constructor is
> constexpr is that expand_or_defer_fn is called in that case much earlier
> than when it is not constexpr; in the former case it is called when we
> try to constant evaluate that constructor.  But DECL_INTERFACE_KNOWN
> is false in that case and comdat_linkage hasn't been called either
> (I think it is desirable, because comdat group is stored in the cgraph
> node and am not sure it is a good idea to create cgraph nodes for
> something that might not be needed later on at all), so maybe_clone_body
> clones the bodies, but doesn't make them as aliases.

Thanks for working this out! Creating cgraph node with no body is
harmless as it will be removed later.  
> 
> The following patch is an attempt to redo this optimization when
> import_export_decl is called at_eof time on the base/complete cdtor
> (or deleting dtor).  It will not do anything if maybe_clone_body
> hasn't been called uyet (the TREE_ASM_WRITTEN check on the
> DECL_MAYBE_IN_CHARGE_CDTOR_P), or when one or both of the base/complete
> cdtors have been lowered already, or when maybe_clone_body called
> maybe_thunk_body and it was successful.  Otherwise retries the
> can_alias_cdtor check and makes the complete cdtor alias to the
> base cdtor with adjustments to the comdat group.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> 2024-04-17  Jakub Jelinek  
> 
>   PR lto/113208
>   * cp-tree.h (maybe_optimize_cdtor): Declare.
>   * decl2.cc (import_export_decl): Call it for cloned cdtors.
>   * optimize.cc (maybe_optimize_cdtor): New function.
> 
>   * g++.dg/lto/pr113208_0.C: New test.
>   * g++.dg/lto/pr113208_1.C: New file.
>   * g++.dg/lto/pr113208.h: New file.
> 
> --- gcc/cp/cp-tree.h.jj   2024-04-16 17:18:37.286279533 +0200
> +++ gcc/cp/cp-tree.h  2024-04-16 17:45:01.685635709 +0200
> @@ -7447,6 +7447,7 @@ extern bool handle_module_option (unsign
>  /* In optimize.cc */
>  extern tree clone_attrs  (tree);
>  extern bool maybe_clone_body (tree);
> +extern void maybe_optimize_cdtor (tree);
>  
>  /* In parser.cc */
>  extern tree cp_convert_range_for (tree, tree, tree, cp_decomp *, bool,
> --- gcc/cp/decl2.cc.jj2024-04-16 17:18:37.287279519 +0200
> +++ gcc/cp/decl2.cc   2024-04-16 17:45:01.686635695 +0200
> @@ -3568,6 +3568,9 @@ import_export_decl (tree decl)
>  }
>  
>DECL_INTERFACE_KNOWN (decl) = 1;
> +
> +  if (DECL_CLONED_FUNCTION_P (decl))
> +maybe_optimize_cdtor (decl);
>  }
>  
>  /* Return an expression that performs the destruction of DECL, which
> --- gcc/cp/optimize.cc.jj 2024-04-16 17:18:37.374278327 +0200
> +++ gcc/cp/optimize.cc2024-04-16 21:35:53.597188774 +0200
> @@ -723,3 +723,58 @@ maybe_clone_body (tree fn)
>/* We don't need to process the original function any further.  */
>return 1;
>  }
> +
> +/* If maybe_clone_body is called while the cdtor is still tentative,
> +   DECL_ONE_ONLY will be false and so will be can_alias_cdtor (fn).
> +   In that case we wouldn't try to optimize using an alias and instead
> +   would emit separate base and complete cdtor.  The following function
> +   attempts to still optimize that case when we import_export_decl
> +   is called first time on one of the clones.  */
> +
> +void
> +maybe_optimize_cdtor (tree orig_decl)
> +{
> +  tree fns[3];
> +  tree fn = DECL_CLONED_FUNCTION (orig_decl);
> +  gcc_checking_assert (DECL_MAYBE_IN_CHARGE_CDTOR_P (fn));
> +  if (DECL_INTERFACE_KNOWN (fn)
> +  || !TREE_ASM_WRITTEN (fn)
> +  || !DECL_ONE_ONLY (orig_decl)
> +  || symtab->global_info_ready)
> +return;
> +
> +  populate_clone_array (fn, fns);
> +
> +  if (!fns[0] || !fns[1])
> +return;
> +  for (int i = 2 - !fns[2]; i >= 0; --i)
> +if (fns[i] != orig_decl && DECL_INTERFACE_KNOWN (fns[i]))
> +  return;
> +  DECL_INTERFACE_KNOWN (fn) = 1;
> +  comdat_linkage (fn);
> +  if (!can_alias_cdtor (fn))
> +return;
> +  /* For comdat base and complete cdtors put them into the same,
> + *[CD]5* comdat group instead of *[CD][12]*.  */
> +  auto n0 = cgraph_node::get_create (fns[0]);
> +  auto n1 = cgraph_node::get_create (fns[1]);
> +  auto n2 = fns[2] ? cgraph_node::get_create (fns[1]) : NULL;
> +  if (n0->lowered || n1->lowered || (n2 && n2->lowered))
> +return;
> +  import_export_decl (fns[0]);
> +  n1->definition = false;
> +  if (!n0->create_same_body_alias (fns[1], fns[0]))
> +return;
> +
> +  tree comdat_group = cdtor_comdat_group (fns[1], fns[0]);
> +  n1 = cgraph_node::get (fns[1]);
> +  n0->set_comdat_group (comdat_group);
> +  if (n1->same_comdat_group)
> +n1->remove_from_same_comdat_group ();
> +  n1->add_to_same_comdat_group (n0);
> +  if (fns[2])
> +n2->add_to_same_comdat_group (n0);
> +  import_export_decl (fns[1]);

So this is handling the case where symbol was already inserted into one
comdat group and later 

Re: [PATCH] lto/114655 - -flto=4 at link time doesn't override -flto=auto at compile time

2024-04-09 Thread Jan Hubicka
> The following adjusts -flto option processing in lto-wrapper to have
> link-time -flto override any compile time setting.
> 
> LTO-boostrapped on x86_64-unknown-linux-gnu, testing in progress.
> 
> OK for trunk and branches?  GCC 11 seems to be unaffected by this.
> 
> Thanks,
> Richard.
> 
>   PR lto/114655
>   * lto-wrapper.cc (merge_flto_options): Add force argument.
>   (merge_and_complain): Do not force here.
>   (run_gcc): But here to make the link-time -flto option override
>   any compile-time one.
Looks good to me.  I am actually surprised we propagate -flto settings
from compile time at all.  I guess I never tried it since I never
assumed it to work :)

Honza


Re: [PATCH 3/3] tree-optimization/114052 - niter analysis from undefined behavior

2024-04-05 Thread Jan Hubicka
> +   /* When there's a call that might not return the last iteration
> +  is possibly partial.  This matches what we check in invariant
> +  motion.
> +  ???  For the call argument evaluation it would be still OK.  */
> +   if (!may_have_exited
> +   && is_gimple_call (stmt)
> +   && gimple_has_side_effects (stmt))
> + may_have_exited = true;

I think you are missing here non-call EH, volatile asms and traps.
 We have stmt_may_terminate_function_p which tests there.

Honza
> +
> +   infer_loop_bounds_from_array (loop, stmt,
> + reliable && !may_have_exited);
>  
> -   if (reliable)
> +   if (reliable && !may_have_exited)
>  {
>infer_loop_bounds_from_signedness (loop, stmt);
>infer_loop_bounds_from_pointer_arith (loop, stmt);
>  }
>   }
> -
>  }
>  }
>  
> @@ -4832,7 +4855,7 @@ estimate_numbers_of_iterations (class loop *loop)
>   diagnose those loops with -Waggressive-loop-optimizations.  */
>number_of_latch_executions (loop);
>  
> -  basic_block *body = get_loop_body (loop);
> +  basic_block *body = get_loop_body_in_rpo (cfun, loop);
>auto_vec exits = get_loop_exit_edges (loop, body);
>likely_exit = single_likely_exit (loop, exits);
>FOR_EACH_VEC_ELT (exits, i, ex)
> -- 
> 2.35.3


Re: [PATCH] ipa: Force args obtined through pass-through maps to the expected type (PR 113964)

2024-04-05 Thread Jan Hubicka
> Hi,
> 
> interactions of IPA-CP and IPA-SRA on the same data is a rather big
> source of issues, I'm afraid.  PR 113964 is a situation where IPA-CP
> propagates an unsigned short in a union parameter into a function
> which itself calls a different function which has a same union
> parameter and both these union parameters are split with IPA-SRA.  The
> leaf function however uses a signed short member of the union.
> 
> In the calling function, we get the unsigned constant as the
> replacement for the union and it is then passed in the call without
> any type compatibility checks.  Apparently on riscv64 it matters
> whether the parameter is signed or unsigned short and so the leaf
> function can see different values.
> 
> Fixed by using useless_type_conversion_p at the appropriate place and
> if it fails, use force_value_to type as elsewhere in similar
> situations.
> 
> Bootstrapped and tested on x86_64-linux, the reporter has also run the
> testsuite with this patch on riscv64 and reported in Bugzilla there were
> no issues.
> 
> OK for master and GCC 13?
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2024-04-04  Martin Jambor  
> 
>   PR ipa/113964
>   * ipa-param-manipulation.cc (ipa_param_adjustments::modify_call):
>   Force values obtined through pass-through maps to the expected
>   split type.
> 
> gcc/testsuite/ChangeLog:
> 
> 2024-04-04  Patrick O'Neill  
>   Martin Jambor  
> 
>   PR ipa/113964
>   * gcc.dg/ipa/pr114247.c: New test.
OK,
thanks!
Honza


Re: [PATCH] ipa: Avoid duplicate replacements in IPA-SRA transformation phase

2024-04-04 Thread Jan Hubicka
> > gcc/ChangeLog:
> >
> > 2024-03-15  Martin Jambor  
> >
> > PR ipa/111571
> > * ipa-param-manipulation.cc
> > (ipa_param_body_adjustments::common_initialization): Avoid creating
> > duplicate replacement entries.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2024-03-15  Martin Jambor  
> >
> > PR ipa/111571
> > * gcc.dg/ipa/pr111571.c: New test.
OK,
thanks!
Honza


Re: [PATCH v10 1/2] Add condition coverage (MC/DC)

2024-04-04 Thread Jan Hubicka
> > > diff --git a/gcc/ipa-inline.cc b/gcc/ipa-inline.cc
> > > index dc120e6da5a..9502a21c741 100644
> > > --- a/gcc/ipa-inline.cc
> > > +++ b/gcc/ipa-inline.cc
> > > @@ -682,7 +682,7 @@ can_early_inline_edge_p (struct cgraph_edge *e)
> > >   }
> > > gcc_assert (gimple_in_ssa_p (DECL_STRUCT_FUNCTION (e->caller->decl))
> > > && gimple_in_ssa_p (DECL_STRUCT_FUNCTION (callee->decl)));
> > > -  if (profile_arc_flag
> > > +  if ((profile_arc_flag || condition_coverage_flag)
> > > && ((lookup_attribute ("no_profile_instrument_function",
> > >   DECL_ATTRIBUTES (caller->decl)) == 
> > > NULL_TREE)
> > > != (lookup_attribute ("no_profile_instrument_function",
> > 
> > tree-inline should also copy the cond tags, since always_inline
> > functions are inlined even at -O0.  But I think this can be done
> > incrementally.
> > 
> > Patch is OK.
> > Thanks a lot and sorry for taking so long on this one.
> > Honza
> 
> I guess you mean that since tree-inlining happen after gimplification (does
> it?) that the basic condition -> condition expression association should
> also fold in the now-inlined conditions?

Yes, if you declare function always_inline the order of event is as
follows
 1) gimplification of both functions
 2) ssa constructions
 3) early inlining where inline happens
 4) profile instrumentation.

Honza
> 
> Thanks for the review - I'll do a final test run and install the patch.


Re: [PATCH v10 1/2] Add condition coverage (MC/DC)

2024-04-04 Thread Jan Hubicka
> gcc/ChangeLog:
> 
>   * builtins.cc (expand_builtin_fork_or_exec): Check
> condition_coverage_flag.
>   * collect2.cc (main): Add -fno-condition-coverage to OBSTACK.
>   * common.opt: Add new options -fcondition-coverage and
> -Wcoverage-too-many-conditions.
>   * doc/gcov.texi: Add --conditions documentation.
>   * doc/invoke.texi: Add -fcondition-coverage documentation.
>   * function.cc (free_after_compilation): Free cond_uids.
>   * function.h (struct function): Add cond_uids.
>   * gcc.cc: Link gcov on -fcondition-coverage.
>   * gcov-counter.def (GCOV_COUNTER_CONDS): New.
>   * gcov-dump.cc (tag_conditions): New.
>   * gcov-io.h (GCOV_TAG_CONDS): New.
>   (GCOV_TAG_CONDS_LENGTH): New.
>   (GCOV_TAG_CONDS_NUM): New.
>   * gcov.cc (class condition_info): New.
>   (condition_info::condition_info): New.
>   (condition_info::popcount): New.
>   (struct coverage_info): New.
>   (add_condition_counts): New.
>   (output_conditions): New.
>   (print_usage): Add -g, --conditions.
>   (process_args): Likewise.
>   (output_intermediate_json_line): Output conditions.
>   (read_graph_file): Read condition counters.
>   (read_count_file): Likewise.
>   (file_summary): Print conditions.
>   (accumulate_line_info): Accumulate conditions.
>   (output_line_details): Print conditions.
>   * gimplify.cc (next_cond_uid): New.
>   (reset_cond_uid): New.
>   (shortcut_cond_r): Set condition discriminator.
>   (tag_shortcut_cond): New.
>   (gimple_associate_condition_with_expr): New.
>   (shortcut_cond_expr): Set condition discriminator.
>   (gimplify_cond_expr): Likewise.
>   (gimplify_function_tree): Call reset_cond_uid.
>   * ipa-inline.cc (can_early_inline_edge_p): Check
> condition_coverage_flag.
>   * ipa-split.cc (pass_split_functions::gate): Likewise.
>   * passes.cc (finish_optimization_passes): Likewise.
>   * profile.cc (struct condcov): New declaration.
>   (cov_length): Likewise.
>   (cov_blocks): Likewise.
>   (cov_masks): Likewise.
>   (cov_maps): Likewise.
>   (cov_free): Likewise.
>   (instrument_decisions): New.
>   (read_thunk_profile): Control output to file.
>   (branch_prob): Call find_conditions, instrument_decisions.
>   (init_branch_prob): Add total_num_conds.
>   (end_branch_prob): Likewise.
>   * tree-core.h (struct tree_exp): Add condition_uid.
>   * tree-profile.cc (struct conds_ctx): New.
>   (CONDITIONS_MAX_TERMS): New.
>   (EDGE_CONDITION): New.
>   (topological_cmp): New.
>   (index_of): New.
>   (single_p): New.
>   (single_edge): New.
>   (contract_edge_up): New.
>   (struct outcomes): New.
>   (conditional_succs): New.
>   (condition_index): New.
>   (condition_uid): New.
>   (masking_vectors): New.
>   (emit_assign): New.
>   (emit_bitwise_op): New.
>   (make_top_index_visit): New.
>   (make_top_index): New.
>   (paths_between): New.
>   (struct condcov): New.
>   (cov_length): New.
>   (cov_blocks): New.
>   (cov_masks): New.
>   (cov_maps): New.
>   (cov_free): New.
>   (find_conditions): New.
>   (struct counters): New.
>   (find_counters): New.
>   (resolve_counter): New.
>   (resolve_counters): New.
>   (instrument_decisions): New.
>   (tree_profiling): Check condition_coverage_flag.
>   (pass_ipa_tree_profile::gate): Likewise.
>   * tree.h (SET_EXPR_UID): New.
>   (EXPR_COND_UID): New.
> 
> libgcc/ChangeLog:
> 
>   * libgcov-merge.c (__gcov_merge_ior): New.
> 
> gcc/testsuite/ChangeLog:
> 
>   * lib/gcov.exp: Add condition coverage test function.
>   * g++.dg/gcov/gcov-18.C: New test.
>   * gcc.misc-tests/gcov-19.c: New test.
>   * gcc.misc-tests/gcov-20.c: New test.
>   * gcc.misc-tests/gcov-21.c: New test.
>   * gcc.misc-tests/gcov-22.c: New test.
>   * gcc.misc-tests/gcov-23.c: New test.
> ---
>  gcc/builtins.cc|2 +-
>  gcc/collect2.cc|7 +-
>  gcc/common.opt |9 +
>  gcc/doc/gcov.texi  |   38 +
>  gcc/doc/invoke.texi|   21 +
>  gcc/function.cc|1 +
>  gcc/function.h |4 +
>  gcc/gcc.cc |4 +-
>  gcc/gcov-counter.def   |3 +
>  gcc/gcov-dump.cc   |   24 +
>  gcc/gcov-io.h  |3 +
>  gcc/gcov.cc|  209 ++-
>  gcc/gimplify.cc|  123 +-
>  gcc/ipa-inline.cc  |2 +-
>  gcc/ipa-split.cc   |2 +-
>  gcc/passes.cc  |3 +-
>  gcc/profile.cc |   76 +-
>  gcc/testsuite/g++.dg/gcov/gcov-18.C  

Re: [PATCH v3] tree-profile: Disable indirect call profiling for IFUNC resolvers

2024-04-03 Thread Jan Hubicka
> We can't profile indirect calls to IFUNC resolvers nor their callees as
> it requires TLS which hasn't been set up yet when the dynamic linker is
> resolving IFUNC symbols.
> 
> Add an IFUNC resolver caller marker to cgraph_node and set it if the
> function is called by an IFUNC resolver.  Disable indirect call profiling
> for IFUNC resolvers and their callees.
> 
> Tested with profiledbootstrap on Fedora 39/x86-64.
> 
> gcc/ChangeLog:
> 
>   PR tree-optimization/114115
>   * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes.
>   (cgraph_node): Add called_by_ifunc_resolver.
>   * cgraphunit.cc (symbol_table::compile): Call
>   symtab_node::check_ifunc_callee_symtab_nodes.
>   * symtab.cc (check_ifunc_resolver): New.
>   (ifunc_ref_map): Likewise.
>   (is_caller_ifunc_resolver): Likewise.
>   (symtab_node::check_ifunc_callee_symtab_nodes): Likewise.
>   * tree-profile.cc (gimple_gen_ic_func_profiler): Disable indirect
>   call profiling for IFUNC resolvers and their callees.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR tree-optimization/114115
>   * gcc.dg/pr114115.c: New test.
> +/* { dg-final { scan-tree-dump-not "__gcov_indirect_call_profiler_v" 
> "optimized" } } */
> diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
> index aed13e2b1bc..373dbd60481 100644
> --- a/gcc/tree-profile.cc
> +++ b/gcc/tree-profile.cc
> @@ -520,7 +520,10 @@ gimple_gen_ic_func_profiler (void)
>gcall *stmt1;
>tree tree_uid, cur_func, void0;
>  
> -  if (c_node->only_called_directly_p ())
> +  /* Disable indirect call profiling for an IFUNC resolver and its
> + callees.  */
Please add a comment here referring to the PR and need to have TLS
initialized.

OK witht that change.
Honza


Re: PING: [PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-04-02 Thread Jan Hubicka
> > I am bit worried about commonly used functions getting "infected" by
> > being called once from ifunc resolver.  I think we only use thread local
> > storage for indirect call profiling, so we may just disable indirect
> > call profiling for these functions.
> 
> Will change it.
> 
> > Also the patch will be noop with -flto -flto-partition=max, so probably
> > we need to compute this flag at WPA time and stream to partitions.
> >
> 
> Why is it a nop with -flto -flto-partition=max? I got
> 
> (gdb) bt
> #0  symtab_node::check_ifunc_callee_symtab_nodes ()
> at /export/gnu/import/git/gitlab/x86-gcc/gcc/symtab.cc:1440
> #1  0x00e487d3 in symbol_table::compile (this=0x7fffea006000)
> at /export/gnu/import/git/gitlab/x86-gcc/gcc/cgraphunit.cc:2320
> #2  0x00d23ecf in lto_main ()
> at /export/gnu/import/git/gitlab/x86-gcc/gcc/lto/lto.cc:687
> #3  0x015254d2 in compile_file ()
> at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:449
> #4  0x015284a4 in do_compile ()
> at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:2154
> #5  0x01528864 in toplev::main (this=0x7fffd84a, argc=16,
> argv=0x42261f0) at 
> /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:2310
> #6  0x030a3fe2 in main (argc=16, argv=0x7fffd958)
> at /export/gnu/import/git/gitlab/x86-gcc/gcc/main.cc:39
> 
> Do you have a testcase to show that it is a nop?
Aha, sorry.  I tought this is run during late optimization, but it is
done early, so LTo partitioning does not mix things up.  So current
patch modified to disable only instrumentation that needs TLS should be
fine.

Honza
> 
> -- 
> H.J.


Re: PING: [PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-04-02 Thread Jan Hubicka
> On Tue, Mar 5, 2024 at 1:45 PM H.J. Lu  wrote:
> >
> > We can't instrument an IFUNC resolver nor its callees as it may require
> > TLS which hasn't been set up yet when the dynamic linker is resolving
> > IFUNC symbols.
> >
> > Add an IFUNC resolver caller marker to cgraph_node and set it if the
> > function is called by an IFUNC resolver.  Update tree_profiling to skip
> > functions called by IFUNC resolver.
> >
> > Tested with profiledbootstrap on Fedora 39/x86-64.
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/114115
> > * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes.
> > (cgraph_node): Add called_by_ifunc_resolver.
> > * cgraphunit.cc (symbol_table::compile): Call
> > symtab_node::check_ifunc_callee_symtab_nodes.
> > * symtab.cc (check_ifunc_resolver): New.
> > (ifunc_ref_map): Likewise.
> > (is_caller_ifunc_resolver): Likewise.
> > (symtab_node::check_ifunc_callee_symtab_nodes): Likewise.
> > * tree-profile.cc (tree_profiling): Do not instrument an IFUNC
> > resolver nor its callees.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/114115
> > * gcc.dg/pr114115.c: New test.
> 
> PING.

I am bit worried about commonly used functions getting "infected" by
being called once from ifunc resolver.  I think we only use thread local
storage for indirect call profiling, so we may just disable indirect
call profiling for these functions.

Also the patch will be noop with -flto -flto-partition=max, so probably
we need to compute this flag at WPA time and stream to partitions.

Honza


Re: [PATCH] profile-count: Avoid overflows into uninitialized [PR112303]

2024-03-28 Thread Jan Hubicka
Hi,
so what goes wrong with the testcase is the fact that after recursive
inliing we have large loop nest and consequently invalid profile since
every loop is predicted to iterate quite a lot.  Rebuild_frequences
should take care of the problem, but it doesn't since there is:
  if (freq_max < 16)
freq_max = 16;
Removing this check solves the testcase.  Looking how it went in, I made
it in 2017 when dropping the original code to scale into range 0...1
https://gcc.gnu.org/pipermail/gcc-patches/2017-November/488115.html

I have no recolleciton of inventing that check, but I suppose one can
argue that we do not want to scale most of CFG to 0 since the branch
prediciton is likely wrong and we do not know if the code with
unrealistic BB profile is important at all.  So perhaps it is safer to
cap rather than scale most of function body to 0.

profile_count arithmetics is indeed supposed to be saturating, it is bad
I managed to miss the check for such a common operation as + :(
> 
> 2024-03-27  Jakub Jelinek  
> 
>   PR tree-optimization/112303
>   * profile-count.h (profile_count::operator+): Perform
>   addition in uint64_t variable and set m_val to MIN of that
>   val and max_count.
>   (profile_count::operator+=): Likewise.
>   (profile_count::operator-=): Formatting fix.

These two changes as OK.

In apply_probability
> @@ -1127,7 +1129,9 @@ public:
>if (!initialized_p ())
>   return uninitialized ();
>profile_count ret;
> -  ret.m_val = RDIV (m_val * prob, REG_BR_PROB_BASE);
> +  uint64_t tmp;
> +  safe_scale_64bit (m_val, prob, REG_BR_PROB_BASE, );
> +  ret.m_val = MIN (tmp, max_count);

This exists only for old code that still uses REG_BR_PROB_BAS integer.
Valid prob is always prob <= REG_BR_PROB_BASE :)
So we need safe_scale_64bit to watch overflow, but result does not need
MIN.

>ret.m_quality = MIN (m_quality, ADJUSTED);
>return ret;
>  }
> @@ -1145,7 +1149,7 @@ public:
>uint64_t tmp;
>safe_scale_64bit (m_val, prob.m_val, 
> profile_probability::max_probability,
>   );
> -  ret.m_val = tmp;
> +  ret.m_val = MIN (tmp, max_count);

Same here, it is unnecesary to do MIN.

OK with this change.

Thanks for looking into this,
Honza


Re: [PATCH] profile-count: Avoid overflows into uninitialized [PR112303]

2024-03-28 Thread Jan Hubicka
> __attribute__((noipa)) void
> bar (void)
> {
>   __builtin_exit (0);
> }
> 
> __attribute__((noipa)) void
> foo (void)
> {
>   for (int i = 0; i < 1000; ++i)
>   for (int j = 0; j < 1000; ++j)
>   for (int k = 0; k < 1000; ++k)
>   for (int l = 0; l < 1000; ++l)
>   for (int m = 0; m < 1000; ++m)
>   for (int n = 0; n < 1000; ++n)
>   for (int o = 0; o < 1000; ++o)
>   for (int p = 0; p < 1000; ++p)
>   for (int q = 0; q < 1000; ++q)
>   for (int r = 0; r < 1000; ++r)
>   for (int s = 0; s < 1000; ++s)
>   for (int t = 0; t < 1000; ++t)
>   for (int u = 0; u < 1000; ++u)
>   for (int v = 0; v < 1000; ++v)
>   for (int w = 0; w < 1000; ++w)
>   for (int x = 0; x < 1000; ++x)
>   for (int y = 0; y < 1000; ++y)
>   for (int z = 0; z < 1000; ++z)
>   for (int a = 0; a < 1000; ++a)
>   for (int b = 0; b < 1000; ++b)
> bar ();
> }
> 
> int
> main ()
> {
>   foo ();
> }
> reaches the maximum count already on the 11th loop.

This should not happen - this is a reason why we esimate in sreals and
convert to profile_counts only later.  In this case we should push down
the profile_count of entry block (to 0)

  freq_max = 0;
  FOR_EACH_BB_FN (bb, cfun)
if (freq_max < BLOCK_INFO (bb)->frequency)
  freq_max = BLOCK_INFO (bb)->frequency;

  /* Scaling frequencies up to maximal profile count may result in
 frequent overflows especially when inlining loops.
 Small scalling results in unnecesary precision loss.  Stay in
 the half of the (exponential) range.  */
  freq_max = (sreal (1) << (profile_count::n_bits / 2)) / freq_max;
  if (freq_max < 16)
freq_max = 16;

I am looking on what goes wrong here.
Honza


Re: RE: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model

2024-03-18 Thread Jan Hubicka
> Hello,
> 
> Le 22/02/2024 à 19:29, Anbazhagan, Karthiban a écrit :
> (...)
> >  gcc/config/i386/{znver4.md => zn4zn5.md}  | 858 +-
> 
> looks like the patch pushed to master lost the file rename.
> I get a bootstrap failure caused by the missing zn4zn5.md file.
> 
> Can you have a look?
Aha, sorry.
I did reset of the git commit since there was formatting error in
changelog.  I will fix it shortly.

Honza


Re: [PATCH] ipa: Fix C++ member ptr indirect inlining (PR 114254, PR 108802)

2024-03-18 Thread Jan Hubicka
> gcc/ChangeLog:
> 
> 2024-03-06  Martin Jambor  
> 
>   PR ipa/108802
>   PR ipa/114254
>   * ipa-prop.cc (ipa_get_stmt_member_ptr_load_param): Fix case looking
>   at COMPONENT_REFs directly from a PARM_DECL, also recognize loads from
>   a pointer parameter.
>   (ipa_analyze_indirect_call_uses): Also recognize loads from a pointer
>   parameter, also recognize the case when pfn pointer is loaded in its
>   own BB.
> 
> gcc/testsuite/ChangeLog:
> 
> 2024-03-06  Martin Jambor  
> 
>   PR ipa/108802
>   PR ipa/114254
>   * g++.dg/ipa/iinline-4.C: New test.
>   * g++.dg/ipa/pr108802.C: Likewise.

OK,
thanks!

Honza


Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-15 Thread Jan Hubicka
> > +  if (POINTER_TYPE_P (TREE_TYPE (t1)))
> > +{
> > +  if (SSA_NAME_PTR_INFO (t1))
> > +   {
> > + if (!SSA_NAME_PTR_INFO (t2)
> > + || SSA_NAME_PTR_INFO (t1)->align != SSA_NAME_PTR_INFO (t2)->align
> > + || SSA_NAME_PTR_INFO (t1)->misalign != SSA_NAME_PTR_INFO 
> > (t2)->misalign)
> 
> You want to compare SSA_NAME_PTR_INFO ()->pt.zero as well I think since
> we store pointer non-null-ness from VRP there.
> 
> You are not comparing the actual points-to info - but of course I'd
> expect that to differ as soon as local decls are involved.  Still looks
> like a hole to me.

I convinced myself that we don't need to do that since we recompute PTA
after IPA stage: unlike value ranges it is thrown away and recomputed
rather then just refined from prevoius solution.  But indeed if parts
are persistent, we need to compare it (and should stream it to LTO I
guess).

Honza


Re: [PATCH] Fix PR ipa/113996

2024-03-14 Thread Jan Hubicka
> > Patch is still OK, but ipa-ICF will only identify the functions if
> > static chain is unused. Perhaps just picking the winning candidate to be
> > version without static chain and making ipa-inline to not ICE when calls
> > with static chain lands to function with no static chain would help us
> > to optimize better.
> 
> I see, thanks for the explanation.  The attached patch appears to work.
> 
> 
>   PR ipa/113996
>   * ipa-icf.h (sem_function): Add static_chain_p member.
>   * ipa-icf.cc (sem_function::init): Initialize it.
>   (sem_item_optimizer::merge_classes): If the class is made of
>   functions, pick one without static chain as the target.
> 
> -- 
> Eric Botcazou


> @@ -3399,11 +3401,22 @@ sem_item_optimizer::merge_classes (unsigned int 
> prev_class_count,
>  
>   sem_item *source = c->members[0];
>  
> - if (DECL_NAME (source->decl)
> - && MAIN_NAME_P (DECL_NAME (source->decl)))
> -   /* If merge via wrappers, picking main as the target can be
> -  problematic.  */
> -   source = c->members[1];
> + if (source->type == FUNC)
> +   {
> + /* Pick a member without static chain, if any.  */
> + for (unsigned int j = 0; j < c->members.length (); j++)
> +   if (!static_cast (c->members[j])->static_chain_p)
> + {
> +   source = c->members[j];
> +   break;
> + }
> +
> + /* If merge via wrappers, picking main as the target can be
> +problematic.  */
> + if (DECL_NAME (source->decl)
> + && MAIN_NAME_P (DECL_NAME (source->decl)))
> +   source = c->members[source == c->members[0] ? 1 : 0];

Thanks for looking into this.
I wonder if it can happen that we ICF merge function with static chain
and main?
We can fix this unlikely case by simply considering functions with
static chain not equivalent to MAIN_NAME_P function.

On x86_64 static chain goes to R10, but I wonder if we support targets
where API of function taking static chain and not using it differs from
API of function with no static chain?

Honza


Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-14 Thread Jan Hubicka
> > Otherwise
> > I will add your testcase for this patch and commit this one.
> > Statistically we almost never merge functions with different value
> > ranges (three in testsuite, 0 during bootstrap, 1 during LTO bootstrap
> > and probably few in LLVM build - there are 15 cases reported, but some
> > are false positives caused by hash function conflicts).
> 
> It is mostly the fnsplit functions, sure, because we don't really drop
> the __builtin_unreachable checks before IPA and so the if (cond)
> __builtin_unreachable (); style assertions or [[assume (cond)]]; still
> result in ICF failures.

Actually on llvm they are not split functions, but functions where value
ranges were determined by early VRP based on code optimized out by time
we reach ipa-icf (the equal thingy from libstdc++ I wrote about in
previous email)

Honza


Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-14 Thread Jan Hubicka
> > We have wrong code with LTO, too.
> 
> I know.
> 
> > The problem is that IPA passes (and
> > not only that, loop analysis too) does analysis at compile time (with
> > value numbers in) and streams the info separately.
> 
> And that is desirable, because otherwise it simply couldn't derive any
> ranges.
> 
> >  Removal of value ranges
> > (either by LTO or by your patch) happens between computing these
> > summaries and using them, so this can be used to trigger wrong code,
> > sadly.
> 
> Yes.  But with LTO, I don't see how the IPA ICF comparison whether
> two functions are the same or not could be done with the
> SSA_NAME_{RANGE,PTR}_INFO in, otherwise it could only ICF merge functions
> from the same TUs.  So the comparison IMHO (and the assert checks in my
> patch prove that) is done when the SSA_NAME_{RANGE,PTR}_INFO aren't in
> anymore.  So, one just needs to compare and punt or union whatever
> is or could be influenced in the IPA streamed data from the ranges etc.
> And because one has to do it for LTO, doing it for non-LTO should be
> sufficient too.

Sorry, this was bit of a misunderstanding: I tought you still considered
the original patch to be full fix, while I tought I should look into it
more and dig out more issues.  This is bit of can of worms.  Overall I
think the plan is:

This stage4
1) fix VR divergences by either comparing or droping them
2) fix jump function differences by comparing them
   (I just constructed a testcase showing that jump functions can
diverge for other reasons than just VR, so this is deeper problem,
too)
3) try to construct aditional wrong code testcases (finite_p
   divergences, trapping)
Next stage1
4) implement VR and PTR info streaming
5) make ICF to compare VRs and punt otherwise
6) implement optimize_size feature to ICF that will not punt on
   diverging TBAA or value ranges and do merging instead.
   We need some extra infrastructure for that, being able to keep the
   maps between basic blocks and SSA names from comparsion stage to
   merging stage
7) measure how effective ICF becomes in optimize_size mode and implement
   heuristics on how much metadata merging we want to do with -O2/-O3 as
   well.
   Based on older data Martin collected for his thesis, ICF used to be
   must more effective on libreoffice then it is today, so hopefully we
   can recover 10-15% binary size improvmeents here.
8) General ICF TLC.  There are many half finished things for a while in
   this pass (such as merging with different BB or stmt orders).

I am attaching the compare patch which also fixes the original wrong
code. If you preffer your version, just go ahead to commit it. Otherwise
I will add your testcase for this patch and commit this one.
Statistically we almost never merge functions with different value
ranges (three in testsuite, 0 during bootstrap, 1 during LTO bootstrap
and probably few in LLVM build - there are 15 cases reported, but some
are false positives caused by hash function conflicts).

Honza

gcc/ChangeLog:

* ipa-icf-gimple.cc (func_checker::compare_ssa_name): Compare value 
ranges.

diff --git a/gcc/ipa-icf-gimple.cc b/gcc/ipa-icf-gimple.cc
index 8c2df7a354e..37c416d777d 100644
--- a/gcc/ipa-icf-gimple.cc
+++ b/gcc/ipa-icf-gimple.cc
@@ -39,9 +39,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "attribs.h"
 #include "gimple-walk.h"
+#include "value-query.h"
+#include "value-range-storage.h"
 
 #include "tree-ssa-alias-compare.h"
 #include "ipa-icf-gimple.h"
 
 namespace ipa_icf_gimple {
 
@@ -109,6 +116,35 @@ func_checker::compare_ssa_name (const_tree t1, const_tree 
t2)
   else if (m_target_ssa_names[i2] != (int) i1)
 return false;
 
+  if (POINTER_TYPE_P (TREE_TYPE (t1)))
+{
+  if (SSA_NAME_PTR_INFO (t1))
+   {
+ if (!SSA_NAME_PTR_INFO (t2)
+ || SSA_NAME_PTR_INFO (t1)->align != SSA_NAME_PTR_INFO (t2)->align
+ || SSA_NAME_PTR_INFO (t1)->misalign != SSA_NAME_PTR_INFO 
(t2)->misalign)
+   return false;
+   }
+  else if (SSA_NAME_PTR_INFO (t2))
+   return false;
+}
+  else
+{
+  if (SSA_NAME_RANGE_INFO (t1))
+   {
+ if (!SSA_NAME_RANGE_INFO (t2))
+   return false;
+ Value_Range r1 (TREE_TYPE (t1));
+ Value_Range r2 (TREE_TYPE (t2));
+ SSA_NAME_RANGE_INFO (t1)->get_vrange (r1, TREE_TYPE (t1));
+ SSA_NAME_RANGE_INFO (t2)->get_vrange (r2, TREE_TYPE (t2));
+ if (r1 != r2)
+   return false;
+   }
+  else if (SSA_NAME_RANGE_INFO (t2))
+   return false;
+}
+
   if (SSA_NAME_IS_DEFAULT_DEF (t1))
 {
   tree b1 = SSA_NAME_VAR (t1);


Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-14 Thread Jan Hubicka
> > int test (int a)
> > {
> > return a>0 ? CST1:  CST2;
> > }
> > 
> > gets same hash value no matter what CST1/CST2 is.  I added hasher and I
> > am re-running stats.
> 
> The hash should be commutative here at least.

It needs to match what comparator is doing later, and sadly it does not
try to find the right order of basic blocks.  I added TODO and will fix
it next stage1.

With the attached patch we have no comparsion with mismatching VRP
ranges with non-LTO bootstrap and there are 4 instances of mismatched
jump function with LTO bootstrap.

I checked them and these are functions that are still different and 
would not be unified anyway.  In testsuite we have
 gcc.c-torture/execute/builtin-prefetch-4.c
   this matches
   [irange] long unsigned int [0, 2147483647][18446744071562067968, +INF]
   [irange] long unsigned int [0, 2147483646][18446744071562067968, +INF]
   in postdec_glob_idx/22new vs predec_glob_idx/20

   The functions are different, but have precisely same statements, just
   differ by SSA names that we do not hash.
 c-c++-common/builtins.c
   [irange] long unsigned int [2, 2147483647] MASK 0x7ffe VALUE 0x0
   [irange] long unsigned int [8, 2147483647] MASK 0x7ff8 VALUE 0x0
   in bar.part.0/10new  foo.part.0/9

   This is a real mismatch
 23_containers/vector/cons/2.cc
   [irange] long unsigned int [1, 9223372036854775807] MASK 0x7fff 
VALUE 0x0
   [irange] long unsigned int [1, 9223372036854775807] MASK 0xfff8 
VALUE 0x0
   static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp, 
_Alloc>::_S_do_relocate(pointer, pointer, pointer, _Tp_alloc_type&, 
std::true_type) [with _Tp = A; _Alloc = std::allocator >]/3482
   static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp, 
_Alloc>::_S_do_relocate(pointer, pointer, pointer, _Tp_alloc_type&, 
std::true_type) [with _Tp = double; _Alloc = std::allocator]/3372

   Here funtions are different initially but are optimized to same body
   while we keep info about ranges from optimized out code.

   I tried to make easy testcase, but

   int a;
   a = (char)a;

   is not optimized away, I wonder why, when for all valid values this
   is noop?

 25_algorithms/lexicographical_compare/uchar.cc
   [irange] long unsigned int [8, +INF] MASK 0xfff8 VALUE 0x0
   [irange] long unsigned int [4, +INF] MASK 0xfffc VALUE 0x0
   static constexpr bool std::__equal::equal(const _Tp*, const _Tp*, 
const _Tp*) [with _Tp = long int]/13669
   static constexpr bool std::__equal::equal(const _Tp*, const _Tp*, 
const _Tp*) [with _Tp = int]/13663
 std/ranges/conv/1.cc
   [irange] long unsigned int [8, +INF] MASK 0xfff8 VALUE 0x0
   [irange] long unsigned int [4, +INF] MASK 0xfffc VALUE 0x0
   static constexpr bool std::__equal::equal(const _Tp*, const _Tp*, 
const _Tp*) [with _Tp = long int]/13757
   static constexpr bool std::__equal::equal(const _Tp*, const _Tp*, 
const _Tp*) [with _Tp = int]/13751

   Those two are also happening in llvm.  We try to merge two functions
   which take pointer parameters of different type.

boolD.2783 std::__equal::equalD.426577 (const intD.9 * 
__first1D.433380, const intD.9 * __last1D.433381, const intD.9 * 
__first2D.433382)
{ 
  const intD.9 * __first1_4(D) = __first1D.433380;
  const intD.9 * __last1_3(D) = __last1D.433381;
  const intD.9 * __first2_6(D) = __first2D.433382;
  long intD.12 _1;
  boolD.2783 _2;
  long unsigned intD.16 _7;
  boolD.2783 _8;
  intD.9 _9;

  _1 = __last1_3(D) - __first1_4(D);
  if (_1 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

  # RANGE [irange] long unsigned int [4, +INF] MASK 0xfffc 
VALUE 0x0
  _7 = (long unsigned intD.16) _1;

  Compared to

boolD.2783 std::__equal::equalD.426685 (const long 
intD.12 * __first1D.433472, const long intD.12 * __last1D.433473, const long 
intD.12 * __first2D.433474)
{
  const long intD.12 * __first1_4(D) = __first1D.433472;
  const long intD.12 * __last1_3(D) = __last1D.433473;
  const long intD.12 * __first2_6(D) = __first2D.433474;
  long intD.12 _1;
  boolD.2783 _2;
  long unsigned intD.16 _7;
  boolD.2783 _8;
  intD.9 _9;

  _1 = __last1_3(D) - __first1_4(D);
  if (_1 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

  # RANGE [irange] long unsigned int [8, +INF] MASK 0xfff8 
VALUE 0x0
  _7 = (long unsigned intD.16) _1;

   This looks like potentially dangerous situation, since if we can
   derive range from pointed-to type, then ICF needs to match them too.

   However with
   #include 
   size_t diff (long *a, long *b)
   {
 long int d = (b - a) * sizeof (long);
 if (!d)
   abort ();
 return d;
   }
   size_t diff2 (int *a, int *b)
   {
 long int d = (b - a) * 

Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-13 Thread Jan Hubicka
> On Wed, Mar 13, 2024 at 10:55:07AM +0100, Jan Hubicka wrote:
> > > > So the ipa_jump_func are I think the only thing that actually can differ
> > > > on the ICF merging candidates from value range POV.
> > > 
> > > I agree.  Btw, I would have approved the original patch in this
> > > thread that wipes SSA_NAME_INFO in merged bodies to mimic what LTO
> > > effectively does right now.  That also looks most sensible to
> > > backport.
> > > 
> > > But I'll defer to Honza in the end (but also want to point out we
> > > need something suitable for backporting).
> > 
> > My main worry here is that I tried to relax matching of IL metadata in
> > the past and it triggers interesting problems.  (I implemented TBAA
> > merging and one needs to match additional things in summaries and loop
> > structures)
> 
> The point of the patch is that it emulates what happens with LTO (though,
> does that only for successful ICF merges), because LTO streaming ignores
> SSA_NAME_{RANGE,PTR}_INFO.
> So, because with LTO all you have in the IL is the m_vr in jump_tables,
> pure/const analysis results, whatever else might be derived on the side
> from the range info, you need to punt or union all that information anyway,
> otherwise it will misbehave with LTO.
> So punting on SSA_NAME_{RANGE,PTR}_INFO differences instead of throwing it
> away means that non-LTO will get fewer ICF merges than LTO unnecessarily,
> it doesn't improve anything for the code correctness at least for the LTO
> case.

We have wrong code with LTO, too.  The problem is that IPA passes (and
not only that, loop analysis too) does analysis at compile time (with
value numbers in) and streams the info separately.  Removal of value ranges
(either by LTO or by your patch) happens between computing these
summaries and using them, so this can be used to trigger wrong code,
sadly.

Honza
> 
>   Jakub
> 


Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-13 Thread Jan Hubicka
> On Tue, 12 Mar 2024, Jakub Jelinek wrote:
> 
> > On Tue, Mar 12, 2024 at 05:21:58PM +0100, Jakub Jelinek wrote:
> > > On Tue, Mar 12, 2024 at 10:46:42AM +0100, Jan Hubicka wrote:
> > > > I am sorry for delaying this.  I made the variant that simply compares
> > > > value range of functions and prevents merging if they diverge and wanted
> > > > to make some bigger statistics.  This made me notice some performance
> > > > problems on clang performance and libstdc++ RB-trees which disrailed me
> > > > from the original PR.  I will finish the statistics today.
> > > 
> > > With the posted patch, perhaps if we don't want to union jump_tables etc.,
> > > all we could punt on is differences in the jump_table VRs rather than just
> > > any SSA_NAME_RANGE_INFO differences.
> > 
> > To expand on this, I think we need to either union or punt on jump_func
> > differences in any case, because for LTO we can't really punt on
> > SSA_NAME_RANGE_INFO differences given that we don't stream that out and in.

I noticed that yesterday too (I added my jump function testcase to
testsuit and it fails with -flto, too).  I implemented comparator for
them too and run the stats.  There was over 3000 functions in bootstrap
where we run into differences in value-range and about 150k in LLVM
build.

Inspecting random examples shown that those are usually false positives
(pair of functions that are different but triggers hash colision) caused
by the fact that we do not hash PHI arguments, so code like

int test (int a)
{
return a>0 ? CST1:  CST2;
}

gets same hash value no matter what CST1/CST2 is.  I added hasher and I
am re-running stats.

> > So the ipa_jump_func are I think the only thing that actually can differ
> > on the ICF merging candidates from value range POV.
> 
> I agree.  Btw, I would have approved the original patch in this
> thread that wipes SSA_NAME_INFO in merged bodies to mimic what LTO
> effectively does right now.  That also looks most sensible to
> backport.
> 
> But I'll defer to Honza in the end (but also want to point out we
> need something suitable for backporting).

My main worry here is that I tried to relax matching of IL metadata in
the past and it triggers interesting problems.  (I implemented TBAA
merging and one needs to match additional things in summaries and loop
structures)

If value range differs at IPA analysis time, we need to be sure that we
compare everything that possibly depends on it. So it is always safer to
just compare more than try to merge. Which is what we do in all cases so
far.  Here, for the first time, is the problem is with LTO streaming
missing the data though.

Thinking more about it, I wonder if different value ranges can be
exploited to cause different decisions about function being finite
(confuse pure/const) or different outcome of alias analysis yielding to
different aggregate jump functions (confusing ipa-prop).

I will try to build testcases today.
Honza
> 
> Richard.


Re: [PATCH] Fix PR ipa/113996

2024-03-12 Thread Jan Hubicka
> 
> 
> On 3/11/24 4:38 PM, Eric Botcazou wrote:
> > Hi,
> > 
> > this is a regression present on all active branches: the attached Ada 
> > testcase
> > triggers an assertion failure when compiled with -O2 -gnatp -flto:
> > 
> >/* Initialize the static chain.  */
> >p = DECL_STRUCT_FUNCTION (fn)->static_chain_decl;
> >gcc_assert (fn != current_function_decl);
> >if (p)
> >  {
> >/* No static chain?  Seems like a bug in tree-nested.cc.  */
> >gcc_assert (static_chain);  <--- here
> > 
> >setup_one_parameter (id, p, static_chain, fn, bb, );
> >  }
> > 
> > The problem is that the ICF pass identifies two functions, one of which has 
> > a
> > static chain but the other does not.  The proposed fix is just to prevent 
> > this
> > identification from occurring.
> > 
> > Tested on x86-64/Linux, OK for all active branches?
> > 
> > 
> > 2024-03-11  Eric Botcazou  
> > 
> > PR ipa/113996
> > * ipa-icf.h (sem_function): Add static_chain_present member.
> > * ipa-icf.cc (sem_function::get_hash): Hash it.
> > (sem_function::equals_wpa): Compare it.
> > (sem_function::equals_private): Likewise.
> > (sem_function::init): Initialize it.
> > 
> > 
> > 2024-03-11  Eric Botcazou  
> > 
> > * gnat.dg/lto27.adb: New test.
> OK.
Patch is still OK, but ipa-ICF will only identify the functions if
static chain is unused. Perhaps just picking the winning candidate to be
version without static chain and making ipa-inline to not ICE when calls
with static chain lands to function with no static chain would help us
to optimize better.

Also IPA-SRA can optimize out unused static chains and ipa-prop can
propagate across them.

Honza
> jeff
> 


Re: Patch ping Re: [PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-03-12 Thread Jan Hubicka
> Hi!
Hi,
> 
> On Thu, Feb 15, 2024 at 08:29:24AM +0100, Jakub Jelinek wrote:
> > 2024-02-15  Jakub Jelinek  
> > 
> > PR middle-end/113907
> > * ipa-icf.cc (sem_item_optimizer::merge_classes): Reset
> > SSA_NAME_RANGE_INFO and SSA_NAME_PTR_INFO on successfully ICF merged
> > functions.
> > 
> > * gcc.dg/pr113907.c: New test.
> 
> I'd like to ping the
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645644.html
> patch.
> After looking at this PR again yesterday, I'm convinced we don't need
> either this patch or some other patch to deal with the jump functions,
> but both, this patch to clear (or other variant would be to union them)
> SSA_NAME_RANGE_INFO in the bodies of IPA-ICF merged functions, and another
> patch that would maybe in sem_function::merge go through all the
> callees cgraph edges and for each of them attempt to merge (for value
> range union) at least the value range/pointer alignment information from
> the corresponding cgraph edge from alias for all jump functions of all
> the arguments.
> 
> Bootstrapped/regtested again last night.

I am sorry for delaying this.  I made the variant that simply compares
value range of functions and prevents merging if they diverge and wanted
to make some bigger statistics.  This made me notice some performance
problems on clang performance and libstdc++ RB-trees which disrailed me
from the original PR.  I will finish the statistics today.

For next stage1 we definitly want to move ahead with merging metadata
(not only value ranges, but hopefully also TBAA).  Currently the only
thing we merge aggressively is the edge profile (and we have PR on that
merging functions which differs only but likely/unlikely attribute).
However for backportability and for this avoiding merging may be safer
solution depending on the stats.A

I was looking into ipa-vrp useability and there are several tests on
Firefox talos where ipa-prop VRP makes difference. It boils down to
propagating fact that parameter is alway snon-NULL.  This is commonly
tested in C++ casts.

Honza


Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model

2024-03-11 Thread Jan Hubicka
> [Public]
> 
> 
> Hi all,
> 
> 
> 
> PFA, the patch that enables support for the next generation AMD Zen5 CPU via 
> -march=znver5 with basic znver5 scheduler Model.
> 
> We may update the scheduler model going forward.
> 
> 
> 
> Good for trunk?
> 
> Thanks and Regards
> Karthiban
> 
> 
> Patch is inline here.
> From 6230938c1420604c8d0af27b0d080970d9b54ac5 Mon Sep 17 00:00:00 2001
> From: karthiban 
> karthiban.anbazha...@amd.com
> Date: Fri, 9 Feb 2024 15:03:09 +0530
> Subject: [PATCH] Add AMD znver5 processor enablement with scheduler model
> 
> gcc/ChangeLog:
> * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5.
> * common/config/i386/i386-common.cc (processor_names): Add znver5.
> (processor_alias_table): Likewise.
> * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen
> family.
> (processor_subtypes): Add znver5.
> * config.gcc (x86_64-*-* |...): Likewise.
> * config/i386/driver-i386.cc (host_detect_local_cpu): Let
> march=native detect znver5 cpu's.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5.
> * config/i386/i386-options.cc (m_ZNVER5): New definition
> (processor_cost_table): Add znver5.
> * config/i386/i386.cc (ix86_reassociation_width): Likewise.
> * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5
> (PTA_ZNVER5): New definition.
> * config/i386/i386.md (define_attr "cpu"): Add znver5.
> (Scheduling descriptions) Add znver5.md.
> * config/i386/x86-tune-costs.h (znver5_cost): New definition.
> * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5.
> (ix86_adjust_cost): Likewise.
> * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5.
> (avx512_store_by_pieces): Add m_ZNVER5.
> * doc/extend.texi: Add znver5.
> * doc/invoke.texi: Likewise.
> * config/i386/znver5.md: New.
> 
> gcc/testsuite/ChangeLog:
> * g++.target/i386/mv29.C: Handle znver5 arch.
> * gcc.target/i386/funcspec-56.inc:Likewise.

Hi,
I went through the scheduler description and found some places that can
be commonized.  Most frequently it is the vector path instruction which
blocks all execution cores so patterns can be shared between znver3 and
5 (blocking the new cores for znver3 does not change anything since they
are not used anyway).  The insn automata growth is now about 5% which I
hope is acceptable.  I tried the completely separate model and it was
abour 7%.

I plan to commit the patch tomorrow if htere are no further ideas for
improvement.

Honza

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index a595ee537a8..017a952a5db 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -310,6 +310,22 @@ get_amd_cpu (struct __processor_model *cpu_model,
  cpu_model->__cpu_subtype = AMDFAM19H_ZNVER3;
}
   break;
+case 0x1a:
+  cpu_model->__cpu_type = AMDFAM1AH;
+  if (model <= 0x77)
+   {
+ cpu = "znver5";
+ CHECK___builtin_cpu_is ("znver5");
+ cpu_model->__cpu_subtype = AMDFAM1AH_ZNVER5;
+   }
+  else if (has_cpu_feature (cpu_model, cpu_features2,
+   FEATURE_AVX512VP2INTERSECT))
+   {
+ cpu = "znver5";
+ CHECK___builtin_cpu_is ("znver5");
+ cpu_model->__cpu_subtype = AMDFAM1AH_ZNVER5;
+   }
+  break;
 default:
   break;
 }
diff --git a/gcc/common/config/i386/i386-common.cc 
b/gcc/common/config/i386/i386-common.cc
index c35191e6925..f814df8385b 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -2166,7 +2166,8 @@ const char *const processor_names[] =
   "znver1",
   "znver2",
   "znver3",
-  "znver4"
+  "znver4",
+  "znver5"
 };
 
 /* Guarantee that the array is aligned with enum processor_type.  */
@@ -2435,6 +2436,9 @@ const pta processor_alias_table[] =
   {"znver4", PROCESSOR_ZNVER4, CPU_ZNVER4,
 PTA_ZNVER4,
 M_CPU_SUBTYPE (AMDFAM19H_ZNVER4), P_PROC_AVX512F},
+  {"znver5", PROCESSOR_ZNVER5, CPU_ZNVER5,
+PTA_ZNVER5,
+M_CPU_SUBTYPE (AMDFAM1AH_ZNVER5), P_PROC_AVX512F},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC,
 PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
   | PTA_SSSE3 | PTA_SSE4A | PTA_ABM | PTA_CX16 | PTA_PRFCHW
diff --git a/gcc/common/config/i386/i386-cpuinfo.h 
b/gcc/common/config/i386/i386-cpuinfo.h
index 2ee7470c8da..73131657eab 100644
--- a/gcc/common/config/i386/i386-cpuinfo.h
+++ b/gcc/common/config/i386/i386-cpuinfo.h
@@ -63,6 +63,7 @@ enum processor_types
   INTEL_SIERRAFOREST,
   INTEL_GRANDRIDGE,
   INTEL_CLEARWATERFOREST,
+  AMDFAM1AH,
   CPU_TYPE_MAX,
   BUILTIN_CPU_TYPE_MAX = CPU_TYPE_MAX
 };
@@ -104,6 +105,7 @@ enum processor_subtypes
   INTEL_COREI7_ARROWLAKE_S,
   INTEL_COREI7_PANTHERLAKE,
   

Re: [PATCH] ipa: Avoid excessive removing of SSAs (PR 113757)

2024-03-07 Thread Jan Hubicka
> On Thu, Feb 08 2024, Martin Jambor wrote:
> > Hi,
> >
> > PR 113757 shows that the code which was meant to debug-reset and
> > remove SSAs defined by LHSs of calls redirected to
> > __builtin_unreachable can trigger also when speculative
> > devirtualization creates a call to a noreturn function (and since it
> > is noreturn, it does not bother dealing with its return value).
> >
> > What is more, it seems that the code handling this case is not really
> > necessary.  I feel slightly idiotic about this because I have a
> > feeling that I added it because of a failing test-case but I can
> > neither find the testcase nor a reason why the code in
> > cgraph_edge::redirect_call_stmt_to_callee would not be sufficient (it
> > turns the SSA name into a default-def, a bit like IPA-SRA, but any
> > code dominated by a call to a noreturn is not dangerous when it comes
> > to its side-effects).  So this patch just removes the handling.
> >
> > Bootstrapped and tested on x86_64-linux and ppc64le-linux.  I have also
> > LTO-bootstrapped and LTO-profilebootstrapped the patch on x86_64-linux.
> >
> > OK for master?
> >
> > Thanks,
> >
> > Martin
> >
> >
> > gcc/ChangeLog:
> >
> > 2024-02-07  Martin Jambor  
> >
> > PR ipa/113757
> > * tree-inline.cc (redirect_all_calls): Remove code adding SSAs to
> > id->killed_new_ssa_names.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2024-02-07  Martin Jambor  
> >
> > PR ipa/113757
> > * g++.dg/ipa/pr113757.C: New test.
OK,
thanks!
Honza
> > ---
> >  gcc/testsuite/g++.dg/ipa/pr113757.C | 14 ++
> >  gcc/tree-inline.cc  | 14 ++
> >  2 files changed, 16 insertions(+), 12 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.dg/ipa/pr113757.C
> >
> > diff --git a/gcc/testsuite/g++.dg/ipa/pr113757.C 
> > b/gcc/testsuite/g++.dg/ipa/pr113757.C
> > new file mode 100644
> > index 000..885d4010a10
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/ipa/pr113757.C
> > @@ -0,0 +1,14 @@
> > +// { dg-do compile }
> > +// { dg-options "-O2 -fPIC" }
> > +// { dg-require-effective-target fpic }
> > +
> > +long size();
> > +struct ll {  virtual int hh();  };
> > +ll  *slice_owner;
> > +int ll::hh() { __builtin_exit(0); }
> > +int nn() {
> > +  if (size())
> > +return 0;
> > +  return slice_owner->hh();
> > +}
> > +int (*a)() = nn;
> > diff --git a/gcc/tree-inline.cc b/gcc/tree-inline.cc
> > index 75c10eb7dfc..cac41b4f031 100644
> > --- a/gcc/tree-inline.cc
> > +++ b/gcc/tree-inline.cc
> > @@ -2984,23 +2984,13 @@ redirect_all_calls (copy_body_data * id, 
> > basic_block bb)
> >gimple *stmt = gsi_stmt (si);
> >if (is_gimple_call (stmt))
> > {
> > - tree old_lhs = gimple_call_lhs (stmt);
> >   struct cgraph_edge *edge = id->dst_node->get_edge (stmt);
> >   if (edge)
> > {
> >   if (!id->killed_new_ssa_names)
> > id->killed_new_ssa_names = new hash_set (16);
> > - gimple *new_stmt
> > -   = cgraph_edge::redirect_call_stmt_to_callee (edge,
> > -   id->killed_new_ssa_names);
> > - if (old_lhs
> > - && TREE_CODE (old_lhs) == SSA_NAME
> > - && !gimple_call_lhs (new_stmt))
> > -   /* In case of IPA-SRA removing the LHS, the name should have
> > -  been already added to the hash.  But in case of redirecting
> > -  to builtin_unreachable it was not and the name still should
> > -  be pruned from debug statements.  */
> > -   id->killed_new_ssa_names->add (old_lhs);
> > + cgraph_edge::redirect_call_stmt_to_callee (edge,
> > +   id->killed_new_ssa_names);
> >  
> >   if (stmt == last && id->call_stmt && maybe_clean_eh_stmt (stmt))
> > gimple_purge_dead_eh_edges (bb);
> > -- 
> > 2.43.0


Re: [PATCH] i386: Guard noreturn no-callee-saved-registers optimization with -mnoreturn-no-callee-saved-registers [PR38534]

2024-02-29 Thread Jan Hubicka
> On Thu, Feb 29, 2024 at 03:15:30PM +0100, Jan Hubicka wrote:
> > I am not wed to the idea (just it appeared to me as an option to
> > disabling this optimization by default). I still think it may make sense.
> 
> Maybe I misunderstood your idea.
> So, you are basically suggesting to go in a completely opposite direction
> from H.J.'s changes, instead of saving less in noreturn prologues, save
> more, most likely not change anything in code generation on the caller's
> side (nothing is kept alive across visible unconditional noreturn calls)
> and maybe after some time use normally caller-saved registers in say call
> frame debug info for possible use in DW_OP_entry_value (but in that case it
> is an ABI change) and improve debuggability of vars which live in normally
> caller-saved registers right before a noreturn call in that frame?
> What about registers in which function arguments are passed?

Hmm, you are right - I got it backwards. 
> 
> If it is done as a change with no changes at all on the caller side and
> just on the callee side, it could again be guarded by some option (not sure
> what the default would be) where the user could increase debuggability of
> the noreturn caller (in that case always necessarily just a single immediate
> one; while not doing the callee saved register saves improves debuggability
> in perhaps multiple routines in the call stack, depending on which registers
> wouldn't be saved and which registers are used in each of the caller frames;
> e.g. noreturn function could save/not save all of %rbx, %r1[2345], one
> caller somewhere use %r12, another %rbx and %r13, yet another one %r14 and
> %r15).

I am not sure how much practical value this would get, but in any case
it is indpeendent of the discussed patch.

Sorry for the confussion,
Honza
> 
>   Jakub
> 


Re: [PATCH] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-02-29 Thread Jan Hubicka
> > I am worried about scenario where ifunc selector calls function foo
> > defined locally and foo is also used from other places possibly in hot
> > loops.
> > >
> > > > So it is not really reliable fix (though I guess it will work a lot of
> > > > common code).  I wonder what would be alternatives.  In GCC generated
> > > > profling code we use TLS only for undirect call profiling (so there is
> > > > no need to turn off rest of profiling).  I wonder if there is any chance
> > > > to not make it seffault when it is done before TLS is set up?
> > >
> > > IFUNC selector should make minimum external calls, none is preferred.
> >
> > Edge porfiling only inserts (atomic) 64bit increments of counters.
> > If target supports these operations inline, no external calls will be
> > done.
> >
> > Indirect call profiling inserts the problematic TLS variable (to track
> > caller-callee pairs). Value profiling also inserts various additional
> > external calls to counters.
> >
> > I am perfectly fine with disabling instrumentation for ifunc selectors
> > and functions only reachable from them, but I am worried about calles
> > used also from non-ifunc path.
> 
> Programmers need to understand not to do it.

It would help to have this documented. Should we warn when ifunc
resolver calls external function, comdat of function reachable from
non-ifunc code?
> 
> > For example selector implemented in C++ may do some string handling to
> > match CPU name and propagation will disable profiling for std::string
> 
> On x86, they should use CPUID, not string functions.
> 
> > member functions (which may not be effective if comdat section is
> > prevailed from other translation unit).
> 
> String functions may lead to external function calls which is dangerous.
> 
> > > Any external calls may lead to issues at run-time.  It is a very bad idea
> > > to profile IFUNC selector via external function call.
> >
> > Looking at https://sourceware.org/glibc/wiki/GNU_IFUNC
> > there are other limitations on ifunc except for profiling, such as
> > -fstack-protector-all.  So perhaps your propagation can be used to
> > disable those features as well.
> 
> So, it may not be tree-profile specific.  Where should these 2 bits
> be added?

If we want to disable other transforms too, then I think having a bit in
cgraph_node for reachability from ifunc resolver makes sense.
I would still do the cycle detection using on-side hash_map to avoid
polution of the global datastructure.

Thanks,
Honza
> 
> > "Unfortunately there are actually a lot of restrictions placed on IFUNC
> > usage which aren't entirely clear and the documentation needs to be
> > updated." makes me wonder what other transformations are potentially
> > dangerous.
> >
> > Honza
> 
> 
> -- 
> H.J.


Re: [PATCH] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-02-29 Thread Jan Hubicka
> On Thu, Feb 29, 2024 at 5:39 AM Jan Hubicka  wrote:
> >
> > > We can't instrument an IFUNC resolver nor its callees as it may require
> > > TLS which hasn't been set up yet when the dynamic linker is resolving
> > > IFUNC symbols.  Add an IFUNC resolver caller marker to symtab_node to
> > > avoid recursive checking.
> > >
> > > gcc/ChangeLog:
> > >
> > >   PR tree-optimization/114115
> > >   * cgraph.h (enum ifunc_caller): New.
> > >   (symtab_node): Add has_ifunc_caller.
> > Unless we have users outside of tree-profile, I think it is better to
> > avoid adding extra data to cgraph_node.  One can use node->get_uid() 
> > indexed hash
> > set to save the two bits needed for propagation.
> > >   * tree-profile.cc (check_ifunc_resolver): New.
> > >   (is_caller_ifunc_resolver): Likewise.
> > >   (tree_profiling): Don't instrument an IFUNC resolver nor its
> > >   callees.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   PR tree-optimization/114115
> > >   * gcc.dg/pr114115.c: New test.
> >
> > The problem with this approach is that tracking callees of ifunc
> > resolvers will stop on the translation unit boundary and also with
> > indirect call.  Also while ifunc resolver itself is called only once,
> > its callees may also be used from performance critical code.
> 
> IFUNC selector shouldn't have any external dependencies which
> can cause issues at run-time.

I am worried about scenario where ifunc selector calls function foo
defined locally and foo is also used from other places possibly in hot
loops.
> 
> > So it is not really reliable fix (though I guess it will work a lot of
> > common code).  I wonder what would be alternatives.  In GCC generated
> > profling code we use TLS only for undirect call profiling (so there is
> > no need to turn off rest of profiling).  I wonder if there is any chance
> > to not make it seffault when it is done before TLS is set up?
> 
> IFUNC selector should make minimum external calls, none is preferred.

Edge porfiling only inserts (atomic) 64bit increments of counters.
If target supports these operations inline, no external calls will be
done.

Indirect call profiling inserts the problematic TLS variable (to track
caller-callee pairs). Value profiling also inserts various additional
external calls to counters.

I am perfectly fine with disabling instrumentation for ifunc selectors
and functions only reachable from them, but I am worried about calles
used also from non-ifunc path.

For example selector implemented in C++ may do some string handling to
match CPU name and propagation will disable profiling for std::string
member functions (which may not be effective if comdat section is
prevailed from other translation unit).

> Any external calls may lead to issues at run-time.  It is a very bad idea
> to profile IFUNC selector via external function call.

Looking at https://sourceware.org/glibc/wiki/GNU_IFUNC
there are other limitations on ifunc except for profiling, such as
-fstack-protector-all.  So perhaps your propagation can be used to
disable those features as well.

"Unfortunately there are actually a lot of restrictions placed on IFUNC
usage which aren't entirely clear and the documentation needs to be
updated." makes me wonder what other transformations are potentially
dangerous.

Honza


Re: [PATCH] i386: Guard noreturn no-callee-saved-registers optimization with -mnoreturn-no-callee-saved-registers [PR38534]

2024-02-29 Thread Jan Hubicka
> On Thu, Feb 29, 2024 at 02:31:05PM +0100, Jan Hubicka wrote:
> > I agree that debugability of user core dumps is important here.
> > 
> > I guess an ideal solution would be to change codegen of noreturn functions
> > to callee save all registers. Performance of prologue of noreturn
> > function is not too important. THen we can stop caller saving registers
> > and still get reasonable backtraces.
> 
> I don't think that is possible.
> While both C and C++ require that if [[noreturn]] attribute is used on
> some function declaration, it must be used on the first declaration and
> also if some function is [[noreturn]] in one TU, it must be [[noreturn]]
> in all other TUs which declare the same function.
> But, we have no such requirement for __attribute__((noreturn)), there it
> is a pure optimization, it can be declared just on the caller side as an
> optimization hint the function will not return, or just on the callee side
> where the compiler will actually verify it doesn't return, or both.
> And, the attribute is not part of function type, so even in standard C/C++,
> one can use
> extern void bar ();
> [[noreturn]] void foo ()
> {
>   for (;;) bar ();
> }
> void (*fn) () = foo;
> void baz ()
> {
>   fn ();
> }
> As you can call the noreturn function directly or indirectly, changing
> calling conventions based on noreturn vs. no-noreturn is IMHO not possible.

I am not wed to the idea (just it appeared to me as an option to
disabling this optimization by default). I still think it may make sense.

Making noreturn calles to save caller saved register is compatible with
the default ABI.  If noreturn is missing on caller side, then caller will
save reigsters as usual. Noreturn callee will save them again, which is
pointless, but everything should work as usual and extra cost of saving
should not matter in practice.  This is also the case of indirect call
of noreturn function where you miss annotation on caller side.

If noreturn is missing on callee side, we will lose information on
functions arguments in backtrace, but the code will still work
(especially if we save BP register to make code backtraceable).  This is
scenario that probably can be avoided in practice where it matters (such
as in glibc abort whose implementation is annotated).  

Noreturn already leads to some information loss in backtraces. I tend to
get surprised from time to time to see whrong call to abort due to tail
merging. So it may be acceptable to lose info in a situation where user
does sily thing and only annotates caller.

Since we auto-detect noreturn, we may need to be extra careful about noreturn
comdats. Here auto-detection of prevailing def may have different
outcome than auto-detection of prevailed defs. So we may want to disable
the optimization for auto-detected comdats.

Honza
> 
>   Jakub
> 


Re: [PATCH] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-02-29 Thread Jan Hubicka
> We can't instrument an IFUNC resolver nor its callees as it may require
> TLS which hasn't been set up yet when the dynamic linker is resolving
> IFUNC symbols.  Add an IFUNC resolver caller marker to symtab_node to
> avoid recursive checking.
> 
> gcc/ChangeLog:
> 
>   PR tree-optimization/114115
>   * cgraph.h (enum ifunc_caller): New.
>   (symtab_node): Add has_ifunc_caller.
Unless we have users outside of tree-profile, I think it is better to
avoid adding extra data to cgraph_node.  One can use node->get_uid() indexed 
hash
set to save the two bits needed for propagation.
>   * tree-profile.cc (check_ifunc_resolver): New.
>   (is_caller_ifunc_resolver): Likewise.
>   (tree_profiling): Don't instrument an IFUNC resolver nor its
>   callees.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR tree-optimization/114115
>   * gcc.dg/pr114115.c: New test.

The problem with this approach is that tracking callees of ifunc
resolvers will stop on the translation unit boundary and also with
indirect call.  Also while ifunc resolver itself is called only once,
its callees may also be used from performance critical code.

So it is not really reliable fix (though I guess it will work a lot of
common code).  I wonder what would be alternatives.  In GCC generated
profling code we use TLS only for undirect call profiling (so there is
no need to turn off rest of profiling).  I wonder if there is any chance
to not make it seffault when it is done before TLS is set up?

Honza
> ---
>  gcc/cgraph.h| 18 +++
>  gcc/testsuite/gcc.dg/pr114115.c | 24 +
>  gcc/tree-profile.cc | 92 +
>  3 files changed, 134 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/pr114115.c
> 
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 47f35e8078d..ce99f4a5114 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -100,6 +100,21 @@ enum symbol_partitioning_class
> SYMBOL_DUPLICATE
>  };
>  
> +/* Classification whether a function has any IFUNC resolver caller.  */
> +enum ifunc_caller
> +{
> +  /* It is unknown if this function has any IFUNC resolver caller.  */
> +  IFUNC_CALLER_UNKNOWN,
> +  /* Work in progress to check if this function has any IFUNC resolver
> + caller.  */
> +  IFUNC_CALLER_WIP,
> +  /* This function has at least an IFUNC resolver caller, including
> + itself.  */
> +  IFUNC_CALLER_TRUE,
> +  /* This function doesn't have any IFUNC resolver caller.  */
> +  IFUNC_CALLER_FALSE
> +};
> +
>  /* Base of all entries in the symbol table.
> The symtab_node is inherited by cgraph and varpol nodes.  */
>  struct GTY((desc ("%h.type"), tag ("SYMTAB_SYMBOL"),
> @@ -121,6 +136,7 @@ public:
>used_from_other_partition (false), in_other_partition (false),
>address_taken (false), in_init_priority_hash (false),
>need_lto_streaming (false), offloadable (false), ifunc_resolver 
> (false),
> +  has_ifunc_caller (IFUNC_CALLER_UNKNOWN),
>order (false), next_sharing_asm_name (NULL),
>previous_sharing_asm_name (NULL), same_comdat_group (NULL), ref_list 
> (),
>alias_target (NULL), lto_file_data (NULL), aux (NULL),
> @@ -595,6 +611,8 @@ public:
>/* Set when symbol is an IFUNC resolver.  */
>unsigned ifunc_resolver : 1;
>  
> +  /* Classification whether a function has any IFUNC resolver caller.  */
> +  ENUM_BITFIELD (ifunc_caller) has_ifunc_caller : 2;
>  
>/* Ordering of all symtab entries.  */
>int order;
> diff --git a/gcc/testsuite/gcc.dg/pr114115.c b/gcc/testsuite/gcc.dg/pr114115.c
> new file mode 100644
> index 000..2629f591877
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr114115.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O0 -fprofile-generate -fdump-tree-optimized" } */
> +/* { dg-require-profiling "-fprofile-generate" } */
> +/* { dg-require-ifunc "" } */
> +
> +void *foo_ifunc2() __attribute__((ifunc("foo_resolver")));
> +
> +void bar(void)
> +{
> +}
> +
> +static int f3()
> +{
> +  bar ();
> +  return 5;
> +}
> +
> +void (*foo_resolver(void))(void)
> +{
> +  f3();
> +  return bar;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "__gcov_indirect_call_profiler_v" 
> "optimized" } } */
> diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
> index aed13e2b1bc..46478648b32 100644
> --- a/gcc/tree-profile.cc
> +++ b/gcc/tree-profile.cc
> @@ -738,6 +738,72 @@ include_source_file_for_profile (const char *filename)
>return false;
>  }
>  
> +/* Return true and set *DATA to true if NODE is an ifunc resolver.  */
> +
> +static bool
> +check_ifunc_resolver (cgraph_node *node, void *data)
> +{
> +  if (node->ifunc_resolver)
> +{
> +  bool *is_ifunc_resolver = (bool *) data;
> +  *is_ifunc_resolver = true;
> +  return true;
> +}
> +  return false;
> +}
> +
> +/* Return true if any caller of NODE is an ifunc resolver.  */
> +
> +static bool
> +is_caller_ifunc_resolver (cgraph_node *node)
> +{
> +  if 

Re: [PATCH] i386: Guard noreturn no-callee-saved-registers optimization with -mnoreturn-no-callee-saved-registers [PR38534]

2024-02-29 Thread Jan Hubicka
> >
> > The problem is that it doesn't help in this case.
> > If some optimization makes debugging of some function harder, normally it is
> > enough to recompile the translation unit that defines it with -O0/-Og, or
> > add optimize attribute on the function.
> > While in this case, the optimization interferes with debugging of other
> > functions, not necessarily from the same translation unit, not necessarily
> > even from the same library or binary, or even from the same package.
> > As I tried to explain, supposedly glibc abort is compiled with -O2 and needs
> > a lot of registers, so say it uses all of %rbx, %rbp, %r12, %r13, %r14,
> > %r15 and this optimization is applied on it.  That means debugging of any
> > application (-O2, -Og or even -O0 compiled) to understand what went wrong
> > and why it aborted will be harder.  Including core file analysis.
> > Recompiling those apps with -O0/-Og will not help.  The only thing that
> > would help is to recompile glibc with -O0/-Og.
> > Doesn't have to be abort, doesn't have to be glibc.  Any library which
> > exports some noreturn APIs may be affected.
> > And there is not even a workaround other than to recompile with -O0/-Og the
> > noreturn functions, no way to disable this optimization.
> >
> > Given that most users just will not be aware of this, even adding the option
> > but defaulting to on would mean a problem for a lot of users.  Most of them
> > will not know the problem is that some noreturn function 10 frames deep in
> > the call stack was optimized this way.
> >
> > If people only call the noreturn functions from within the same package,
> > for some strange reason care about performance of noreturn functions (they
> > don't return, so unless you longjmp out of them or something similar
> > which is costly on its own already, they should be entered exactly once)
> > and are willing to pay the price in worse debugging in that case, let them
> > use the option.  But if they provide libraries that other packages then
> > consume, I'd say it wouldn't be a good idea.
> 
> +1
> 
> I'll definitely patch this by-default behavior out if we as upstream keep it.
> Debugging customer core dumps is more important than optimizing
> glibc abort/assert.
> 
> I do hope such patch will be at least easy, like flipping the default of an
> option.

I agree that debugability of user core dumps is important here.

I guess an ideal solution would be to change codegen of noreturn functions
to callee save all registers. Performance of prologue of noreturn
function is not too important. THen we can stop caller saving registers
and still get reasonable backtraces.

This is essentially an ABI change (though kind of conservative one since
nothing except debugging really depends on it).  Perhaps it would make
sense to make the optimization non-default option now and also implement
the callee save logic. Then we should be able to flip release or two
later. Maybe synchroniation with LLVM would be desirable here if we
decide to go this route.

Honza
> 
> Richard.
> 
> > Jakub
> >


Re: Patch ping^2

2024-02-26 Thread Jan Hubicka
> Hi!
> 
> I'd like to ping 2 patches:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644580.html
>   
>   
> PR113617 P1 - Handle private COMDAT function symbol reference in readonly 
> data section  
>   
>   
>   
>   
> More details in the   
>   
>   
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/thread.html#644121 
>   
>   
> and   
>   
>   
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/thread.html#644486 
>   
>   
> threads.  
>   
>   

I went through the thread - it looks like an issue that was latent for
quite some time.  I think it belongs to cgraph/symtab maintenance, so I
think I can OK this patch.

Honza
> 
> and
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645326.html
> i386: Enable _BitInt support on ia32
> 
> all the FAILs mentioned in that mail have been fixed by now.
> 
> Thanks
> 
>   Jakub
> 


Re: [PATCH v9 2/2] Add gcov MC/DC tests for GDC

2024-02-22 Thread Jan Hubicka
> This is a mostly straight port from the gcov-19.c tests from the C test
> suite. The only notable differences from C to D are that D flips the
> true/false outcomes for loop headers, and the D front end ties loop and
> ternary conditions to slightly different locus.
> 
> The test for >64 conditions warning is disabled as it either needs
> support from the testing framework or a something similar to #pragma GCC
> diagnostic push to not cause a test failure from detecting a warning.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gdc.dg/gcov.exp: New test.
>   * gdc.dg/gcov1.d: New test.

I never wrote anyting in D, so I would preffer Iain to take a look, but
the transition seems direct enough so I think the patch is OK.

Honza


Re: [PATCH v9 1/2] Add condition coverage (MC/DC)

2024-02-22 Thread Jan Hubicka
Hello,
> This patch adds support in gcc+gcov for modified condition/decision
> coverage (MC/DC) with the -fcondition-coverage flag. MC/DC is a type of
> test/code coverage and it is particularly important for safety-critical
> applicaitons in industries like aviation and automotive. Notably, MC/DC
> is required or recommended by:
> 
> * DO-178C for the most critical software (Level A) in avionics.
> * IEC 61508 for SIL 4.
> * ISO 26262-6 for ASIL D.
> 
> From the SQLite webpage:
> 
> Two methods of measuring test coverage were described above:
> "statement" and "branch" coverage. There are many other test
> coverage metrics besides these two. Another popular metric is
> "Modified Condition/Decision Coverage" or MC/DC. Wikipedia defines
> MC/DC as follows:
> 
> * Each decision tries every possible outcome.
> * Each condition in a decision takes on every possible outcome.
> * Each entry and exit point is invoked.
> * Each condition in a decision is shown to independently affect
>   the outcome of the decision.
> 
> In the C programming language where && and || are "short-circuit"
> operators, MC/DC and branch coverage are very nearly the same thing.
> The primary difference is in boolean vector tests. One can test for
> any of several bits in bit-vector and still obtain 100% branch test
> coverage even though the second element of MC/DC - the requirement
> that each condition in a decision take on every possible outcome -
> might not be satisfied.
> 
> https://sqlite.org/testing.html#mcdc
> 
> MC/DC comes in different flavours, the most important being unique
> cause MC/DC and masking MC/DC - this patch implements masking MC/DC,
> which is works well with short circuiting semantics, and according to
> John Chilenski's "An Investigation of Three Forms of the Modified
> Condition Decision Coverage (MCDC) Criterion" (2001) is as good as
> unique cause at catching bugs.
> 
> Whalen, Heimdahl, and De Silva "Efficient Test Coverage Measurement for
> MC/DC" describes an algorithm for determining masking from an AST walk,
> but my algorithm figures this out from analyzing the control flow graph.
> The CFG is considered a binary decision diagram and an evaluation
> becomes a path through the BDD, which is recorded. Certain paths will
> mask ("null out") the contribution from earlier path segments, which can
> be determined by finding short circuit endpoints. Masking is the
> short circuiting of terms in the reverse-ordered Boolean function, and
> the masked terms do not affect the decision like short-circuited
> terms do not affect the decision.
> 
> A tag/discriminator mapping from gcond->uid is created during
> gimplification and made available through the function struct. The
> values are unimportant as long as basic conditions constructed from a
> single Boolean expression are given the same identifier. This happens in
> the breaking down of ANDIF/ORIF trees, so the coverage generally works
> well for frontends that create such trees.
> 
> Like Whalen et al this implementation records coverage in fixed-size
> bitsets which gcov knows how to interpret. This takes only a few bitwise
> operations per condition and is very fast, but comes with a limit on the
> number of terms in a single boolean expression; the number of bits in a
> gcov_unsigned_type (which is usually typedef'd to uint64_t). For most
> practical purposes this is acceptable, and by default a warning will be
> issued if gcc cannot instrument the expression.  This is a practical
> limitation in the implementation, not the algorithm, so support for more
> conditions can be added by also introducing arbitrary-sized bitsets.
> 
> In action it looks pretty similar to the branch coverage. The -g short
> opt carries no significance, but was chosen because it was an available
> option with the upper-case free too.
> 
> gcov --conditions:
> 
> 3:   17:void fn (int a, int b, int c, int d) {
> 3:   18:if ((a && (b || c)) && d)
> conditions covered 3/8
> condition  0 not covered (true false)
> condition  1 not covered (true)
> condition  2 not covered (true)
> condition  3 not covered (true)
> 1:   19:x = 1;
> -:   20:else
> 2:   21:x = 2;
> 3:   22:}
> 
> gcov --conditions --json-format:
> 
> "conditions": [
> {
> "not_covered_false": [
> 0
> ],
> "count": 8,
> "covered": 3,
> "not_covered_true": [
> 0,
> 1,
> 2,
> 3
> ]
> }
> ],
> 
> Expressions with constants may be heavily rewritten before it reaches
> the gimplification, so constructs like int x = a ? 0 : 1 becomes
> _x = (_a == 0). From source you would expect coverage, but it gets
> neither branch nor condition coverage. The same applies to expressions
> like int x = 1 || a which are simply replaced by a constant.
> 
> 

Re: [PATCH] profile-count: Don't dump through a temporary buffer [PR111960]

2024-02-22 Thread Jan Hubicka
> Hi!
> 
> The profile_count::dump (char *, struct function * = NULL) const;
> method has a single caller, the
> profile_count::dump (FILE *f, struct function *fun) const;
> method and for that going through a temporary buffer is just slower
> and opens doors for buffer overflows, which is exactly why this P1
> was filed.
> The buffer size is 64 bytes, the previous maximum
> "%" PRId64 " (%s)"
> would print up to 61 bytes in there (19 bytes for arbitrary uint64_t:61
> bitfield printed as signed, "estimated locally, globally 0 adjusted"
> i.e. 38 bytes longest %s and 4 other characters).
> Now, after the r14-2389 changes, it can be
> 19 + 38 plus 11 other characters + %.4f, which is worst case
> 309 chars before decimal point, decimal point and 4 digits after it,
> so total 382 bytes.
> 
> So, either we could bump the buffer[64] to buffer[400], or the following
> patch just drops the indirection through buffer and prints it directly to
> stream.  After all, having APIs which fill in some buffer without passing
> down the size of the buffer is just asking for buffer overflows over time.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Thanks for fixing it!
I believe the original reason why dump had the buffer argument was
pretty printing, which we do differently now.  Fully agree that the
fixed size buffer is ugly.

Honza


Re: [PATCH] ipa: Convert lattices from pure array to vector (PR 113476)

2024-02-20 Thread Jan Hubicka
> On Tue, Feb 13 2024, Martin Jambor wrote:
> > On Mon, Feb 12 2024, Jan Hubicka wrote:
> >>> Believe it or not, even though I have re-worked the internals of the
> >>> lattices completely, the array itself is older than my involvement with
> >>> GCC (or at least with ipa-cp.c ;-).
> >>> 
> >>> So it being an array and not a vector is historical coincidence, as far
> >>> as I am concerned :-).  But that may be the reason, or because vector
> >>> macros at that time looked scary, or perhaps the initialization by
> >>> XCNEWVEC zeroing everything out was considered attractive (I kind of
> >>> like that but constructors would probably be cleaner), I don't know.
> >>
> >> If your class is no longer a POD, then the clearing before construcion
> >> is dead and GCC may optimize it out.  So fixing this may solve some
> >> surprised in foreseable future when we will try to compile older GCC's
> >> with newer ones.
> >>
> >
> > That's a good point.  I'll prepare a patch converting the whole thing to
> > use constructors and vectors.
> >
> 
> In PR 113476 we have discovered that ipcp_param_lattices is no longer
> a POD and should be destructed.  In a follow-up discussion it
> transpired that their initialization done by memsetting their backing
> memory to zero is also invalid because now any write there before
> construction can be considered dead.  Plus that having them in an
> array is a little bit old-school and does not get the extra checking
> offered by vector along with automatic construction and destruction
> when necessary.
> 
> So this patch converts the array to a vector.  That however means that
> ipcp_param_lattices cannot be just a forward declared type but must be
> known to all code that deal with ipa_node_params and thus to all code
> that includes ipa-prop.h.  Therefore I have moved ipcp_param_lattices
> and the type it depends on to a new header ipa-cp.h which now
> ipa-prop.h depends on.  Because we have the (IMHO not a very wise)
> rule that headers don't include what they need themselves, I had to
> add inclusions of ipa-cp.h and sreal.h (on which it depends) to very
> many files, which made the patch rather ugly.
> 
> Bootstrapped and tested on x86_64-linux.  I also had it checked by our
> script which builds more than a hundred of cross-compilers, so other
> targets are hopefully also fine.
> 
> OK for master?
> 
> Martin
> 
> 
> gcc/lto/ChangeLog:
> 
> 2024-02-16  Martin Jambor  
> 
>   * lto-common.cc: Include sreal.h and ipa-cp.h.
>   * lto-partition.cc: Include ipa-cp.h, move inclusion of sreal higher.
>   * lto.cc: Include sreal.h and ipa-cp.h.
> 
> gcc/ChangeLog:
> 
> 2024-02-16  Martin Jambor  
> 
>   * ipa-prop.h (ipa_node_params): Convert lattices to a vector, adjust
>   initializers in the contructor.
>   (ipa_node_params::~ipa_node_params): Release lattices as a vector.
>   * ipa-cp.h: New file.
>   * ipa-cp.cc: Include sreal.h and ipa-cp.h.
>   (ipcp_value_source): Move to ipa-cp.h.
>   (ipcp_value_base): Likewise.
>   (ipcp_value): Likewise.
>   (ipcp_lattice): Likewise.
>   (ipcp_agg_lattice): Likewise.
>   (ipcp_bits_lattice): Likewise.
>   (ipcp_vr_lattice): Likewise.
>   (ipcp_param_lattices): Likewise.
>   (ipa_get_parm_lattices): Remove assert latticess is non-NULL).
>   (ipa_value_from_jfunc): Adjust a check for empty lattices.
>   (ipa_context_from_jfunc): Likewise.
>   (ipa_agg_value_from_jfunc): Likewise.
>   (merge_agg_lats_step): Do not memset new aggregate lattices to zero.
>   (ipcp_propagate_stage): Allocate lattices in a vector as opposed to
>   just in contiguous memory.
>   (ipcp_store_vr_results): Adjust a check for empty lattices.
>   * auto-profile.cc: Include sreal.h and ipa-cp.h.
>   * cgraph.cc: Likewise.
>   * cgraphclones.cc: Likewise.
>   * cgraphunit.cc: Likewise.
>   * config/aarch64/aarch64.cc: Likewise.
>   * config/i386/i386-builtins.cc: Likewise.
>   * config/i386/i386-expand.cc: Likewise.
>   * config/i386/i386-features.cc: Likewise.
>   * config/i386/i386-options.cc: Likewise.
>   * config/i386/i386.cc: Likewise.
>   * config/rs6000/rs6000.cc: Likewise.
>   * config/s390/s390.cc: Likewise.
>   * gengtype.cc (open_base_files): Added sreal.h and ipa-cp.h to the
>   files to be included in gtype-desc.cc.
>   * gimple-range-fold.cc: Include sreal.h and ipa-cp.h.
>   * ipa-devirt.cc: Likewise.
>   * ipa-fnsummary.cc: Likewise.
>   * ipa-icf.cc: Likewise.

Fix ICE in loop splitting

2024-02-14 Thread Jan Hubicka
Hi,
as demonstrated in the testcase, I forgot to check that profile is
present in tree-ssa-loop-split.
Bootstrapped and regtested x86_64-linux, comitted.

PR tree-optimization/111054

gcc/ChangeLog:

* tree-ssa-loop-split.cc (split_loop): Check for profile being present.

gcc/testsuite/ChangeLog:

* gcc.c-torture/compile/pr111054.c: New test.

diff --git a/gcc/testsuite/gcc.c-torture/compile/pr111054.c 
b/gcc/testsuite/gcc.c-torture/compile/pr111054.c
new file mode 100644
index 000..3c0d6e816b9
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr111054.c
@@ -0,0 +1,11 @@
+/* { dg-additional-options "-fno-guess-branch-probability" } */
+void *p, *q;
+int i, j;
+
+void
+foo (void)
+{
+  for (i = 0; i < 20; i++)
+if (i < j)
+  p = q;
+}
diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 04215fe7937..c0bb1b71d17 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -712,7 +712,8 @@ split_loop (class loop *loop1)
  ? true_edge->probability.to_sreal () : (sreal)1;
sreal scale2 = false_edge->probability.reliable_p ()
  ? false_edge->probability.to_sreal () : (sreal)1;
-   sreal div1 = loop1_prob.to_sreal ();
+   sreal div1 = loop1_prob.initialized_p ()
+? loop1_prob.to_sreal () : (sreal)1/(sreal)2;
/* +1 to get header interations rather than latch iterations and 
then
   -1 to convert back.  */
if (div1 != 0)


Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model

2024-02-14 Thread Jan Hubicka
> [Public]
> 
> Hi,
> 
> >>I assume the znver5 costs are smae as znver4 so far?
> 
> Costing table updated for below entries.
> +  {COSTS_N_INSNS (10), /* cost of a divide/mod for QI.  */
> +   COSTS_N_INSNS (11), /*  HI.  */
> +   COSTS_N_INSNS (16), /*  DI.  */
> +   COSTS_N_INSNS (16)},/*  
> other.  */
> +  COSTS_N_INSNS (10),  /* cost of DIVSS instruction. 
>  */
> +  COSTS_N_INSNS (14),  /* cost of SQRTSS 
> instruction.  */
> +  COSTS_N_INSNS (20),  /* cost of SQRTSD 
> instruction.  */

I see, that looks good.
> 
> 
> >> we can just change znver4.md to also work for znver5?
> We will combine znver4 and znver5 scheduler descriptions into one

Thanks!

Honza
> 
> Thanks and Regards
> Karthiban
> 
> -Original Message-
> From: Jan Hubicka 
> Sent: Monday, February 12, 2024 9:30 PM
> To: Anbazhagan, Karthiban 
> Cc: gcc-patches@gcc.gnu.org; Kumar, Venkataramanan 
> ; Joshi, Tejas Sanjay 
> ; Nagarajan, Muthu kumar raj 
> ; Gopalasubramanian, Ganesh 
> 
> Subject: Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 
> CPU with znver5 scheduler Model
> 
> Caution: This message originated from an External Source. Use proper caution 
> when opening attachments, clicking links, or responding.
> 
> 
> Hi,
> > gcc/ChangeLog:
> > * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5.
> > * common/config/i386/i386-common.cc (processor_names): Add znver5.
> > (processor_alias_table): Likewise.
> > * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen
> > family.
> > (processor_subtypes): Add znver5.
> > * config.gcc (x86_64-*-* |...): Likewise.
> > * config/i386/driver-i386.cc (host_detect_local_cpu): Let
> > march=native detect znver5 cpu's.
> > * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5.
> > * config/i386/i386-options.cc (m_ZNVER5): New definition
> > (processor_cost_table): Add znver5.
> > * config/i386/i386.cc (ix86_reassociation_width): Likewise.
> > * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5
> > (PTA_ZNVER5): New definition.
> > * config/i386/i386.md (define_attr "cpu"): Add znver5.
> > (Scheduling descriptions) Add znver5.md.
> > * config/i386/x86-tune-costs.h (znver5_cost): New definition.
> > * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5.
> > (ix86_adjust_cost): Likewise.
> > * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5.
> > (avx512_store_by_pieces): Add m_ZNVER5.
> > * doc/extend.texi: Add znver5.
> > * doc/invoke.texi: Likewise.
> > * config/i386/znver5.md: New.
> >
> > gcc/testsuite/ChangeLog:
> > * g++.target/i386/mv29.C: Handle znver5 arch.
> > * gcc.target/i386/funcspec-56.inc:Likewise.
> > +/* This table currently replicates znver4_cost table. */ struct
> > +processor_costs znver5_cost = {
> 
> I assume the znver5 costs are smae as znver4 so far?
> 
> > +;; AMD znver5 Scheduling
> > +;; Modeling automatons for zen decoders, integer execution pipes, ;;
> > +AGU pipes, branch, floating point execution and fp store units.
> > +(define_automaton "znver5, znver5_ieu, znver5_idiv, znver5_fdiv,
> > +znver5_agu, znver5_fpu, znver5_fp_store")
> > +
> > +;; Decoders unit has 4 decoders and all of them can decode fast path
> > +;; and vector type instructions.
> > +(define_cpu_unit "znver5-decode0" "znver5") (define_cpu_unit
> > +"znver5-decode1" "znver5") (define_cpu_unit "znver5-decode2"
> > +"znver5") (define_cpu_unit "znver5-decode3" "znver5")
> 
> Duplicating znver4 description to znver5 before scheduler description is 
> tuned is basically just leads to increasing compiler binary size (scheduler 
> models are quite large).
> 
> Depending on changes between generations, I think we should try to share CPU 
> unit DFAs where it makes sense (i.e. shared DFA is smaller than two DFAs).  
> So perhaps unit scheduler is tuned, we can just change znver4.md to also work 
> for znver5?
> 
> Honza


Re: [PATCH] ipa: call destructors on lattices before freeing them (PR 113476)

2024-02-12 Thread Jan Hubicka
> Believe it or not, even though I have re-worked the internals of the
> lattices completely, the array itself is older than my involvement with
> GCC (or at least with ipa-cp.c ;-).
> 
> So it being an array and not a vector is historical coincidence, as far
> as I am concerned :-).  But that may be the reason, or because vector
> macros at that time looked scary, or perhaps the initialization by
> XCNEWVEC zeroing everything out was considered attractive (I kind of
> like that but constructors would probably be cleaner), I don't know.

If your class is no longer a POD, then the clearing before construcion
is dead and GCC may optimize it out.  So fixing this may solve some
surprised in foreseable future when we will try to compile older GCC's
with newer ones.

Honza
> 
> Martin


Re: [PATCH] ipa: call destructors on lattices before freeing them (PR 113476)

2024-02-12 Thread Jan Hubicka
> Hi,
> 
> In PR 113476 we have discovered that ipcp_param_lattices is no longer
> a POD and should be destructed.  This patch does that, calling
> destructor on each element of the array containing them when the
> corresponding summary of a node is freed.  An alternative would be to
> change the XCNEWVEC-and-placement-new to initializations in
> constructors of all things in ipcp_param_lattices and then simply use
> normal operators new and delete.  I am not sure, the initialization
> through XCNEWVEC may be a bit more efficient although that is probably
> not a big concern.  In the end, I opted for a simpler solution for
> stage 4.
> 
> I have verified that valgrind no longer reports lost memory blocks
> allocated within ipcp_vr_lattice::meet_with_1 on the preprocessed source
> (dwarf2out.i) attached to Bugzilla.  The patch also passes bootstrap and
> LTO bootstrap and testing on x86_64-linux.
> 
> OK for master?
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2024-02-09  Martin Jambor  
> 
>   PR tree-optimization/113476
>   * ipa-prop.h (ipa_node_params::~ipa_node_params): Moved...
>   * ipa-cp.cc (ipa_node_params::~ipa_node_params): ...here.  Added
>   destruction of lattices.

OK.
So you do not use vectors (which would also handle the destruction)
basically to save space needed to keep the
size of the vector since that is known from the parameter count?

Honza


Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model

2024-02-12 Thread Jan Hubicka
Hi,
> gcc/ChangeLog:
> * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5.
> * common/config/i386/i386-common.cc (processor_names): Add znver5.
> (processor_alias_table): Likewise.
> * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen
> family.
> (processor_subtypes): Add znver5.
> * config.gcc (x86_64-*-* |...): Likewise.
> * config/i386/driver-i386.cc (host_detect_local_cpu): Let
> march=native detect znver5 cpu's.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5.
> * config/i386/i386-options.cc (m_ZNVER5): New definition
> (processor_cost_table): Add znver5.
> * config/i386/i386.cc (ix86_reassociation_width): Likewise.
> * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5
> (PTA_ZNVER5): New definition.
> * config/i386/i386.md (define_attr "cpu"): Add znver5.
> (Scheduling descriptions) Add znver5.md.
> * config/i386/x86-tune-costs.h (znver5_cost): New definition.
> * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5.
> (ix86_adjust_cost): Likewise.
> * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5.
> (avx512_store_by_pieces): Add m_ZNVER5.
> * doc/extend.texi: Add znver5.
> * doc/invoke.texi: Likewise.
> * config/i386/znver5.md: New.
> 
> gcc/testsuite/ChangeLog:
> * g++.target/i386/mv29.C: Handle znver5 arch.
> * gcc.target/i386/funcspec-56.inc:Likewise.
> +/* This table currently replicates znver4_cost table. */
> +struct processor_costs znver5_cost = {

I assume the znver5 costs are smae as znver4 so far?

> +;; AMD znver5 Scheduling
> +;; Modeling automatons for zen decoders, integer execution pipes,
> +;; AGU pipes, branch, floating point execution and fp store units.
> +(define_automaton "znver5, znver5_ieu, znver5_idiv, znver5_fdiv, znver5_agu, 
> znver5_fpu, znver5_fp_store")
> +
> +;; Decoders unit has 4 decoders and all of them can decode fast path
> +;; and vector type instructions.
> +(define_cpu_unit "znver5-decode0" "znver5")
> +(define_cpu_unit "znver5-decode1" "znver5")
> +(define_cpu_unit "znver5-decode2" "znver5")
> +(define_cpu_unit "znver5-decode3" "znver5")

Duplicating znver4 description to znver5 before scheduler description is
tuned is basically just leads to increasing compiler binary size
(scheduler models are quite large).

Depending on changes between generations, I think we should try to share
CPU unit DFAs where it makes sense (i.e. shared DFA is smaller than two
DFAs).  So perhaps unit scheduler is tuned, we can just change znver4.md
to also work for znver5?

Honza


Re: [PATCH] ipa: Self-DCE of uses of removed call LHSs (PR 108007)

2024-01-22 Thread Jan Hubicka
> Hi,
> 
> PR 108007 is another manifestation where we rely on DCE to clean-up
> after IPA-SRA and if the user explicitely switches DCE off, IPA-SRA
> can leave behind statements which are fed uninitialized values and
> trap, even though their results are themselves never used.
> 
> I have already fixed this for unused parameters in callees, this bug
> shows that almost the same thing can happen for removed returns, on
> the side of callers.  This means that the issue has to be fixed
> elsewhere, in call redirection.  This patch adds a function which
> looks for (and through, using a work-list) uses of operations fed
> specific SSA names and removes them all.
> 
> That would have been easy if it wasn't for debug statements during
> tree-inline (from which call redirection is also invoked).  Debug
> statements are decoupled from the rest at this point and iterating
> over uses of SSAs does not bring them up.  During tree-inline they are
> handled especially at the end, I assume in order to make sure that
> relative ordering of UIDs are the same with and without debug info.
> 
> This means that during tree-inline we need to make a hash of killed
> SSAs, that we already have in copy_body_data, available to the
> function making the purging.  So the patch duly does also that, making
> the interface slightly ugly.  Moreover, all newly unused SSA names
> need to be freed and as PR 112616 showed, it must be done in a defined
> order, which is what newly added ipa_release_ssas_in_hash does.
> 
> The only difference from the patch which has already been approved in
> September but which I later had to revert is (one function name and)
> that SSAs that are to be released are first put into an auto_vec and
> sorted according their version number to avoid issues like PR 112616.
> 
> The patch has passed bootstrap, LTO-bootstrap and profiled-LTO-bootstrap
> and testing on x86_64-linux, bootstrap, LTO-bootstrap and testing on
> ppc64le-linux and bootstrap and LTO-bootstrap on Aarch64, testsuite
> there is still running, OK if it passes?
> 
> Thanks
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2024-01-12  Martin Jambor  
> 
>   PR ipa/108007
>   PR ipa/112616
>   * cgraph.h (cgraph_edge): Add a parameter to
>   redirect_call_stmt_to_callee.
>   * ipa-param-manipulation.h (ipa_param_adjustments): Add a
>   parameter to modify_call.
>   (ipa_release_ssas_in_hash): Declare.
>   * cgraph.cc (cgraph_edge::redirect_call_stmt_to_callee): New
>   parameter killed_ssas, pass it to padjs->modify_call.
>   * ipa-param-manipulation.cc (purge_all_uses): New function.
>   (ipa_param_adjustments::modify_call): New parameter killed_ssas.
>   Instead of substituting uses, invoke purge_all_uses.  If
>   hash of killed SSAs has not been provided, create a temporary one
>   and release SSAs that have been added to it.
>   (compare_ssa_versions): New function.
>   (ipa_release_ssas_in_hash): Likewise.
>   * tree-inline.cc (redirect_all_calls): Create
>   id->killed_new_ssa_names earlier, pass it to edge redirection,
>   adjust a comment.
>   (copy_body): Release SSAs in id->killed_new_ssa_names.
> 
> gcc/testsuite/ChangeLog:
> 
> 2024-01-15  Martin Jambor  
> 
>   PR ipa/108007
>   PR ipa/112616
>   * gcc.dg/ipa/pr108007.c: New test.
>   * gcc.dg/ipa/pr112616.c: Likewise.
> +/* Remove all statements that use NAME directly or indirectly.  KILLED_SSAS
> +   contains the SSA_NAMEs that are already being or have been processed and 
> new
> +   ones need to be added to it.  The function only has to process situations
> +   handled by ssa_name_only_returned_p in ipa-sra.cc with the exception that 
> it
> +   can assume it must never reach a use in a return statement.  */
> +
> +static void
> +purge_all_uses (tree name, hash_set  *killed_ssas)

I think simple_dce_from_worklist does pretty much the same but expects
bitmap instead of hash set.

It checks for some cases when statement is not removable, but these
should all pass if we declared it dead.

The patch looks OK to me, except that keeping this walk at one place may
be nice. 

Honza
> +{
> +  imm_use_iterator imm_iter;
> +  gimple *stmt;
> +  auto_vec  worklist;
> +
> +  worklist.safe_push (name);
> +  while (!worklist.is_empty ())
> +{
> +  tree cur_name = worklist.pop ();
> +  FOR_EACH_IMM_USE_STMT (stmt, imm_iter, cur_name)
> + {
> +   if (gimple_debug_bind_p (stmt))
> + {
> +   /* When runing within tree-inline, we will never end up here but
> +  adding the SSAs to killed_ssas will do the trick in this case
> +  and the respective debug statements will get reset. */
> +   gimple_debug_bind_reset_value (stmt);
> +   update_stmt (stmt);
> +   continue;
> + }
> +
> +   tree lhs = NULL_TREE;
> +   if (is_gimple_assign (stmt))
> + lhs = gimple_assign_lhs (stmt);
> +   else if (gimple_code 

Re: [PATCH] ipa-cp: Fix check for exceeding param_ipa_cp_value_list_size (PR 113490)

2024-01-22 Thread Jan Hubicka
> Hi,
> 
> When the check for exceeding param_ipa_cp_value_list_size limit was
> modified to be ignored for generating values from self-recursive
> calls, it should have been changed from equal to, to equals toor is
> greater than.  This omission manifests itself as PR 113490.
> 
> When I examined the condition I also noticed that the parameter should
> come from the callee rather than the caller, since the value list is
> associated with the former and not the latter.  In practice the limit
> is of course very likely to be the same, but I fixed this aspect of
> the condition too.  I briefly audited all other uses of opt_for_fn in
> ipa-cp.cc and all the others looked OK.
> 
> Bootstrapped and tested on x86_64-linux.  OK for master?
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2024-01-19  Martin Jambor  
> 
>   PR ipa/113490
>   * ipa-cp.cc (ipcp_lattice::add_value): Bail out if value
>   count is equal or greater than the limit.  Use the limit from the
>   callee.
> 
> gcc/testsuite/ChangeLog:
> 
> 2024-01-19  Martin Jambor  
> 
>   PR ipa/113490
>   * gcc.dg/ipa/pr113490.c: New test.
OK,
thanks!
Honza


Re: [PATCH v2 2/2] x86: Don't save callee-saved registers in noreturn functions

2024-01-22 Thread Jan Hubicka
> I compared GCC master branch bootstrap and test times on a slow machine
> with 6.6 Linux kernels compiled with the original GCC 13 and the GCC 13
> with the backported patch.  The performance data isn't precise since the
> measurements were done on different days with different GCC sources under
> different 6.6 kernel versions.
> 
> GCC master branch build time in seconds:
> 
> beforeafter  improvement
> 30043.75user  30013.16user   0%
> 1274.85system 1243.72system  2.4%
> 
> GCC master branch test time in seconds (new tests added):
> 
> beforeafter  improvement
> 216035.90user 216547.51user  0
> 27365.51system26658.54system 2.6%

It is interesting - the system time difference comes from smaller
binary?  Is the difference any significant?
> 
> gcc/
> 
>   PR target/38534
>   * config/i386/i386-options.cc (ix86_set_func_type): Don't
>   save and restore callee saved registers for a noreturn function
>   with nothrow or compiled with -fno-exceptions.

In general this looks like good thing to do.  I wonder if that is not
something middle-end should understand for all targets.
Also I wonder about asynchronous stack unwinding.  If we want to unwind
stack from interrupt then we may need some registers to be saved (like
base pointer).

Honza
> 
> gcc/testsuite/
> 
>   PR target/38534
>   * gcc.target/i386/pr38534-1.c: New file.
>   * gcc.target/i386/pr38534-2.c: Likewise.
>   * gcc.target/i386/pr38534-3.c: Likewise.
>   * gcc.target/i386/pr38534-4.c: Likewise.
>   * gcc.target/i386/stack-check-17.c: Updated.
> ---
>  gcc/config/i386/i386-options.cc   | 16 ++--
>  gcc/testsuite/gcc.target/i386/pr38534-1.c | 26 +++
>  gcc/testsuite/gcc.target/i386/pr38534-2.c | 18 +
>  gcc/testsuite/gcc.target/i386/pr38534-3.c | 19 ++
>  gcc/testsuite/gcc.target/i386/pr38534-4.c | 18 +
>  .../gcc.target/i386/stack-check-17.c  | 19 +-
>  6 files changed, 102 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr38534-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr38534-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr38534-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr38534-4.c
> 
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 0cdea30599e..f965568947c 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -3371,9 +3371,21 @@ ix86_simd_clone_adjust (struct cgraph_node *node)
>  static void
>  ix86_set_func_type (tree fndecl)
>  {
> +  /* No need to save and restore callee-saved registers for a noreturn
> + function with nothrow or compiled with -fno-exceptions.
> +
> + NB: Don't use TREE_THIS_VOLATILE to check if this is a noreturn
> + function.  The local-pure-const pass turns an interrupt function
> + into a noreturn function by setting TREE_THIS_VOLATILE.  Normally
> + the local-pure-const pass is run after ix86_set_func_type is called.
> + When the local-pure-const pass is enabled for LTO, the interrupt
> + function is marked as noreturn in the IR output, which leads the
> + incompatible attribute error in LTO1.  */
>bool has_no_callee_saved_registers
> -= lookup_attribute ("no_callee_saved_registers",
> - TYPE_ATTRIBUTES (TREE_TYPE (fndecl)));
> += (((TREE_NOTHROW (fndecl) || !flag_exceptions)
> + && lookup_attribute ("noreturn", DECL_ATTRIBUTES (fndecl)))
> +   || lookup_attribute ("no_callee_saved_registers",
> + TYPE_ATTRIBUTES (TREE_TYPE (fndecl;
>  
>if (cfun->machine->func_type == TYPE_UNKNOWN)
>  {
> diff --git a/gcc/testsuite/gcc.target/i386/pr38534-1.c 
> b/gcc/testsuite/gcc.target/i386/pr38534-1.c
> new file mode 100644
> index 000..9297959e759
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr38534-1.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune-ctrl=^prologue_using_move,^epilogue_using_move" 
> } */
> +
> +#define ARRAY_SIZE 256
> +
> +extern int array[ARRAY_SIZE][ARRAY_SIZE][ARRAY_SIZE];
> +extern int value (int, int, int)
> +#ifndef __x86_64__
> +__attribute__ ((regparm(3)))
> +#endif
> +;
> +
> +void
> +__attribute__((noreturn))
> +no_return_to_caller (void)
> +{
> +  unsigned i, j, k;
> +  for (i = ARRAY_SIZE; i > 0; --i)
> +for (j = ARRAY_SIZE; j > 0; --j)
> +  for (k = ARRAY_SIZE; k > 0; --k)
> + array[i - 1][j - 1][k - 1] = value (i, j, k);
> +  while (1);
> +}
> +
> +/* { dg-final { scan-assembler-not "push" } } */
> +/* { dg-final { scan-assembler-not "pop" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr38534-2.c 
> b/gcc/testsuite/gcc.target/i386/pr38534-2.c
> new file mode 100644
> index 000..1fb01363273
> --- 

Remove accidental hack in ipa_polymorphic_call_context::set_by_invariant

2024-01-17 Thread Jan Hubicka
Hi,
I managed to commit a hack setting offset to 0 in
ipa_polymorphic_call_context::set_by_invariant.  This makes it to give up on 
multiple
inheritance, but most likely won't give bad code since the ohter base will be of
different type.  

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* ipa-polymorphic-call.cc 
(ipa_polymorphic_call_context::set_by_invariant): Remove
accidental hack reseting offset.

diff --git a/gcc/ipa-polymorphic-call.cc b/gcc/ipa-polymorphic-call.cc
index 8667059abee..81de6d7fc33 100644
--- a/gcc/ipa-polymorphic-call.cc
+++ b/gcc/ipa-polymorphic-call.cc
@@ -766,7 +766,6 @@ ipa_polymorphic_call_context::set_by_invariant (tree cst,
   tree base;
 
   invalid = false;
-  off = 0;
   clear_outer_type (otr_type);
 
   if (TREE_CODE (cst) != ADDR_EXPR)


Fix handling of X86_TUNE_AVOID_512FMA_CHAINS

2024-01-17 Thread Jan Hubicka
Hi,
I have noticed quite bad pasto in handling of X86_TUNE_AVOID_512FMA_CHAINS.  At 
the
moment it is ignored, but X86_TUNE_AVOID_256FMA_CHAINS controls 512FMA too.
This patch fixes it, we may want to re-check how that works on AVX512 machines.

Bootstrapped/regtested x86_64-linux, will commit it shortly.

Honza

gcc/ChangeLog:

* config/i386/i386-options.cc (ix86_option_override_internal): Fix
handling of X86_TUNE_AVOID_512FMA_CHAINS.

diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 3605c2c53fb..b6f634e9a32 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -3248,7 +3248,7 @@ ix86_option_override_internal (bool main_args_p,
   = (cf_protection_level) (opts->x_flag_cf_protection | CF_SET);
 }
 
-  if (ix86_tune_features [X86_TUNE_AVOID_256FMA_CHAINS])
+  if (ix86_tune_features [X86_TUNE_AVOID_512FMA_CHAINS])
 SET_OPTION_IF_UNSET (opts, opts_set, param_avoid_fma_max_bits, 512);
   else if (ix86_tune_features [X86_TUNE_AVOID_256FMA_CHAINS])
 SET_OPTION_IF_UNSET (opts, opts_set, param_avoid_fma_max_bits, 256);


Re: Disable FMADD in chains for Zen4 and generic

2024-01-17 Thread Jan Hubicka
> Can we backport the patch(at least the generic part) to
> GCC11/GCC12/GCC13 release branch?

Yes, the periodic testers has took the change and as far as I can tell,
there are no surprises.

Thanks,
Honza
> > > >
> > > >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 
> > > > 512bit or
> > > > smaller FMA chain.  */
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> 
> 
> 
> -- 
> BR,
> Hongtao


Re: Add -falign-all-functions

2024-01-17 Thread Jan Hubicka
> On Wed, 17 Jan 2024, Jan Hubicka wrote:
> 
> > > 
> > > I meant the new option might be named -fmin-function-alignment=
> > > rather than -falign-all-functions because of how it should
> > > override all other options.
> > 
> > I was also pondering about both names.  -falign-all-functions has the
> > advantage that it is similar to all the other alignment flags that are
> > all called -falign-XXX
> > 
> > but both options are finte for me.
> > > 
> > > Otherwise is there an updated patch to look at?
> > 
> > I will prepare one.  So shall I drop the max-skip support for alignment
> > and rename the flag?
> 
> Yes.
OK, here is updated version.
Bootstrapped/regtested on x86_64-linux, OK?

gcc/ChangeLog:

* common.opt (flimit-function-alignment): Reorder so file is
alphabetically ordered.
(flimit-function-alignment): New flag.
* doc/invoke.texi (-fmin-function-alignment): Document
(-falign-jumps,-falign-labels): Document that this is an optimization
bypassed in cold code.
* varasm.cc (assemble_start_function): Honor -fmin-function-alignment.

diff --git a/gcc/common.opt b/gcc/common.opt
index 5f0a101bccb..6e85853f086 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1040,9 +1040,6 @@ Align the start of functions.
 falign-functions=
 Common RejectNegative Joined Var(str_align_functions) Optimization
 
-flimit-function-alignment
-Common Var(flag_limit_function_alignment) Optimization Init(0)
-
 falign-jumps
 Common Var(flag_align_jumps) Optimization
 Align labels which are only reached by jumping.
@@ -2277,6 +2274,10 @@ fmessage-length=
 Common RejectNegative Joined UInteger
 -fmessage-length=  Limit diagnostics to  characters per 
line.  0 suppresses line-wrapping.
 
+fmin-function-alignment=
+Common Joined RejectNegative UInteger Var(flag_min_function_alignment) 
Optimization
+Align the start of every function.
+
 fmodulo-sched
 Common Var(flag_modulo_sched) Optimization
 Perform SMS based modulo scheduling before the first scheduling pass.
@@ -2601,6 +2602,9 @@ starts and when the destructor finishes.
 flifetime-dse=
 Common Joined RejectNegative UInteger Var(flag_lifetime_dse) Optimization 
IntegerRange(0, 2)
 
+flimit-function-alignment
+Common Var(flag_limit_function_alignment) Optimization Init(0)
+
 flive-patching
 Common RejectNegative Alias(flive-patching=,inline-clone) Optimization
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 43fd3c3a3cd..456374d9446 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -546,6 +546,7 @@ Objective-C and Objective-C++ Dialects}.
 -falign-jumps[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
 -falign-labels[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
 -falign-loops[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
+-fmin-function-alignment=[@var{n}]
 -fno-allocation-dce -fallow-store-data-races
 -fassociative-math  -fauto-profile  -fauto-profile[=@var{path}]
 -fauto-inc-dec  -fbranch-probabilities
@@ -14177,6 +14178,9 @@ Align the start of functions to the next power-of-two 
greater than or
 equal to @var{n}, skipping up to @var{m}-1 bytes.  This ensures that at
 least the first @var{m} bytes of the function can be fetched by the CPU
 without crossing an @var{n}-byte alignment boundary.
+This is an optimization of code performance and alignment is ignored for
+functions considered cold.  If alignment is required for all functions,
+use @option{-fmin-function-alignment}.
 
 If @var{m} is not specified, it defaults to @var{n}.
 
@@ -14240,6 +14244,8 @@ Enabled at levels @option{-O2}, @option{-O3}.
 Align loops to a power-of-two boundary.  If the loops are executed
 many times, this makes up for any execution of the dummy padding
 instructions.
+This is an optimization of code performance and alignment is ignored for
+loops considered cold.
 
 If @option{-falign-labels} is greater than this value, then its value
 is used instead.
@@ -14262,6 +14268,8 @@ Enabled at levels @option{-O2}, @option{-O3}.
 Align branch targets to a power-of-two boundary, for branch targets
 where the targets can only be reached by jumping.  In this case,
 no dummy operations need be executed.
+This is an optimization of code performance and alignment is ignored for
+jumps considered cold.
 
 If @option{-falign-labels} is greater than this value, then its value
 is used instead.
@@ -14275,6 +14283,14 @@ The maximum allowed @var{n} option value is 65536.
 
 Enabled at levels @option{-O2}, @option{-O3}.
 
+@opindex fmin-function-alignment=@var{n}
+@item -fmin-function-alignment
+Specify minimal alignment of functions to the next power-of-two greater than or
+equal to @var{n}. Unlike @option{-falign-functions} this alignment is applied
+also to all functions (even those considered cold).  The alignment is also not
+affected by @option{-flimit-function-alignment}
+
+
 @opindex fno-allocation-dce
 @item -fn

Re: Add -falign-all-functions

2024-01-17 Thread Jan Hubicka
> 
> I meant the new option might be named -fmin-function-alignment=
> rather than -falign-all-functions because of how it should
> override all other options.

I was also pondering about both names.  -falign-all-functions has the
advantage that it is similar to all the other alignment flags that are
all called -falign-XXX

but both options are finte for me.
> 
> Otherwise is there an updated patch to look at?

I will prepare one.  So shall I drop the max-skip support for alignment
and rename the flag?

Honza
> 
> Richard.
> 
> > > -flimit-function-alignment should not have an effect on it
> > > and even very small functions should be aligned.
> > 
> > I write that it is not affected by limit-function-alignment
> > @opindex falign-all-functions=@var{n}
> > @item -falign-all-functions
> > Specify minimal alignment for function entry. Unlike 
> > @option{-falign-functions}
> > this alignment is applied also to all functions (even those considered 
> > cold).
> > The alignment is also not affected by @option{-flimit-function-alignment}
> > 
> > Because indeed that would break the atomicity of updates.
> 
> 
> 
> > Honza
> > > 
> > > Richard.
> > > 
> > > > +}
> > > > +
> > > >/* Handle a user-specified function alignment.
> > > >   Note that we still need to align to DECL_ALIGN, as above,
> > > >   because ASM_OUTPUT_MAX_SKIP_ALIGN might not do any alignment at 
> > > > all.  */
> > > > 
> > > 
> > > -- 
> > > Richard Biener 
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> > 
> 
> -- 
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: Fix merging of value predictors

2024-01-17 Thread Jan Hubicka
> 
> Please fill in what has changed, both for predict-18.c and predict.{cc,def}
> changes.

Sorry, I re-generated the patch after fixing some typos and forgot to
copy over the changelog.
> 
> > @@ -2613,24 +2658,40 @@ expr_expected_value_1 (tree type, tree op0, enum 
> > tree_code code,
> >   if (!nop1)
> > nop1 = op1;
> >  }
> > +  /* We already checked if folding one of arguments to constant is good
> > +enough.  Consequently failing to fold both means that we will not
> > +succeed determinging the value.  */
> 
> s/determinging/determining/
Fixed.  I am re-testing the following and will commit if it succeeds (on
x86_64-linux)

2024-01-17  Jan Hubicka  
Jakub Jelinek  

PR tree-optimization/110852

gcc/ChangeLog:

* predict.cc (expr_expected_value_1): Fix profile merging of PHI and
binary operations
(get_predictor_value): Handle PRED_COMBINED_VALUE_PREDICTIONS and
PRED_COMBINED_VALUE_PREDICTIONS_PHI
* predict.def (PRED_COMBINED_VALUE_PREDICTIONS): New predictor.
(PRED_COMBINED_VALUE_PREDICTIONS_PHI): New predictor.

gcc/testsuite/ChangeLog:

* gcc.dg/predict-18.c: Update template to expect combined value 
predictor.
* gcc.dg/predict-23.c: New test.
* gcc.dg/tree-ssa/predict-1.c: New test.
* gcc.dg/tree-ssa/predict-2.c: New test.
* gcc.dg/tree-ssa/predict-3.c: New test.

diff --git a/gcc/predict.cc b/gcc/predict.cc
index 84cbe3ffc61..c1c48bf3df1 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -2404,44 +2404,78 @@ expr_expected_value_1 (tree type, tree op0, enum 
tree_code code,
   if (!bitmap_set_bit (visited, SSA_NAME_VERSION (op0)))
return NULL;
 
-  if (gimple_code (def) == GIMPLE_PHI)
+  if (gphi *phi = dyn_cast  (def))
{
  /* All the arguments of the PHI node must have the same constant
 length.  */
- int i, n = gimple_phi_num_args (def);
- tree val = NULL, new_val;
+ int i, n = gimple_phi_num_args (phi);
+ tree val = NULL;
+ bool has_nonzero_edge = false;
+
+ /* If we already proved that given edge is unlikely, we do not need
+to handle merging of the probabilities.  */
+ for (i = 0; i < n && !has_nonzero_edge; i++)
+   {
+ tree arg = PHI_ARG_DEF (phi, i);
+ if (arg == PHI_RESULT (phi))
+   continue;
+ profile_count cnt = gimple_phi_arg_edge (phi, i)->count ();
+ if (!cnt.initialized_p () || cnt.nonzero_p ())
+   has_nonzero_edge = true;
+   }
 
  for (i = 0; i < n; i++)
{
- tree arg = PHI_ARG_DEF (def, i);
+ tree arg = PHI_ARG_DEF (phi, i);
  enum br_predictor predictor2;
 
- /* If this PHI has itself as an argument, we cannot
-determine the string length of this argument.  However,
-if we can find an expected constant value for the other
-PHI args then we can still be sure that this is
-likely a constant.  So be optimistic and just
-continue with the next argument.  */
- if (arg == PHI_RESULT (def))
+ /* Skip self-referring parameters, since they does not change
+expected value.  */
+ if (arg == PHI_RESULT (phi))
continue;
 
+ /* Skip edges which we already predicted as executing
+zero times.  */
+ if (has_nonzero_edge)
+   {
+ profile_count cnt = gimple_phi_arg_edge (phi, i)->count ();
+ if (cnt.initialized_p () && !cnt.nonzero_p ())
+   continue;
+   }
  HOST_WIDE_INT probability2;
- new_val = expr_expected_value (arg, visited, ,
-);
+ tree new_val = expr_expected_value (arg, visited, ,
+ );
+ /* If we know nothing about value, give up.  */
+ if (!new_val)
+   return NULL;
 
- /* It is difficult to combine value predictors.  Simply assume
-that later predictor is weaker and take its prediction.  */
- if (*predictor < predictor2)
+ /* If this is a first edge, trust its prediction.  */
+ if (!val)
{
+ val = new_val;
  *predictor = predictor2;
  *probability = probability2;
+ continue;
}
- if (!new_val)
-   return NULL;
- if (!val)
-   val = new_val;
- else if (!operand_equal_p (val, new_val, false))
+ /* If there are two different values, g

Re: Add -falign-all-functions

2024-01-17 Thread Jan Hubicka
> > +falign-all-functions
> > +Common Var(flag_align_all_functions) Optimization
> > +Align the start of functions.
> 
> all functions
> 
> or maybe "of every function."?

Fixed, thanks!
> > +@opindex falign-all-functions=@var{n}
> > +@item -falign-all-functions
> > +Specify minimal alignment for function entry. Unlike 
> > @option{-falign-functions}
> > +this alignment is applied also to all functions (even those considered 
> > cold).
> > +The alignment is also not affected by @option{-flimit-function-alignment}
> > +
> 
> For functions with two entries (like on powerpc), which entry does this
> apply to?  I suppose the external ABI entry, not the local one?  But
> how does this then help to align the patchable entry (the common
> local entry should be aligned?).  Should we align _both_ entries?

To be honest I did not really know we actually would like to patch
alternative entry points.
The function alignent is always produced before the start of function,
so the first entry point wins and the other entry point is not aligned.

Aligning later labels needs to go using the label align code, since
theoretically some targets need to do relaxation over it.

In final.cc we do no function alignments on labels those labels.
I guess this makes sense because if we align for performance, we
probably do not want the altenrate entry point to be aligned since it
appears close to original one.  I can add that to compute_alignment:
test if label is alternative entry point and add alignment.
I wonder if that is a desired behaviour though and is this code
path even used?

I know this was originally added to support i386 register passing
conventions and stack alignment via alternative entry point, but it was
never really used that way.  Also there was plan to support Fortran
alternative entry point.

Looking at what rs6000 does, it seems to not use the RTL representation
of alternative entry points.  it seems that:
 1) call assemble_start_functions which
a) outputs function alignment
b) outputs start label
c) calls print_patchable_function_entry
 2) call final_start_functions which calls output_function_prologue.
In rs6000 there is second call to
rs6000_print_patchable_function_entry
So there is no target-independent place where alignment can be added,
so I would say it is up to rs6000 maintainers to decide what is right
here :)
> 
> >  @opindex falign-labels
> >  @item -falign-labels
> >  @itemx -falign-labels=@var{n}
> > @@ -14240,6 +14250,8 @@ Enabled at levels @option{-O2}, @option{-O3}.
> >  Align loops to a power-of-two boundary.  If the loops are executed
> >  many times, this makes up for any execution of the dummy padding
> >  instructions.
> > +This is an optimization of code performance and alignment is ignored for
> > +loops considered cold.
> >  
> >  If @option{-falign-labels} is greater than this value, then its value
> >  is used instead.
> > @@ -14262,6 +14274,8 @@ Enabled at levels @option{-O2}, @option{-O3}.
> >  Align branch targets to a power-of-two boundary, for branch targets
> >  where the targets can only be reached by jumping.  In this case,
> >  no dummy operations need be executed.
> > +This is an optimization of code performance and alignment is ignored for
> > +jumps considered cold.
> >  
> >  If @option{-falign-labels} is greater than this value, then its value
> >  is used instead.
> > @@ -14371,7 +14385,7 @@ To use the link-time optimizer, @option{-flto} and 
> > optimization
> >  options should be specified at compile time and during the final link.
> >  It is recommended that you compile all the files participating in the
> >  same link with the same options and also specify those options at
> > -link time.  
> > +link time.
> >  For example:
> >  
> >  @smallexample
> > diff --git a/gcc/flags.h b/gcc/flags.h
> > index e4bafa310d6..ecf4fb9e846 100644
> > --- a/gcc/flags.h
> > +++ b/gcc/flags.h
> > @@ -89,6 +89,7 @@ public:
> >align_flags x_align_jumps;
> >align_flags x_align_labels;
> >align_flags x_align_functions;
> > +  align_flags x_align_all_functions;
> >  };
> >  
> >  extern class target_flag_state default_target_flag_state;
> > @@ -98,10 +99,11 @@ extern class target_flag_state *this_target_flag_state;
> >  #define this_target_flag_state (_target_flag_state)
> >  #endif
> >  
> > -#define align_loops (this_target_flag_state->x_align_loops)
> > -#define align_jumps (this_target_flag_state->x_align_jumps)
> > -#define align_labels(this_target_flag_state->x_align_labels)
> > -#define align_functions (this_target_flag_state->x_align_functions)
> > +#define align_loops(this_target_flag_state->x_align_loops)
> > +#define align_jumps(this_target_flag_state->x_align_jumps)
> > +#define align_labels   (this_target_flag_state->x_align_labels)
> > +#define align_functions(this_target_flag_state->x_align_functions)
> > +#define align_all_functions 

Fix merging of value predictors

2024-01-17 Thread Jan Hubicka
Hi,
expr_expected_value is doing some guesswork when it is merging two or more
independent value predictions either in PHI node or in binary operation.
Since we do not know how the predictions interact with each other, we can
not really merge the values precisely.

The previous logic merged the prediciton and picked the later predictor
(since predict.def is sorted by reliability). This however leads to troubles
with __builtin_expect_with_probability since it is special cased as a predictor
with custom probabilities.  If this predictor is downgraded to something else,
we ICE since we have prediction given by predictor that is not expected
to have customprobability.

This patch fixies it by inventing new predictors PRED_COMBINED_VALUE_PREDICTIONS
and PRED_COMBINED_VALUE_PREDICTIONS_PHI which also allows custom values but
are considered less reliable then __builtin_expect_with_probability (they
are combined by ds theory rather then by first match).  This is less likely
going to lead to very stupid decisions if combining does not work as expected.

I also updated the code to be bit more careful about merging values and do not
downgrade the precision when unnecesary (as tested by new testcases).

Bootstrapped/regtested x86_64-linux, will commit it tomorrow if there are
no complains.

2024-01-17  Jan Hubicka 
Jakub Jelinek 

PR tree-optimization/110852

gcc/ChangeLog:

* predict.cc (expr_expected_value_1):
(get_predictor_value):
* predict.def (PRED_COMBINED_VALUE_PREDICTIONS):
(PRED_COMBINED_VALUE_PREDICTIONS_PHI):

gcc/testsuite/ChangeLog:

* gcc.dg/predict-18.c:
* gcc.dg/predict-23.c: New test.
* gcc.dg/tree-ssa/predict-1.c: New test.
* gcc.dg/tree-ssa/predict-2.c: New test.
* gcc.dg/tree-ssa/predict-3.c: New test.

diff --git a/gcc/predict.cc b/gcc/predict.cc
index 84cbe3ffc61..f9d73c5eb1a 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -2404,44 +2404,78 @@ expr_expected_value_1 (tree type, tree op0, enum 
tree_code code,
   if (!bitmap_set_bit (visited, SSA_NAME_VERSION (op0)))
return NULL;
 
-  if (gimple_code (def) == GIMPLE_PHI)
+  if (gphi *phi = dyn_cast  (def))
{
  /* All the arguments of the PHI node must have the same constant
 length.  */
- int i, n = gimple_phi_num_args (def);
- tree val = NULL, new_val;
+ int i, n = gimple_phi_num_args (phi);
+ tree val = NULL;
+ bool has_nonzero_edge = false;
+
+ /* If we already proved that given edge is unlikely, we do not need
+to handle merging of the probabilities.  */
+ for (i = 0; i < n && !has_nonzero_edge; i++)
+   {
+ tree arg = PHI_ARG_DEF (phi, i);
+ if (arg == PHI_RESULT (phi))
+   continue;
+ profile_count cnt = gimple_phi_arg_edge (phi, i)->count ();
+ if (!cnt.initialized_p () || cnt.nonzero_p ())
+   has_nonzero_edge = true;
+   }
 
  for (i = 0; i < n; i++)
{
- tree arg = PHI_ARG_DEF (def, i);
+ tree arg = PHI_ARG_DEF (phi, i);
  enum br_predictor predictor2;
 
- /* If this PHI has itself as an argument, we cannot
-determine the string length of this argument.  However,
-if we can find an expected constant value for the other
-PHI args then we can still be sure that this is
-likely a constant.  So be optimistic and just
-continue with the next argument.  */
- if (arg == PHI_RESULT (def))
+ /* Skip self-referring parameters, since they does not change
+expected value.  */
+ if (arg == PHI_RESULT (phi))
continue;
 
+ /* Skip edges which we already predicted as executing
+zero times.  */
+ if (has_nonzero_edge)
+   {
+ profile_count cnt = gimple_phi_arg_edge (phi, i)->count ();
+ if (cnt.initialized_p () && !cnt.nonzero_p ())
+   continue;
+   }
  HOST_WIDE_INT probability2;
- new_val = expr_expected_value (arg, visited, ,
-);
+ tree new_val = expr_expected_value (arg, visited, ,
+ );
+ /* If we know nothing about value, give up.  */
+ if (!new_val)
+   return NULL;
 
- /* It is difficult to combine value predictors.  Simply assume
-that later predictor is weaker and take its prediction.  */
- if (*predictor < predictor2)
+ /* If this is a first edge, trust its prediction.  */
+ if (!val)
{
+ val = new_val;
  

Add -falign-all-functions

2024-01-04 Thread Jan Hubicka
Hi,
this patch adds new option -falign-all-functions which works like
-falign-functions, but applies to all functions including those in cold
regions.  As discussed in the PR log, this is needed for atomically
patching function entries in the kernel.

An option would be to make -falign-function mandatory, but I think it is not a
good idea, since original purpose of -falign-funtions is optimization of
instruction decode and cache size.  Having -falign-all-functions is
backwards compatible.  Richi also suggested extending syntax of the
-falign-functions parameters (which is already non-trivial) but it seems
to me that having separate flag is more readable.

Bootstrapped/regtested x86_64-linux, OK for master and later
backports to release branches?

gcc/ChangeLog:

PR middle-end/88345
* common.opt: Add -falign-all-functions
* doc/invoke.texi: Add -falign-all-functions.
(-falign-functions, -falign-labels, -falign-loops): Document
that alignment is ignored in cold code.
* flags.h (align_loops): Reindent.
(align_jumps): Reindent.
(align_labels): Reindent.
(align_functions): Reindent.
(align_all_functions): New macro.
* opts.cc (common_handle_option): Handle -falign-all-functions.
* toplev.cc (parse_alignment_opts): Likewise.
* varasm.cc (assemble_start_function): Likewise.

diff --git a/gcc/common.opt b/gcc/common.opt
index d263a959df3..fea2c855fcf 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1033,6 +1033,13 @@ faggressive-loop-optimizations
 Common Var(flag_aggressive_loop_optimizations) Optimization Init(1)
 Aggressively optimize loops using language constraints.
 
+falign-all-functions
+Common Var(flag_align_all_functions) Optimization
+Align the start of functions.
+
+falign-all-functions=
+Common RejectNegative Joined Var(str_align_all_functions) Optimization
+
 falign-functions
 Common Var(flag_align_functions) Optimization
 Align the start of functions.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d272b9228dd..ad3d75d310c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -543,6 +543,7 @@ Objective-C and Objective-C++ Dialects}.
 @xref{Optimize Options,,Options that Control Optimization}.
 @gccoptlist{-faggressive-loop-optimizations
 -falign-functions[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
+-falign-all-functions=[@var{n}]
 -falign-jumps[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
 -falign-labels[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
 -falign-loops[=@var{n}[:@var{m}:[@var{n2}[:@var{m2}
@@ -14177,6 +14178,9 @@ Align the start of functions to the next power-of-two 
greater than or
 equal to @var{n}, skipping up to @var{m}-1 bytes.  This ensures that at
 least the first @var{m} bytes of the function can be fetched by the CPU
 without crossing an @var{n}-byte alignment boundary.
+This is an optimization of code performance and alignment is ignored for
+functions considered cold.  If alignment is required for all functions,
+use @option{-falign-all-functions}.
 
 If @var{m} is not specified, it defaults to @var{n}.
 
@@ -14210,6 +14214,12 @@ overaligning functions. It attempts to instruct the 
assembler to align
 by the amount specified by @option{-falign-functions}, but not to
 skip more bytes than the size of the function.
 
+@opindex falign-all-functions=@var{n}
+@item -falign-all-functions
+Specify minimal alignment for function entry. Unlike @option{-falign-functions}
+this alignment is applied also to all functions (even those considered cold).
+The alignment is also not affected by @option{-flimit-function-alignment}
+
 @opindex falign-labels
 @item -falign-labels
 @itemx -falign-labels=@var{n}
@@ -14240,6 +14250,8 @@ Enabled at levels @option{-O2}, @option{-O3}.
 Align loops to a power-of-two boundary.  If the loops are executed
 many times, this makes up for any execution of the dummy padding
 instructions.
+This is an optimization of code performance and alignment is ignored for
+loops considered cold.
 
 If @option{-falign-labels} is greater than this value, then its value
 is used instead.
@@ -14262,6 +14274,8 @@ Enabled at levels @option{-O2}, @option{-O3}.
 Align branch targets to a power-of-two boundary, for branch targets
 where the targets can only be reached by jumping.  In this case,
 no dummy operations need be executed.
+This is an optimization of code performance and alignment is ignored for
+jumps considered cold.
 
 If @option{-falign-labels} is greater than this value, then its value
 is used instead.
@@ -14371,7 +14385,7 @@ To use the link-time optimizer, @option{-flto} and 
optimization
 options should be specified at compile time and during the final link.
 It is recommended that you compile all the files participating in the
 same link with the same options and also specify those options at
-link time.  
+link time.
 For example:
 
 @smallexample
diff --git a/gcc/flags.h b/gcc/flags.h
index e4bafa310d6..ecf4fb9e846 100644
--- 

Re: [PATCH v7] Add condition coverage (MC/DC)

2023-12-31 Thread Jan Hubicka
> > This seems good. Profile-arcs is rarely used by itself - most of time it
> > is implied by -fprofile-generate and -ftest-coverage and since
> > condition coverage is more associated to the second, I guess
> > -fcondition-coverage is better name.
> > 
> > Since -fcondition-coverage now affects gimplification (which makes me
> > less worried about its ability to map things back to conditionals), it
> > should be marked as incompatible with -fprofile-generate and
> > -fprofile-use.  It would only work with -fcondtion-coverage was used
> > both with -fprofile-generate and -fprofile-use and I do not see much use
> > of this. On the other hand I can imagine users wanting both coverage and
> > -fprofile-generate run at once, so we should output sorry message that
> > this is not implemented
> 
> I don't quite understand this - gimplification is affected only in the sense
> that slightly more information is recorded, why would it be incompatible
> with -fprofile-generate?

You are right. Originally I tought you add extra temporary boolean
variables (as described by the comment in condition coverage) but that
happens only at tree-profile time.
> > 
> > If condition statement is removed before profiling is done, you will end
> > up with a stale entry in the hash table.
> > We already keep EH regions and profile histogram in on-side hashtables,
> > so I think you need to follow gimple_.*histograms calls in
> > gimple-iterator.cc and be sure that the hashtable is updated.
> > Also if function is inlined (which may be forced at -O0 using
> > always_inline) the conditional annotations will be lost.
> > 
> > I think you should simply make condition coverage to happen before
> > pass_early_inline is run.  But that may be done incrementally.
> In the v8 revision I switched strategy to not using an auxiliary hash table,
> but rather temporarily in the gimple.uid field until the basic block is
> created and it can be stored there. What do you think about that approach?
> 
> https://patchwork.sourceware.org/project/gcc/patch/20231212114125.1998866-...@lambda.is/
> 
> I assume the stale entries is why I got ICEs from what looked like double
> frees in gc runs.
I see, I missed version 8. Sorry for that.
I am not sure about (ab)using gimple_uid - it looks like something that
may lead to surprises later.  We can check with Richi for an opinion.
It seems to me that using startegy similar to histogram annotations may
end up being more maintainable (since at least it is symetric to what we
already have). 

I wonder if we want to have in longer term more general annotation
infratstructure like we do for symbol table.

I will check the version patch day after tomorrow, since now I am about
to set off for new year celebration...

Happy new year,
Honza


Re: skip vector profiles multiple exits

2023-12-29 Thread Jan Hubicka
> Hi Honza,
Hi,
> 
> I wasn't sure what to do here so I figured I'd ask.
> 
> In adding support for multiple exits to the vectorizer I didn't know how to 
> update this bit:
> 
> https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop-manip.cc#L3363
> 
> Essentially, if skip_vector (i.e. not enough iteration to enter the vector 
> loop) then the
> previous code would update the new probability to be the same as that of the
> exit edge.  This made sense because that's the only edge which could bring 
> you to
> the next loop preheader.
> 
> With multiple exits this is no longer the case since any exit can bring you 
> to the
> Preaheader node.  I figured the new counts should simply be the sum of all 
> exit
> edges.  But that gives quite large count values compared to the rest of the 
> loop.
The sum of all exit counts (not probabilities) relative to header count should
give you estimated probability that the loop iterates at any given
iteration.  I am not sure how good estimate this is for loop
preconditioning to be true (without profile histograms it is really hard
to tell).
> 
> I then thought I would need to scale the counts by the probability of the edge
> being taken.  The problem here is that the probabilities don't end up to 100%

So you are summing exit_edge->count ()?
I am not sure how useful would be summit probabilities since they are
conditional (relative to probability of entering BB you go to).
How complicated CFG we now handle with vectorization?

Honza
> 
> so the scaled counts also looked kinda wonkey.   Any suggestions?
> 
> If you want some small examples to look at, testcases
> ./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to 
> ./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c
> should be relevant here.
> 
> Thanks,
> Tamar


Re: [PATCH 3/7] Lockfile.

2023-12-29 Thread Jan Hubicka
Hi,
> This patch implements lockfile used for incremental LTO.
> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * Makefile.in: Add lockfile.o.
>   * lockfile.cc: New file.
>   * lockfile.h: New file.

I can't approve it, but overall it looks good to me.
We also have locking in gcov-io, but it is probably not that practical
to keep these shared, since gcov-io is also built into runtime.

You do not implement GCOV_LINKED_WITH_LOCKING patch, does locking work
with mingw? Or we only build gcc with cygwin emulation layer these days?

Honza
> ---
>  gcc/Makefile.in |   5 +-
>  gcc/lockfile.cc | 136 
>  gcc/lockfile.h  |  85 ++
>  3 files changed, 224 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/lockfile.cc
>  create mode 100644 gcc/lockfile.h
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 7b7a4ff789a..2c527245c81 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1831,7 +1831,7 @@ ALL_HOST_BACKEND_OBJS = $(GCC_OBJS) $(OBJS) 
> $(OBJS-libcommon) \
>$(OBJS-libcommon-target) main.o c-family/cppspec.o \
>$(COLLECT2_OBJS) $(EXTRA_GCC_OBJS) $(GCOV_OBJS) $(GCOV_DUMP_OBJS) \
>$(GCOV_TOOL_OBJS) $(GENGTYPE_OBJS) gcc-ar.o gcc-nm.o gcc-ranlib.o \
> -  lto-wrapper.o collect-utils.o
> +  lto-wrapper.o collect-utils.o lockfile.o
>  
>  # for anything that is shared use the cc1plus profile data, as that
>  # is likely the most exercised during the build
> @@ -2359,7 +2359,8 @@ collect2$(exeext): $(COLLECT2_OBJS) $(LIBDEPS)
>  CFLAGS-collect2.o += -DTARGET_MACHINE=\"$(target_noncanonical)\" \
>   @TARGET_SYSTEM_ROOT_DEFINE@
>  
> -LTO_WRAPPER_OBJS = lto-wrapper.o collect-utils.o ggc-none.o
> +LTO_WRAPPER_OBJS = lto-wrapper.o collect-utils.o ggc-none.o lockfile.o
> +
>  lto-wrapper$(exeext): $(LTO_WRAPPER_OBJS) libcommon-target.a $(LIBDEPS)
>   +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o T$@ \
>  $(LTO_WRAPPER_OBJS) libcommon-target.a $(LIBS)
> diff --git a/gcc/lockfile.cc b/gcc/lockfile.cc
> new file mode 100644
> index 000..9440e8938f3
> --- /dev/null
> +++ b/gcc/lockfile.cc
> @@ -0,0 +1,136 @@
> +/* File locking.
> +   Copyright (C) 2009-2023 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#include "config.h"
> +#include "system.h"
> +
> +#include "lockfile.h"
> +
> +
> +/* Unique write lock.  No other lock can be held on this lockfile.
> +   Blocking call.  */
> +int
> +lockfile::lock_write ()
> +{
> +  fd = open (filename.c_str (), O_RDWR | O_CREAT, 0666);
> +  if (fd < 0)
> +return -1;
> +
> +#if HAVE_FCNTL_H
> +  struct flock s_flock;
> +
> +  s_flock.l_whence = SEEK_SET;
> +  s_flock.l_start = 0;
> +  s_flock.l_len = 0;
> +  s_flock.l_pid = getpid ();
> +  s_flock.l_type = F_WRLCK;
> +
> +  while (fcntl (fd, F_SETLKW, _flock) && errno == EINTR)
> +continue;
> +#endif
> +  return 0;
> +}
> +
> +/* Unique write lock.  No other lock can be held on this lockfile.
> +   Only locks if this filelock is not locked by any other process.
> +   Return whether locking was successful.  */
> +int
> +lockfile::try_lock_write ()
> +{
> +  fd = open (filename.c_str (), O_RDWR | O_CREAT, 0666);
> +  if (fd < 0)
> +return -1;
> +
> +#if HAVE_FCNTL_H
> +  struct flock s_flock;
> +
> +  s_flock.l_whence = SEEK_SET;
> +  s_flock.l_start = 0;
> +  s_flock.l_len = 0;
> +  s_flock.l_pid = getpid ();
> +  s_flock.l_type = F_WRLCK;
> +
> +  if (fcntl (fd, F_SETLK, _flock) == -1)
> +{
> +  close (fd);
> +  fd = -1;
> +  return 1;
> +}
> +#endif
> +  return 0;
> +}
> +
> +/* Shared read lock.  Only read lock can be held concurrently.
> +   If write lock is already held by this process, it will be
> +   changed to read lock.
> +   Blocking call.  */
> +int
> +lockfile::lock_read ()
> +{
> +  fd = open (filename.c_str (), O_RDWR | O_CREAT, 0666);
> +  if (fd < 0)
> +return -1;
> +
> +#if HAVE_FCNTL_H
> +  struct flock s_flock;
> +
> +  s_flock.l_whence = SEEK_SET;
> +  s_flock.l_start = 0;
> +  s_flock.l_len = 0;
> +  s_flock.l_pid = getpid ();
> +  s_flock.l_type = F_RDLCK;
> +
> +  while (fcntl (fd, F_SETLKW, _flock) && errno == EINTR)
> +continue;
> +#endif
> +  return 0;
> +}
> +
> +/* Unlock all previously placed locks.  */
> +void
> +lockfile::unlock ()
> +{
> +  if (fd < 

Re: [PATCH 2/7] lto: Remove random_seed from section name.

2023-12-29 Thread Jan Hubicka
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * lto-streamer.cc (lto_get_section_name): Remove random_seed in WPA.

This is also OK. (since it lacks explanation - the random suffixes are
added for ld -r to work.  This never happens between WPA and ltrans, so
they only consume extra space and confuse the ltrans cache).
> ---
>  gcc/lto-streamer.cc | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/lto-streamer.cc b/gcc/lto-streamer.cc
> index 4968fd13413..53275e32618 100644
> --- a/gcc/lto-streamer.cc
> +++ b/gcc/lto-streamer.cc
> @@ -132,11 +132,17 @@ lto_get_section_name (int section_type, const char 
> *name,
>   doesn't confuse the reader with merged sections.
>  
>   For options don't add a ID, the option reader cannot deal with them
> - and merging should be ok here. */
> + and merging should be ok here.
> +
> + WPA output is sent to LTRANS directly inside of lto-wrapper, so name
> + uniqueness for external tools is not needed.
> + Randomness would inhibit incremental LTO.  */
>if (section_type == LTO_section_opts)
>  strcpy (post, "");
>else if (f != NULL) 
>  sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, f->id);
> +  else if (flag_wpa)
> +strcpy (post, ".0");
Con't post be just empty string?
>else
>  sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, get_random_seed 
> (false)); 
>char *res = concat (section_name_prefix, sep, add, post, NULL);
> -- 
> 2.42.1
> 


Re: [PATCH 1/7] lto: Skip flag OPT_fltrans_output_list_.

2023-12-29 Thread Jan Hubicka
Hi,
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * lto-opts.cc (lto_write_options): Skip OPT_fltrans_output_list_.
OK,
thanks,
Honza
> ---
>  gcc/lto-opts.cc | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/gcc/lto-opts.cc b/gcc/lto-opts.cc
> index c9bee9d4197..0451e290c75 100644
> --- a/gcc/lto-opts.cc
> +++ b/gcc/lto-opts.cc
> @@ -152,6 +152,7 @@ lto_write_options (void)
>   case OPT_fprofile_prefix_map_:
>   case OPT_fcanon_prefix_map:
>   case OPT_fwhole_program:
> + case OPT_fltrans_output_list_:
> continue;
>  
>   default:
> -- 
> 2.42.1
> 


Re: [PATCH v7] Add condition coverage (MC/DC)

2023-12-29 Thread Jan Hubicka
> gcc/ChangeLog:
> 
>   * builtins.cc (expand_builtin_fork_or_exec): Check
> condition_coverage_flag.
>   * collect2.cc (main): Add -fno-condition-coverage to OBSTACK.
>   * common.opt: Add new options -fcondition-coverage and
> -Wcoverage-too-many-conditions.
>   * doc/gcov.texi: Add --conditions documentation.
>   * doc/invoke.texi: Add -fcondition-coverage documentation.
>   * function.cc (free_after_compilation): Clear conditions.
>   (allocate_struct_function): Allocate conditions.
>   (basic_condition_uid): New.
>   * function.h (struct function): Add conditions.
>   (basic_condition_uid): New declaration.
>   * gcc.cc: Link gcov on -fcondition-coverage.
>   * gcov-counter.def (GCOV_COUNTER_CONDS): New.
>   * gcov-dump.cc (tag_conditions): New.
>   * gcov-io.h (GCOV_TAG_CONDS): New.
>   (GCOV_TAG_CONDS_LENGTH): New.
>   (GCOV_TAG_CONDS_NUM): New.
>   * gcov.cc (class condition_info): New.
>   (condition_info::condition_info): New.
>   (condition_info::popcount): New.
>   (struct coverage_info): New.
>   (add_condition_counts): New.
>   (output_conditions): New.
>   (print_usage): Add -g, --conditions.
>   (process_args): Likewise.
>   (output_intermediate_json_line): Output conditions.
>   (read_graph_file): Read condition counters.
>   (read_count_file): Likewise.
>   (file_summary): Print conditions.
>   (accumulate_line_info): Accumulate conditions.
>   (output_line_details): Print conditions.
>   * gimplify.cc (next_cond_uid): New.
>   (reset_cond_uid): New.
>   (shortcut_cond_r): Set condition discriminator.
>   (tag_shortcut_cond): New.
>   (shortcut_cond_expr): Set condition discriminator.
>   (gimplify_cond_expr): Likewise.
>   (gimplify_function_tree): Call reset_cond_uid.
>   * ipa-inline.cc (can_early_inline_edge_p): Check
> condition_coverage_flag.
>   * ipa-split.cc (pass_split_functions::gate): Likewise.
>   * passes.cc (finish_optimization_passes): Likewise.
>   * profile.cc (struct condcov): New declaration.
>   (cov_length): Likewise.
>   (cov_blocks): Likewise.
>   (cov_masks): Likewise.
>   (cov_maps): Likewise.
>   (cov_free): Likewise.
>   (instrument_decisions): New.
>   (read_thunk_profile): Control output to file.
>   (branch_prob): Call find_conditions, instrument_decisions.
>   (init_branch_prob): Add total_num_conds.
>   (end_branch_prob): Likewise.
>   * tree-core.h (struct tree_exp): Add condition_uid.
>   * tree-profile.cc (struct conds_ctx): New.
>   (CONDITIONS_MAX_TERMS): New.
>   (EDGE_CONDITION): New.
>   (topological_cmp): New.
>   (index_of): New.
>   (single_p): New.
>   (single_edge): New.
>   (contract_edge_up): New.
>   (struct outcomes): New.
>   (conditional_succs): New.
>   (condition_index): New.
>   (masking_vectors): New.
>   (emit_assign): New.
>   (emit_bitwise_op): New.
>   (make_top_index_visit): New.
>   (make_top_index): New.
>   (paths_between): New.
>   (struct condcov): New.
>   (cov_length): New.
>   (cov_blocks): New.
>   (cov_masks): New.
>   (cov_maps): New.
>   (cov_free): New.
>   (gimple_cond_uid): New.
>   (find_conditions): New.
>   (struct counters): New.
>   (find_counters): New.
>   (resolve_counter): New.
>   (resolve_counters): New.
>   (instrument_decisions): New.
>   (tree_profiling): Check condition_coverage_flag.
>   (pass_ipa_tree_profile::gate): Likewise.
>   * tree.h (SET_EXPR_UID): New.
>   (EXPR_COND_UID): New.
> 
> libgcc/ChangeLog:
> 
>   * libgcov-merge.c (__gcov_merge_ior): New.
> 
> gcc/testsuite/ChangeLog:
> 
>   * lib/gcov.exp: Add condition coverage test function.
>   * g++.dg/gcov/gcov-18.C: New test.
>   * gcc.misc-tests/gcov-19.c: New test.
>   * gcc.misc-tests/gcov-20.c: New test.
>   * gcc.misc-tests/gcov-21.c: New test.
>   * gcc.misc-tests/gcov-22.c: New test.
>   * gcc.misc-tests/gcov-23.c: New test.

Sorry for taking so long on this - I needed some time to actually try
the patch since generally we will need more changes in frontend to
preserve conditionals intanct till gimple.
> This revision brings quite a few changes, some of which warrant a more
> careful review.
> 
> 1. Basic conditions are tied to the Boolean expression during
>gimplification, not through CFG analysis. The CFG analysis seemed to
>work well up until constructs like a && fn (b && c) && d where
>fn(...) seems indistinguishible from then-blocks. This wipes out much
>of the implementation in tree-profile.cc.
> 2. I changed the flag from -fprofile-conditions to -fcondition-coverage.
>-fprofile-conditions was chosen because of its symmetry with
>-fprofile-arcs, condition-coverage does feel more appropriate.

This seems good. 

Re: Disable FMADD in chains for Zen4 and generic

2023-12-13 Thread Jan Hubicka
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> >
> > Since this seems noticeable win on zen and not loss on Core it seems like 
> > good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> The generic part LGTM.(It's exactly what we proposed in [1])
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html

Thanks.  I wonder if can think of other generic changes that would make
sense to do?
Concerning zen4 and FMA, it is not really win with AVX512 enabled
(which is what I was benchmarking for znver4 tuning), but indeed it is
win with AVX256 where the extra latency is not hidden by the parallelism
exposed by doing evertyhing twice.

I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
for x86-64-v3 this makes sense.

Honza
> >
> > Honza
> >
> > #include 
> > #include 
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> >int i, j, k;
> >for(i=0; i >{
> >   for(j=0; j >   {
> >  a[i][j] = (float)i + j;
> >  b[i][j] = (float)i - j;
> >  c[i][j] = 0.0f;
> >   }
> >}
> > }
> >
> > void mult(void)
> > {
> >int i, j, k;
> >
> >for(i=0; i >{
> >   for(j=0; j >   {
> >  for(k=0; k >  {
> > c[i][j] += a[i][k] * b[k][j];
> >  }
> >   }
> >}
> > }
> >
> > int main(void)
> > {
> >clock_t s, e;
> >
> >init();
> >s=clock();
> >mult();
> >e=clock();
> >printf("mult took %10d clocks\n", (int)(e-s));
> >
> >return 0;
> >
> > }
> >
> > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> > X86_TUNE_AVOID_256FMA_CHAINS)
> > Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> > "use_scatter_8parts",
> >
> >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> > smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > m_ZNVER2 | m_ZNVER3
> > -  | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +  | m_YONGFENG | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> > smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> > m_ZNVER3
> > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> > m_ZNVER3 | m_ZNVER4
> > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > smaller FMA chain.  */
> 
> 
> 
> -- 
> BR,
> Hongtao


Re: Disable FMADD in chains for Zen4 and generic

2023-12-12 Thread Jan Hubicka
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with
> multiple dependent FMAs.
> 
> I can't understand how this uarch detail can affect performance when
> as in the testcase
> the longest input latency is on the multiplication from a memory load.
> Do we actually
> understand _why_ the FMAs are slower here?

This is my understanding:
The loop is well predictable and memory caluclations + loads can happen
in parallel.  So the main dependency chain is updating the accumulator
computing c[i][j].  FMADD is 4 cycles on Zen4, while ADD is 3.  So the
loop with FMADD can not run any faster than one iteration per 4 cycles,
while ADD can do one iteration per 3.  Which roughtly matches the
speedup we see 484875179*3/4=363656384 while measured speed is 375875209
cycles.  The benchmark is quite short and I run it 100 times in perf to
collect the data so the overhead is probably attributing to smaller then
expected difference.

> 
> Do we know that Cores can start the multiplication part when the add
> operand isn't
> ready yet?  I'm curious how you set up a micro benchmark to measure this.

Here is cycle counting benchmark:
#include 
int
main()
{ 
  float o=0;
  for (int i = 0; i < 10; i++)
  {
#ifdef ACCUMULATE
float p1 = o;
float p2 = 0;
#else
float p1 = 0;
float p2 = o;
#endif
float p3 = 0;
#ifdef FMA
asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3));
#else
float t;
asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3));
asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t));
#endif
  }
  printf ("%f\n",o);
  return 0;
}

It performs FMAs in sequence all with zeros.  If you define ACCUMULATE
you get the pattern from matrix multiplication. On Zen I get:

jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep 
cycles:
 4,001,011,489  cycles:u #4.837 GHz 
(83.32%)
jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
 3,000,335,064  cycles:u #4.835 GHz 
(83.08%)

So 4 cycles for FMA loop and 3 cycles for separate mul and add.
Muls execute in parallel to adds in the second case.
If the dependence chain is done over multiplied paramter I get:

jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles:
 4,000,118,069  cycles:u #4.836 GHz 
(83.32%)
jh@ryzen3:~> gcc -O3  l.c ; perf stat ./a.out 2>&1 | grep cycles:
 6,001,947,341  cycles:u #4.838 GHz 
(83.32%)

FMA is the same (it is still one FMA instruction periteration) while
mul+add is 6 cycles since the dependency chain is longer.

Core gives me:

jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep 
cycles:u
 5,001,515,473  cycles:u #3.796 GHz
jh@aster:~> gcc -O3 l.c  -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
 4,000,977,739  cycles:u #3.819 GHz
jh@aster:~> gcc -O3 l.c  -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u
 5,350,523,047  cycles:u #3.814 GHz
jh@aster:~> gcc -O3 l.c   ; perf stat ./a.out 2>&1 | grep cycles:u
10,251,994,240  cycles:u #3.852 GHz

So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do
the long chain.  I think some cores have bigger difference between these
two numbers.
I am bit surprised of the last number of 10 cycles.  I would expect 8.

I changed the matrix multiplication benchmark to repeat multiplication
100 times

> 
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per 
> cycle.
> So in theory we can at most do 2 FMA per cycle but with latency (FMA)
> == 4 for Zen3/4
> and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
> throughput when there are many FADD/FMUL ops to execute?  That works 
> independent
> on whether FMAs have a head-start on multiplication as you'd still be
> bottle-necked
> on the 2-wide issue for FMA?

I am not sure I follow what you say here.  The knob only checks for
FMADDS used in accmulation type loop, so it is latency 4 and latency 3
per accumulation.  Indeed in ohter loops fmadd is win.
> 
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a 
> latency
> of four.  So you should get worse results there (looking at the
> numbers above you
> do get worse results, slightly so), probably the higher number of uops is 
> hidden
> by the latency.
I think the slower non-FMA on Core was just a noise (it shows in overall
time but not in cycle counts).

I changed the benchmark to run the multiplication 100 times.
On Intel I get:

jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out
mult took   15146405 clocks

 Performance counter stats for './a.out':

 15,149.62 msec 

Disable FMADD in chains for Zen4 and generic

2023-12-12 Thread Jan Hubicka
Hi,
this patch disables use of FMA in matrix multiplication loop for generic (for
x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.

For Intel this is neutral both on the matrix multiplication microbenchmark
(attached) and spec2k17 where the difference was within noise for Core.

On core the micro-benchmark runs as follows:

With FMA:

   578,500,241  cycles:u #3.645 GHz 
( +-  0.12% )
   753,318,477  instructions:u   #1.30  insn per 
cycle  ( +-  0.00% )
   125,417,701  branches:u   #  790.227 M/sec   
( +-  0.00% )
  0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )


No FMA:

   577,573,960  cycles:u #3.514 GHz 
( +-  0.15% )
   878,318,479  instructions:u   #1.52  insn per 
cycle  ( +-  0.00% )
   125,417,702  branches:u   #  763.035 M/sec   
( +-  0.00% )
  0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )

So the cycle count is unchanged and discrete multiply+add takes same time as 
FMA.

While on zen:


With FMA:
 484875179  cycles:u #3.599 GHz 
 ( +-  0.05% )  (82.11%)
 752031517  instructions:u   #1.55  insn per 
cycle 
 125106525  branches:u   #  928.712 M/sec   
 ( +-  0.03% )  (85.09%)
128356  branch-misses:u  #0.10% of all 
branches  ( +-  0.06% )  (83.58%)

No FMA:
 375875209  cycles:u #3.592 GHz 
 ( +-  0.08% )  (80.74%)
 875725341  instructions:u   #2.33  insn per 
cycle
 124903825  branches:u   #1.194 G/sec   
 ( +-  0.04% )  (84.59%)
  0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )

The diffrerence is that Cores understand the fact that fmadd does not need
all three parameters to start computation, while Zen cores doesn't.

Since this seems noticeable win on zen and not loss on Core it seems like good
default for generic.

I plan to commit the patch next week if there are no compplains.

Honza

#include 
#include 

#define SIZE 1000

float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];

void init(void)
{
   int i, j, k;
   for(i=0; i

Re: [PATCH] strub: add note on attribute access

2023-12-12 Thread Jan Hubicka
> On Dec  7, 2023, Alexandre Oliva  wrote:
> 
> > Thanks for raising the issue.  Maybe there should be at least a comment
> > there, and perhaps some asserts to check that pointer and reference
> > types don't make to indirect_parms.
> 
> Document why attribute access doesn't need the same treatment as fn
> spec, and check that the assumption behind it holds.
> 
> Regstrapped on x86_64-linux-gnu.  Ok to install?
> 
> 
> for  gcc/ChangeLog
> 
>   * ipa-strub.cc (pass_ipa_strub::execute): Check that we don't
>   add indirection to pointer parameters, and document attribute
>   access non-interactions.
OK,
Honza


Re: [PATCH] ipa/92606 - properly handle no_icf attribute for variables

2023-12-12 Thread Jan Hubicka
> The following adds no_icf handling for variables where the attribute
> was rejected.  It also fixes the check for no_icf by checking both
> the source and the targets decl.
> 
> Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> 
> This would solve the AVR issue with merging of "progmem" attributed
> and non-"progmem" attributed variables if they'd also add no_icf there.
> 
> OK?
> 
> Thanks,
> Richard.
> 
>   PR ipa/92606
> gcc/c-family/
>   * c-attribs.cc (handle_noicf_attribute): Also allow the
>   attribute on global variables.
> 
> gcc/
>   * ipa-icf.cc (sem_item_optimizer::merge_classes): Check
>   both source and alias for the no_icf attribute.
>   * doc/extend.texi (no_icf): Document variable attribute.
OK,
thanks!
Honza


Re: [PATCH v5] Introduce strub: machine-independent stack scrubbing

2023-12-06 Thread Jan Hubicka
Hi,
I am sorry for sending this late.  I think the ipa changes are generally
fine.  There are few things which was not clear to me.
> for  gcc/ChangeLog
> 
>   * Makefile.in (OBJS): Add ipa-strub.o.
>   (GTFILES): Add ipa-strub.cc.
>   * builtins.def (BUILT_IN_STACK_ADDRESS): New.
>   (BUILT_IN___STRUB_ENTER): New.
>   (BUILT_IN___STRUB_UPDATE): New.
>   (BUILT_IN___STRUB_LEAVE): New.
>   * builtins.cc: Include ipa-strub.h.
>   (STACK_STOPS, STACK_UNSIGNED): Define.
>   (expand_builtin_stack_address): New.
>   (expand_builtin_strub_enter): New.
>   (expand_builtin_strub_update): New.
>   (expand_builtin_strub_leave): New.
>   (expand_builtin): Call them.
>   * common.opt (fstrub=*): New options.
>   * doc/extend.texi (strub): New type attribute.
>   (__builtin_stack_address): New function.
>   (Stack Scrubbing): New section.
>   * doc/invoke.texi (-fstrub=*): New options.
>   (-fdump-ipa-*): New passes.
>   * gengtype-lex.l: Ignore multi-line pp-directives.
>   * ipa-inline.cc: Include ipa-strub.h.
>   (can_inline_edge_p): Test strub_inlinable_to_p.
>   * ipa-split.cc: Include ipa-strub.h.
>   (execute_split_functions): Test strub_splittable_p.
>   * ipa-strub.cc, ipa-strub.h: New.
>   * passes.def: Add strub_mode and strub passes.
>   * tree-cfg.cc (gimple_verify_flow_info): Note on debug stmts.
>   * tree-pass.h (make_pass_ipa_strub_mode): Declare.
>   (make_pass_ipa_strub): Declare.
>   (make_pass_ipa_function_and_variable_visibility): Fix
>   formatting.
>   * tree-ssa-ccp.cc (optimize_stack_restore): Keep restores
>   before strub leave.
>   * multiple_target.cc (pass_target_clone::gate): Test seen_error.
>   * attribs.cc: Include ipa-strub.h.
>   (decl_attributes): Support applying attributes to function
>   type, rather than pointer type, at handler's request.
>   (comp_type_attributes): Combine strub_comptypes and target
>   comp_type results.
>   * doc/tm.texi.in (TARGET_STRUB_USE_DYNAMIC_ARRAY): New.
>   (TARGET_STRUB_MAY_USE_MEMSET): New.
>   * doc/tm.texi: Rebuilt.
>   * cgraph.h (symtab_node::reset): Add preserve_comdat_group
>   param, with a default.
>   * cgraphunit.cc (symtab_node::reset): Use it.
> 
> for  gcc/c-family/ChangeLog
> 
>   * c-attribs.cc: Include ipa-strub.h.
>   (handle_strub_attribute): New.
>   (c_common_attribute_table): Add strub.
> 
> for  gcc/ada/ChangeLog
> 
>   * gcc-interface/trans.cc: Include ipa-strub.h.
>   (gigi): Make internal decls for targets of compiler-generated
>   calls strub-callable too.
>   (build_raise_check): Likewise.
>   * gcc-interface/utils.cc: Include ipa-strub.h.
>   (handle_strub_attribute): New.
>   (gnat_internal_attribute_table): Add strub.
> 
> for  gcc/testsuite/ChangeLog
> 
>   * c-c++-common/strub-O0.c: New.
>   * c-c++-common/strub-O1.c: New.
>   * c-c++-common/strub-O2.c: New.
>   * c-c++-common/strub-O2fni.c: New.
>   * c-c++-common/strub-O3.c: New.
>   * c-c++-common/strub-O3fni.c: New.
>   * c-c++-common/strub-Og.c: New.
>   * c-c++-common/strub-Os.c: New.
>   * c-c++-common/strub-all1.c: New.
>   * c-c++-common/strub-all2.c: New.
>   * c-c++-common/strub-apply1.c: New.
>   * c-c++-common/strub-apply2.c: New.
>   * c-c++-common/strub-apply3.c: New.
>   * c-c++-common/strub-apply4.c: New.
>   * c-c++-common/strub-at-calls1.c: New.
>   * c-c++-common/strub-at-calls2.c: New.
>   * c-c++-common/strub-defer-O1.c: New.
>   * c-c++-common/strub-defer-O2.c: New.
>   * c-c++-common/strub-defer-O3.c: New.
>   * c-c++-common/strub-defer-Os.c: New.
>   * c-c++-common/strub-internal1.c: New.
>   * c-c++-common/strub-internal2.c: New.
>   * c-c++-common/strub-parms1.c: New.
>   * c-c++-common/strub-parms2.c: New.
>   * c-c++-common/strub-parms3.c: New.
>   * c-c++-common/strub-relaxed1.c: New.
>   * c-c++-common/strub-relaxed2.c: New.
>   * c-c++-common/strub-short-O0-exc.c: New.
>   * c-c++-common/strub-short-O0.c: New.
>   * c-c++-common/strub-short-O1.c: New.
>   * c-c++-common/strub-short-O2.c: New.
>   * c-c++-common/strub-short-O3.c: New.
>   * c-c++-common/strub-short-Os.c: New.
>   * c-c++-common/strub-strict1.c: New.
>   * c-c++-common/strub-strict2.c: New.
>   * c-c++-common/strub-tail-O1.c: New.
>   * c-c++-common/strub-tail-O2.c: New.
>   * c-c++-common/torture/strub-callable1.c: New.
>   * c-c++-common/torture/strub-callable2.c: New.
>   * c-c++-common/torture/strub-const1.c: New.
>   * c-c++-common/torture/strub-const2.c: New.
>   * c-c++-common/torture/strub-const3.c: New.
>   * c-c++-common/torture/strub-const4.c: New.
>   * c-c++-common/torture/strub-data1.c: New.
>   * c-c++-common/torture/strub-data2.c: New.
>   * 

Re: [PATCH v7] Introduce attribute sym_alias

2023-12-06 Thread Jan Hubicka
> On Nov 30, 2023, Jan Hubicka  wrote:
> 
> >> +  if (VAR_P (replaced))
> >> +  varpool_node::create_alias (sym_node->decl, replacement);
> >> +  else
> >> +  cgraph_node::create_alias (sym_node->decl, replacement);
> 
> Unfortunately, this change didn't work.  Several of the C++ tests
> regressed with it.  Going back to same-body aliases, they work.
> 
> I suspect this may have to do with the infrastructure put in to deal
> with cdtors clones.

Do you have short testcase for this?  THe main oddities with same body
aliases comes from the fact that C++ FE creates them early during
parsing before all declaration flags are finished.

Later we do:

  /* Ugly, but the fixup cannot happen at a time same body alias is created;
 C++ FE is confused about the COMDAT groups being right.  */
  if (symtab->cpp_implicit_aliases_done)
FOR_EACH_SYMBOL (node)
  if (node->cpp_implicit_alias)
  node->fixup_same_cpp_alias_visibility (node->get_alias_target ());

Fixup copies some flags such as inline flags, visibility and comdat
groups which can change during parsing process.

Since you produce aliases late at finalization time, I do not see how
this could be relevant.  Pehraps unless you manage to copy wrong flags
from implicit aliases before the fixup happens which would be simple
ordering problem

Honza
> 
> I've also found some discrepancies between C and C++ WRT sym_alias in
> static local variables, and failure to detect and report symbol name
> clashes between sym_aliases and unrelated declarations.  Thanks, Joseph,
> for pushing me to consider other cases I hadn't thought of before :-)
> I'm going to look into these, but for now, the patch below gets a full
> pass, with these issues XFAILed.
> 
> 
> > The IPA bits are fine.  I will take a look on your second patch.
> 
> Thanks!
> 
> 
> Introduce attribute sym_alias
> 
> This patch introduces an attribute to add extra asm names (aliases)
> for a decl when its definition is output.  The main goal is to ease
> interfacing C++ with Ada, as C++ mangled names have to be named, and
> in some cases (e.g. when using stdint.h typedefs in function
> arguments) the symbol names may vary across platforms.
> 
> The attribute is usable in C and C++, presumably in all C-family
> languages.  It can be attached to global variables and functions.  In
> C++, it can also be attached to class types, namespace-scoped
> variables and functions, static data members, member functions,
> explicit instantiations and specializations of template functions,
> members and classes.
> 
> When applied to constructors or destructor, additional sym aliases
> with _Base and _Del suffixes are defined for variants other than
> complete-object ones.  This changes the assumption that clones always
> carry the same attributes as their abstract declarations, so there is
> now a function to adjust them.
> 
> C++ also had a bug in which attributes from local extern declarations
> failed to be propagated to a preexisting corresponding
> namespace-scoped decl.  I've fixed that, and adjusted acc tests that
> distinguished between C and C++ in this regard.
> 
> Applying the attribute to class types is only valid in C++, and the
> effect is to attach the alias to the RTTI object associated with the
> class type.
> 
> for  gcc/ChangeLog
> 
>   * attribs.cc: Include cgraph.h.
>   (decl_attributes): Allow late introduction of sym_alias in
>   types.
>   (create_sym_alias_decl, create_sym_alias_decls): New.
>   * attribs.h: Declare them.
>   (FOR_EACH_SYM_ALIAS): New macro.
>   * cgraph.cc (cgraph_node::create): Create sym_alias decls.
>   * varpool.cc (varpool_node::get_create): Create sym_alias
>   decls.
>   * cgraph.h (symtab_node::remap_sym_alias_target): New.
>   * symtab.cc (symtab_node::remap_sym_alias_target): Define.
>   * cgraphunit.cc (cgraph_node::analyze): Create alias_target
>   node if needed.
>   (analyze_functions): Fixup visibility of implicit alias only
>   after its node is analyzed.
>   * doc/extend.texi (sym_alias): Document for variables,
>   functions and types.
> 
> for  gcc/ada/ChangeLog
> 
>   * doc/gnat_rm/interfacing_to_other_languages.rst: Mention
>   attribute sym_alias to give RTTI symbols mnemonic names.
>   * doc/gnat_ugn/the_gnat_compilation_model.rst: Mention
>   aliases.  Fix incorrect ref to C1 ctor variant.
> 
> for  gcc/c-family/ChangeLog
> 
>   * c-ada-spec.cc (pp_asm_name): Use first sym_alias if
>   available.
>   * c-attribs.cc (handle_sym_alias_attribute): New.
>   (c_common_attribute_table): Add sym_alias.
>   (handle_copy_at

Re: [PATCH v6] Introduce attribute sym_alias

2023-11-30 Thread Jan Hubicka
> On Nov 22, 2023, Jan Hubicka  wrote:
> 
> > I wonder why you use same body aliases, which are kind of special to C++
> > frontend (and come with fixup code working around its quirks you had to
> > disable above).
> 
> TBH, I don't recall whether I had any reason to have gone down that
> path, or I just didn't realize I could have done something simpler.
> I've worked on and off on this patch for the past 3.5y, so many details
> have faded away from memory by now.  I do recall there were some
> challenges in making the sym_alias name available as an alias target
> early enough for it to be found, and this may have been related with
> these odd choices back then.  But the good news is that calling
> create_alias works just fine.  I'm suppose that creating alias
> attributes would as well, but why bother?  This looks even clearner!
> Thanks!

> diff --git a/gcc/cgraphunit.cc b/gcc/cgraphunit.cc
> index bccd2f2abb5a3..eb2d05094e989 100644
> --- a/gcc/cgraphunit.cc
> +++ b/gcc/cgraphunit.cc
> @@ -1175,7 +1175,7 @@ analyze_functions (bool first_time)
>   C++ FE is confused about the COMDAT groups being right.  */
>if (symtab->cpp_implicit_aliases_done)
>  FOR_EACH_SYMBOL (node)
> -  if (node->cpp_implicit_alias)
> +  if (node->cpp_implicit_alias && node->analyzed)

I think you hould be able to drop this, since aliases you create now are
not same body aliases.
> +  if (VAR_P (replaced))
> + varpool_node::create_alias (sym_node->decl, replacement);
> +  else
> + cgraph_node::create_alias (sym_node->decl, replacement);
We probably chould have create_alias on symbol node directly, but that
is something I can clean up next stage1.

The IPA bits are fine.  I will take a look on your second patch.
Honza


Re: [PATCH] tree-sra: Avoid returns of references to SRA candidates

2023-11-29 Thread Jan Hubicka
> Hi,
> 
> On Tue, Nov 28 2023, Jan Hubicka wrote:
> >> On Tue, 28 Nov 2023, Martin Jambor wrote:
> >> 
> >> > On Tue, Nov 28 2023, Richard Biener wrote:
> >> > > On Mon, 27 Nov 2023, Martin Jambor wrote:
> >> > >
> >> > >> Hi,
> >> > >> 
> >> > >> The enhancement to address PR 109849 contained an importsnt thinko,
> >> > >> and that any reference that is passed to a function and does not
> >> > >> escape, must also not happen to be aliased by the return value of the
> >> > >> function.  This has quickly transpired as bugs PR 112711 and PR
> >> > >> 112721.
> >> > >> 
> >> > >> Just as IPA-modref does a good enough job to allow us to rely on the
> >> > >> escaped set of variables, it sems to be doing well also on updating
> >> > >> EAF_NOT_RETURNED_DIRECTLY call argument flag which happens to address
> >> > >> exactly the situation we need to avoid.  Of course, if a call
> >> > >> statement ignores any returned value, we also do not need to check the
> >> > >> flag.
> >> > >
> >> > > But what about EAF_NOT_RETURNED_INDIRECTLY?  Don't you need to
> >> > > verify the parameter doesn't escape through the return at all?
> >> > >
> >> > 
> >> > I thought EAF_NOT_RETURNED_INDIRECTLY prohibits things like "return
> >> > param->next" but those are not a problem (whatever next points to cannot
> >> > be an SRA candidate and any ADDR_EXPR storing its address there would
> >> > trigger a disqualification or at least an assert).  But I guess I am
> >> > wrong, what is actually the exact meaning of the flag?
> >> 
> >> I thought it's return (x.ptr = param, );
> >> 
> >> so the parameter is reachable from the return value.
> >> 
> >> But let's Honza answer...
> > It is same difference as direct/indirect escape. so it check whether
> > values pointed to by arg can be possibly returned.  Indeed maybe we
> > should think of better name - the other interpretation did not even
> > occur to me, but it makes sense.
> >
> 
> Is my patch OK then?

Yes, given that we do not attempt to track any EAF flags for things
ever stored to memory, I believe this is safe

Honza
> 
> (Apart from making one of the testcases x86_64-only, as Andrew pointed
> out, which I wanted to do but the line somehow got lost.  Making the
> testcase more general is fairly low on my contested TODO list and the
> testing depends on a specific instruction trapping.)
> 
> Thanks,
> 
> Martin
> 


Re: [PATCH] tree-sra: Avoid returns of references to SRA candidates

2023-11-28 Thread Jan Hubicka
> 
> 
> > Am 28.11.2023 um 17:59 schrieb Jan Hubicka :
> > 
> > 
> >> 
> >>> On Tue, 28 Nov 2023, Martin Jambor wrote:
> >>> 
> >>> On Tue, Nov 28 2023, Richard Biener wrote:
> >>>> On Mon, 27 Nov 2023, Martin Jambor wrote:
> >>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> The enhancement to address PR 109849 contained an importsnt thinko,
> >>>>> and that any reference that is passed to a function and does not
> >>>>> escape, must also not happen to be aliased by the return value of the
> >>>>> function.  This has quickly transpired as bugs PR 112711 and PR
> >>>>> 112721.
> >>>>> 
> >>>>> Just as IPA-modref does a good enough job to allow us to rely on the
> >>>>> escaped set of variables, it sems to be doing well also on updating
> >>>>> EAF_NOT_RETURNED_DIRECTLY call argument flag which happens to address
> >>>>> exactly the situation we need to avoid.  Of course, if a call
> >>>>> statement ignores any returned value, we also do not need to check the
> >>>>> flag.
> >>>> 
> >>>> But what about EAF_NOT_RETURNED_INDIRECTLY?  Don't you need to
> >>>> verify the parameter doesn't escape through the return at all?
> >>>> 
> >>> 
> >>> I thought EAF_NOT_RETURNED_INDIRECTLY prohibits things like "return
> >>> param->next" but those are not a problem (whatever next points to cannot
> >>> be an SRA candidate and any ADDR_EXPR storing its address there would
> >>> trigger a disqualification or at least an assert).  But I guess I am
> >>> wrong, what is actually the exact meaning of the flag?
> >> 
> >> I thought it's return (x.ptr = param, );
> >> 
> >> so the parameter is reachable from the return value.
> >> 
> >> But let's Honza answer...
> > It is same difference as direct/indirect escape. so it check whether
> > values pointed to by arg can be possibly returned.  Indeed maybe we
> > should think of better name - the other interpretation did not even
> > occur to me, but it makes sense.
> 
> So does the directly returned flag cover my interpretation of indirect or is 
> there a hole?

Stores goes through:

  /* Handle *lhs = name.  */
  if (assign && gimple_assign_rhs1 (assign) == name)
{ 
  if (dump_file) 
fprintf (dump_file, "%*s  ssa name saved to memory\n",
 m_depth * 4, "");
  m_lattice[index].merge (0);
}

So we give up on any flags.  So far modref does not try to track values
in memory at all. I suppose PTA info does not help me here, since the
memory values is stored to may not escape but later it may be read and
copied into something that does escape?

Honza
> 
> Richard 
> 
> > Honza
> >> 
> >> Richard.


Re: [PATCH] tree-sra: Avoid returns of references to SRA candidates

2023-11-28 Thread Jan Hubicka
> On Tue, 28 Nov 2023, Martin Jambor wrote:
> 
> > On Tue, Nov 28 2023, Richard Biener wrote:
> > > On Mon, 27 Nov 2023, Martin Jambor wrote:
> > >
> > >> Hi,
> > >> 
> > >> The enhancement to address PR 109849 contained an importsnt thinko,
> > >> and that any reference that is passed to a function and does not
> > >> escape, must also not happen to be aliased by the return value of the
> > >> function.  This has quickly transpired as bugs PR 112711 and PR
> > >> 112721.
> > >> 
> > >> Just as IPA-modref does a good enough job to allow us to rely on the
> > >> escaped set of variables, it sems to be doing well also on updating
> > >> EAF_NOT_RETURNED_DIRECTLY call argument flag which happens to address
> > >> exactly the situation we need to avoid.  Of course, if a call
> > >> statement ignores any returned value, we also do not need to check the
> > >> flag.
> > >
> > > But what about EAF_NOT_RETURNED_INDIRECTLY?  Don't you need to
> > > verify the parameter doesn't escape through the return at all?
> > >
> > 
> > I thought EAF_NOT_RETURNED_INDIRECTLY prohibits things like "return
> > param->next" but those are not a problem (whatever next points to cannot
> > be an SRA candidate and any ADDR_EXPR storing its address there would
> > trigger a disqualification or at least an assert).  But I guess I am
> > wrong, what is actually the exact meaning of the flag?
> 
> I thought it's return (x.ptr = param, );
> 
> so the parameter is reachable from the return value.
> 
> But let's Honza answer...
It is same difference as direct/indirect escape. so it check whether
values pointed to by arg can be possibly returned.  Indeed maybe we
should think of better name - the other interpretation did not even
occur to me, but it makes sense.

Honza
> 
> Richard.


Re: libstdc++: Speed up push_back

2023-11-24 Thread Jan Hubicka
> With my changes at -O3 we now inline push_back, so we could optimize the
> first loop to the second. However with 
> ~/trunk-install/bin/gcc -O3  auto.C  -S -fdump-tree-all-details 
> -fno-exceptions -fno-store-merging -fno-tree-slp-vectorize
> the fist problem is right at the begining:
> 
>[local count: 97603128]:
>   MEM[(struct _Vector_impl_data *)x_4(D)]._M_start = 0B;
>   MEM[(struct _Vector_impl_data *)x_4(D)]._M_finish = 0B;
>   MEM[(struct _Vector_impl_data *)x_4(D)]._M_end_of_storage = 0B;
>   _37 = operator new (40);
>   _22 = x_4(D)->D.26019._M_impl.D.25320._M_finish;
>   _23 = x_4(D)->D.26019._M_impl.D.25320._M_start;

Thinking of this problem, it is easy to adjust reserve to copy _M_start
and _M_finish to local variables across the call of new() which makes
the old values visible to compiler regardless of points-to-analysis.
In fact _M_realloc_insert already has such code:

  // Make local copies of these members because the compiler thinks
  // the allocator can alter them if 'this' is globally reachable.
  pointer __old_start = this->_M_impl._M_start;
  pointer __old_finish = this->_M_impl._M_finish;

So attached patch does that in reserve.  The downside is that if things
are not inlined we may end up pushing extra copy to stack, but I believe
the benefit from inlining actually pays this back.

The testcase with loop still does not optimize it so I simplified it:

#include 

auto
f()
{
  std::vector x;
  x.reserve(10);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  x.push_back(0);
  return x;
}

auto


   
g() 


   
{ return std::vector(10, 0); } 


   

This now compiles to less code but it is somewhat funny:

   [local count: 1073741824]:
  MEM  [(int * *)x_3(D)] = { 0, 0 };
  MEM[(struct _Vector_impl_data *)x_3(D)]._M_end_of_storage = 0B;
  _70 = operator new (40);

   [local count: 1073741824]:
  x_3(D)->D.26024._M_impl.D.25325._M_start = _70;
  _65 = _70 + 40;
  x_3(D)->D.26024._M_impl.D.25325._M_end_of_storage = _65;
  _74 = _70 + 4;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _74;
  MEM  [(int *)_70] = 0;
  _80 = _70 + 8;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _80;
  *_80 = 0;
  _86 = _70 + 12;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _86;
  *_86 = 0;
  _92 = _70 + 16;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _92;
  *_92 = 0;
  _98 = _70 + 20;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _98;
  *_98 = 0;
  _104 = _70 + 24;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _104;
  *_104 = 0;
  _110 = _70 + 28;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _110;
  *_110 = 0;
  _116 = _70 + 32;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _116;
  *_116 = 0;
  _122 = _70 + 36;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _122;
  return x_3(D);

The setup code in BB2 is useless and due to fake escape, as discussed in
PRR112653.

However it is funny that we miss dead store elimintation and repeately
set x_3(D)->D.26024._M_impl.D.25325._M_finish = _104;

The problem here is that until pushback.C.208t.forwprop4:std::vector we
do not optimize the reallocation code:

   [local count: 1073741824]:
  MEM  [(int * *)x_3(D)] = { 0, 0 };
  MEM[(struct _Vector_impl_data *)x_3(D)]._M_end_of_storage = 0B;
  _70 = operator new (40);

   [local count: 1073741824]:
  x_3(D)->D.26024._M_impl.D.25325._M_start = _70;
  _65 = _70 + 40;
  x_3(D)->D.26024._M_impl.D.25325._M_end_of_storage = _65;
  *_70 = 0;
  _74 = _70 + 4;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _74;
  D.26115 = 0;
  if (_65 != _74)
goto ; [82.57%]
  else
goto ; [17.43%]

   [local count: 886588624]:
  *_74 = 0;
  _80 = _70 + 8;
  x_3(D)->D.26024._M_impl.D.25325._M_finish = _80;
  goto ; [100.00%]

   [local count: 187153200]:
  std::vector::_M_realloc_append (x_3(D), );
  goto ; [100.00%]

   [count: 0]:
:
  goto ; [100.00%]

   [local count: 187153200]:
  pretmp_127 = MEM[(int * const &)x_3(D) + 8];
  pretmp_144 = MEM[(struct _Vector_base 
*)x_3(D)]._M_impl.D.25325._M_end_of_storage;

And after forwprop4 it is too late because DSE is no longer run.
I filled PR112706. For some reason FRE is missing to optimize:

int *ptr;
void link_error ();
void
test 

Re: libstdc++: Speed up push_back

2023-11-23 Thread Jan Hubicka
Hi,
so if I understand it right, it should be safe to simply replace memmove
by memcpy.  I wonder if we can get rid of the count != 0 check at least
for glibc systems.  In general push_back now need inline-insns-auto to
be 33 to be inlined at -O2


jh@ryzen4:/tmp> cat ~/tt.C
#include 
typedef unsigned int uint32_t;
struct pair_t {uint32_t first, second;};
struct pair_t pair;
void
test()
{
std::vector stack;
stack.push_back (pair);
while (!stack.empty()) {
pair_t cur = stack.back();
stack.pop_back();
if (!cur.first)
{
cur.second++;
stack.push_back (cur);
}
if (cur.second > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

jh@ryzen4:/tmp> ~/trunk-install/bin/g++ ~/tt.C -O2 --param 
max-inline-insns-auto=32 ; time ./a.out

real0m0.399s
user0m0.399s
sys 0m0.000s
jh@ryzen4:/tmp> ~/trunk-install/bin/g++ ~/tt.C -O2 --param 
max-inline-insns-auto=33 ; time ./a.out

real0m0.039s
user0m0.039s
sys 0m0.000s

Current inline limit is 15. We can save
 - 2 insns if inliner knows that conditional guarding
   builtin_unreachable will die (I have patch for this)
 - 4 isnsn if we work out that on 64bit hosts allocating vector with
   2^63 elements is impossible
 - 2 insns if we allow NULL parameter on memcpy
 - 2 insns if we allos NULL parameter on delete
So thi is 23 instructions. Inliner has hinting which could make
push_back reasonable candidate for -O2 inlining and then we could be
able to propagate interesitng stuff across repeated calls to push_back.

libstdc++-v3/ChangeLog:

* include/bits/stl_uninitialized.h (relocate_a_1): Use memcpy instead 
of memmove.

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h 
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 1282af3bc43..a9b802774c6 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1119,14 +1119,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 #ifdef __cpp_lib_is_constant_evaluated
  if (std::is_constant_evaluated())
{
- // Can't use memmove. Wrap the pointer so that __relocate_a_1
+ // Can't use memcpy. Wrap the pointer so that __relocate_a_1
  // resolves to the non-trivial overload above.
  __gnu_cxx::__normal_iterator<_Tp*, void> __out(__result);
  __out = std::__relocate_a_1(__first, __last, __out, __alloc);
  return __out.base();
}
 #endif
- __builtin_memmove(__result, __first, __count * sizeof(_Tp));
+ __builtin_memcpy(__result, __first, __count * sizeof(_Tp));
}
   return __result + __count;
 }


Re: libstdc++: Speed up push_back

2023-11-23 Thread Jan Hubicka
> > On Sunday, 19 November 2023 22:53:37 CET Jan Hubicka wrote:
> > > Sadly it is really hard to work out this
> > > from IPA passes, since we basically care whether the iterator points to
> > > the same place as the end pointer, which are both passed by reference.
> > > This is inter-procedural value numbering that is quite out of reach.
> > 
> > I've done a fair share of branching on __builtin_constant_p in 
> > std::experimental::simd to improve code-gen. It's powerful! But maybe we 
> > also need the other side of the story to tell the optimizer: "I know you 
> > can't const-prop everything; but this variable / expression, even if you 
> > need to put in a lot of effort, the performance difference will be worth 
> > it."
> > 
> > For std::vector, the remaining capacity could be such a value. The 
> > functions f() and g() are equivalent (their code-gen isn't https://
> > compiler-explorer.com/z/r44ejK1qz):
> > 
> > #include 
> > 
> > auto
> > f()
> > {
> >   std::vector x;
> >   x.reserve(10);
> >   for (int i = 0; i < 10; ++i)
> > x.push_back(0);
> >   return x;
> > }
> > auto
> > g()
> > { return std::vector(10, 0); }
> 
> With my changes at -O3 we now inline push_back, so we could optimize the
> first loop to the second. However with 
> ~/trunk-install/bin/gcc -O3  auto.C  -S -fdump-tree-all-details 
> -fno-exceptions -fno-store-merging -fno-tree-slp-vectorize
> the fist problem is right at the begining:
> 
>[local count: 97603128]:
>   MEM[(struct _Vector_impl_data *)x_4(D)]._M_start = 0B;
>   MEM[(struct _Vector_impl_data *)x_4(D)]._M_finish = 0B;
>   MEM[(struct _Vector_impl_data *)x_4(D)]._M_end_of_storage = 0B;
>   _37 = operator new (40);

I also wonder, if default operator new and malloc can be handled as not
reading/modifying anything visible to the user code.  That would help
us to propagate here even if we lose track of points-to information.

We have:

  /* If the call is to a replaceable operator delete and results
 from a delete expression as opposed to a direct call to
 such operator, then we can treat it as free.  */
  if (fndecl
  && DECL_IS_OPERATOR_DELETE_P (fndecl)
  && DECL_IS_REPLACEABLE_OPERATOR (fndecl)
  && gimple_call_from_new_or_delete (stmt))
return ". o ";
  /* Similarly operator new can be treated as malloc.  */
  if (fndecl
  && DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl)
  && gimple_call_from_new_or_delete (stmt))
return "m ";
Which informs alias analysis that new returns pointer to memory
not aliasing with anything and that free is not reading anything
from its parameter (but it is modelled as a write to make it clear
that the memory dies).

stmt_kills_ref_p special cases BUILT_IN_FREE but not OPERATOR delete
to make it clear that everything pointed to by it dies.   This is needed
because 'o' only means that some data may be overwritten, but it does
not make it clear that all data dies.

Not handling operator delete seems like an omision, but maybe it is not
too critical since we emit clobbers around destructors that are usually
right before call to delete.  Also ipa-modref kill analysis does not
understand BUILT_IN_FREE nor delete and could.

I wonder if we can handle both as const except for side-effects
described.

Honza
>   _22 = x_4(D)->D.26019._M_impl.D.25320._M_finish;
>   _23 = x_4(D)->D.26019._M_impl.D.25320._M_start;
>   _24 = _22 - _23;
>   if (_24 > 0)
> goto ; [41.48%]
>   else
> goto ; [58.52%]
> 
> So the vector is fist initialized with _M_start=_M_finish=0, but after
> call to new we already are not able to propagate this.
> 
> This is because x is returned and PTA considers it escaping.  This is
> problem discussed in 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653
> Which shows that it is likely worthwhile to fix PTA to handle this
> correctly.


Re: libstdc++: Speed up push_back

2023-11-23 Thread Jan Hubicka
> On Sunday, 19 November 2023 22:53:37 CET Jan Hubicka wrote:
> > Sadly it is really hard to work out this
> > from IPA passes, since we basically care whether the iterator points to
> > the same place as the end pointer, which are both passed by reference.
> > This is inter-procedural value numbering that is quite out of reach.
> 
> I've done a fair share of branching on __builtin_constant_p in 
> std::experimental::simd to improve code-gen. It's powerful! But maybe we 
> also need the other side of the story to tell the optimizer: "I know you 
> can't const-prop everything; but this variable / expression, even if you 
> need to put in a lot of effort, the performance difference will be worth 
> it."
> 
> For std::vector, the remaining capacity could be such a value. The 
> functions f() and g() are equivalent (their code-gen isn't https://
> compiler-explorer.com/z/r44ejK1qz):
> 
> #include 
> 
> auto
> f()
> {
>   std::vector x;
>   x.reserve(10);
>   for (int i = 0; i < 10; ++i)
> x.push_back(0);
>   return x;
> }
> auto
> g()
> { return std::vector(10, 0); }

With my changes at -O3 we now inline push_back, so we could optimize the
first loop to the second. However with 
~/trunk-install/bin/gcc -O3  auto.C  -S -fdump-tree-all-details -fno-exceptions 
-fno-store-merging -fno-tree-slp-vectorize
the fist problem is right at the begining:

   [local count: 97603128]:
  MEM[(struct _Vector_impl_data *)x_4(D)]._M_start = 0B;
  MEM[(struct _Vector_impl_data *)x_4(D)]._M_finish = 0B;
  MEM[(struct _Vector_impl_data *)x_4(D)]._M_end_of_storage = 0B;
  _37 = operator new (40);
  _22 = x_4(D)->D.26019._M_impl.D.25320._M_finish;
  _23 = x_4(D)->D.26019._M_impl.D.25320._M_start;
  _24 = _22 - _23;
  if (_24 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

So the vector is fist initialized with _M_start=_M_finish=0, but after
call to new we already are not able to propagate this.

This is because x is returned and PTA considers it escaping.  This is
problem discussed in 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653
Which shows that it is likely worthwhile to fix PTA to handle this
correctly.


Re: [PATCH v5] Introduce attribute sym_alias (was: Last call for bikeshedding on attribute sym/exalias/reverse_alias)

2023-11-22 Thread Jan Hubicka
Hi,
it seems that interface to symbol table is fairly minimal here reduced
to...
>   (create_sym_alias_decl, create_sym_alias_decls): New.
>   * cgraphunit.cc (cgraph_node::analyze): Create alias_target
>   node if needed.
called from here...
>   (analyze_functions): Fixup visibility of implicit alias only
>   after its node is analyzed.

> +  if (VAR_P (replaced))
> + varpool_node::create_extra_name_alias (sym_node->decl, replacement);
> +  else
> + cgraph_node::create_same_body_alias (sym_node->decl, replacement);

I wonder why you use same body aliases, which are kind of special to C++
frontend (and come with fixup code working around its quirks you had to
disable above).

Why you do not produce usual alias attribute once you know the symbol
table so it goes the cgraph_node::create_alias or
vaprool_node::create_alias path?

Honza


libstdc++: Turn memmove to memcpy in vector reallocations

2023-11-21 Thread Jan Hubicka
Hi,
this patch turns memmove to memcpy where we can and also avoids extra
guard checking if block is non-empty.  This does not show as performance
improvement in my push_back micro-benchmark because vector rellocation
does not happen that often. In general, however, we optimize memcpy better
then memove (can inline it in some cases).  Saving extra 3 instructions
makes push_back more likely to be inlined though (estimate is now 23)

I also filled in PR112653.  I think for default allocator we should be
able to work out from PTA that the memmove can be memcpy.

Honestly I am not quite sure if I need to have the first
__relocat_copy_a_1 tempalte.  It handles the case we can't use memmove,
but in my limited C++ skills I don't see how to get rid of it or make it
a wrapper for __relocat_a_1 which is identical.

Regtested on x86_64-linux.

libstdc++-v3/ChangeLog:

* include/bits/stl_uninitialized.h (__relocate_copy_a_1): New member 
fnctions.
(__relocate_a_1): Do not check count to be non-zero
before calling memmove.
(__relocate_copy_a): New member function.
* include/bits/stl_vector.h (_S_do_relocate_copy): New member function.
* include/bits/vector.tcc (reserve, _M_realloc_append, 
_M_realloc_insert, _M_default_append):
Use _S_relocate_copy.

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h 
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 1282af3bc43..983fa315e1b 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1104,6 +1104,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__addressof(*__first), __alloc);
   return __cur;
 }
+  template 
+_GLIBCXX20_CONSTEXPR
+inline _ForwardIterator
+__relocate_copy_a_1(_InputIterator __first, _InputIterator __last,
+   _ForwardIterator __result, _Allocator& __alloc)
+noexcept(noexcept(std::__relocate_object_a(std::addressof(*__result),
+  std::addressof(*__first),
+  __alloc)))
+{
+  typedef typename iterator_traits<_InputIterator>::value_type
+   _ValueType;
+  typedef typename iterator_traits<_ForwardIterator>::value_type
+   _ValueType2;
+  static_assert(std::is_same<_ValueType, _ValueType2>::value,
+ "relocation is only possible for values of the same type");
+  _ForwardIterator __cur = __result;
+  for (; __first != __last; ++__first, (void)++__cur)
+   std::__relocate_object_a(std::__addressof(*__cur),
+std::__addressof(*__first), __alloc);
+  return __cur;
+}
 
 #if _GLIBCXX_HOSTED
   template 
@@ -1114,20 +1136,46 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
-  if (__count > 0)
-   {
 #ifdef __cpp_lib_is_constant_evaluated
- if (std::is_constant_evaluated())
+  if (std::is_constant_evaluated())
+   {
+ // Can't use memmove. Wrap the pointer so that __relocate_a_1
+ // resolves to the non-trivial overload above.
+ if (__count > 0)
{
- // Can't use memmove. Wrap the pointer so that __relocate_a_1
- // resolves to the non-trivial overload above.
  __gnu_cxx::__normal_iterator<_Tp*, void> __out(__result);
  __out = std::__relocate_a_1(__first, __last, __out, __alloc);
  return __out.base();
}
+ return __result;
+   }
 #endif
- __builtin_memmove(__result, __first, __count * sizeof(_Tp));
+  __builtin_memmove(__result, __first, __count * sizeof(_Tp));
+  return __result + __count;
+}
+  template 
+_GLIBCXX20_CONSTEXPR
+inline __enable_if_t::value, _Tp*>
+__relocate_copy_a_1(_Tp* __first, _Tp* __last,
+   _Tp* __result,
+   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
+{
+  ptrdiff_t __count = __last - __first;
+#ifdef __cpp_lib_is_constant_evaluated
+  if (std::is_constant_evaluated())
+   {
+ // Can't use memcpy. Wrap the pointer so that __relocate_copy_a_1
+ // resolves to the non-trivial overload above.
+ if (__count > 0)
+   {
+ __gnu_cxx::__normal_iterator<_Tp*, void> __out(__result);
+ __out = std::__relocate_a_1(__first, __last, __out, __alloc);
+ return __out.base();
+   }
+ return __result;
}
+#endif
+  __builtin_memcpy(__result, __first, __count * sizeof(_Tp));
   return __result + __count;
 }
 #endif
@@ -1146,6 +1194,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__last),
 std::__niter_base(__result), __alloc);
 }
+  template 
+_GLIBCXX20_CONSTEXPR
+

Re: Propagate value ranges of return values

2023-11-21 Thread Jan Hubicka
> After this patch in addition to the problem already reported about
> vlda1.c and return-value-range-1.c, we have noticed these regressions
> on aarch64:
> Running gcc:gcc.target/aarch64/aarch64.exp ...
> FAIL: gcc.target/aarch64/movk.c scan-assembler movk\tx[0-9]+, 0x4667, lsl 16
> FAIL: gcc.target/aarch64/movk.c scan-assembler movk\tx[0-9]+, 0x7a3d, lsl 32
> 
> Running gcc:gcc.target/aarch64/simd/simd.exp ...
> FAIL: gcc.target/aarch64/simd/vmulxd_f64_2.c scan-assembler-times
> fmul[ \t]+[dD][0-9]+, ?[dD][0-9]+, ?[dD][0-9]+\n 1
> FAIL: gcc.target/aarch64/simd/vmulxd_f64_2.c scan-assembler-times
> fmulx[ \t]+[dD][0-9]+, ?[dD][0-9]+, ?[dD][0-9]+\n 4
> FAIL: gcc.target/aarch64/simd/vmulxs_f32_2.c scan-assembler-times
> fmul[ \t]+[sS][0-9]+, ?[sS][0-9]+, ?[sS][0-9]+\n 1
> FAIL: gcc.target/aarch64/simd/vmulxs_f32_2.c scan-assembler-times
> fmulx[ \t]+[sS][0-9]+, ?[sS][0-9]+, ?[sS][0-9]+\n 4

Sorry for that - I guess we will see some on various targets.
This is quite common issue - the testcase is having
dummy_number_generator function returning constant and prevents
inlining to avoid constant being visible to compiler.  This no longer
works, since we get it from the return value range.  This should fix it.

return-value_range-1.c should be fixed now and I do not have vlda1.c in
my tree.  I will check.

diff --git a/gcc/testsuite/gcc.target/aarch64/movk.c 
b/gcc/testsuite/gcc.target/aarch64/movk.c
index e6e4e3a8961..6b1f3f8ecf5 100644
--- a/gcc/testsuite/gcc.target/aarch64/movk.c
+++ b/gcc/testsuite/gcc.target/aarch64/movk.c
@@ -1,8 +1,9 @@
 /* { dg-do run } */
-/* { dg-options "-O2 --save-temps -fno-inline" } */
+/* { dg-options "-O2 --save-temps" } */
 
 extern void abort (void);
 
+__attribute__ ((noipa))
 long long int
 dummy_number_generator ()
 {

> 
> We have already sent you a notification for the regression on arm, but
> it includes on vla-1.c and return-value-range-1.c.
> The notification email contains a pointer to the page where we record
> all the configurations that regress because of this patch:
> 
> https://linaro.atlassian.net/browse/GNU-1025
> 
> Can you have a look?
> 
> Thanks,
> 
> Christophe
> 
> 
> 
> 
> > diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc
> > index e41e5ad3ae7..71dacf23ce1 100644
> > --- a/gcc/cgraph.cc
> > +++ b/gcc/cgraph.cc
> > @@ -2629,6 +2629,54 @@ cgraph_node::set_malloc_flag (bool malloc_p)
> >return changed;
> >  }
> >
> > +/* Worker to set malloc flag.  */
> > +static void
> > +add_detected_attribute_1 (cgraph_node *node, const char *attr, bool 
> > *changed)
> > +{
> > +  if (!lookup_attribute (attr, DECL_ATTRIBUTES (node->decl)))
> > +{
> > +  DECL_ATTRIBUTES (node->decl) = tree_cons (get_identifier (attr),
> > +NULL_TREE, DECL_ATTRIBUTES 
> > (node->decl));
> > +  *changed = true;
> > +}
> > +
> > +  ipa_ref *ref;
> > +  FOR_EACH_ALIAS (node, ref)
> > +{
> > +  cgraph_node *alias = dyn_cast (ref->referring);
> > +  if (alias->get_availability () > AVAIL_INTERPOSABLE)
> > +   add_detected_attribute_1 (alias, attr, changed);
> > +}
> > +
> > +  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
> > +if (e->caller->thunk
> > +   && (e->caller->get_availability () > AVAIL_INTERPOSABLE))
> > +  add_detected_attribute_1 (e->caller, attr, changed);
> > +}
> > +
> > +/* Set DECL_IS_MALLOC on NODE's decl and on NODE's aliases if any.  */
> > +
> > +bool
> > +cgraph_node::add_detected_attribute (const char *attr)
> > +{
> > +  bool changed = false;
> > +
> > +  if (get_availability () > AVAIL_INTERPOSABLE)
> > +add_detected_attribute_1 (this, attr, );
> > +  else
> > +{
> > +  ipa_ref *ref;
> > +
> > +  FOR_EACH_ALIAS (this, ref)
> > +   {
> > + cgraph_node *alias = dyn_cast (ref->referring);
> > + if (alias->get_availability () > AVAIL_INTERPOSABLE)
> > +   add_detected_attribute_1 (alias, attr, );
> > +   }
> > +}
> > +  return changed;
> > +}
> > +
> >  /* Worker to set noreturng flag.  */
> >  static void
> >  set_noreturn_flag_1 (cgraph_node *node, bool noreturn_p, bool *changed)
> > diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> > index cedaaac3a45..cfdd9f693a8 100644
> > --- a/gcc/cgraph.h
> > +++ b/gcc/cgraph.h
> > @@ -1190,6 +1190,10 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
> > public symtab_node
> >
> >bool set_pure_flag (bool pure, bool looping);
> >
> > +  /* Add attribute ATTR to cgraph_node's decl and on aliases of the node
> > + if any.  */
> > +  bool add_detected_attribute (const char *attr);
> > +
> >/* Call callback on function and aliases associated to the function.
> >   When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks 
> > are
> >   skipped. */
> > diff --git a/gcc/common.opt b/gcc/common.opt
> > index d21db5d4a20..c6599c7147b 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -781,6 +781,10 @@ Wsuggest-attribute=malloc
> >  Common 

Re: libstdc++: Speed up push_back

2023-11-21 Thread Jan Hubicka
> > +   // RAII type to destroy initialized elements.
> 
> There's only one initialized element, not "elements".
> 
> > +   struct _Guard_elts
> > +   {
> > + pointer _M_first, _M_last;  // Elements to destroy
> 
> We only need to store one pointer here, call it _M_p.
> 
> > + _Tp_alloc_type& _M_alloc;
> > +
> > + _GLIBCXX20_CONSTEXPR
> > + _Guard_elts(pointer __elt, _Tp_alloc_type& __a)
> > + : _M_first(__elt), _M_last(__elt + 1), _M_alloc(__a)
> > + { }
> > +
> > + _GLIBCXX20_CONSTEXPR
> > + ~_Guard_elts()
> > + { std::_Destroy(_M_first, _M_last, _M_alloc); }
> 
> This should be either:
> 
>   std::_Destroy(_M_p, _M_p+1, _M_alloc);
> 
> or avoid the loop that happens in that _Destroy function:
> 
>   _Alloc_traits::destroy(_M_alloc, _M_p);
> 
> > +
> > +   private:
> > + _Guard_elts(const _Guard_elts&);
> > +   };
> > +
> > +   // Guard the new element so it will be destroyed if anything 
> > throws.
> > +   _Guard_elts __guard_elts(__new_start + __elems, _M_impl);
> > +
> > +   __new_finish = std::__uninitialized_move_if_noexcept_a(
> > +__old_start, __old_finish,
> > +__new_start, _M_get_Tp_allocator());
> > +
> > +   ++__new_finish;
> > +   // Guard everything before the new element too.
> > +   __guard_elts._M_first = __new_start;
> 
> This seems redundant, we're not doing any more insertions now, and so
> this store is dead.

I am attaching patch with this check removed.  However I still think
Guard_elts needs to stay to be able to destroy the old ellements
> 
> > +
> > +   // New storage has been fully initialized, destroy the old 
> > elements.
> > +   __guard_elts._M_first = __old_start;
> > +   __guard_elts._M_last = __old_finish;
... here

Does it look better? (I am not really that confidend with libstdc++).

I also was wondering if we do have some data on what libstdc++ functions
are perofrmance critical in practice.  Given that the push_back can be
sped up very noticeably, I wonder if we don't have problems elsewhere?

Thank you,
Honza

diff --git a/libstdc++-v3/include/bits/stl_vector.h 
b/libstdc++-v3/include/bits/stl_vector.h
index 5e18f6eedce..973f4d7e2e9 100644
--- a/libstdc++-v3/include/bits/stl_vector.h
+++ b/libstdc++-v3/include/bits/stl_vector.h
@@ -1288,7 +1288,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
_GLIBCXX_ASAN_ANNOTATE_GREW(1);
  }
else
- _M_realloc_insert(end(), __x);
+ _M_realloc_append(__x);
   }
 
 #if __cplusplus >= 201103L
@@ -1822,6 +1822,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 
   void
   _M_realloc_insert(iterator __position, const value_type& __x);
+
+  void
+  _M_realloc_append(const value_type& __x);
 #else
   // A value_type object constructed with _Alloc_traits::construct()
   // and destroyed with _Alloc_traits::destroy().
@@ -1871,6 +1874,11 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
void
_M_realloc_insert(iterator __position, _Args&&... __args);
 
+  template
+   _GLIBCXX20_CONSTEXPR
+   void
+   _M_realloc_append(_Args&&... __args);
+
   // Either move-construct at the end, or forward to _M_insert_aux.
   _GLIBCXX20_CONSTEXPR
   iterator
diff --git a/libstdc++-v3/include/bits/vector.tcc 
b/libstdc++-v3/include/bits/vector.tcc
index 80631d1e2a1..0ccef7911b3 100644
--- a/libstdc++-v3/include/bits/vector.tcc
+++ b/libstdc++-v3/include/bits/vector.tcc
@@ -120,7 +120,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
_GLIBCXX_ASAN_ANNOTATE_GREW(1);
  }
else
- _M_realloc_insert(end(), std::forward<_Args>(__args)...);
+ _M_realloc_append(std::forward<_Args>(__args)...);
 #if __cplusplus > 201402L
return back();
 #endif
@@ -459,6 +459,8 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 #endif
 {
   const size_type __len = _M_check_len(1u, "vector::_M_realloc_insert");
+  if (__len <= 0)
+   __builtin_unreachable ();
   pointer __old_start = this->_M_impl._M_start;
   pointer __old_finish = this->_M_impl._M_finish;
   const size_type __elems_before = __position - begin();
@@ -571,6 +573,127 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   this->_M_impl._M_end_of_storage = __new_start + __len;
 }
 
+#if __cplusplus >= 201103L
+  template
+template
+  _GLIBCXX20_CONSTEXPR
+  void
+  vector<_Tp, _Alloc>::
+  _M_realloc_append(_Args&&... __args)
+#else
+  template
+void
+vector<_Tp, _Alloc>::
+_M_realloc_append(const _Tp& __x)
+#endif
+{
+  const size_type __len = _M_check_len(1u, "vector::_M_realloc_append");
+  if (__len <= 0)
+   __builtin_unreachable ();
+  pointer __old_start = this->_M_impl._M_start;
+  pointer __old_finish = this->_M_impl._M_finish;
+  const 

Re: Fix 'gcc.dg/tree-ssa/return-value-range-1.c' (was: Propagate value ranges of return values)

2023-11-21 Thread Jan Hubicka
> Hi!
> 
> On 2023-11-19T16:05:42+0100, Jan Hubicka  wrote:
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/return-value-range-1.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do ling } */
> 
> ERROR: gcc.dg/tree-ssa/return-value-range-1.c: 1: syntax error for " 
> dg-do 1 ling "
> 
> With that fixed into 'dg-do link', and...
> 
> > +/* { dg-options "-O1 -dump-tree-evrp-details" } */
> 
> ... that one fixed into '-fdump-tree-evrp-details', I then get:
> 
> FAIL: gcc.dg/tree-ssa/return-value-range-1.c (test for excess errors)
> UNRESOLVED: gcc.dg/tree-ssa/return-value-range-1.c scan-tree-dump-times 
> evrp "Recording return range" 2
> 
> /tmp/ccTEuffl.o: In function `test':
> return-value-range-1.c:(.text+0x24): undefined reference to `link_error'
> 
> This disappears when switching from '-O1' to '-O2'.  OK to push the
> attached "Fix 'gcc.dg/tree-ssa/return-value-range-1.c'"?  (..., or did
> you intend something else, here?)

Ah sorry for that - I looked for FAIl and missed the error.  Yes, the
change is OK.  Indeed -fipa-vrp is enabled only at -O2. (I think basic
non-dataflow VRP could be doable and effective even at -O1, but we don't
do that)
Honza
> 
> 
> Grüße
>  Thomas
> 
> 
> > +__attribute__ ((__noinline__))
> > +int a(char c)
> > +{
> > + return c;
> > +}
> > +void link_error ();
> > +
> > +void
> > +test(int d)
> > +{
> > + if (a(d) > 200)
> > + link_error ();
> > +}
> > +int
> > +main(int argc, char **argv)
> > +{
> > + test(argc);
> > + return 0;
> > +}
> > +/* { dg-final { scan-tree-dump-times "Recording return range" 2 "evrp"} } 
> > */
> 
> 
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955

> From f3a47339a9df9726da7e3c1daeadc216e1d5b365 Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge 
> Date: Tue, 21 Nov 2023 11:51:42 +0100
> Subject: [PATCH] Fix 'gcc.dg/tree-ssa/return-value-range-1.c'
> 
> ... added in recent commit 53ba8d669550d3a1f809048428b97ca607f95cf5
> "inter-procedural value range propagation".
> 
>   gcc/testsuite/
>   * gcc.dg/tree-ssa/return-value-range-1.c: Fix.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/return-value-range-1.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/return-value-range-1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/return-value-range-1.c
> index 4db52233c5d..74f1a5080bb 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/return-value-range-1.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/return-value-range-1.c
> @@ -1,5 +1,5 @@
> -/* { dg-do ling } */
> -/* { dg-options "-O1 -dump-tree-evrp-details" } */
> +/* { dg-do link } */
> +/* { dg-options "-O2 -fdump-tree-evrp-details" } */
>  __attribute__ ((__noinline__))
>  int a(char c)
>  {
> -- 
> 2.34.1
> 



Re: libstdc++: Speed up push_back

2023-11-20 Thread Jan Hubicka
> > +   // RAII type to destroy initialized elements.
> 
> There's only one initialized element, not "elements".
> 
> > +   struct _Guard_elts
> > +   {
> > + pointer _M_first, _M_last;  // Elements to destroy
> 
> We only need to store one pointer here, call it _M_p.
> 
> > + _Tp_alloc_type& _M_alloc;
> > +
> > + _GLIBCXX20_CONSTEXPR
> > + _Guard_elts(pointer __elt, _Tp_alloc_type& __a)
> > + : _M_first(__elt), _M_last(__elt + 1), _M_alloc(__a)
> > + { }
> > +
> > + _GLIBCXX20_CONSTEXPR
> > + ~_Guard_elts()
> > + { std::_Destroy(_M_first, _M_last, _M_alloc); }
> 
> This should be either:
> 
>   std::_Destroy(_M_p, _M_p+1, _M_alloc);
> 
> or avoid the loop that happens in that _Destroy function:
> 
>   _Alloc_traits::destroy(_M_alloc, _M_p);
> 
> > +
> > +   private:
> > + _Guard_elts(const _Guard_elts&);
> > +   };
> > +
> > +   // Guard the new element so it will be destroyed if anything 
> > throws.
> > +   _Guard_elts __guard_elts(__new_start + __elems, _M_impl);
> > +
> > +   __new_finish = std::__uninitialized_move_if_noexcept_a(
> > +__old_start, __old_finish,
> > +__new_start, _M_get_Tp_allocator());
> > +
> > +   ++__new_finish;
> > +   // Guard everything before the new element too.
> > +   __guard_elts._M_first = __new_start;
> 
> This seems redundant, we're not doing any more insertions now, and so
> this store is dead.

I removed this one.
> 
> > +
> > +   // New storage has been fully initialized, destroy the old 
> > elements.
> > +   __guard_elts._M_first = __old_start;
> > +   __guard_elts._M_last = __old_finish;

However here I think I need __guard_elts supporting destruction of many
elements in case the vector has moved to new location
So I do not quite  see how to simplify the code as suggested above
except that the constructor can be simplified to not require first and
last argument since we always initialize it for 1 destruction but later
we may update it.

Thanks,
Honza


Re: Propagate value ranges of return values

2023-11-20 Thread Jan Hubicka
> 
> On 11/18/23 20:21, Jan Hubicka wrote:
> > Hi,
> > this patch implements very basic propaation of return value ranges from VRP
> > pass.  This helps std::vector's push_back since we work out value range of
> > allocated block.  This propagates only within single translation unit.  I 
> > hoped
> > we will also do the propagation at WPA stage, but that needs more work on
> > ipa-cp side.
> > 
> > I also added code auto-detecting return_nonnull and corresponding 
> > -Wsuggest-attribute
> > 
> > Variant of this patch bootstrapped/regtested x86_64-linux, testing with
> > this version is running.  I plan to commit the patch at Monday provided
> > there are no issues.
> 
> I see no obvious issues with the ranger/vrp changes...
> 
> My only comment is that execute_ranger_vrp is called 3 times... EVRP,  VRP1
> and VRP2.. perhaps more someday.  As long as that is OK with the call to
> warn_function_returns_nonnull().
That should be OK.  Add the nonnull attribute and also local_pure_onst
behaves same way.

Honza
> 
> Andrew
> 
> 


libstdc++: Speed up push_back

2023-11-19 Thread Jan Hubicka
Hi,
this patch speeds up the push_back at -O3 significantly by making the
reallocation to be inlined by default.  _M_realloc_insert is general
insertion that takes iterator pointing to location where the value
should be inserted.  As such it contains code to move other entries around
that is quite large.

Since appending to the end of array is common operation, I think we should
have specialized code for that.  Sadly it is really hard to work out this
from IPA passes, since we basically care whether the iterator points to
the same place as the end pointer, which are both passed by reference.
This is inter-procedural value numbering that is quite out of reach.

I also added extra check making it clear that the new length of the vector
is non-zero.  This saves extra conditionals.  Again it is quite hard case
since _M_check_len seem to be able to return 0 if its parameter is 0.
This never happens here, but we are not able to propagate this early nor
at IPA stage.

Would it be OK to duplciate code as this?  The resulting code is still not quite
optimal.

Regtested on x86_64-linux, OK?

Honza

void std::vector::_M_realloc_append (struct vector * 
const this, const struct pair_t & __args#0)
{
  struct _Guard __guard;
  struct pair_t * __new_finish;
  struct pair_t * __old_finish;
  struct pair_t * __old_start;
  struct _Vector_impl * _1;
  long unsigned int _2;
  struct pair_t * _3;
  struct pair_t * _4;
  long int _5;
  long int _6;
  long unsigned int _7;
  long unsigned int _8;
  struct pair_t * _9;
  const size_type _13;
  struct pair_t * _16;
  struct _Vector_impl * _18;
  long int _27;
  long unsigned int _34;

   [local count: 1073741824]:
  _13 = std::vector::_M_check_len (this_11(D), 1, 
"vector::_M_realloc_append");
  if (_13 == 0)
goto ; [0.00%]
  else
goto ; [100.00%]

   [count: 0]:
  __builtin_unreachable ();

   [local count: 1073741824]:
  __old_start_14 = this_11(D)->D.26060._M_impl.D.25361._M_start;
  __old_finish_15 = this_11(D)->D.26060._M_impl.D.25361._M_finish;
  _27 = __old_finish_15 - __old_start_14;
  _18 = [(struct _Vector_base *)this_11(D)]._M_impl;
  _16 = std::__new_allocator::allocate (_18, _13, 0B);
  _1 = _11(D)->D.26060._M_impl;
  __guard ={v} {CLOBBER};
  __guard._M_alloc = _1;
  _2 = (long unsigned int) _27;
  _3 = _16 + _2;
  *_3 = *__args#0_17(D);
  if (_27 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  __builtin_memmove (_16, __old_start_14, _2);

   [local count: 1073741824]:
  _34 = _2 + 8;
  __new_finish_19 = _16 + _34;
  __guard._M_storage = __old_start_14;
  _4 = this_11(D)->D.26060._M_impl.D.25361._M_end_of_storage;
  _5 = _4 - __old_start_14;
  _6 = _5 /[ex] 8;
  _7 = (long unsigned int) _6;
  __guard._M_len = _7;
  std::vector::_M_realloc_append(const 
pair_t&)::_Guard::~_Guard (&__guard);
  __guard ={v} {CLOBBER(eol)};
  this_11(D)->D.26060._M_impl.D.25361._M_start = _16;
  this_11(D)->D.26060._M_impl.D.25361._M_finish = __new_finish_19;
  _8 = _13 * 8;
  _9 = _16 + _8;
  this_11(D)->D.26060._M_impl.D.25361._M_end_of_storage = _9;
  return;

}

Notice that memmove can be memcopy and the test whether block size is non-zero 
is useless.


libstdc++-v3/ChangeLog:

PR libstdc++/110287
* include/bits/stl_vector.h (_M_realloc_append): New member function.
(push_back): Use it.
* include/bits/vector.tcc: (emplace_back): Use it.
(_M_realloc_insert): Let compiler know that new vector size is non-zero.
(_M_realloc_append): New member function.

diff --git a/libstdc++-v3/include/bits/stl_vector.h 
b/libstdc++-v3/include/bits/stl_vector.h
index 5e18f6eedce..973f4d7e2e9 100644
--- a/libstdc++-v3/include/bits/stl_vector.h
+++ b/libstdc++-v3/include/bits/stl_vector.h
@@ -1288,7 +1288,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
_GLIBCXX_ASAN_ANNOTATE_GREW(1);
  }
else
- _M_realloc_insert(end(), __x);
+ _M_realloc_append(__x);
   }
 
 #if __cplusplus >= 201103L
@@ -1822,6 +1822,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 
   void
   _M_realloc_insert(iterator __position, const value_type& __x);
+
+  void
+  _M_realloc_append(const value_type& __x);
 #else
   // A value_type object constructed with _Alloc_traits::construct()
   // and destroyed with _Alloc_traits::destroy().
@@ -1871,6 +1874,11 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
void
_M_realloc_insert(iterator __position, _Args&&... __args);
 
+  template
+   _GLIBCXX20_CONSTEXPR
+   void
+   _M_realloc_append(_Args&&... __args);
+
   // Either move-construct at the end, or forward to _M_insert_aux.
   _GLIBCXX20_CONSTEXPR
   iterator
diff --git a/libstdc++-v3/include/bits/vector.tcc 
b/libstdc++-v3/include/bits/vector.tcc
index 80631d1e2a1..1306676e795 100644
--- a/libstdc++-v3/include/bits/vector.tcc
+++ b/libstdc++-v3/include/bits/vector.tcc
@@ -120,7 +120,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER

Re: Propagate value ranges of return values

2023-11-19 Thread Jan Hubicka
Hi,
this is updated version which also adds testuiste compensation
I lost earlier while maintaining the patch in my testing tree.
There are quite few testcases that use constant return values to hide
something from optimizer.

Bootstrapped/regtested x86_64-linux.
gcc/ChangeLog:

* cgraph.cc (add_detected_attribute_1): New function.
(cgraph_node::add_detected_attribute): Likewise.
* cgraph.h (cgraph_node::add_detected_attribute): Declare.
* common.opt: Add -Wsuggest-attribute=returns_nonnull.
* doc/invoke.texi: Document new flag.
* gimple-range-fold.cc (fold_using_range::range_of_call):
Use known reutrn value ranges.
* ipa-prop.cc (struct ipa_return_value_summary): New type.
(class ipa_return_value_sum_t): New type.
(ipa_return_value_sum): New summary.
(ipa_record_return_value_range): New function.
(ipa_return_value_range): New function.
* ipa-prop.h (ipa_return_value_range): Declare.
(ipa_record_return_value_range): Declare.
* ipa-pure-const.cc (warn_function_returns_nonnull): New funcion.
* ipa-utils.h (warn_function_returns_nonnull): Declare.
* symbol-summary.h: Fix comment.
* tree-vrp.cc (execute_ranger_vrp): Record return values.

gcc/testsuite/ChangeLog:

* g++.dg/ipa/devirt-2.C: Add noipa attribute to prevent ipa-vrp.
* g++.dg/ipa/devirt-7.C: Disable ipa-vrp.
* g++.dg/ipa/ipa-icf-2.C: Disable ipa-vrp.
* g++.dg/ipa/ipa-icf-3.C: Disable ipa-vrp.
* g++.dg/ipa/ivinline-1.C: Disable ipa-vrp.
* g++.dg/ipa/ivinline-3.C: Disable ipa-vrp.
* g++.dg/ipa/ivinline-5.C: Disable ipa-vrp.
* g++.dg/ipa/ivinline-8.C: Disable ipa-vrp.
* g++.dg/ipa/nothrow-1.C: Disable ipa-vrp.
* g++.dg/ipa/pure-const-1.C: Disable ipa-vrp.
* g++.dg/ipa/pure-const-2.C: Disable ipa-vrp.
* g++.dg/lto/inline-crossmodule-1_0.C: Disable ipa-vrp.
* gcc.c-torture/compile/pr106433.c: Add noipa attribute to prevent 
ipa-vrp.
* gcc.c-torture/execute/frame-address.c: Likewise.
* gcc.dg/ipa/fopt-info-inline-1.c: Disable ipa-vrp.
* gcc.dg/ipa/ipa-icf-25.c: Disable ipa-vrp.
* gcc.dg/ipa/ipa-icf-38.c: Disable ipa-vrp.
* gcc.dg/ipa/pure-const-1.c: Disable ipa-vrp.
* gcc.dg/ipa/remref-0.c: Add noipa attribute to prevent ipa-vrp.
* gcc.dg/tree-prof/time-profiler-1.c: Disable ipa-vrp.
* gcc.dg/tree-prof/time-profiler-2.c: Disable ipa-vrp.
* gcc.dg/tree-ssa/pr110269.c: Disable ipa-vrp.
* gcc.dg/tree-ssa/pr20701.c: Disable ipa-vrp.
* gcc.dg/tree-ssa/vrp05.c: Disable ipa-vrp.
* gcc.dg/tree-ssa/return-value-range-1.c: New test.

diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc
index e41e5ad3ae7..71dacf23ce1 100644
--- a/gcc/cgraph.cc
+++ b/gcc/cgraph.cc
@@ -2629,6 +2629,54 @@ cgraph_node::set_malloc_flag (bool malloc_p)
   return changed;
 }
 
+/* Worker to set malloc flag.  */
+static void
+add_detected_attribute_1 (cgraph_node *node, const char *attr, bool *changed)
+{
+  if (!lookup_attribute (attr, DECL_ATTRIBUTES (node->decl)))
+{
+  DECL_ATTRIBUTES (node->decl) = tree_cons (get_identifier (attr),
+NULL_TREE, DECL_ATTRIBUTES 
(node->decl));
+  *changed = true;
+}
+
+  ipa_ref *ref;
+  FOR_EACH_ALIAS (node, ref)
+{
+  cgraph_node *alias = dyn_cast (ref->referring);
+  if (alias->get_availability () > AVAIL_INTERPOSABLE)
+   add_detected_attribute_1 (alias, attr, changed);
+}
+
+  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
+if (e->caller->thunk
+   && (e->caller->get_availability () > AVAIL_INTERPOSABLE))
+  add_detected_attribute_1 (e->caller, attr, changed);
+}
+
+/* Set DECL_IS_MALLOC on NODE's decl and on NODE's aliases if any.  */
+
+bool
+cgraph_node::add_detected_attribute (const char *attr)
+{
+  bool changed = false;
+
+  if (get_availability () > AVAIL_INTERPOSABLE)
+add_detected_attribute_1 (this, attr, );
+  else
+{
+  ipa_ref *ref;
+
+  FOR_EACH_ALIAS (this, ref)
+   {
+ cgraph_node *alias = dyn_cast (ref->referring);
+ if (alias->get_availability () > AVAIL_INTERPOSABLE)
+   add_detected_attribute_1 (alias, attr, );
+   }
+}
+  return changed;
+}
+
 /* Worker to set noreturng flag.  */
 static void
 set_noreturn_flag_1 (cgraph_node *node, bool noreturn_p, bool *changed)
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index cedaaac3a45..cfdd9f693a8 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1190,6 +1190,10 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
public symtab_node
 
   bool set_pure_flag (bool pure, bool looping);
 
+  /* Add attribute ATTR to cgraph_node's decl and on aliases of the node
+ if any.  */
+  bool add_detected_attribute (const char *attr);
+
   /* Call callback on function and aliases associated to the function.
  When 

Re: Propagate value ranges of return values

2023-11-19 Thread Jan Hubicka
> > +Wsuggest-attribute=returns_nonnull
> 
> - or _?
> 
> (If changing it, needs adjustment in rest of patch too.)
I was thinking of this and I am not sure what is better.
Sure _ in command line option looks odd, but this is an identifier
and it is returns_nonnull and not returns-nonnull.

I am not sure we have earlier situation like that.  I can live with both
variants and would be happy to hear opinions on this.
> 
> > +Common Var(warn_suggest_attribute_malloc) Warning
> > +Warn about functions which might be candidates for __attribute__((malloc)).
> > +
> 
> Typo: s/malloc/nonnull/?
Thanks!
Honza


Propagate value ranges of return values

2023-11-18 Thread Jan Hubicka
Hi,
this patch implements very basic propaation of return value ranges from VRP
pass.  This helps std::vector's push_back since we work out value range of
allocated block.  This propagates only within single translation unit.  I hoped
we will also do the propagation at WPA stage, but that needs more work on
ipa-cp side.

I also added code auto-detecting return_nonnull and corresponding 
-Wsuggest-attribute

Variant of this patch bootstrapped/regtested x86_64-linux, testing with
this version is running.  I plan to commit the patch at Monday provided
there are no issues.

gcc/ChangeLog:

* cgraph.cc (add_detected_attribute_1): New function.
(cgraph_node::add_detected_attribute): New member function.
* cgraph.h (struct cgraph_node): Declare it.
* common.opt: Add Wsuggest-attribute=returns_nonnull.
* doc/invoke.texi: Document +Wsuggest-attribute=returns_nonnull.
* gimple-range-fold.cc: Include ipa-prop and dependencies.
(fold_using_range::range_of_call): Look for return value range.
* ipa-prop.cc (struct ipa_return_value_summary): New structure.
(class ipa_return_value_sum_t): New summary.
(ipa_record_return_value_range): New function.
(ipa_return_value_range): New function.
* ipa-prop.h (ipa_return_value_range): Declare.
(ipa_record_return_value_range): Declare.
* ipa-pure-const.cc (warn_function_returns_nonnull): New function.
* ipa-utils.h (warn_function_returns_nonnull): Declare.
* symbol-summary.h: Fix comment typo.
* tree-vrp.cc (execute_ranger_vrp): Record return values.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/return-value-range-1.c: New test.

diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc
index e41e5ad3ae7..71dacf23ce1 100644
--- a/gcc/cgraph.cc
+++ b/gcc/cgraph.cc
@@ -2629,6 +2629,54 @@ cgraph_node::set_malloc_flag (bool malloc_p)
   return changed;
 }
 
+/* Worker to set malloc flag.  */
+static void
+add_detected_attribute_1 (cgraph_node *node, const char *attr, bool *changed)
+{
+  if (!lookup_attribute (attr, DECL_ATTRIBUTES (node->decl)))
+{
+  DECL_ATTRIBUTES (node->decl) = tree_cons (get_identifier (attr),
+NULL_TREE, DECL_ATTRIBUTES 
(node->decl));
+  *changed = true;
+}
+
+  ipa_ref *ref;
+  FOR_EACH_ALIAS (node, ref)
+{
+  cgraph_node *alias = dyn_cast (ref->referring);
+  if (alias->get_availability () > AVAIL_INTERPOSABLE)
+   add_detected_attribute_1 (alias, attr, changed);
+}
+
+  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
+if (e->caller->thunk
+   && (e->caller->get_availability () > AVAIL_INTERPOSABLE))
+  add_detected_attribute_1 (e->caller, attr, changed);
+}
+
+/* Set DECL_IS_MALLOC on NODE's decl and on NODE's aliases if any.  */
+
+bool
+cgraph_node::add_detected_attribute (const char *attr)
+{
+  bool changed = false;
+
+  if (get_availability () > AVAIL_INTERPOSABLE)
+add_detected_attribute_1 (this, attr, );
+  else
+{
+  ipa_ref *ref;
+
+  FOR_EACH_ALIAS (this, ref)
+   {
+ cgraph_node *alias = dyn_cast (ref->referring);
+ if (alias->get_availability () > AVAIL_INTERPOSABLE)
+   add_detected_attribute_1 (alias, attr, );
+   }
+}
+  return changed;
+}
+
 /* Worker to set noreturng flag.  */
 static void
 set_noreturn_flag_1 (cgraph_node *node, bool noreturn_p, bool *changed)
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index cedaaac3a45..cfdd9f693a8 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1190,6 +1190,10 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
public symtab_node
 
   bool set_pure_flag (bool pure, bool looping);
 
+  /* Add attribute ATTR to cgraph_node's decl and on aliases of the node
+ if any.  */
+  bool add_detected_attribute (const char *attr);
+
   /* Call callback on function and aliases associated to the function.
  When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
  skipped. */
diff --git a/gcc/common.opt b/gcc/common.opt
index d21db5d4a20..0be4f02677c 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -781,6 +781,10 @@ Wsuggest-attribute=malloc
 Common Var(warn_suggest_attribute_malloc) Warning
 Warn about functions which might be candidates for __attribute__((malloc)).
 
+Wsuggest-attribute=returns_nonnull
+Common Var(warn_suggest_attribute_malloc) Warning
+Warn about functions which might be candidates for __attribute__((malloc)).
+
 Wsuggest-final-types
 Common Var(warn_suggest_final_types) Warning
 Warn about C++ polymorphic types where adding final keyword would improve code 
quality.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 557d613a1e6..b9e98843613 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8092,7 +8092,7 @@ if the array is referenced as a flexible array member.
 
 @opindex Wsuggest-attribute=
 @opindex Wno-suggest-attribute=
-@item 

Re: [PATCH][WIP] dwarf2out: extend to output debug section directly to object file during debug_early phase

2023-10-23 Thread Jan Hubicka
> > > +  output_data_to_object_file (1, 0);
> > > +  output_data_to_object_file (1, 0);
> >
> > So this basically renames dw2_asm_output_data to
> > output_data_to_object_file and similarly for other output functions.
> >
> > What would be main problems of making dw2_asm_* functions to do the
> > right thing when outputting to object file?
> > Either by conditionals or turning them to virtual functions/hooks as
> > Richi suggested?
> >
> I think it's doable via conditionals. Can you explain the second approach
> in more detail?

Basically you want to have output functions
like dw2_asm_output_data to do the right thing and either store
it to the LTO simple object section or the assembly file.
So either we can add conditionals to every dw2_asm_* function needed
of the form
  if (outputting_to_lto)
 ... new code ...
  else
 ... existing code ...

Or have a virtual table with two different dw2_asm implementations.
Older GCC code uses hooks which is essencially a structure holding
function pointers, mostly because it was implemented before we converted
source base to C++. Some newer code uses virtual functions for this.
> > > +struct lto_simple_object
> > lto_simple_object is declared in lto frontend.  Why do you need to
> > duplicate it here?
> >
> > It looks like adding relocations should be abstracted by lto API,
> > so you don't need to look inside this structure that is
> > lto/lto-object.cc only.
> >
> I should have taken this approach, but instead, I exposed simple objects to
> dwarf2out.
> That's the reason to duplicate the above struct. I will take care of this
> while refactoring
> and abstracting it by lto API

Yep, this should not be hard to do.

Thanks for all the work!
Honza
> 
> 
> >
> > > +/* Output one line number table into the .debug_line section.  */
> > > +
> > > +static void
> > > +output_one_line_info_table (dw_line_info_table *table)
> > It is hard to tell from the diff.  Did you just moved these functions
> > earlier in source file?
> >
> Yeah. I will refactor the dwarf2out soon to clear these confusions.
> 
> -- 
> Rishi
> 
> 
> >
> > Honza
> >


  1   2   3   4   5   6   7   8   9   10   >