Re: [PATCH] sra: SRA of non-escaped aggregates passed by reference to calls

2023-11-24 Thread Martin Jambor
; +  gsi_insert_after (refresh_gsi, stmt, GSI_NEW_STMT);
>>  }
>>else
>>  {
>>gassign *stmt;
>>  
>>if (access->grp_partial_lhs)
>> -repl = force_gimple_operand_gsi (gsi, repl, true, NULL_TREE,
>> - true, GSI_SAME_STMT);
>> +repl = force_gimple_operand_gsi (stmt_gsi, repl, true,
>> + NULL_TREE, true,
>> + GSI_SAME_STMT);
>>stmt = gimple_build_assign (ref, repl);
>>gimple_set_location (stmt, loc);
>> -  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
>> +  gsi_insert_before (stmt_gsi, stmt, GSI_SAME_STMT);
>>  }
>>  }
>>else
>> @@ -3899,8 +3992,8 @@ sra_modify_expr (tree *expr, gimple_stmt_iterator 
>> *gsi, bool write)
>>  {
>>gdebug *ds = gimple_build_debug_bind (get_access_replacement (access),
>>  NULL_TREE,
>> -gsi_stmt (*gsi));
>> -  gsi_insert_after (gsi, ds, GSI_NEW_STMT);
>> +gsi_stmt (*stmt_gsi));
>> +  gsi_insert_after (stmt_gsi, ds, GSI_NEW_STMT);
>>  }
>>  
>>if (access->first_child && !TREE_READONLY (access->base))
>> @@ -3918,8 +4011,58 @@ sra_modify_expr (tree *expr, gimple_stmt_iterator 
>> *gsi, bool write)
>>  start_offset = chunk_size = 0;
>>  
>>generate_subtree_copies (access->first_child, orig_expr, 
>> access->offset,
>> -   start_offset, chunk_size, gsi, write, write,
>> -   loc);
>> +   start_offset, chunk_size,
>> +   write ? refresh_gsi : stmt_gsi,
>> +   write, write, loc);
>> +}
>> +  return true;
>> +}
>> +
>> +/* If EXPR, which must be a call argument, is an ADDR_EXPR, generate writes 
>> and
>> +   reads from its base before and after the call statement given in CALL_GSI
>> +   and return true if any copying took place.  Otherwise call 
>> sra_modify_expr
>> +   on EXPR and return its value.  FLAGS is what the gimple_call_arg_flags
>> +   return for the given parameter.  */
>> +
>> +static bool
>> +sra_modify_call_arg (tree *expr, gimple_stmt_iterator *call_gsi,
>> + gimple_stmt_iterator *refresh_gsi, int flags)
>> +{
>> +  if (TREE_CODE (*expr) != ADDR_EXPR)
>> +return sra_modify_expr (expr, false, call_gsi, refresh_gsi);
>> +
>> +  if (flags & EAF_UNUSED)
>> +return false;
>
> So in this case we leave around the address of the DECL but with
> "wrong" content.  That might lead to an odd debug experience, but well.
> For optimization wouldn't it be better to substitute 'nullptr' so
> we can elide the DECL (not allocate stack space for it)?  I think
> EAF_UNUSED means we're not inspecting the pointer value either.

Using EAF_UNUSED is a bit of an after-thought.  I hope that often
IPA-SRA juste removes such parameters but when the called function is
externally visible and yet not interposable, IPA-SRA may not touch it
but IPA-modref could derive its properties and so this could take an
effect.

I'll be happy to do prepare a patch passing nullptr in these cases as a
follow-up.

>
>> +
>> +  tree base = get_base_address (TREE_OPERAND (*expr, 0));
>> +  if (!DECL_P (base))
>> +return false;
>> +  struct access *access = get_access_for_expr (base);
>> +  if (!access)
>> +return false;
>> +
>> +  gimple *stmt = gsi_stmt (*call_gsi);
>> +  location_t loc = gimple_location (stmt);
>> +  generate_subtree_copies (access, base, 0, 0, 0, call_gsi, false, false,
>> +   loc);
>> +
>> +  if ((flags & (EAF_NO_DIRECT_CLOBBER | EAF_NO_INDIRECT_CLOBBER))
>> +  == (EAF_NO_DIRECT_CLOBBER | EAF_NO_INDIRECT_CLOBBER))
>> +    return true;
>
> Why both EAF_NO_DIRECT_CLOBBER and EAF_NO_INDIRECT_CLOBBER?  I think
> EAF_NO_DIRECT_CLOBBER is enough unless you think that you didn't rule
> out the decl to point to itself by the escaping means?  Warrants a
> explaining comment.

That was a mistake, I've changed it to use just EAF_NO_DIRECT_CLOBBER.

>
>> +
>> +  if (!stmt_ends_bb_p (stmt))
>> +generate_subtree_copies (access, base, 0, 0, 0, refresh_gsi, true,
>> + true, loc);
>&

Re: [PATCH] sra: SRA of non-escaped aggregates passed by reference to calls

2023-11-16 Thread Richard Biener
On Thu, 16 Nov 2023, Martin Jambor wrote:

> Hello,
> 
> PR109849 shows that a loop that heavily pushes and pops from a stack
> implemented by a C++ std::vec results in slow code, mainly because the
> vector structure is not split by SRA and so we end up in many loads
> and stores into it.  This is because it is passed by reference
> to (re)allocation methods and so needs to live in memory, even though
> it does not escape from them and so we could SRA it if we
> re-constructed it before the call and then separated it to distinct
> replacements afterwards.
> 
> This patch does exactly that, first relaxing the selection of
> candidates to also include those which are addressable but do not
> escape and then adding code to deal with the calls.  The
> micro-benchmark that is also the (scan-dump) testcase in this patch
> runs twice as fast with it than with current trunk.  Honza measured
> its effect on the libjxl benchmark and it almost closes the
> performance gap between Clang and GCC while not requiring excessive
> inlining and thus code growth.
> 
> The patch disallows creation of replacements for such aggregates which
> are also accessed with a precision smaller than their size because I
> have observed that this led to excessive zero-extending of data
> leading to slow-downs of perlbench (on some CPUs).  Apart from this
> case I have not noticed any regressions, at least not so far.
> 
> Gimple call argument flags can tell if an argument is unused (and then
> we do not need to generate any statements for it) or if it is not
> written to and then we do not need to generate statements loading
> replacements from the original aggregate after the call statement.
> Unfortunately, we cannot symmetrically use flags that an aggregate is
> not read because to avoid re-constructing the aggregate before the
> call because flags don't tell which what parts of aggregates were not
> written to, so we load all replacements, and so all need to have the
> correct value before the call.
> 
> The patch passes bootstrap, lto-bootstrap and profiled-lto-bootstrap on
> x86_64-linux and a very similar patch has also passed bootstrap and
> testing on Aarch64-linux and ppc64le-linux (I'm re-running both on these
> two architectures but as I'm sending this).  OK for master?
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2023-11-16  Martin Jambor  
> 
>   PR middle-end/109849
>   * tree-sra.cc (passed_by_ref_in_call): New.
>   (sra_initialize): Allocate passed_by_ref_in_call.
>   (sra_deinitialize): Free passed_by_ref_in_call.
>   (create_access): Add decl pool candidates only if they are not
>   already candidates.
>   (build_access_from_expr_1): Bail out on ADDR_EXPRs.
>   (build_access_from_call_arg): New function.
>   (asm_visit_addr): Rename to scan_visit_addr, change the
>   disqualification dump message.
>   (scan_function): Check taken addresses for all non-call statements,
>   including phi nodes.  Process all call arguments, including the static
>   chain, build_access_from_call_arg.
>   (maybe_add_sra_candidate): Relax need_to_live_in_memory check to allow
>   non-escaped local variables.
>   (sort_and_splice_var_accesses): Disallow smaller-than-precision
>   replacements for aggregates passed by reference to functions.
>   (sra_modify_expr): Use a separate stmt iterator for adding satements
>   before the processed statement and after it.
>   (sra_modify_call_arg): New function.
>   (sra_modify_assign): Adjust calls to sra_modify_expr.
>   (sra_modify_function_body): Likewise, use sra_modify_call_arg to
>   process call arguments, including the static chain.
> 
> gcc/testsuite/ChangeLog:
> 
> 2023-11-03  Martin Jambor  
> 
>   PR middle-end/109849
>   * g++.dg/tree-ssa/pr109849.C: New test.
>   * gfortran.dg/pr43984.f90: Added -fno-tree-sra to dg-options.
> ---
>  gcc/testsuite/g++.dg/tree-ssa/pr109849.C |  31 +++
>  gcc/testsuite/gfortran.dg/pr43984.f90|   2 +-
>  gcc/tree-sra.cc  | 244 ++-
>  3 files changed, 231 insertions(+), 46 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/tree-ssa/pr109849.C
> 
> diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr109849.C 
> b/gcc/testsuite/g++.dg/tree-ssa/pr109849.C
> new file mode 100644
> index 000..cd348c0f590
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/tree-ssa/pr109849.C
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-sra" } */
> +
> +#include 
> +typedef unsigned int uint32_t;
> +std::pair pair;
> +void
> +test()
> +{
> +std::vector > stack;
> +stack.push_back (pair);
> +while (!stack.empty()) {
> +std::pair cur = stack.back();
> +stack.pop_back();
> +if (!cur.first)
> +{
> +cur.second++;
> +stack.push_back (cur);
> +}
> +

[PATCH] sra: SRA of non-escaped aggregates passed by reference to calls

2023-11-16 Thread Martin Jambor
Hello,

PR109849 shows that a loop that heavily pushes and pops from a stack
implemented by a C++ std::vec results in slow code, mainly because the
vector structure is not split by SRA and so we end up in many loads
and stores into it.  This is because it is passed by reference
to (re)allocation methods and so needs to live in memory, even though
it does not escape from them and so we could SRA it if we
re-constructed it before the call and then separated it to distinct
replacements afterwards.

This patch does exactly that, first relaxing the selection of
candidates to also include those which are addressable but do not
escape and then adding code to deal with the calls.  The
micro-benchmark that is also the (scan-dump) testcase in this patch
runs twice as fast with it than with current trunk.  Honza measured
its effect on the libjxl benchmark and it almost closes the
performance gap between Clang and GCC while not requiring excessive
inlining and thus code growth.

The patch disallows creation of replacements for such aggregates which
are also accessed with a precision smaller than their size because I
have observed that this led to excessive zero-extending of data
leading to slow-downs of perlbench (on some CPUs).  Apart from this
case I have not noticed any regressions, at least not so far.

Gimple call argument flags can tell if an argument is unused (and then
we do not need to generate any statements for it) or if it is not
written to and then we do not need to generate statements loading
replacements from the original aggregate after the call statement.
Unfortunately, we cannot symmetrically use flags that an aggregate is
not read because to avoid re-constructing the aggregate before the
call because flags don't tell which what parts of aggregates were not
written to, so we load all replacements, and so all need to have the
correct value before the call.

The patch passes bootstrap, lto-bootstrap and profiled-lto-bootstrap on
x86_64-linux and a very similar patch has also passed bootstrap and
testing on Aarch64-linux and ppc64le-linux (I'm re-running both on these
two architectures but as I'm sending this).  OK for master?

Thanks,

Martin


gcc/ChangeLog:

2023-11-16  Martin Jambor  

PR middle-end/109849
* tree-sra.cc (passed_by_ref_in_call): New.
(sra_initialize): Allocate passed_by_ref_in_call.
(sra_deinitialize): Free passed_by_ref_in_call.
(create_access): Add decl pool candidates only if they are not
already candidates.
(build_access_from_expr_1): Bail out on ADDR_EXPRs.
(build_access_from_call_arg): New function.
(asm_visit_addr): Rename to scan_visit_addr, change the
disqualification dump message.
(scan_function): Check taken addresses for all non-call statements,
including phi nodes.  Process all call arguments, including the static
chain, build_access_from_call_arg.
(maybe_add_sra_candidate): Relax need_to_live_in_memory check to allow
non-escaped local variables.
(sort_and_splice_var_accesses): Disallow smaller-than-precision
replacements for aggregates passed by reference to functions.
(sra_modify_expr): Use a separate stmt iterator for adding satements
before the processed statement and after it.
(sra_modify_call_arg): New function.
(sra_modify_assign): Adjust calls to sra_modify_expr.
(sra_modify_function_body): Likewise, use sra_modify_call_arg to
process call arguments, including the static chain.

gcc/testsuite/ChangeLog:

2023-11-03  Martin Jambor  

PR middle-end/109849
* g++.dg/tree-ssa/pr109849.C: New test.
* gfortran.dg/pr43984.f90: Added -fno-tree-sra to dg-options.
---
 gcc/testsuite/g++.dg/tree-ssa/pr109849.C |  31 +++
 gcc/testsuite/gfortran.dg/pr43984.f90|   2 +-
 gcc/tree-sra.cc  | 244 ++-
 3 files changed, 231 insertions(+), 46 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/tree-ssa/pr109849.C

diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr109849.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr109849.C
new file mode 100644
index 000..cd348c0f590
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr109849.C
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sra" } */
+
+#include 
+typedef unsigned int uint32_t;
+std::pair pair;
+void
+test()
+{
+std::vector > stack;
+stack.push_back (pair);
+while (!stack.empty()) {
+std::pair cur = stack.back();
+stack.pop_back();
+if (!cur.first)
+{
+cur.second++;
+stack.push_back (cur);
+}
+if (cur.second > 1)
+break;
+}
+}
+int
+main()
+{
+for (int i = 0; i < 1; i++)
+  test();
+}
+
+/* { dg-final { scan-tree-dump "Created a replacement