On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>> Hi,
>> This patch merges adjacent memset builtin partitions if possible.  It is
>> a useful special case optimization transforming below code:
>>
>> #define M (256)
>> #define N (512)
>>
>> struct st
>> {
>>   int a[M][N];
>>   int c[M];
>>   int b[M][N];
>> };
>>
>> void
>> foo (struct st *p)
>> {
>>   for (unsigned i = 0; i < M; ++i)
>>     {
>>       p->c[i] = 0;
>>       for (unsigned j = N; j > 0; --j)
>>         {
>>           p->a[i][j - 1] = 0;
>>           p->b[i][j - 1] = 0;
>>         }
>>     }
>>
>> into a single memset function call, rather than three calls initializing
>> the structure field by field.
>>
>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>
> +      /* Insertion sort is good enough for the small sub-array.  */
> +      for (k = i + 1; k < j; ++k)
> +       {
> +         part1 = (*partitions)[k];
> +         for (l = k; l > i; --l)
> +           {
> +             part2 = (*partitions)[l - 1];
> +             if (part1->builtin->dst_base_offset
> +                   >= part2->builtin->dst_base_offset)
> +               break;
> +
> +             (*partitions)[l] = part2;
> +           }
> +         (*partitions)[l] = part1;
> +       }
>
> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
> to write a qsort helper for this but I can't wrap my head around the above
> in 5 minutes so ... please!
Hmm, I thought twice about this and now I believe stable sorting (thus
insertion sort)
is required here.  Please see below for explanation.

>
> You don't seem to check the sizes of the memsets given that they
> obviously cannot overlap(!?)
Yes, given it's quite special case transformation, I did add code
checking overlap cases.
>
> Also why handle memset and not memcpy/memmove or combinations of
> them (for sorting)?
>
>   for ()
>    {
>       p->a[i] = 0;
>       p->c[i] = q->c[i];
>       p->b[i] = 0;
>    }
>
> with a and b adjacent.  I suppose p->c could be computed by a
> non-builtin partition as well.
Yes, the two memset builtin partitions can be merged in this case, but...
> So don't we want to see if dependences allow sorting all builtin
> partitions next to each other
> as much as possible?  (maybe we do that already?)
The answer for this, above partition merging and use of qsort is no.
I think all the three are the same question here.  For now we only do
topological sort for partitions.  To maximize parallelism (either by merging
normal parallel partitions or merging builtin partitions) requires fine-tuned
sorting between partitions that doesn't dependence on each other.
In order to sort all memset/memcpy/memmove, we need check dependence
between all data references between different partitions.  For example, I
created new test ldist-36.c illustrating sorting memcpy along with memset
would generate wrong code because dependence is broken.  It's the same
for qsort.  In extreme case, if the same array is set twice with different rhs
value, the order between the two sets needs to be preserved.  Unfortunately,
qsort is unstable and could reorder different sets.  This would break output
dependence.
At the point of this function, dependence graph is destroyed, we can't do
much in addition to special case handling for memset.  Full solution would
require a customized topological sorting process.

So, this updated patch keeps insertion sort with additional comment explaining
why.  Also two test cases added showing when memset partitions should be
merged (we can't for now) and when memset partitions should not be merged.

Bootstrap and test.  Is it OK?

Thanks,
bin

2017-10-14  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (tree-ssa-loop-ivopts.h): New header file.
(struct builtin_info): New fields.
(classify_builtin_1): Compute and record base and offset parts for
memset builtin partition by calling strip_offset.
(fuse_memset_builtins): New function.
(finalize_partitions): Fuse adjacent memset partitions by calling
above function.
* tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration.
Expose the interface.
* tree-ssa-loop-ivopts.h (strip_offset): New declaration.

gcc/testsuite/ChangeLog
2017-10-14  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/ldist-17.c: Adjust test string.
* gcc.dg/tree-ssa/ldist-32.c: New test.
* gcc.dg/tree-ssa/ldist-35.c: New test.
* gcc.dg/tree-ssa/ldist-36.c: New test.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
index 4efc0a4..b3617f6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
@@ -45,5 +45,5 @@ mad_synth_mute (struct mad_synth *synth)
   return;
 }
 
-/* { dg-final { scan-tree-dump "distributed: split to 0 loops and 4 library 
calls" "ldist" } } */
-/* { dg-final { scan-tree-dump-times "generated memset zero" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 0 loops and 
1 library calls" "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated memset zero" 1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c
new file mode 100644
index 0000000..477d222
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st *p)
+{
+  for (unsigned i = 0; i < M; ++i)
+    {
+      p->c[i] = 0;
+      for (unsigned j = N; j > 0; --j)
+       {
+         p->a[i][j - 1] = 0;
+         p->b[i][j - 1] = 0;
+       }
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 1 library" 1 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_memset \\(.*, 0, 1049600\\);" 
1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c
new file mode 100644
index 0000000..445d23d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st * restrict p, struct st * restrict q)
+{
+  for (unsigned i = 0; i < M; ++i)
+    {
+      p->c[i] = 0;
+      for (unsigned j = N; j > 0; --j)
+       {
+         p->a[i][j - 1] = q->a[i][j - 1];
+         p->b[i][j - 1] = 0;
+       }
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 1 library" 1 "ldist" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c
new file mode 100644
index 0000000..0e843f4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st * restrict p)
+{
+  for (unsigned i = 0; i < M; ++i)
+    {
+      p->c[i] = 0;
+      for (unsigned j = N; j > 0; --j)
+       {
+         p->b[i][j - 1] = p->a[i][j - 1];
+         p->a[i][j - 1] = 0;
+       }
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 3 library" 1 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 5e835be..a9026d9 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -106,6 +106,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stor-layout.h"
 #include "tree-cfg.h"
 #include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-ivopts.h"
 #include "tree-ssa-loop.h"
 #include "tree-into-ssa.h"
 #include "tree-ssa.h"
@@ -604,6 +605,10 @@ struct builtin_info
   tree dst_base;
   tree src_base;
   tree size;
+  /* Base and offset part of dst_base after stripping constant offset.  This
+     is only used in memset builtin distribution for now.  */
+  tree dst_base_base;
+  unsigned HOST_WIDE_INT dst_base_offset;
 };
 
 /* Partition for loop distribution.  */
@@ -1500,7 +1505,11 @@ classify_builtin_st (loop_p loop, partition *partition, 
data_reference_p dr)
   if (!compute_access_range (loop, dr, &base, &size))
     return;
 
-  partition->builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size);
+  struct builtin_info *builtin;
+  builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size);
+  builtin->dst_base_base = strip_offset (builtin->dst_base,
+                                        &builtin->dst_base_offset);
+  partition->builtin = builtin;
   partition->kind = PKIND_MEMSET;
 }
 
@@ -2476,6 +2485,116 @@ version_for_distribution_p (vec<struct partition *> 
*partitions,
   return (alias_ddrs->length () > 0);
 }
 
+/* Fuse adjacent memset builtin PARTITIONS if possible.  This is a special
+   case optimization transforming below code:
+
+     __builtin_memset (&obj, 0, 100);
+     _1 = &obj + 100;
+     __builtin_memset (_1, 0, 200);
+     _2 = &obj + 300;
+     __builtin_memset (_2, 0, 100);
+
+   into:
+
+     __builtin_memset (&obj, 0, 400);
+
+   Note we don't have dependence information between different partitions
+   at this point, as a result, we can't handle nonadjacent memset builtin
+   partitions since dependence might be broken.  */
+
+static void
+fuse_memset_builtins (vec<struct partition *> *partitions)
+{
+  unsigned i, j, k, l;
+  struct partition *part1, *part2;
+
+  for (i = 0; partitions->iterate (i, &part1);)
+    {
+      if (part1->kind != PKIND_MEMSET)
+       {
+         i++;
+         continue;
+       }
+
+      /* Find sub-array of memset builtins of the same base.  Index range
+        of the sub-array is [i, j) with "j > i".  */
+      for (j = i + 1; partitions->iterate (j, &part2); ++j)
+       {
+         if (part2->kind != PKIND_MEMSET
+             || !operand_equal_p (part1->builtin->dst_base_base,
+                                  part2->builtin->dst_base_base, 0))
+           break;
+       }
+
+      /* Insertion sort is necessary because stable sorting is required.
+        Also it is faster then qsort for small sub-arrays.  */
+      for (k = i + 1; k < j; ++k)
+       {
+         part1 = (*partitions)[k];
+         for (l = k; l > i; --l)
+           {
+             part2 = (*partitions)[l - 1];
+             if (part1->builtin->dst_base_offset
+                   >= part2->builtin->dst_base_offset)
+               break;
+
+             (*partitions)[l] = part2;
+           }
+         (*partitions)[l] = part1;
+       }
+      /* Continue with next partition.  */
+      i = j;
+    }
+
+  /* Merge all consecutive memset builtin partitions.  */
+  for (i = 0; i < partitions->length () - 1;)
+    {
+      part1 = (*partitions)[i];
+      if (part1->kind != PKIND_MEMSET)
+       {
+         i++;
+         continue;
+       }
+
+      part2 = (*partitions)[i + 1];
+      /* Only merge memset partitions of the same base and with constant
+        access sizes.  */
+      if (part2->kind != PKIND_MEMSET
+         || TREE_CODE (part1->builtin->size) != INTEGER_CST
+         || TREE_CODE (part2->builtin->size) != INTEGER_CST
+         || !operand_equal_p (part1->builtin->dst_base_base,
+                              part2->builtin->dst_base_base, 0))
+       {
+         i++;
+         continue;
+       }
+      tree rhs1 = gimple_assign_rhs1 (DR_STMT (part1->builtin->dst_dr));
+      tree rhs2 = gimple_assign_rhs1 (DR_STMT (part2->builtin->dst_dr));
+      int bytev1 = const_with_all_bytes_same (rhs1);
+      int bytev2 = const_with_all_bytes_same (rhs2);
+      /* Only merge memset partitions of the same value.  */
+      if (bytev1 != bytev2 || bytev1 == -1)
+       {
+         i++;
+         continue;
+       }
+      wide_int end1 = wi::add (part1->builtin->dst_base_offset,
+                              wi::to_wide (part1->builtin->size));
+      /* Only merge adjacent memset partitions.  */
+      if (wi::ne_p (end1, part2->builtin->dst_base_offset))
+       {
+         i++;
+         continue;
+       }
+      /* Merge partitions[i] and partitions[i+1].  */
+      part1->builtin->size = fold_build2 (PLUS_EXPR, sizetype,
+                                         part1->builtin->size,
+                                         part2->builtin->size);
+      partition_free (part2);
+      partitions->ordered_remove (i + 1);
+    }
+}
+
 /* Fuse PARTITIONS of LOOP if necessary before finalizing distribution.
    ALIAS_DDRS contains ddrs which need runtime alias check.  */
 
@@ -2519,6 +2638,10 @@ finalize_partitions (struct loop *loop, vec<struct 
partition *> *partitions,
        }
       partitions->truncate (1);
     }
+
+  /* Fuse memset builtins if possible.  */
+  if (partitions->length () > 1)
+    fuse_memset_builtins (partitions);
 }
 
 /* Distributes the code from LOOP in such a way that producer statements
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 2a71027..4eb0e73 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -1550,9 +1550,6 @@ record_invariant (struct ivopts_data *data, tree op, bool 
nonlinear_use)
   bitmap_set_bit (data->relevant, SSA_NAME_VERSION (op));
 }
 
-static tree
-strip_offset (tree expr, unsigned HOST_WIDE_INT *offset);
-
 /* Record a group of TYPE.  */
 
 static struct iv_group *
@@ -2863,7 +2860,7 @@ strip_offset_1 (tree expr, bool inside_addr, bool 
top_compref,
 
 /* Strips constant offsets from EXPR and stores them to OFFSET.  */
 
-static tree
+tree
 strip_offset (tree expr, unsigned HOST_WIDE_INT *offset)
 {
   HOST_WIDE_INT off;
diff --git a/gcc/tree-ssa-loop-ivopts.h b/gcc/tree-ssa-loop-ivopts.h
index f8f31e9..bd92051 100644
--- a/gcc/tree-ssa-loop-ivopts.h
+++ b/gcc/tree-ssa-loop-ivopts.h
@@ -28,6 +28,7 @@ extern void dump_cand (FILE *, struct iv_cand *);
 extern bool contains_abnormal_ssa_name_p (tree);
 extern struct loop *outermost_invariant_loop_for_expr (struct loop *, tree);
 extern bool expr_invariant_in_loop_p (struct loop *, tree);
+extern tree strip_offset (tree, unsigned HOST_WIDE_INT *);
 bool may_be_nonaddressable_p (tree expr);
 void tree_ssa_iv_optimize (void);
 

Reply via email to