Re: [PATCH GCC][12/13]Workaround reduction statements for distribution
On Fri, Jun 23, 2017 at 12:26 PM, Bin.Chengwrote: > On Tue, Jun 20, 2017 at 12:36 PM, Richard Biener > wrote: >> On Tue, Jun 20, 2017 at 11:20 AM, Bin.Cheng wrote: >>> On Fri, Jun 16, 2017 at 6:15 PM, Bin.Cheng wrote: On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener wrote: > On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng wrote: >> Hi, >> For now, loop distribution handles variables used outside of loop as >> reduction. >> This is inaccurate because all partitions contain statement defining >> induction >> vars. > > But final induction values are usually not used outside of the loop... This is in actuality for induction variable which is used outside of the loop. > > What is missing is loop distribution trying to change partition order. > In fact > we somehow assume we can move a reduction across a detected builtin > (I don't remember if we ever check for validity of that...). Hmm, I am not sure when we can't. If there is any dependence between builtin/reduction partitions, it should be captured by RDG or PG, otherwise the partitions are independent and can be freely ordered as long as reduction partition is scheduled last? > >> Ideally we should factor out scev-propagation as a standalone interface >> which can be called when necessary. Before that, this patch simply >> workarounds >> reduction issue by checking if the statement belongs to all partitions. >> If yes, >> the reduction must be computed in the last partition no matter how the >> loop is >> distributed. >> Bootstrap and test on x86_64 and AArch64. Is it OK? > > stmt_in_all_partitions is not kept up-to-date during partition merging > and if > merging makes the reduction partition(s) pass the stmt_in_all_partitions > test your simple workaround doesn't work ... I think it doesn't matter because: A) it's really workaround for induction variables. In general, induction variables are included by all partition. B) After classify partition, we immediately fuses all reduction partitions. More stmt_in_all_partitions means we are fusing non-reduction partition with reduction partition, so the newly generated (stmt_in_all_partitions) are actually not reduction statements. The workaround won't work anyway even the bitmap is maintained. > > As written it's a valid optimization but can you please note it's > limitation in > some comment please? Yeah, I will add comment explaining it. >>> Comment added in new version patch. It also computes bitmap outside >>> now, is it OK? >> >> Ok. Can you add a testcase for this as well please? I think the >> series up to this >> is now fully reviewed, I defered 1/n (the new IFN) to the last one >> containing the >> runtime versioning. Can you re-post that (you can merge with the IFN patch) >> to apply after the series has been applied up to this? > Test case added. Ok. > Thanks, > bin > 2017-06-20 Bin Cheng > > * tree-loop-distribution.c (classify_partition): New parameter and > better handle reduction statement. > (rdg_build_partitions): Revise comment. > (distribute_loop): Compute statements in all partitions and pass it > to classify_partition. > > gcc/testsuite/ChangeLog > 2017-06-20 Bin Cheng > > * gcc.dg/tree-ssa/ldist-26.c: New test.
Re: [PATCH GCC][12/13]Workaround reduction statements for distribution
On Tue, Jun 20, 2017 at 12:36 PM, Richard Bienerwrote: > On Tue, Jun 20, 2017 at 11:20 AM, Bin.Cheng wrote: >> On Fri, Jun 16, 2017 at 6:15 PM, Bin.Cheng wrote: >>> On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener >>> wrote: On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng wrote: > Hi, > For now, loop distribution handles variables used outside of loop as > reduction. > This is inaccurate because all partitions contain statement defining > induction > vars. But final induction values are usually not used outside of the loop... >>> This is in actuality for induction variable which is used outside of the >>> loop. What is missing is loop distribution trying to change partition order. In fact we somehow assume we can move a reduction across a detected builtin (I don't remember if we ever check for validity of that...). >>> Hmm, I am not sure when we can't. If there is any dependence between >>> builtin/reduction partitions, it should be captured by RDG or PG, >>> otherwise the partitions are independent and can be freely ordered as >>> long as reduction partition is scheduled last? > Ideally we should factor out scev-propagation as a standalone interface > which can be called when necessary. Before that, this patch simply > workarounds > reduction issue by checking if the statement belongs to all partitions. > If yes, > the reduction must be computed in the last partition no matter how the > loop is > distributed. > Bootstrap and test on x86_64 and AArch64. Is it OK? stmt_in_all_partitions is not kept up-to-date during partition merging and if merging makes the reduction partition(s) pass the stmt_in_all_partitions test your simple workaround doesn't work ... >>> I think it doesn't matter because: >>> A) it's really workaround for induction variables. In general, >>> induction variables are included by all partition. >>> B) After classify partition, we immediately fuses all reduction >>> partitions. More stmt_in_all_partitions means we are fusing >>> non-reduction partition with reduction partition, so the newly >>> generated (stmt_in_all_partitions) are actually not reduction >>> statements. The workaround won't work anyway even the bitmap is >>> maintained. As written it's a valid optimization but can you please note it's limitation in some comment please? >>> Yeah, I will add comment explaining it. >> Comment added in new version patch. It also computes bitmap outside >> now, is it OK? > > Ok. Can you add a testcase for this as well please? I think the > series up to this > is now fully reviewed, I defered 1/n (the new IFN) to the last one > containing the > runtime versioning. Can you re-post that (you can merge with the IFN patch) > to apply after the series has been applied up to this? Test case added. Thanks, bin 2017-06-20 Bin Cheng * tree-loop-distribution.c (classify_partition): New parameter and better handle reduction statement. (rdg_build_partitions): Revise comment. (distribute_loop): Compute statements in all partitions and pass it to classify_partition. gcc/testsuite/ChangeLog 2017-06-20 Bin Cheng * gcc.dg/tree-ssa/ldist-26.c: New test. From b16a4839f3211737dccc3ff92ab2c4f325907cd3 Mon Sep 17 00:00:00 2001 From: Bin Cheng Date: Thu, 22 Jun 2017 17:16:58 +0100 Subject: [PATCH 11/13] reduction-workaround-20170607.txt --- gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c | 36 ++ gcc/tree-loop-distribution.c | 43 2 files changed, 68 insertions(+), 11 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c new file mode 100644 index 000..3a69884 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-26.c @@ -0,0 +1,36 @@ +/* { dg-do run } */ +/* { dg-options "-O3 -ftree-loop-distribution -fdump-tree-ldist-details" } */ + +extern void abort (void); + +int a[130], b[128], c[128]; + +int __attribute__((noinline,noclone)) +foo (int len, int x) +{ + int i; + for (i = 1; i <= len; ++i) +{ + a[i] = a[i + 2] + 1; + b[i] = 0; + a[i + 1] = a[i] - 3; + if (i < x) + c[i] = a[i]; +} + return i; +} + +int main() +{ + int i; + for (i = 0; i < 130; ++i) +a[i] = i; + foo (127, 67); + if (a[0] != 0 || a[1] != 4 || a[127] != 130) +abort (); + return 0; +} + +/* { dg-final { scan-tree-dump "distributed: split to 2 loops and 0 library calls" "ldist" } } */ +/* { dg-final { scan-tree-dump "distributed: split to 1 loops and 1 library calls" "ldist" } } */ +/* { dg-final { scan-tree-dump
Re: [PATCH GCC][12/13]Workaround reduction statements for distribution
On Tue, Jun 20, 2017 at 11:20 AM, Bin.Chengwrote: > On Fri, Jun 16, 2017 at 6:15 PM, Bin.Cheng wrote: >> On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener >> wrote: >>> On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng wrote: Hi, For now, loop distribution handles variables used outside of loop as reduction. This is inaccurate because all partitions contain statement defining induction vars. >>> >>> But final induction values are usually not used outside of the loop... >> This is in actuality for induction variable which is used outside of the >> loop. >>> >>> What is missing is loop distribution trying to change partition order. In >>> fact >>> we somehow assume we can move a reduction across a detected builtin >>> (I don't remember if we ever check for validity of that...). >> Hmm, I am not sure when we can't. If there is any dependence between >> builtin/reduction partitions, it should be captured by RDG or PG, >> otherwise the partitions are independent and can be freely ordered as >> long as reduction partition is scheduled last? >>> Ideally we should factor out scev-propagation as a standalone interface which can be called when necessary. Before that, this patch simply workarounds reduction issue by checking if the statement belongs to all partitions. If yes, the reduction must be computed in the last partition no matter how the loop is distributed. Bootstrap and test on x86_64 and AArch64. Is it OK? >>> >>> stmt_in_all_partitions is not kept up-to-date during partition merging and >>> if >>> merging makes the reduction partition(s) pass the stmt_in_all_partitions >>> test your simple workaround doesn't work ... >> I think it doesn't matter because: >> A) it's really workaround for induction variables. In general, >> induction variables are included by all partition. >> B) After classify partition, we immediately fuses all reduction >> partitions. More stmt_in_all_partitions means we are fusing >> non-reduction partition with reduction partition, so the newly >> generated (stmt_in_all_partitions) are actually not reduction >> statements. The workaround won't work anyway even the bitmap is >> maintained. >>> >>> As written it's a valid optimization but can you please note it's >>> limitation in >>> some comment please? >> Yeah, I will add comment explaining it. > Comment added in new version patch. It also computes bitmap outside > now, is it OK? Ok. Can you add a testcase for this as well please? I think the series up to this is now fully reviewed, I defered 1/n (the new IFN) to the last one containing the runtime versioning. Can you re-post that (you can merge with the IFN patch) to apply after the series has been applied up to this? Thanks, Richard. > Thanks, > bin > 2017-06-07 Bin Cheng > > * tree-loop-distribution.c (classify_partition): New parameter and > better handle reduction statement. > (rdg_build_partitions): Revise comment. > (distribute_loop): Compute statements in all partitions and pass it > to classify_partition.
Re: [PATCH GCC][12/13]Workaround reduction statements for distribution
On Fri, Jun 16, 2017 at 6:15 PM, Bin.Chengwrote: > On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener > wrote: >> On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng wrote: >>> Hi, >>> For now, loop distribution handles variables used outside of loop as >>> reduction. >>> This is inaccurate because all partitions contain statement defining >>> induction >>> vars. >> >> But final induction values are usually not used outside of the loop... > This is in actuality for induction variable which is used outside of the loop. >> >> What is missing is loop distribution trying to change partition order. In >> fact >> we somehow assume we can move a reduction across a detected builtin >> (I don't remember if we ever check for validity of that...). > Hmm, I am not sure when we can't. If there is any dependence between > builtin/reduction partitions, it should be captured by RDG or PG, > otherwise the partitions are independent and can be freely ordered as > long as reduction partition is scheduled last? >> >>> Ideally we should factor out scev-propagation as a standalone interface >>> which can be called when necessary. Before that, this patch simply >>> workarounds >>> reduction issue by checking if the statement belongs to all partitions. If >>> yes, >>> the reduction must be computed in the last partition no matter how the loop >>> is >>> distributed. >>> Bootstrap and test on x86_64 and AArch64. Is it OK? >> >> stmt_in_all_partitions is not kept up-to-date during partition merging and if >> merging makes the reduction partition(s) pass the stmt_in_all_partitions >> test your simple workaround doesn't work ... > I think it doesn't matter because: > A) it's really workaround for induction variables. In general, > induction variables are included by all partition. > B) After classify partition, we immediately fuses all reduction > partitions. More stmt_in_all_partitions means we are fusing > non-reduction partition with reduction partition, so the newly > generated (stmt_in_all_partitions) are actually not reduction > statements. The workaround won't work anyway even the bitmap is > maintained. >> >> As written it's a valid optimization but can you please note it's limitation >> in >> some comment please? > Yeah, I will add comment explaining it. Comment added in new version patch. It also computes bitmap outside now, is it OK? Thanks, bin 2017-06-07 Bin Cheng * tree-loop-distribution.c (classify_partition): New parameter and better handle reduction statement. (rdg_build_partitions): Revise comment. (distribute_loop): Compute statements in all partitions and pass it to classify_partition. From b3502d713309da08d93cd53e5fe8fbfdccf3557b Mon Sep 17 00:00:00 2001 From: Bin Cheng Date: Fri, 9 Jun 2017 13:21:07 +0100 Subject: [PATCH 11/13] reduction-workaround-20170607.txt --- gcc/tree-loop-distribution.c | 43 --- 1 file changed, 32 insertions(+), 11 deletions(-) diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c index d741e9e..1b4e238 100644 --- a/gcc/tree-loop-distribution.c +++ b/gcc/tree-loop-distribution.c @@ -1226,17 +1226,18 @@ build_rdg_partition_for_vertex (struct graph *rdg, int v) } /* Classifies the builtin kind we can generate for PARTITION of RDG and LOOP. - For the moment we detect only the memset zero pattern. */ + For the moment we detect memset, memcpy and memmove patterns. Bitmap + STMT_IN_ALL_PARTITIONS contains statements belonging to all partitions. */ static void -classify_partition (loop_p loop, struct graph *rdg, partition *partition) +classify_partition (loop_p loop, struct graph *rdg, partition *partition, + bitmap stmt_in_all_partitions) { bitmap_iterator bi; unsigned i; tree nb_iter; data_reference_p single_load, single_store; - bool volatiles_p = false; - bool plus_one = false; + bool volatiles_p = false, plus_one = false, has_reduction = false; partition->kind = PKIND_NORMAL; partition->main_dr = NULL; @@ -1251,16 +1252,31 @@ classify_partition (loop_p loop, struct graph *rdg, partition *partition) if (gimple_has_volatile_ops (stmt)) volatiles_p = true; - /* If the stmt has uses outside of the loop mark it as reduction. */ + /* If the stmt is not included by all partitions and there is uses + outside of the loop, then mark the partition as reduction. */ if (stmt_has_scalar_dependences_outside_loop (loop, stmt)) { - partition->reduction_p = true; - return; + /* Due to limitation in the transform phase we have to fuse all + reduction partitions. As a result, this could cancel valid + loop distribution especially for loop that induction variable + is used outside of loop. To workaround this issue, we skip + marking partition as reudction if the reduction stmt
Re: [PATCH GCC][12/13]Workaround reduction statements for distribution
On Fri, Jun 16, 2017 at 11:21 AM, Richard Bienerwrote: > On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng wrote: >> Hi, >> For now, loop distribution handles variables used outside of loop as >> reduction. >> This is inaccurate because all partitions contain statement defining >> induction >> vars. > > But final induction values are usually not used outside of the loop... This is in actuality for induction variable which is used outside of the loop. > > What is missing is loop distribution trying to change partition order. In > fact > we somehow assume we can move a reduction across a detected builtin > (I don't remember if we ever check for validity of that...). Hmm, I am not sure when we can't. If there is any dependence between builtin/reduction partitions, it should be captured by RDG or PG, otherwise the partitions are independent and can be freely ordered as long as reduction partition is scheduled last? > >> Ideally we should factor out scev-propagation as a standalone interface >> which can be called when necessary. Before that, this patch simply >> workarounds >> reduction issue by checking if the statement belongs to all partitions. If >> yes, >> the reduction must be computed in the last partition no matter how the loop >> is >> distributed. >> Bootstrap and test on x86_64 and AArch64. Is it OK? > > stmt_in_all_partitions is not kept up-to-date during partition merging and if > merging makes the reduction partition(s) pass the stmt_in_all_partitions > test your simple workaround doesn't work ... I think it doesn't matter because: A) it's really workaround for induction variables. In general, induction variables are included by all partition. B) After classify partition, we immediately fuses all reduction partitions. More stmt_in_all_partitions means we are fusing non-reduction partition with reduction partition, so the newly generated (stmt_in_all_partitions) are actually not reduction statements. The workaround won't work anyway even the bitmap is maintained. > > As written it's a valid optimization but can you please note it's limitation > in > some comment please? Yeah, I will add comment explaining it. Thanks, bin > > Also... > > + bitmap_set_range (stmt_in_all_partitions, 0, rdg->n_vertices); > + rdg_build_partitions (rdg, stmts, , stmt_in_all_partitions); > > ick. Please instead do > >bitmap_copy (smtt_in_all_partitions, partitions[0]->stmts); >for (i = 1; i < ...) > bitmap_and_into (stmt_in_all_partitons, partitions[i]->stmts); > > Thanks, > Richard. > >> Thanks, >> bin >> 2017-06-07 Bin Cheng >> >> * tree-loop-distribution.c (classify_partition): New parameter and >> better handle reduction statement. >> (rdg_build_partitions): New parameter and record statements belonging >> to all partitions. >> (distribute_loop): Update use of above functions.
Re: [PATCH GCC][12/13]Workaround reduction statements for distribution
On Mon, Jun 12, 2017 at 7:03 PM, Bin Chengwrote: > Hi, > For now, loop distribution handles variables used outside of loop as > reduction. > This is inaccurate because all partitions contain statement defining induction > vars. But final induction values are usually not used outside of the loop... What is missing is loop distribution trying to change partition order. In fact we somehow assume we can move a reduction across a detected builtin (I don't remember if we ever check for validity of that...). > Ideally we should factor out scev-propagation as a standalone interface > which can be called when necessary. Before that, this patch simply > workarounds > reduction issue by checking if the statement belongs to all partitions. If > yes, > the reduction must be computed in the last partition no matter how the loop is > distributed. > Bootstrap and test on x86_64 and AArch64. Is it OK? stmt_in_all_partitions is not kept up-to-date during partition merging and if merging makes the reduction partition(s) pass the stmt_in_all_partitions test your simple workaround doesn't work ... As written it's a valid optimization but can you please note it's limitation in some comment please? Also... + bitmap_set_range (stmt_in_all_partitions, 0, rdg->n_vertices); + rdg_build_partitions (rdg, stmts, , stmt_in_all_partitions); ick. Please instead do bitmap_copy (smtt_in_all_partitions, partitions[0]->stmts); for (i = 1; i < ...) bitmap_and_into (stmt_in_all_partitons, partitions[i]->stmts); Thanks, Richard. > Thanks, > bin > 2017-06-07 Bin Cheng > > * tree-loop-distribution.c (classify_partition): New parameter and > better handle reduction statement. > (rdg_build_partitions): New parameter and record statements belonging > to all partitions. > (distribute_loop): Update use of above functions.
[PATCH GCC][12/13]Workaround reduction statements for distribution
Hi, For now, loop distribution handles variables used outside of loop as reduction. This is inaccurate because all partitions contain statement defining induction vars. Ideally we should factor out scev-propagation as a standalone interface which can be called when necessary. Before that, this patch simply workarounds reduction issue by checking if the statement belongs to all partitions. If yes, the reduction must be computed in the last partition no matter how the loop is distributed. Bootstrap and test on x86_64 and AArch64. Is it OK? Thanks, bin 2017-06-07 Bin Cheng* tree-loop-distribution.c (classify_partition): New parameter and better handle reduction statement. (rdg_build_partitions): New parameter and record statements belonging to all partitions. (distribute_loop): Update use of above functions.From 51764e6a377cf21ef13ffc36928c9f2b8932aac2 Mon Sep 17 00:00:00 2001 From: Bin Cheng Date: Fri, 9 Jun 2017 13:21:07 +0100 Subject: [PATCH 12/14] reduction-workaround-20170607.txt --- gcc/tree-loop-distribution.c | 40 +++- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c index 7e31fee8..167155e 100644 --- a/gcc/tree-loop-distribution.c +++ b/gcc/tree-loop-distribution.c @@ -1276,17 +1276,18 @@ build_rdg_partition_for_vertex (struct graph *rdg, int v) } /* Classifies the builtin kind we can generate for PARTITION of RDG and LOOP. - For the moment we detect only the memset zero pattern. */ + For the moment we detect memset, memcpy and memmove patterns. Bitmap + STMT_IN_ALL_PARTITIONS contains statements belonging to all partitions. */ static void -classify_partition (loop_p loop, struct graph *rdg, partition *partition) +classify_partition (loop_p loop, struct graph *rdg, partition *partition, + bitmap stmt_in_all_partitions) { bitmap_iterator bi; unsigned i; tree nb_iter; data_reference_p single_load, single_store; - bool volatiles_p = false; - bool plus_one = false; + bool volatiles_p = false, plus_one = false, has_reduction = false; partition->kind = PKIND_NORMAL; partition->main_dr = NULL; @@ -1301,16 +1302,24 @@ classify_partition (loop_p loop, struct graph *rdg, partition *partition) if (gimple_has_volatile_ops (stmt)) volatiles_p = true; - /* If the stmt has uses outside of the loop mark it as reduction. */ + /* If the stmt is not included by all partitions and there is uses +outside of the loop, then mark the partition as reduction. */ if (stmt_has_scalar_dependences_outside_loop (loop, stmt)) { - partition->reduction_p = true; - return; + if (!bitmap_bit_p (stmt_in_all_partitions, i)) + { + partition->reduction_p = true; + return; + } + has_reduction = true; } } /* Perform general partition disqualification for builtins. */ if (volatiles_p + /* Simple workaround to prevent classifying the partition as builtin +if it contains any use outside of loop. */ + || has_reduction || !flag_tree_loop_distribute_patterns) return; @@ -1540,14 +1549,16 @@ share_memory_accesses (struct graph *rdg, return false; } -/* Aggregate several components into a useful partition that is - registered in the PARTITIONS vector. Partitions will be - distributed in different loops. */ +/* For each seed statement in STARTING_STMTS, this function builds + partition for it by adding depended statements according to RDG. + All partitions are recorded in PARTITIONS. Statements belongs + to all partitions are recorded in STMT_IN_ALL_PARTITIONS. */ static void rdg_build_partitions (struct graph *rdg, vec starting_stmts, - vec *partitions) + vec *partitions, + bitmap stmt_in_all_partitions) { auto_bitmap processed; int i; @@ -1568,6 +1579,7 @@ rdg_build_partitions (struct graph *rdg, partition *partition = build_rdg_partition_for_vertex (rdg, v); bitmap_ior_into (processed, partition->stmts); + bitmap_and_into (stmt_in_all_partitions, partition->stmts); if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -1814,13 +1826,15 @@ distribute_loop (struct loop *loop, vec stmts, ddrs_vec = new vec (); ddrs_table = new hash_table (389); + auto_bitmap stmt_in_all_partitions; auto_vec partitions; - rdg_build_partitions (rdg, stmts, ); + bitmap_set_range (stmt_in_all_partitions, 0, rdg->n_vertices); + rdg_build_partitions (rdg, stmts, , stmt_in_all_partitions); any_builtin = false; FOR_EACH_VEC_ELT (partitions, i, partition) { - classify_partition (loop, rdg, partition); + classify_partition (loop, rdg,