https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245 > > --- Comment #4 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #3) > > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like > > > > int i; > > for (i = 0; i < 128; ++i) > > { > > a[i] = a[i] + 1; > > b[i] = d[i]; > > c[i] = a[i] / d[i]; > > } > > > > where the testcase expects b[i] = d[i] to be split out as memcpy but > > the other two partitions to be fused. > > > > Generally the cost model lacks computing the number of input/output streams > > of a partition and a target interface to query it about limits. Usually > > store bandwidth is not equal to load bandwidth and not re-used store streams > > can benefit from non-temporal stores being used by libc. > > > > In your testcase I wonder whether distributing to > > > > for (int j = 0; j < x; j++) > > { > > for (int i = 0; i < y; i++) > > { > > c[j][i] = b[j][i] - a[j][i]; > > } > > } > > memcpy (a, b, ...); > > > > would be faster in the end (or even doing the memcpy first in this case). > > > > Well, for now let's be more conservative given the cost model really is > > lacking. > > The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3% > regression for Cortex-A72 after r242038, and I can fix that with > -fno-tree-loop-distribute-patterns. > > In that benchmark you've got 3 instances of the above pattern, so you end up > with 3 memcpy calls after: > > DO 300 J=1,N > DO 300 I=1,M > UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) > VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) > POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) > U(I,J) = UNEW(I,J) > V(I,J) = VNEW(I,J) > P(I,J) = PNEW(I,J) > 300 CONTINUE > > 3 memcpy calls compared to 3 vector store instructions doesn't seem the right > tradeoff to me. Sorry if I reduced the testcase too far to make the balance > clear. Itanic seems to like it though: http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown as well around that time. I guess it depends if the CPU is already throttled by load/compute bandwith here. What should be profitable is to distribute the above to three loops (w/o memcpy). So after the patch doing -ftree-loop-distribution. Patch being Index: gcc/tree-loop-distribution.c =================================================================== --- gcc/tree-loop-distribution.c (revision 244963) +++ gcc/tree-loop-distribution.c (working copy) @@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec< for (int j = i + 1; partitions.iterate (j, &partition); ++j) { - if (!partition_builtin_p (partition) - && similar_memory_accesses (rdg, into, partition)) + if (similar_memory_accesses (rdg, into, partition)) { if (dump_file && (dump_flags & TDF_DETAILS)) {