[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Richard Biener --- The comment#4 case is sth completely different. If it's really interesting to re-vectorize already vectorized code please file a different bug. The other testcases seem to work fine for me now.
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 Bug 83202 depends on bug 83326, which changed state. Bug 83326 Summary: [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #8 from Richard Biener --- Author: rguenth Date: Thu Nov 30 07:53:31 2017 New Revision: 255267 URL: https://gcc.gnu.org/viewcvs?rev=255267=gcc=rev Log: 2017-11-30 Richard BienerPR tree-optimization/83202 * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Add allow_peel argument and guard peeling. (canonicalize_loop_induction_variables): Likewise. (canonicalize_induction_variables): Pass false. (tree_unroll_loops_completely_1): Pass unroll_outer to disallow peeling from cunrolli. * gcc.dg/vect/pr83202-1.c: New testcase. * gcc.dg/tree-ssa/pr61743-1.c: Adjust. Added: trunk/gcc/testsuite/gcc.dg/vect/pr83202-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c trunk/gcc/tree-ssa-loop-ivcanon.c
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #9 from Richard Biener --- The last commit fixed the testcase incomment #1.
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #7 from rguenther at suse dot de --- On Wed, 29 Nov 2017, bugzi...@poradnik-webmastera.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 > > --- Comment #4 from Daniel Fruzynski--- > One more case. Code has to process diagonal half of matrix and uses SSE > intrinsics - see test1() below. When n is constant like in test2() below, gcc > unrolls loops. However more more transform could be performed, replace pairs > of > SSE instructions with one AVX one. GCC currently does not "vectorize" already vectorized code so this is a much farther away "goal" apart from eventually pattern-matching some very simple cases.
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #6 from Richard Biener --- There are multiple issues reflected in this bug. The last commit addressed the SLP cost model thing (not fixing any testcase on its own).
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #5 from Richard Biener --- Author: rguenth Date: Wed Nov 29 14:38:06 2017 New Revision: 255233 URL: https://gcc.gnu.org/viewcvs?rev=255233=gcc=rev Log: 2017-11-29 Richard BienerPR tree-optimization/83202 * tree-vect-slp.c (scalar_stmts_set_t): New typedef. (bst_fail): Use it. (vect_analyze_slp_cost_1): Add visited set, do not account SLP nodes vectorized to the same stmts multiple times. (vect_analyze_slp_cost): Allocate a visited set and pass it down. (vect_analyze_slp_instance): Adjust. (scalar_stmts_to_slp_tree_map_t): New typedef. (vect_schedule_slp_instance): Add a map recording the SLP node representing the vectorized stmts for a set of scalar stmts. Avoid code-generating redundancies. (vect_schedule_slp): Allocate map and pass it down. * gcc.dg/vect/costmodel/x86_64/costmodel-pr83202.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr83202.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-slp.c
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #4 from Daniel Fruzynski--- One more case. Code has to process diagonal half of matrix and uses SSE intrinsics - see test1() below. When n is constant like in test2() below, gcc unrolls loops. However more more transform could be performed, replace pairs of SSE instructions with one AVX one. #include #include "immintrin.h" void test1(double data[100][100], unsigned int n) { for (int i = 0; i < n; i++) { for (int j = 0; j < i; j += 2) { __m128d v = _mm_loadu_pd([i][j]); v = _mm_mul_pd(v, v); _mm_storeu_pd([i][j], v); } } } void test2(double data[100][100]) { const unsigned int n = 6; for (int i = 0; i < n; i++) { for (int j = 0; j < i; j += 2) { __m128d v = _mm_loadu_pd([i][j]); v = _mm_mul_pd(v, v); _mm_storeu_pd([i][j], v); } } }
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 --- Comment #3 from Richard Biener --- For the other case the issue is I think that the SLP instance group size is not the number of scalar stmts but somehow set to the group-size. Changing that has quite some ripple-down effects though. -> GCC 9.
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2017-11-29 Blocks||53947 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- wiht += 4 the inner loop doesn't iterate so it's effectively void test(double data[4][4]) { for (int i = 0; i < 4; i++) { data[i][i] = data[i][i] * data[i][i]; data[i][i+1] = data[i][i+1] * data[i][i+1]; } } we fail to SLP here because we get confused by the computed group size of 5 as there's a gap of three elements between the first stores of each iteration. When later doing BB vectorization we fail to analyze dependences, likely because not analyzing refs as thoroughly as with loops. For your second example we fail to loop vectorize this because we completely peel the inner loop in cunrolli, leaving control flow inside the loop... I have a patch for that one. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/83202] Try joining operations on consecutive array elements during tree vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Component|c |tree-optimization Severity|normal |enhancement