[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Richard Henderson rth at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|NEW AssignedTo|rth at gcc dot gnu.org |unassigned at gcc dot ||gnu.org --- Comment #22 from Richard Henderson rth at gcc dot gnu.org 2012-11-29 21:17:05 UTC --- Needs long-term work in pre-vectorization folding.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added Target Milestone|4.7.2 |4.7.3 --- Comment #21 from Jakub Jelinek jakub at gcc dot gnu.org 2012-09-20 10:21:07 UTC --- GCC 4.7.2 has been released.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #20 from Matt Hargett matt at use dot net 2012-08-20 23:52:31 UTC --- Some additional information: Compared to LLVM 3.1 with -O3, GCC 4.7 is twice as slow on these benchmarks. LLVM even outperforms GCC 4.1, which previously had the best result. We are very eager to hear about any resolution for this major regression in 4.7 so we can deploy it. Even a return to GCC 4.1 performance levels would be fine. Thanks!
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #19 from Matt Hargett matt at use dot net 2012-08-14 17:25:40 UTC --- Does this mean there will be a fix for this regression committed for 4.7.2? If there's a patch I can test ahead of time, please let me know. Thanks!
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Target Milestone|--- |4.7.2
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #17 from Jakub Jelinek jakub at gcc dot gnu.org 2012-06-15 09:03:04 UTC --- This started with http://gcc.gnu.org/viewcvs?root=gccview=revrev=173856 The current cost model is seriously insufficient.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #18 from Richard Henderson rth at gcc dot gnu.org 2012-06-15 21:04:49 UTC --- See comments in http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01081.html It's not the vectorization costing, as previously suggested.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Richard Henderson rth at gcc dot gnu.org changed: What|Removed |Added CC||rth at gcc dot gnu.org AssignedTo|unassigned at gcc dot |rth at gcc dot gnu.org |gnu.org | --- Comment #14 from Richard Henderson rth at gcc dot gnu.org 2012-06-14 14:38:43 UTC --- Mine, at least for a 4.8 solution.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #15 from Matt Hargett matt at use dot net 2012-06-14 18:01:31 UTC --- (In reply to comment #14) Mine, at least for a 4.8 solution. What enhancement to 4.7 caused the regression? Can whatever the change was be (partially) reverted to lessen the impact?
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Richard Henderson rth at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #16 from Richard Henderson rth at gcc dot gnu.org 2012-06-14 18:38:30 UTC --- Dunno exactly. The pre-SSE4.1 emulation of PMULLD has been there since at least gcc 4.5. What's not present in *any* version so far are some proper rtx_costs for integer vector operations. So any questions the vectorizer might be asking about what transformations are profitable are currently being given bogus answers. I'm hoping just that will fix the regression, though I also plan to address some of the other algorithmic questions raised in this PR.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #13 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-13 09:43:15 UTC --- (In reply to comment #12) (In reply to comment #10) But maybe allowing const_vector in (some of) the define_insn_and_split would be the way to go ... Maybe. It certainly would ease some of the simplifications. At the moment I don't think we can go from mem - const - simplify - const -newmem On the other hand, for this particular test case, where all of the vector_cst elements are the same, and a reasonably small number of bits set, it would be great to be able to leverage synth_mult. I agree, though that should possibly be done earlier. The main complexity for sse2_mulv4si3 is due to the fact that we have to decompose the operation into V8HImode multiplies. Whereas if we decompose the multiply, we have the shifts and adds in V4SImode. Well, for a constant multiplier one can avoid the shuffles of the multiplier - we seem to use v2si - v2di multiplies with sse2_mulv4si3.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Target||x86_64-*-* Status|WAITING |NEW Known to work||4.6.3 Keywords||missed-optimization Component|middle-end |rtl-optimization CC||jakub at gcc dot gnu.org, ||uros at gcc dot gnu.org Summary|[4.7 regression] loop |[4.7/4.8 regression] |unrolling as measured by|vectorization causes loop |Adobe's C++Benchmark is |unrolling test slowdown as |twice as slow versus|measured by Adobe's |4.4-4.6 |C++Benchmark Known to fail||4.7.1, 4.8.0 Severity|major |normal --- Comment #6 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 09:54:02 UTC --- Ok, it seems to me that this has template-metaprogramming loop unrolling. With GCC 4.7 we unroll and vectorize all loops, for example unroll factor 8 looks like bb 50: # vect_var_.941_3474 = PHI vect_var_.941_3472(50), {0, 0, 0, 0}(64) # vect_var_.941_3473 = PHI vect_var_.941_3471(50), {0, 0, 0, 0}(64) # ivtmp.1325_970 = PHI ivtmp.1325_812(50), ivtmp.1325_813(64) D.9934_819 = (void *) ivtmp.1325_970; vect_var_.918_323 = MEM[base: D.9934_819, offset: 0B]; vect_var_.919_325 = MEM[base: D.9934_819, offset: 16B]; vect_var_.920_328 = vect_var_.918_323 + { 12345, 12345, 12345, 12345 }; vect_var_.920_330 = vect_var_.919_325 + { 12345, 12345, 12345, 12345 }; vect_var_.923_480 = vect_var_.920_328 * { 914237, 914237, 914237, 914237 }; vect_var_.923_895 = vect_var_.920_330 * { 914237, 914237, 914237, 914237 }; vect_var_.926_231 = vect_var_.923_480 + { 12332, 12332, 12332, 12332 }; vect_var_.926_232 = vect_var_.923_895 + { 12332, 12332, 12332, 12332 }; vect_var_.929_235 = vect_var_.926_231 * { 914237, 914237, 914237, 914237 }; vect_var_.929_236 = vect_var_.926_232 * { 914237, 914237, 914237, 914237 }; vect_var_.932_239 = vect_var_.929_235 + { 12332, 12332, 12332, 12332 }; vect_var_.932_240 = vect_var_.929_236 + { 12332, 12332, 12332, 12332 }; vect_var_.935_113 = vect_var_.932_239 * { 914237, 914237, 914237, 914237 }; vect_var_.935_247 = vect_var_.932_240 * { 914237, 914237, 914237, 914237 }; vect_var_.938_582 = vect_var_.935_113 + { -13, -13, -13, -13 }; vect_var_.938_839 = vect_var_.935_247 + { -13, -13, -13, -13 }; vect_var_.941_3472 = vect_var_.938_582 + vect_var_.941_3474; vect_var_.941_3471 = vect_var_.938_839 + vect_var_.941_3473; ivtmp.1325_812 = ivtmp.1325_970 + 32; if (ivtmp.1325_812 != D.9937_388) goto bb 50; else goto bb 51; bb 51: # vect_var_.941_3468 = PHI vect_var_.941_3472(50) # vect_var_.941_3467 = PHI vect_var_.941_3471(50) vect_var_.945_3466 = vect_var_.941_3468 + vect_var_.941_3467; vect_var_.946_3465 = vect_var_.945_3466 v 64; vect_var_.946_3464 = vect_var_.946_3465 + vect_var_.945_3466; vect_var_.946_3463 = vect_var_.946_3464 v 32; vect_var_.946_3462 = vect_var_.946_3463 + vect_var_.946_3464; stmp_var_.944_3461 = BIT_FIELD_REF vect_var_.946_3462, 32, 0; init_value.7_795 = init_value; D.8606_796 = (int) init_value.7_795; D.8600_797 = D.8606_796 + 12345; D.8599_798 = D.8600_797 * 914237; D.8602_799 = D.8599_798 + 12332; D.8601_800 = D.8602_799 * 914237; D.8604_801 = D.8601_800 + 12332; D.8603_802 = D.8604_801 * 914237; D.8605_803 = D.8603_802 + -13; temp_804 = D.8605_803 * 8000; if (temp_804 != stmp_var_.944_3461) goto bb 52; else goto bb 53; With GCC 4.6 OTOH the above loop is not vectorized, only the (slow) not unrolled loop is. bb 49: # result_622 = PHI result_704(49), 0(63) # ivtmp.852_1026 = PHI ivtmp.852_842(49), ivtmp.852_844(63) D.9283_3302 = (void *) ivtmp.852_1026; temp_801 = MEM[base: D.9283_3302, offset: 0B]; D.8366_802 = temp_801 + 12345; D.8365_803 = D.8366_802 * 914237; D.8368_804 = D.8365_803 + 12332; D.8367_805 = D.8368_804 * 914237; D.8370_806 = D.8367_805 + 12332; D.8369_807 = D.8370_806 * 914237; temp_808 = D.8369_807 + -13; result_810 = temp_808 + result_622; temp_815 = MEM[base: D.9283_3302, offset: 4B]; D.8381_816 = temp_815 + 12345; D.8382_817 = D.8381_816 * 914237; D.8378_818 = D.8382_817 + 12332; D.8379_819 = D.8378_818 * 914237; D.8376_820 = D.8379_819 + 12332; D.8377_821 = D.8376_820 * 914237; temp_822 = D.8377_821 + -13; result_824 = result_810 + temp_822; temp_788 = MEM[base: D.9283_3302, offset: 8B]; D.8351_789 = temp_788 + 12345; D.8352_790 = D.8351_789 * 914237; D.8348_791 =
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #7 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 10:11:51 UTC --- Btw, when I run the benchmark with the addition of -march=native (for me, that's -march=corei7) then GCC 4.7 performs better than 4.6: 4.6: ./t 10 test description absolute operations ratio with number time per second test0 0 int32_t for loop unroll 1 0.41 sec 1951.22 M 1.00 1 int32_t for loop unroll 2 0.51 sec 1568.63 M 1.24 2 int32_t for loop unroll 3 0.47 sec 1702.13 M 1.15 3 int32_t for loop unroll 4 0.48 sec 1666.67 M 1.17 4 int32_t for loop unroll 5 0.47 sec 1702.13 M 1.15 5 int32_t for loop unroll 6 0.51 sec 1568.63 M 1.24 6 int32_t for loop unroll 7 0.47 sec 1702.13 M 1.15 7 int32_t for loop unroll 8 0.47 sec 1702.13 M 1.15 Total absolute time for int32_t for loop unrolling: 3.79 sec 4.7: ./t 10 test description absolute operations ratio with number time per second test0 0 int32_t for loop unroll 1 0.39 sec 2051.28 M 1.00 1 int32_t for loop unroll 2 0.40 sec 2000.00 M 1.03 2 int32_t for loop unroll 3 0.39 sec 2051.28 M 1.00 3 int32_t for loop unroll 4 0.39 sec 2051.28 M 1.00 4 int32_t for loop unroll 5 0.38 sec 2105.26 M 0.97 5 int32_t for loop unroll 6 0.41 sec 1951.22 M 1.05 6 int32_t for loop unroll 7 0.37 sec 2162.16 M 0.95 7 int32_t for loop unroll 8 0.36 sec .22 M 0.92 Total absolute time for int32_t for loop unrolling: 3.09 sec The loop then looks like (the expected) .L53: movdqa (%rax), %xmm4 paddd %xmm3, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm2, %xmm4 paddd %xmm4, %xmm6 movdqa 16(%rax), %xmm4 addq$32, %rax cmpq$data32+32000, %rax paddd %xmm3, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm2, %xmm4 paddd %xmm4, %xmm5 jne .L53 looks like pmulld is only available with SSE 4.1 and otherwise we fall back to the define_insn_and_split *sse2_mulv4si3. But that complexity is not reflected in the vectorizer cost model (which needs improvement ...).
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #8 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 10:27:15 UTC --- Small testcase: int a[256]; int b[256]; void foo (void) { int i; for (i = 0; i 256; ++i) { b[i] = a[i] * 23; } } you can see that we shuffle even the vector with constants around! Not taking into account the REG_EQUAL note which is gone at split1 time, removed by either loop2_invariant or loop2_unswitch. (insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ]) (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3} (expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (const_vector:V4SI [ (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) ])) (expr_list:REG_DEAD (reg:V4SI 84) (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (nil)
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #9 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 10:39:19 UTC --- And cprop fails to propagate (reg:V4SI 85) := (const_vector:V4SI [ (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) ]) but it at least re-adds the REG_EQUAL note, but DSE drops it again. From (insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ]) (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3} (expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (const_vector:V4SI [ (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) ])) (expr_list:REG_DEAD (reg:V4SI 85) (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (nil) we go to (insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ]) (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3} (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (nil))) Unfortunately there is no cprop pass after split1 to eventually clean things up again (because of out-of-cfg-layout-mode ...). If I force it to run it cannot simplify (insn 42 24 43 3 (set (subreg:V2DI (reg:V4SI 86) 0) (mult:V2DI (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (parallel [ (const_int 0 [0]) (const_int 2 [0x2]) ]))) (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 85) (parallel [ (const_int 0 [0]) (const_int 2 [0x2]) ]) t.c:9 -1 (nil)) either though.
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added CC||stevenb.gcc at gmail dot ||com --- Comment #10 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 11:57:20 UTC --- Changing the insn_and_split to (define_insn_and_split *sse2_mulv4si3 [(set (match_operand:V4SI 0 register_operand) (mult:V4SI (match_operand:V4SI 1 register_operand) (match_operand:V4SI 2 nonmemory_vector_operand)))] ... and defining (define_predicate nonmemory_vector_operand (ior (match_operand 0 register_operand) (match_code const_vector))) we ICE because when splitting (insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ]) (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (const_vector:V4SI [ (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) (const_int 23 [0x17]) ]))) t.c:9 1496 {*sse2_mulv4si3} (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ]) (nil))) we don't even try to simplify when emitting the code. But maybe allowing const_vector in (some of) the define_insn_and_split would be the way to go ...
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #11 from Matt Hargett matt at use dot net 2012-06-12 18:25:25 UTC --- Richard, Thanks for the quick analysis! Sounds like a perfect storm of sorts :/ re: cprop failure: this may be indicated by another major regression in their suite for the simple constant folding tests. in GCC 4.1-4.6, those tests are all 0.0s but in 4.7 take tens of seconds. Let me know if you want me to file a separate bug/reduced test case for that, and then have that new bug depend on this one. Otherwise, I'll wait until this one sees some resolution and then retest. re: multiple passes: if you think that feature has enough merit to be revisited now, I can look into re-proposing Maxim's patches from October/November 2011 that integrated your feedback at the time. re: -march workaround: our deployment platform's minimum arch is nocona, and enabling -march=nocona doesn't workaround the issue. For grins, I tried -march=amdfam10 (another deployment target, but would require a separate distributable binary), but that also didn't work around the issue. I see a small improvement when using -fno-tree-vectorize, but not nearly as dramatic as yours. For the int32_t for and while loop unrolling, the times go from ~107s and ~105s to ~96s and ~95s, respectively. The do and goto loop unrolling times get slightly worse (~2%), but it might be noise. Let me know if there's any additional testing/footwork you'd like me to do. Again, thanks for the quick turnaround on such a deep analysis!
[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533 --- Comment #12 from Richard Henderson rth at gcc dot gnu.org 2012-06-12 18:54:24 UTC --- (In reply to comment #10) But maybe allowing const_vector in (some of) the define_insn_and_split would be the way to go ... Maybe. It certainly would ease some of the simplifications. At the moment I don't think we can go from mem - const - simplify - const -newmem On the other hand, for this particular test case, where all of the vector_cst elements are the same, and a reasonably small number of bits set, it would be great to be able to leverage synth_mult. The main complexity for sse2_mulv4si3 is due to the fact that we have to decompose the operation into V8HImode multiplies. Whereas if we decompose the multiply, we have the shifts and adds in V4SImode.