[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-11-29 Thread rth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533



Richard Henderson rth at gcc dot gnu.org changed:



   What|Removed |Added



 Status|ASSIGNED|NEW

 AssignedTo|rth at gcc dot gnu.org  |unassigned at gcc dot

   ||gnu.org



--- Comment #22 from Richard Henderson rth at gcc dot gnu.org 2012-11-29 
21:17:05 UTC ---

Needs long-term work in pre-vectorization folding.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-09-20 Thread jakub at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533



Jakub Jelinek jakub at gcc dot gnu.org changed:



   What|Removed |Added



   Target Milestone|4.7.2   |4.7.3



--- Comment #21 from Jakub Jelinek jakub at gcc dot gnu.org 2012-09-20 
10:21:07 UTC ---

GCC 4.7.2 has been released.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-08-20 Thread matt at use dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #20 from Matt Hargett matt at use dot net 2012-08-20 23:52:31 UTC 
---
Some additional information:
Compared to LLVM 3.1 with -O3, GCC 4.7 is twice as slow on these benchmarks.
LLVM even outperforms GCC 4.1, which previously had the best result. We are
very eager to hear about any resolution for this major regression in 4.7 so we
can deploy it. Even a return to GCC 4.1 performance levels would be fine.

Thanks!


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-08-14 Thread matt at use dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #19 from Matt Hargett matt at use dot net 2012-08-14 17:25:40 UTC 
---
Does this mean there will be a fix for this regression committed for 4.7.2? If
there's a patch I can test ahead of time, please let me know. Thanks!


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-08-10 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|--- |4.7.2


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-15 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #17 from Jakub Jelinek jakub at gcc dot gnu.org 2012-06-15 
09:03:04 UTC ---
This started with http://gcc.gnu.org/viewcvs?root=gccview=revrev=173856
The current cost model is seriously insufficient.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-15 Thread rth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #18 from Richard Henderson rth at gcc dot gnu.org 2012-06-15 
21:04:49 UTC ---
See comments in http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01081.html

It's not the vectorization costing, as previously suggested.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-14 Thread rth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Henderson rth at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rth at gcc dot gnu.org
 AssignedTo|unassigned at gcc dot   |rth at gcc dot gnu.org
   |gnu.org |

--- Comment #14 from Richard Henderson rth at gcc dot gnu.org 2012-06-14 
14:38:43 UTC ---
Mine, at least for a 4.8 solution.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-14 Thread matt at use dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #15 from Matt Hargett matt at use dot net 2012-06-14 18:01:31 UTC 
---
(In reply to comment #14)
 Mine, at least for a 4.8 solution.

What enhancement to 4.7 caused the regression? Can whatever the change was be
(partially) reverted to lessen the impact?


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-14 Thread rth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Henderson rth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #16 from Richard Henderson rth at gcc dot gnu.org 2012-06-14 
18:38:30 UTC ---
Dunno exactly.  The pre-SSE4.1 emulation of PMULLD has been there since
at least gcc 4.5.

What's not present in *any* version so far are some proper rtx_costs for
integer vector operations.  So any questions the vectorizer might be
asking about what transformations are profitable are currently being
given bogus answers.

I'm hoping just that will fix the regression, though I also plan to
address some of the other algorithmic questions raised in this PR.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-13 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #13 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-13 
09:43:15 UTC ---
(In reply to comment #12)
 (In reply to comment #10)
  But maybe allowing const_vector in (some of) the define_insn_and_split would
  be the way to go ...
 
 Maybe.  It certainly would ease some of the simplifications.
 At the moment I don't think we can go from
 
   mem - const - simplify - const -newmem
 
 On the other hand, for this particular test case, where all
 of the vector_cst elements are the same, and a reasonably
 small number of bits set, it would be great to be able to
 leverage synth_mult.

I agree, though that should possibly be done earlier.

 The main complexity for sse2_mulv4si3 is due to the fact that
 we have to decompose the operation into V8HImode multiplies.
 Whereas if we decompose the multiply, we have the shifts and
 adds in V4SImode.

Well, for a constant multiplier one can avoid the shuffles of the
multiplier - we seem to use v2si - v2di multiplies with sse2_mulv4si3.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Target||x86_64-*-*
 Status|WAITING |NEW
  Known to work||4.6.3
   Keywords||missed-optimization
  Component|middle-end  |rtl-optimization
 CC||jakub at gcc dot gnu.org,
   ||uros at gcc dot gnu.org
Summary|[4.7 regression] loop   |[4.7/4.8 regression]
   |unrolling as measured by|vectorization causes loop
   |Adobe's C++Benchmark is |unrolling test slowdown as
   |twice as slow versus|measured by Adobe's
   |4.4-4.6 |C++Benchmark
  Known to fail||4.7.1, 4.8.0
   Severity|major   |normal

--- Comment #6 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 
09:54:02 UTC ---
Ok, it seems to me that this has template-metaprogramming loop unrolling.  With
GCC 4.7 we unroll and vectorize all loops, for example unroll factor 8 looks
like

bb 50:
  # vect_var_.941_3474 = PHI vect_var_.941_3472(50), {0, 0, 0, 0}(64)
  # vect_var_.941_3473 = PHI vect_var_.941_3471(50), {0, 0, 0, 0}(64)
  # ivtmp.1325_970 = PHI ivtmp.1325_812(50), ivtmp.1325_813(64)
  D.9934_819 = (void *) ivtmp.1325_970;
  vect_var_.918_323 = MEM[base: D.9934_819, offset: 0B];
  vect_var_.919_325 = MEM[base: D.9934_819, offset: 16B];
  vect_var_.920_328 = vect_var_.918_323 + { 12345, 12345, 12345, 12345 };
  vect_var_.920_330 = vect_var_.919_325 + { 12345, 12345, 12345, 12345 };
  vect_var_.923_480 = vect_var_.920_328 * { 914237, 914237, 914237, 914237 };
  vect_var_.923_895 = vect_var_.920_330 * { 914237, 914237, 914237, 914237 };
  vect_var_.926_231 = vect_var_.923_480 + { 12332, 12332, 12332, 12332 };
  vect_var_.926_232 = vect_var_.923_895 + { 12332, 12332, 12332, 12332 };
  vect_var_.929_235 = vect_var_.926_231 * { 914237, 914237, 914237, 914237 };
  vect_var_.929_236 = vect_var_.926_232 * { 914237, 914237, 914237, 914237 };
  vect_var_.932_239 = vect_var_.929_235 + { 12332, 12332, 12332, 12332 };
  vect_var_.932_240 = vect_var_.929_236 + { 12332, 12332, 12332, 12332 };
  vect_var_.935_113 = vect_var_.932_239 * { 914237, 914237, 914237, 914237 };
  vect_var_.935_247 = vect_var_.932_240 * { 914237, 914237, 914237, 914237 };
  vect_var_.938_582 = vect_var_.935_113 + { -13, -13, -13, -13 };
  vect_var_.938_839 = vect_var_.935_247 + { -13, -13, -13, -13 };
  vect_var_.941_3472 = vect_var_.938_582 + vect_var_.941_3474;
  vect_var_.941_3471 = vect_var_.938_839 + vect_var_.941_3473;
  ivtmp.1325_812 = ivtmp.1325_970 + 32;
  if (ivtmp.1325_812 != D.9937_388)
goto bb 50;
  else
goto bb 51;

bb 51:
  # vect_var_.941_3468 = PHI vect_var_.941_3472(50)
  # vect_var_.941_3467 = PHI vect_var_.941_3471(50)
  vect_var_.945_3466 = vect_var_.941_3468 + vect_var_.941_3467;
  vect_var_.946_3465 = vect_var_.945_3466 v 64;
  vect_var_.946_3464 = vect_var_.946_3465 + vect_var_.945_3466;
  vect_var_.946_3463 = vect_var_.946_3464 v 32;
  vect_var_.946_3462 = vect_var_.946_3463 + vect_var_.946_3464;
  stmp_var_.944_3461 = BIT_FIELD_REF vect_var_.946_3462, 32, 0;
  init_value.7_795 = init_value;
  D.8606_796 = (int) init_value.7_795;
  D.8600_797 = D.8606_796 + 12345;
  D.8599_798 = D.8600_797 * 914237;
  D.8602_799 = D.8599_798 + 12332;
  D.8601_800 = D.8602_799 * 914237;
  D.8604_801 = D.8601_800 + 12332;
  D.8603_802 = D.8604_801 * 914237;
  D.8605_803 = D.8603_802 + -13;
  temp_804 = D.8605_803 * 8000;
  if (temp_804 != stmp_var_.944_3461)
goto bb 52;
  else
goto bb 53;


With GCC 4.6 OTOH the above loop is not vectorized, only the (slow) not
unrolled loop is.

bb 49:
  # result_622 = PHI result_704(49), 0(63)
  # ivtmp.852_1026 = PHI ivtmp.852_842(49), ivtmp.852_844(63)
  D.9283_3302 = (void *) ivtmp.852_1026;
  temp_801 = MEM[base: D.9283_3302, offset: 0B];
  D.8366_802 = temp_801 + 12345;
  D.8365_803 = D.8366_802 * 914237;
  D.8368_804 = D.8365_803 + 12332;
  D.8367_805 = D.8368_804 * 914237;
  D.8370_806 = D.8367_805 + 12332;
  D.8369_807 = D.8370_806 * 914237;
  temp_808 = D.8369_807 + -13;
  result_810 = temp_808 + result_622;
  temp_815 = MEM[base: D.9283_3302, offset: 4B];
  D.8381_816 = temp_815 + 12345;
  D.8382_817 = D.8381_816 * 914237;
  D.8378_818 = D.8382_817 + 12332;
  D.8379_819 = D.8378_818 * 914237;
  D.8376_820 = D.8379_819 + 12332;
  D.8377_821 = D.8376_820 * 914237;
  temp_822 = D.8377_821 + -13;
  result_824 = result_810 + temp_822;
  temp_788 = MEM[base: D.9283_3302, offset: 8B];
  D.8351_789 = temp_788 + 12345;
  D.8352_790 = D.8351_789 * 914237;
  D.8348_791 = 

[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #7 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 
10:11:51 UTC ---
Btw, when I run the benchmark with the addition of -march=native (for me,
that's
-march=corei7) then GCC 4.7 performs better than 4.6:

4.6:

./t 10 

test   description   absolute   operations   ratio with
number   time   per second   test0

 0 int32_t for loop unroll 1   0.41 sec   1951.22 M 1.00
 1 int32_t for loop unroll 2   0.51 sec   1568.63 M 1.24
 2 int32_t for loop unroll 3   0.47 sec   1702.13 M 1.15
 3 int32_t for loop unroll 4   0.48 sec   1666.67 M 1.17
 4 int32_t for loop unroll 5   0.47 sec   1702.13 M 1.15
 5 int32_t for loop unroll 6   0.51 sec   1568.63 M 1.24
 6 int32_t for loop unroll 7   0.47 sec   1702.13 M 1.15
 7 int32_t for loop unroll 8   0.47 sec   1702.13 M 1.15

Total absolute time for int32_t for loop unrolling: 3.79 sec

4.7:

./t 10 

test   description   absolute   operations   ratio with
number   time   per second   test0

 0 int32_t for loop unroll 1   0.39 sec   2051.28 M 1.00
 1 int32_t for loop unroll 2   0.40 sec   2000.00 M 1.03
 2 int32_t for loop unroll 3   0.39 sec   2051.28 M 1.00
 3 int32_t for loop unroll 4   0.39 sec   2051.28 M 1.00
 4 int32_t for loop unroll 5   0.38 sec   2105.26 M 0.97
 5 int32_t for loop unroll 6   0.41 sec   1951.22 M 1.05
 6 int32_t for loop unroll 7   0.37 sec   2162.16 M 0.95
 7 int32_t for loop unroll 8   0.36 sec   .22 M 0.92

Total absolute time for int32_t for loop unrolling: 3.09 sec

The loop then looks like (the expected)

.L53:
movdqa  (%rax), %xmm4
paddd   %xmm3, %xmm4
pmulld  %xmm0, %xmm4
paddd   %xmm1, %xmm4
pmulld  %xmm0, %xmm4
paddd   %xmm1, %xmm4
pmulld  %xmm0, %xmm4
paddd   %xmm2, %xmm4
paddd   %xmm4, %xmm6
movdqa  16(%rax), %xmm4
addq$32, %rax
cmpq$data32+32000, %rax
paddd   %xmm3, %xmm4
pmulld  %xmm0, %xmm4
paddd   %xmm1, %xmm4
pmulld  %xmm0, %xmm4
paddd   %xmm1, %xmm4
pmulld  %xmm0, %xmm4
paddd   %xmm2, %xmm4
paddd   %xmm4, %xmm5
jne .L53

looks like pmulld is only available with SSE 4.1 and otherwise we fall back
to the define_insn_and_split *sse2_mulv4si3.  But that complexity is not
reflected in the vectorizer cost model (which needs improvement ...).


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #8 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 
10:27:15 UTC ---
Small testcase:

int a[256];
int b[256];

void foo (void)
{
  int i;
  for (i = 0; i  256; ++i)
{
  b[i] = a[i] * 23;
}
}

you can see that we shuffle even the vector with constants around!  Not taking
into account the REG_EQUAL note which is gone at split1 time, removed by
either loop2_invariant or loop2_unswitch.

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
 (expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
]))
(expr_list:REG_DEAD (reg:V4SI 84)
(expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(nil)


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #9 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 
10:39:19 UTC ---
And cprop fails to propagate

  (reg:V4SI 85) := (const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
])

but it at least re-adds the REG_EQUAL note, but DSE drops it again.  From

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
 (expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
]))
(expr_list:REG_DEAD (reg:V4SI 85)
(expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(nil)


we go to

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
 (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9,
offset: 0B] ])
(nil)))

Unfortunately there is no cprop pass after split1 to eventually clean things
up again (because of out-of-cfg-layout-mode ...).  If I force it to run
it cannot simplify

(insn 42 24 43 3 (set (subreg:V2DI (reg:V4SI 86) 0)
(mult:V2DI (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 83 [
MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ])
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
])))
(zero_extend:V2DI (vec_select:V2SI (reg:V4SI 85)
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
]) t.c:9 -1
 (nil))

either though.


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 CC||stevenb.gcc at gmail dot
   ||com

--- Comment #10 from Richard Guenther rguenth at gcc dot gnu.org 2012-06-12 
11:57:20 UTC ---
Changing the insn_and_split to

(define_insn_and_split *sse2_mulv4si3
  [(set (match_operand:V4SI 0 register_operand)
(mult:V4SI (match_operand:V4SI 1 register_operand)
   (match_operand:V4SI 2 nonmemory_vector_operand)))]
...

and defining

(define_predicate nonmemory_vector_operand
(ior (match_operand 0 register_operand)
 (match_code const_vector)))

we ICE because when splitting

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
]))) t.c:9 1496 {*sse2_mulv4si3}
 (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9,
offset: 0B] ])
(nil)))

we don't even try to simplify when emitting the code.

But maybe allowing const_vector in (some of) the define_insn_and_split would
be the way to go ...


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread matt at use dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #11 from Matt Hargett matt at use dot net 2012-06-12 18:25:25 UTC 
---
Richard,

Thanks for the quick analysis! Sounds like a perfect storm of sorts :/

re: cprop failure: this may be indicated by another major regression in their
suite for the simple constant folding tests. in GCC 4.1-4.6, those tests are
all 0.0s but in 4.7 take tens of seconds. Let me know if you want me to file a
separate bug/reduced test case for that, and then have that new bug depend on
this one. Otherwise, I'll wait until this one sees some resolution and then
retest.

re: multiple passes: if you think that feature has enough merit to be revisited
now, I can look into re-proposing Maxim's patches from October/November 2011
that integrated your feedback at the time.

re: -march workaround: our deployment platform's minimum arch is nocona, and
enabling -march=nocona doesn't workaround the issue. For grins, I tried
-march=amdfam10 (another deployment target, but would require a separate
distributable binary), but that also didn't work around the issue.

I see a small improvement when using -fno-tree-vectorize, but not nearly as
dramatic as yours. For the int32_t for and while loop unrolling, the times go
from ~107s and ~105s to ~96s and ~95s, respectively. The do and goto loop
unrolling times get slightly worse (~2%), but it might be noise.

Let me know if there's any additional testing/footwork you'd like me to do.
Again, thanks for the quick turnaround on such a deep analysis!


[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

2012-06-12 Thread rth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #12 from Richard Henderson rth at gcc dot gnu.org 2012-06-12 
18:54:24 UTC ---
(In reply to comment #10)
 But maybe allowing const_vector in (some of) the define_insn_and_split would
 be the way to go ...

Maybe.  It certainly would ease some of the simplifications.
At the moment I don't think we can go from

  mem - const - simplify - const -newmem

On the other hand, for this particular test case, where all
of the vector_cst elements are the same, and a reasonably
small number of bits set, it would be great to be able to
leverage synth_mult.

The main complexity for sse2_mulv4si3 is due to the fact that
we have to decompose the operation into V8HImode multiplies.
Whereas if we decompose the multiply, we have the shifts and
adds in V4SImode.