https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951
--- Comment #11 from Jim Wilson <wilson at gcc dot gnu.org> --- I've spent some time looking at solutions to this problem. One way to solve it is to simply add a mulv2di3 pattern to the aarch64 port. The presence of a multiply pattern means we will go through expand_mult, and we get the shift/add sequence generation automatically. The x86 port has a few patterns like this. mulv2di3 can be implemented with 3 v4si multiplies, and 3 zip instructions, plus 2 adds that fold into multiply-add, and a shift. The downside of this solution is that if we aren't multiplying by a constant, then we get this long mulv2di3 sequence. On an APM/Mustang, for a loop that only does a multiply, this sequence is much slower than 2 integer DImode multiplies, and hence it is better to not vectorize in this case. This is probably not a win in general. Another way to solve it is to use the existing synth_mult code in expmed.c. We can easily share the code that generates the algorithm in choose_mult_variant, but expand_mult_const is very rtl specific, so I copied that part with a lot of modification to general gimple. Again, testing on APM/Mustang, for a loop that only does multiply, I found that a 2 instruction shift/add sequence is a win, but a 3 instruction shift/add sequence is a lose. Since we already handle the 1 instruction case trivially, this appears to be a lot of work for not much gain. This is probably a better solution than the above one if the amount of new code is OK. This patch needs a bit more work to finish it, and will likely need aarch64 rtx costs adjusted so that we get the best result for all targets. There may be wins in cases where a loop does more than a simple multiply, where the loop is not vectorized only because of the multiply. It isn't clear how to quantify that. For the original testcase, with constant 19594, this constant requires 9 operations. We can do that with 9 vector instructions, or 5 integer instructions. add x0, x1, x1, lsl 3 add x6, x0, x0, lsl 4 add x7, x1, x6, lsl 4 add x8, x1, x7, lsl 2 lsl x9, x8, 1 Since 5 integer instructions is likely more than twice as fast as 9 vector operations on all aarch64 parts, we still won't vectorize this loop even if we have synth mult support in the vectorizer. We still get an integer multiply instruction, as that is faster than the 5 integer shift/add instructions, which in turn is faster than the 9 vector shift/add instructions and the vector multiply via 3 v4si multiplies. I've attached work in progress patches for the two solutions to the PR, along with testcases I'm using to verify the patches.