https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348
--- Comment #2 from Jim Wilson <wilson at gcc dot gnu.org> --- The testcase doesn't produce runable code, and I'm not sure if I have access to any haswell parts, but I can make a few comments. The testcase requires -O3 -ftree-loop-distribution to reproduce. Without my patch, the loop distribution pass thinks that there is a bi-directional (backward and forward) dependence between the first and second lines of the loop, and that prevents optimization. With my patch, the loop distribution pass correctly computes that there is only the forward anti-dependence, which allows optimization. Without optimization, the inner loop is fully unrolled and vectorized to use 128-bit pair of double vectors. With optimization, we get a call to the memmove and memset builtins. There is the problem already mentioned by Richard Biener in my patch review that the unoptimized code has 2 memory streams, but the optimized code has 3 memory streams. This might be causing some of the performance loss. There is another problem here with load/store sizes. The memmove builtin does not get expanded inline, and we end up in libc which appears to be 128-bit loads and stores. However, the memset is expanded inline, and we only get 64-bit stores. The extra stores necessary here may be causing some of the performance loss. For short term solutions, we could look at adding a heuristic that tries to determine whether code is stream limited, and prevent optimizations that would increase the number of streams in that case. Maybe something like PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO used in the prefetch code. Another short term solution is to get the memset expander to use 128-bit stores. Long term, loop distribution should only be performed when this enables some other optimization, like vectorization, which suggests that loop distribution should be a library called by other passes, instead of its own pass.