https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122234
--- Comment #2 from Manuel López-Ibáñez <manu at gcc dot gnu.org> --- (In reply to Richard Biener from comment #1) > We do not implement this kind of prologue peeling on GIMPLE (only full loop > peeling). I'm also not sure if doing this would be profitable on modern > uarchs. AVX should be able to subtract 4 doubles at a time with one instruction and multiply the 4 differences with 2 instructions. AVX2 should be able to do 8 doubles at a time. Is that slower than the scalar loop?
