https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933
--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> --- (In reply to Segher Boessenkool from comment #1) > Is that actually faster though? The original has shorter dependency > chains. Or is this to avoid some LHS/SHL? Yes, I tested it with one constructed case, the original version takes 18.20s while the optimized version takes 8.40s. And yes, I guess it's due to LHS/SHL similar to the vec_insert issue xionghu is working on.