https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111648
--- Comment #3 from prathamesh3492 at gcc dot gnu.org --- Created attachment 56037 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56037&action=edit Untested fix The issue is that when a1 is a multiple of vector length, we end up creating following encoding in result: { base_elem, arg[0], arg[1], ... } where arg is chosen input vector, which is incorrect. For above case, vectorizer pass creates VEC_PERM_EXPR<arg0, arg, sel> where: arg0: { -16, -9, -10, -11 } arg1: { -12, -5, -6, -7 } sel = { 3, 4, 5, 6 } arg0, arg1 and sel are encoded with npatterns = 1 and nelts_per_pattern = 3. Since a1 = 4 and arg_len = 4, it ended up creating the result with following encoding: res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, nelts_per_pattern = 3 = { -11, -12, -5 } So for res[4], it used S = (-5) - (-12) = 7 And hence computed it as -5 + 7 = 2. instead of arg1[2], ie, -6. which is the difference we see in output at -O0 vs -O2. The patch tweaks the constratints in valid_mask_for_fold_vec_perm_cst_p to punt if a1 is a multiple of vector length, so a1 ... ae only selects from stepped part of the input vector, which seems to fix this issue. I will run a proper bootstrap+test and post it upstream.