https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069
--- Comment #15 from luoxhu at gcc dot gnu.org --- In combine: vec_select(vec_concat and the followed vec_select are combined to a single extract instruction, which seems reasonable for both LE and BE? R146: 0 1 2 3 R141: 4 5 6 7 R150: 2 6 3 7 // vec_select(vec_concat(r146:V4SI,r141:V4SI),[2 6 3 7]) R151: R150[3] // vec_select(r150:V4SI,3) => R151: R141[3] // vec_select(r141:V4SI,3) Trying 21 -> 24: 21: r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel) REG_DEAD r146:V4SI REG_DEAD r141:V4SI 24: {r151:SI=vec_select(r150:V4SI,parallel);clobber scratch;} Failed to match this instruction: (parallel [ (set (reg:SI 151) (vec_select:SI (reg:V4SI 141) (parallel [ (const_int 3 [0x3]) ]))) (clobber (scratch:SI)) (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) ]) Failed to match this instruction: (parallel [ (set (reg:SI 151) (vec_select:SI (reg:V4SI 141) (parallel [ (const_int 3 [0x3]) ]))) (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) ]) Successfully matched this instruction: (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) Successfully matched this instruction: (set (reg:SI 151) (vec_select:SI (reg:V4SI 141) (parallel [ (const_int 3 [0x3]) ]))) allowing combination of insns 21 and 24 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i2 21: r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel) REG_DEAD r146:V4SI deferring rescan insn with uid = 21. modifying insn i3 24: {r151:SI=vec_select(r141:V4SI,parallel);clobber scratch;} REG_DEAD r141:V4SI deferring rescan insn with uid = 24. I guess the previous unspec implementation bypassed the LE + LE swap check, so now in split2, we should generate vextuwlx instead of vextuwrx on little endian?