https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92265
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Oh, and there's also the case where VPINSR[BWDQ] takes a GPR (or memory) to insert int a XMM reg. PINSRW is available with SSE2, the B/Q/D variants with SSE4.1. It's also only the non-zero lane inserts that require an extra move if the above are not available. There's memory move to upper half for DImode memory sources as well. Note the larger store might still be good to reduce needed store bandwith and to avoid later STLF issues when a vector load follows. But some targets have non-trivial move cost between register files (not Intel though). So the question is whether your example makes a difference in practice.