https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120456
Bug ID: 120456 Summary: __builtin_shuffle produces unnecessary vperm2i128 Product: gcc Version: 14.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: adamant.pwn at gmail dot com Target Milestone: --- Consider the following snippet: auto test1(uint32_t bits) { auto bytes = u8x32(u32x8() + bits); u8x32 shuffler = { 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3 }; auto shuffle = __builtin_shuffle(bytes, shuffler); return shuffle; } It produces an unnecessary vperm2i128 command in the output. Same if I try to use __builtin_shufflevector. Also if I try to change one of the bytes to the second half, e.g. 3 to 31, it produces an unnecessary vpermq instead of vperm2i128. See https://godbolt.org/z/eEnd673e6 for details, and comparison with manual implementation of the same function with intrinsics.