https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106704
Bug ID: 106704 Summary: avx intrinsic no generating vblendvps instruction with -mavx Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: mindmark at gmail dot com Target Milestone: --- I'm getting some really strange code generation on x86_64 when using the _mm256_blendv_ps intrinsic in gcc 12 with just the -mavx flag. https://godbolt.org/z/YbEK4nvEd It should generate a single vblendvps instruction, but instead creates a lot of branchy scaler code. gcc 11 doesn't appear to exhibit this behavior. Also using -mavx2 on gcc 12 generates the single instruction, just not -mavx for whatever reason. #include <immintrin.h> __m256 bend_stuff( __m256 a, __m256 b, __m256 mask) { return _mm256_blendv_ps(a, b, mask); } gcc -O2 -mavx -c blend.c -masm=intel -S -o out.s in GCC 12 it generates: bend_stuff: vmovd eax, xmm2 vmovaps ymm3, ymm0 vmovdqa xmm4, xmm2 test eax, eax jns .L3 vmovaps xmm0, xmm1 .L3: vpextrd eax, xmm4, 1 vshufps xmm9, xmm3, xmm3, 85 test eax, eax jns .L5 vshufps xmm9, xmm1, xmm1, 85 .L5: vpextrd eax, xmm4, 2 vunpckhps xmm5, xmm3, xmm3 test eax, eax jns .L7 vunpckhps xmm5, xmm1, xmm1 .L7: vpextrd eax, xmm4, 3 vshufps xmm7, xmm3, xmm3, 255 test eax, eax jns .L9 vshufps xmm7, xmm1, xmm1, 255 .L9: vextractf128 xmm2, ymm2, 0x1 vextractf128 xmm4, ymm3, 0x1 vmovd eax, xmm2 test eax, eax jns .L11 vextractf128 xmm4, ymm1, 0x1 .L11: vextractf128 xmm8, ymm3, 0x1 vpextrd eax, xmm2, 1 vshufps xmm8, xmm8, xmm8, 85 test eax, eax jns .L13 vextractf128 xmm8, ymm1, 0x1 vshufps xmm8, xmm8, xmm8, 85 .L13: vextractf128 xmm6, ymm3, 0x1 vpextrd eax, xmm2, 2 vunpckhps xmm6, xmm6, xmm6 test eax, eax jns .L15 vextractf128 xmm6, ymm1, 0x1 vunpckhps xmm6, xmm6, xmm6 .L15: vextractf128 xmm3, ymm3, 0x1 vpextrd eax, xmm2, 3 vshufps xmm3, xmm3, xmm3, 255 test eax, eax jns .L17 vextractf128 xmm1, ymm1, 0x1 vshufps xmm3, xmm1, xmm1, 255 .L17: vunpcklps xmm6, xmm6, xmm3 vunpcklps xmm4, xmm4, xmm8 vunpcklps xmm5, xmm5, xmm7 vunpcklps xmm0, xmm0, xmm9 vmovlhps xmm4, xmm4, xmm6 vmovlhps xmm0, xmm0, xmm5 vinsertf128 ymm0, ymm0, xmm4, 0x1 ret in GCC 11.3 it generates: bend_stuff: vblendvps ymm0, ymm0, ymm1, ymm2 ret