https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122818
Bug ID: 122818
Summary: std::experimental::simd miss-optimize mask computation
of fixed_size_simd
Product: gcc
Version: 15.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: stoventtas at gmail dot com
Target Milestone: ---
This usage of std::experimental::simd is to compute a row of pixels from a
pattern mask. Given an 8 bit pattern where bit 7 is the first left pixel and
bit 0 is the last right pixel, 0 means the pixel should be black and 1 means it
should be of the given color.
The code below generates a suboptimal assembly:
```cpp
#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;
using fixed_simd_t = stdx::fixed_size_simd<uint32_t, 8>;
using fixed_mask_t = fixed_simd_t::mask_type;
static const fixed_simd_t FIXED_PATTERN{[](uint32_t i){
return (1u << (7 - i));
}};
void simd_blend_fixed(uint32_t* dst, uint8_t pattern, uint32_t color)
{
const fixed_simd_t patternSimd = pattern;
const fixed_mask_t patternMask = (patternSimd & FIXED_PATTERN) != 0;
fixed_simd_t pixel = 0x00'10'10'10; // Default black color level.
stdx::where(patternMask, pixel) = color;
pixel.copy_to(dst, stdx::element_aligned);
}
```
The -O3 -mavx2 assembly is:
```
"simd_blend_fixed(unsigned int*, unsigned char, unsigned int)":
movzx esi, sil
vpxor xmm1, xmm1, xmm1
vmovd xmm2, edx
vmovd xmm0, esi
vpbroadcastd ymm2, xmm2
vpbroadcastd ymm0, xmm0
vpand ymm0, ymm0, YMMWORD PTR "FIXED_PATTERN"[rip]
vpcmpeqd ymm0, ymm0, ymm1
vpcmpeqd ymm0, ymm0, ymm1
vmovmskps eax, ymm0
movzx eax, al
vmovd xmm0, eax
mov eax, 1052688
vpbroadcastd ymm0, xmm0
vpand ymm0, ymm0, YMMWORD PTR .LC0[rip]
vpcmpgtd ymm0, ymm0, ymm1
vmovd xmm1, eax
vpbroadcastd ymm1, xmm1
vpblendvb ymm0, ymm1, ymm2, ymm0
vmovdqu YMMWORD PTR [rdi], ymm0
vzeroupper
ret
```
The middle section right after the two vpcmpeqd seems useless as it packs the
pattern mask from ymm0 to eax, effectively just mirroring the bits of the
original pattern value from the argument, just to immediately reload it in ymm0
and then vpand with the mirror value of the FIXED_PATTERN mask, effectively
being a no-op from the result of the two cpcmpeqd before.
The native_simd version doesn't do the vmovmskps part.
Godbolt with GCC and clang comparison: https://godbolt.org/z/azhexj59Y
It looks like clang has the same issue, optimized with a left shift to replace
the vpand+vpcmpgtd.
I have included a hand written AVX2 intrinsics version to show what simd should
compile to, which is nearly identical to the native version, except for the
double vpcmpeqd which could be optimized to a single vpcmpgtd.