https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122818

            Bug ID: 122818
           Summary: std::experimental::simd miss-optimize mask computation
                    of fixed_size_simd
           Product: gcc
           Version: 15.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: stoventtas at gmail dot com
  Target Milestone: ---

This usage of std::experimental::simd is to compute a row of pixels from a
pattern mask. Given an 8 bit pattern where bit 7 is the first left pixel and
bit 0 is the last right pixel, 0 means the pixel should be black and 1 means it
should be of the given color.

The code below generates a suboptimal assembly:

```cpp
#include <experimental/simd>
#include <cstdint>

namespace stdx = std::experimental;

using fixed_simd_t  = stdx::fixed_size_simd<uint32_t, 8>;
using fixed_mask_t  = fixed_simd_t::mask_type;

static const fixed_simd_t FIXED_PATTERN{[](uint32_t i){
    return (1u << (7 - i));
}};

void simd_blend_fixed(uint32_t* dst, uint8_t pattern, uint32_t color)
{
    const fixed_simd_t patternSimd = pattern;

    const fixed_mask_t patternMask = (patternSimd & FIXED_PATTERN) != 0;

    fixed_simd_t pixel = 0x00'10'10'10; // Default black color level.

    stdx::where(patternMask, pixel) = color;

    pixel.copy_to(dst, stdx::element_aligned);
}
```

The -O3 -mavx2 assembly is:
```
"simd_blend_fixed(unsigned int*, unsigned char, unsigned int)":
        movzx   esi, sil
        vpxor   xmm1, xmm1, xmm1
        vmovd   xmm2, edx
        vmovd   xmm0, esi
        vpbroadcastd    ymm2, xmm2
        vpbroadcastd    ymm0, xmm0
        vpand   ymm0, ymm0, YMMWORD PTR "FIXED_PATTERN"[rip]
        vpcmpeqd        ymm0, ymm0, ymm1
        vpcmpeqd        ymm0, ymm0, ymm1
        vmovmskps       eax, ymm0
        movzx   eax, al
        vmovd   xmm0, eax
        mov     eax, 1052688
        vpbroadcastd    ymm0, xmm0
        vpand   ymm0, ymm0, YMMWORD PTR .LC0[rip]
        vpcmpgtd        ymm0, ymm0, ymm1
        vmovd   xmm1, eax
        vpbroadcastd    ymm1, xmm1
        vpblendvb       ymm0, ymm1, ymm2, ymm0
        vmovdqu YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret
```
The middle section right after the two vpcmpeqd seems useless as it packs the
pattern mask from ymm0 to eax, effectively just mirroring the bits of the
original pattern value from the argument, just to immediately reload it in ymm0
and then vpand with the mirror value of the FIXED_PATTERN mask, effectively
being a no-op from the result of the two cpcmpeqd before.
The native_simd version doesn't do the vmovmskps part.

Godbolt with GCC and clang comparison: https://godbolt.org/z/azhexj59Y
It looks like clang has the same issue, optimized with a left shift to replace
the vpand+vpcmpgtd.

I have included a hand written AVX2 intrinsics version to show what simd should
compile to, which is nearly identical to the native version, except for the
double vpcmpeqd which could be optimized to a single vpcmpgtd.

Reply via email to