https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122664

            Bug ID: 122664
           Summary: Using a large fake length for a buffer enables
                    vectorizations
           Product: gcc
           Version: 15.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nicula at nicula dot xyz
  Target Milestone: ---

Created attachment 62781
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62781&action=edit
Benchmark source file

Consider these two functions that return the index of the first non-zero
element:

    #include <optional>
    #include <memory>
    #include <numeric>

    using u8 = uint8_t;

    size_t v1(u8 *data) {
        data = std::assume_aligned<32>(data);
        for (size_t i = 0;; i++) {
            if (data[i] != 0) {
                return i;
            }
        }
    }

    __always_inline
    std::optional<size_t> v2_helper(u8 *data, size_t len) {
        data = std::assume_aligned<32>(data);
        for (size_t i = 0; i <= len; i++) {
            if (data[i] != 0) {
                return i;
            }
        }
        return std::nullopt;
    }

    size_t v2(u8 *data) {
        auto ret = v2_helper(data, std::numeric_limits<size_t>::max() - 1);
        if (!ret) {
            __builtin_unreachable();
        }
        return *ret;
    }

Assembly (`-std=c++23 -O3 -march=znver3`):

    v1(unsigned char*):
            xor     eax, eax
            cmp     BYTE PTR [rdi], 0
            jne     .L1
    .L2:
            inc     rax
            cmp     BYTE PTR [rdi+rax], 0
            je      .L2
    .L1:
            ret

    v2(unsigned char*):
            vmovdqa ymm2, YMMWORD PTR .LC0[rip]
            vmovdqa ymm0, YMMWORD PTR .LC1[rip]
            mov     rax, rdi
            lea     rdx, [rdi-32]
            vpbroadcastq    ymm5, QWORD PTR .LC4[rip]
            vpbroadcastq    ymm4, QWORD PTR .LC5[rip]
            vpxor   xmm3, xmm3, xmm3
    .L10:
            vpcmpeqb        ymm1, ymm3, YMMWORD PTR [rax]
            vpcmpeqb        ymm1, ymm1, ymm3
            vptest  ymm1, ymm1
            jne     .L15
            add     rax, 32
            vpaddq  ymm0, ymm0, ymm5
            vpaddq  ymm2, ymm2, ymm4
            cmp     rax, rdx
            jne     .L10
            mov     edx, 31
            mov     rax, -32
    .L9:
            add     rdx, rax
    .L12:
            cmp     BYTE PTR [rdi+rax], 0
            jne     .L7
            inc     rax
            cmp     rdx, rax
            jne     .L12
            xor     eax, eax
    .L7:
            vzeroupper
            ret
    .L15:
            vmovq   rdx, xmm2
            vmovq   rax, xmm0
            jmp     .L9
    .LC0:
            .quad   -1
            .quad   -2
            .quad   -3
            .quad   -4
    .LC1:
            .quad   0
            .quad   1
            .quad   2
            .quad   3
    .LC4:
            .quad   32
    .LC5:
            .quad   -32

`v1()` is not vectorized and goes byte-by-byte. `v2()` is vectorized. The only
difference between those functions is that `v1` can return `data + <any index>`
or loop infinitely (though would this infinite loop be UB?), while `v2()` can
only return `data + <any index except SIZE_MAX>` and is guaranteed to return.
If `SIZE_MAX` is passed to `v2_helper()` instead of `SIZE_MAX - 1`, then `v2()`
will not be vectorized and will go byte-by-byte just like `v1()`.

Can't `v1()` be vectorized like `v2()`? Also why do the vectorizations fail
when `SIZE_MAX` is passed instead of `SIZE_MAX - 1`? Maybe the key piece of
information that GCC uses to vectorize this is that the loop in `v2_helper()`
always finishes? Also, important to note that versions prior to GCC 15.1 are
not capable of vectorizing `v2()`.

Note:
* attaching benchmarking source file too;
* the code could be modified to search for the null byte instead, but I wanted
to avoid GCC figuring it out and calling `strlen()`.

Reply via email to