https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122664
Bug ID: 122664
Summary: Using a large fake length for a buffer enables
vectorizations
Product: gcc
Version: 15.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: nicula at nicula dot xyz
Target Milestone: ---
Created attachment 62781
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62781&action=edit
Benchmark source file
Consider these two functions that return the index of the first non-zero
element:
#include <optional>
#include <memory>
#include <numeric>
using u8 = uint8_t;
size_t v1(u8 *data) {
data = std::assume_aligned<32>(data);
for (size_t i = 0;; i++) {
if (data[i] != 0) {
return i;
}
}
}
__always_inline
std::optional<size_t> v2_helper(u8 *data, size_t len) {
data = std::assume_aligned<32>(data);
for (size_t i = 0; i <= len; i++) {
if (data[i] != 0) {
return i;
}
}
return std::nullopt;
}
size_t v2(u8 *data) {
auto ret = v2_helper(data, std::numeric_limits<size_t>::max() - 1);
if (!ret) {
__builtin_unreachable();
}
return *ret;
}
Assembly (`-std=c++23 -O3 -march=znver3`):
v1(unsigned char*):
xor eax, eax
cmp BYTE PTR [rdi], 0
jne .L1
.L2:
inc rax
cmp BYTE PTR [rdi+rax], 0
je .L2
.L1:
ret
v2(unsigned char*):
vmovdqa ymm2, YMMWORD PTR .LC0[rip]
vmovdqa ymm0, YMMWORD PTR .LC1[rip]
mov rax, rdi
lea rdx, [rdi-32]
vpbroadcastq ymm5, QWORD PTR .LC4[rip]
vpbroadcastq ymm4, QWORD PTR .LC5[rip]
vpxor xmm3, xmm3, xmm3
.L10:
vpcmpeqb ymm1, ymm3, YMMWORD PTR [rax]
vpcmpeqb ymm1, ymm1, ymm3
vptest ymm1, ymm1
jne .L15
add rax, 32
vpaddq ymm0, ymm0, ymm5
vpaddq ymm2, ymm2, ymm4
cmp rax, rdx
jne .L10
mov edx, 31
mov rax, -32
.L9:
add rdx, rax
.L12:
cmp BYTE PTR [rdi+rax], 0
jne .L7
inc rax
cmp rdx, rax
jne .L12
xor eax, eax
.L7:
vzeroupper
ret
.L15:
vmovq rdx, xmm2
vmovq rax, xmm0
jmp .L9
.LC0:
.quad -1
.quad -2
.quad -3
.quad -4
.LC1:
.quad 0
.quad 1
.quad 2
.quad 3
.LC4:
.quad 32
.LC5:
.quad -32
`v1()` is not vectorized and goes byte-by-byte. `v2()` is vectorized. The only
difference between those functions is that `v1` can return `data + <any index>`
or loop infinitely (though would this infinite loop be UB?), while `v2()` can
only return `data + <any index except SIZE_MAX>` and is guaranteed to return.
If `SIZE_MAX` is passed to `v2_helper()` instead of `SIZE_MAX - 1`, then `v2()`
will not be vectorized and will go byte-by-byte just like `v1()`.
Can't `v1()` be vectorized like `v2()`? Also why do the vectorizations fail
when `SIZE_MAX` is passed instead of `SIZE_MAX - 1`? Maybe the key piece of
information that GCC uses to vectorize this is that the loop in `v2_helper()`
always finishes? Also, important to note that versions prior to GCC 15.1 are
not capable of vectorizing `v2()`.
Note:
* attaching benchmarking source file too;
* the code could be modified to search for the null byte instead, but I wanted
to avoid GCC figuring it out and calling `strlen()`.