| Issue |
182371
|
| Summary |
Missed vectorisation optimisation for `std::ranges::all_of` when testing if a range with compile time-known size is all zero
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
sharadhr
|
In [this Reddit thread](https://www.reddit.com/r/cpp_questions/comments/1r7csxn/why_cant_the_compiler_optimise_the_more_idiomatic/) I pose [two implementations](https://godbolt.org/z/zEj6n8M1h) of a function that tests if a `std::array<int, 64>` is all zero. The simple naïve loop compiles to tight vectorised assembly, whereas the `std::ranges` and `std::bind` implementation, doesn't, and reverts to a scalar loop.
[A comment](https://www.reddit.com/r/cpp_questions/comments/1r7csxn/why_cant_the_compiler_optimise_the_more_idiomatic/o665p9y/) in the thread says:
> The root cause is what not_a_novel_account identified — the compiler loses trip count information through the ranges abstraction — but it's worth understanding why mechanically.
>
> When the compiler sees the simple loop over `std::array<int const, 64>`, the trip count (64) is immediately visible as a constant. The optimizer can unroll it, prove the memory region is contiguous and bounded, and emit wide SIMD loads (`vpcmpeqd` + `vptest` on 256-bit or 512-bit vectors).
>
> With `std::ranges::all_of`, the iteration goes through a sentinel-based abstraction: `ranges::begin()` returns an iterator, `ranges::end()` returns a sentinel, and the loop increments the iterator and compares against the sentinel each step. The predicate goes through `std::invoke` with a projected function object. Each of these is a separate template instantiation that the optimizer must inline and then see through.
>
> The problem is that LLVM (and GCC) have inlining budgets — heuristics that limit how deep/wide they'll inline before giving up. Ranges tend to exhaust this budget before the optimizer reaches the point where it can recognize "this is just iterating over 64 contiguous ints." By the time the abstraction layers are (partially) peeled, the loop trip count is no longer a visible constant, so the vectorizer sees a generic loop with unknown bounds and either gives up or produces scalar code.
>
> This is fundamentally a phase ordering issue: loop vectorization runs after inlining, but if inlining didn't fully resolve the abstraction, the vectorizer doesn't have enough information.
>
> For the practical case — if this is a hot path and you want vectorized zero-check — the bitwise OR reduction is the most portable approach across all three compilers:
>
> ```
> cpp auto isAllZeros(std::array<int const, 64> const& a) { int acc = 0; for (auto v : a) acc |= v; return acc == 0; }
> ```
>
> All three compilers vectorize this reliably because there's no early-exit control flow — it's a pure reduction, which is the easiest pattern for auto-vectorizers to recognize. It does evaluate every element (no short circuit), but for 64 ints that's a single cache line or two of loads with wide SIMD — the branch misprediction cost of early exit would likely be worse anyway.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs