https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123244

            Bug ID: 123244
           Summary: Vectorized loop falls back to unvectorized loop
                    instead of using “count trailing zeros” instruction
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: me at manueljacob dot de
  Target Milestone: ---

The following C code:

const unsigned char *search_nonascii(const unsigned char *p, const unsigned
char *e) {
  for (const unsigned char *s = p; s < e; s++) {
    if (*s & 0x80)
      return s;
  }
  return 0;
}

compiled with GCC 16.0.0 20251221 using options `-O3 -march=x86-64-v4` contains
the following vectorized loop:

        vpxor   xmm1, xmm1, xmm1
<...>
.L7:
        add     rax, 64
        cmp     rax, rcx
        je      <...>
.L8:
        vmovdqa64       zmm0, ZMMWORD PTR [rdx+rax]
        vpcmpb          k0, zmm0, zmm1, 1
        kortestq        k0, k0
        je      .L7
<jump to unvectorized loop>

Instead of falling back to the unvectorized loop, the code could move k0 into a
GPR and get the offset to the matching byte using tzcnt.

Reply via email to