[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

guojiufu at gcc dot gnu.org Mon, 01 Jun 2020 19:50:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398


--- Comment #36 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> If the compiler knew say from PGO that pos is usually a multiple of certain
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself and
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p = cur + pos;
>   int len = 0;
>   if (max > 32 && (pos & 7) == 0)
>     {
>       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
>       while (++len != l)
>         if (p[len] != cur[len])
>           goto end;
>       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> long *) &p[len];
>       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> long *) &cur[len];
>       while (len + 8 < max)
>         {
>           if (*p2++ != *cur2++)
>             break;
>           len += 8;
>         }
>       --len;
>     }
>   while (++len != max)
>     if (p[len] != cur[len])
>       break;
> end:
>   return cur[len];
> }
> 
> or so (untested).  Of course, it could be done using SIMD too if there is a
> way to terminate the loop if any of the elts is different and could be done
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would just
> waste code size, dealing with the unaligned cases would be more complicated
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

Unaligned reading is supported on some platforms already, and reading
multi-bytes(64/128bits) takes far less cost than reading 8bits multi-times,
extremely, dword reading may cost the same cycles as byte reading.
As the above discussions, there are still a few kinds of stuff need to take
care of.  I’m wondering if we could introduce this as a compiler optimization
in some circumstances.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

Reply via email to