[Bug tree-optimization/120598] Compiler is unable to vectorise scalar code

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 20 Jun 2025 03:43:47 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120598


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #8)
> (In reply to Jeevitha from comment #6)
> > The following dot_product function gets vectorized with the latest GCC trunk
> > and gcc 15.1.0:
> > 
> > #include <cstdint>
> > #include <cstddef>
> > extern float dot_product(const int16_t *v1, const int16_t *v2, size_t len);
> > float dot_product(const int16_t *v1, const int16_t *v2, size_t len)
> > {
> >     int64_t d = 0;
> >     for (size_t i = 0; i < len; i++)
> >         d += int32_t(v1[i]) * int32_t(v2[i]);
> >     return static_cast<float>(d);
> > }
> > 
> > 
> > I observed that -O2 was used during compilation. However, for GCC versions
> > earlier than 15, vectorization of this loop requires -O3. Since they are
> > using the -O2 flag, GCC 15 necessary in this case.
> 
> Is that what the original code does?  Or does it convert every number to
> float
> and then sum over that?

The above is from the preprocessed source.

> And, can you try to find out what patch to GCC 15 made this work at -O2?  In
> case we want to backport anything, but also just to get a better grip on what
> is happening  here :-)

With GCC 15 we allow peeling for niter at -O2, with GCC 14 and earlier at -O2
we effectively only ever vectorize loops with constant number of iterations
(divisible by vector size).

I'd say this is "fixed" (it was reported against GCC 15), but the function
is 'static' in the preprocessed sources and thus likely inlined.  I'll
also note that plain SSE2 is a bit inefficient for this loop.

So maybe the reporter can clarify "We’ve observed that while functions in the
PGVector library benefit from both loop unrolling and auto-vectorization (even
with earlier versions of GCC, like 13.3 and 11.5), the same does not hold true
for the dot_product function in the MariaDB library" - does this mean
the autovectorization makes the function slower?  That would mean our cost
model isn't good enough here.

[Bug tree-optimization/120598] Compiler is unable to vectorise scalar code

Reply via email to