https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

Cory Fields <lists at coryfields dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lists at coryfields dot com

--- Comment #11 from Cory Fields <lists at coryfields dot com> ---
Confirmed seeing this as well. Specifically in a vectorized chacha20
implementation which performs left-rotates of uint32_t's.

Targeting avx2, Clang optimizes the 8bit/16bit shifts to a vpshufb which
performs significantly better than vpsrld+vpslld on my hardware.

Minimal reproducer:

using vec256 = unsigned __attribute__((__vector_size__(32)));

template <unsigned BITS>
void vec_rotl(vec256& vec)
{
    vec = (vec << BITS) | (vec >> (32 - BITS));
}

template void vec_rotl<16>(vec256&);
template void vec_rotl<8>(vec256&);
template void vec_rotl<7>(vec256&);

godbolt: https://godbolt.org/z/85j544EEf

Reply via email to