https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918
Cory Fields <lists at coryfields dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |lists at coryfields dot com
--- Comment #11 from Cory Fields <lists at coryfields dot com> ---
Confirmed seeing this as well. Specifically in a vectorized chacha20
implementation which performs left-rotates of uint32_t's.
Targeting avx2, Clang optimizes the 8bit/16bit shifts to a vpshufb which
performs significantly better than vpsrld+vpslld on my hardware.
Minimal reproducer:
using vec256 = unsigned __attribute__((__vector_size__(32)));
template <unsigned BITS>
void vec_rotl(vec256& vec)
{
vec = (vec << BITS) | (vec >> (32 - BITS));
}
template void vec_rotl<16>(vec256&);
template void vec_rotl<8>(vec256&);
template void vec_rotl<7>(vec256&);
godbolt: https://godbolt.org/z/85j544EEf