https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92492
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |roger at nextmovesoftware dot
com
--- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> ---
While type promotion can be used to vectorize this loop (as LLVM does), it
reduces the number of lanes which adversely affects performance. A better
approach is for the optimizers to transform the code into a form that can be
vectorized using vectors of QImode, such as:
static inline uint8_t x264_clip_uint8( int x )
{
unsigned char tmp, hi, lo;
// tmp = (-x)>>7;
signed char y = x;
hi = (y != 0) ? -1 : 0;
signed char m1 = y-1;
lo = (m1 >= 0) ? -1 : 0;
tmp = (hi+hi) - lo;
return x&(~63) ? tmp : x;
}
This form (which produces idential results) can be vectorized on x86_64, but
reveals a number of missed optimization opportunities and poor choices.
Can someone investigate the performance impact on 525.x264_r? It would be good
to know there's an observable improvement for a transformation that's
complicated (worse on scalar architectures) and potentially rarely encountered.