https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92492

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot 
com

--- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> ---
While type promotion can be used to vectorize this loop (as LLVM does), it
reduces the number of lanes which adversely affects performance.  A better
approach is for the optimizers to transform the code into a form that can be
vectorized using vectors of QImode, such as:

static inline uint8_t x264_clip_uint8( int x )
{
  unsigned char tmp, hi, lo;
  // tmp = (-x)>>7;
  signed char y = x;
  hi = (y != 0) ? -1 : 0;
  signed char m1 = y-1;
  lo = (m1 >= 0) ? -1 : 0;
  tmp = (hi+hi) - lo;
  return x&(~63) ? tmp : x;
}

This form (which produces idential results) can be vectorized on x86_64, but
reveals a number of missed optimization opportunities and poor choices.

Can someone investigate the performance impact on 525.x264_r?  It would be good
to know there's an observable improvement for a transformation that's
complicated (worse on scalar architectures) and potentially rarely encountered.

Reply via email to