https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #6 from Pat Haugen <pthaugen at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> I can see what the patch does to this testcase on x86_64 - it enables BB
> vectorization of the first two loops after runrolling.  I don't see anything
> suspicious here on x86_64 and 525.x264_r works fine for me.
> 
> Can you claify whether test, ref or train inputs fail for you?  I tried
> AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
> time...
> 
> Can you check whether the following reduced file produces the same assembly
> for add4x4_idct as in the complete benchmark?  If so it should be possible to
> generate a runtime testcase from it.  Please attach preprocessed source if
> that doesn't work out.
> 
> Sofar I do suspect we are hitting a latent target issue?
> 
> #include <stdint.h>
> static uint8_t x264_clip_uint8( int x )
> {
>   return x&(~255) ? (-x)>>31 : x;
> }
> void add4x4_idct( uint8_t *p_dst, int16_t dct[16])
> {
>   int16_t d[16];
>   int16_t tmp[16];
>   for( int i = 0; i < 4; i++ )
>     {
>       int s02 =  dct[0*4+i]     +  dct[2*4+i];
>       int d02 =  dct[0*4+i]     -  dct[2*4+i];
>       int s13 =  dct[1*4+i]     + (dct[3*4+i]>>1);
>       int d13 = (dct[1*4+i]>>1) -  dct[3*4+i];
>       tmp[i*4+0] = s02 + s13;
>       tmp[i*4+1] = d02 + d13;
>       tmp[i*4+2] = d02 - d13;
>       tmp[i*4+3] = s02 - s13;
>     }
>   for( int i = 0; i < 4; i++ )
>     {
>       int s02 =  tmp[0*4+i]     +  tmp[2*4+i];
>       int d02 =  tmp[0*4+i]     -  tmp[2*4+i];
>       int s13 =  tmp[1*4+i]     + (tmp[3*4+i]>>1);
>       int d13 = (tmp[1*4+i]>>1) -  tmp[3*4+i];
>       d[0*4+i] = ( s02 + s13 + 32 ) >> 6;
>       d[1*4+i] = ( d02 + d13 + 32 ) >> 6;
>       d[2*4+i] = ( d02 - d13 + 32 ) >> 6;
>       d[3*4+i] = ( s02 - s13 + 32 ) >> 6;
>     }
>   for( int y = 0; y < 4; y++ )
>     {
>       for( int x = 0; x < 4; x++ )
>         p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] );
>       p_dst += 32;
>     }
> }

Yes, that produces similar code, and adding the following to it produces an
executable test that fails at -O3.

void main()
{
  uint8_t dst[128];
  int16_t dct[16];
  int i;

  for (i = 0; i < 16; i++)
    dct[i] = i*10 + i;
  for (i = 0; i < 128; i++)
    dst[i] = i;

  add4x4_idct(dst, dct);

  if (dst[0] != 14 || dst[1] != 0 || dst[2] != 4 || dst[3] != 2 
      || dst[32] != 28 || dst[33] != 35 || dst[34] != 33 || dst[35] != 35)
    abort();

}

Continuing to debug further...

Reply via email to