> > Currently, costs of moves are also used for costs of RTL expressions.   This
> > patch:
> >
> > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> >
> > includes:
> >
> > diff --git a/gcc/config/i386/x86-tune-costs.h 
> > b/gcc/config/i386/x86-tune-costs.h
> > index e943d13..8409a5f 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> >    {4, 4, 4}, /* cost of loading integer registers
> >      in QImode, HImode and SImode.
> >      Relative to reg-reg move (2).  */
> > -  {6, 6, 6}, /* cost of storing integer registers */
> > +  {6, 6, 3}, /* cost of storing integer registers */
> >    2, /* cost of reg,reg fld/fst */
> >    {6, 6, 8}, /* cost of loading fp registers
> >      in SFmode, DFmode and XFmode */

Well, it seems that the patch was fixing things on wrong spot - the
tables are intended to be mostly latency based. I think we ought to
document divergences from these including benchmarks where the change
helped. Otherwise it is very hard to figure out why the entry does not
match the reality.
> >
> > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > register move.  It caused a regression:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> >
> > Since the cost for SImode store is also used to compute scalar_store
> > in ix86_builtin_vectorization_cost, it changed loop costs in
> >
> > void
> > foo (long p2, long *diag, long d, long i)
> > {
> >   long k;
> >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> >   while (i < k)
> >     diag[i++] = d;
> > }
> >
> > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > instead of 3.
> >
> > My patch separates costs of moves from costs of RTL expressions.  We have
> > a follow up patch which restores the cost for SImode store back to 6 and 
> > leave
> > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > data now.

I have seen the problem with scalar_store with AMD tuning as well.
It seems to make SLP vectorizer to be happy about idea of turning
sequence of say integer tores into code which moves all the values into
AVX register and then does one vector store.

The cost basically compare cost of N scalar stores to 1 scalar store +
vector construction. Vector construction then N*sse_op+addss.

With testcase:

short array[8];
test (short a,short b,short c,short d,short e,short f,short g,short h)
{ 
  array[0]=a;
  array[1]=b;
  array[2]=c;
  array[3]=d;
  array[4]=e;
  array[5]=f;
  array[6]=g;
  array[7]=h;
}
int iarray[8];
test2 (int a,int b,int c,int d,int e,int f,int g,int h)
{ 
  iarray[0]=a;
  iarray[1]=b;
  iarray[2]=c;
  iarray[3]=d;
  iarray[4]=e;
  iarray[5]=f;
  iarray[6]=g;
  iarray[7]=h;
}

I get the following codegen:


test:
        vmovd   %edi, %xmm0
        vmovd   %edx, %xmm2
        vmovd   %r8d, %xmm1
        vmovd   8(%rsp), %xmm3
        vpinsrw $1, 16(%rsp), %xmm3, %xmm3
        vpinsrw $1, %esi, %xmm0, %xmm0
        vpinsrw $1, %ecx, %xmm2, %xmm2
        vpinsrw $1, %r9d, %xmm1, %xmm1
        vpunpckldq      %xmm2, %xmm0, %xmm0
        vpunpckldq      %xmm3, %xmm1, %xmm1
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
        vmovaps %xmm0, array(%rip)
        ret

test2:
        vmovd   %r8d, %xmm5
        vmovd   %edx, %xmm6
        vmovd   %edi, %xmm7
        vpinsrd $1, %r9d, %xmm5, %xmm1
        vpinsrd $1, %ecx, %xmm6, %xmm3
        vpinsrd $1, %esi, %xmm7, %xmm0
        vpunpcklqdq     %xmm3, %xmm0, %xmm0
        vmovd   16(%rbp), %xmm4
        vpinsrd $1, 24(%rbp), %xmm4, %xmm2
        vpunpcklqdq     %xmm2, %xmm1, %xmm1
        vinserti128     $0x1, %xmm1, %ymm0, %ymm0
        vmovdqu %ymm0, iarray(%rip)
        vzeroupper
        ret

which is about 20% slower on my skylake notebook than the
non-SLP-vectorized variant.

I wonder if the vec_construct costs should be made more realistic.
It is computed as:

      case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
          return cost;

So it expects 8 simple SSE operations + one SSE FP arithmetical
operations.  While code above has 8 inter-unit moves + 3 SSE integer
operations to shuffle things around. Not mentioning the increased
register pressure.

I would say that for integer constructs it is a common case that things
needs to be moved from integer unit to SSE.

Overall the problem is deeper since vectorizer really may need to get
better idea about latencies and throughputs to estimate loop times more
realistically. 

One also may want to account somewhat that stores are often not part
of the hot path and thus their latency is not too critical and the
fact that vector stores prevents later partial memory stalls on the
other hand...

Honza

Reply via email to